In this notebook, I am exploring different recommendation approaches for building an article recommendation engine for the IBM Watson Studio platform.
I. Exploratory Data Analysis
II. Rank Based Recommendations
III. User-User Based Collaborative Filtering
IV. Content Based Recommendations (EXTRA - NOT REQUIRED)
V. Matrix Factorization
VI. Extras & Concluding
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import seaborn as sns
from collections import defaultdict
%matplotlib inline
df = pd.read_csv('../data/user-item-interactions.csv')
df_content = pd.read_csv('../data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']
df.head()
df_content.head()
1.
What is the distribution of how many articles a user interacts with in the dataset?
def hist_box_plot(x: pd.Series,
x_label: str,
y_label: str,
bin_incr: int) -> None:
'''Take a pandas series as input and draw a histogram with a boxblot above it'''
fig, (ax_box, ax_hist) = plt.subplots(2,
sharex=True,
gridspec_kw={
"height_ratios": (.15, .85)},
figsize=(14, 6))
sns.boxplot(x, ax=ax_box)
bins = np.arange(0, x.max() + bin_incr, bin_incr)
x.hist(grid=False, bins=bins)
ax_box.set(yticks=[])
ax_hist.set_ylabel(y_label)
ax_hist.set_xlabel(x_label)
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
user_interactions = pd.crosstab(df.email, columns=[df.article_id]).T.sum()
hist_box_plot(user_interactions, 'No of Interactions', 'Frequency', 1)
user_interactions.value_counts().nlargest(10)
The distribution of the number of user interactions is right (positevely) skewed with the majority of users (1416) having the minimum number of interactions (1)
user_interactions.describe()
2.
Explore and remove duplicate articles from the df_content dataframe.
# Find and explore duplicate articles
df_content.info()
df_content[df_content.duplicated(subset='article_id', keep=False)]
df_content[df_content.duplicated(subset='article_id')]
There are 5 duplicate rows that need to be removed
# Remove any rows that have the same article_id - only keep the first
len_before = df_content.shape[0]
df_content = df_content.drop_duplicates(subset='article_id')
# Check that the df_content no of rows has been reduced by 5
assert len_before - df_content.shape[0] == 5
# a. The number of unique articles that have an interaction with a user.
articles_interactions = pd.crosstab(df.email, columns=[df.article_id]).sum()
articles_interactions[articles_interactions>0].size
# b. The number of unique articles in the dataset (whether they have any interactions or not).
df_content.article_id.nunique()
# c. The number of unique users in the dataset
df.email.nunique()
# d. The number of user-article interactions in the dataset.
df.shape[0]
unique_articles = 714 # The number of unique articles that have at least one interaction
total_articles = 1051 # The number of unique articles on the IBM platform
unique_users = 5148 # The number of unique users
user_article_interactions = 45993 # The number of user-article interactions
# the top 10
articles_interactions.sort_values(ascending=False).iloc[:10]
articles_interactions[articles_interactions.argmax()]
def email_mapper():
'''Map the user email to a user_id column and remove the email column'''
coded_dict = dict()
cter = 1
email_encoded = []
for val in df['email']:
if val not in coded_dict:
coded_dict[val] = cter
cter+=1
email_encoded.append(coded_dict[val])
return email_encoded
email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded
# show header
df.head()
We don't actually have ratings for whether a user liked an article or not. We only know that a user has interacted with an article. In these cases, the popularity of an article can really only be based on how often an article was interacted with.
def get_top_articles(n, df=df):
'''
INPUT:
n - (int) the number of top articles to return
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
top_articles - (list) A list of the top 'n' article titles
'''
articles_interactions = pd.crosstab(df.user_id, columns=[df.title]).sum()
top_articles = articles_interactions.nlargest(n).index.tolist()
return top_articles # Return the top article titles from df (not df_content)
def get_top_article_ids(n, df=df):
'''
INPUT:
n - (int) the number of top articles to return
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
top_articles - (list) A list of the top 'n' article titles
'''
articles_interactions = pd.crosstab(df.user_id, columns=[df.article_id]).sum()
top_articles = articles_interactions.nlargest(n).index.tolist()
return top_articles # Return the top article ids
print(get_top_articles(10))
print(get_top_article_ids(10))
# create the user-article matrix with 1's and 0's
def create_user_item_matrix(df):
'''
INPUT:
df - pandas dataframe with article_id, title, user_id columns
OUTPUT:
user_item - user item matrix
Description:
Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with
an article and a 0 otherwise
'''
# Fill in the function here
user_item = pd.crosstab(df.user_id, columns=[df.article_id])
user_item[user_item > 0] = 1
return user_item # return the user_item matrix
user_item = create_user_item_matrix(df)
def find_similar_users(user_id, user_item=user_item):
'''
INPUT:
user_id - (int) a user_id
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
similar_users - (list) an ordered list where the closest users (largest dot product users)
are listed first
Description:
Computes the similarity of every pair of users based on the dot product
Returns an ordered
'''
# compute similarity of each user to the provided user # remove the own user's id
similarity = user_item[user_item.index != user_id].dot(user_item.loc[user_id ])
# sort by similarity # create list of just the ids
most_similar_users = similarity.sort_values(ascending=False).index.tolist()
return most_similar_users # return a list of the users in order from most to least similar
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))
def get_article_names(article_ids, df=df):
'''
INPUT:
article_ids - (list) a list of article ids
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
article_names - (list) a list of article names associated with the list of article ids
(this is identified by the title column)
'''
df.article_id = df.article_id.astype(str)
article_names = (df
.drop_duplicates('article_id')
.set_index('article_id')
.loc[article_ids]
.title
.tolist()
)
return article_names # Return the article names associated with list of article ids
def get_user_articles(user_id, user_item=user_item):
'''
INPUT:
user_id - (int) a user id
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
article_ids - (list) a list of the article ids seen by the user
article_names - (list) a list of article names associated with the list of article ids
(this is identified by the doc_full_name column in df_content)
Description:
Provides a list of the article_ids and article titles that have been seen by a user
'''
user = user_item.loc[user_id]
article_ids = user[user==1].index.tolist()
article_ids = [str(i)for i in article_ids]
article_names = get_article_names(article_ids)
return article_ids, article_names # return the ids and names
def user_user_recs(user_id, m=10):
'''
INPUT:
user_id - (int) a user id
m - (int) the number of recommendations you want for the user
OUTPUT:
recs - (list) a list of recommendations for the user
Description:
Loops through the users based on closeness to the input user_id
For each user - finds articles the user hasn't seen before and provides them as recs
Does this until m recommendations are found
Notes:
Users who are the same closeness are chosen arbitrarily as the 'next' user
For the user where the number of recommended articles starts below m
and ends exceeding m, the last items are chosen arbitrarily
'''
seen_articles = get_user_articles(user_id)[0]
users = find_similar_users(user_id)
recs = []
for user in users:
rec_articles = get_user_articles(user)[0]
for article in rec_articles:
if article not in seen_articles and article not in recs:
recs.append(article)
if len(recs) == m:
return recs # return your recommendations for this user_id
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1
def get_top_sorted_users(user_id, df=df, user_item=user_item):
'''
INPUT:
user_id - (int)
df - (pandas dataframe) df as defined at the top of the notebook
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
neighbors_df - (pandas dataframe) a dataframe with:
neighbor_id - is a neighbor user_id
similarity - measure of the similarity of each user to the provided user_id
num_interactions - the number of articles viewed by the user - if a u
Other Details - sort the neighbors_df by the similarity and then by number of interactions where
highest of each is higher in the dataframe
'''
neighbors_df = pd.DataFrame()
# compute similarity of each user to the provided user
similarity = user_item[user_item.index != user_id].dot(user_item.loc[user_id ])
# sort by similarity
most_similar_users = similarity.sort_values(ascending=False)
neighbors_df['neighbor_id'] = most_similar_users.index.tolist()
neighbors_df['similarity'] = most_similar_users.tolist()
neighbors_df['num_interactions'] = neighbors_df['neighbor_id'].apply(lambda x: df[df['user_id'] == x].shape[0])
# Return the dataframe specified in the doc_string
return neighbors_df.sort_values(['similarity', 'num_interactions'], ascending=False)
def user_user_recs_part2(user_id, m=10):
'''
INPUT:
user_id - (int) a user id
m - (int) the number of recommendations you want for the user
OUTPUT:
recs - (list) a list of recommendations for the user by article id
rec_names - (list) a list of recommendations for the user by article title
Description:
Loops through the users based on closeness to the input user_id
For each user - finds articles the user hasn't seen before and provides them as recs
Does this until m recommendations are found
Notes:
* Choose the users that have the most total article interactions
before choosing those with fewer article interactions.
* Choose articles with the articles with the most total interactions
before choosing those with fewer total interactions.
'''
seen_articles = get_user_articles(user_id)[0]
users = get_top_sorted_users(user_id)['neighbor_id']
recs = []
users_artic_dict = defaultdict(list)
for user in users:
if len(recs) <= m:
rec_articles = get_user_articles(user)[0]
for article in rec_articles:
if article not in seen_articles and article not in recs:
# Keep track of users and articles
users_artic_dict[user].append(article)
recs.append(article)
top_articles = get_top_article_ids(df.shape[0])
# Dictionary for ranking top articles
top_articles_dict = {article:i for i, article in enumerate(top_articles)}
# remove the final user articles
no_of_final_user_articles = len(users_artic_dict[user])
recs = recs[:-no_of_final_user_articles]
# import pdb
# pdb.set_trace()
# sort rec_articles of the final user based on top articles
sort_articles = [(article, top_articles_dict[article]) for article in rec_articles]
sort_articles.sort(key=lambda x: x[1])
for article_tuple in sort_articles:
article = article_tuple[0]
if article not in seen_articles and article not in recs:
recs.append(article)
if len(recs) == m:
return recs, get_article_names(recs) # return your recommendations for this user_id
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)
get_top_sorted_users(1).head(1)
get_top_sorted_users(131).head(10).iloc[-1]
Since we have no information about the user it makes sense to recommend the most viewed articles across all existing users (Simple Recommender).
new_user = '0.0'
# What would your recommendations be for this new user '0.0'? As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to
new_user_recs = get_top_article_ids(10) # Your recommendations here
assert set(new_user_recs) == set(['1314.0','1429.0','1293.0','1427.0','1162.0','1364.0','1304.0','1170.0','1431.0','1330.0']), "Oops! It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."
print("That's right! Nice job!")
def make_content_recs():
'''
INPUT:
OUTPUT:
'''
# make recommendations for a brand new user
# make a recommendations for a user who only has interacted with article id '1427.0'
# Load the matrix
user_item_matrix = pd.read_pickle('../user_item_matrix.p')
# quick look at the matrix
user_item_matrix.head()
# Perform SVD on the User-Item Matrix
u, s, vt = np.linalg.svd(user_item_matrix, full_matrices=False) # use the built in to get the three matrices
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []
for k in num_latent_feats:
# restructure with k latent features
s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
# take dot product
user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
# compute error for each prediction to actual value
diffs = np.subtract(user_item_matrix, user_item_est)
# total errors and keep track of thema
err = np.sum(np.sum(np.abs(diffs)))
sum_errs.append(err)
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');
df_train = df.head(40000)
df_test = df.tail(5993)
def create_test_and_train_user_item(df_train, df_test):
'''
INPUT:
df_train - training dataframe
df_test - test dataframe
OUTPUT:
user_item_train - a user-item matrix of the training dataframe
(unique users for each row and unique articles for each column)
user_item_test - a user-item matrix of the testing dataframe
(unique users for each row and unique articles for each column)
test_idx - all of the test user ids
test_arts - all of the test article ids
'''
user_item_train = create_user_item_matrix(df_train)
user_item_test = create_user_item_matrix(df_test)
# Find users both in test and in train
train_idx = set(user_item_train.index)
test_idx = set(user_item_test.index)
match_idx = train_idx.intersection(test_idx)
# Find movies both in test and in train
train_arts = set(user_item_train.columns)
test_arts = set(user_item_test.columns)
match_cols = train_arts.intersection(test_arts)
user_item_test = user_item_test.loc[match_idx, match_cols]
return user_item_train, user_item_test, test_idx, test_arts
user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)
# How many users can we make predictions for in the test set?
user_item_test.shape[0]
# How many users in the test set are we not able to make predictions for because of the cold start problem?
len(test_idx) - user_item_test.shape[0]
# How many movies can we make predictions for in the test set?
user_item_test.shape[1]
# How many movies in the test set are we not able to make predictions for because of the cold start problem?
len(test_arts) - user_item_test.shape[1]
# fit SVD on the user_item_train matrix
u_train, s_train, vt_train = np.linalg.svd(user_item_train, full_matrices=False) # fit svd similar to above then use the cells below
# Use these cells to see how well you can use the training
# decomposition to predict on test data
row_idxs = user_item_train.index.isin(test_idx)
col_idxs = user_item_train.columns.isin(test_arts)
u_test = u_train[row_idxs, :]
vt_test = vt_train[:, col_idxs]
num_latent_feats = np.arange(0, 700+10, 20)
sum_errs_train = []
sum_errs_test = []
all_errs = []
for k in num_latent_feats:
# restructure with k latent features
s_train_lat, u_train_lat, vt_train_lat = np.diag(s_train[:k]), u_train[:, :k], vt_train[:k, :]
u_test_lat, vt_test_lat = u_test[:, :k], vt_test[:k, :]
# take dot product
user_item_train_preds = np.around(np.dot(np.dot(u_train_lat, s_train_lat), vt_train_lat))
user_item_test_preds = np.around(np.dot(np.dot(u_test_lat, s_train_lat), vt_test_lat))
all_errs.append(1 - ((np.sum(user_item_test_preds)+np.sum(np.sum(user_item_test)))/(user_item_test.shape[0]*user_item_test.shape[1])))
# compute error for each prediction to actual value
diffs_train = np.subtract(user_item_train, user_item_train_preds)
diffs_test = np.subtract(user_item_test, user_item_test_preds)
# total errors and keep track of them
err_train = np.sum(np.sum(np.abs(diffs_train)))
err_test = np.sum(np.sum(np.abs(diffs_test)))
sum_errs_train.append(err_train)
sum_errs_test.append(err_test)
plt.plot(num_latent_feats, 1 - np.array(sum_errs_train)/(user_item_train.shape[0]*user_item_test.shape[1]), label='Train', color='r');
plt.plot(num_latent_feats, 1 - np.array(sum_errs_test)/(user_item_test.shape[0]*user_item_test.shape[1]), label='Test', color='c');
plt.plot(num_latent_feats, all_errs, label='All Data');
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');
plt.legend();
The accuracy of around 96% with around 300 latent features, appears to be good but has the drawback that it is based on only 20 users and the fact that there the classes are very inbalanced (1s are appox 1% of the dataset). Calculating and plotting precision, recall and the f1 score would give better insights into the actual performance of the predictions.
A potential solution to the problem of lmited users would be to set up an online experiment where we randomly split the users to a control group that receives no recommendationsand a second group that receives recommendations. The null hupothesis could be that there is no difference in number of interactions between the two groups which we can measure against an alpha threshhold and setermine whether the recommendations have a statistically significant effect </b>
TO BE CONTINUED