Article Recommendation Solutions for the IBM Watson Studio Platform

In this notebook, I am exploring different recommendation approaches for building an article recommendation engine for the IBM Watson Studio platform.

Table of Contents

I. Exploratory Data Analysis
II. Rank Based Recommendations
III. User-User Based Collaborative Filtering
IV. Content Based Recommendations (EXTRA - NOT REQUIRED)
V. Matrix Factorization
VI. Extras & Concluding

In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import seaborn as sns
from collections import defaultdict

%matplotlib inline

df = pd.read_csv('../data/user-item-interactions.csv')
df_content = pd.read_csv('../data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']

df.head()
Out[79]:
article_id title email
0 1430.0 using pixiedust for fast, flexible, and easier... ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1 1314.0 healthcare python streaming application demo 083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2 1429.0 use deep learning for image classification b96a4f2e92d8572034b1e9b28f9ac673765cd074
3 1338.0 ml optimization using cognitive assistant 06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4 1276.0 deploy your python model as a restful api f01220c46fc92c6e6b161b1849de11faacd7ccb2
In [80]:
df_content.head()
Out[80]:
doc_body doc_description doc_full_name doc_status article_id
0 Skip navigation Sign in SearchLoading...\r\n\r... Detect bad readings in real time using Python ... Detect Malfunctioning IoT Sensors with Streami... Live 0
1 No Free Hunch Navigation * kaggle.com\r\n\r\n ... See the forest, see the trees. Here lies the c... Communicating data science: A guide to present... Live 1
2 ☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat... Here’s this week’s news in Data Science and Bi... This Week in Data Science (April 18, 2017) Live 2
3 DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... Learn how distributed DBs solve the problem of... DataLayer Conference: Boost the performance of... Live 3
4 Skip navigation Sign in SearchLoading...\r\n\r... This video demonstrates the power of IBM DataS... Analyze NY Restaurant data using Spark in DSX Live 4

Part I : Exploratory Data Analysis

1. What is the distribution of how many articles a user interacts with in the dataset?

In [81]:
def hist_box_plot(x: pd.Series,
                  x_label: str,
                  y_label: str,
                  bin_incr: int) -> None:
    '''Take a pandas series as input and draw a histogram with a boxblot above it'''
    fig, (ax_box, ax_hist) = plt.subplots(2,
                                        sharex=True,
                                        gridspec_kw={
                                            "height_ratios": (.15, .85)},
                                        figsize=(14, 6))

    sns.boxplot(x, ax=ax_box)
    bins = np.arange(0, x.max() + bin_incr, bin_incr)
    x.hist(grid=False, bins=bins)
    ax_box.set(yticks=[])
    ax_hist.set_ylabel(y_label)
    ax_hist.set_xlabel(x_label)
    sns.despine(ax=ax_hist)
    sns.despine(ax=ax_box, left=True)

user_interactions = pd.crosstab(df.email, columns=[df.article_id]).T.sum()

hist_box_plot(user_interactions, 'No of Interactions', 'Frequency', 1)
In [82]:
user_interactions.value_counts().nlargest(10)
Out[82]:
1     1416
2      694
3      485
4      351
5      277
6      228
7      182
8      156
10     124
9      115
dtype: int64

The distribution of the number of user interactions is right (positevely) skewed with the majority of users (1416) having the minimum number of interactions (1)

In [83]:
user_interactions.describe()
Out[83]:
count    5148.000000
mean        8.930847
std        16.802267
min         1.000000
25%         1.000000
50%         3.000000
75%         9.000000
max       364.000000
dtype: float64

2. Explore and remove duplicate articles from the df_content dataframe.

In [84]:
# Find and explore duplicate articles
df_content.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056 entries, 0 to 1055
Data columns (total 5 columns):
doc_body           1042 non-null object
doc_description    1053 non-null object
doc_full_name      1056 non-null object
doc_status         1056 non-null object
article_id         1056 non-null int64
dtypes: int64(1), object(4)
memory usage: 41.3+ KB
In [85]:
df_content[df_content.duplicated(subset='article_id', keep=False)]
Out[85]:
doc_body doc_description doc_full_name doc_status article_id
50 Follow Sign in / Sign up Home About Insight Da... Community Detection at Scale Graph-based machine learning Live 50
221 * United States\r\n\r\nIBM® * Site map\r\n\r\n... When used to make sense of huge amounts of con... How smart catalogs can turn the big data flood... Live 221
232 Homepage Follow Sign in Get started Homepage *... If you are like most data scientists, you are ... Self-service data preparation with IBM Data Re... Live 232
365 Follow Sign in / Sign up Home About Insight Da... During the seven-week Insight Data Engineering... Graph-based machine learning Live 50
399 Homepage Follow Sign in Get started * Home\r\n... Today’s world of data science leverages data f... Using Apache Spark as a parallel processing fr... Live 398
578 This video shows you how to construct queries ... This video shows you how to construct queries ... Use the Primary Index Live 577
692 Homepage Follow Sign in / Sign up Homepage * H... One of the earliest documented catalogs was co... How smart catalogs can turn the big data flood... Live 221
761 Homepage Follow Sign in Get started Homepage *... Today’s world of data science leverages data f... Using Apache Spark as a parallel processing fr... Live 398
970 This video shows you how to construct queries ... This video shows you how to construct queries ... Use the Primary Index Live 577
971 Homepage Follow Sign in Get started * Home\r\n... If you are like most data scientists, you are ... Self-service data preparation with IBM Data Re... Live 232
In [86]:
df_content[df_content.duplicated(subset='article_id')]
Out[86]:
doc_body doc_description doc_full_name doc_status article_id
365 Follow Sign in / Sign up Home About Insight Da... During the seven-week Insight Data Engineering... Graph-based machine learning Live 50
692 Homepage Follow Sign in / Sign up Homepage * H... One of the earliest documented catalogs was co... How smart catalogs can turn the big data flood... Live 221
761 Homepage Follow Sign in Get started Homepage *... Today’s world of data science leverages data f... Using Apache Spark as a parallel processing fr... Live 398
970 This video shows you how to construct queries ... This video shows you how to construct queries ... Use the Primary Index Live 577
971 Homepage Follow Sign in Get started * Home\r\n... If you are like most data scientists, you are ... Self-service data preparation with IBM Data Re... Live 232

There are 5 duplicate rows that need to be removed

In [87]:
# Remove any rows that have the same article_id - only keep the first

len_before = df_content.shape[0]

df_content = df_content.drop_duplicates(subset='article_id')

# Check that the df_content no of rows has been reduced by 5
assert len_before - df_content.shape[0] == 5
In [88]:
# a. The number of unique articles that have an interaction with a user.
articles_interactions = pd.crosstab(df.email, columns=[df.article_id]).sum()
articles_interactions[articles_interactions>0].size
Out[88]:
714
In [89]:
# b. The number of unique articles in the dataset (whether they have any interactions or not).
df_content.article_id.nunique()
Out[89]:
1051
In [90]:
# c. The number of unique users in the dataset
df.email.nunique()
Out[90]:
5148
In [91]:
# d. The number of user-article interactions in the dataset.
df.shape[0]
Out[91]:
45993
In [92]:
unique_articles = 714 # The number of unique articles that have at least one interaction
total_articles = 1051 # The number of unique articles on the IBM platform
unique_users = 5148 # The number of unique users
user_article_interactions = 45993 # The number of user-article interactions
In [93]:
# the top 10
articles_interactions.sort_values(ascending=False).iloc[:10]
Out[93]:
article_id
1429.0    937
1330.0    927
1431.0    671
1427.0    643
1364.0    627
1314.0    614
1293.0    572
1170.0    565
1162.0    512
1304.0    483
dtype: int64
In [94]:
articles_interactions[articles_interactions.argmax()]
/home/jkb/miniconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 'argmax' is deprecated. Use 'idxmax' instead. The behavior of 'argmax' will be corrected to return the positional maximum in the future. Use 'series.values.argmax' to get the position of the maximum now.
  """Entry point for launching an IPython kernel.
Out[94]:
937
In [95]:
def email_mapper():
    '''Map the user email to a user_id column and remove the email column'''
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded

# show header
df.head()
Out[95]:
article_id title user_id
0 1430.0 using pixiedust for fast, flexible, and easier... 1
1 1314.0 healthcare python streaming application demo 2
2 1429.0 use deep learning for image classification 3
3 1338.0 ml optimization using cognitive assistant 4
4 1276.0 deploy your python model as a restful api 5

Part II: Rank-Based Recommendations

We don't actually have ratings for whether a user liked an article or not. We only know that a user has interacted with an article. In these cases, the popularity of an article can really only be based on how often an article was interacted with.

In [96]:
def get_top_articles(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''

    articles_interactions = pd.crosstab(df.user_id, columns=[df.title]).sum()
    top_articles = articles_interactions.nlargest(n).index.tolist()
    
    return top_articles # Return the top article titles from df (not df_content)

def get_top_article_ids(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''

    
    articles_interactions = pd.crosstab(df.user_id, columns=[df.article_id]).sum()
    top_articles = articles_interactions.nlargest(n).index.tolist()
    
    return top_articles # Return the top article ids
In [97]:
print(get_top_articles(10))
print(get_top_article_ids(10))
['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model']
[1429.0, 1330.0, 1431.0, 1427.0, 1364.0, 1314.0, 1293.0, 1170.0, 1162.0, 1304.0]
In [98]:
# create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    # Fill in the function here
    
    user_item = pd.crosstab(df.user_id, columns=[df.article_id])

    user_item[user_item > 0] = 1
        
    return user_item # return the user_item matrix 

user_item = create_user_item_matrix(df)
In [99]:
def find_similar_users(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # compute similarity of each user to the provided user # remove the own user's id
    similarity = user_item[user_item.index != user_id].dot(user_item.loc[user_id ])
    
    # sort by similarity # create list of just the ids 
    most_similar_users = similarity.sort_values(ascending=False).index.tolist()
    
       
    return most_similar_users # return a list of the users in order from most to least similar
In [100]:
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))
The 10 most similar users to user 1 are: [3933, 23, 3782, 203, 4459, 3870, 131, 4201, 46, 3697]
The 5 most similar users to user 3933 are: [1, 3782, 23, 203, 4459]
The 3 most similar users to user 46 are: [4201, 3782, 23]
In [101]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''

    df.article_id = df.article_id.astype(str)
    
    article_names = (df
                     .drop_duplicates('article_id')
                     .set_index('article_id')
                     .loc[article_ids]
                     .title
                     .tolist()
                    )
    
    return article_names # Return the article names associated with list of article ids


def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''


    user = user_item.loc[user_id] 
    article_ids = user[user==1].index.tolist()
    article_ids = [str(i)for i in article_ids]
    
    
    article_names = get_article_names(article_ids)
    
    return article_ids, article_names # return the ids and names


def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''

    seen_articles  = get_user_articles(user_id)[0]
    users = find_similar_users(user_id)
    
    recs = []
    for user in users:
        rec_articles = get_user_articles(user)[0]

        for article in rec_articles:
            if article not in seen_articles and article not in recs:
                recs.append(article)
                if len(recs) == m:
                    return recs # return your recommendations for this user_id  
In [102]:
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1
Out[102]:
['this week in data science (april 18, 2017)',
 'timeseries data analysis of iot events by using jupyter notebook',
 'got zip code data? prep it for analytics. – ibm watson data lab – medium',
 'higher-order logistic regression for large datasets',
 'using machine learning to predict parking difficulty',
 'deep forest: towards an alternative to deep neural networks',
 'experience iot with coursera',
 'using brunel in ipython/jupyter notebooks',
 'graph-based machine learning',
 'the 3 kinds of context: machine learning and the art of the frame']
In [103]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''

    neighbors_df = pd.DataFrame()
    
    # compute similarity of each user to the provided user
    similarity = user_item[user_item.index != user_id].dot(user_item.loc[user_id ])
    
    # sort by similarity
    most_similar_users = similarity.sort_values(ascending=False)
    
    neighbors_df['neighbor_id'] = most_similar_users.index.tolist()
    neighbors_df['similarity'] = most_similar_users.tolist()
    neighbors_df['num_interactions'] = neighbors_df['neighbor_id'].apply(lambda x: df[df['user_id'] == x].shape[0])
    
    # Return the dataframe specified in the doc_string
    return neighbors_df.sort_values(['similarity', 'num_interactions'], ascending=False) 


def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''

    seen_articles  = get_user_articles(user_id)[0]
    users = get_top_sorted_users(user_id)['neighbor_id']
    
    recs = []
    users_artic_dict = defaultdict(list)
    for user in users:
        if len(recs) <= m:
            rec_articles = get_user_articles(user)[0]

            for article in rec_articles:
                if article not in seen_articles and article not in recs:
                    # Keep track of users and articles
                    users_artic_dict[user].append(article)

                    recs.append(article)
                    
 
    
    top_articles = get_top_article_ids(df.shape[0])
    # Dictionary for ranking top articles
    top_articles_dict = {article:i for i, article in enumerate(top_articles)}
    
    # remove the final user articles
    no_of_final_user_articles = len(users_artic_dict[user])
    recs = recs[:-no_of_final_user_articles]
    
#     import pdb
#     pdb.set_trace() 
    
    # sort rec_articles of the final user based on top articles
    sort_articles = [(article, top_articles_dict[article]) for article in rec_articles]
    sort_articles.sort(key=lambda x: x[1])
    for article_tuple in sort_articles:
        article = article_tuple[0]
        if article not in seen_articles and article not in recs:
            recs.append(article)
            if len(recs) == m:
                return recs,  get_article_names(recs) # return your recommendations for this user_id  
                    
    
    
In [104]:
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)
The top 10 recommendations for user 20 are the following article ids:
['1330.0', '1427.0', '1364.0', '1170.0', '1162.0', '1304.0', '1351.0', '1160.0', '1354.0', '1368.0']

The top 10 recommendations for user 20 are the following article names:
['insights from new york car accident reports', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model', 'model bike sharing data with spss', 'analyze accident reports on amazon emr spark', 'movie recommender system with spark machine learning', 'putting a human face on machine learning']
In [105]:
get_top_sorted_users(1).head(1)
Out[105]:
neighbor_id similarity num_interactions
0 3933 35 45
In [106]:
get_top_sorted_users(131).head(10).iloc[-1]
Out[106]:
neighbor_id         242
similarity           25
num_interactions    148
Name: 9, dtype: int64

Since we have no information about the user it makes sense to recommend the most viewed articles across all existing users (Simple Recommender).

In [107]:
new_user = '0.0'

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 
new_user_recs = get_top_article_ids(10) # Your recommendations here
In [108]:
assert set(new_user_recs) == set(['1314.0','1429.0','1293.0','1427.0','1162.0','1364.0','1304.0','1170.0','1431.0','1330.0']), "Oops!  It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."

print("That's right!  Nice job!")
That's right!  Nice job!
In [109]:
def make_content_recs():
    '''
    INPUT:
    
    OUTPUT:
    
    '''
In [110]:
# make recommendations for a brand new user


# make a recommendations for a user who only has interacted with article id '1427.0'
In [111]:
# Load the matrix
user_item_matrix = pd.read_pickle('../user_item_matrix.p')

# quick look at the matrix
user_item_matrix.head()
Out[111]:
article_id 0.0 100.0 1000.0 1004.0 1006.0 1008.0 101.0 1014.0 1015.0 1016.0 ... 977.0 98.0 981.0 984.0 985.0 986.0 990.0 993.0 996.0 997.0
user_id
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 714 columns

In [112]:
# Perform SVD on the User-Item Matrix
u, s, vt = np.linalg.svd(user_item_matrix, full_matrices=False) # use the built in to get the three matrices
In [113]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
    
    # take dot product
    user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    # compute error for each prediction to actual value
    diffs = np.subtract(user_item_matrix, user_item_est)
    
    # total errors and keep track of thema
    err = np.sum(np.sum(np.abs(diffs)))
    sum_errs.append(err)
    
    
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');
In [114]:
df_train = df.head(40000)
df_test = df.tail(5993)

def create_test_and_train_user_item(df_train, df_test):
    '''
    INPUT:
    df_train - training dataframe
    df_test - test dataframe
    
    OUTPUT:
    user_item_train - a user-item matrix of the training dataframe 
                      (unique users for each row and unique articles for each column)
    user_item_test - a user-item matrix of the testing dataframe 
                    (unique users for each row and unique articles for each column)
    test_idx - all of the test user ids
    test_arts - all of the test article ids
    
    '''

    
    user_item_train = create_user_item_matrix(df_train)
    user_item_test = create_user_item_matrix(df_test)

    # Find users both in test and in train
    train_idx = set(user_item_train.index)
    test_idx = set(user_item_test.index)
    match_idx = train_idx.intersection(test_idx)
    
    # Find movies both in test and in train
    train_arts = set(user_item_train.columns)
    test_arts = set(user_item_test.columns)
    match_cols = train_arts.intersection(test_arts)


    user_item_test = user_item_test.loc[match_idx, match_cols]
    
    return user_item_train, user_item_test, test_idx, test_arts

user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)
In [115]:
# How many users can we make predictions for in the test set?
user_item_test.shape[0]
Out[115]:
20
In [116]:
# How many users in the test set are we not able to make predictions for because of the cold start problem?
len(test_idx) - user_item_test.shape[0] 
Out[116]:
662
In [117]:
# How many movies can we make predictions for in the test set?
user_item_test.shape[1]
Out[117]:
574
In [118]:
# How many movies in the test set are we not able to make predictions for because of the cold start problem?
len(test_arts) - user_item_test.shape[1] 
Out[118]:
0
In [119]:
# fit SVD on the user_item_train matrix
u_train, s_train, vt_train = np.linalg.svd(user_item_train, full_matrices=False) # fit svd similar to above then use the cells below
In [120]:
# Use these cells to see how well you can use the training 
# decomposition to predict on test data
In [121]:
row_idxs = user_item_train.index.isin(test_idx)
col_idxs = user_item_train.columns.isin(test_arts)
u_test = u_train[row_idxs, :]
vt_test = vt_train[:, col_idxs]

num_latent_feats = np.arange(0, 700+10, 20)
sum_errs_train = []
sum_errs_test = []
all_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_train_lat, u_train_lat, vt_train_lat = np.diag(s_train[:k]), u_train[:, :k], vt_train[:k, :]
    u_test_lat, vt_test_lat = u_test[:, :k], vt_test[:k, :]
    
    # take dot product
    user_item_train_preds = np.around(np.dot(np.dot(u_train_lat, s_train_lat), vt_train_lat))
    user_item_test_preds = np.around(np.dot(np.dot(u_test_lat, s_train_lat), vt_test_lat))
    all_errs.append(1 - ((np.sum(user_item_test_preds)+np.sum(np.sum(user_item_test)))/(user_item_test.shape[0]*user_item_test.shape[1])))
    
    
    # compute error for each prediction to actual value
    diffs_train = np.subtract(user_item_train, user_item_train_preds)
    diffs_test = np.subtract(user_item_test, user_item_test_preds)
    
    # total errors and keep track of them
    err_train = np.sum(np.sum(np.abs(diffs_train)))
    err_test = np.sum(np.sum(np.abs(diffs_test)))
    
    sum_errs_train.append(err_train)
    sum_errs_test.append(err_test)
    
plt.plot(num_latent_feats, 1 - np.array(sum_errs_train)/(user_item_train.shape[0]*user_item_test.shape[1]), label='Train', color='r');
plt.plot(num_latent_feats, 1 - np.array(sum_errs_test)/(user_item_test.shape[0]*user_item_test.shape[1]), label='Test', color='c');
plt.plot(num_latent_feats, all_errs, label='All Data');
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');
plt.legend();

The accuracy of around 96% with around 300 latent features, appears to be good but has the drawback that it is based on only 20 users and the fact that there the classes are very inbalanced (1s are appox 1% of the dataset). Calculating and plotting precision, recall and the f1 score would give better insights into the actual performance of the predictions.

A potential solution to the problem of lmited users would be to set up an online experiment where we randomly split the users to a control group that receives no recommendationsand a second group that receives recommendations. The null hupothesis could be that there is no difference in number of interactions between the two groups which we can measure against an alpha threshhold and setermine whether the recommendations have a statistically significant effect </b>

TO BE CONTINUED