{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Practice PS06: Recommendations engines (interactions-based)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this assignment we will build and apply an item-based and model-based collaborative filtering recommenders for movies. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Author: Your name here\n",
"\n",
"E-mail: Your e-mail here\n",
"\n",
"Date: The current date here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. The Movies dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the same dataset as in ps05, the 25M version of [MovieLens DataSet](https://grouplens.org/datasets/movielens/) released in late 2019. We will use a sub-set containing only movies released in the 2000s, and only 10% of the users and all of their ratings.\n",
"\n",
"* **MOVIES** are described in `movies-2000s.csv` in the following format: `movieId,title,genres`.\n",
"* **RATINGS** are contained in `ratings-2000s.csv` in the following format: `userId,movieId,rating`\n",
"* **TAGS** are contained in `tags.csv` in the following format: `userId,movieId,tag,timestamp`\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1.1. Load the input files"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Leave this code as-is\n",
"\n",
"import numpy as np\n",
"import pandas as pd \n",
"import matplotlib.pyplot as plt\n",
"from math import*\n",
"from scipy.sparse.linalg import svds\n",
"from sklearn.metrics.pairwise import linear_kernel"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Leave this code as-is\n",
"\n",
"FILENAME_MOVIES = \"movies-2000s.csv\"\n",
"FILENAME_RATINGS = \"ratings-2000s.csv\"\n",
"FILENAME_TAGS = \"tags-2000s.csv\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" movie_id | \n",
" title | \n",
" genres | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2769 | \n",
" Yards, The (2000) | \n",
" Crime|Drama | \n",
"
\n",
" \n",
" 1 | \n",
" 3177 | \n",
" Next Friday (2000) | \n",
" Comedy | \n",
"
\n",
" \n",
" 2 | \n",
" 3190 | \n",
" Supernova (2000) | \n",
" Adventure|Sci-Fi|Thriller | \n",
"
\n",
" \n",
" 3 | \n",
" 3225 | \n",
" Down to You (2000) | \n",
" Comedy|Romance | \n",
"
\n",
" \n",
" 4 | \n",
" 3228 | \n",
" Wirey Spindell (2000) | \n",
" Comedy | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" movie_id title genres\n",
"0 2769 Yards, The (2000) Crime|Drama\n",
"1 3177 Next Friday (2000) Comedy\n",
"2 3190 Supernova (2000) Adventure|Sci-Fi|Thriller\n",
"3 3225 Down to You (2000) Comedy|Romance\n",
"4 3228 Wirey Spindell (2000) Comedy"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" movie_id | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4 | \n",
" 1 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 4 | \n",
" 260 | \n",
" 3.5 | \n",
"
\n",
" \n",
" 2 | \n",
" 4 | \n",
" 296 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 541 | \n",
" 4.5 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 589 | \n",
" 4.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id movie_id rating\n",
"0 4 1 3.0\n",
"1 4 260 3.5\n",
"2 4 296 4.0\n",
"3 4 541 4.5\n",
"4 4 589 4.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Leave this code as-is\n",
"\n",
"movies = pd.read_csv(FILENAME_MOVIES, \n",
" sep=',', \n",
" engine='python', \n",
" encoding='latin-1',\n",
" names=['movie_id', 'title', 'genres'])\n",
"display(movies.head(5))\n",
"\n",
"ratings_raw = pd.read_csv(FILENAME_RATINGS, \n",
" sep=',', \n",
" encoding='latin-1',\n",
" engine='python',\n",
" names=['user_id', 'movie_id', 'rating'])\n",
"display(ratings_raw.head(5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1.2. Merge the data into a single dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Join the data into a single dataframe that should contain columns: user_id, movie_id, rating, timestamp, title, genders.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code from the previous practice that joined these three dataframes using \"merge\" into a single dataframe named \"ratings\". Print the first 5 rows of the resulting dataframe, which should contain columns \"user_id\", \"movie_id\", \"rating\", \"title\", and \"genres\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code from the previous practice for \"find_movies\" that list movies containing a keyword"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"movie_id: 4993, title: Lord of the Rings: The Fellowship of the Ring, The (2001)\n",
"movie_id: 5952, title: Lord of the Rings: The Two Towers, The (2002)\n",
"movie_id: 7153, title: Lord of the Rings: The Return of the King, The (2003)\n"
]
}
],
"source": [
"# LEAVE AS-IS\n",
"\n",
"# For testing, this should print:\n",
"# movie_id: 4993, title: Lord of the Rings: The Fellowship of the Ring, The (2001)\n",
"# movie_id: 5952, title: Lord of the Rings: The Two Towers, The (2002)\n",
"# movie_id: 7153, title: Lord of the Rings: The Return of the King, The (2003)\n",
"find_movies(\"Lord of the Rings\", movies)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following function, which you can leave as-is, prints the title of a movie given its movie_id.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# LEAVE AS-IS\n",
"\n",
"def get_title(movie_id, movies):\n",
" return movies[movies['movie_id'] == movie_id].title.iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lord of the Rings: The Return of the King, The (2003)\n"
]
}
],
"source": [
"# LEAVE AS-IS\n",
"\n",
"# For testing, should print \"Lord of the Rings: The Return of the King, The (2003)\")\n",
"print(get_title(7153, movies))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.3. Count unique registers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count the number of unique users and unique movies in the `ratings` variable. Use [unique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html). Print also the total number of movies in the `movies` variable. Your code should print:\n",
"\n",
"```\n",
"Number of users who have rated a movie : 12676\n",
"Number of movies that have been rated : 2049\n",
"Total number of movies : 33168\n",
"```\n",
"\n",
"Note that ratings are heavily concentrated on a few popular movies.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your own code to indicate the number of unique users and unique movies in the \"ratings\" variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Item-based Collaborative Filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The two main types of interactions-based recommender system, also known as *collaborative filtering* algorithms are:\n",
"\n",
"1. **User-based Collaborative Filtering**: To recommend items for user A, we first look at other users B1, B2, ..., Bk with a similar behavior to A, and aggregate their preferences. For instance, if all Bi like a movie that A has not watched, it would be a good candidate to be recommended. \n",
"\n",
"\n",
"2. **Item-based Collaborative Filtering**: To recommend items for user A, we first look at all the items I1, I2, ..., Ik that the user A has consumed, and find items that elicit similar ratings from other users. For instnce, an item that is rated positively by the same users that rate positively the Ii items, and negatively by the same users that rate negatively the Ii items, would be a good candidate to be recommended.\n",
"\n",
"In both cases, a similarity matrix needs to be built. For user-based, the **user-similarity matrix** will consist of some **distance metrics** that measure the similarity between any two pairs of users. For item-based, the **matrix** will measure the similarity between any two pairs of items.\n",
"\n",
"As we already know, there are several metrics strategy for measure the \"similarity\" of two items. Some of the most used metrics are Jaccard, Cosine and Pearson. Meanwhile, Jaccard similarity is based on the number of users which have rated item A and B divided by the number of users who have rated either A or B (very useful for those use cases where there is not a numeric rating but just a boolean value like a product being bought), in Pearson and Cosine similarities we measure the similarity between two vectors.\n",
"\n",
"For the purpose of this assignment, we will use **Pearson Similarity** and we will implement a **Item-based Collaborative filtering**.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.1. Data pre-processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Firstly, create a new dataframe called \"rated_movies\" that is simply the \"ratings\" dataset with column genres removed using the [Drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code to generate \"rated_movies\" and print the first ten rows. This should have columns user_id, movie_id, rating, title"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, using the `rated_movies` dataframe, create a new dataframe named `ratings_summary` containing the following columns:\n",
"\n",
"* movie_id\n",
"* title\n",
"* ratings_mean (average rating)\n",
"* ratings_count (number of people who have rated this movie)\n",
"\n",
"You can use the following operations:\n",
"\n",
"* Initialize `ratings_summary` to be only the movie_id and title of all movies in `rated_movies`\n",
" * To group dataframe `df` by column `a` and keep only one unique row per value of `a`, use: `df.groupby('a').first()`\n",
"* Compute two series: `ratings_mean` and `ratings_count`:\n",
" * To obtain a series with the average of column `a` for each distinct value of column `b` in dataframe `df`, use: `df.groupby(b)['a'].mean()`\n",
" * To obtain a series with the count of column `a` for each distinct value of column `b` in dataframe `df`, use: `df.groupby(b)['a'].count()`\n",
"* Add these series to the `ratings_summary`\n",
" * To add a series `s` with column name `a` to dataframe `df`, use: `df['a'] = s`\n",
" \n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code to generate \"ratings_summary\" and print the first 5 rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To select from dataframe A those having column C larger or equal to N, you can do `A[A.C >= N]`.\n",
"\n",
"To sort dataframe A by decreasing values of column C, you can do `A.sort_values(by='C', ascending=False)`.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with code to print the top 5 highest rated movies, considering only movies receiving at least 2500 ratings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Repeat this, but this time consider movies receiving at least 3 ratings. What is the difference? How would you explain this?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2. Compute the user-movie matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before calculating the **similarity matrix**, we create a table where columns are movies and rows are users, and each movie-user cell contains the score of that user for that movie.\n",
"\n",
"We will use the [pivot_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) function of Pandas, which receives a dataframe plus the variable that will make the rows, the variable that will make the columns, and the variable that will make the cells, and transform it into a matrix of the specified rows, columns, and cells.\n",
"\n",
"For instance, if you have a dataframe D containing:\n",
"\n",
"```\n",
"U V W\n",
"1 a 3.0\n",
"1 b 2.0\n",
"2 a 1.0\n",
"2 c 4.0\n",
"```\n",
"\n",
"Calling `D.pivot_table(index='U', columns='V', values='W')` will create the following:\n",
"\n",
"```\n",
"V a b c\n",
"U\n",
"1 3.0 2.0 NaN\n",
"2 1.0 NaN 4.0\n",
"```\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with code to generate a \"user_movie\" matrix by calling \"pivot_table\" on \"rated_movies\". Print the first 5 rows. It might take about one minute to compute, depending on your computer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this a brief commentary indicating why do you think the \"user_movie\" matrix has so many \"NaN\" values. How do we call this characteristic of user ratings in recommender systems?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.3. Explore some correlations in the user-movie matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let us explore whether correlations in this matrix make sense.\n",
"\n",
"1. Locate the movie_id for the following three movies:\n",
" * [Lord of the Rings: The Fellowship of the Ring (2001)](https://en.wikipedia.org/wiki/The_Lord_of_the_Rings:_The_Fellowship_of_the_Ring) -- name this id_pivot\n",
" * [Finding Nemo (2003)](https://en.wikipedia.org/wiki/Finding_Nemo) -- name this id_m1\n",
" * [Talk to Her (Hable con Ella) (2002)](https://en.wikipedia.org/wiki/Talk_to_Her) -- name this id_m2\n",
"2. Obtain the ratings for each of these movies: `user_movie[movie_id].dropna()`. You will obtain a column, containing a series of ratings for each movie.\n",
"3. Consolidate these four series into a single dataframe: `ratings3 = pd.concat([s1, s2, s3], axis=1)`\n",
"4. Drop from `ratings3` all rows containing a *NaN*. This will keep only the users that have rated all the 3 movies.\n",
"5. Display the first 10 rows from this table.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with code to compute and display the first 10 rows of the \"ratings3\" table as described above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To compute Pearson correlation, we use the [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html) method.\n",
"\n",
"To compute the correlation between two columns `a`, `b` in dataframe `df`, we use: `df[a].corr(df[b])`.\n",
"\n",
"Compute the correlations between all pairs of columns of the `ratings3` table. You should display:\n",
"\n",
"```\n",
"Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)': 0.38\n",
"Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)': 0.16\n",
"Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)': 0.20\n",
"```\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with code to compute all correlations between these three movies, as described above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with a brief commentary on the correlations you find."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let us take the first movie selected above, the one with movie_id `id_pivot`.\n",
"\n",
"Select the column corresponding to this movie in `user_movies` and compute its correlation with all other columns in `user_movies`. This can be done with [corrwith](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corrwith.html).\n",
"\n",
"*Note 1*: You might receive a runtime warning on degrees of freedom and/or division by zero, which you can safely ignore. It simply means that there are some columns that have no elements in common with the given column, or only one element in common, and thus for which the correlation cannot be computed.\n",
"\n",
"*Note 2*: Note that the similarities that you computed above with `corr` are limited to the set of users who rated all the three movies. Instead, the similarities that you compute below with `corrwith` include all users who rated the pivot plus at least one other movie. Hence, the results could be different.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with code to create a \"similar_to_pivot\" series that contains the computed correlations, droping the NaNs in the series."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, create a dataframe `corr_with_pivot` by using `similar_to_pivot` and `ratings_summary`. This dataframe should have the following columns:\n",
"\n",
"* corr - the correlation between each movie and the selected movie\n",
"* title\n",
"* ratings_mean\n",
"* ratings_count\n",
"\n",
"To create a dataframe `df` from a series `s`, use: `df = pd.DataFrame(s, columns=['colname'])`. \n",
"\n",
"Keep only rows in which *ratings_count* > 500, i.e., popular movies. To filter a dataframe `df` and keep only rows having column `c` larger than `x`, use `df[df[c] > x]`.\n",
"\n",
"Display the top 10 rows with the largest correlation. To select the largest `n` rows from dataframe `df` according to column `c`, use `df.sort_values(c, ascending=False).head(n)`. \n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with code to create a \"corr_with_pivot\" dataframe as specified above, and to print the 10 movies (rated 500 times or more) with the highest correlation with the selected movie."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with a brief commentary about the movies you see on this list. What happens if you set the condition on *ratings_count* to a much larger value? What happens if you set it to a much smaller value?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.4. Implement the item-based recommendations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we believe that this type of correlation sort of makes sense, let us implement the item-based recommender. We need all correlations between columns in `user_movie`.\n",
"\n",
"To compute all correlations between columns in a dataframe, use [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html). This function receives a matrix with *r* rows and *c* columns, and returns a square matrix of *c x c* containing all pair-wise correlations between columns.\n",
"\n",
"**This process may take a few minutes.** Print the first 5 rows of the resulting matrix when done.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie. Store this in \"item_similarity\", and print the first 10 rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarities between movies that do not have many ratings in common are unreliable. Fortunately, the `corr` method includes a parameter `min_periods` that establishes a minimum number of elements in common that two columns must have to compute the correlation.\n",
"\n",
"Re-generate item_similarity setting min_periods to 100.\n",
"\n",
"This process will also take a few minutes. Print the first 5 rows of the resulting matrix when done.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie, but considering only movies having at least 100 ratings in common. Store this in \"item_similarity_min_ratings\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will need to test our function so let us select a couple of interesting users.\n",
"\n",
"Our first user, `user_id_super` will be someone who has given the following 3 films a rating higher than 4.5:\n",
"\n",
"* movie_id=5349: *Spider-Man (2002)*\n",
"* movie_id=3793: *X-Men (2000)*\n",
"* movie_id=6534: *Hulk (2003)* \t\n",
"\n",
"Our second user, `user_id_drama` will be someone who has given the following 3 films a rating higher than 4.5:\n",
"\n",
"* movie_id=6870: *Mystic River (2003)*\n",
"* movie_id=5995: *Pianist, The (2002)*\n",
"* movie_id=3555: *U-571 (2000)*\n",
"\n",
"To filter a dataframe by multiple conditions you can use, e.g., `df[(a > 1) & (b > 2)]`. \n",
"\n",
"**Important**: these particular users have watched lots of movies, so we cannot tell for sure they have only these interests.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code to find the userids of two example users: user_id_super (the who liked the three superhero movies), and user_id_drama (the one who liked the three dramas)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will need some auxiliary functions that are provided below. You can leave as-is.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"# Leave this code as-is\n",
"\n",
"# Gets a list of watched movies for a user_id\n",
"def get_watched_movies(user_id, user_movie):\n",
" return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)\n",
" \n",
"# Gets the rating a user_id has given to a movie_id\n",
"def get_rating(user_id, movie_id, user_movie):\n",
" return user_movie[movie_id][user_id]\n",
"\n",
"# Print watched movies\n",
"def print_watched_movies(user_id, user_movie, movies):\n",
" for movie_id in get_watched_movies(user_id, user_movie):\n",
" print(\"%d %.1f %s \" %\n",
" (movie_id, get_rating(user_id, movie_id, user_movie), get_title(movie_id, movies)))\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5502 5.0 Signs (2002) \n",
"5445 5.0 Minority Report (2002) \n",
"6156 5.0 Shanghai Knights (2003) \n",
"5952 5.0 Lord of the Rings: The Two Towers, The (2002) \n",
"5944 5.0 Star Trek: Nemesis (2002) \n",
"5816 5.0 Harry Potter and the Chamber of Secrets (2002) \n",
"5618 5.0 Spirited Away (Sen to Chihiro no kamikakushi) (2001) \n",
"5524 5.0 Blue Crush (2002) \n",
"5480 5.0 Stuart Little 2 (2002) \n",
"5459 5.0 Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (2002) \n",
"5420 5.0 Windtalkers (2002) \n",
"4388 5.0 Scary Movie 2 (2001) \n",
"5389 5.0 Spirit: Stallion of the Cimarron (2002) \n",
"5349 5.0 Spider-Man (2002) \n",
"5218 5.0 Ice Age (2002) \n",
"5064 5.0 The Count of Monte Cristo (2002) \n",
"4993 5.0 Lord of the Rings: The Fellowship of the Ring, The (2001) \n",
"4973 5.0 Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001) \n",
"4896 5.0 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001) \n",
"4886 5.0 Monsters, Inc. (2001) \n",
"6186 5.0 Gods and Generals (2003) \n",
"6333 5.0 X2: X-Men United (2003) \n",
"6377 5.0 Finding Nemo (2003) \n",
"6534 5.0 Hulk (2003) \n",
"30816 5.0 Phantom of the Opera, The (2004) \n",
"30812 5.0 Aviator, The (2004) \n",
"8972 5.0 National Treasure (2004) \n",
"8644 5.0 I, Robot (2004) \n",
"8636 5.0 Spider-Man 2 (2004) \n",
"8622 5.0 Fahrenheit 9/11 (2004) \n",
"8533 5.0 Notebook, The (2004) \n",
"8368 5.0 Harry Potter and the Prisoner of Azkaban (2004) \n",
"8361 5.0 Day After Tomorrow, The (2004) \n",
"8360 5.0 Shrek 2 (2004) \n",
"7454 5.0 Van Helsing (2004) \n",
"7324 5.0 Hidalgo (2004) \n",
"7153 5.0 Lord of the Rings: The Return of the King, The (2003) \n",
"6946 5.0 Looney Tunes: Back in Action (2003) \n",
"6761 5.0 Tibet: Cry of the Snow Lion (2002) \n",
"6565 5.0 Seabiscuit (2003) \n",
"6539 5.0 Pirates of the Caribbean: The Curse of the Black Pearl (2003) \n",
"4701 5.0 Rush Hour 2 (2001) \n",
"33162 5.0 Kingdom of Heaven (2005) \n",
"4270 5.0 Mummy Returns, The (2001) \n",
"3624 5.0 Shanghai Noon (2000) \n",
"3785 5.0 Scary Movie (2000) \n",
"3753 5.0 Patriot, The (2000) \n",
"3988 5.0 How the Grinch Stole Christmas (a.k.a. The Grinch) (2000) \n",
"3793 5.0 X-Men (2000) \n",
"3751 5.0 Chicken Run (2000) \n",
"3827 5.0 Space Cowboys (2000) \n",
"4306 5.0 Shrek (2001) \n",
"3623 5.0 Mission: Impossible II (2000) \n",
"3598 5.0 Hamlet (2000) \n",
"3863 5.0 Cell, The (2000) \n",
"3578 5.0 Gladiator (2000) \n",
"4370 5.0 A.I. Artificial Intelligence (2001) \n",
"3755 5.0 Perfect Storm, The (2000) \n",
"6567 4.5 Buffalo Soldiers (2001) \n",
"5994 4.5 Nicholas Nickleby (2002) \n",
"6541 4.5 League of Extraordinary Gentlemen, The (a.k.a. LXG) (2003) \n",
"5992 4.5 Hours, The (2002) \n",
"5956 4.5 Gangs of New York (2002) \n",
"3981 4.5 Red Planet (2000) \n",
"4386 4.5 Cats & Dogs (2001) \n",
"6617 4.5 Open Range (2003) \n",
"6766 4.5 Camera Obscura (2000) \n",
"6947 4.5 Master and Commander: The Far Side of the World (2003) \n",
"7143 4.5 Last Samurai, The (2003) \n",
"7164 4.5 Peter Pan (2003) \n",
"8526 4.5 Around the World in 80 Days (2004) \n",
"8640 4.5 King Arthur (2004) \n",
"8961 4.5 Incredibles, The (2004) \n",
"8977 4.5 Alexander (2004) \n",
"3481 4.5 High Fidelity (2000) \n",
"3408 4.5 Erin Brockovich (2000) \n",
"5881 4.5 Solaris (2002) \n",
"6211 4.5 Ten (2002) \n",
"5171 4.5 Time Machine, The (2002) \n",
"3354 4.5 Mission to Mars (2000) \n",
"4016 4.5 Emperor's New Groove, The (2000) \n",
"4368 4.5 Dr. Dolittle 2 (2001) \n",
"5378 4.5 Star Wars: Episode II - Attack of the Clones (2002) \n",
"5313 4.5 The Scorpion King (2002) \n",
"4020 4.5 Gift, The (2000) \n",
"5444 4.5 Lilo & Stitch (2002) \n",
"4638 4.5 Jurassic Park III (2001) \n",
"4783 4.5 Endurance: Shackleton's Legendary Antarctic Expedition, The (2000) \n",
"4995 4.0 Beautiful Mind, A (2001) \n",
"4299 4.0 Knight's Tale, A (2001) \n",
"7255 4.0 Win a Date with Tad Hamilton! (2004) \n",
"3996 4.0 Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000) \n",
"3745 4.0 Titan A.E. (2000) \n",
"8639 4.0 Clearing, The (2004) \n",
"4254 4.0 Crocodile Dundee in Los Angeles (2001) \n",
"3744 4.0 Shaft (2000) \n",
"3555 4.0 U-571 (2000) \n",
"4069 4.0 Wedding Planner, The (2001) \n",
"6550 4.0 Johnny English (2003) \n",
"4367 4.0 Lara Croft: Tomb Raider (2001) \n",
"6959 4.0 Timeline (2003) \n",
"3826 4.0 Hollow Man (2000) \n",
"6537 4.0 Terminator 3: Rise of the Machines (2003) \n",
"4366 4.0 Atlantis: The Lost Empire (2001) \n",
"5833 4.0 Dog Soldiers (2002) \n",
"30846 4.0 Assassination of Richard Nixon, The (2004) \n",
"5876 4.0 Quiet American, The (2002) \n",
"5530 4.0 Simone (S1m0ne) (2002) \n",
"4446 4.0 Final Fantasy: The Spirits Within (2001) \n",
"3997 4.0 Dungeons & Dragons (2000) \n",
"3998 4.0 Proof of Life (2000) \n",
"6060 4.0 Guru, The (2002) \n",
"3977 4.0 Charlie's Angels (2000) \n",
"3889 4.0 Highlander: Endgame (Highlander IV) (2000) \n",
"5463 4.0 Reign of Fire (2002) \n",
"8666 4.0 Catwoman (2004) \n",
"8371 4.0 Chronicles of Riddick, The (2004) \n",
"8979 3.5 Guerrilla: The Taking of Patty Hearst (2004) \n",
"3564 3.5 Flintstones in Viva Rock Vegas, The (2000) \n",
"8981 3.5 Closer (2004) \n",
"4756 3.5 Musketeer, The (2001) \n",
"4643 3.5 Planet of the Apes (2001) \n",
"3510 3.5 Frequency (2000) \n",
"5670 3.5 Comedian (2002) \n",
"4958 3.5 Behind Enemy Lines (2001) \n",
"5872 3.5 Die Another Day (2002) \n",
"5882 3.5 Treasure Planet (2002) \n",
"5071 3.5 Maelström (2000) \n",
"4232 3.5 Spy Kids (2001) \n",
"6373 3.5 Bruce Almighty (2003) \n",
"4310 3.5 Pearl Harbor (2001) \n",
"6264 3.5 Core, The (2003) \n",
"6220 3.0 Willard (2003) \n",
"30894 3.0 White Noise (2005) \n",
"3986 3.0 6th Day, The (2000) \n",
"5504 3.0 Spy Kids 2: The Island of Lost Dreams (2002) \n",
"6157 3.0 Daredevil (2003) \n",
"8907 3.0 Shark Tale (2004) \n",
"6703 3.0 Order, The (2003) \n",
"5328 3.0 Rain (2001) \n",
"7846 3.0 Tremors 3: Back to Perfection (2001) \n",
"8865 2.5 Sky Captain and the World of Tomorrow (2004) \n",
"7458 2.5 Troy (2004) \n",
"3593 2.5 Battlefield Earth (2000) \n",
"5478 0.5 Eight Legged Freaks (2002) \n",
"3300 0.5 Pitch Black (2000) \n"
]
}
],
"source": [
"# LEAVE AS-IS (TESTING CODE)\n",
"\n",
"print_watched_movies(user_id_super, user_movie, movies)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3967 5.0 Billy Elliot (2000) \n",
"4014 5.0 Chocolat (2000) \n",
"4034 5.0 Traffic (2000) \n",
"5995 5.0 Pianist, The (2002) \n",
"7147 5.0 Big Fish (2003) \n",
"4995 5.0 Beautiful Mind, A (2001) \n",
"3555 5.0 U-571 (2000) \n",
"6870 5.0 Mystic River (2003) \n",
"5991 5.0 Chicago (2002) \n",
"8464 5.0 Super Size Me (2004) \n",
"5669 5.0 Bowling for Columbine (2002) \n",
"8622 5.0 Fahrenheit 9/11 (2004) \n",
"30707 5.0 Million Dollar Baby (2004) \n",
"6953 4.5 21 Grams (2003) \n",
"5015 4.5 Monster's Ball (2001) \n",
"5464 4.5 Road to Perdition (2002) \n",
"3510 4.5 Frequency (2000) \n",
"5989 4.5 Catch Me If You Can (2002) \n",
"4022 4.0 Cast Away (2000) \n",
"5010 4.0 Black Hawk Down (2001) \n",
"5299 4.0 My Big Fat Greek Wedding (2002) \n",
"3897 4.0 Almost Famous (2000) \n",
"3755 4.0 Perfect Storm, The (2000) \n",
"4308 4.0 Moulin Rouge (2001) \n",
"4447 3.5 Legally Blonde (2001) \n",
"4246 3.5 Bridget Jones's Diary (2001) \n",
"4975 3.5 Vanilla Sky (2001) \n",
"4019 3.5 Finding Forrester (2000) \n",
"5377 3.5 About a Boy (2002) \n",
"3948 3.5 Meet the Parents (2000) \n",
"5956 3.0 Gangs of New York (2002) \n",
"6281 3.0 Phone Booth (2002) \n",
"7143 3.0 Last Samurai, The (2003) \n",
"7458 3.0 Troy (2004) \n",
"5349 3.0 Spider-Man (2002) \n",
"5902 3.0 Adaptation (2002) \n",
"5418 3.0 Bourne Identity, The (2002) \n",
"4993 3.0 Lord of the Rings: The Fellowship of the Ring, The (2001) \n",
"4310 3.0 Pearl Harbor (2001) \n",
"4025 2.5 Miss Congeniality (2000) \n",
"3578 2.5 Gladiator (2000) \n",
"8644 2.5 I, Robot (2004) \n",
"4018 2.5 What Women Want (2000) \n",
"5679 2.5 Ring, The (2002) \n",
"30825 2.5 Meet the Fockers (2004) \n",
"4027 2.5 O Brother, Where Art Thou? (2000) \n",
"4776 2.5 Training Day (2001) \n",
"4963 2.5 Ocean's Eleven (2001) \n",
"7438 2.0 Kill Bill: Vol. 2 (2004) \n",
"3916 2.0 Remember the Titans (2000) \n",
"8665 2.0 Bourne Supremacy, The (2004) \n",
"30793 2.0 Charlie and the Chocolate Factory (2005) \n",
"7153 2.0 Lord of the Rings: The Return of the King, The (2003) \n",
"6934 1.5 Matrix Revolutions, The (2003) \n",
"3752 1.0 Me, Myself & Irene (2000) \n",
"8798 1.0 Collateral (2004) \n"
]
}
],
"source": [
"# LEAVE AS-IS (TESTING CODE)\n",
"\n",
"print_watched_movies(user_id_drama, user_movie, movies)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For every user, we will consider that the importance of a new movie (a movie s/he has not rated) will be equal to the sum of the similarities between that new movie and all the movies the user has already rated.\n",
"\n",
"Indeed, to further improve this, we will compute a weighted sum, in which the weight will be the rating given to the movie.\n",
"\n",
"For instance, suppose a user has rated movies as follows:\n",
"\n",
"```\n",
"movie_id rating\n",
"1 2.0\n",
"2 3.0\n",
"3 NaN\n",
"4 NaN\n",
"```\n",
"\n",
"And that movie similarities are as follows (values with a \".\" do not matter in this example):\n",
"\n",
"```\n",
"movie_id 1 2 3 4\n",
"1 ...............\n",
"2 ...............\n",
"3 0.1 0.2 NaN ...\n",
"4 0.9 0.8 ... NaN\n",
"```\n",
"\n",
"The importance of movie 3 to this user will be:\n",
"\n",
"```\n",
"2.0 * 0.1 + 3.0 * 0.2 = 0.8\n",
"```\n",
"\n",
"While the importance of movie 4 to this user will be:\n",
"\n",
"```\n",
"2.0 * 0.9 + 3.0 + 0.8 = 5.6\n",
"```\n",
"\n",
"As we can see, we are favoring movies that are highly similar to many movies that the user has rated high.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a function `get_movies_relevance` that returns a dataframe with columns `movie_id` and `relevance`. You can use the following template:\n",
"\n",
"```python\n",
"def get_movies_relevance(user_id, user_movie, item_similarity_matrix):\n",
" \n",
" # Create an empty series\n",
" movies_relevance = ...\n",
" \n",
" # Iterate through the movies the user has watched\n",
" for watched_movie in ...\n",
" \n",
" # Obtain the rating given\n",
" rating_given = ...\n",
" \n",
" # Obtain the vector containing the similarities of watched_movie\n",
" # with all other movies in item_similarity_matrix\n",
" similarities = ...\n",
" \n",
" # Multiply this vector by the given rating\n",
" weighted_similarities = ...\n",
" \n",
" # Append these terms to movies_relevance\n",
" movies_relevance = movies_relevance.append(weighted_similarities)\n",
" \n",
" # Compute the sum for each movie\n",
" movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()\n",
" \n",
" # Convert to a dataframe\n",
" movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])\n",
" movies_relevance_df['movie_id'] = movies_relevance_df.index\n",
" \n",
" return movies_relevance_df\n",
"\n",
"```\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code for \"get_movies_relevance\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apply `get_movies_relevance` to the two users we have selected, `user_id_super` and `user_id_drama`.\n",
"\n",
"The result will contain only `movie_id` and `relevance`, you will have to merge with the `movies` dataframe on the `movie_id` attribute.\n",
"\n",
"Sort the results by descending relevance and print the top 10 for each case.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code to obtain the 5 most relevant movies for the users user_id_super (who likes superhero movies) and user_id_drama (who likes dramas)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with a brief commentary on the movies you see on these lists. How many of them look relevant for the intended users? Feel free to use IMDB or Wikipedia to get info on these movies.\n",
"\n",
"All those trivial facts you learned about 1980s and 1990s pop culture were supposed to be useful one day; that day has arrived :-)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, you only need to remove the movies the user has watched. To do so:\n",
"\n",
"* Obtain the dataframe of relevant movies with `get_movies_relevance`\n",
"* Set this dataframe index to 'movie_id'\n",
"* Obtain the list of movie_ids of watched movies with `get_watched_movies`\n",
"* Drop from the relevant movies dataframe the watched movies\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code implementing \"get_recommended_movies\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with your code to obtain the 10 most recommended movies for the users user_id_super and user_id_drama"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace this cell with a brief commentary on these recommendations. Do you think they are relevant? Why or why not? After removing the movies the user has already watched, are the relevance scores of the remaining items comparable to the previous lists that contained all relevant movies?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DELIVER (individually)\n",
"\n",
"Remember to read the section on \"delivering your code\" in the [course evaluation guidelines](https://github.com/chatox/data-mining-course/blob/master/upf/upf-evaluation.md).\n",
"\n",
"Deliver a zip file containing:\n",
"\n",
"* This notebook\n",
"\n",
"## Extra points available\n",
"\n",
"For more learning and extra points, use the [surprise](http://surpriselib.com/) library to generate recommendations for the same two users. Display the generated recommendations and comment on them.\n",
"\n",
"**Note:** if you go for the extra points, add ``Additional results: surprise library`` at the top of your notebook.\n",
"\n",
"(Remove this cell when delivering.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}