2 most commonly used approaches to build Recommendation Engine
When shopping in an ecommerce platform, one of the most important factors that make you purchase any item is how quickly are you able to locate the item you want to buy. One of the popular ways in which ecommerce platforms achieve this is through recommendation systems (also known as recommender systems), e.g. the ‘customers who bought this item also bought…’ section on Amazon.
Recommender systems are used to gain more user attraction by understanding the user’s taste. These systems have now become popular because of their ability to provide personalized content to users that are of the user interest.
The approach used to build recommender system depends on the nature of the problem and the data or information to be used.
The popular approaches used to build recommender system are:
- Content-Based filtering
- Collaborative filtering
Content-Based filtering
Content here refers to the content or feature of the products you like. So, the idea in content-based filtering is to tag products using certain keywords, understand what the user likes, look up those keywords in the database and recommend different products with same attributes.
The working methodology of content based filtering
We can make use of a Utility Matrix for Content-Based Methods. A Utility Matrix signifies the user’s preference for certain items. With the data gathered from the user, we can find a relation between the items which are liked by the user as well as those which are disliked, for this purpose the utility matrix can be put to best use. We assign a particular value to each user-item pair, this value is known as the degree of preference and a matrix of the user is drawn with the respective items to identify their preference relationship.
How is the Similarity Computed between the different products?
The similarity is the main key fundamental in the case of content-based recommendation systems. A most similar thing to what we are currently viewing gets recommended to us. The question is how?
Below are different techniques or similarity measures that are used to compute the similarity.
Euclidean Distance: This distance metric is used when we have numeric data and if the results of the distance come out to be 0 then both are considered to be similar whereas if the distance is anything other than 0 then they are not similar.
Cosine Similarity: This type of metric is used to compute the similarity of textual data. We convert these textual data in the form of vectors using tf-idf vectorization and check for the cosine angle between those two vectors if the angle between them is 0,it means they are similar or else they are not.
Hands-On Implementation Of Content-Based Filtering
Let’s build movie recommendation system using content-based filtering.
The dataset used for this task contains all the information related to the movies. The information is spread around 24 columns, out of which we are using the title of the movie and reviews given for each movie by the users.
Following are the steps used in building movie recommendation system using content-based filtering.
Step 1: Loading the data(csv file)
# Import pandas
import pandas as pd
# Load the data
data = pd.read_csv('/content/movies_metadata.csv')# Top 5 rows
data.head()
Step 2: Preprocess and define input features
# Selecting the features
features = data[['original_title','overview']]# Check for null values
features.isna().sum()# Drop null values
features.dropna(inplace=True)
Step 3: Vectorizing the features and Computing the similarity between movies
# Import text vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Vectorize the input features
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
# Get vectorized features
matrix = tf.fit_transform(features['overview'])
matrix.data#Now we are ready to compute cosine similarity to check that all movies are of the same content based on the overview column present in the data set.# import linear kernel
from sklearn.metrics.pairwise import linear_kernel
# get the similarities
cosine_similarities = linear_kernel(matrix,matrix)
Step 4: Retrieve the result
Step 4: Generating recommendations
# Reset the index
movie_title = features['original_title']
indices = pd.Series(features.index, index=features['original_title'])
# Function get recommendation
def movie_recommend(original_title):
idx = indices[original_title]
sim_scores = list(enumerate(cosine_similarities[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:31]
movie_indices = [i[0] for i in sim_scores]
movie_score = [i[1] for i in sim_scores]
return pd.DataFrame(data = {'Movie names' :list(movie_title.iloc[movie_indices]),'Similarity score' :movie_score})#Now we will compute the top 5 recommendations for two different movies.# Test 1
movie_recommend('The Godfather').head()# Test 2
movie_recommend('Hulk').head()
Scores of Test 1 are as below:
Scores of Test 2 are as below:
In the above two recommendations, the similarity score observed for the movie The Godfather is higher than for the movie Hulk; this might happen because we have not used the full dataset. Even though the recommendations that we have seen for the above two movies are quite relevant.
So, this is how we can build the recommendation system using content-based filtering in Python on a real-world dataset
Collaborative Filtering
The Collaborative filtering method for recommender systems is a method that is solely based on the past interactions that have been recorded between users and items, in order to produce new recommendations. Collaborative Filtering tends to find what similar users would like and the recommendations to be provided in order to classify the users into clusters of similar types and recommend each user according to the preference of its cluster.
The Working Principle Of Collaborative Filtering
The main idea that governs the collaborative methods is that through past user-item interactions when processed through the system. Recommender systems based on collaborative filtering can be categorized in the following ways:
Item-based: This type of Recommendation engine helps in finding similarities between the items or products. This is done by generating data on the number of users who bought two or more items together and if the system finds a high correlation then it assumes similarity between products from an item-item matrix. For example, there are two products X and Y that are highly correlated when a user buys X, the system recommends buying Y also.
User-based: This type of system helps in finding similar users based on their nature of item selection. For example, one user uses a helmet, knee guard, and elbow guard, and the second uses only a helmet and elbow guard at the time of bike riding the user-based recommendation system will recommend the second user use a knee guard. This method looks for users who share the same rating patterns with the active user (the user to whom the prediction is to be made) and uses the ratings from those like-minded users found in the previous step to calculate a prediction for the active user.
The following steps are to be taken to build the recommendation system using collaborative filtering.
Hands-On Implementation Of Collaborative Filtering
we are going to address the user-based collaborative filtering where based on ratings given by the various users to the respective movie will be analyzed and a new set of movies will be recommended to the new user who may have initially queried the movie.
The dataset used for this task is having information in two files that is movies.csv titles and ratings.csv. Movies.csv contains the MovieID, titles, genres of the movies and ratings.csv contains movieId and ratings. Out of these features, we will use movie titles and ratings.
Step 1: Load and read the data
#Let's start with importing all the necessary library# Pandas for Data handling
import pandas as pd
# Numpy for numerical operations
import numpy as np# movies title dataset
movies_title = pd.read_csv('/content/dataset/movies.csv')
movies_title.head()
Step 2: Prepare the data for recommendation
# Dropping the irrelevant columns
title_ratings.drop(['genres','timestamp'], axis=1,inplace=True)
Now we will create a pivot table to identify the interaction between movies by each user.
# Pivot table
UserRatings = title_ratings.pivot_table(index=['userId'],columns=['title'],values='rating')
print("Before: ",UserRatings.shape)
UserRatings = UserRatings.dropna(thresh=10, axis=1).fillna(0,axis=1)
print("After: ",UserRatings.shape)
UserRatings.head()
Step 3 Building correlation matrix
#Now we will build a correlation between movies using the Pearson correlation approach.# Pearson correlations
relation_metrix = UserRatings.corr(method='pearson')
relation_metrix.head()
# User defined function to recommend the movie
def get_similar(movie_name,rating):
similar_ratings = relation_metrix[movie_name]*(rating-2.5)
similar_ratings = similar_ratings.sort_values(ascending=False)
return similar_ratings
Step 4: Generating similar movies
# getting similar movie
movies= [("Skyfall (2012)",5),("Mission: Impossible III (2006)",4)]
similar_movies = pd.DataFrame()
for movie,rating in movies:
similar_movies = similar_movies.append(get_similar(movie,rating),ignore_index = True)# Top 10 movies that are similar to queries
similar_movies.sum().sort_values(ascending=False).head(10)
So this is how we can build the recommendation system using Collaborative filtering in Python on a real-world dataset
Issues in recommendation system
The limitations of the two types of recommender systems are as follows:
- The cold start problem: How will you provide recommendations when the website is just starting up and you have no previous data about products or users? How can the system know what a new user will like?
- Explicit and implicit feedback: How can you collect information on what users like?
- Recommending only a narrow range of items: How can you ensure that recommendations provided are not all similar to each other and have sufficient variation among them?
Possible Solutions for issues in Recommendation System:
One way to solve the cold start problem in content-based filtering is to explicitly ask the users what they like. In collaborative filtering, explicit feedback is given in the form of ratings.
When only a few users have rated the items, you can use alternative methods to infer ratings. One such way is called implicit feedback. In implicit feedback, you observe the behavior of the users, such as their browsing history, items they have searched for in the past, etc., and use this data to predict ratings.
Another problem that arises in content-based systems is that new and unrelated items are rarely recommended. For example, in news recommender systems, if you have only read politics and cricket related articles in the past, the system will keep recommending those articles in the future. It will probably not recommend you to read technology even if you like reading about it.
This is not a problem with collaborative systems. They will recommend other topics if similar users like them, and hence they recommend diverse items to you.
Conclusion:
There are other filtering techniques like Hybrid filtering, Knowledge based and context aware which are implemented to build recommendation engine in the industry but are less common compared to the two discussed in the article. I hope the content has provided some basics insights to build recommendation engine on real-world datasets.
Appreciate feedback if any in the comments section.