NotFlix

Documentation Status CircleCI

About

This is Notflix, a free movie database and recommendation website.

This website is simply a side project, that aims at displaying a fixed dataset of movies and provide recommendations about other movies to watch. I am building it mostly for fun and also to have a nice playground to implement various recommendation algorithms using Machine Learning.

NotFlix is based on data from the following sources:

Installation

Pre-requisite:

  1. Install the following software:
  2. Download the movielens data:
    • Download the ml-1m dataset by clicking here
    • Unzip it and place it under datasets/movielens/ml-1m
  3. To add the movie metadata to the downloaded dataset
    (such as the actors, …) I chose to use the OMDb API (the Open Movie Database). Support him on Patreon and get a key that you will place in a text file called omdb.key at the root of this repository

Once you have all the pre-requisite set up, follow these steps:

  1. Copy the db-credentials.env template and add the credentials you want:

    cp -n db-credentials.env.dist db-credentials.env;
    
  2. Create a virtual environment and install the required packages:

    virtualenv venv;
    source venv/bin/activate;
    pip install -r requirements.txt;
    
  3. Build the Docker images:

    docker-compose build;
    
  4. Launch the PostgreSQL database:

    docker-compose up -d postgres
    
  5. Use the following flask-cli commands to insert the data into the DB:

    export FLASK_APP="src/web";
    export POSTGRES_HOST="127.0.0.1";
    
    flask init-db;
    flask insert-engines;
    flask insert-pages;
    flask download-movies;
    flask insert-movies;
    flask train-engines;
    flask upload-engines;
    
  6. Launch the application with make start and then visit

    localhost:5000.

Usage

So far, Notflix exposes the following pages:

  • A home page, displaying the popular movies, the user browsing history
    and some personalized recommendations.
  • A movie page, displaying basic informations about the selected movie
    and recommendations on similar movies to watch.
  • A genres page, that lets you browse movies by genres.
  • A search page, that lets you search the movies.

The configuration for engines and pages is handled with the display.json file. You can use it to change the engines displayed, their names and order on the page.

Repository organization

The repository is organized the following way:

  • .circleci: Configuration file for CircleCI
  • datasets\ : Folder containing the datasets
    (so far only movielens)
  • docs\ : Folder containing the documentation, auto-generated
    by Sphinx.
  • logs: Logs file are saved here
  • models: Machine Learning models are saved here.
    • Under bin we save the models weights.
    • Under csv we save CSV files containing the predictions made by a given engine.
  • notebooks: The exploratory Jupyter notebooks
  • src: Source code
    • api: Flask API, responsible of computing the recommendations displayed on the web app.
    • data_interface: Code for interacting with the cache or the database.
    • recommender: Everything related to computing recommendations.
    • tracker: Code for tracking the user events.
    • utils: Various utility functions
    • web: Code for the Flask web application
  • tests: Unit test code

Notes

  • I am deliberately showing multiple engines on a web page to outline
    the different recommendations results from an algorithm to another.
  • I am not removing the movie duplicates from an engine to another
    for the same reason than above.
  • The Machine Learning algorithms are not very well trained yet, I spent
    some time working on the application to make it easy to add new engines later.

So far, the movie page looks like this:

movie

notflix

config module

src package

Subpackages
src.api package
src.api.create_app(test_config=None)[source]
Submodules
src.api.errors module
src.api.errors.page_not_found(e)[source]
src.api.recommend module
src.api.recommend.item(item_id)[source]
src.api.recommend.session_recommendations(session_id)[source]
src.api.recommend.user(user_id)[source]
src.api.wsgi module
src.data_interface package
Submodules
src.data_interface.cache module
class src.data_interface.cache.Cache[source]

Bases: object

append(key, value)[source]

Append to a redis list

Parameters:
  • key (str) – cache key
  • value (str) – object to store in cache
get(key, start=None, end=None)[source]

Get an object from cache by its key

Parameters:
  • key (str) – cache key
  • start (int) – when querying a redis list, starting range of the list.
  • end (int) – when querying a redis list, ending range of the list.
Returns:

cached object

Return type:

str

set(key, value)[source]

Set an object in cache by its key

Parameters:
  • key (str) – cache key
  • value (str) – object to store in cache
src.data_interface.downloader module

This module contains wrappers to download various movies datasets. So far we are only using Movielens but we can add more if we want.

Every dataset should have its wrapper class that inherits from Downloader.

class src.data_interface.downloader.Downloader[source]

Bases: abc.ABC

download_to_file()[source]
insert_in_db()[source]
item_from_api(id)[source]
read_api_key(key_filepath)[source]
class src.data_interface.downloader.MovielensDownloader[source]

Bases: src.data_interface.downloader.Downloader

download_to_file()[source]
insert_in_db()[source]
src.data_interface.model module
class src.data_interface.model.BaseTable(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Model

created_at = Column(None, DateTime(), table=None, nullable=False, default=ColumnDefault(<function datetime.utcnow>))
id = Column(None, Integer(), table=None, primary_key=True, nullable=False)
updated_at = Column(None, DateTime(), table=None, onupdate=ColumnDefault(<function datetime.utcnow>), default=ColumnDefault(<function datetime.utcnow>))
class src.data_interface.model.Engine(**kwargs)[source]

Bases: src.data_interface.model.BaseTable

created_at
display_name
id
priority
type
updated_at
class src.data_interface.model.Genre(**kwargs)[source]

Bases: src.data_interface.model.BaseTable

created_at
id
name
updated_at
class src.data_interface.model.Movie(**kwargs)[source]

Bases: src.data_interface.model.BaseTable

actors
as_dict()[source]
awards
country
created_at
description
director
duration
genres
id
image
language
name
rating
updated_at
year
class src.data_interface.model.Page(**kwargs)[source]

Bases: src.data_interface.model.BaseTable

created_at
engines
id
name
updated_at
class src.data_interface.model.Recommendation(**kwargs)[source]

Bases: src.data_interface.model.BaseTable

created_at
engine_name
id
recommended_item_id
score
source_item_id
source_item_id_kind
updated_at
class src.data_interface.model.User(**kwargs)[source]

Bases: src.data_interface.model.BaseTable

created_at
email
favorite_genres
id
password
updated_at
username
src.data_interface.model.init()[source]
src.data_interface.model.insert(to_insert)[source]
src.data_interface.model.truncate_table(table)[source]
src.recommender package
Subpackages
src.recommender.engines package

This is the engines package, where we define the recommendation engines.

This package contains the following modules:

  • engine: Module where the base classes are defined. All the created
    engines should inherit from one of these base classes. They define the skeleton of the engine, like which methods they must overwrite.
  • collaborative_filtering: All collaborative filtering engines
  • content_based: All content based filtering engines
  • generic: All the generic engines, that are not really collaborative
    or content based (e.g display the most popular items, or the items in the user browsing history, etc …)
Submodules
src.recommender.engines.collaborative_filtering module
class src.recommender.engines.collaborative_filtering.Item2Vec[source]

Bases: src.recommender.engines.engine.OfflineEngine

train()[source]

Method for training the engine. This method should load the dataset, compute the recommendations and then persist them to disk using save_recommendations_to_csv.

class src.recommender.engines.collaborative_filtering.Item2VecOnline[source]

Bases: src.recommender.engines.engine.OnlineEngine

load_model()[source]

Load the ML model from disk and return it

Returns:The ML model to be saved as self.model
predict(context)[source]

Predict using the loaded model and the context.

Parameters:context (src.recommender.wrappers.Context) – Context wrapper
Returns:list of recommended ids sorted by descending score scores (list(float)): list of scores for each recommended item
Return type:ids (list(int))
train()[source]

Train a ML model and save it to disk

class src.recommender.engines.collaborative_filtering.ItemBasedCF[source]

Bases: src.recommender.engines.engine.OfflineEngine

train()[source]

Method for training the engine. This method should load the dataset, compute the recommendations and then persist them to disk using save_recommendations_to_csv.

src.recommender.engines.content_based module
class src.recommender.engines.content_based.OneHotMultiInput[source]

Bases: src.recommender.engines.engine.OfflineEngine

train()[source]

Method for training the engine. This method should load the dataset, compute the recommendations and then persist them to disk using save_recommendations_to_csv.

class src.recommender.engines.content_based.SameGenres[source]

Bases: src.recommender.engines.engine.QueryBasedEngine

compute_query(context)[source]

Abstract method that computes the SQL query using SQLAlchemy

Parameters:context (recommender.wrappers.Context) – context wrapper
Returns:query result
class src.recommender.engines.content_based.TfidfGenres[source]

Bases: src.recommender.engines.engine.OfflineEngine

train()[source]

Method for training the engine. This method should load the dataset, compute the recommendations and then persist them to disk using save_recommendations_to_csv.

src.recommender.engines.engine module
class src.recommender.engines.engine.Engine[source]

Bases: abc.ABC

Abstract class for all engines. You should not directly use this class, instead use the classes that inherit from this class.

init_recommendations(context)[source]

Create an empty src.recommender.wrappers.Recommendations` object and fill in the engine type, display name and priority based on the informations stored in DB.

Parameters:context (src.recommender.wrappers.Context) – Context wrapper, containing useful informations for the engine.
Returns:Recommendations object filled with engine type, display name and priority
Return type:(src.recommender.wrappers.Recommendations)
recommend(context)[source]

Abstract method for all engines for recommending items.

The context wrapper stores all the informations the engine might need to compute the recommendations, like the current item_id, the current user_id, the user browsing history, etc …

Every engine must override this method. They have to call self.init_recommendations first to create an empty src.recommender.wrappers.Recommendations object and then enrich it with the recommended items.

Parameters:context (src.recommender.wrappers.Context) – the context
Returns:the recommendation object
Return type:src.recommender.wrappers.Recommendations
class src.recommender.engines.engine.OfflineEngine[source]

Bases: src.recommender.engines.engine.QueryBasedEngine

These engines are a special kind of QueryBasedEngine because they require a training.

Most of the offline Machine Learning algorithms will inherit from this class.

The recommendations are computed offline with the train method, then saved on disk with save_recommendations_to_csv and finally uploaded to the DB using upload.

compute_query(context)[source]

Get the recommended items from the DB.

Parameters:context (src.recommender.wrappers.Context) – Context wrapper
Returns:list of Recommendation
Return type:list
save_recommendations_to_csv(recommendations)[source]

Save recommendations to a CSV file.

Parameters:recommendations (list(tuple)) – List of recommendation tuple corresponding to: (movie_id, recommended_movie_id, input_kind, score)
train()[source]

Method for training the engine. This method should load the dataset, compute the recommendations and then persist them to disk using save_recommendations_to_csv.

upload()[source]

Upload the recommendations from a CSV file to the DB.

class src.recommender.engines.engine.OnlineEngine[source]

Bases: src.recommender.engines.engine.Engine

Online Machine Learning Engines that do not get their recommendations from a SQL query but from a loaded model.

The model is trained with the train method, and loaded at runtime with the load_model method.

load_model()[source]

Load the ML model from disk and return it

Returns:The ML model to be saved as self.model
predict(context)[source]

Predict using the loaded model and the context.

Parameters:context (src.recommender.wrappers.Context) – Context wrapper
Returns:list of recommended ids sorted by descending score scores (list(float)): list of scores for each recommended item
Return type:ids (list(int))
recommend(context)[source]

Recommend movies based on context

Parameters:context (src.recommender.wrappers.Context) – Context wrapper
Returns:src.recommender.wrappers.Recommendations as dict
Return type:recommendations (dict)
train()[source]

Train a ML model and save it to disk

class src.recommender.engines.engine.QueryBasedEngine[source]

Bases: src.recommender.engines.engine.Engine

Abstract class for an engine based on a SQL query performed at every call. These are engines require no training, for instance an engine that will recommend random items for DB.

compute_query(context)[source]

Abstract method that computes the SQL query using SQLAlchemy

Parameters:context (recommender.wrappers.Context) – context wrapper
Returns:query result
recommend(context)[source]

Method for recommending items, by calling self.compute_query.

Parameters:context (recommender.wrappers.Context) – context wrapper
Returns:recommendations as list of dict
Return type:list(dict)
src.recommender.engines.generic module
class src.recommender.engines.generic.MostRecent[source]

Bases: src.recommender.engines.engine.QueryBasedEngine

compute_query(context)[source]

Abstract method that computes the SQL query using SQLAlchemy

Parameters:context (recommender.wrappers.Context) – context wrapper
Returns:query result
class src.recommender.engines.generic.Random[source]

Bases: src.recommender.engines.engine.QueryBasedEngine

compute_query(context)[source]

Abstract method that computes the SQL query using SQLAlchemy

Parameters:context (recommender.wrappers.Context) – context wrapper
Returns:query result
class src.recommender.engines.generic.TopRated[source]

Bases: src.recommender.engines.engine.QueryBasedEngine

compute_query(context)[source]

Abstract method that computes the SQL query using SQLAlchemy

Parameters:context (recommender.wrappers.Context) – context wrapper
Returns:query result
class src.recommender.engines.generic.UserHistory[source]

Bases: src.recommender.engines.engine.QueryBasedEngine

compute_query(context)[source]

Abstract method that computes the SQL query using SQLAlchemy

Parameters:context (recommender.wrappers.Context) – context wrapper
Returns:query result
Submodules
src.recommender.metrics module

The metrics functions are copied from this repository: https://gist.github.com/bwhite/3726239

src.recommender.metrics.dcg_at_k(r, k, method=0)[source]

Score is discounted cumulative gain (dcg) Relevance is positive real values. Can use binary as the previous methods. Example from http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf >>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0] >>> dcg_at_k(r, 1) 3.0 >>> dcg_at_k(r, 1, method=1) 3.0 >>> dcg_at_k(r, 2) 5.0 >>> dcg_at_k(r, 2, method=1) 4.2618595071429155 >>> dcg_at_k(r, 10) 9.6051177391888114 >>> dcg_at_k(r, 11) 9.6051177391888114 :param r: Relevance scores (list or numpy) in rank order

(first element is the first item)
Parameters:
  • k – Number of results to consider
  • method – If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, …] If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, …]
Returns:

Discounted cumulative gain

src.recommender.metrics.evaluate_recommendations(predictions, target, k)[source]

Evaluate the quality of recommendations with NDCG. We compare the predictions set with the target set that should reflect what items are relevant.

Parameters:
  • predictions (list) – List of recommended items. Ordered by descending score.
  • target (list) – List of relevant items.
  • k (int) – Only consider the k first items in the set
Returns:

NDCG at k score

Return type:

float

src.recommender.metrics.ndcg_at_k(r, k, method=0)[source]

Score is normalized discounted cumulative gain (ndcg) Relevance is positive real values. Can use binary as the previous methods. Example from http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf >>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0] >>> ndcg_at_k(r, 1) 1.0 >>> r = [2, 1, 2, 0] >>> ndcg_at_k(r, 4) 0.9203032077642922 >>> ndcg_at_k(r, 4, method=1) 0.96519546960144276 >>> ndcg_at_k([0], 1) 0.0 >>> ndcg_at_k([1], 2) 1.0 :param r: Relevance scores (list or numpy) in rank order

(first element is the first item)
Parameters:
  • k – Number of results to consider
  • method – If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, …] If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, …]
Returns:

Normalized discounted cumulative gain

src.recommender.recommender module
class src.recommender.recommender.Recommender[source]

Bases: object

Recommender System base class.

recommend(context, restrict_to_engines=[])[source]

Call all the active engines based on a context and return their recommendations.

It is possible to restrict to a list of engines by using the restrict_to_engines parameter.

Parameters:context (recommender.wrappers.Context) – Context wrapper, providing informations about the current item or user or session.
Returns:List of recommendations as dictionaries
Return type:list(dict)
src.recommender.wrappers module
class src.recommender.wrappers.Context(**kwargs)[source]

Bases: object

A wrapper for context that will help engines make recommendations.

class src.recommender.wrappers.Recommendations[source]

Bases: object

A recommendation object that is returned by the engines

to_dict()[source]

Convert recommendation object to dict

Returns:the recommendations as a dictionary
Return type:dict
to_string()[source]

Convert recommendation object to string for debug purpose

Returns:the recommendations as string, stating the type and number of items recommended.
Return type:str
src.tracker package
Submodules
src.tracker.tracker module
class src.tracker.tracker.Tracker[source]

Bases: object

get_views_history(key, n=-1)[source]
store_item_viewed(key, item)[source]

Store item viewed in cache

Parameters:
  • key (str) – cache key
  • item (str) – value

Returns:

src.utils package
Submodules
src.utils.data module
src.utils.data.matrix_from_df_with_vect(df, groupby_column, data_column, vectorizer)[source]
src.utils.data.recommendations_from_similarity_matrix(movie_ids, sim_matrix, n_recommendations, input_kind)[source]
src.utils.data.sparse_matrix_from_df(df, groupby, indicator)[source]

Make a scipy sparse matrix from a pandas Dataframe

Parameters:
  • df (pd.DataFrame) – Dataframe with the matrix desired rows as index
  • groupby (str) – Name of the column to set as matrix column
  • indicator (str) – Name of the column that will serve as data
Returns:

sparse matrix (scipy.sparse.csr_matrix) row values (list) column values (list)

src.utils.logging module
src.utils.logging.setup_logging(config_path)[source]

Setup logging configuration

src.web package
src.web.create_app()[source]
Submodules
src.web.errors module
src.web.errors.page_not_found(e)[source]
src.web.genres module
src.web.genres.genre(genre)[source]
src.web.genres.index()[source]
src.web.home module
src.web.home.index()[source]
src.web.item module
src.web.item.index(item_id)[source]
src.web.login module
src.web.login.signin()[source]
src.web.login.signout()[source]
src.web.login.signup()[source]
src.web.login.valid_signin(username, password)[source]
src.web.login.valid_signup(username, password)[source]
src.web.search module
src.web.search.search()[source]
src.web.wsgi module
src.web.you module
src.web.you.index()[source]
src.web.you.taste()[source]

Indices and tables