# Autoencoders_cf **Repository Path**: abc-pedicle/Autoencoders_cf ## Basic Information - **Project Name**: Autoencoders_cf - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-11-08 - **Last Updated**: 2023-11-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Hybrid Collaborative Filtering with Neural Networks Collaborative Fltering uses the ratings history of users and items. The feedback of one user on some items is combined with the feedback of all other users on all items to predict a new rating. For instance, if someone rated a few books, Collaborative Filtering aims at estimating the ratings he would have given to thousands of other books by using the ratings of all the other readers. The following module tackles Collaborative Filtering by using sparse denoising autoencoders. More information can be found in those papers - Collaborative Filtering with Stacked Denoising AutoEncoders and Sparse Inputs (NIPS workshop - ecommerce): https://hal.archives-ouvertes.fr/hal-01256422/document - Hybrid Collaborative Filtering with Autoencoders (RecSys Workshop DLRS) : https://arxiv.org/abs/1606.07659 [TEMPO] A step-by-step tutorial is available [here](https://github.com/fstrub95/torch.github.io/blob/master/blog/_posts/2016-02-21-cfn.md) . It will be pushed soon :) Dependencies: - torch - nn - xlua - nnsparse - optim (optional) anaconda2 ## Summary ``` git clone git@github.com:fstrub95/Autoencoders_cf.git cd Autoencoders_cf cd data wget http://files.grouplens.org/datasets/movielens/ml-10m.zip unzip ml-10m.zip cd ../src th data.lua -ratings ../data/ml-10M100K/ratings.dat -metaItem ../data/ml-10M100K/movies.dat -out ../data/ml-10M100K/movieLens-10M.t7 -fileType movieLens -ratio 0.9 th main.lua -file ../data/ml-10M100K/movieLens-10M.t7 -conf ../conf/conf.movieLens.10M.V.lua -save network.t7 -type V -meta 1 -gpu 1 th computeMetrics.lua -file ../data/ml-10M100K/movieLens-10M.t7 -network network.t7 -type V -gpu 1 ``` Your network is ready! (Average time ~25min) ## Step 1 : Convert the dataset ``` th data.lua -xargs ``` This script will turn an external raw dataset into torch format. The dataset will be split into a training/testing set by using the training ratio. When side inforamtion exist, they are automatically appended to the inputs. The [MovieLens](http://grouplens.org/datasets/movielens/) and [Douban](https://www.cse.cuhk.edu.hk/irwin.king/pub/data/douban) dataset are supported by default. ``` Options -ratings [compulsary] The relative path to your data file -metaUser The relative path to your metadata file for users -metaItem The relative path to your metadata file for items -tags The relative path to your tag file -fileType [compulsary] The data file format (movieLens/douban/classic) -out [compulsary] The data file format (movieLens/douban/classic) -ratio [compulsary] The training ratio -seed seed ``` Example: ``` th data.lua -ratings ../data/movieLens-10M/ratings.dat -metaItem ../data/movieLens-10M/movies.dat -out ../data/movieLens-10M/movieLens-10M.t7 -fileType movieLens -ratio 0.9 ``` For information, the datasets contains the following side information | Dataset | user info | item info | item tags | | :------- | --------: | :--------: | --------: | | [MovieLens-1M](http://grouplens.org/datasets/movielens/1m/) | true | true | false | | [MovieLens-10M](http://grouplens.org/datasets/movielens/10m/) | false | true | true | | [MovieLens-20M](http://grouplens.org/datasets/movielens/20m/) | false | true | true | | [Douban](https://www.cse.cuhk.edu.hk/irwin.king/pub/data/douban) | true | info | false | To compute tags, please use the script sparsesvd.py : ```sparsesvd.py [in] [out] [rank]``` Example: ``` python2 sparsesvd.py ml-10M100K/tags.dat ml-10M100K/tags.dense.csv 50 th data.lua -xargs ... -tags ml-10M100K/tags.dense.csv ``` If you have want to use external data (for benchmarking purpose), please use the Classic mode. The classic mode takes up to four file as input: - training ratings - testing ratings - user side information - item side information **Training/Testing** : You have to create two files: - [fileName].train - [fileName].test and provide the following argument to the scrit data.lua ``` ls dataset* dataset.txt.train dataset.txt.test th data.lua -ratings dataset.txt ``` Please use the following format for the training/testing datasets: ```[idUser] [idItem] [rating]``` - idUser > 0 (id must start at 1) - idItem > 0 - rating \in [-1;1] Example: ``` 1 2 0.31 2 3 0.5 1 5 -0.1 ``` NB If your ratings are not included in [-1,1], you can modify the function preprocessing() in data/ClassicLoader.lua For instance, if the ratings are included in [1-5], use: ```preprocessing(x) return (x-3)/2 end``` **Side information** : You can create two files: - [userFileName].txt - [itemFileName].txt ``` ls dataset* dataset.txt.train dataset.txt.test th data.lua -ratings [fileName] -metaUser [userFileName].txt -metaItem [itemFileName].txt ``` Please use the following format for the side information datasets: - user side info : ```[idUser] [noInfo] [idUserInfo]:[value] [idUserInfo]:[value] ...``` - user item info : ```[idItem] [noInfo] [idItemInfo]:[value] [idItemInfo]:[value] ...``` where - idUser/idItem > 0 (id must correspond to the training/testing datasets) - idUserInfo/idItemInfo > 0 (id must start at 1) - value \in [-1;1] Example: ``` 1 2 5:0.31 12:-1 2 0 1 3 5:0.28 4:1 12:0.5 ``` ## Step 2 : Train the Network ``` th main.lua -xargs ``` You can either train a U-Autoencoders/V-Autoencoders. Both will compute a final matrix of ratings. Yet, U-encoders will mainly learn a representation of users while V-Autoencoders will mainly learn representation of items. Training a network requires to use an external configuration file (cf further for more explanation regarding this file). Basic configuration files are provided for both MovieLens and Douban datasets. ``` Options -file [compulsary] The relative path to your data file (torch format). Please use data.lua to create such file. -conf [compulsary] The relative path to the lua configuration file -seed The seed. random = 0 -meta [compulsary] use metadata false=0, true=1 -type [compulsary] Pick either the U/V Autoencoder. -gpu [compulsary] use gpu. CPU = 0, GPU > 0 with GPU the index of the device -save Store the final network in an external file ``` Example: ``` th main.lua -file ../data/movieLens-10M/movieLens-10M.t7 -conf ../conf/conf.movieLens.10M.V.lua -save network.t7 -type V -meta 1 -gpu 1 ``` NB: Saving the network let you use it for recommendation tasks. You can configure the network architecture and training by modifying the file config.template.lua it has the following structure: ```lua local config = { layer1 = { layerSize = 100, { Training 1 } }, layer2 = { layerSize = 50, { Training 1 }, --inner hidden layers { Training 2 }, --final network }, layer3 = { layerSize = 20, { Training 1 }, -- inner hidden layers { Training 2 }, -- intermediate hidden layers { Training 3 }, -- final network } etc. } return config ``` Autoencoders are iteratively trained, stacked and fine-tuned. "Training" is defined as follow: ``` { noEpoch = 15, -- number of epoch to train the layer miniBatchSize = 35, -- minibatch size learningRate = 0.02, -- Learning rate learningRateDecay = 0.5, -- Learning rate decay lrt = lrt / (1+lrt_dec) weightDecay = 0.03, -- L2 regulizer criterion = cfn.SDAECriterionGPU(nn.MSECriterion(), -- define the training loss { alpha = 1, -- prediction hyperparameter beta = 0.5, -- reconstruction hyperparameter hideRatio = 0.2, -- Maksing noise ratio }), } ``` ## Step 3 : Recommender System Once the network is trained, it is possible to use it as a recommender system. For now, it is possible to compute the RMSE by sorting the users/items regarding their number of ratings. Further work will enable to directly suggest items to users (or users to items!) ## Benchmarks ## The SVD and ALS-WR algorithms are provided for benchmarking for medium size datasets. For bigger datasets, we adivese to use [mahout](http://mahout.apache.org/) - ALS-WR : ``` th ALS.lua -xargs -file The relative path to your data file. -lambda Rank of the final matrix -rank Regularisation -seed The random seed ``` - Gradient : ``` th GradDescent.lua -xargs -file The relative path to your data file. -lambda Rank of the final matrix -rank Regularisation -lrt Learning Rate -seed The random seed ```