# selavi **Repository Path**: facebookresearch/selavi ## Basic Information - **Project Name**: selavi - **Description**: This repo covers the implementation for Labelling unlabelled videos from scratch with multi-modal self-supervision, which learns clusters from multi-modal data in a self-supervised way. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-07-30 - **Last Updated**: 2024-10-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SeLaVi: Labelling unlabelled videos from scratch with multi-modal self-supervision This code provides a PyTorch implementation and pretrained models for **SeLaVi** (Labelling unlabelled videos from scratch with multi-modal self-supervision), as described in the paper [Labelling unlabelled videos from scratch with multi-modal self-supervision](https://arxiv.org/abs/2006.13662).
SeLaVi Illustration
SeLaVi is an efficient and simple method for learning labels of multi-modal audio-visual data. # Key contributions **(1) Clustering does not come for free** Even very strong feature representations such as a supervisedly pretrained R(2+1)D-18 or MIL-NCE S3D network underperform our method that _learns_ clusters. **(2) Truly multi-modal clustering yields robust clusters** Since our method treats each modality as an _augmentation_ from another, our method learn to give stable predictions even if one modality is degraded. # Model Zoo We provide serveral baseline SeLaVi pre-trained models with R(2+1)-D-18 video and Resnet-9 audio architecture in torchvision format in different datasets. | Method | Dataset | Clusters | Setting | Heads | NMI | Accuracy | url | |--------|----------------|----------|-----------|-------|-------|----------|------------------------------------------------------------------------------| | SeLaVi | AVE | 28 | MA, G, MH | 10 | 66.2% | 57.9% | [model](https://dl.fbaipublicfiles.com/selavi/selavi_ave.pth) | | SeLaVi | Kinetics-Sound | 32 | MA, G, MH | 10 | 47.5% | 41.2% | [model](https://dl.fbaipublicfiles.com/selavi/selavi_kinetics_sound.pth) | | SeLaVi | Kinetics | 400 | MA, G, MH | 10 | 27.1% | 7.8% | [model](https://dl.fbaipublicfiles.com/selavi/selavi_kinetics.pth) | | SeLaVi | VGG-Sound | 309 | MA, G, MH | 10 | 55.9% | 31.0% | [model](https://dl.fbaipublicfiles.com/selavi/selavi_vgg_sound.pth) | MA = Modality Alignment, G = Gaussian Marginals, DH = Decorrelated Heads (see paper for details) ## Further details for the VGG-Sound trained model | Model | NMI | aNMI | aRI | Accuracy | Purity | HMDB-51 (3-fold) | UCF-101 (3-fold) | |------------------|-------|-------|-------|----------|--------|------------------|------------------| | SeLaVi VGG-Sound | 54.6% | 52.0% | 20.6% | 30.9% | 36.2% | 55.1% (55.4, 54.8, 55.1) | 86.1% (86.0, 85.9, 86.5) | ## Download and or visualize clusters: You can download the csv files for our clusters here: [VGG-Sound](https://www.robots.ox.ac.uk/~vgg/research/selavi/data/vgg_sound_clusters.csv), [Kinetics](https://www.robots.ox.ac.uk/~vgg/research/selavi/data/kinetics_clusters.csv). Note: as everywhere in the paper, we're only taking a single crop in space and time for generating these. SeLaVi is an efficient and simple method for learning labels of multi-modal audio-visual data. To interactively visualize the clusters we obtain for Kinetics and VGG-Sound, as we do on our [homepage](https://www.robots.ox.ac.uk/~vgg/research/selavi/#demo), run: ``` python3 cluster_vis/get_clusters_vggsounds.py --ckpt_path ${VGG_SOUND_CKPT_PATH}; python3 cluster_vis/get_clusters_kinetics.py --ckpt_path ${KINETICS_CKPT_PATH}; cd cluster_vis; python3 preprocess.py --kinetics_path selavi_kinetics.pkl --vgg_sound_path selavi_vgg_sounds.pkl # open index.html in your browser ``` # Running SeLaVi unsupervised training ## Installation This repo was tested with Ubuntu 16.04.5 LTS, Python 3.7.5, PyTorch 1.3.1, Torchvision 0.4.1, and CUDA 10.0. 1. Install required packages using `conda env create -f environment.yml` 2. Activate conda environment using `conda activate lab_vid` 3. Ensure pre-training datasets (VGG-Sound, Kinetics, AVE) are pre-processed such that the folder structure is in the form: ``` {dataset_name}/{train,val,test}/{class_name}/{video_name}.mp4 ``` N.B. Kinetics-Sound is a subset of Kinetics. ## Single-node training SeLaVi is very simple to implement and experiment with. Our implementation consists of a [main.py](./main.py) file from which the following are imported: the dataset definition [dataset/AVideoDataset.py](./dataset/AVideoDataset.py), the model architecture [model.py](model.py), Sinkhorn-knopp algorithm [src/sk_utils.py](src/sk_utils.py), and some miscellaneous training utilities [utils.py](utils.py). For example, to train SeLaVi baseline on a single node with 8 gpus for 200 epochs on VGG-Sound, run: ``` python -m torch.distributed.launch --nproc_per_node=8 main.py \ --root_dir /path/to/VGGSound \ --epochs 200 \ --batch_size 16 \ --base_lr 1e-2 \ --ds_name vgg_sound \ --use_mlp True \ --mlp_dim 309 \ --headcount 10 \ --match True \ --distribution gauss \ --ind_groups 2 ``` ## Multi-node training Distributed training is available via Slurm. We provide a customizable [SBATCH script](./scripts/master.sh) to reproduce our SeLaVi models. For example, to train SeLaVi on 8 nodes and 64 GPUs with a batch size of 1024 for 200 epochs run: ``` sbatch ./scripts/master.sh ``` Note that you might need to remove the copyright header from the sbatch file to launch it. **Set up `dist_url` parameter**: We refer the user to pytorch distributed documentation ([env](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) or [file](https://pytorch.org/docs/stable/distributed.html#shared-file-system-initialization) or [tcp](https://pytorch.org/docs/stable/distributed.html#tcp-initialization)) for setting the distributed initialization method (parameter `dist_url`) correctly. In the provided sbatch files, we use the [tcp init method](https://pytorch.org/docs/stable/distributed.html#tcp-initialization) (see [\*](https://github.com/facebookresearch/swav/blob/master/scripts/swav_800ep_pretrain.sh#L17-L20) for example). # Evaluating models ## Evaluate models: Clustering quality To evaluate the clustering quality of SeLaVi pretraining: ``` python3 get_clusters.py \ --dataset {vggsound, kinetics, ave, kinetics_sound} --root_dir /path/to/dataset \ --weights_path ${WEIGHTS_PATH} \ --output_dir ${OUTPUT_DIR} \ --exp_desc ${EXP_DESC} \ --mode train \ --headcount ${HEADCOUNT} python3 clustering_metircs.py \ --path ${OUTPUT_DIR}/${EXP_DESC}.pkl \ --ncentroids ${NUM_CLS} # Set NUM_CLS={kinetics: 400, ave: 28, vggsound: 309, kinetics_sounds: 32} ``` ## Evaluate models: Video Action Recognition To evaluate SeLaVi pretraining on video action recognition: ``` python3 finetune_video.py \ --dataset {ucf101, hmdb51} \ --root_dir /path/to/dataset \ --fold {1,2,3} \ --batch_size 32 \ --workers 10 \ --weights_path ${WEIGHTS_PATH} \ --output_dir ${OUTPUT_DIR} \ --num_clusters ${NUM_CLUSTERS} ``` ## Evaluate models: Video Retrieval To evaluate SeLaVi pretraining on video action retrieval: ``` python3 video_retrieval.py \ --dataset {ucf101, hmdb51} \ --root_dir /path/to/dataset \ --fold {1,2,3} \ --batch_size 32 \ --workers 10 \ --weights_path ${WEIGHTS_PATH} \ --output_dir ${OUTPUT_DIR} ``` ## Visualize output distributions To evaluate SeLaVi pretraining on video action retrieval: ``` python3 plot_distributions.py ``` ![Gaussian marginals](cluster_vis/gaussian_hist_0.png) # Citation If you find this repository useful in your research, please cite: ``` @inproceedings{asano2020labelling, title={Labelling unlabelled videos from scratch with multi-modal self-supervision}, author={Yuki M. Asano and Mandela Patrick and Christian Rupprecht and Andrea Vedaldi}, year={2020}, booktitle={NeurIPS} } ```