# MultiMAE **Repository Path**: guopingpan/MultiMAE ## Basic Information - **Project Name**: MultiMAE - **Description**: MultiMAE: Multi-modal Multi-task Masked Autoencoders - **Primary Language**: Python - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-01-14 - **Last Updated**: 2024-10-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MultiMAE: Multi-modal Multi-task Masked Autoencoders [Roman Bachmann*](https://roman-bachmann.github.io/), [David Mizrahi*](https://dmizrahi.com), [Andrei Atanov](https://andrewatanov.github.io/), [Amir Zamir](https://vilab.epfl.ch/zamir/) [`Website`](https://multimae.epfl.ch) | [`arXiv`](https://arxiv.org/abs/2204.01678) | [`BibTeX`](#citation) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/EPFL-VILAB/MultiMAE/blob/main/MultiMAE_Demo.ipynb) [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/EPFL-VILAB/MultiMAE) Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.

We introduce Multi-modal Multi-task Masked Autoencoders (**MultiMAE**), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines. ## Catalog - [x] Pre-trained models - [x] MultiMAE pre-training code - [x] ImageNet-1K classification fine-tuning code - [x] Semantic segmentation fine-tuning code (single-modal & multi-modal) - [x] Depth estimation fine-tuning code - [x] Taskonomy fine-tuning code - [x] Colab & Hugging Face demos - [x] Download links for ImageNet-1K depth and semantic segmentation pseudo labels ## Pre-trained models We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and [timm](https://github.com/rwightman/pytorch-image-models/tree/master/timm) (RGB-only) format. For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the [official MAE codebase](https://github.com/facebookresearch/mae) following the recommended settings. | Method | Arch. | Pre-training
modalities | Pre-training
epochs | Weights
(MultiViT) | Weights
(timm) | Config | |----------------|---------|------------------------------|--------------------------|-----------------------------|-------------------------|----------------------------------------------------------------------------| | MAE | ViT-B | RGB | 1600 | [download](https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/mae-b_dec512d8b_1600e_multivit-c477195b.pth) | [download](https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/mae-b_dec512d8b_1600e_timm-f74f3a8d.pth) | See [MAE](https://github.com/facebookresearch/mae/blob/main/PRETRAIN.md) | | **MultiMAE** | ViT-B | RGB+D+S | 1600 | [**download**](https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/multimae-b_98_rgb+-depth-semseg_1600e_multivit-afff3f8c.pth) | [**download**](https://github.com/EPFL-VILAB/MultiMAE/releases/download/pretrained-weights/multimae-b_98_rgb+-depth-semseg_1600e_timm-bafa5499.pth) | [link](cfgs/pretrain/multimae-b_98_rgb+-depth-semseg_1600e.yaml) | These pre-trained models can then be fine-tuned using this codebase to reach the following performance:

Method	Classif. (@1)	Semantic Segmentation (mIoU)							Depth (δ1)
	ImageNet-1K (RGB)	ADE20K (RGB)	Hypersim (RGB / D / RGB + D)			NYUv2 (RGB / D / RGB + D)			NYUv2 (RGB)
Sup. (DeiT)	81.8	45.8	33.9	-	-	50.1	-	-	80.7
MAE	83.3	46.2	36.5	-	-	50.8	-	-	85.1
MultiMAE	83.3	46.2	37.0	38.5	47.6	52.0	41.4	56.0	86.4

### Model formats We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., [timm](https://github.com/rwightman/pytorch-image-models/tree/master/timm), [DINO](https://github.com/facebookresearch/dino ), [MAE](https://github.com/facebookresearch/mae)), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See [`multimae/multimae.py`](multimae/multimae.py) for the documentation and implementation of MultiMAE / MultiViT. You can convert between these formats using the provided [`vit2multimae_converter.py`](tools/vit2multimae_converter.py) and [`multimae2vit_converter.py`](tools/multimae2vit_converter.py) scripts. ## Usage ### Set-up See [SETUP.md](SETUP.md) for set-up instructions. ### Pre-training See [PRETRAINING.md](PRETRAINING.md) for pre-training instructions. ### Fine-tuning See [FINETUNING.md](FINETUNING.md) for fine-tuning instructions. ## Demo & visualizations For interactive demos, please see our [`website`](https://multimae.epfl.ch). Open our [`Colab notebook`](https://colab.research.google.com/github/EPFL-VILAB/MultiMAE/blob/main/MultiMAE_Demo.ipynb) to play around with the visualization code, or simply upload an image to our [`Hugging Face Spaces demo`](https://huggingface.co/spaces/EPFL-VILAB/MultiMAE). ## Acknowledgement This repository is built using the [timm](https://github.com/rwightman/pytorch-image-models/tree/master/timm), [DeiT](https://github.com/facebookresearch/deit), [DINO](https://github.com/facebookresearch/dino ), [MoCo v3](https://github.com/facebookresearch/moco-v3), [BEiT](https://github.com/microsoft/unilm/tree/master/beit), [MAE-priv](https://github.com/BUPT-PRIV/MAE-priv), and [MAE](https://github.com/facebookresearch/mae) repositories. ## License This project is under the CC-BY-NC 4.0 license. See [LICENSE](LICENSE) for details. ## Citation If you find this repository helpful, please consider citing our work: ```BibTeX @article{bachmann2022multimae, author = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir}, title = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders}, booktitle = {European Conference on Computer Vision}, year = {2022}, } ```