# MambaVision **Repository Path**: ogw0725/MambaVision ## Basic Information - **Project Name**: MambaVision - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-06-05 - **Last Updated**: 2026-06-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MambaVision: A Hybrid Mamba-Transformer Vision Backbone Official PyTorch implementation of [**MambaVision: A Hybrid Mamba-Transformer Vision Backbone**](https://arxiv.org/abs/2407.08083). [![Star on GitHub](https://img.shields.io/github/stars/NVlabs/MambaVision.svg?style=social)](https://github.com/NVlabs/MambaVision/stargazers) [Ali Hatamizadeh](https://research.nvidia.com/person/ali-hatamizadeh) and [Jan Kautz](https://jankautz.com/). For business inquiries, please visit our website and submit the form: [NVIDIA Research Licensing](https://www.nvidia.com/en-us/research/inquiries/) Try MambaVision: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WR8LAzRMoK19RiFA-Br0Xxir_Htb3pLf) --- MambaVision demonstrates a strong performance by achieving a new SOTA Pareto-front in terms of Top-1 accuracy and throughput.

We introduce a novel mixer block by creating a symmetric path without SSM to enhance the modeling of global context:

MambaVision has a hierarchical architecture that employs both self-attention and mixer blocks: ![teaser](./mambavision/assets/arch.png) ## 💥 News 💥 - **[06.10.2025]** The MambaVision [poster](https://github.com/NVlabs/MambaVision/blob/main/mambavision/assets/mamba_vision_poster_cvpr25.pdf) will be presented in CVPR 2025 in Nashville on Sunday, June 15, 2025, from 10:30 a.m. to 12:30 p.m. CDT in Exhibit Hall D, Poster #403. - **[06.10.2025]** Semantic segmentation code and models released [here](https://github.com/NVlabs/MambaVision/tree/main/semantic_segmentation) ! - **[06.07.2025]** Object detection code and models released [here](https://github.com/NVlabs/MambaVision/tree/main/object_detection) ! - **[03.29.2025]** You can now easily run MambaVision in Google Colab. Try here: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WR8LAzRMoK19RiFA-Br0Xxir_Htb3pLf) - **[03.29.2025]** New MambaVision [pip package](https://pypi.org/project/mambavision/) released ! - **[03.25.2025]** Updated [manuscript](https://arxiv.org/pdf/2407.08083) is now available on arXiv ! - **[03.25.2025]** 21K models and code added to the repository. - **[03.25.2025]** MambaVision is the **first** mamba-based vision backbone at scale ! - **[03.24.2025]** [MambaVision-L3-512-21K](https://huggingface.co/nvidia/MambaVision-L3-512-21K) achieves a **Top-1 accuracy of 88.1** % - **[03.24.2025]** New ImageNet-21K models have been added to [MambaVision Hugging Face collection](https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3) - **[02.26.2025]** MambaVision has been accepted to CVPR 2025 ! - **[07.24.2024]** MambaVision [Hugging Face](https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3) models are released ! - **[07.14.2024]** We added support for processing any resolution images. - **[07.12.2024]** [Paper](https://arxiv.org/abs/2407.08083) is now available on arXiv ! - **[07.11.2024]** [Mambavision pip package](https://pypi.org/project/mambavision/) is released ! - **[07.10.2024]** We have released the code and model checkpoints for Mambavision ! ## Quick Start ### Google Colab You can simply try image classification with MambaVision in Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WR8LAzRMoK19RiFA-Br0Xxir_Htb3pLf) ### Hugging Face (Classification + Feature extraction) Pretrained MambaVision models can be simply used via [Hugging Face](https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3) library with **a few lines of code**. First install the requirements: ```bash pip install mambavision ``` The model can be simply imported: ```python >>> from transformers import AutoModelForImageClassification >>> model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True) ``` We demonstrate an end-to-end image classification example in the following. Given the following image from [COCO dataset](https://cocodataset.org/#home) val set as an input:

The following snippet can be used: ```python from transformers import AutoModelForImageClassification from PIL import Image from timm.data.transforms_factory import create_transform import requests model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True) # eval mode for inference model.cuda().eval() # prepare image for the model url = 'http://images.cocodataset.org/val2017/000000020247.jpg' image = Image.open(requests.get(url, stream=True).raw) input_resolution = (3, 224, 224) # MambaVision supports any input resolutions transform = create_transform(input_size=input_resolution, is_training=False, mean=model.config.mean, std=model.config.std, crop_mode=model.config.crop_mode, crop_pct=model.config.crop_pct) inputs = transform(image).unsqueeze(0).cuda() # model inference outputs = model(inputs) logits = outputs['logits'] predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx]) ``` The predicted label is brown bear, bruin, Ursus arctos. You can also use Hugging Face MambaVision models for feature extraction. The model provides the outputs of each stage of model (hierarchical multi-scale features in 4 stages) as well as the final averaged-pool features that are flattened. The former is used for downstream tasks such as classification and detection. The following snippet can be used for feature extraction: ```Python from transformers import AutoModel from PIL import Image from timm.data.transforms_factory import create_transform import requests model = AutoModel.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True) # eval mode for inference model.cuda().eval() # prepare image for the model url = 'http://images.cocodataset.org/val2017/000000020247.jpg' image = Image.open(requests.get(url, stream=True).raw) input_resolution = (3, 224, 224) # MambaVision supports any input resolutions transform = create_transform(input_size=input_resolution, is_training=False, mean=model.config.mean, std=model.config.std, crop_mode=model.config.crop_mode, crop_pct=model.config.crop_pct) inputs = transform(image).unsqueeze(0).cuda() # model inference out_avg_pool, features = model(inputs) print("Size of the averaged pool features:", out_avg_pool.size()) # torch.Size([1, 640]) print("Number of stages in extracted features:", len(features)) # 4 stages print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 80, 56, 56]) print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 640, 7, 7]) ``` Currently, we offer [MambaVision-T-1K](https://huggingface.co/nvidia/MambaVision-T-1K), [MambaVision-T2-1K](https://huggingface.co/nvidia/MambaVision-T2-1K), [MambaVision-S-1K](https://huggingface.co/nvidia/MambaVision-S-1K), [MambaVision-B-1K](https://huggingface.co/nvidia/MambaVision-B-1K), [MambaVision-L-1K](https://huggingface.co/nvidia/MambaVision-L-1K) and [MambaVision-L2-1K](https://huggingface.co/nvidia/MambaVision-L2-1K) on Hugging Face. All models can also be viewed [here](https://huggingface.co/collections/nvidia/mambavision-66943871a6b36c9e78b327d3). ### Classification (pip package) We can also import pre-trained MambaVision models from the pip package with **a few lines of code**: ```bash pip install mambavision ``` A pretrained MambaVision model with default hyper-parameters can be created as in: ```python >>> from mambavision import create_model # Define mamba_vision_T model >>> model = create_model('mamba_vision_T', pretrained=True, model_path="/tmp/mambavision_tiny_1k.pth.tar") ``` Available list of pretrained models include `mamba_vision_T`, `mamba_vision_T2`, `mamba_vision_S`, `mamba_vision_B`, `mamba_vision_L` and `mamba_vision_L2`. We can also simply test the model by passing a dummy image with **any resolution**. The output is the logits: ```python >>> import torch >>> image = torch.rand(1, 3, 512, 224).cuda() # place image on cuda >>> model = model.cuda() # place model on cuda >>> output = model(image) # output logit size is [1, 1000] ``` Using the pretrained models from our pip package, you can simply run validation: ``` python validate_pip_model.py --model mamba_vision_T --data_dir=$DATA_PATH --batch-size $BS ``` ## Results + Pretrained Models ### ImageNet-21K
Name Acc@1(%) Acc@5(%) #Params(M) FLOPs(G) Resolution HF Download
MambaVision-B-21K 84.9 97.5 97.7 15.0 224x224 link model
MambaVision-L-21K 86.1 97.9 227.9 34.9 224x224 link model
MambaVision-L2-512-21K 87.3 98.4 241.5 196.3 512x512 link model
MambaVision-L3-256-21K 87.3 98.3 739.6 122.3 256x256 link model
MambaVision-L3-512-21K 88.1 98.6 739.6 489.1 512x512 link model
### ImageNet-1K
Name Acc@1(%) Acc@5(%) Throughput(Img/Sec) Resolution #Params(M) FLOPs(G) HF Download
MambaVision-T 82.3 96.2 6298 224x224 31.8 4.4 link model
MambaVision-T2 82.7 96.3 5990 224x224 35.1 5.1 link model
MambaVision-S 83.3 96.5 4700 224x224 50.1 7.5 link model
MambaVision-B 84.2 96.9 3670 224x224 97.7 15.0 link model
MambaVision-L 85.0 97.1 2190 224x224 227.9 34.9 link model
MambaVision-L2 85.3 97.2 1021 224x224 241.5 37.5 link model
## Detection Results + Models
Backbone Detector Lr Schd box mAP mask mAP #Params(M) FLOPs(G) Config Log Model Ckpt
MambaVision-T-1K Cascade Mask R-CNN 3x 51.1 44.3 86 740 config log model
MambaVision-S-1K Cascade Mask R-CNN 3x 52.3 45.2 108 828 config log model
MambaVision-B-1K Cascade Mask R-CNN 3x 52.8 45.7 145 964 config log model
## Segmentation Results + Models
Backbone Method Lr Schd mIoU #Params(M) FLOPs(G) Config Log Model Ckpt
MambaVision-T-1K UPerNet 160K 46.0 55 945 config log model
MambaVision-S-1K UPerNet 160K 48.2 84 1135 config log model
MambaVision-B-1K UPerNet 160K 49.1 126 1342 config log model
MambaVision-L3-512-21K UPerNet 160K 53.2 780 3670 config log model
## Installation We provide a [docker file](./Dockerfile). In addition, assuming that a recent [PyTorch](https://pytorch.org/get-started/locally/) package is installed, the dependencies can be installed by running: ```bash pip install -r requirements.txt ``` ## Evaluation The MambaVision models can be evaluated on ImageNet-1K validation set using the following: ``` python validate.py \ --model --checkpoint --data_dir --batch-size