# actionformer
**Repository Path**: mr-abc/actionformer
## Basic Information
- **Project Name**: actionformer
- **Description**: git clone http://github.com/happyharrycn/actionformer_release.git
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-09-08
- **Last Updated**: 2023-09-08
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# ActionFormer: Localizing Moments of Actions with Transformers
## Introduction
This code repo implements Actionformer, one of the first Transformer-based model for temporal action localization --- detecting the onsets and offsets of action instances and recognizing their action categories. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points and crossing the 60% mAP for the first time. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.56% average mAP) and the more challenging EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our paper is accepted to ECCV 2022 and an arXiv version can be found at [this link](https://arxiv.org/abs/2202.07925).
In addition, ActionFormer is the backbone for many winning solutions in the Ego4D Moment Queries Challenge 2022. Our submission in particular is ranked 2nd with a record 21.76% average mAP and 42.54% Recall@1x, tIoU=0.5, nearly three times higher than the official baseline. An arXiv version of our tech report can be found at [this link](https://arxiv.org/abs/2211.09074). We invite our audience to try out the code.
Specifically, we adopt a minimalist design and develop a Transformer based model for temporal action localization, inspired by the recent success of Transformers in NLP and vision. Our method, illustrated in the figure, adapts local self-attention to model temporal context in untrimmed videos, classifies every moment in an input video, and regresses their corresponding action boundaries. The result is a deep model that is trained using standard classification and regression loss, and can localize moments of actions in a single shot, without using action proposals or pre-defined anchor windows.
## Changelog
* 11/18/2022: We have released the [tech report](https://arxiv.org/abs/2211.09074) for our submission to the [Ego4D Moment Queries (MQ) Challenge](https://eval.ai/web/challenges/challenge-page/1626/overview). The code repo now includes config files, pre-trained models and results on the Ego4D MQ benchmark.
* 08/29/2022: Updated arXiv version.
* 08/01/2022: Updated code repo with latest results on ActivityNet.
* 07/08/2022: The paper is accepted to ECCV 2022.
* 05/09/2022: Pre-trained models have been updated.
* 05/08/2022: We have updated the code repo based on the community feedback and our code review, leading to significantly better average mAP on THUMOS14 (>66.0%) and slightly improved results on ActivityNet and EPIC-Kitchens 100.
## Code Overview
The structure of this code repo is heavily inspired by Detectron2. Some of the main components are
* ./libs/core: Parameter configuration module.
* ./libs/datasets: Data loader and IO module.
* ./libs/modeling: Our main model with all its building blocks.
* ./libs/utils: Utility functions for training, inference, and postprocessing.
## Installation
* Follow INSTALL.md for installing necessary dependencies and compiling the code.
## Frequently Asked Questions
* See FAQ.md.
## To Reproduce Our Results on THUMOS14
**Download Features and Annotations**
* Download *thumos.tar.gz* (`md5sum 375f76ffbf7447af1035e694971ec9b2`) from [this Box link](https://uwmadison.box.com/s/glpuxadymf3gd01m1cj6g5c3bn39qbgr) or [this Google Drive link](https://drive.google.com/file/d/1zt2eoldshf99vJMDuu8jqxda55dCyhZP/view?usp=sharing) or [this BaiduYun link](https://pan.baidu.com/s/1TgS91LVV-vzFTgIHl1AEGA?pwd=74eh).
* The file includes I3D features, action annotations in json format (similar to ActivityNet annotation format), and external classification scores.
**Details**: The features are extracted from two-stream I3D models pretrained on Kinetics using clips of `16 frames` at the video frame rate (`~30 fps`) and a stride of `4 frames`. This gives one feature vector per `4/30 ~= 0.1333` seconds.
**Unpack Features and Annotations**
* Unpack the file under *./data* (or elsewhere and link to *./data*).
* The folder structure should look like
```
This folder
│ README.md
│ ...
│
└───data/
│ └───thumos/
│ │ └───annotations
│ │ └───i3d_features
│ └───...
|
└───libs
│
│ ...
```
**Training and Evaluation**
* Train our ActionFormer with I3D features. This will create an experiment folder under *./ckpt* that stores training config, logs, and checkpoints.
```shell
python ./train.py ./configs/thumos_i3d.yaml --output reproduce
```
* [Optional] Monitor the training using TensorBoard
```shell
tensorboard --logdir=./ckpt/thumos_i3d_reproduce/logs
```
* Evaluate the trained model. The expected average mAP should be around 62.6(%) as in Table 1 of our main paper. **With recent commits, the expected average mAP should be higher than 66.0(%)**.
```shell
python ./eval.py ./configs/thumos_i3d.yaml ./ckpt/thumos_i3d_reproduce
```
* Training our model on THUMOS requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.
**[Optional] Evaluating Our Pre-trained Model**
We also provide a pre-trained model for THUMOS 14. The model with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/file/d/1isG3bc1dG5-llBRFCivJwz_7c_b0XDcY/view?usp=sharing). To evaluate the pre-trained model, please follow the steps listed below.
* Create a folder *./pretrained* and unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*).
* The folder structure should look like
```
This folder
│ README.md
│ ...
│
└───pretrained/
│ └───thumos_i3d_reproduce/
│ │ └───thumos_reproduce_log.txt
│ │ └───thumos_reproduce_results.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
```
* The training config is recorded in *./pretrained/thumos_i3d_reproduce/config.txt*.
* The training log is located at *./pretrained/thumos_i3d_reproduce/thumos_reproduce_log.txt* and also *./pretrained/thumos_i3d_reproduce/logs*.
* The pre-trained model is *./pretrained/thumos_i3d_reproduce/epoch_034.pth.tar*.
* Evaluate the pre-trained model.
```shell
python ./eval.py ./configs/thumos_i3d.yaml ./pretrained/thumos_i3d_reproduce/
```
* The results (mAP at tIoUs) should be
| Method | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | Avg |
| ------------ | ----- | ----- | ----- | ----- | ----- | ----- |
| ActionFormer | 82.13 | 77.80 | 70.95 | 59.40 | 43.87 | 66.83 |
## To Reproduce Our Results on ActivityNet 1.3
**Download Features and Annotations**
* Download *anet_1.3.tar.gz* (`md5sum c415f50120b9425ee1ede9ac3ce11203`) from [this Box link](https://uwmadison.box.com/s/aisdoymowukc99zoc7gpqegxbb4whikx) or [this Google Drive Link](https://drive.google.com/file/d/1VW8px1Nz9A17i0wMVUfxh6YsPCLVqL-S/view?usp=sharing) or [this BaiduYun Link](https://pan.baidu.com/s/1tw5W8B5YqDvfl-mrlWQvnQ?pwd=xuit).
* The file includes TSP features, action annotations in json format (similar to ActivityNet annotation format), and external classification scores.
**Details**: The features are extracted from the R(2+1)D-34 model pretrained with TSP on ActivityNet using clips of `16 frames` at a frame rate of `15 fps` and a stride of `16 frames` (*i.e.,* **non-overlapping** clips). This gives one feature vector per `16/15 ~= 1.067` seconds. The features are converted into numpy files for our code.
**Unpack Features and Annotations**
* Unpack the file under *./data* (or elsewhere and link to *./data*).
* The folder structure should look like
```
This folder
│ README.md
│ ...
│
└───data/
│ └───anet_1.3/
│ │ └───annotations
│ │ └───tsp_features
│ └───...
|
└───libs
│
│ ...
```
**Training and Evaluation**
* Train our ActionFormer with TSP features. This will create an experiment folder under *./ckpt* that stores training config, logs, and checkpoints.
```shell
python ./train.py ./configs/anet_tsp.yaml --output reproduce
```
* [Optional] Monitor the training using TensorBoard
```shell
tensorboard --logdir=./ckpt/anet_tsp_reproduce/logs
```
* Evaluate the trained model. The expected average mAP should be around 36.5(%) as in Table 1 of our main paper.
```shell
python ./eval.py ./configs/anet_tsp.yaml ./ckpt/anet_tsp_reproduce
```
* Training our model on ActivityNet requires ~4.6GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.
**[Optional] Evaluating Our Pre-trained Model**
We also provide a pre-trained model for ActivityNet 1.3. The model with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/file/d/1JKh3w14ngAjgzuuP22BnjhkhIcBSqteJ/view?usp=sharing). To evaluate the pre-trained model, please follow the steps listed below.
* Create a folder *./pretrained* and unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*).
* The folder structure should look like
```
This folder
│ README.md
│ ...
│
└───pretrained/
│ └───anet_tsp_reproduce/
│ │ └───anet_tsp_reproduce_log.txt
│ │ └───anet_tsp_reproduce_results.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
```
* The training config is recorded in *./pretrained/anet_tsp_reproduce/config.txt*.
* The training log is located at *./pretrained/anet_tsp_reproduce/anet_tsp_reproduce_log.txt* and also *./pretrained/anet_tsp_reproduce/logs*.
* The pre-trained model is *./pretrained/anet_tsp_reproduce/epoch_014.pth.tar*.
* Evaluate the pre-trained model.
```shell
python ./eval.py ./configs/anet_tsp.yaml ./pretrained/anet_tsp_reproduce/
```
* The results (mAP at tIoUs) should be
| Method | 0.5 | 0.75 | 0.95 | Avg |
| ------------ | ----- | ----- | ---- | ----- |
| ActionFormer | 54.67 | 37.81 | 8.36 | 36.56 |
**[Optional] Reproducing Our Results with I3D Features**
* Download *anet_1.3_i3d.tar.gz* (`md5sum e649425954e0123401650312dd0d56a7`) from [this Google Drive Link](https://drive.google.com/file/d/16239kUT2Z-j6S6PXIT1b_31OJi35QW_o/view?usp=sharing).
**Details**: The features are extracted from the I3D model pretrained on Kinetics using clips of `16 frames` at a frame rate of `25 fps` and a stride of `16 frames`. This gives one feature vector per `16/25 = 0.64` seconds. The features are converted into numpy files for our code.
* Unpack the file under *./data* (or elsewhere and link to *./data*), similar to TSP features.
* Train our ActionFormer with I3D features. This will create an experiment folder under *./ckpt* that stores training config, logs, and checkpoints.
```shell
python ./train.py ./configs/anet_i3d.yaml --output reproduce
```
* Evaluate the trained model. The expected average mAP should be around 36.0(%). This is slightly improved from our paper. The improvement is produced by better training scheme / hyperparameters (see comments in the config file).
```shell
python ./eval.py ./configs/anet_i3d.yaml ./ckpt/anet_i3d_reproduce
```
* The pre-trained model with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/file/d/152dw2JDoNPssSnaQDaNolQUSFgcHlxe3/view?usp=sharing). To produce the results, create a folder *./pretrained*, unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*), and run
```shell
python ./eval.py ./configs/anet_i3d.yaml ./pretrained/anet_i3d_reproduce/
```
* The results (mAP at tIoUs) with I3D features should be
| Method | 0.5 | 0.75 | 0.95 | Avg |
| ------------ | ----- | ----- | ---- | ----- |
| ActionFormer | 54.29 | 36.71 | 8.24 | 36.03 |
## To Reproduce Our Results on EPIC Kitchens 100
**Download Features and Annotations**
* Download *epic_kitchens.tar.gz* (`md5sum add9803756afd9a023bc9a9c547e0229`) from [this Box link](https://uwmadison.box.com/s/vdha47qnce6jhqktz9g4mq1gc40w82yj) or [this Google Drive Link](https://drive.google.com/file/d/1Z4U_dLuu6_cV5NBIrSzsSDOOj2Uar85X/view?usp=sharing) or [this BaiduYun Link](https://pan.baidu.com/s/15tOdX6Yp4AJ9lFGjbQ8dgg?pwd=f3tx).
* The file includes SlowFast features as well as action annotations in json format (similar to ActivityNet annotation format).
**Details**: The features are extracted from the SlowFast model pretrained on the training set of EPIC Kitchens 100 (action classification) using clips of `32 frames` at a frame rate of `30 fps` and a stride of `16 frames`. This gives one feature vector per `16/30 ~= 0.5333` seconds.
**Unpack Features and Annotations**
* Unpack the file under *./data* (or elsewhere and link to *./data*).
* The folder structure should look like
```
This folder
│ README.md
│ ...
│
└───data/
│ └───epic_kitchens/
│ │ └───annotations
│ │ └───features
│ └───...
|
└───libs
│
│ ...
```
**Training and Evaluation**
* On EPIC Kitchens, we train separate models for nouns and verbs.
* To train our ActionFormer on verbs with SlowFast features, use
```shell
python ./train.py ./configs/epic_slowfast_verb.yaml --output reproduce
```
* To train our ActionFormer on nouns with SlowFast features, use
```shell
python ./train.py ./configs/epic_slowfast_noun.yaml --output reproduce
```
* Evaluate the trained model for verbs. The expected average mAP should be around 23.4(%) as in Table 2 of our main paper.
```shell
python ./eval.py ./configs/epic_slowfast_verb.yaml ./ckpt/epic_slowfast_verb_reproduce
```
* Evaluate the trained model for nouns. The expected average mAP should be around 21.9(%) as in Table 2 of our main paper.
```shell
python ./eval.py ./configs/epic_slowfast_noun.yaml ./ckpt/epic_slowfast_noun_reproduce
```
* Training our model on EPIC Kitchens requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.
**[Optional] Evaluating Our Pre-trained Model**
We also provide a pre-trained model for EPIC-Kitchens 100. The model with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/file/d/1Ta4ggKSj2YcszSrDbePlHe1ECF1CFKK4/view?usp=sharing) (verb), and from this [Google Drive link](https://drive.google.com/file/d/1OTlxeiWj8JE9n1-LsRYogHmqgUdsE5PR/view?usp=sharing) (noun). To evaluate the pre-trained model, please follow the steps listed below.
* Create a folder *./pretrained* and unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*).
* The folder structure should look like
```
This folder
│ README.md
│ ...
│
└───pretrained/
│ └───epic_slowfast_verb_reproduce/
│ │ └───epic_slowfast_verb_reproduce_log.txt
│ │ └───epic_slowfast_verb_reproduce_results.txt
│ │ └───...
│ └───epic_slowfast_noun_reproduce/
│ │ └───epic_slowfast_noun_reproduce_log.txt
│ │ └───epic_slowfast_noun_reproduce_results.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
```
* The training config is recorded in *./pretrained/epic_slowfast_(verb|noun)_reproduce/config.txt*.
* The training log is located at *./pretrained/epic_slowfast_(verb|noun)_reproduce/epic_slowfast_(verb|noun)_reproduce_log.txt* and also *./pretrained/epic_slowfast_(verb|noun)_reproduce/logs*.
* The pre-trained model is *./pretrained/epic_slowfast_(verb|noun)_reproduce/epoch_(020|020).pth.tar*.
* Evaluate the pre-trained model for verbs.
```shell
python ./eval.py ./configs/epic_slowfast_verb.yaml ./pretrained/epic_slowfast_verb_reproduce/
```
* Evaluate the pre-trained model for nouns.
```shell
python ./eval.py ./configs/epic_slowfast_noun.yaml ./pretrained/epic_slowfast_noun_reproduce/
```
* The results (mAP at tIoUs) should be
| Method | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | Avg |
| ------------------- | ----- | ----- | ----- | ----- | ----- | ----- |
| ActionFormer (verb) | 26.58 | 25.42 | 24.15 | 22.29 | 19.09 | 23.51 |
| ActionFormer (noun) | 25.21 | 24.11 | 22.66 | 20.47 | 16.97 | 21.88 |
## To Reproduce Our Results on Ego4D Moment Queries Benchmark
**Download Features and Annotations**
* Download the official SlowFast and Omnivore features from [the Ego4D website](https://ego4d-data.org/#download) and the official EgoVLP features from [this link](https://github.com/showlab/EgoVLP/issues/1#issuecomment-1219076014). Please note that we are not authorized to release the features and annotations. Instead, we provide our script for feature and annotation conversion at `./tools/convert_ego4d_trainval.py`.
**Details**: All features are extracted at `1.875 fps` from videos at `30 fps`. This gives one feature vector per `~0.5333` seconds. Please refer to Ego4D and EgoVLP's documentation for more details on feature extraction.
**Unpack Features and Annotations**
* Unpack the file under *./data* (or elsewhere and link to *./data*).
* The folder structure should look like
```
This folder
│ README.md
│ ...
│
└───data/
│ └───ego4d/
│ │ └───annotations
│ │ └───slowfast_features
│ │ └───omnivore_features
│ │ └───egovlp_features
│ └───...
|
└───libs
│
│ ...
```
**Training and Evaluation**
* We provide config files for training ActionFormer with different feature combinations. For example, training on Omnivore and EgoVLP features will create an experiment folder under *./ckpt* that stores training config, logs, and checkpoints.
```shell
python ./train.py ./configs/ego4d_omnivore_egovlp.yaml --output reproduce
```
* [Optional] Monitor the training using TensorBoard
```shell
tensorboard --logdir=./ckpt/ego4d_omnivore_egovlp_reproduce/logs
```
* Evaluate the trained model. The expected average mAP and Recall@1x, tIoU=0.5 should be around 22.0(%) and 40.0(%) respectively.
```shell
python ./eval.py ./configs/ego4d_omnivore_egovlp.yaml ./ckpt/ego4d_omnivore_egovlp_reproduce
```
* Training our model on Ego4D with all three features requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.
**[Optional] Evaluating Our Pre-trained Model**
We also provide pre-trained models for Ego4D trained with all feature combinations. The models with all training logs can be downloaded from [this Google Drive link](https://drive.google.com/drive/folders/1NpAECS0ZhcCuehXkF9OhLQDPFrNdStJb?usp=sharing). To evaluate the pre-trained model, please follow the steps listed below.
* Create a folder *./pretrained* and unpack the file under *./pretrained* (or elsewhere and link to *./pretrained*).
* An example of the folder structure should look like
```
This folder
│ README.md
│ ...
│
└───pretrained/
│ └───ego4d_omnivore_egovlp_reproduce/
│ │ └───ego4d_omnivore_egovlp_reproduce_log.txt
│ │ └───ego4d_omnivore_egovlp_reproduce_results.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
```
* The training config is recorded in *./pretrained/ego4d_omnivore_egovlp_reproduce/config.txt*.
* The training log is located at *./pretrained/ego4d_omnivore_egovlp_reproduce/ego4d_omnivore_egovlp_reproduce_log.txt* and also *./pretrained/ego4d_omnivore_egovlp_reproduce/logs*.
* The pre-trained model is *./pretrained/ego4d_omnivore_egovlp_reproduce/epoch_010.pth.tar*.
* Evaluate the pre-trained model.
```shell
python ./eval.py ./configs/ego4d_omnivore_egovlp.yaml ./pretrained/ego4d_omnivore_egovlp_reproduce/
```
* The results (mAP at tIoUs) should be
| Method | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | Avg |
| -------------------- | ----- | ----- | ----- | ----- | ----- | ----- |
| ActionFormer (S) | 20.09 | 17.45 | 14.44 | 12.46 | 10.00 | 14.89 |
| ActionFormer (O) | 23.87 | 20.78 | 18.39 | 15.33 | 12.65 | 18.20 |
| ActionFormer (E) | 26.84 | 23.86 | 20.57 | 17.19 | 14.54 | 20.60 |
| ActionFormer (S+E) | 27.98 | 24.46 | 21.21 | 18.56 | 15.60 | 21.56 |
| ActionFormer (O+E) | 27.99 | 24.94 | 21.94 | 19.05 | 15.98 | 21.98 |
| ActionFormer (S+O+E) | 28.26 | 24.69 | 21.88 | 19.35 | 16.28 | 22.09 |
* The results (Recall@1x at tIoUs) should be
| Method | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | Avg |
| -------------------- | ----- | ----- | ----- | ----- | ----- | ----- |
| ActionFormer (S) | 52.25 | 45.84 | 40.60 | 36.58 | 31.33 | 41.32 |
| ActionFormer (O) | 54.63 | 48.72 | 43.03 | 37.76 | 33.57 | 43.54 |
| ActionFormer (E) | 59.53 | 54.39 | 48.97 | 42.75 | 37.12 | 48.55 |
| ActionFormer (S+E) | 59.96 | 53.75 | 48.76 | 44.00 | 38.96 | 49.09 |
| ActionFormer (O+E) | 61.03 | 54.15 | 49.79 | 45.17 | 39.88 | 49.99 |
| ActionFormer (S+O+E) | 60.85 | 54.16 | 49.60 | 45.12 | 39.87 | 49.92 |
## Training and Evaluating Your Own Dataset
Work in progress. Stay tuned.
## Contact
Yin Li (yin.li@wisc.edu)
## References
If you are using our code, please consider citing our paper.
```
@inproceedings{zhang2022actionformer,
title={ActionFormer: Localizing Moments of Actions with Transformers},
author={Zhang, Chen-Lin and Wu, Jianxin and Li, Yin},
booktitle={European Conference on Computer Vision},
series={LNCS},
volume={13664},
pages={492-510},
year={2022}
}
```
If you cite our results on Ego4D, please consider citing our tech report in addition to the main paper.
```
@article{mu2022actionformerego4d,
title={Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge},
author={Mu, Fangzhou and Mo, Sicheng and Wang, Gillian, and Li, Yin},
journal={arXiv e-prints},
year={2022}
}
```
If you are using TSP features, please cite
```
@inproceedings{alwassel2021tsp,
title={{TSP}: Temporally-sensitive pretraining of video encoders for localization tasks},
author={Alwassel, Humam and Giancola, Silvio and Ghanem, Bernard},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
pages={3173--3183},
year={2021}
}
```