# distil-whisper **Repository Path**: chiyee/distil-whisper ## Basic Information - **Project Name**: distil-whisper - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2023-11-08 - **Last Updated**: 2023-11-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Distil-Whisper [[Paper]](https://arxiv.org/abs/2311.00430) [[Models]](https://huggingface.co/collections/distil-whisper/distil-whisper-models-65411987e6727569748d2eb6) [[Colab]](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/Distil_Whisper_Benchmark.ipynb) Distil-Whisper is a distilled version of Whisper that is **6 times faster**, 49% smaller, and performs **within 1% word error rate (WER)** on out-of-distribution evaluation sets. | Model | Params / M | Rel. Latency | Short-Form WER | Long-Form WER | |----------------------------------------------------------------------------|------------|--------------|----------------|---------------| | [whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 1550 | 1.0 | **9.1** | 11.7 | | | | | | | | [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2) | 756 | 5.8 | 10.1 | **11.6** | | [distil-medium.en](https://huggingface.co/distil-whisper/distil-medium.en) | **394** | **6.8** | 11.1 | 12.4 | **Note:** Distil-Whisper is currently only available for English speech recognition. Multilingual support will be provided soon. ## 1. Usage Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub: ```bash pip install --upgrade pip pip install --upgrade transformers accelerate datasets[audio] ``` ### Short-Form Transcription Short-form transcription is the process of transcribing audio samples that are less than 30-seconds long, which is the maximum receptive field of the Whisper models. This means the entire audio clip can be processed in one go without the need for chunking. First, we load Distil-Whisper via the convenient [`AutoModelForSpeechSeq2Seq`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSpeechSeq2Seq) and [`AutoProcessor`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoProcessor) classes. We load the model in `float16` precision and make sure that loading time takes as little time as possible by passing `low_cpu_mem_usage=True`. In addition, we want to make sure that the model is loaded in [`safetensors`](https://github.com/huggingface/safetensors) format by passing `use_safetensors=True`: ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "distil-whisper/distil-large-v2" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) ``` The model and processor can then be passed to the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline). Note that if you would like to have more control over the generation process, you can directly make use of `model.generate(...)` as shown [here](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.forward.example). ```python pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, torch_dtype=torch_dtype, device=device, ) ``` Next, we load an example short-form audio from the LibriSpeech corpus: ```python from datasets import load_dataset dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") sample = dataset[0]["audio"] ``` Finally, we can call the pipeline to transcribe the audio: ```python result = pipe(sample) print(result["text"]) ``` To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: ```python result = pipe("audio.mp3") print(result["text"]) ``` For more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline [docs](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline). We also provide an end-to-end [Google Colab](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/Distil_Whisper_Benchmark.ipynb) that benchmarks Whisper against Distil-Whisper. ### Long-Form Transcription Distil-Whisper uses a chunked algorithm to transcribe long-form audio files longer than 30-seconds. In practice, this chunked long-form algorithm is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). We can load the model and processor as before: ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "distil-whisper/distil-large-v2" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) ``` To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds is optimal. To activate batching, pass the argument `batch_size`: ```python pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, chunk_length_s=15, batch_size=16, torch_dtype=torch_dtype, device=device, ) ``` The argument `max_new_tokens` controls the maximum number of generated tokens *per-chunk*. In the typical speech setting, we have no more than 3 words spoken per-second. Therefore, for a 15-second input, we have at most 45 words (approx 60 tokens). We set the maximum number of generated tokens per-chunk to 128 to truncate any possible hallucinations that occur at the end of the segment. These tokens get removed at the chunk borders using the long-form chunking transcription algorithm, so it is more efficient to truncate them directly during generation to avoid redundant generation steps in the decoder. Now, let's load a long-form audio sample. Here, we use an example of concatenated samples from the LibriSpeech corpus: ```python from datasets import load_dataset dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation") sample = dataset[0]["audio"] ``` Finally, we can call the pipeline to transcribe the audio: ```python result = pipe(sample) print(result["text"]) ``` For more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline [docs](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline). ### Speculative Decoding Distil-Whisper can be used as an assistant model to Whisper for speculative decoding. As a refresher, we recommend reading Joao's [amazing blog post](https://huggingface.co/blog/assisted-generation) or taking a look at [the original paper](https://arxiv.org/abs/2211.17192). Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed. For speculative decoding, we need to load both the teacher: [`openai/whisper-large-v2`](https://huggingface.co/openai/whisper-large-v2). As well as the assistant (*a.k.a* student) [`distil-whisper/distil-large-v2`](https://huggingface.co/distil-whisper/distil-large-v2). Let's start by loading the teacher model and processor. We do this in much the same way we loaded the Distil-Whisper model in the previous examples: ```python from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor import torch device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "openai/whisper-large-v2" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) ``` Now let's load the assistant. Since Distil-Whisper shares exactly same encoder as the teacher model, we only need to load the 2-layer decoder as a "Decoder-only" model: ```python from transformers import AutoModelForCausalLM assistant_model_id = "distil-whisper/distil-large-v2" assistant_model = AutoModelForCausalLM.from_pretrained( assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) assistant_model.to(device) ``` The assistant model shares the same processor as the teacher, so there's no need to load a student processor. We can now pass the assistant model to the pipeline to be used for speculative decoding. We pass it as a `generate_kwarg` with the key [`"assistant_model"`](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate.assistant_model) so that speculative decoding is enabled: ```python pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, generate_kwargs={"assistant_model": assistant_model}, torch_dtype=torch_dtype, device=device, ) ``` As before, we can pass any sample to the pipeline to be transcribed: ```python from datasets import load_dataset dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") sample = dataset[0]["audio"] result = pipe(sample) print(result["text"]) ``` **Note:** speculative decoding should be on average 2x faster than using "only" Whisper large-v2 at a mere 8% increase in VRAM memory usage while mathematically ensuring the same results. This makes it the perfect replacement for Whisper large-v2 in existing speech recognition pipelines. ### Additional Speed & Memory Improvements You can apply additional speed and memory improvements to Distil-Whisper which we cover in the following. #### Flash Attention We recommend using [Flash Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention): ``` pip install flash-attn --no-build-isolation ``` You can then pass `use_flash_attention_2=True` to `from_pretrained` to enable Flash Attention 2: ```diff - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True) + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True) ``` #### Torch Scale-Product-Attention (SDPA) If your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#bettertransformer). To do so, you first need to install optimum: ``` pip install --upgrade optimum ``` And then convert your model to a "BetterTransformer" model before using it: ```diff model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True) + model = model.to_bettertransformer() ``` ## 2. Why use Distil-Whisper? ⁉️ Distil-Whisper is designed to be a drop-in replacement for Whisper on English speech recognition. Here are 5 reasons for making the switch to Distil-Whisper: 1. **Faster inference:** 6 times faster inference speed, while performing to within 1% WER of Whisper on out-of-distribution audio: