Whisper

Whisper 是一种用于自动语音识别（ASR）和语音翻译的前沿模型，由 OpenAI 的 Alec Radford 等人在论文
Robust Speech Recognition via Large-Scale Weak Supervision 中提出。
该模型在超过 500 万小时的标注数据上训练，展现出在零样本场景下对多种数据集和领域强大的泛化能力。

Whisper large-v3-turbo 是经过剪枝的 Whisper large-v3 的微调版本。
换句话说，它与原模型完全相同，只是解码层数从 32 层减少到了 4 层。
因此，该模型速度显著提升，但代价是轻微的质量下降。你可以在此 GitHub 讨论中找到更多详细信息。

免责声明：此模型卡片的部分内容由 🤗 Hugging Face 团队编写，部分内容复制并粘贴自原始模型卡片。

使用方法

Hugging Face 🤗 Transformers 支持 Whisper large-v3-turbo。要运行该模型，首先安装 Transformers 库。
在本示例中，我们还将安装 🤗 Datasets 以从 Hugging Face Hub 加载音频数据集，以及 🤗 Accelerate 以减少模型加载时间：

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

该模型可通过 pipeline 类用于转录任意长度的音频：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

要转录本地音频文件，只需在调用 pipeline 时传入音频文件的路径：

result = pipe("audio.mp3")

通过将多个音频文件指定为列表并设置 batch_size 参数，可以并行转录多个音频文件：

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Transformers 兼容所有 Whisper 解码策略，例如温度回退和基于先前 token 的条件生成。以下示例演示如何启用这些启发式方法：

generate_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Whisper 会自动预测源音频的语言。如果源音频语言是先验已知的，可以将其作为参数传递给 pipeline：

result = pipe(sample, generate_kwargs={"language": "english"})

默认情况下，Whisper 执行语音转录任务，即源音频语言与目标文本语言相同。要执行语音翻译任务（目标文本为英语），请将任务设置为 "translate"：

result = pipe(sample, generate_kwargs={"task": "translate"})

最后，可以让模型预测时间戳。对于句子级时间戳，传递 return_timestamps 参数：

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

对于词级时间戳：

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

上述参数可以单独使用或组合使用。例如，要执行源音频为法语的语音转录任务，并返回句子级时间戳，可以使用以下方式：

result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])

如需对生成参数进行更多控制，可直接使用模型 + processor API：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

额外的速度与内存优化

你可以对 Whisper 应用额外的速度和内存优化，以进一步降低推理速度和 VRAM 需求。

分块长音频处理

Whisper 的感受野为 30 秒。要转录超过此长度的音频，需要使用以下两种长音频算法之一：
1. 顺序处理：使用“滑动窗口”进行缓冲推理，逐个转录 30 秒的片段
2. 分块处理：将长音频文件分割成较短的片段（片段之间有小部分重叠），独立转录每个片段，并在边界处拼接生成的转录结果

在以下任一场景中，应使用顺序长音频算法：
1. 转录准确性是最重要的因素，速度次之
2. 你正在转录批量长音频文件，此时顺序处理的延迟与分块处理相当，且 WER 准确率可提高多达 0.5%

相反，在以下场景中应使用分块算法：
1. 转录速度是最重要的因素
2. 你正在转录单个长音频文件

默认情况下，Transformers 使用顺序算法。要启用分块算法，请将 chunk_length_s 参数传递给 pipeline。对于 large-v3，30 秒的块长度是最优的。要激活长音频文件的批处理，请传递 batch_size 参数：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,  # batch size for inference - set based on your device
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Torch compile

Whisper 的前向传播兼容 torch.compile，可实现 4.5 倍的加速。

注意： torch.compile 目前不兼容分块长音频算法或 Flash Attention 2 ⚠️

import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm

torch.set_float32_matmul_precision("high")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)

# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
    with sdpa_kernel(SDPBackend.MATH):
        result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})

# fast run
with sdpa_kernel(SDPBackend.MATH):
    result = pipe(sample.copy())

print(result["text"])

Flash Attention 2

我们建议使用 [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flash

whisper-large-v3-turbo

简介

模型卡片

模型配置

模型详情