whisper-large-v3-turbo
简介
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.
模型卡片
模型配置
模型详情
已翻译Whisper
Whisper 是一种用于自动语音识别(ASR)和语音翻译的前沿模型,由 OpenAI 的 Alec Radford 等人在论文
Robust Speech Recognition via Large-Scale Weak Supervision 中提出。
该模型在超过 500 万小时的标注数据上训练,展现出在零样本场景下对多种数据集和领域强大的泛化能力。
Whisper large-v3-turbo 是经过剪枝的 Whisper large-v3 的微调版本。
换句话说,它与原模型完全相同,只是解码层数从 32 层减少到了 4 层。
因此,该模型速度显著提升,但代价是轻微的质量下降。你可以在 此 GitHub 讨论 中找到更多详细信息。
免责声明:此模型卡片的部分内容由 🤗 Hugging Face 团队编写,部分内容复制并粘贴自原始模型卡片。
使用方法
Hugging Face 🤗 Transformers 支持 Whisper large-v3-turbo。要运行该模型,首先安装 Transformers 库。
在本示例中,我们还将安装 🤗 Datasets 以从 Hugging Face Hub 加载音频数据集,以及 🤗 Accelerate 以减少模型加载时间:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
该模型可通过 pipeline 类用于转录任意长度的音频:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
要转录本地音频文件,只需在调用 pipeline 时传入音频文件的路径:
result = pipe("audio.mp3")
通过将多个音频文件指定为列表并设置 batch_size 参数,可以并行转录多个音频文件:
result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)
Transformers 兼容所有 Whisper 解码策略,例如温度回退和基于先前 token 的条件生成。以下示例演示如何启用这些启发式方法:
generate_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
result = pipe(sample, generate_kwargs=generate_kwargs)
Whisper 会自动预测源音频的语言。如果源音频语言是先验已知的,可以将其作为参数传递给 pipeline:
result = pipe(sample, generate_kwargs={"language": "english"})
默认情况下,Whisper 执行语音转录任务,即源音频语言与目标文本语言相同。要执行语音翻译任务(目标文本为英语),请将任务设置为 "translate":
result = pipe(sample, generate_kwargs={"task": "translate"})
最后,可以让模型预测时间戳。对于句子级时间戳,传递 return_timestamps 参数:
result = pipe(sample, return_timestamps=True)
print(result["chunks"])
对于词级时间戳:
result = pipe(sample, return_timestamps="word")
print(result["chunks"])
上述参数可以单独使用或组合使用。例如,要执行源音频为法语的语音转录任务,并返回句子级时间戳,可以使用以下方式:
result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])
如需对生成参数进行更多控制,可直接使用模型 + processor API:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
inputs = processor(
sample["array"],
sampling_rate=sample["sampling_rate"],
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)
额外的速度与内存优化
你可以对 Whisper 应用额外的速度和内存优化,以进一步降低推理速度和 VRAM 需求。
分块长音频处理
Whisper 的感受野为 30 秒。要转录超过此长度的音频,需要使用以下两种长音频算法之一:
1. 顺序处理:使用“滑动窗口”进行缓冲推理,逐个转录 30 秒的片段
2. 分块处理:将长音频文件分割成较短的片段(片段之间有小部分重叠),独立转录每个片段,并在边界处拼接生成的转录结果
在以下任一场景中,应使用顺序长音频算法:
1. 转录准确性是最重要的因素,速度次之
2. 你正在转录批量长音频文件,此时顺序处理的延迟与分块处理相当,且 WER 准确率可提高多达 0.5%
相反,在以下场景中应使用分块算法:
1. 转录速度是最重要的因素
2. 你正在转录单个长音频文件
默认情况下,Transformers 使用顺序算法。要启用分块算法,请将 chunk_length_s 参数传递给 pipeline。对于 large-v3,30 秒的块长度是最优的。要激活长音频文件的批处理,请传递 batch_size 参数:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=16, # batch size for inference - set based on your device
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Torch compile
Whisper 的前向传播兼容 torch.compile,可实现 4.5 倍的加速。
注意: torch.compile 目前不兼容分块长音频算法或 Flash Attention 2 ⚠️
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm
torch.set_float32_matmul_precision("high")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)
# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})
# fast run
with sdpa_kernel(SDPBackend.MATH):
result = pipe(sample.copy())
print(result["text"])
Flash Attention 2
我们建议使用 [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flash
正在翻译中,请稍候...
标签
操作
详细信息
- 厂商
- openai
- 任务
- automatic-speech-recognition
- 框架
- transformers
- 模型类型
- whisper
- 许可(HF)
- mit
- 语言
- en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, no, th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su