模型库 / openai/whisper-large-v3-turbo

whisper-large-v3-turbo

openai automatic-speech-recognition transformers en zh de
openai/whisper-large-v3-turbo
6,876,575
下载量
3002
收藏数
19
浏览量
mit
许可

简介

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.

模型卡片

许可协议 mit
语言
en zh de es ru ko fr ja pt tr pl ca nl ar sv it id hi fi vi he uk el ms cs ro da hu ta no th ur hr bg lt la mi ml cy sk te fa lv bn sr az sl kn et mk br eu is hy ne mn bs kk sq sw gl mr pa si km sn yo so af oc ka be tg sd gu am yi lo uz fo ht ps tk nn mt sa lb my bo tl mg as tt haw ln ha ba jw su
框架 transformers
任务 automatic-speech-recognition
audio automatic-speech-recognition

模型配置

模型类型 whisper
架构 WhisperForConditionalGeneration

模型详情

已翻译

Whisper

Whisper 是一种用于自动语音识别(ASR)和语音翻译的前沿模型,由 OpenAI 的 Alec Radford 等人在论文
Robust Speech Recognition via Large-Scale Weak Supervision 中提出。
该模型在超过 500 万小时的标注数据上训练,展现出在零样本场景下对多种数据集和领域强大的泛化能力。

Whisper large-v3-turbo 是经过剪枝的 Whisper large-v3 的微调版本。
换句话说,它与原模型完全相同,只是解码层数从 32 层减少到了 4 层。
因此,该模型速度显著提升,但代价是轻微的质量下降。你可以在 此 GitHub 讨论 中找到更多详细信息。

免责声明:此模型卡片的部分内容由 🤗 Hugging Face 团队编写,部分内容复制并粘贴自原始模型卡片。

使用方法

Hugging Face 🤗 Transformers 支持 Whisper large-v3-turbo。要运行该模型,首先安装 Transformers 库。
在本示例中,我们还将安装 🤗 Datasets 以从 Hugging Face Hub 加载音频数据集,以及 🤗 Accelerate 以减少模型加载时间:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

该模型可通过 pipeline 类用于转录任意长度的音频:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

要转录本地音频文件,只需在调用 pipeline 时传入音频文件的路径:

result = pipe("audio.mp3")

通过将多个音频文件指定为列表并设置 batch_size 参数,可以并行转录多个音频文件:

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Transformers 兼容所有 Whisper 解码策略,例如温度回退和基于先前 token 的条件生成。以下示例演示如何启用这些启发式方法:

generate_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Whisper 会自动预测源音频的语言。如果源音频语言是先验已知的,可以将其作为参数传递给 pipeline:

result = pipe(sample, generate_kwargs={"language": "english"})

默认情况下,Whisper 执行语音转录任务,即源音频语言与目标文本语言相同。要执行语音翻译任务(目标文本为英语),请将任务设置为 "translate"

result = pipe(sample, generate_kwargs={"task": "translate"})

最后,可以让模型预测时间戳。对于句子级时间戳,传递 return_timestamps 参数:

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

对于词级时间戳:

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

上述参数可以单独使用或组合使用。例如,要执行源音频为法语的语音转录任务,并返回句子级时间戳,可以使用以下方式:

result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])

如需对生成参数进行更多控制,可直接使用模型 + processor API:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

额外的速度与内存优化

你可以对 Whisper 应用额外的速度和内存优化,以进一步降低推理速度和 VRAM 需求。

分块长音频处理

Whisper 的感受野为 30 秒。要转录超过此长度的音频,需要使用以下两种长音频算法之一:
1. 顺序处理:使用“滑动窗口”进行缓冲推理,逐个转录 30 秒的片段
2. 分块处理:将长音频文件分割成较短的片段(片段之间有小部分重叠),独立转录每个片段,并在边界处拼接生成的转录结果

在以下任一场景中,应使用顺序长音频算法:
1. 转录准确性是最重要的因素,速度次之
2. 你正在转录批量长音频文件,此时顺序处理的延迟与分块处理相当,且 WER 准确率可提高多达 0.5%

相反,在以下场景中应使用分块算法:
1. 转录速度是最重要的因素
2. 你正在转录单个长音频文件

默认情况下,Transformers 使用顺序算法。要启用分块算法,请将 chunk_length_s 参数传递给 pipeline。对于 large-v3,30 秒的块长度是最优的。要激活长音频文件的批处理,请传递 batch_size 参数:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,  # batch size for inference - set based on your device
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Torch compile

Whisper 的前向传播兼容 torch.compile,可实现 4.5 倍的加速。

注意: torch.compile 目前不兼容分块长音频算法或 Flash Attention 2 ⚠️

import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
from tqdm import tqdm

torch.set_float32_matmul_precision("high")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
).to(device)

# Enable static cache and compile the forward pass
model.generation_config.cache_implementation = "static"
model.generation_config.max_new_tokens = 256
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

# 2 warmup steps
for _ in tqdm(range(2), desc="Warm-up step"):
    with sdpa_kernel(SDPBackend.MATH):
        result = pipe(sample.copy(), generate_kwargs={"min_new_tokens": 256, "max_new_tokens": 256})

# fast run
with sdpa_kernel(SDPBackend.MATH):
    result = pipe(sample.copy())

print(result["text"])

Flash Attention 2

我们建议使用 [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flash

标签

whisper audio en zh de es ru ko

操作


详细信息

厂商
openai
任务
automatic-speech-recognition
框架
transformers
模型类型
whisper
许可(HF)
mit
语言
en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, no, th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su