whisper-large-v3

openai automatic-speech-recognition transformers en zh de

openai/whisper-large-v3

4,998,671

下载量

5669

收藏数

19

浏览量

apache-2.0

许可

简介

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.

模型卡片

许可协议 apache-2.0

语言

en zh de es ru ko fr ja pt tr pl ca nl ar sv it id hi fi vi he uk el ms cs ro da hu ta no th ur hr bg lt la mi ml cy sk te fa lv bn sr az sl kn et mk br eu is hy ne mn bs kk sq sw gl mr pa si km sn yo so af oc ka be tg sd gu am yi lo uz fo ht ps tk nn mt sa lb my bo tl mg as tt haw ln ha ba jw su

任务 automatic-speech-recognition

audio automatic-speech-recognition hf-asr-leaderboard

模型配置

模型类型 whisper

架构 WhisperForConditionalGeneration

模型详情

已翻译

Whisper

Whisper 是一种用于自动语音识别（ASR）和语音翻译的前沿模型，由 OpenAI 的 Alec Radford 等人在论文
Robust Speech Recognition via Large-Scale Weak Supervision 中提出。
该模型在超过 500 万小时的标注数据上训练，展现出在零样本场景下对多种数据集和领域强大的泛化能力。

Whisper large-v3 与之前的 large 和 large-v2
模型具有相同的架构，但存在以下细微差异：

频谱图输入使用 128 个 Mel 频率 bins 而非 80 个
新增了粤语的语言 token

Whisper large-v3 模型在 100 万小时弱标注音频和 400 万小时伪标注音频上训练，这些伪标注音频使用 Whisper large-v2 收集。该模型在此混合数据集上训练了 2.0 个 epoch。

large-v3 模型在多种语言上展现出性能提升，与 Whisper large-v2 相比，错误率降低了 10% 到 20%。有关不同可用 checkpoint 的更多详细信息，请参阅模型详情部分。

免责声明：本模型卡片的部分内容由 🤗 Hugging Face 团队撰写，部分内容复制并粘贴自原始模型卡片。

使用方法

Hugging Face 🤗 Transformers 支持 Whisper large-v3。要运行该模型，首先安装 Transformers 库。在本示例中，我们还将安装 🤗 Datasets 以从 Hugging Face Hub 加载玩具音频数据集，以及 🤗 Accelerate 以减少模型加载时间：

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

该模型可与 pipeline 类一起使用，以转录任意长度的音频：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

要转录本地音频文件，只需在调用 pipeline 时传递音频文件的路径：

result = pipe("audio.mp3")

可以通过将多个音频文件指定为列表并设置 batch_size 参数来并行转录：

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Transformers 兼容所有 Whisper 解码策略，例如温度回退和基于先前 token 的条件生成。以下示例演示如何启用这些启发式方法：

generate_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Whisper 会自动预测源音频的语言。如果源音频语言是已知的 先验信息，可以将其作为参数传递给 pipeline：

result = pipe(sample, generate_kwargs={"language": "english"})

默认情况下，Whisper 执行 语音转录 任务，其中源音频语言与目标文本语言相同。要执行 语音翻译 任务（目标文本为英语），请将任务设置为 "translate"：

result = pipe(sample, generate_kwargs={"task": "translate"})

最后，模型可以预测时间戳。对于句子级别的时间戳，传递 return_timestamps 参数：

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

对于单词级别的时间戳：

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

上述参数可以单独使用或组合使用。例如，要执行源音频为法语的语音转录任务，并返回句子级别的时间戳，可以使用以下方式：

result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
print(result["chunks"])

如需对生成参数进行更多控制，请直接使用模型 + processor API：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

额外的速度与内存优化

您可以对 Whisper 应用额外的速度和内存优化，以进一步降低推理速度和 VRAM 需求。

分块长音频处理

Whisper 的感受野为 30 秒。要转录超过此长度的音频，需要使用以下两种长音频算法之一：
1. 顺序处理： 使用“滑动窗口”进行缓冲推理，逐个转录 30 秒的片段
2. 分块处理： 将长音频文件分割成较短的片段（片段之间有小幅重叠），独立转录每个片段，并在边界处拼接生成的转录结果

在以下任一场景中应使用顺序长音频算法：
1. 转录准确性是最重要的因素，速度是次要考虑
2. 您正在转录批量长音频文件，此时顺序处理的延迟与分块处理相当，且 WER 准确率可提高高达 0.5%

相反，在以下场景中应使用分块算法：
1. 转录速度是最重要的因素
2. 您正在转录单个长音频文件

默认情况下，Transformers 使用顺序算法。要启用分块算法，请将 chunk_length_s 参数传递给 pipeline。对于 large-v3，30 秒的块长度是最优的。要激活长音频文件的批处理，请传递参数