模型库 / MahmoudAshraf/mms-300m-1130-forced-aligner

mms-300m-1130-forced-aligner

MahmoudAshraf automatic-speech-recognition transformers ab af ak
MahmoudAshraf/mms-300m-1130-forced-aligner
3,477,232
下载量
87
收藏数
9
浏览量
cc-by-nc-4.0
许可

简介

Forced Alignment with Hugging Face CTC Models This Python package provides an efficient way to perform forced alignment between text and audio using Hugging Face's pretrained models. it also features an improved implementation to use much less memory than TorchAudio forced alignment API.

模型卡片

许可协议 cc-by-nc-4.0
语言
ab af ak am ar as av ay az ba bm be bn bi bo sh br bg ca cs ce cv ku cy da de dv dz el en eo et eu ee fo fa fj fi fr fy ff ga gl gn gu zh ht ha he hi sh hu hy ig ia ms is it jv ja kn ka kk kr km ki rw ky ko kv lo la lv ln lt lb lg mh ml mr ms mk mg mt mn mi my zh nl no no ne ny oc om or os pa pl pt ms ps qu qu qu qu qu qu qu qu qu qu qu qu qu qu qu qu qu qu qu qu qu qu ro rn ru sg sk sl sm sn sd so es sq su sv sw ta tt te tg tl th ti ts tr uk ms vi wo xh ms yo ms zu za
任务 automatic-speech-recognition
mms wav2vec2 audio voice speech forced-alignment

模型配置

模型类型 wav2vec2
架构 Wav2Vec2ForCTC

模型详情

已翻译

使用 Hugging Face CTC 模型进行强制对齐

该 Python 包提供了一种高效的方式,利用 Hugging Face 的预训练模型在文本和音频之间执行强制对齐。它还具备改进的实现,相比 TorchAudio 的强制对齐 API,内存占用显著降低。

此处上传的模型检查点是从 torchaudio 转换为 HF Transformers 的 MMS-300M 检查点,该模型在强制对齐数据集上训练而成。

安装

pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git

使用

import torch
from ctc_forced_aligner import (
    load_audio,
    load_alignment_model,
    generate_emissions,
    preprocess_text,
    get_alignments,
    get_spans,
    postprocess_results,
)

audio_path = "your/audio/path"
text_path = "your/text/path"
language = "iso" # ISO-639-3 Language code
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16

alignment_model, alignment_tokenizer = load_alignment_model(
    device,
    dtype=torch.float16 if device == "cuda" else torch.float32,
)

audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)

with open(text_path, "r") as f:
    lines = f.readlines()
text = "".join(line for line in lines).replace("\n", " ").strip()

emissions, stride = generate_emissions(
    alignment_model, audio_waveform, batch_size=batch_size
)

tokens_starred, text_starred = preprocess_text(
    text,
    romanize=True,
    language=language,
)

segments, scores, blank_token = get_alignments(
    emissions,
    tokens_starred,
    alignment_tokenizer,
)

spans = get_spans(tokens_starred, segments, blank_token)

word_timestamps = postprocess_results(text_starred, spans, stride, scores)

标签

wav2vec2 mms audio voice speech forced-alignment ab af

操作


详细信息

厂商
MahmoudAshraf
任务
automatic-speech-recognition
框架
transformers
模型类型
wav2vec2
许可(HF)
cc-by-nc-4.0
语言
ab, af, ak, am, ar, as, av, ay, az, ba, bm, be, bn, bi, bo, sh, br, bg, ca, cs, ce, cv, ku, cy, da, de, dv, dz, el, en, eo, et, eu, ee, fo, fa, fj, fi, fr, fy, ff, ga, gl, gn, gu, zh, ht, ha, he, hi, sh, hu, hy, ig, ia, ms, is, it, jv, ja, kn, ka, kk, kr, km, ki, rw, ky, ko, kv, lo, la, lv, ln, lt, lb, lg, mh, ml, mr, ms, mk, mg, mt, mn, mi, my, zh, nl, no, no, ne, ny, oc, om, or, os, pa, pl, pt, ms, ps, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, ro, rn, ru, sg, sk, sl, sm, sn, sd, so, es, sq, su, sv, sw, ta, tt, te, tg, tl, th, ti, ts, tr, uk, ms, vi, wo, xh, ms, yo, ms, zu, za