mms-300m-1130-forced-aligner
MahmoudAshraf
automatic-speech-recognition
transformers
ab
af
ak
MahmoudAshraf/mms-300m-1130-forced-aligner
3,477,232
下载量
87
收藏数
9
浏览量
cc-by-nc-4.0
许可
简介
Forced Alignment with Hugging Face CTC Models This Python package provides an efficient way to perform forced alignment between text and audio using Hugging Face's pretrained models. it also features an improved implementation to use much less memory than TorchAudio forced alignment API.
模型卡片
许可协议
cc-by-nc-4.0
语言
ab
af
ak
am
ar
as
av
ay
az
ba
bm
be
bn
bi
bo
sh
br
bg
ca
cs
ce
cv
ku
cy
da
de
dv
dz
el
en
eo
et
eu
ee
fo
fa
fj
fi
fr
fy
ff
ga
gl
gn
gu
zh
ht
ha
he
hi
sh
hu
hy
ig
ia
ms
is
it
jv
ja
kn
ka
kk
kr
km
ki
rw
ky
ko
kv
lo
la
lv
ln
lt
lb
lg
mh
ml
mr
ms
mk
mg
mt
mn
mi
my
zh
nl
no
no
ne
ny
oc
om
or
os
pa
pl
pt
ms
ps
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
qu
ro
rn
ru
sg
sk
sl
sm
sn
sd
so
es
sq
su
sv
sw
ta
tt
te
tg
tl
th
ti
ts
tr
uk
ms
vi
wo
xh
ms
yo
ms
zu
za
任务
automatic-speech-recognition
mms
wav2vec2
audio
voice
speech
forced-alignment
模型配置
模型类型
wav2vec2
架构
Wav2Vec2ForCTC
模型详情
已翻译使用 Hugging Face CTC 模型进行强制对齐
该 Python 包提供了一种高效的方式,利用 Hugging Face 的预训练模型在文本和音频之间执行强制对齐。它还具备改进的实现,相比 TorchAudio 的强制对齐 API,内存占用显著降低。
此处上传的模型检查点是从 torchaudio 转换为 HF Transformers 的 MMS-300M 检查点,该模型在强制对齐数据集上训练而成。
安装
pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
使用
import torch
from ctc_forced_aligner import (
load_audio,
load_alignment_model,
generate_emissions,
preprocess_text,
get_alignments,
get_spans,
postprocess_results,
)
audio_path = "your/audio/path"
text_path = "your/text/path"
language = "iso" # ISO-639-3 Language code
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16
alignment_model, alignment_tokenizer = load_alignment_model(
device,
dtype=torch.float16 if device == "cuda" else torch.float32,
)
audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)
with open(text_path, "r") as f:
lines = f.readlines()
text = "".join(line for line in lines).replace("\n", " ").strip()
emissions, stride = generate_emissions(
alignment_model, audio_waveform, batch_size=batch_size
)
tokens_starred, text_starred = preprocess_text(
text,
romanize=True,
language=language,
)
segments, scores, blank_token = get_alignments(
emissions,
tokens_starred,
alignment_tokenizer,
)
spans = get_spans(tokens_starred, segments, blank_token)
word_timestamps = postprocess_results(text_starred, spans, stride, scores)
正在翻译中,请稍候...
标签
wav2vec2
mms
audio
voice
speech
forced-alignment
ab
af
操作
详细信息
- 厂商
- MahmoudAshraf
- 任务
- automatic-speech-recognition
- 框架
- transformers
- 模型类型
- wav2vec2
- 许可(HF)
- cc-by-nc-4.0
- 语言
- ab, af, ak, am, ar, as, av, ay, az, ba, bm, be, bn, bi, bo, sh, br, bg, ca, cs, ce, cv, ku, cy, da, de, dv, dz, el, en, eo, et, eu, ee, fo, fa, fj, fi, fr, fy, ff, ga, gl, gn, gu, zh, ht, ha, he, hi, sh, hu, hy, ig, ia, ms, is, it, jv, ja, kn, ka, kk, kr, km, ki, rw, ky, ko, kv, lo, la, lv, ln, lt, lb, lg, mh, ml, mr, ms, mk, mg, mt, mn, mi, my, zh, nl, no, no, ne, ny, oc, om, or, os, pa, pl, pt, ms, ps, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, qu, ro, rn, ru, sg, sk, sl, sm, sn, sd, so, es, sq, su, sv, sw, ta, tt, te, tg, tl, th, ti, ts, tr, uk, ms, vi, wo, xh, ms, yo, ms, zu, za