paraphrase-multilingual-MiniLM-L12-v2
sentence-transformers
sentence-similarity
sentence-transformers
multilingual
ar
bg
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
46,703,265
下载量
1424
收藏数
93
浏览量
apache-2.0
许可
简介
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
模型卡片
许可协议
apache-2.0
语言
multilingual
ar
bg
ca
cs
da
de
el
en
es
et
fa
fi
fr
gl
gu
he
hi
hr
hu
hy
id
it
ja
ka
ko
ku
lt
lv
mk
mn
mr
ms
my
nb
nl
pl
pt
ro
ru
sk
sl
sq
sr
sv
th
tr
uk
ur
vi
框架
sentence-transformers
任务
sentence-similarity
sentence-transformers
feature-extraction
sentence-similarity
transformers
模型配置
模型类型
bert
架构
BertModel
模型详情
已翻译sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
这是一个 sentence-transformers 模型:它将句子和段落映射到 384 维的稠密向量空间,可用于聚类或语义搜索等任务。
使用方法(Sentence-Transformers)
安装 sentence-transformers 后,使用该模型变得非常简单:
pip install -U sentence-transformers
然后你可以像这样使用模型:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
embeddings = model.encode(sentences)
print(embeddings)
使用方法(HuggingFace Transformers)
如果没有安装 sentence-transformers,你可以这样使用模型:首先,将输入传入 transformer 模型,然后需要在上下文相关的 word embedding 之上应用正确的池化操作。
from transformers import AutoTokenizer, AutoModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
完整模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
引用与作者
该模型由 sentence-transformers 训练完成。
如果你觉得这个模型有帮助,欢迎引用我们的论文 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}
正在翻译中,请稍候...
标签
tf
onnx
openvino
bert
feature-extraction
multilingual
ar
bg
操作
详细信息
- 厂商
- sentence-transformers
- 任务
- sentence-similarity
- 框架
- sentence-transformers
- 模型类型
- bert
- 许可(HF)
- apache-2.0
- 语言
- multilingual, ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi