模型库 / sentence-transformers/paraphrase-multilingual-mpnet-base-v2

paraphrase-multilingual-mpnet-base-v2

sentence-transformers sentence-similarity sentence-transformers multilingual ar bg
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
6,055,304
下载量
460
收藏数
20
浏览量
apache-2.0
许可

简介

sentence-transformers/paraphrase-multilingual-mpnet-base-v2

模型卡片

许可协议 apache-2.0
语言
multilingual ar bg ca cs da de el en es et fa fi fr gl gu he hi hr hu hy id it ja ka ko ku lt lv mk mn mr ms my nb nl pl pt ro ru sk sl sq sr sv th tr uk ur vi
框架 sentence-transformers
任务 sentence-similarity
sentence-transformers feature-extraction sentence-similarity transformers text-embeddings-inference

模型配置

模型类型 xlm-roberta
架构 XLMRobertaModel

模型详情

已翻译

sentence-transformers/paraphrase-multilingual-mpnet-base-v2

这是一个 sentence-transformers 模型:它将句子和段落映射到 768 维的密集向量空间,可用于聚类或语义搜索等任务。

使用方法(Sentence-Transformers)

安装 sentence-transformers 后,使用该模型变得非常简单:

pip install -U sentence-transformers

然后你可以像这样使用模型:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)

使用方法(HuggingFace Transformers)

如果没有 sentence-transformers,你可以像这样使用模型:首先,将输入传入 transformer 模型,然后在上下文相关的 word embedding 之上应用正确的池化操作。

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

使用方法(Text Embeddings Inference (TEI))

Text Embeddings Inference (TEI) 是一个用于文本 embedding 模型的极速推理解决方案。

  • CPU:
docker run -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-latest --model-id sentence-transformers/paraphrase-multilingual-mpnet-base-v2 --pooling mean --dtype float16
  • NVIDIA GPU:
docker run --gpus all -p 8080:80 -v hf_cache:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cuda-latest --model-id sentence-transformers/paraphrase-multilingual-mpnet-base-v2 --pooling mean --dtype float16

/v1/embeddings 发送请求,通过 OpenAI Embeddings API 生成 embedding:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    "input": "This is an example sentence"
  }'

或者查看 Text Embeddings Inference API 规范

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

引用与作者

该模型由 sentence-transformers 训练。

如果你觉得这个模型有帮助,欢迎引用我们的论文 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}

标签

tf onnx openvino xlm-roberta feature-extraction text-embeddings-inference multilingual ar

操作


详细信息

厂商
sentence-transformers
任务
sentence-similarity
框架
sentence-transformers
模型类型
xlm-roberta
许可(HF)
apache-2.0
语言
multilingual, ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi