Zyentor（智元界） - AI 开发者社区 · AI 资讯/工具/模型/论坛

gte-multilingual-base

gte-multilingual-base 模型是 GTE（通用文本嵌入）系列模型的最新成员，具有以下几个关键特性：

高性能：在与同类规模模型的比较中，在多语言检索任务和多任务表示模型评估中取得了最先进（SOTA）的结果。
训练架构：采用仅编码器（encoder-only）的 transformer 架构进行训练，模型尺寸更小。与之前基于仅解码器（decode-only）LLM 架构的模型（例如 gte-qwen2-1.5b-instruct）不同，本模型对推理的硬件要求更低，推理速度提升了 10 倍。
长上下文：支持最长 8192 个 token 的文本长度。
多语言能力：支持超过 70 种语言。
弹性密集嵌入：支持弹性输出密集表示，同时保持下游任务的有效性，显著降低存储成本并提高执行效率。
稀疏向量：除了密集表示外，还可以生成稀疏向量。

论文：mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

模型信息

模型大小：305M
嵌入维度：768
最大输入 token 数：8192

使用方法

建议安装 xformers 并启用 unpadding 以加速，请参考 enable-unpadding-and-xformers。
如何离线使用：new-impl/discussions/2
如何与 TEI 一起使用：refs/pr/7

使用 Transformers 获取密集嵌入

# Requires transformers>=4.36.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "北京",
    "快排算法介绍"
]

model_name_or_path = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)

dimension=768 # The output dimension of the output embedding, should be in [128, 768]
embeddings = outputs.last_hidden_state[:, 0][:dimension]

embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# [[0.3016996383666992, 0.7503870129585266, 0.3203084468841553]]

与 sentence-transformers 一起使用

# Requires sentence-transformers>=3.0.0

from sentence_transformers import SentenceTransformer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "北京",
    "快排算法介绍"
]

model_name_or_path="Alibaba-NLP/gte-multilingual-base"
model = SentenceTransformer(model_name_or_path, trust_remote_code=True)
embeddings = model.encode(input_texts, normalize_embeddings=True) # embeddings.shape (4, 768)

# sim scores
scores = model.similarity(embeddings[:1], embeddings[1:])

print(scores.tolist())
# [[0.301699697971344, 0.7503870129585266, 0.32030850648880005]]

与 infinity 一起使用

通过 docker 和 infinity 使用，采用 MIT 许可证。

docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \
michaelf34/infinity:0.0.69 \
v2 --model-id Alibaba-NLP/gte-multilingual-base --revision "main" --dtype float16 --batch-size 32 --device cuda --engine torch --port 7997

与 Text Embeddings Inference (TEI) 一起使用

通过 Docker 和 Text Embeddings Inference (TEI) 使用：

CPU：

docker run --platform linux/amd64 \
  -p 8080:80 \
  -v $PWD/data:/data \
  --pull always \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 \
  --model-id Alibaba-NLP/gte-multilingual-base \
  --dtype float16

GPU：

docker run --gpus all \
  -p 8080:80 \
  -v $PWD/data:/data \
  --pull always \
  ghcr.io/huggingface/text-embeddings-inference:1.7 \
  --model-id Alibaba-NLP/gte-multilingual-base \
  --dtype float16

然后，您可以通过兼容 OpenAI 的 v1/embeddings 路由向部署的 API 发送请求（更多关于 OpenAI Embeddings API 的信息）：

curl https://0.0.0.0:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": [
      "what is the capital of China?",
      "how to implement quick sort in python?",
      "北京",
      "快排算法介绍"
    ],
    "model": "Alibaba-NLP/gte-multilingual-base",
    "encoding_format": "float"
  }'

使用自定义代码获取密集嵌入和稀疏 token 权重

# You can find the script gte_embedding.py in https://huggingface.co/Alibaba-NLP/gte-multilingual-base/blob/main/scripts/gte_embedding.py

from gte_embedding import GTEEmbeddidng

model_name_or_path = 'Alibaba-NLP/gte-multilingual-base'
model = GTEEmbeddidng(model_name_or_path)
query = "中国的首都在哪儿"

docs = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "北京",
    "快排算法介绍"
]

embs = model.encode(docs, return_dense=True,return_sparse=True)
print('dense_embeddings vecs', embs['dense_embeddings'])
print('token_weights', embs['token_weights'])
pairs = [(query, doc) for doc in docs]
dense_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.0)
sparse_scores = model.compute_scores(pairs, dense_weight=0.0, sparse_weight=1.0)
hybrid_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.3)

print('dense_scores', dense_scores)
print('sparse_scores', sparse_scores)
print('hybrid_scores', hybrid_scores)

# dense_scores [0.85302734375, 0.257568359375, 0.76953125, 0.325439453125]
# sparse_scores [0.0, 0.0, 4.600879669189453, 1.570279598236084]
# hybrid_scores [0.85302734375, 0.257568359375, 2.1497951507568356, 0.7965233325958252]

评估

我们在多个下游任务上验证了 gte-multilingual-base 模型的性能，包括多语言检索、跨语言检索、长文本检索，以及在 MTEB 排行榜上的通用文本表示评估等。

检索任务

在 MIRACL 和 MLDR（多语言）、MKQA（跨语言）、BEIR 和 LoCo（英语）上的检索结果。

在 MLDR 上的详细结果

在 LoCo 上的详细结果

MTEB

在 MTEB 英语、中文、法语、波兰语上的结果

更详细的实验结果可在论文中找到。

云 API 服务

除了开源的 GTE 系列模型外，GTE 系列模型也可作为阿里云上的商业 API 服务使用。

嵌入模型：提供三个版本的文本嵌入模型：text-embedding-v1/v2/v3，其中 v3 是最新的 API 服务。
重排序模型：提供 gte-rerank 模型服务。

请注意，商业 API 背后的模型与开源模型并非完全相同。

引用

如果您觉得我们的论文或模型有帮助，请考虑引用：

@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}

gte-multilingual-base

简介

模型卡片

模型配置

模型详情

gte-multilingual-base

模型信息

使用方法

使用 Transformers 获取密集嵌入

与 sentence-transformers 一起使用

与 infinity 一起使用

与 Text Embeddings Inference (TEI) 一起使用

使用自定义代码获取密集嵌入和稀疏 token 权重

评估

检索任务

MTEB

云 API 服务

引用

标签

操作

详细信息