siglip-so400m-patch14-384

google zero-shot-image-classification transformers

google/siglip-so400m-patch14-384

2,080,535

下载量

674

收藏数

36

浏览量

apache-2.0

许可

简介

SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.

模型卡片

许可协议 apache-2.0

vision

模型配置

模型类型 siglip

架构 SiglipModel

模型详情

已翻译

SigLIP（形状优化模型）

SigLIP 模型在 WebLi 上以 384x384 分辨率进行预训练。该模型由 Zhai 等人在论文 Sigmoid Loss for Language Image Pre-Training 中提出，并首次发布于此仓库。

该模型采用 SoViT-400m 架构，这是 Alabdulmohsin 等人在 Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design 中提出的形状优化版本。

免责声明：发布 SigLIP 的团队并未为该模型编写 model card，因此本 model card 由 Hugging Face 团队编写。

模型描述

SigLIP 是 CLIP（一种多模态模型）的改进版本，采用了更优的损失函数。Sigmoid 损失仅作用于图像-文本对，无需全局视角的成对相似度进行归一化。这使得 batch size 可以进一步扩大，同时在较小的 batch size 下也能表现更好。

关于 SigLIP 的简要说明（由其中一位作者提供）可参见此处。

预期用途与局限性

您可以使用原始模型执行零样本图像分类和图像-文本检索等任务。请查看模型中心以寻找您感兴趣任务的其他版本。

使用方法

以下是使用该模型执行零样本图像分类的方法：

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

或者，用户可以利用 pipeline API，该 API 为用户屏蔽了复杂性：

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

更多代码示例请参考文档。

训练过程

训练数据

SigLIP 在 WebLI 数据集 (Chen et al., 2023) 上进行预训练。

预处理

图像被调整大小/缩放至相同分辨率 (384x384)，并在 RGB 通道上使用均值 (0.5, 0.5, 0.5) 和标准差 (0.5, 0.5, 0.5) 进行归一化。

文本被 tokenize 并填充至相同长度 (64 tokens)。

计算资源

该模型在 16 个 TPU-v4 芯片上训练了三天。

评估结果

SigLIP 与 CLIP 的评估对比如下（取自论文）。

BibTeX 条目与引用信息

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}