mobilevit-small

apple image-classification transformers

apple/mobilevit-small

3,131,094

下载量

91

收藏数

44

浏览量

other

许可

简介

MobileViT model pre-trained on ImageNet-1k at resolution 256x256. It was introduced in MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer by Sachin Mehta and Mohammad Rastegari, and first released in this repository. The license used is Apple sample code license.

模型卡片

许可协议 other

数据集

imagenet-1k

vision image-classification

模型配置

模型类型 mobilevit

架构 MobileViTForImageClassification

模型详情

已翻译

MobileViT (small-sized model)

MobileViT 模型在 256x256 分辨率下基于 ImageNet-1k 进行预训练。该模型由 Sachin Mehta 和 Mohammad Rastegari 在论文 MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer 中提出，并首次发布于此仓库。使用的许可证为 Apple sample code license。

免责声明：发布 MobileViT 的团队并未为该模型编写 model card，因此本 model card 由 Hugging Face 团队编写。

模型描述

MobileViT 是一种轻量级、低延迟的卷积神经网络，它将 MobileNetV2 风格的层与一个新模块相结合，该模块使用 transformer 的全局处理替代卷积中的局部处理。与 ViT（Vision Transformer）类似，图像数据在被 transformer 层处理之前会转换为扁平化的 patches。之后，这些 patches 会被"反扁平化"回特征图。这使得 MobileViT 模块可以放置在 CNN 中的任意位置。MobileViT 不需要任何位置编码（positional embeddings）。

预期用途与局限性

您可以使用原始模型进行图像分类。请查看模型库以寻找您感兴趣任务的微调版本。

使用方法

以下是如何使用该模型将 COCO 2017 数据集中的图像分类为 1,000 个 ImageNet 类别之一：

from transformers import MobileViTFeatureExtractor, MobileViTForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/mobilevit-small")
model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-small")

inputs = feature_extractor(images=image, return_tensors="pt")

outputs = model(**inputs)
logits = outputs.logits

# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

目前，特征提取器和模型均支持 PyTorch。

训练数据

MobileViT 模型在 ImageNet-1k 上进行了预训练，该数据集包含 100 万张图像和 1,000 个类别。

训练过程

预处理

训练仅需要基本的数据增强，即随机调整大小裁剪和水平翻转。

为了在不进行微调的情况下学习多尺度表示，训练过程中使用了多尺度采样器，图像大小从以下尺寸中随机采样：(160, 160)、(192, 192)、(256, 256)、(288, 288)、(320, 320)。

在推理时，图像会被调整/缩放到相同分辨率 (288x288)，并中心裁剪为 256x256。

像素值被归一化到 [0, 1] 范围。图像预期采用 BGR 像素顺序，而非 RGB。

预训练

MobileViT 网络从零开始训练，在 8 块 NVIDIA GPU 上使用有效 batch size 为 1024 进行 300 个 epoch 的训练，学习率经过 3k 步的预热，随后采用余弦退火策略。同时还使用了标签平滑交叉熵损失和 L2 权重衰减。训练分辨率从 160x160 到 320x320 不等，采用多尺度采样。

评估结果

模型	ImageNet top-1 准确率	ImageNet top-5 准确率	参数量	URL
MobileViT-XXS	69.0	88.9	1.3 M	https://huggingface.co/apple/mobilevit-xx-small
MobileViT-XS	74.8	92.3	2.3 M	https://huggingface.co/apple/mobilevit-x-small
MobileViT-S	78.4	94.1	5.6 M	https://huggingface.co/apple/mobilevit-small

BibTeX 条目与引用信息

@inproceedings{vision-transformer,
title = {MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
author = {Sachin Mehta and Mohammad Rastegari},
year = {2022},
URL = {https://arxiv.org/abs/2110.02178}
}