模型库 / Qwen/Qwen2-VL-2B-Instruct

Qwen2-VL-2B-Instruct

Qwen image-text-to-text transformers en
Qwen/Qwen2-VL-2B-Instruct
4,134,844
下载量
500
收藏数
7
浏览量
apache-2.0
许可

简介

We're excited to unveil **Qwen2-VL**, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.

模型卡片

许可协议 apache-2.0
语言
en
框架 transformers
任务 image-text-to-text
multimodal

模型配置

模型类型 qwen2_vl
架构 Qwen2VLForConditionalGeneration

模型详情

已翻译

Qwen2-VL-2B-Instruct

简介

我们很高兴推出 Qwen2-VL,这是 Qwen-VL 模型的最新迭代版本,凝聚了近一年的创新成果。

Qwen2-VL 的新特性

主要增强:

  • 对各种分辨率和比例图像的 SoTA 理解:Qwen2-VL 在视觉理解基准测试中达到了最先进的性能,包括 MathVista、DocVQA、RealWorldQA、MTVQA 等。

  • 理解 20 分钟以上的视频:Qwen2-VL 能够理解超过 20 分钟的视频,用于高质量的视频问答、对话、内容创作等。

  • 可操作手机、机器人等设备的 Agent:凭借复杂的推理和决策能力,Qwen2-VL 可与手机、机器人等设备集成,基于视觉环境和文本指令实现自动操作。

  • 多语言支持:为服务全球用户,除英文和中文外,Qwen2-VL 现在支持图像中多种语言文本的理解,包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新:

  • 原生动态分辨率:与之前不同,Qwen2-VL 可以处理任意图像分辨率,将其映射为动态数量的 visual token,提供更接近人类的视觉处理体验。

  • 多模态旋转位置编码(M-ROPE):将位置编码分解为多个部分,以捕获一维文本、二维视觉和三维视频的位置信息,增强其多模态处理能力。

我们提供了三个模型,参数规模分别为 20 亿、70 亿和 720 亿。本仓库包含经过指令微调的 2B Qwen2-VL 模型。更多信息,请访问我们的 博客GitHub

评估

图像基准测试

基准测试 InternVL2-2B MiniCPM-V 2.0 Qwen2-VL-2B
MMMUval 36.3 38.2 41.1
DocVQAtest 86.9 - 90.1
InfoVQAtest 58.9 - 65.5
ChartQAtest 76.2 - 73.5
TextVQAval 73.4 - 79.7
OCRBench 781 605 794
MTVQA - - 20.0
VCRen easy - - 81.45
VCRzh easy - - 46.16
RealWorldQA 57.3 55.8 62.9
MMEsum 1876.8 1808.6 1872.0
MMBench-ENtest 73.2 69.1 74.9
MMBench-CNtest 70.9 66.5 73.5
MMBench-V1.1test 69.6 65.8 72.2
MMT-Benchtest - - 54.5
MMStar 49.8 39.1 48.0
MMVetGPT-4-Turbo 39.7 41.0 49.5
HallBenchavg 38.0 36.1 41.7
MathVistatestmini 46.0 39.8 43.0
MathVision - - 12.4

视频基准测试

基准测试 Qwen2-VL-2B
MVBench 63.2
PerceptionTesttest 53.9
EgoSchematest 54.9
Video-MMEwo/w subs 55.6/60.4

环境要求

Qwen2-VL 的代码已集成到最新的 Hugging Face transformers 中,建议您使用命令 pip install git+https://github.com/huggingface/transformers 从源码构建,否则可能会遇到以下错误:

KeyError: 'qwen2_vl'

快速开始

我们提供了一个工具包,帮助您更方便地处理各种类型的视觉输入。这包括 base64、URL 以及交错的图像和视频。您可以使用以下命令进行安装:

pip install qwen-vl-utils

以下代码片段展示了如何使用 transformersqwen_vl_utils 使用聊天模型:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用 qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: 'system\nYou are a helpful assistant.\nuser\nDescribe this image.\nassistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

更多使用技巧

对于输入图像,我们支持本地文件、base64 和 URL。对于视频,我们目前仅支持本地文件。

# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Image URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 encoded image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

图像分辨率以提升性能

模型支持多种分辨率输入。默认情况下,它使用输入的原生分辨率,但更高的分辨率可以在增加计算量的同时提升性能。用户可以设置最小

标签

qwen2_vl multimodal conversational en arxiv:2409.12191 arxiv:2308.12966 base_model:Qwen/Qwen2-VL-2B base_model:finetune:Qwen/Qwen2-VL-2B

操作


详细信息

厂商
Qwen
任务
image-text-to-text
框架
transformers
模型类型
qwen2_vl
许可(HF)
apache-2.0
语言
en