Qwen2-VL-2B-Instruct

简介

我们很高兴推出 Qwen2-VL，这是 Qwen-VL 模型的最新迭代版本，凝聚了近一年的创新成果。

Qwen2-VL 的新特性

主要增强：

对各种分辨率和比例图像的 SoTA 理解：Qwen2-VL 在视觉理解基准测试中达到了最先进的性能，包括 MathVista、DocVQA、RealWorldQA、MTVQA 等。
理解 20 分钟以上的视频：Qwen2-VL 能够理解超过 20 分钟的视频，用于高质量的视频问答、对话、内容创作等。
可操作手机、机器人等设备的 Agent：凭借复杂的推理和决策能力，Qwen2-VL 可与手机、机器人等设备集成，基于视觉环境和文本指令实现自动操作。
多语言支持：为服务全球用户，除英文和中文外，Qwen2-VL 现在支持图像中多种语言文本的理解，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新：

原生动态分辨率：与之前不同，Qwen2-VL 可以处理任意图像分辨率，将其映射为动态数量的 visual token，提供更接近人类的视觉处理体验。
多模态旋转位置编码（M-ROPE）：将位置编码分解为多个部分，以捕获一维文本、二维视觉和三维视频的位置信息，增强其多模态处理能力。

我们提供了三个模型，参数规模分别为 20 亿、70 亿和 720 亿。本仓库包含经过指令微调的 2B Qwen2-VL 模型。更多信息，请访问我们的博客和 GitHub。

评估

图像基准测试

基准测试	InternVL2-2B	MiniCPM-V 2.0	Qwen2-VL-2B
MMMUval	36.3	38.2	41.1
DocVQAtest	86.9	-	90.1
InfoVQAtest	58.9	-	65.5
ChartQAtest	76.2	-	73.5
TextVQAval	73.4	-	79.7
OCRBench	781	605	794
MTVQA	-	-	20.0
VCRen easy	-	-	81.45
VCRzh easy	-	-	46.16
RealWorldQA	57.3	55.8	62.9
MMEsum	1876.8	1808.6	1872.0
MMBench-ENtest	73.2	69.1	74.9
MMBench-CNtest	70.9	66.5	73.5
MMBench-V1.1test	69.6	65.8	72.2
MMT-Benchtest	-	-	54.5
MMStar	49.8	39.1	48.0
MMVetGPT-4-Turbo	39.7	41.0	49.5
HallBenchavg	38.0	36.1	41.7
MathVistatestmini	46.0	39.8	43.0
MathVision	-	-	12.4

视频基准测试

基准测试	Qwen2-VL-2B
MVBench	63.2
PerceptionTesttest	53.9
EgoSchematest	54.9
Video-MMEwo/w subs	55.6/60.4

环境要求

Qwen2-VL 的代码已集成到最新的 Hugging Face transformers 中，建议您使用命令 pip install git+https://github.com/huggingface/transformers 从源码构建，否则可能会遇到以下错误：

KeyError: 'qwen2_vl'

快速开始

我们提供了一个工具包，帮助您更方便地处理各种类型的视觉输入。这包括 base64、URL 以及交错的图像和视频。您可以使用以下命令进行安装：

pip install qwen-vl-utils

以下代码片段展示了如何使用 transformers 和 qwen_vl_utils 使用聊天模型：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用 qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: 'system\nYou are a helpful assistant.\nuser\nDescribe this image.\nassistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

Qwen2-VL-2B-Instruct

简介

模型卡片

模型配置

模型详情

Qwen2-VL-2B-Instruct

简介

Qwen2-VL 的新特性

主要增强：

模型架构更新：

评估

图像基准测试

视频基准测试

环境要求

快速开始

更多使用技巧

图像分辨率以提升性能

标签

操作

详细信息