Qwen2-VL-7B-Instruct

简介

我们很高兴推出 Qwen2-VL，这是 Qwen-VL 模型的最新迭代版本，凝聚了近一年的创新成果。

Qwen2-VL 有哪些新特性？

主要增强：

对各种分辨率和比例的图像的 SoTA 理解：Qwen2-VL 在视觉理解基准测试中达到了最先进的性能，包括 MathVista、DocVQA、RealWorldQA、MTVQA 等。
理解 20 分钟以上的视频：Qwen2-VL 能够理解超过 20 分钟的视频，用于高质量的视频问答、对话、内容创作等。
可操作手机、机器人等设备的 Agent：凭借复杂的推理和决策能力，Qwen2-VL 可与手机、机器人等设备集成，基于视觉环境和文本指令实现自动操作。
多语言支持：为服务全球用户，除英语和中文外，Qwen2-VL 现在支持图像中多种语言的文本理解，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新：

原生动态分辨率：与以往不同，Qwen2-VL 可以处理任意图像分辨率，将其映射为动态数量的 visual token，提供更接近人类的视觉处理体验。
多模态旋转位置编码（M-ROPE）：将位置编码分解为多个部分，以捕获一维文本、二维视觉和三维视频的位置信息，增强其多模态处理能力。

我们提供了三个模型，参数规模分别为 20 亿、70 亿和 720 亿。本仓库包含经过指令微调的 7B Qwen2-VL 模型。更多信息，请访问我们的博客和 GitHub。

评估

图像基准测试

基准测试	InternVL2-8B	MiniCPM-V 2.6	GPT-4o-mini	Qwen2-VL-7B
MMMUval	51.8	49.8	60	54.1
DocVQAtest	91.6	90.8	-	94.5
InfoVQAtest	74.8	-	-	76.5
ChartQAtest	83.3	-	-	83.0
TextVQAval	77.4	80.1	-	84.3
OCRBench	794	852	785	845
MTVQA	-	-	-	26.3
VCRen easy	-	73.88	83.60	89.70
VCRzh easy	-	10.18	1.10	59.94
RealWorldQA	64.4	-	-	70.1
MMEsum	2210.3	2348.4	2003.4	2326.8
MMBench-ENtest	81.7	-	-	83.0
MMBench-CNtest	81.2	-	-	80.5
MMBench-V1.1test	79.4	78.0	76.0	80.7
MMT-Benchtest	-	-	-	63.7
MMStar	61.5	57.5	54.8	60.7
MMVetGPT-4-Turbo	54.2	60.0	66.9	62.0
HallBenchavg	45.2	48.1	46.1	50.6
MathVistatestmini	58.3	60.6	52.4	58.2
MathVision	-	-	-	16.3

视频基准测试

基准测试	Internvl2-8B	LLaVA-OneVision-7B	MiniCPM-V 2.6	Qwen2-VL-7B
MVBench	66.4	56.7	-	67.0
PerceptionTesttest	-	57.1	-	62.3
EgoSchematest	-	60.1	-	66.7
Video-MMEwo/w subs	54.0/56.9	58.2/-	60.9/63.6	63.3/69.0

环境要求

Qwen2-VL 的代码已集成到最新的 Hugging Face transformers 中，我们建议您使用命令 pip install git+https://github.com/huggingface/transformers 从源码构建，否则您可能会遇到以下错误：

KeyError: 'qwen2_vl'

快速开始

我们提供了一个工具包，帮助您更方便地处理各种类型的视觉输入。这包括 base64、URL 以及交错的图像和视频。您可以使用以下命令安装：

pip install qwen-vl-utils

以下代码片段展示了如何使用 transformers 和 qwen_vl_utils 使用聊天模型：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用 qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: 'system\nYou are a helpful assistant.\nuser\nDescribe this image.\nassistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

Qwen2-VL-7B-Instruct

简介

模型卡片

模型配置

模型详情

Qwen2-VL-7B-Instruct

简介

Qwen2-VL 有哪些新特性？

主要增强：

模型架构更新：

评估

图像基准测试

视频基准测试

环境要求

快速开始

更多使用技巧

标签

操作

详细信息