模型库 / Qwen/Qwen2-VL-7B-Instruct

Qwen2-VL-7B-Instruct

Qwen image-text-to-text transformers en
Qwen/Qwen2-VL-7B-Instruct
3,421,574
下载量
1274
收藏数
11
浏览量
apache-2.0
许可

简介

We're excited to unveil **Qwen2-VL**, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.

模型卡片

许可协议 apache-2.0
语言
en
框架 transformers
任务 image-text-to-text
multimodal

模型配置

模型类型 qwen2_vl
架构 Qwen2VLForConditionalGeneration

模型详情

已翻译

Qwen2-VL-7B-Instruct

简介

我们很高兴推出 Qwen2-VL,这是 Qwen-VL 模型的最新迭代版本,凝聚了近一年的创新成果。

Qwen2-VL 有哪些新特性?

主要增强:

  • 对各种分辨率和比例的图像的 SoTA 理解:Qwen2-VL 在视觉理解基准测试中达到了最先进的性能,包括 MathVista、DocVQA、RealWorldQA、MTVQA 等。

  • 理解 20 分钟以上的视频:Qwen2-VL 能够理解超过 20 分钟的视频,用于高质量的视频问答、对话、内容创作等。

  • 可操作手机、机器人等设备的 Agent:凭借复杂的推理和决策能力,Qwen2-VL 可与手机、机器人等设备集成,基于视觉环境和文本指令实现自动操作。

  • 多语言支持:为服务全球用户,除英语和中文外,Qwen2-VL 现在支持图像中多种语言的文本理解,包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新:

  • 原生动态分辨率:与以往不同,Qwen2-VL 可以处理任意图像分辨率,将其映射为动态数量的 visual token,提供更接近人类的视觉处理体验。

  • 多模态旋转位置编码(M-ROPE):将位置编码分解为多个部分,以捕获一维文本、二维视觉和三维视频的位置信息,增强其多模态处理能力。

我们提供了三个模型,参数规模分别为 20 亿、70 亿和 720 亿。本仓库包含经过指令微调的 7B Qwen2-VL 模型。更多信息,请访问我们的 博客GitHub

评估

图像基准测试

基准测试 InternVL2-8B MiniCPM-V 2.6 GPT-4o-mini Qwen2-VL-7B
MMMUval 51.8 49.8 60 54.1
DocVQAtest 91.6 90.8 - 94.5
InfoVQAtest 74.8 - - 76.5
ChartQAtest 83.3 - - 83.0
TextVQAval 77.4 80.1 - 84.3
OCRBench 794 852 785 845
MTVQA - - - 26.3
VCRen easy - 73.88 83.60 89.70
VCRzh easy - 10.18 1.10 59.94
RealWorldQA 64.4 - - 70.1
MMEsum 2210.3 2348.4 2003.4 2326.8
MMBench-ENtest 81.7 - - 83.0
MMBench-CNtest 81.2 - - 80.5
MMBench-V1.1test 79.4 78.0 76.0 80.7
MMT-Benchtest - - - 63.7
MMStar 61.5 57.5 54.8 60.7
MMVetGPT-4-Turbo 54.2 60.0 66.9 62.0
HallBenchavg 45.2 48.1 46.1 50.6
MathVistatestmini 58.3 60.6 52.4 58.2
MathVision - - - 16.3

视频基准测试

基准测试 Internvl2-8B LLaVA-OneVision-7B MiniCPM-V 2.6 Qwen2-VL-7B
MVBench 66.4 56.7 - 67.0
PerceptionTesttest - 57.1 - 62.3
EgoSchematest - 60.1 - 66.7
Video-MMEwo/w subs 54.0/56.9 58.2/- 60.9/63.6 63.3/69.0

环境要求

Qwen2-VL 的代码已集成到最新的 Hugging Face transformers 中,我们建议您使用命令 pip install git+https://github.com/huggingface/transformers 从源码构建,否则您可能会遇到以下错误:

KeyError: 'qwen2_vl'

快速开始

我们提供了一个工具包,帮助您更方便地处理各种类型的视觉输入。这包括 base64、URL 以及交错的图像和视频。您可以使用以下命令安装:

pip install qwen-vl-utils

以下代码片段展示了如何使用 transformersqwen_vl_utils 使用聊天模型:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用 qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: 'system\nYou are a helpful assistant.\nuser\nDescribe this image.\nassistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多图像推理

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

更多使用技巧

对于输入图像,我们支持本地文件、base64 和 URL。对于视频,我们目前仅支持

标签

qwen2_vl multimodal conversational en arxiv:2409.12191 arxiv:2308.12966 base_model:Qwen/Qwen2-VL-7B base_model:finetune:Qwen/Qwen2-VL-7B

操作


详细信息

厂商
Qwen
任务
image-text-to-text
框架
transformers
模型类型
qwen2_vl
许可(HF)
apache-2.0
语言
en