Qwen2-VL-2B-Instruct
简介
We're excited to unveil **Qwen2-VL**, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.
模型卡片
模型配置
模型详情
已翻译Qwen2-VL-2B-Instruct
简介
我们很高兴推出 Qwen2-VL,这是 Qwen-VL 模型的最新迭代版本,凝聚了近一年的创新成果。
Qwen2-VL 的新特性
主要增强:
-
对各种分辨率和比例图像的 SoTA 理解:Qwen2-VL 在视觉理解基准测试中达到了最先进的性能,包括 MathVista、DocVQA、RealWorldQA、MTVQA 等。
-
理解 20 分钟以上的视频:Qwen2-VL 能够理解超过 20 分钟的视频,用于高质量的视频问答、对话、内容创作等。
-
可操作手机、机器人等设备的 Agent:凭借复杂的推理和决策能力,Qwen2-VL 可与手机、机器人等设备集成,基于视觉环境和文本指令实现自动操作。
-
多语言支持:为服务全球用户,除英文和中文外,Qwen2-VL 现在支持图像中多种语言文本的理解,包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。
模型架构更新:
-
原生动态分辨率:与之前不同,Qwen2-VL 可以处理任意图像分辨率,将其映射为动态数量的 visual token,提供更接近人类的视觉处理体验。
-
多模态旋转位置编码(M-ROPE):将位置编码分解为多个部分,以捕获一维文本、二维视觉和三维视频的位置信息,增强其多模态处理能力。
我们提供了三个模型,参数规模分别为 20 亿、70 亿和 720 亿。本仓库包含经过指令微调的 2B Qwen2-VL 模型。更多信息,请访问我们的 博客 和 GitHub。
评估
图像基准测试
| 基准测试 | InternVL2-2B | MiniCPM-V 2.0 | Qwen2-VL-2B |
|---|---|---|---|
| MMMUval | 36.3 | 38.2 | 41.1 |
| DocVQAtest | 86.9 | - | 90.1 |
| InfoVQAtest | 58.9 | - | 65.5 |
| ChartQAtest | 76.2 | - | 73.5 |
| TextVQAval | 73.4 | - | 79.7 |
| OCRBench | 781 | 605 | 794 |
| MTVQA | - | - | 20.0 |
| VCRen easy | - | - | 81.45 |
| VCRzh easy | - | - | 46.16 |
| RealWorldQA | 57.3 | 55.8 | 62.9 |
| MMEsum | 1876.8 | 1808.6 | 1872.0 |
| MMBench-ENtest | 73.2 | 69.1 | 74.9 |
| MMBench-CNtest | 70.9 | 66.5 | 73.5 |
| MMBench-V1.1test | 69.6 | 65.8 | 72.2 |
| MMT-Benchtest | - | - | 54.5 |
| MMStar | 49.8 | 39.1 | 48.0 |
| MMVetGPT-4-Turbo | 39.7 | 41.0 | 49.5 |
| HallBenchavg | 38.0 | 36.1 | 41.7 |
| MathVistatestmini | 46.0 | 39.8 | 43.0 |
| MathVision | - | - | 12.4 |
视频基准测试
| 基准测试 | Qwen2-VL-2B |
|---|---|
| MVBench | 63.2 |
| PerceptionTesttest | 53.9 |
| EgoSchematest | 54.9 |
| Video-MMEwo/w subs | 55.6/60.4 |
环境要求
Qwen2-VL 的代码已集成到最新的 Hugging Face transformers 中,建议您使用命令 pip install git+https://github.com/huggingface/transformers 从源码构建,否则可能会遇到以下错误:
KeyError: 'qwen2_vl'
快速开始
我们提供了一个工具包,帮助您更方便地处理各种类型的视觉输入。这包括 base64、URL 以及交错的图像和视频。您可以使用以下命令进行安装:
pip install qwen-vl-utils
以下代码片段展示了如何使用 transformers 和 qwen_vl_utils 使用聊天模型:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen2-VL-2B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
不使用 qwen_vl_utils
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: 'system\nYou are a helpful assistant.\nuser\nDescribe this image.\nassistant\n'
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
多图像推理
# Messages containing multiple images and a text query
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
视频推理
# Messages containing a images list as a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
],
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Messages containing a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
批量推理
# Sample messages for batch inference
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages1]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
更多使用技巧
对于输入图像,我们支持本地文件、base64 和 URL。对于视频,我们目前仅支持本地文件。
# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Image URL
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "http://path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Base64 encoded image
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "data:image;base64,/9j/..."},
{"type": "text", "text": "Describe this image."},
],
}
]
图像分辨率以提升性能
模型支持多种分辨率输入。默认情况下,它使用输入的原生分辨率,但更高的分辨率可以在增加计算量的同时提升性能。用户可以设置最小
正在翻译中,请稍候...