vitpose-plus-base

usyd-community keypoint-detection transformers en

usyd-community/vitpose-plus-base

2,032,582

下载量

31

收藏数

36

浏览量

apache-2.0

许可

简介

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation and ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation. It obtains 81.1 AP on MS COCO Keypoint test-dev set.

模型卡片

许可协议 apache-2.0

语言

en

框架 transformers

任务 keypoint-detection

模型配置

模型类型 vitpose

架构 VitPoseForPoseEstimation

模型详情

已翻译

VitPose 模型卡片

ViTPose：用于人体姿态估计的简单 Vision Transformer 基线模型，以及 ViTPose+：用于通用人体姿态估计的 Vision Transformer 基础模型。该方法在 MS COCO Keypoint test-dev 集上获得了 81.1 AP。

模型详情

尽管在设计中没有考虑特定的领域知识，但普通的 vision transformer 在视觉识别任务中已展现出卓越的性能。然而，目前鲜有研究揭示这种简单结构在姿态估计任务中的潜力。本文通过一个名为 ViTPose 的简单基线模型，从多个方面展示了普通 vision transformer 在姿态估计中令人惊讶的优秀能力，即模型结构的简洁性、模型规模的可扩展性、训练范式的灵活性以及模型间知识的可迁移性。具体而言，ViTPose 采用普通且非分层的 vision transformer 作为骨干网络来提取给定人物实例的特征，并使用轻量级解码器进行姿态估计。通过利用 transformer 可扩展的模型容量和高并行性，ViTPose 可以从 1 亿参数扩展到 10 亿参数，在吞吐量和性能之间建立了新的帕累托前沿。此外，ViTPose 在注意力类型、输入分辨率、预训练和微调策略以及处理多个姿态任务方面都非常灵活。我们还通过实验证明，大型 ViTPose 模型的知识可以通过简单的知识 token 轻松迁移到小型模型。实验结果表明，我们的基础 ViTPose 模型在具有挑战性的 MS COCO Keypoint Detection 基准测试中优于代表性方法，而最大的模型则树立了新的最先进水平，即在 MS COCO test-dev 集上达到 80.9 AP。代码和模型可在 https://github.com/ViTAE-Transformer/ViTPose 获取。

模型描述

这是已推送到 Hub 的 🤗 transformers 模型的模型卡片。此模型卡片为自动生成。

开发者： Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao
资助方： ARC FL-170100117 和 IH-180100002。
许可证： Apache-2.0
移植至 🤗 Transformers 者： Sangbum Choi 和 Niels Rogge

模型来源

原始仓库： https://github.com/ViTAE-Transformer/ViTPose
论文： https://arxiv.org/pdf/2204.12484
演示： https://huggingface.co/spaces?sort=trending&search=vitpose

用途

由 ViTAE-Transformer 团队开发的 ViTPose 模型主要用于姿态估计任务。以下是该模型的一些直接用途：

人体姿态估计：该模型可用于估计图像或视频中人物的姿态。这涉及识别关键身体关节的位置，例如头部、肩膀、肘部、手腕、臀部、膝盖和脚踝。

动作识别：通过分析随时间变化的姿态，该模型有助于识别各种人类动作和活动。

监控：在安防和监控应用中，ViTPose 可用于监控和分析公共空间或私人场所中的人类行为。

健康与健身：该模型可用于健身应用，以跟踪和分析锻炼姿势，提供关于姿势和技巧的反馈。

游戏与动画：ViTPose 可集成到游戏和动画系统中，以创建更逼真的角色动作和交互。

偏见、风险与局限性

本文提出了一种简单而有效的用于姿态估计的 vision transformer 基线模型，即 ViTPose。尽管在结构上没有精心设计，ViTPose 在 MS COCO 数据集上仍获得了 SOTA 性能。然而，ViTPose 的潜力尚未通过更先进的技术（如复杂解码器或 FPN 结构）得到充分挖掘，这些技术可能进一步提升性能。此外，尽管 ViTPose 展示了令人兴奋的特性，如简单性、可扩展性、灵活性和可迁移性，但仍可进行更多研究，例如探索基于提示的调优以进一步展示 ViTPose 的灵活性。另外，我们相信 ViTPose 也可以应用于其他姿态估计数据集，例如动物姿态估计 [47, 9, 45] 和人脸关键点检测 [21, 6]。我们将这些留作未来的工作。

如何开始使用该模型

使用以下代码开始使用该模型。

import torch
import requests
import numpy as np

from PIL import Image

from transformers import (
    AutoProcessor,
    RTDetrForObjectDetection,
    VitPoseForPoseEstimation,
)

device = "cuda" if torch.cuda.is_available() else "cpu"

url = "http://images.cocodataset.org/val2017/000000000139.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# ------------------------------------------------------------------------
# Stage 1. Detect humans on the image
# ------------------------------------------------------------------------

# You can choose detector by your choice
person_image_processor = AutoProcessor.from_pretrained("PekingU/rtdetr_r50vd_coco_o365")
person_model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd_coco_o365", device_map=device)

inputs = person_image_processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = person_model(**inputs)

results = person_image_processor.post_process_object_detection(
    outputs, target_sizes=torch.tensor([(image.height, image.width)]), threshold=0.3
)
result = results[0]  # take first image results

# Human label refers 0 index in COCO dataset
person_boxes = result["boxes"][result["labels"] == 0]
person_boxes = person_boxes.cpu().numpy()

# Convert boxes from VOC (x1, y1, x2, y2) to COCO (x1, y1, w, h) format
person_boxes[:, 2] = person_boxes[:, 2] - person_boxes[:, 0]
person_boxes[:, 3] = person_boxes[:, 3] - person_boxes[:, 1]

# ------------------------------------------------------------------------
# Stage 2. Detect keypoints for each person found
# ------------------------------------------------------------------------

image_processor = AutoProcessor.from_pretrained("usyd-community/vitpose-plus-base")
model = VitPoseForPoseEstimation.from_pretrained("usyd-community/vitpose-plus-base", device_map=device)

inputs = image_processor(image, boxes=[person_boxes], return_tensors="pt").to(device)

# This is MOE architecture, we should specify dataset indexes for each image in range 0..5
inputs["dataset_index"] = torch.tensor([0], device=device)

with torch.no_grad():
    outputs = model(**inputs)

pose_results = image_processor.post_process_pose_estimation(outputs, boxes=[person_boxes], threshold=0.3)
image_pose_result = pose_results[0]  # results for first image

for i, person_pose in enumerate(image_pose_result):
    print(f"Person #{i}")
    for keypoint, label, score in zip(
        person_pose["keypoints"], person_pose["labels"], person_pose["scores"]
    ):
        keypoint_name = model.config.id2label[label.item()]
        x, y = keypoint
        print(f" - {keypoint_name}: x={x.item():.2f}, y={y.item():.2f}, score={score.item():.2f}")

输出：

Person #0
 - Nose: x=428.81, y=171.53, score=0.92
 - L_Eye: x=429.32, y=168.30, score=0.92
 - R_Eye: x=428.84, y=168.47, score=0.82
 - L_Ear: x=434.60, y=166.54, score=0.90
 - R_Ear: x=440.14, y=165.80, score=0.80
 - L_Shoulder: x=440.74, y=176.95, score=0.96
 - R_Shoulder: x=444.06, y=177.52, score=0.68
 - L_Elbow: x=436.30, y=197.08, score=0.91
 - R_Elbow: x=432.29, y=201.22, score=0.79
 - L_Wrist: x=429.91, y=217.90, score=0.84
 - R_Wrist: x=421.08, y=212.72, score=0.90
 - L_Hip: x=446.15, y=223.88, score=0.74
 - R_Hip: x=449.32, y=223.45, score=0.65
 - L_Knee: x=443.73, y=255.72, score=0.76
 - R_Knee: x=450.72, y=255.21, score=0.73
 - L_Ankle: x=452.14, y=287.30, score=0.66
 - R_Ankle: x=456.02, y=285.99, score=0.72
Person #1
 - Nose: x=398.22, y=181.60, score=0.88
 - L_Eye: x=398.67, y=179.84, score=0.87
 - R_Eye: x=396.07, y=179.44, score=0.87
 - R_Ear: x=388.94, y=180.38, score=0.87
 - L_Shoulder: x=397.11, y=194.19, score=0.71
 - R_Shoulder: x=384.75, y=190.74, score=0.55

训练详情

训练数据

数据集详情。我们使用 MS COCO [28]、AI Challenger [41]、MPII [3] 和 CrowdPose [22] 数据集进行训练和评估。OCHuman [54] 数据集仅参与评估阶段，用于

vitpose-plus-base

简介

模型卡片

模型配置

模型详情

VitPose 模型卡片

模型详情

模型描述

模型来源

用途

偏见、风险与局限性

如何开始使用该模型

训练详情

训练数据

标签

操作

详细信息