Sapiens: Foundation for Human Vision Models
摘要
我们介绍了 Sapiens,这是一个针对四个基本的以人为中心的视觉任务的模型系列 - 二维姿态估计、身体部位分割、深度估计和表面法线预测。我们的模型原生支持1K高分辨率推理,并且通过简单地对超过3亿张野外人类图像上预训练的模型进行微调,非常容易适应个别任务。我们观察到,在相同的计算预算下,对精选的人类图像数据集进行自监督预训练可以显著提高多样化的以人为中心的任务的性能。产生的模型即使在标记数据稀缺或完全合成的情况下,也表现出对野外数据的显著泛化能力。我们简单的模型设计还带来了可扩展性 - 当我们从0.3亿参数扩展到20亿参数时,模型在各项任务上的性能得到了提升。Sapiens在各种以人为中心的基准测试中始终超越现有的基线。我们在Humans-5K(姿态)上比之前的最佳状态提高了7.6 mAP,在Humans-2K(部分分割)上提高了17.1 mIoU,在Hi4D(深度)上提高了22.4%的相对RMSE,在THuman2(法线)上提高了53.5%的相对角度误差。
- 引言
近年来,在生成二维[17, 28, 50, 118]和三维[69, 89, 102, 109]的真实感人类方面取得了显著进展。这些方法的成功在很大程度上归功于对各种资产的稳健估计,如二维关键点[14, 67]、细粒度的身体部位分割[119]、深度[113]和表面法线[89, 108]。然而,这些资产的稳健和准确估计仍然是一个活跃的研究领域,而为了提高个别任务的性能而构建的复杂系统往往阻碍了更广泛的应用。此外,在野外环境中获取准确的真值注释是出了名的难以扩展。我们的目标是提供一个统一的框架和模型,以在野外推断这些资产,为每个人解锁广泛的以人为中心的应用。
我们认为,这样的以人为中心的模型应该满足三个标准:泛化、广泛的适用性和高保真度。泛化确保了对未见条件的鲁棒性,使模型能够在不同环境中一致地执行。广泛的适用性表明了模型的多功能性,使其适合在最少修改的情况下进行广泛的任务。高保真度表示模型产生精确、高分辨率输出的能力,这对于忠实的人类生成任务至关重要。本文详细介绍了体现这些属性的模型的开发,统称为 Sapiens。
遵循[34,79,91]中的见解,利用大型数据集和可扩展的模型架构是泛化的关键。为了更广泛的适用性,我们采用了先预训练再微调的方法,使得在预训练后对特定任务的调整可以最小化。这种方法引发了一个关键问题:哪种类型的数据对预训练最有效?鉴于计算限制,应该强调收集尽可能多的人类图像,还是应该在较少策划的集合上进行预训练以更好地反映现实世界的多样性?现有的方法经常忽视了预训练数据分布在下游任务的背景下。为了研究预训练数据分布对特定人类任务的影响,我们收集了Humans-300M数据集,包含了3亿张多样化的人类图像。这些未标记的图像被用来从头开始预训练一系列视觉变换器[27],参数数量从300M到2B不等。
在从大型数据集中学习通用视觉特征的各种自监督方法[5, 19, 34, 47, 48, 121]中,我们选择了掩蔽自编码器(MAE)方法[48],因为它在预训练中的简单性和效率。与对比或多推理策略相比,MAE具有单次推理模型,允许使用相同的计算资源处理更大量的图像。为了提高保真度,与以前的方法相比,我们将预训练的原生输入分辨率提高到1024像素,与现有的最大视觉主干[91]相比,FLOPs增加了约4倍。每个模型都在1.2万亿个标记上进行预训练。表1概述了与早期方法的比较。对于以人为中心的任务的微调[15,101,113,119],我们使用一致的编码器-解码器架构。编码器是用预训练的权重初始化的,而解码器是一个轻量级和特定任务的头部,是随机初始化的。然后对这两个组件进行端到端的微调。我们专注于四个关键任务 - 二维姿态估计、身体部位分割、深度和法线估计,如图1所示。
与之前的研究[56,122]一致,我们肯定了标签质量对模型在野外性能的关键影响。公共基准[23, 40, 55]通常包含嘈杂的标签,在模型微调期间提供不一致的监督信号。同时,利用细粒度和精确的注释与我们三维人类数字化的主要目标紧密对齐也很重要。为此,我们提出了一个更密集的二维全身关键点集合用于姿态估计,以及一个详细的类别词汇表用于身体部位分割,超越了以前数据集的范围(请参阅图1)。具体来说,我们引入了一个包含308个关键点的全面集合,涵盖身体、手、脚、表面和面部。此外,我们将分割类别词汇表扩展到28个类别,覆盖头发、舌头、牙齿、上/下唇和躯干等身体部位。为了确保注释的质量和一致性以及高度的自动化,我们使用多视图捕获设置来收集姿态和分割注释。我们还利用以人为中心的合成数据进行深度和法线估计,利用RenderPeople[84]的600个详细扫描生成高分辨率的深度图和表面法线。
我们展示了特定领域的大规模预训练与有限但高质量的注释相结合,可以带来强大的野外泛化。总的来说,我们的方法展示了一种有效的策略,用于开发能够执行现实世界场景中的高精度判别模型,而无需收集昂贵和多样化的注释集。
我们的贡献总结如下。
- 我们介绍了 Sapiens,这是一个在大规模人类图像数据集上预训练的视觉变换器系列。
- 这项研究表明,简单的数据策划和大规模预训练在相同的计算预算下显著提高了模型的性能。
- 我们的模型,即使是用高质量或合成标签进行微调,也展示了野外泛化能力。
- 第一个原生支持1K分辨率的高保真推理模型,用于以人为中心的任务,在2D姿态、身体部位分割、深度和法线估计的基准测试中实现了最先进的性能。
- 相关工作
我们的工作探索了在大量野外人类图像上训练大型架构的极限。我们建立在不同领域的先前工作基础上:大规模预训练、人类视觉任务和大型视觉变换器。
大规模预训练。大规模预训练[26,95]的显著成功,随后是针对语言建模的任务特定微调[2,13,53,96,99,100],已经将这种方法确立为标准实践。同样,计算机视觉方法[1,4,33,34,42,79,82,85,87,120]正在逐步接受大规模数据进行预训练。大型数据集的出现,如LAION5B[90]、Instagram-3.5B[77]、JFT-300M[92]、LVD142M[79]、Visual Genome[60]和YFCC100M[97],使得探索超出传统基准范围的数据语料库成为可能[61, 67, 86]。这一领域的突出工作包括DINOv2[79]、MAWS[91]和AIM[34]。DINOv2通过在LDV-142M数据集[79]上扩展对比iBot[121]方法,实现了生成自监督特征的最新性能。MAWS[91]研究了在十亿图像上掩蔽自编码器(MAE)[48]的扩展。AIM[34]探索了与BERT[26]类似的自回归视觉预训练的可扩展性,用于视觉变换器[27]。与这些主要关注一般图像预训练或零次图像分类的方法不同,我们采取了明确的以人为中心的方法:我们的模型利用大量人类图像进行预训练,随后对一系列与人相关的任务进行微调。
人类视觉任务。大规模3D人类数字化[8, 44, 64, 74]仍然是计算机视觉[12]中的一个关键目标。在控制或工作室环境中已经取得了显著进展[3,59, 63,69,70,76,89],但将这些方法扩展到不受限制的环境中仍然存在挑战[29]。为了应对这些挑战,开发能够执行多个基本任务的多功能模型,如关键点估计[21,35, 46, 51, 57, 78, 80, 93, 106]、身体部位分割[36,40, 41, 41, 75, 104, 105]、深度估计[9,10,32,43,52,66,83,113]和表面法线预测[6,7,31,39,62,88,101,108],从自然设置中的图像中是至关重要的。在这项工作中,我们的目标是开发这些基本人类视觉任务的模型,使其能够泛化到野外环境。
扩展架构。目前,最大的公开可访问的语言模型包含超过100B参数[49],而更常用的语言模型[94, 100]包含约7B参数。相比之下,尽管视觉变换器(ViT)[27]共享类似的架构,但尚未成功扩展到这种程度。虽然在这方面有一些值得注意的尝试,包括开发在文本和图像上训练的密集ViT-4B[20],以及为稳定训练ViT-22B[25]制定的技术,但常用的视觉主干仍然在300M到600M参数之间[24,38,45,68],并且主要在大约224像素的图像分辨率上进行预训练。类似地,现有的基于变换器的图像生成模型,如DiT[81],使用不到700 M参数,并在高度压缩的潜在空间上操作。为了解决这一差距,我们介绍了 Sapiens - 一系列大型、高分辨率的ViT模型,它们在数以百万计的人类图像上以1024像素的图像分辨率进行原生预训练。
- Method
3.1. Humans-300M Dataset
We utilize a large proprietary dataset for pretraining of approximately 1 billion in-the-wild images, focusing exclusively on human images. The preprocessing involves discarding images with watermarks, text, artistic depictions, or unnatural elements. Subsequently, we use an off-theshelf person bounding-box detector [103] to filter images, retaining those with a detection score above 0.9 and bounding box dimensions exceeding 300 pixels. Fig. 2 provides an overview of the distribution of the number of people per image in our dataset, noting that over 248 million images contain multiple subjects.
3.2. Pretraining
We follow the masked-autoencoder [48] (MAE) approach for pretraining. Our model is trained to reconstruct the original human image given its partial observation. Like all autoencoders, our model has an encoder that maps the visible image to a latent representation and a decoder that reconstructs the original image from this latent representation. Our pretraining dataset consists of both single and multihuman images; each image is resized to a fixed size with a square aspect ratio. Similar to ViT [27], we divide an image into regular non-overlapping patches with a fixed patch size. A subset of these patches is randomly selected and masked, leaving the rest visible. The proportion of masked patches to visible ones is defined as the masking ratio, which remains fixed throughout training. We refer to MAE [48] for more details. Fig. 3 (Top) shows the reconstruction of our pretrained model on unseen human images.
Our models exhibit generalization across a variety of image characteristics including scales, crops, the age and ethnicity of subjects, and number of subjects. Each patch token in our model accounts for 0.02 % of the image area compared to 0.4 % in standard ViTs, a 16 \times reduction - this provides a fine-grained inter-token reasoning for our models. Fig. 3 (Bottom) shows that even with an increased mask ratio of 95 % , our model achieves a plausible reconstruction of human anatomy on held-out samples.
3.3. 2D Pose Estimation
We follow the top-down paradigm, which aims to detect the locations of K keypoints from an input image \mathbf{I} \in \mathbb{R}^{H \times W \times 3} . Most methods pose this problem as heatmap prediction, where each of K heatmaps represents the prob- ability of the corresponding keypoint being at any spatial location. Similar to [111], we define a pose estimation transformer, \mathcal{P} , for keypoint detection. The bounding box at training and inference is scaled to H \times W and is provided as an input to \mathcal{P} . Let \mathbf{y} \in \mathbb{R}^{H \times W \times K} denote the K heatmaps corresponding to the ground truth keypoints for a given input \mathbf{I} . The pose estimator transforms input \mathbf{I} to a set of predicted heatmaps, \hat{\mathbf{y}} \in \mathbb{R}^{H \times W \times K} , such that \hat{\mathbf{y}}=\mathcal{P}(\mathbf{I}) . \mathcal{P} is trained to minimize the mean squared loss \mathcal{L}_{\text {pose }}=\operatorname{MSE}(\mathbf{y}, \hat{\mathbf{y}}) . During finetuning, the encoder of \mathcal{P} is initialized with the weights from pretaining, and the decoder is initialized randomly. The aspect ratio H: W is set to be 4: 3 , with the pretrained positional embedding being interpolated accordingly[58]. We use lightweight decoders with deconvolution and convolution operations.
We finetune the encoder and decoder in \mathcal{P} across multiple skeletons, including K=17 [67], K=133 [55] and a new highly-detailed skeleton, with K=308 , as shown in Fig. 4 (Left). Compared to existing formats with at most 68 facial keypoints, our annotations consist of 243 facial keypoints, including representative points around the eyes, lips, nose, and ears. This design is tailored to meticulously capture the nuanced details of facial expressions in the real world. With these keypoints, we manually annotated 1 million images at 4 K resolution from an indoor capture setup.
3.4. Body-Part Segmentation
Commonly referred to as human parsing, body-part segmentation aims to classify pixels in the input image \mathbf{I} into C classes. Most methods [40] transform this problem to estimating per-pixel class probabilities to create a probability map \hat{\mathbf{p}} \in \mathbb{R}^{H \times W \times C} such that \hat{\mathbf{p}}=\mathcal{S}(\mathbf{I}) , where \mathcal{S} is the segmentation model. As outlined previously, we adopt the same encoder-decoder architecture and initialization scheme for \mathcal{S} . \mathcal{S} is finetuned to minimize the weighted cross-entropy loss between the actual \mathbf{p} and predicted \hat{\mathbf{p}} probability maps, \mathcal{L}_{\text {seg }}=\operatorname{WeightedCE}(\mathbf{p}, \hat{\mathbf{p}}) .
We finetune \mathcal{S} across two part-segmentation vocabularies: a standard set with C=20 [40] and a new larger vocabulary with C=28 , as illustrated in Fig. 4 (Right). Our proposed vocabulary goes beyond previous datasets in important ways. It distinguishes between the upper and lower halves of limbs and incorporates more detailed classifications such as upper/lower lips, teeth, and tongue. To this end, we manually annotate 100 K images at 4 K resolution with this vocabulary.
3.5. Depth Estimation
For depth estimation, we adopt the architecture used for segmentation, with the modification that the decoder output channel is set to 1 for regression. We denote the groundtruth depth map of image \mathbf{I} by \mathbf{d} \in \mathbb{R}^{H \times W} , the depth estimator by \mathcal{D} , where \hat{\mathbf{d}}=\mathcal{D}(\mathbf{I}) , and M as the number of human pixels in the image. For the relative depth estimation, we normalize \mathbf{d} to the range [0,1] using max and min depths in the image. The \mathcal{L}_{\text {depth }} loss [32] for \mathcal{D} is defined as follows:
\begin{aligned}
\Delta \mathbf{d} & =\log (\mathbf{d})-\log (\hat{\mathbf{d}}) \
\overline{\Delta \mathbf{d}} & =\frac{1}{M} \sum_{i=1}^{M} \Delta \mathbf{d}{i}, \
\mathcal{L}{\text {depth }} & =\sqrt{(\Delta \mathbf{d})^{2}}-\frac{1}{2}\left(\overline{\Delta \mathbf{d})^{2}}\right.
\end{aligned}
We render 500,000 synthetic images using 600 highresolution photogrammetry human scans as shown in Fig. 5 to obtain a robust monocular depth estimation model with high-fidelity. A random background is selected from a 100 HDRI environment map collection. We place a virtual camera within the scene, randomly adjusting its focal length, rotation, and translation to capture images and their associated ground-truth depth maps at 4 K resolution.
3.6. Surface Normal Estimation
Similar to previous tasks, we set the decoder output channels of the normal estimator \mathcal{N} to be 3 , corresponding to the x y z components of the normal vector at each pixel. The generated synthetic data is also used as supervision for surface normal estimation. Let \mathbf{n} be the ground-truth normal map for image \mathbf{I} and \hat{\mathbf{n}}=\mathcal{N}(\mathbf{I}) . Similar to depth, the loss \mathcal{L}_{\text {normal }} is only computed for human pixels in the image and is defined as follows:
\mathcal{L}{\text {normal }}=|\mathbf{n}-\hat{\mathbf{n}}|{1}+(1-\mathbf{n} \cdot \hat{\mathbf{n}})
- Experiments
In this section, we initially provide an overview of the implementation details. Subsequently, we conduct comprehensive benchmarking across four tasks: pose estimation, part segmentation, depth estimation, and normal estimation.
4.1. Implementation Details
Our largest model, Sapiens-2B, is pretrained using 1024 A100 GPUs for 18 days using PyTorch. We use the AdamW [73] optimizer for all our experiments. The learning schedule includes a brief linear warm-up, followed by cosine annealing [72] for pretraining and linear decay [65] for finetuning. All models are pretrained from scratch at a resolution of 1024 \times 1024 with a patch size of 16 . For finetuning, the input image is resized to a 4: 3 ratio, i.e. 1024 \times 768 . We use standard augmentations like cropping, scaling, flipping, and photometric distortions. A random background from non-human COCO [67] images is added for segmentation, depth, and normal prediction tasks. Importantly, we use differential learning rates [114] to preserve generalization i.e. lower learning rates for initial layers and progressively higher rates for subsequent layers. The layer-wise learning rate decay is set to 0.85 with a weight decay of 0.1 for the encoder. We detail the design specifications of Sapiens in Table. 2. Following [34, 100], we prioritize scaling models by width rather than depth. Note that the Sapiens-0.3B model, while architecturally similar to the traditional ViT-Large, consists of twentyfold more FLOPs due to its higher resolution.
4.2. 2D Pose Estimation
We finetune Sapiens for face, body, feet, and hand ( K= 308) pose estimation on our high-fidelity annotations. For training, we use the train set with 1 M images and for evaluation, we use the test set, named Humans5 K , with 5 K images. Our evaluation is top-down [111] i.e. we use an off-the-shelf detector [37] for boundingbox and conduct single human pose inference. Table 3 shows a comparison of our models with existing methods for whole-body pose estimation. We evaluate all methods on 114 common keypoints between our 308 keypoint vocabulary and the 133 keypoint vocabulary from COCOWholeBody [55]. Sapiens-0.6B surpasses the current stateof-the-art, DWPose-1 [115] by +2.8 AP. Contrary to DWPose [115], which utilizes a complex student-teacher framework with feature distillation tailored for the task, Sapiens adopts a general encoder-decoder architecture with large human-centric pretraining.
Interestingly, even with the same parameter count, our models demonstrate superior performance compared to their counterparts. For instance, Sapiens-0.3B exceeds VitPose±L by +5.6 AP , and Sapiens- 0.6 B outperforms VitPose±H by +7.9 AP. Within the Sapiens family, our results indicate a direct correlation between model size and performance. Sapiens-2B sets a state-of-the-art with 61.1 AP , a significant improvement of +7.6 AP to the prior art. Despite fine-tuning with annotations from a indoor capture studio, Sapiens demonstrate robust generalization to realworld, as shown in Fig. 6.
4.3. Body-Part Segmentation
We fine-tune and evaluate our annotations with a segmentation vocabulary of 28 classes. Our train set consists of 100 K images, and the test set, Humans- 2 K , consists of 2 K images. We compare Sapiens with existing body-part segmentation methods fine-tuned on our train set. Importantly, we use suggested pretrained checkpoints by each method as initialization. Similar to pose, we observe generalization to segmentation as shown in Table 4.
Interestingly, our smallest model, Sapiens-0.3B outperforms existing state-of-the-art segmentation methods like Mask2Former [22] and DeepLabV3+ [18] by 12.6 mIoU due to its higher resolution and large human-centric pretraining. Furthermore, increasing the model size improves segmentation performance. Sapiens-2B achieves the best performance of 81.2 mIoU and 89.4 mAcc on the test set. Fig. 7 shows the qualitative results of our models.
4.4. Depth Estimation
We evaluate our models on THuman2.0[117] and Hi4D[116] datasets for depth estimation. THuman2.0 consists of 526 high-quality human scans, from which we derive three sets of images for testing: a) face, b) upper body, and c) full body using a virtual camera. THuman2.0 with 1578 images thus enables the evaluation of our models’ performance on single-human images across multiple scales. Conversely, the Hi4D dataset focuses on multi-human scenarios, with each sequence showcasing two subjects engaged in activities involving human-human interactions. We select sequences from pair 28, 32, and 37 , featuring 6 unique subjects from camera 4 , totaling 1195 multi-human real images for testing. We follow the relative-depth evaluation protocols established by MiDaSv3.1 [11], reporting standard metrics such as AbsRel and \delta_{1} . In addition, we also report RMSE as our primary metric since \delta_{1} does not effectively reflect performance in human scenes characterized by subtle depth variations.
Table 5 compares our models with existing state-ofthe-art monocular depth estimators. Sapiens-2B, finetuned solely on synthetic data, remarkably outperforms prior art across all single-human scales and multi-human scenarios. We observe a 20 % RMSE reduction compared to the topperforming Depth-Anything model on Hi4D images. It is important to highlight that while baseline models are trained on a variety of scenes, Sapiens specializes in humancentric depth estimation. Fig. 8 presents a qualitative comparison of depth estimation between Sapiens-1B and DepthAnything-L. To ensure a fair comparison, the predicted depth is renormalized using the human mask in the baseline visualizations.
4.5. Surface Normal Estimation
The datasets for surface normal evaluation are identical to those used for depth estimation. Following [30], we report the mean and median angular error, along with the percentage of pixels within t^{\circ} error for t \in\left{11.25^{\circ}, 22.5^{\circ}, 30^{\circ}\right} . Table 6 compares our models with existing human-specific surface normal estimators. All our models outperform existing methods by a significant margin. Sapiens-2B achieves a mean error of around 12^{\circ} on the THuman 2.0 (singlehuman) and Hi4D (multi-human) datasets. We qualitatively compare Sapiens-1B with PIFuHD [89] and ECON [108] for surface normal estimation in Figure 9. Note that PI-
FuHD [89] is trained with the identical set of 3D scans as ours, and ECON [108] is trained with 4000 scans that are a super set of our 3D scan data.
4.6. Discussion
Importance of Pretraining Data Source. The feature quality is closely linked to the pretraining data quality. We assess the importance of pretraining on various data sources for human-centric tasks by pretraining Sapiens-0.3B on each dataset under identical training schedules and number of iterations. We fine-tune the model on each task and select early checkpoints for evaluation, reasoning that early-stage fine-tuning better reflects the model’s generalization capability. We investigate the impact of pretraining at scale on general images (which may include humans) versus exclusively human images using Sapiens. We randomly select 100 million and 300 million general images from our 1 billion image corpus to create the General-100M and General300M datasets, respectively. Table 7 showcases the comparison of pretraining outcomes. We report mAP for pose on Humans-5K, mIoU for segmentation on Humans-2K, RMSE for depth on THuman2.0, and mean angular error in degrees for normal estimation on Hi4D. Aligned with findings from [112], our results show that pretraining with Human300M leads to superior performance across all metrics, highlighting the benefits of human-centric pretraining within a fixed computational budget.
We also study the effect of number of unique human images seen during pretraining with normal estimation performance. We report % within 30^{\circ} . Again, we maintain identical conditions for Sapiens-0.3B pretraining and finetuning. Fig. 10 shows a steady improvement in performance as the pretraining data size increases without saturation. In summary, the diversity of human images observed during pretraining directly correlates with improved generalization to down-stream tasks.
Zero-Shot Generalization. Our models exhibit broad generalization to a variety of settings. For instance, in segmentation, Sapiens are finetuned on single-human images with limited subject diversity, minimal background variation, and solely third-person views (see Fig. 4). Nevertheless, our large-scale pretraining enables generalization across number of subjects, varying ages, and egocentric views, as shown in Fig. 11. These observations similarly hold for other tasks.
Limitations. While our models generally perform well, they are not perfect. Human images with complex/rare poses, crowding, and severe occlusion are challenging (see supplemental for details). Although aggressive data augmentation and a detect-and-crop strategy could mitigate these issues, we envision our models as a tool for acquiring large-scale, real-world supervision with human-in-the-loop to develop the next generations of human vision models.
5. Conclusion
Sapiens represents a significant step toward elevating human-centric vision models into the realm of foundation models. Our models demonstrate strong generalization capabilities on a variety of human-centric tasks. We attribute the state-of-the-art performance of our models to: (i) largescale pretraining on a large curated dataset, which is specifically tailored to understanding humans, (ii) scaled highresolution and high-capacity vision transformer backbones, and (iii) high-quality annotations on augmented studio and synthetic data. We believe that these models can become a key building block for a multitude of downstream tasks, and provide access to high-quality vision backbones to a significantly wider part of the community. A potential direction for future work would be extending Sapiens to 3D and multi-modal datasets.
Acknowledgements: We would like to acknowledge He Wen and Srivathsan Govindarajan for their contributions with training, and optimizing Sapiens.