当前位置：首页 > article >正文

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM论文解读

article 2025/3/1 0:29:57

文章目录

前言
一、引言
二、文献综述
- 1. Multimodal Large Language Models on Detection Tasks.
- 2、Visual Prompting.
三、方法
- 3.1 Detection Chain-of-Thought for Object Detection
- 3.2 Detection Prompting Toolkits
- - Visual Processing Prompts
  - Detection Reasoning Prompts
四、实验
- 4.1 Experimental Setup
- - Benchmarks
  - Models
  - Implementations
- 4.2 Experimental Results
- - Open Vocabulary Object Detection
  - Described Object Detection Described Object Detection
  - Referring Expression Comprehension
  - Oriented Object Detection
五、结论

前言

感觉好久都没有写个一篇论文解读了，刚好此篇论文激发MLLM模型的检测能力。我也想一探究竟，因此我给出此篇论文的解读。我们提出了DetToolChain，这是一种新的提示范式，旨在释放多模态大语言模型（MLLMs），如GPT-4V和Gemini的零样本目标检测能力。我们的方法包含一个受高精度检测先验启发的检测提示工具包，以及一个新的思维链来实现这些提示。具体来说，工具包中的提示被设计用来引导MLLM关注区域信息（例如，放大），按照测量标准读取坐标（例如，叠加尺子和圆规），并从上下文信息中推断（例如，叠加场景图）。基于这些工具，新的检测思维链可以自动将任务分解为简单的子任务，诊断预测结果，并规划逐步的边界框改进。我们的框架在一系列检测任务上的有效性得到了验证，尤其是在处理困难案例时。与现有的最先进方法相比，使用我们的DetToolChain的GPT-4V在MS COCO新颖类别的开放词汇检测上提升了21.5%的AP50，在RefCOCO验证集的零样本指代表达理解任务上准确率提高了24.23%，在D-cube描述对象检测FULL设定下提升了14.5%的AP。

论文地址：https://arxiv.org/pdf/2403.12488

一、引言

Large language models (LLMs), e.g., GPT3 [6], Gemini [42], InternLM [43], and Qwen [2], have shown unprecedented capabilities in understanding human languages and solving practical problems such as scientific question answering and code generation. When integrated with visual encoders, large language models can be upgraded to multimodal large language models (MLLMs), which can
achieve the ability similar to human visual intelligence and tackle some visual understanding tasks, such as image captioning. Despite these advances, the potential of MLLMs in detection tasks is still underestimated among common vision tasks [45,58,63,67]. When required to ask precise coordinates in complicated object detection tasks, e.g., detecting highly occluded objects, rotational objects, or small objects in the scene images, MLLMs often miss the target objects or answer inaccurate bounding boxs [58]. The poor performance on object detection significantly limits the applications of MLLMs in the real world, e.g., defect detection [8, 16, 70] and sports analysis [7, 44, 46].
大型语言模型（LLMs），例如GPT3 [6]、Gemini [42]、InternLM [43]以及Qwen [2]，展现了前所未有的能力来理解人类语言并解决实际问题，如科学问答和代码生成。当与视觉编码器结合时，大型语言模型可以升级为多模态大型语言模型（MLLMs），从而获得类似于人类视觉智能的能力，并处理一些视觉理解任务，比如图像字幕生成。尽管取得了这些进步，MLLMs在检测任务中的潜力在常见的视觉任务中仍然被低估[45,58,63,67]。当要求在复杂的对象检测任务中提供精确坐标时，例如检测高度遮挡的对象、旋转对象或场景图像中的小对象，MLLMs常常会错过目标对象或给出不准确的边界框[58]。对象检测上的不良表现极大地限制了MLLMs在现实世界的应用，例如缺陷检测[8, 16, 70]和运动分析[7, 44, 46]。

To enhance the detection capabilities of MLLMs, prior efforts can be categorized into two classes: (1) Finetuning MLLMs with high-quality questionanswer instructions with abundant location information [3, 12, 27, 27, 34] in the answer. Despite the considerable improvements achieved, preparing high-quality question-answer pairs requires great manual efforts and finetuning multimodal large language models suffers from large computational costs. Furthermore, since current state-of-the-art MLLMs [1, 3, 42] are closed-sourced and their performances have been significantly superior to open-sourced models, the “finetuning” method can not be implemented on the most powerful MLLMs at the moment (and most likely in the future), which significantly limits its potential to continuously improve the emerging state-of-the-art MLLMs. (2) Designing textual or visual prompting with location information to advance the localization ability of MLLMs. While intuition-based prompting methods have greatly advanced the performance of regional comprehension tasks such as compositional reasoning [33] and spatial understanding [10], their effectiveness on detection tasks remains underexplored.
为了增强MLLMs的检测能力，先前的努力可以分为两类：(1) 使用包含丰富位置信息的高质量问答指令对MLLMs进行微调[3, 12, 27, 27, 34]。尽管取得了一定的改进，但准备高质量的问答对需要大量的手动工作，而且微调多模态大型语言模型伴随着巨大的计算成本。此外，由于当前最先进的MLLMs [1, 3, 42]是闭源的，其性能已经显著优于开源模型，因此“微调”方法不能应用于目前最强大的MLLMs（且很可能在未来也是如此），这大大限制了其不断改进新兴的最先进MLLMs的潜力。(2) 设计带有位置信息的文本或视觉提示以提高MLLMs的定位能力。虽然基于直觉的提示方法已经在区域理解任务（如组合推理[33]和空间理解[10]）上大幅推进了性能，但它们在检测任务上的有效性仍有待深入研究。

This work explores how the detection ability of multimodal large language models can be unlocked by a new chain of thoughts on detection prompting toolkits (dubbed as DetToolchain). The new DetToolchain is motivated by three ideas. First, visual prompts are identified as a crucial component of the detection prompting toolkits. They offer a more direct and intuitive approach to enhancing the spatial comprehension of Multimodal Large Language Models (MLLMs) compared to language prompts. This is because current MLLMs still struggle with accurately translating textual coordinates and descriptions into precise regions and visual information. Instead, visual prompts directly drawn in the image can significantly narrow the gap between visual and textual information and ultimately contribute to the improved detection ability of MLLMs. Second, detecting challenging instances, such as occluded and small objects, can be more
efficiently tackled by breaking them down into smaller, simpler subtasks. Third, the detection results should be modified step by step using Chain-of-Thought, similar to the progressive refinement of bounding boxes in current state-of-theart object detection algorithms such as DETR [9], SparseRCNN [41], and DiffusionDet [14]. Based on these ideas, our proposed DetToolChain consists of a detection prompts toolkit, including visual processing prompts and detection reasoning prompts, and a multimodal Chain-of-Thought to properly apply these detection prompts to unleash the detection ability of MLLMs. Their critical designs and insights involve:
这项工作探讨了如何通过关于检测提示工具包的新思路（命名为DetToolchain）来解锁多模态大型语言模型的检测能力。新的DetToolchain受到三个理念的启发。首先，视觉提示被确定为检测提示工具包的关键组成部分。它们提供了比语言提示更直接和直观的方法来增强多模态大型语言模型（MLLMs）的空间理解能力。这是因为当前的MLLMs在将文本坐标和描述准确转化为精确区域和视觉信息方面仍存在困难。相反，在图像中直接绘制的视觉提示可以显著缩小视觉和文本信息之间的差距，并最终有助于提升MLLMs的检测能力。其次，检测挑战性实例，如遮挡和小对象，可以通过将其分解为更小、更简单的子任务来更有效地处理。第三，应该使用思维链逐步修改检测结果，类似于当前最先进的对象检测算法（如DETR [9]、SparseRCNN [41]和DiffusionDet [14]）中的边界框渐进式细化。基于这些理念，我们提出的DetToolChain包括一个检测提示工具包，其中包含视觉处理提示和检测推理提示，以及一个多模态思维链，以正确应用这些检测提示，释放MLLMs的检测能力。其关键设计和见解涉及：

(1) A comprehensive set of visual processing prompts that support a wide range of detection tasks. The visual processing prompts process the given images to facilitate the detection performance of MLLMs, according to some well-accepted prior knowledge and techniques in effective detectors. Specifically, the visual processing prompts can be divided into three categories, i.e., the regional amplifier, the spatial measurement standard, and the scene image parser. These prompts facilitate different key factors for better detectors. First, the regional amplifier consists of image splitting and zooming in, capable of highlighting the region of interest in detection tasks. Second, the spatial measurement standard includes rulers and compass with linear graduations, which provide transnational references and rotational references in object detection (particularly in rotated object detection), respectively. Similar to human intelligence, these spatial measurement standard can help MLLMs to locate the objects and read their coordinates out. Third, the scene parser marks the predicted positions or spatial relations of objects in the images by convex hull, bounding boxes of objects and scene graphs of the image. The markers by the scene parser can facilitate the detection capability of MLLMs by encouraging them to reason
from contextual information in the scene images.
(1) 一套全面的视觉处理提示，支持广泛的检测任务。这些视觉处理提示根据一些在高效检测器中被广泛接受的先验知识和技术来处理给定的图像，以促进MLLMs的检测性能。具体来说，视觉处理提示可以分为三个类别，即区域放大器、空间测量标准和场景图像解析器。这些提示有助于提升检测器的不同关键因素。首先，区域放大器包括图像分割和放大，能够突出显示检测任务中的感兴趣区域。其次，空间测量标准包含带有线性刻度的尺子和圆规，它们分别在对象检测（特别是在旋转对象检测）中提供跨距参考和旋转参考。与人类智能类似，这些空间测量标准可以帮助MLLMs定位对象并读取其坐标。第三，场景解析器通过凸包、对象边界框和图像场景图标记图像中预测的对象位置或空间关系。场景解析器提供的标记可以通过鼓励MLLMs从场景图像中的上下文信息进行推理，从而增强其检测能力。

(2) A comprehensive set of detection reasoning prompts that help MLLMs to diagnose the detection results and reason the next visual
processing prompts to be applied. Different from visual processing prompts, the detection reasoning prompts do not process the images but are dedicated to evaluating the predicted bounding boxes and diagnosing the inaccurate predictions even using the visual processing prompts. By analyzing the relationship and reasoning the co-occurrence of detected objects in the scene image with the commonsense knowledge in MLLM, these prompts help the MLLM to prevent hallucination on detection tasks.
(2) 一套全面的检测推理提示，帮助MLLMs诊断检测结果，并推断下一步应应用的视觉处理提示。不同于视觉处理提示，检测推理提示不处理图像，而是专注于评估预测的边界框并诊断即使使用了视觉处理提示仍存在的不准确预测。通过分析场景图像中检测到的对象之间的关系，并结合MLLM中的常识知识进行推理，这些提示帮助MLLM防止在检测任务上的幻觉。

(3) A multimodal detection Chain-of-Thought (Det-CoT) that enables the MLLM to manage the whole process of detecting targets. With these powerful detection prompting toolkits, the detection Chain-of-Thought helps the MLLM comprehend spatial detection tasks and produce reliable bounding boxes. As illustrated in Fig. 1, the multimodal detection Chain-of-Thought are instructed to (1) format the raw input with a suitable instruction template, (2) decompose the complex detection task into smaller sub-tasks and select the corresponding detection prompting toolkit, (3) execute detection prompting toolkit iteratively, and (4) apply its own reasoning and critical thinking to oversee the whole detection process and return the final response. Given a detection task, our proposed Det-CoT can automatically decompose it into multimodal subtasks and refine the predicted bounding boxes progressively.
(3) 多模态检测思维链（Det-CoT），使MLLM能够管理整个目标检测过程。借助这些强大的检测提示工具包，检测思维链帮助MLLM理解空间检测任务并生成可靠的边界框。如图1所示，多模态检测思维链被指示：(1) 使用适当的指令模板格式化原始输入；(2) 将复杂的检测任务分解为较小的子任务，并选择相应的检测提示工具包；(3) 迭代执行检测提示工具包；(4) 应用自己的推理和批判性思维监督整个检测过程并返回最终响应。对于一个给定的检测任务，我们提出的Det-CoT可以自动将其分解为多模态子任务，并逐步细化预测的边界框。

DetToolChain allows MLLMs to support various detection tasks without instruction tuning. Concretely, it significantly improves baseline GPT-4V and Gemini by 20%-50% on main detection metrics of open-vocabulary detection, described object detection, referring expression comprehension, and oriented object detection. Compared to existing state-of-the-art methods, GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting. Notably, our method without trained on COCO train2017 set even achieves better AP50 on COCO val2017 than the DETR trained on COCO train2017 set. To summarize, our contributions are two-fold: (1) We propose a new prompting paradigm to instruct the MLLM to manage object detection by applying tools in the detection prompting toolkit with a multimodal detection Chain-of-Thought. (2) We propose detection prompting toolkits including visual processing prompts and detection reasoning prompts to facilitate MLLMs on detection tasks. We showcase that our method achieves remarkable performance on a range of detection tasks.
DetToolChain使得MLLM无需指令调优即可支持各种检测任务。具体而言，它显著提升了基线GPT-4V和Gemini在开放词汇检测、描述对象检测、指代表达理解以及定向对象检测的主要检测指标上20%-50%的表现。相较于现有的最先进方法，我们的DetToolChain使GPT-4V在MS COCO新类别的开放词汇检测上的AP50提高了+21.5%，在RefCOCO验证集上的零样本指代表达理解准确性提高了+24.23%，在D-cube描述对象检测FULL设置上的AP提高了+14.5%。值得注意的是，我们的方法在没有训练于COCO train2017数据集的情况下，在COCO val2017上的AP50甚至超过了训练于COCO train2017数据集的DETR。总结来说，我们的贡献有两点：(1) 我们提出了一种新的提示范式，指导MLLM通过使用检测提示工具包中的工具，利用多模态检测思维链来管理对象检测。(2) 我们提出了检测提示工具包，包括视觉处理提示和检测推理提示，以促进MLLMs在检测任务上的表现。我们展示了该方法在一系列检测任务上实现了显著的性能。

在这里插入图片描述

二、文献综述

1. Multimodal Large Language Models on Detection Tasks.

The implementation of Multimodal Large Language Models (MLLMs) in detection tasks has been a recent focus. The widely used strategy is to finetune MLLMs using a high-quality image-text instruction tuning dataset consisting of detection problems. To promote the detection performance, BuboGPT [73], Kosmos-2 [34], Shikra [12], Qwen-VL [3], SPHINX [27], and Ferret [64] constructed instruction datasets with high-quality question-and-answer pairs about detection. Consequently, they consumed great manual efforts and heavy computing costs and suffered from unsatisfied generalization ability to detect objects unseen in the instruction dataset. Furthermore, since most state-of-the-art MLLMs are closed-sourced [1, 3, 42], it is infeasible to apply instruction tuning on these MLLMs. In this paper, we believe the current state-of-the-art MLLM is potentially a zero-shot detector with proper promptings, and therefore design a Chain-of-Thought with new detection prompts to unleash their potential on detection tasks.
多模态大型语言模型（MLLMs）在检测任务中的应用已成为近期的研究焦点。常用的策略是使用高质量的图文指令调优数据集对MLLMs进行微调，该数据集包含检测问题。为了促进检测性能，BuboGPT [73]、Kosmos-2 [34]、Shikra [12]、Qwen-VL [3]、SPHINX [27]和Ferret [64]构建了关于检测的高质量问答对指令数据集。因此，这些方法消耗了大量的人工努力和计算成本，并且在检测指令数据集中未见过的对象时，其泛化能力不令人满意。此外，由于大多数最先进MLLMs是闭源的[1, 3, 42]，因此无法对这些MLLMs进行指令调优。在本文中，我们认为当前最先进的MLLM通过适当的提示可以作为零样本检测器，因此设计了一种结合新检测提示的思维链来释放它们在检测任务上的潜力。

2、Visual Prompting.

Visual prompting, inspired by textural prompting in Natural Language Processing [6,35], manipulates images to improve the visual perception ability of MLLMs, such as classification and spatial reasoning. For example, RedCircle [39] used a circle marker to guide the model to focus on the specific region for fine-grained classification while FGVP [57], SCAFFOLD [25] and SOM [56] explored prompts for spatial reasoning with dot matrices or pretrained models such as SAM [23]. Although CPT [62], SOM [56] are claimed to perform well on the visual grounding task and referring expression comprehension task, they just choose one of the pre-extracted bounding boxes and
segmentation masks by high-quality detectors [65] and segmentors [23] for the correct answer, which is essential not the ability of object detection. To unleash the detection potential of MLLMs, we propose a general detection prompting toolkit that incorporates some well-adopted prior knowledge and strategies but does not include any pretrained model for detection or segmentation. Multimodal Chain-of-thought. Chain-of-Thought [50,71] methods and their variants [5,24,60,61] have greatly boosted the reasoning ability of large language
models (LMMs). By encouraging LLMs to follow a human-like, step-by-step reasoning process, these methods decompose complex tasks into easier subtasks and solve them sequentially to get the final answer. However, the original CoT shows inferior performance [33] on vision-language tasks, e.g., compositional reasoning. Therefore, recent researches pivoted towards developing multimodal Chain-ofThought methods, e.g., Multimodal CoT [72], DDCoT [74], HoT [59] for multimodal reasoning, Chameleon [30], Compositional CoT [33] for compositional reasoning, and Spatial CoT [10] for spatial comprehension. Nevertheless, there remain multimodal Chain-of-Thought methods tailored for the detection ability of MLLMs unexplored. Our research fills this gap by introducing a detectiontailored Chain-of-Thought approach, coupled with a detection prompting toolkit. Our method differs from other practices by in visual promptings that use detection priors to progressively refine detection results through the CoT process.
视觉提示，受到自然语言处理中文本提示的启发[6,35]，通过对图像进行操作以提高MLLMs的视觉感知能力，例如分类和空间推理。例如，RedCircle [39]使用圆圈标记引导模型关注特定区域以进行细粒度分类，而FGVP [57]、SCAFFOLD [25]和SOM [56]探索了使用点矩阵或预训练模型如SAM [23]的空间推理提示。尽管CPT [62]和SOM [56]声称在视觉定位任务和指代表达理解任务上表现良好，但它们只是选择了由高质量检测器[65]和分割器[23]预先提取的边界框和分割掩码之一作为正确答案，这实际上不是对象检测的能力。为了释放MLLMs的检测潜力，我们提出了一套通用的检测提示工具包，它融合了一些广泛采用的先验知识和策略，但不包括任何用于检测或分割的预训练模型。多模态思维链。思维链[50,71]方法及其变体[5,24,60,61]极大地提升了大型语言模型（LLMs）的推理能力。通过鼓励LLMs遵循类似于人类的逐步推理过程，这些方法将复杂任务分解为更简单的子任务，并依次解决以获得最终答案。然而，原始的CoT在视觉-语言任务上的表现较差[33]，比如组合推理。因此，最近的研究转向开发多模态思维链方法，如Multimodal CoT [72]、DDCoT [74]、HoT [59]用于多模态推理，Chameleon [30]、组合CoT [33]用于组合推理，以及Spatial CoT [10]用于空间理解。然而，针对MLLMs检测能力定制的多模态思维链方法尚未被深入研究。我们的研究通过引入一个为检测量身定做的思维链方法，配合检测提示工具包填补了这一空白。我们的方法不同于其他实践之处在于，在视觉提示中使用检测先验通过思维链过程逐步细化检测结果。

三、方法

While state-of-the-art MLLMs, e.g., GPT-4V and Gemini, exhibit commendable reasoning and recognition capabilities, their detection proficiency is still unleashed. To unlock the potential of MLLMs in object detection, we introduce a comprehensive detection prompts toolkit (Sec. 3.2) including visual processing prompts and detection reasoning prompts, and a detection Chain-of-Thought (Det-CoT) to reason the sequential implementation of the detection prompts in the toolkit (Sec. 3.1)
尽管最先进的多模态大型语言模型（MLLMs），例如GPT-4V和Gemini，表现出令人称赞的推理和识别能力，但它们在对象检测方面的专长尚未得到充分释放。为了挖掘MLLMs在对象检测中的潜力，我们引入了一套全面的检测提示工具包（第3.2节），包括视觉处理提示和检测推理提示，以及一个用于推理工具包中检测提示顺序实现的检测思维链（Det-CoT）（第3.1节）。

3.1 Detection Chain-of-Thought for Object Detection

Let L and I denote the set of finite language strings and images, respectively. Given a test-time query x = {xl, xi|xl ∈ L, xi ∈ I}, which can be a specific detection task described in images and natural languages, we aim to obtain corresponding textual outputs with the help of a frozen MLLM. This model, like GPT-4V [1], also maintains a prompt history that may include a list of previous messages, symbolized by H, and produces a corresponding textual response y. The textual response y includes the detection outputs, the diagnosis of current
outputs and the suggestions of next prompts in the toolkit TK = {T1, . . . , Tn} to be called, where Ti is the i-th tool pretrained in the toolkit. Finally, we define string extractors e, which is designed to retrieve a substring that is enclosed within specific delimiters.
设L和I分别表示有限的语言字符串集合和图像集合。给定测试时查询x = {xl, xi | xl ∈ L, xi ∈ I}，它可以是用图像和自然语言描述的具体检测任务，我们的目标是在不更新权重的情况下，借助冻结的MLLM获得相应的文本输出。这个模型，如GPT-4V [1]，还保持了一个可能包含先前消息列表的提示历史H，并生成对应的文本响应y。文本响应y包括检测输出、当前输出的诊断以及建议调用的工具包TK = {T1, …, Tn}中的下一个提示，其中Ti是工具包中预训练的第i个工具。最后，我们定义了字符串提取器e，它被设计用来检索特定分隔符内的子字符串。
Algorithmic Procedure. As illustrated in Algorithm 1, we provide a conceptual overview of the procedure below:

Formatting the Test-time Query: Using the transformation function f, the raw query is replaced with a suitable template for the MLLM.
Loop Iteration in each prompting step:
(a) Proposing a thought based on the response ys: The current message history, named Hs, guides the MLLM to propose a thought - either to end it here and return the final outputs, or to continue the algorithm and engage another visual prompting tool.
(b) Engaging detection prompting tool: If the MLLM does not return the result, it can select the appropriate visual prompting tool and construct it with corresponding instructions. The constructed prompting will be appended with the message history Hs and be sent to the next iteration.
Returning the Final Response: If the string extractor e discovers the final detection with special markers in the MLLM’s response ys, the algorithm will be ended and return the extracted bounding boxes.

在这里插入图片描述

如算法1所示，我们在下面提供了该过程的概念性概述：

格式化测试时查询：使用转换函数f，将原始查询替换为适合MLLM的模板。
每次提示步骤的循环迭代：
a. 基于响应ys提出思考：当前的消息历史Hs引导MLLM提出思考——要么在此结束并返回最终输出，要么继续算法并参与另一个视觉提示工具。
b. 参与检测提示工具：如果MLLM没有返回结果，它可以选择合适的视觉提示工具，并根据相应的指令构建它。构建的提示将附加到消息历史Hs上，并发送到下一次迭代。
返回最终响应：如果字符串提取器e在MLLM的响应ys中发现了带有特殊标记的最终检测结果，算法将会结束，并返回提取的边界框。

3.2 Detection Prompting Toolkits

Compared with language promptings, visual promptings are more intuitive and can effectively teach MLLMs prior knowledge for detection. Following this insight, we design a comprehensive set of visual prompting toolkits to enhance various detection tasks. The toolkits are summarized in Fig. 2 and Tab. 1 and described as follows.
相比于语言提示，视觉提示更加直观，并能有效教导MLLMs关于检测的先验知识。基于这一见解，我们设计了一套全面的视觉提示工具包来增强各种检测任务。这些工具包在图2和表1中进行了总结，并描述如下。

在这里插入图片描述

Visual Processing Prompts

Visual Processing Prompts. We introduce visual processing prompts to preprocess the input image to enhance the detection erformance of MLLMs. Inspired by prior techniques for detection, we design our visual processing prompts (see Fig. 2) with three types focusing on different aspects, i.e., better visibility on details, more accurate spatial reference, and better contextual comprehension of the given images.

Regional Amplifier aims at enhancing the visibility of the region of interest for MLLMs. Specifically, Image Split crops the image into disparate parts, and Zoom in enables close-up inspection of specific regions in the image.
Spatial Measurement Standard provides a more explicit reference for object detection by overlaying rulers and compasses with linear graduations on the original image, as depicted in Fig. 2(2). The auxiliary ruler and compass enable MLLMs to output accurate coordinates and angles with the help of translational and rotational references overlaid in the image. Essentially, this standard simplifies the detection task, allowing MLLMs to read out the coordinates of the objects instead of directly predicting them.
Scene Image Parser marks the predicted location or relations of objects, enabling a more localized understanding of images using spatial and contextual information. The proposed scene image parser can be divided into two categories. First, we mark the predicted objects with centroids, convex hulls and bounding boxes with label names and box indices. These markers represent the object location information in different formats, enabling the MLLM to detect diverse objects with different shapes and backgrounds, especially with irregular shapes or significant occlusions. For example, Convex Hull Marker marks the boundary points of objects and connects them as
the convex hull to enhance the detection performance when objects have very irregular shapes. Second, we mark the scene graph by connecting the center of different objects with Scene Graph Marker to highlight the relationships of objects in the image. Based on the scene graph, the MLLM can leverage its contextual reasoning abilities to refine its predicted boxes and avoid hallucinations. For instance, as shown in Fig. 2(3), Jerry Mouse is going to eat cheese, so their bounding boxes should be pretty close to each other.

我们引入了视觉处理提示来预处理输入图像，以提升MLLMs的检测性能。受到先前检测技术的启发，我们设计了视觉处理提示（见图2），包含三种类型，分别聚焦于不同的方面：更好的细节可见性、更精确的空间参考以及对给定图像的更好上下文理解。

区域放大器旨在提高MLLMs对感兴趣区域的可见性。具体来说，图像分割将图像裁剪为不同的部分，而放大则允许对图像中的特定区域进行近距离检查。
空间测量标准通过在原始图像上叠加带有线性刻度的尺子和圆规，为对象检测提供更明确的参考，如图2(2)所示。辅助尺子和圆规使得MLLMs能够借助图像中叠加的平移和旋转参考输出准确的坐标和角度。本质上，这个标准简化了检测任务，使MLLMs可以读取对象的坐标而不是直接预测它们。
场景图像解析器标记预测的对象位置或关系，利用空间和上下文信息实现对图像的更局部化理解。所提出的场景图像解析器可以分为两类。首先，我们使用质心、凸包和带有标签名称和框索引的边界框标记预测对象。这些标记以不同格式表示对象位置信息，使MLLM能够检测形状和背景各异的对象，尤其是具有不规则形状或显著遮挡的对象。例如，凸包标记会标记对象的边界点并将它们连接成凸包，以增强形状非常不规则的对象的检测性能。其次，我们使用场景图标记连接不同对象的中心来标记场景图，以突出显示图像中对象之间的关系。基于场景图，MLLM可以利用其上下文推理能力细化预测框并避免幻觉。例如，如图2(3)所示，杰瑞老鼠正要去吃奶酪，所以它们的边界框应该彼此非常接近。

Detection Reasoning Prompts

Detection Reasoning Prompts. To improve the reliability of predicted boxes, we conduct detection reasoning prompts (illustrated in Tab. 1) to examine the predictions and diagnose the potential problem that may exist. Based on these intuitions, we first propose Problem Insight Guider, which highlights difficult issues and offers effective detection suggestions for query images. For example, given Fig. 3, the prompt will point out the issue of small object detection and suggest solving it by zooming in the region of the surfboard. Second, to leverage the inherent spatial and contextual ability of MLLMs, we design Spatial Relation Analyzer and Contextual Object Preditor to
ensure the detection results accord with common sense. As shown in Fig. 3, surfboards may co-occur with the ocean (contextual
knowledge), and the surfing person should have a surfboard very close to his feet (spatial knowledge). Furthermore, we apply Self-Verification Promoter to enhance the consistency of multi-round responses. To further promote MLLMs’ reasoning ability, we engage in widely adopted prompting methods, e.g., debating [18, 76] and self-debugging [15, 37].
为了提高预测框的可靠性，我们执行检测推理提示（如表1所示）来检查预测结果并诊断可能存在的问题。基于这些直觉，我们首先提出了问题洞察引导器，它强调困难的问题，并为查询图像提供有效的检测建议。例如，给定图3，提示将指出小对象检测的问题，并建议通过放大冲浪板所在的区域来解决这个问题。其次，为了利用MLLMs固有的空间和上下文能力，我们设计了空间关系分析器和上下文对象预测器，以确保检测结果符合常识。如图3所示，冲浪板可能会与海洋（上下文知识）共同出现，而冲浪者应该在他的脚附近有一个冲浪板（空间知识）。此外，我们应用自我验证促进器来增强多轮响应的一致性。为了进一步促进MLLMs的推理能力，我们参与了广泛采用的提示方法，例如辩论[18, 76]和自我调试[15, 37]。

在这里插入图片描述

四、实验

4.1 Experimental Setup

Benchmarks

Benchmarks. To assess the effectiveness of our proposed methodology, we undertook a comprehensive evaluation across a diverse array of tasks and datasets related to object detection. We also extend our method to Referring expression comprehension which also needs to localize objects referred to in the expression.
These evaluations encompass:
– Object detection utilize the MS COCO 2017 dataset [26], with results reported on the validation subset by default.
– Described object detection on the D-cube dataset [53, 54], aiming to confirm the presence of objects described in arbitrary open-set expressions and to localize them accordingly.
– Referring expression comprehension, a multimodal task that involves grounding a referent based on a given expression. This was tested on three representative benchmarks: RefCOCO [66], RefCOCO+ [66], and RefCOCOg [31].
– Oriented object detection on the widely recognized HRSC2016 dataset [29], aiming to identify and locate objects of arbitrary orientation within images.
为了评估我们所提出方法的有效性，我们在一系列与对象检测相关的任务和数据集上进行了全面的评估。我们也扩展了我们的方法到描述表达理解，该任务同样需要定位表达中提到的对象。这些评估包括：

对象检测使用MS COCO 2017数据集[26]，默认在验证子集上报告结果。
描述对象检测在D-cube数据集[53, 54]上进行，旨在确认任意开放集表达中描述的对象的存在，并相应地对其进行定位。
指代表达理解，一个多模态任务，涉及根据给定表达来确定参照物。这在三个代表性基准上进行了测试：RefCOCO [66]、RefCOCO+ [66] 和 RefCOCOg [31]。
旋转目标检测在广受认可的HRSC2016数据集[29]上进行，旨在识别并定位图像中任意方向的对象。

Models

Models. We utilized the following Multimodal Large Language Models (MLLMs) in our experiments to explore the performance of our proposed DetToolChain:
– GPT-4 [1], specifically the gpt-4-vision-preview model.
– Gemini [42], specifically the gemini-pro-vision variant.
在实验中，我们使用了以下多模态大型语言模型（MLLMs）来探索我们提出的DetToolChain的性能：

GPT-4 [1]，具体为gpt-4-vision-preview模型。
Gemini [42]，具体为gemini-pro-vision变体。

Implementations

Implementations. As shown in Fig. 4, the system instruction is set to ensure MLLM to comprehend its ultimate task objective and the preceding steps. Subsequently, the task description and input image are provided as the initial round of inserted instruction. Upon reaching the final answer, an automatic function extracts coordinate results, streamlining the prediction collection process. To accommodate varying image sizes across datasets, we performed a transformation from absolute to normalized coordinate axes. The returned normalized coordinates are ultimately converted back to absolute ones for standard evaluation. For comparison, we implement baselines of GPT-4V and Gemini. We provide foundational prompt to enable successful result return. Across all experiments, we uniformly applied identical parameters and instructions, setting the maximum token count to 1024, with other settings at default values.

如图4所示，系统指令被设定以确保MLLM理解其最终任务目标和先前步骤。随后，任务描述和输入图像作为第一轮插入的指令提供。当达到最终答案时，自动函数提取坐标结果，简化预测收集过程。为了适应不同数据集中变化的图像大小，我们从绝对坐标轴转换为归一化坐标轴。返回的归一化坐标最终转换回绝对坐标用于标准评估。对于比较，我们实现了GPT-4V和Gemini的基线。我们提供了基础提示以确保成功返回结果。在所有实验中，我们统一应用相同的参数和指令，将最大标记数设置为1024，其他设置保持默认值。

4.2 Experimental Results

Open Vocabulary Object Detection

Open Vocabulary Object Detection As shown in Tab. 2, we evaluate our method on open vocabulary detection (OVD), where the AP50 results on 17 novel classes, 48 base classes, and all classes of the COCO OVD benchmark [4] are reported. With our DetToolChain , the performance of GPT-4V and Gemini are both remarkably enhanced. GPT-4V+DetToolChain significantly outperforms the state-of-the-art method CORA [52] by 22.7AP50, 4.1AP50, and 9.0AP50 on Novel, Base and All classes, respectively. Benefiting from general comprehension abilities, MLLMs can originally recognize most objects in the image, including those belonging to novel classes. However, they fall short in predicting precise textural coordinates. When integrated with our DetToolChain , the MLLM is instructed to locate the position with overlaying measure standards. This enhancement allows the MLLM to recognize and accurately locate objects across
novel classes, effectively transforming it into a super open-vocabulary detector.

To position our method in previous state-of-the-art object detectors, we further compare our zero-shot detection performance with their detection performance by the model trained on COCO train2017 in Tab. 3. While the baseline performance of GPT-4V and Gemini are only AP is 0.2 and 0.3 respectively, there is a significant performance improvement after employing our DetToolChain , i.e., Gemini’s AP, AP50, and AP75 increasing to 30.6, 58.0, and 27.9, and GPT-4V’s AP, AP50 and AP75 increasing to 34.5, 64.8, and 31.5. Notably, GPT-4V with DetToolChain surpasses Faster R-CNN on AP50 by 6.4, and even achieves better performance than recent detectors, e.g., DETR-R50 [9] and VisionLLM-R50 [49], demonstrating its potential to be a general object detector in the real world. The higher AP50 but lower AP75 than state-of-the-art detectors, e.g., Mask R-CNN and DETR, indicates that our method enables MLLMs to detect most objects in a zero-shot manner but with less accurate box boundaries because we consider accurate boundaries involves annotation priors in each dataset and can only be regressed after learning the model on the training set.
如表2所示，我们在开放词汇检测（OVD）上评估了我们的方法，报告了COCO OVD基准[4]中17个新类、48个基类以及所有类别的AP50结果。通过我们的DetToolChain，GPT-4V和Gemini的性能都得到了显著提升。GPT-4V + DetToolChain分别在新类、基类和全部类别上的AP50表现比最先进的方法CORA [52]高出22.7AP50、4.1AP50和9.0AP50。得益于通用理解能力，MLLMs可以识别图像中的大多数对象，包括属于新类的对象。然而，它们在预测精确文本坐标方面有所欠缺。当我们集成DetToolChain后，MLLM被指示通过叠加测量标准来定位位置。这种增强使MLLM能够识别并准确地定位跨新类的对象，有效地将其转变为超级开放词汇检测器。

为了定位我们的方法在以前最先进的对象检测器中的地位，我们进一步比较了我们的零样本检测性能与训练于COCO train2017的模型的检测性能（见表3）。虽然GPT-4V和Gemini的基线性能仅为AP是0.2和0.3，但在采用我们的DetToolChain之后，性能有了显著提升，即Gemini的AP、AP50和AP75分别增加到了30.6、58.0和27.9，而GPT-4V的AP、AP50和AP75则分别增加到了34.5、64.8和31.5。值得注意的是，带有DetToolChain的GPT-4V在AP50上超过了Faster R-CNN 6.4个点，甚至优于最近的检测器，例如DETR-R50 [9]和VisionLLM-R50 [49]，展示了其作为现实世界通用对象检测器的潜力。更高的AP50但较低的AP75表明我们的方法使MLLM能够在零样本情况下检测大多数对象，但由于我们认为准确的边界框涉及每个数据集的标注先验，只能在模型学习训练集后回归，因此边界框准确性略低于最先进检测器，如Mask R-CNN和DETR。
在这里插入图片描述

Described Object Detection Described Object Detection

Described Object Detection Described Object Detection [54] is a new task that can be regarded as a superset of open-vocabulary detection and referring expression comprehension. The dataset D-cube [53,54] in the task presents two primary challenges: (1) The presence of negative samples, where the object referred by sentences does not necessarily exist within the given image. This demands the model’s capability to understand semantically incongruent concepts. (2) The referring sentences are exceedingly complex, often containing the use of negation, such as “the person riding a horse who is not wearing a hat”, evaluating the model’s comprehension of complex and lengthy sentences. The experiment evaluates the model’s ability across three categories: FULL, PRES, and ABS, which
respectively mean the full descriptions (422 categories), presence descriptions (316 categories), and absence descriptions (106 categories).

As shown in Tab. 4, with our DetToolChain method, both GPT-4V and Gemini significantly outperform existing approaches, e.g., GPT-4V + DetToolChain improve the best method, FIBER-B [17], by 14.5, 16.0, and 10.2 AP in the FULL, PRES, and ABS settings, respectively. Such improvements come from two aspects. First, Gemini and GPT-4V have stronger ability to comprehend detection-related descriptions, especially the negative concepts, and get rid of unrelated concepts in both PRES and ABS settings than other models, e.g.,
OFA [48], OWL-ViT [32]. Second, our proposed DetToolChain significantly unleash the ability of MLLMs to detect the described objects by prompts in the detection toolkit and the new detection Chain-of-Thought.
描述对象检测[54]是一个新的任务，可以被视为开放词汇检测和指代表达理解的超集。任务中的D-cube数据集[53,54]提出了两个主要挑战：(1) 负样本的存在，即句子所指的对象不一定存在于给定图像中。这要求模型具备理解语义不一致概念的能力。(2) 参考句子极其复杂，通常包含否定用法，例如“没有戴帽子骑马的人”，评估模型对复杂和长句子的理解能力。实验评估了模型在三个类别上的能力：FULL、PRES和ABS，分别表示完整描述（422类别）、存在描述（316类别）和不存在描述（106类别）。

如表4所示，借助我们的DetToolChain方法，GPT-4V和Gemini的表现明显优于现有方法，例如，GPT-4V + DetToolChain在FULL、PRES和ABS设置下分别比最佳方法FIBER-B [17]提高了14.5、16.0和10.2 AP。这样的改进来自于两方面。首先，Gemini和GPT-4V对检测相关描述有更强的理解能力，尤其是在处理负概念方面，并且在PRES和ABS设置中能更好地去除无关概念，超过其他模型，如OFA [48]和OWL-ViT [32]。其次，我们提出的DetToolChain通过检测工具包中的提示和新的检测思维链显著释放了MLLMs检测描述对象的能力。
在这里插入图片描述

Referring Expression Comprehension

Referring Expression Comprehension To demonstrate the effectiveness of our method on referring expression comprehension, we compare our approach with other zero-shot methods on the RefCOCO, RefCOCO+, and RefCOCOg datasets (Tab. 5). First, our DetToolChain improves GPT-4V baseline by 44.53%, 46.11%, and 24.85% on val, test-A, and test-B, respectively, which exhibits the
best zero-shot referring expression comprehension performance on RefCOCO.Since the referring sentences in RefCOCO contain a substantial number of directional words, such as “right” and “left”, our designed visual processing prompts, e.g.,Ruler Marker, can help MLLMs to more effectively understand the spatial rationale and detect target objects. Furthermore, while other zero-shot methods [20, 40] introduce an out-of-the-shelf object detector [65] to generate proposals, our DetToolChain instructs MLLM to predict and refine the detection boxes with their inherent spatial and contextual ability, which is more simple and adaptable in real deployments. In addition, our method significantly narrows the gap between zero-shot methods and instruction-tuned MLLMs [3,11,12,27], indicating the promising direction of designing new multimodal promptings.
为了展示我们方法在指代表达理解上的有效性，我们在RefCOCO、RefCOCO+和RefCOCOg数据集上比较了我们的方法与其他零样本方法（见表5）。首先，我们的DetToolChain分别在val、test-A和test-B上将GPT-4V基线提高了44.53%、46.11%和24.85%，表现出在RefCOCO上的最佳零样本指代表达理解性能。由于RefCOCO中的指代表达包含大量的方位词，如“右”和“左”，我们设计的视觉处理提示（例如尺子标记）可以帮助MLLM更有效地理解空间逻辑并检测目标对象。此外，虽然其他零样本方法[20, 40]引入了一个现成的对象检测器[65]来生成提议，我们的DetToolChain指导MLLM利用其内在的空间和上下文能力预测和细化检测框，这在实际部署中更为简单和灵活。此外，我们的方法显著缩小了零样本方法和指令调优MLLMs [3,11,12,27]之间的差距，显示出设计新的多模态提示的有希望的方向。
在这里插入图片描述

Oriented Object Detection

Oriented Object Detection To investigate the efficacy of our DetToolChain in detecting rotated objects, we conducted experiments on the HRSC2016 test set. The images in the HRSC2016 dataset contain several ships of varying orientations. We employ mIoUprecision and mIoUrecall as performance metrics.mIoU p reflects the precision in bounding box coordinates prediction for objects that have been identified and mIoUr evaluate recall for the model’s ability to recognize all targeted objects. As shown in the last two columns of Tab. 6, our DetToolChain significantly improves the baseline results of Gemini and GPT-4 by 0.46 mIoUp and 0.47 mIoUr on Gemini, and 0.50 mIoUp and 0.51 mIoUr on GPT-4V. Furthermore, compared with other combinations of detection prompts in Tab. 6, Compass Marker brings the most substantial enhancements on oriented object detection, 0.34 mIoUp and 0.25 mIoUr performance gains over the GPT-4 baseline, which is consistent with our expectations. With a compass drawn by Compass Marker on the query image, it is easier for MLLMs to read the angle out, which will utimately improve the accuracy of ship angle predictions.

为了调查我们的DetToolChain在检测旋转对象方面的效果，我们在HRSC2016测试集上进行了实验。HRSC2016数据集中的图像包含多个方向各异的船只。我们采用mIoUprecision和mIoUrecall作为性能指标。mIoUp反映了已识别对象的边界框坐标预测精度，而mIoUr评估了模型识别所有目标对象的能力。如表6的最后两列所示，我们的DetToolChain显著提升了Gemini和GPT-4的基础结果，在Gemini上mIoUp提升了0.46，mIoUr提升了0.47；在GPT-4V上mIoUp提升了0.50，mIoUr提升了0.51。此外，相比于表6中的其他检测提示组合，Compass Marker在定向对象检测中带来了最大的增强，在GPT-4基线上mIoUp提升了0.34，mIoUr提升了0.25，这与我们的预期一致。有了Compass Marker绘制的指南针，MLLM更容易读取角度，从而最终提高船的角度预测准确性。
在这里插入图片描述

五、结论

We have presented DetToolChain, a groundbreaking prompting paradigm, to unleash the potential of multimodal large language models (MLLMs), like GPT-4V and Gemini, as zero-shot detectors. By employing a multimodal detection Chain-of-Thought (Det-CoT) to manage the proposed visual processing prompts and detection reasoning prompts, MLLMs are instructed to perceive objects by regional focus, prediction progressive refinement, and contextual inference. Our approach not only improves interpretability and precision in detection tasks across a variety of challenging scenarios, including occluded, small, and oriented objects, but also sets new records in open-vocabulary detection, described object detection, and referring expression comprehension without instruction tuning.
我们提出了DetToolChain，这是一种开创性的提示范式，旨在释放多模态大型语言模型（MLLMs），如GPT-4V和Gemini作为零样本检测器的潜力。通过使用多模态检测思维链（Det-CoT）来管理所提出的视觉处理提示和检测推理提示，MLLMs被指导通过区域聚焦、预测逐步精细化以及上下文推理来感知对象。我们的方法不仅提高了在各种具有挑战性场景中检测任务的可解释性和精确度，包括遮挡、小尺寸和定向对象，还在无需指令调优的情况下，在开放词汇检测、描述对象检测和指代表达理解方面树立了新的记录。

查看全文

http://www.kler.cn/a/466987.html