当前位置：首页 > article >正文

测试 SpatialLM 空间语义识别

article 2025/3/30 12:25:17

之前使用sam-vit-base 辅助检测但是因为技术上还涉及到寻找prompt point , 大概率还是得通过视觉识别去找点。咨询方让我考虑下还有没有其他可行的方式，目前距离投标还有一段时间，正好借着这段时间研究下视觉圈前沿的模型成果。正好在社区翻看到篇 SpatialLM Introduction的简介

Introduction

SpatialLM是一个专注于空间理解的大型语言模型，特别适合处理3D点云数据并生成结构化的3D场景理解输出。开源模型的语义理解部分有借助llama和Qwen的

SpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.

teaser.mp4

SpatialLM Models

Model Download
SpatialLM-Llama-1B 🤗 HuggingFace
SpatialLM-Qwen-0.5B 🤗 HuggingFace

Model	Download
SpatialLM-Llama-1B	🤗 HuggingFace
SpatialLM-Qwen-0.5B	🤗 HuggingFace

考虑到SpatialLM能处理来自单目视频序列、RGBD图像和LiDAR传感器的点云数据，适用于机器人实体化、自动导航和其他复杂的3D场景分析任务。

但是呢

研究表明，SpatialLM 模型主要用于室内场景和 3D 空间理解，能理解下面这些对象

Objects	F1 @.25 IoU (3D)
curtain	27.35	28.59
nightstand	57.47	54.39
chandelier	38.92	40.12
wardrobe	23.33	30.60
bed	95.24	93.75
sofa	65.50	66.15
chair	21.26	14.94
cabinet	8.47	8.44
dining table	54.26	56.10
plants	20.68	26.46
tv cabinet	33.33	10.26
coffee table	50.00	55.56
side table	7.60	2.17
air conditioner	20.00	13.04
dresser	46.67	23.53

Thin Objects	F1 @.25 IoU (2D)
painting	50.04	53.81
carpet	31.76	45.31
tv	67.31	52.29
door	50.35	42.15
window	45.4	45.9

当前不直接支持识别卡车车斗上的雨布是否完全覆盖。
它能处理单目摄像头视频，但其检测对象类别不包括卡车或雨布，可能需要额外训练。倾向于认为该模型通过微或可适应此任务，但需更多开发工作。

先看看Demo的效果如何

安装依赖

conda create -n spatiallm python=3.11
conda activate spatiallm
conda install -y nvidia/label/cuda-12.4.0::cuda-toolkit conda-forge::sparsehash

pip install poetry && poetry config virtualenvs.create false --local
poetry install

torchsparse 使用源码安装，torchsparse 是一个处理稀疏点云数据的库，SpatialLM 可能依赖它将点云文件（如 .ply）转换为模型可接受的输入。如果 torchsparse 的版本与 SpatialLM 的实现不兼容，确实可能导致数据处理或张量操作出错。

https://github.com/mit-han-lab/torchsparse/releases/tag/v2.0.0https://github.com/mit-han-lab/torchsparse/releases/tag/v2.0.0

把场景pcd/scene0000_00.ply 转推理为点云数据

python inference.py --point_cloud pcd/scene0000_00.ply --output scene0000_00.txt --model_path ./spatiallm-llama-1B/

执行报错

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Exception in thread Thread-2 (generate):
Traceback (most recent call last):
File "/opt/miniconda3/envs/spatiallm/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/opt/miniconda3/envs/spatiallm/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/opt/miniconda3/envs/spatiallm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/spatiallm/lib/python3.11/site-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/opt/miniconda3/envs/spatiallm/lib/python3.11/site-packages/transformers/generation/utils.py", line 3206, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/spatiallm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/spatiallm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/chenrui/SpatialLM/spatiallm/model/spatiallm_qwen.py", line 208, in forward
cur_new_input_embeds = torch.cat(
^^^^^^^^^^
RuntimeError: Tensors must have same number of dimensions: got 2 and 3

我猜测第 208 行可能是类似 torch.cat 的操作，而 cur_point_features（来自 torchsparse 的点云特征）与 cur_input_embeds（LLaMA 的文本嵌入）的维度不匹配。找到spatiallm_qwen.py line 208 前加上打印日志

print("cur_input_embeds.shape:", cur_input_embeds.shape)
print("cur_point_features.shape:", cur_point_features.shape)

打印到形状

cur_input_embeds.shape: torch.Size([221, 896])

cur_point_features.shape: torch.Size([65, 20, 896])

cur_input_embeds.shape: torch.Size([221, 896])
- 这是一个二维张量（2D），表示序列长度为 221，每个 token 的嵌入维度为 896。
cur_point_features.shape: torch.Size([65, 20, 896])
- 这是一个三维张量（3D），表示 65 个点（可能是点云中的点数），每个点有 20 个特征（可能是空间或时间维度），特征维度为 896。

之前的错误 RuntimeError: Tensors must have same number of dimensions: got 2 and 3，问题已经很明确：torch.cat 试图沿着 dim=0 拼接张量，但 cur_input_embeds 是 2D（[221, 896]），而 cur_point_features 是 3D（[65, 20, 896]），它们的维度数量不一致，导致拼接失败

问题分析

维度不匹配：
- torch.cat 要求所有输入张量在除拼接维度（这里是 dim=0）外的其他维度大小必须相同。
- cur_input_embeds[: point_start_token_pos + 1] 和 cur_input_embeds[point_end_token_pos:] 是从 cur_input_embeds 切片得到的，仍然是 2D，形状分别为 [X, 896] 和 [Y, 896]（具体长度取决于 point_start_token_pos 和 point_end_token_pos）。
- 但 cur_point_features 是 3D，形状为 [65, 20, 896]，它有额外的维度（20），无法直接与 2D 张量拼接。
SpatialLM 的设计意图：
- 从 cur_point_features 的形状来看，它可能是点云数据经过 torchsparse 处理后的特征表示，65 可能是点云中的点数，20 可能是每个点的特征向量数量（例如空间坐标、颜色、或某种多通道特征），896 是与 cur_input_embeds 一致的嵌入维度。
- 代码的目标可能是将点云特征插入到文本嵌入序列中，但当前实现没有正确处理 cur_point_features 的多维结构。

解决方案

为了解决维度不匹配的问题，我们需要将 cur_point_features 的形状调整为与 cur_input_embeds 兼容，即从 [65, 20, 896] 转换为 2D 形状 [N, 896]，然后才能进行拼接。以下是几种可能的处理方法，具体选择取决于 SpatialLM 的设计意图： 将 cur_point_features 展平为 2D

如果 cur_point_features 的 [65, 20] 表示 65 个点各有 20 个特征，且你希望将所有这些特征按顺序插入到序列中，可以将其展平为 [65 * 20, 896]

# 展平第0和第1维
cur_point_features = cur_point_features.view(-1, 896)  # 形状变为 [65 * 20, 896] = [1300, 896]
cur_new_input_embeds = torch.cat(
    (
        cur_input_embeds[: point_start_token_pos + 1],
        cur_point_features,
        cur_input_embeds[point_end_token_pos:],
    ),
    dim=0,
)

再次执行成功

再执行

# Convert the predicted layout to Rerun format
python visualize.py \
--point_cloud pcd/scene0000_00.ply \
--layout scene0000_00.txt \
--save scene0000_00.rrd

.rrd 文件是记录什么的？

.rrd 文件是一个日志文件，记录了通过 Rerun SDK 记录的多模态数据。它本质上是一个按时间顺序追加的日志消息序列（log messages），使用 Apache Arrow 格式编码，便于高效存储和传输。具体内容取决于记录时的输入，可能包括：
- 元数据：时间戳、实体路径（entity paths）、组件信息。
- 时间序列：传感器读数、标量值。
- 图像数据：RGB 图像、深度图。
- 空间数据：3D 点云、网格（mesh）、变换矩阵（transforms）。

执行 rerun scene0000_00.rrd 报错

Error: winit EventLoopError: os error at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/winit-0.30.7/src/platform_impl/linux/mod.rs:764: neither WAYLAND_DISPLAY nor WAYLAND_SOCKET nor DISPLAY is set.

这个错误表明 Rerun（基于 Rust 的 `winit` 库构建）尝试创建一个图形窗口（Rerun Viewer）来显示 `scene0000_00.rrd` 文件中的数据，但它无法找到可用的显示服务（Display Server）。具体来说：在 Linux 系统上，图形界面通常依赖 X11（通过 `DISPLAY` 环境变量）或 Wayland（通过 `WAYLAND_DISPLAY` 或 `WAYLAND_SOCKET` 环境变量）来运行。错误提示当前环境中这些变量都没有设置，意味着系统没有检测到有效的图形环境。

测试用的原始 pcd/scene0000_00.ply

文件格式：.ply 是 "Polygon File Format" 或 "Stanford Triangle Format" 的缩写，是一种常见的 3D 数据文件格式，最初由斯坦福大学开发。.ply 文件通常用于存储 3D 几何数据，主要包括：

顶点（Vertices）：3D 空间中的点坐标（x, y, z）。
面（Faces）：由顶点连接构成的多边形（通常是三角形）。
附加属性：可以包含颜色（RGB）、法向量（normals）、纹理坐标等。

scene0000_00.ply 是 SpatialLM 的输入文件是一个点云，表示某个场景的 3D 结构，用于后续特征提取或生成任务，我们之前使用PE3R 通过多张照片生成过.glb格式文件，参考这篇

浅分析 PE3R 感知高效的三维重建-CSDN博客文章浏览阅读516次，点赞6次，收藏4次。近期，二维到三维感知技术的进步显著提升了对二维图像中三维场景的理解能力。然而，现有方法面临诸多关键挑战，包括跨场景泛化能力有限、感知精度欠佳以及重建速度缓慢。为克服这些局限，我们提出了感知高效三维重建框架（PE3R），旨在同时提升准确性与效率。PE3R采用前馈架构，实现了快速的三维语义场重建。该框架在多样化的场景与对象上展现出强大的零样本泛化能力，并显著提高了重建速度。在二维到三维开放词汇分割及三维重建上的大量实验验证了PE3R的有效性与多功能性。代码开源在。https://blog.csdn.net/u011564831/article/details/146312329?spm=1011.2415.3001.5331

下一步将 .glb 文件转换为 .ply 文件是一个常见的 3D 数据格式转换任务。.glb 是 glTF（GL Transmission Format）的二进制版本，通常用于存储 3D 模型（包括顶点、面、纹理等），而 .ply 是一种更简单的格式，主要存储点云或网格数据。

读取 .glb

在 Blender 中将 GLB 文件转换为 PLY 文件时，它能够保留尽可能多的信息（例如顶点、面、顶点颜色、法线、UV 坐标等），因为 Blender 是一个功能强大的 3D 建模工具，内部处理复杂 3D 数据的能力远超普通的 Python 库（如 trimesh）。然而，PLY 格式本身有一些局限性（例如不支持复杂的材质、动画或多纹理），因此“保留所有信息”实际上受限于 PLY 的能力。

这里使用 pygltflib把 glb 导出为ply

from pygltflib import GLTF2
import numpy as np
import base64


def get_buffer_data(buffer_view, gltf):
    buffer = gltf.buffers[buffer_view.buffer]
    if buffer is None:
        raise ValueError("Buffer is missing.")

    # 直接从 GLB 的二进制部分读取数据
    if buffer.uri is None:
        data = gltf.binary_blob()[buffer_view.byteOffset:buffer_view.byteOffset + buffer_view.byteLength]
    else:
        if buffer.uri.endswith('.bin'):
            with open(buffer.uri, 'rb') as f:
                data = f.read()
        elif buffer.uri.startswith('data:'):
            header, b64_data = buffer.uri.split(',')
            data = base64.b64decode(b64_data)
        else:
            raise ValueError(f"Unsupported buffer URI: {buffer.uri}")

    print(f"Buffer view byteOffset: {buffer_view.byteOffset}, byteLength: {buffer_view.byteLength}")
    print(f"Extracted data size: {len(data)} bytes")
    return np.frombuffer(data, dtype=np.uint8)


# 读取 GLB 文件
gltf = GLTF2().load("scene.glb")

# 初始化点云数据
all_points = []

# 遍历所有网格和原始数据
for mesh in gltf.meshes:
    for primitive in mesh.primitives:
        # 提取顶点位置
        position_attr = primitive.attributes.POSITION
        if position_attr is not None:
            position_accessor = gltf.accessors[position_attr]
            position_buffer_view = gltf.bufferViews[position_accessor.bufferView]

            print(f"\nProcessing mesh: {mesh.name}, primitive: {primitive}")
            print(
                f"Position accessor: count={position_accessor.count}, type={position_accessor.type}, componentType={position_accessor.componentType}")
            print(f"Position buffer view: {position_buffer_view}")

            if position_buffer_view is None:
                raise ValueError("Position buffer view is missing.")

            byte_offset = position_accessor.byteOffset if position_accessor.byteOffset is not None else 0
            byte_offset += position_buffer_view.byteOffset
            count = position_accessor.count

            # 获取缓冲区数据
            buffer_data = get_buffer_data(position_buffer_view, gltf)

            # 计算期望的字节数
            expected_bytes = count * 12  # 每个点 3 个 float32 = 12 字节
            print(f"Expected bytes: {expected_bytes}, Actual buffer size: {len(buffer_data)}")

            # 检查数据是否足够
            if len(buffer_data) < byte_offset + expected_bytes:
                print(f"警告: 缓冲区数据不足，跳过此 primitive")
                continue

            # 提取点云数据
            points_data = buffer_data[byte_offset:byte_offset + expected_bytes]
            print(f"Extracted points data size: {len(points_data)} bytes")

            # 转换为 float32 并重塑
            points = points_data.view(dtype=np.float32).reshape((count, 3))
            all_points.append(points)

# 将所有点云数据写入 PLY 文件
if all_points:
    with open("scene.ply", "w") as ply_file:
        ply_file.write("ply\n")
        ply_file.write("format ascii 1.0\n")
        total_points = sum(len(p) for p in all_points)
        ply_file.write(f"element vertex {total_points}\n")
        ply_file.write("property float x\n")
        ply_file.write("property float y\n")
        ply_file.write("property float z\n")
        ply_file.write("end_header\n")

        # 写入每个点
        for points in all_points:
            for point in points:
                ply_file.write(f"{point[0]} {point[1]} {point[2]}\n")
    print(f"成功生成 PLY 文件，包含 {total_points} 个点")
else:
    print("未找到任何点云数据")

使用spatiallm-qwen-0.5B 把 ply 信息在空间做语义分割

python inference.py \
   --point_cloud pcd/scene.ply \
   --output pcd/scene.txt \
   --model_path ./spatiallm-qwen-0.5B/

(spatiallm) [root@node126 pcd]# cat scene.txt
wall_0=Wall(0.04201483726501465,-0.04488217458128929,0.045627253130078316,1.2920148372650146,-0.04488217458128929,0.045627253130078316,2.2399999999999998,0.0)
wall_1=Wall(0.04201483726501465,-0.04488217458128929,0.045627253130078316,0.04201483726501465,1.4551178254187107,0.045627253130078316,2.2399999999999998,0.0)
wall_2=Wall(1.2920148372650146,-0.04488217458128929,0.045627253130078316,1.2920148372650146,1.4551178254187107,0.045627253130078316,2.2399999999999998,0.0)
wall_3=Wall(0.04201483726501465,1.4551178254187107,0.045627253130078316,1.2920148372650146,1.4551178254187107,0.045627253130078316,2.2399999999999998,0.0)
door_0=Door(wall_3,0.8920148372650146,1.4551178254187107,0.8956272531300783,0.68,1.64)
window_0=Window(wall_0,0.8920148372650146,-0.04488217458128929,1.2956272531300783,0.76,1.1199999999999999)
bbox_0=Bbox(toilet,0.29201483726501465,1.0551178254187108,0.24562725313007833,-1.5708000000000002,0.28125,0.5,0.34375)

本来不想贴rdd 结果图了，效果太差了，和宣传的效果距离太远了