当前位置: 首页 > article >正文

[CVPR 2022]Cross-view Transformers for real-time Map-view Semantic Segmentation

论文网址:Cross-View Transformers for Real-Time Map-View Semantic Segmentation

论文代码:cross_view_transformers/cross_view_transformer at master · bradyz/cross_view_transformers · GitHub

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Works

2.4. Cross-view transformers

2.4.1. Cross-view attention

2.4.2. A cross-view transformer architecture

2.5. Implementation Details

2.6. Results

2.6.1. Comparison to prior work

2.6.2. Ablations of cross-view attention

2.6.3. Camera-aware positional embeddings

2.6.4. Accuracy vs distance

2.6.5. Robustness to sensor dropout

2.6.6. Qualitative Results

2.6.7. Geometric reasoning in cross-view attention

2.7. Conclusion

3. Reference


1. 心得

(1)虽然人需要挑战自己,但也不能总是挑战自己

2. 论文逐段精读

2.1. Abstract

        ①They implement segmentation on camera-aware cross-view attention way

2.2. Introduction

        ①Depth projection and measurement is a bottleneck

        ②Their cross view Transformer will never reason geometry but "learns to map between views through a geometry-aware positional embedding"

        ③Schematic of cross view Transformer:

2.3. Related Works

(1)Monocular 3D object detection

        ①Lists monocular detection methods, which convert 3D object to 2D and predict the depth

monocular adj. 单眼的  n. 单目;单筒望远镜

(2)Depth estimation

        ①Depth estimation methods rely on camera and accurate calibration

(3)Semantic mapping in the map-view

        ①This method divides input and output in two coordinate frames, where inputs are in calibrated camera views, outputs are rasterized onto a map

        ②The authors rekon that implicit geometric reasoning performs as well as explicit geometric models

rasterized  v. 栅格化

2.4. Cross-view transformers

        ①Monocular views: n with (I_k,K_k,R_k,t_k)_{k=1}^n, where I_k\in\mathbb{R}^{H\times W\times3} denotes input image, K_k\in\mathbb{R}^{3\times3} denotes camera intrisics, R_k\in\mathbb{R}^{3\times3} is  extrinsic rotation, t_k\in\mathbb{R}^3 denotes translation relative to the center of the ego-vehicle

        ②They aim to predict a binary semantic segmentation mask y\in\left\{0,1\right\}^{h\times w\times C}

        ③Pipeline of cross-view transformer:

where positional embedding shares the same encoder with image

2.4.1. Cross-view attention

        ①For any world coordinate x^{(W)}\in\mathbb{R}^3, perspective transformation converts it to corresponding image coordinate x^{(I)}\in\mathbb{R}^3 by:

x^{(I)}\simeq K_kR_k(x^{(W)}-t_k)

where \simeq denotes equality up to a scale factor, x^{(I)}=\left ( \cdot , \cdot, 1 \right ) adopts homogeneous coordinates

        ②Reconstruct geometric relationship between world and image coordinates to cosine similarity:

\begin{aligned} sim_k(x^{(I)},x^{(W)})=\frac{\left(R_k^{-1}K_k^{-1}x^{(I)}\right)\cdot\left(x^{(W)}-t_k\right)}{\|R_k^{-1}K_k^{-1}x^{(I)})\|\|(x^{(W)}-t_k\|} \end{aligned}

(1)Camera-aware positional encoding

        ①Each unprojected image coordinate d_{k,i}=R_k^{-1}K_k^{-1}x_i^{(I)} is direction vector from the origin t_k of camera k to the image plane at depth 1

        ②Encoding direction vector to \delta_{k,i}\in\mathbb{R}^{D}, D=1 by MLP

        ③Learned positional encoding: c^{(0)}\in\mathbb{R}^{w\times h\times D}

        ④Generating new embedding c^{(1)},c^{(2)},\ldots. by learning every element 

(2)Map-view latent embedding

        ①Combining 2 positional embeddings:

sim(\delta_{k,i},\phi_{k,i},c_j^{(n)},\tau_k)=\frac{(\delta_{k,i}+\phi_{k,i})\cdot\left(c_j^{(n)}-\tau_k\right)}{\|\delta_{k,i}+\phi_{k,i}\|\|c_j^{(n)}-\tau_k\|}

(3)Cross-view attention

        ①Image encoder: pretrained and fine tuned EfficientNet-B4

2.4.2. A cross-view transformer architecture

        ①Patch embedding from encoder: \{\phi_1^1,\phi_1^2,\ldots,\phi_n^R\} where R is the number of resolutions

2.5. Implementation Details

(1)Architecture

        ①Two feature scales of images: (28, 60) and (14, 30) which are 8x and 16x downscaling

        ②The grid size w=h=25

        ③Head of attention: 4 with embedding size d_{head}=64

        ④Decoder: consists of three (bilinear upsample + conv) layers

        ⑤Output resolution: 200 \times 200

(2)Training

        ①Loss: focal loss

        ②Batch size: 4 per GPU

        ③Epoch: 30

        ④Initial learning rate: 1e-2 with 1e-7 weight decay by AdamW

2.6. Results

(1)Dataset

        ①Dataset: nuScenes

        ②Scenes: 1k over different weather, time and traffic conditions

        ③Frames: 40 in 20 seconds in nuScenes, 10k in Argoverse dataset

        ④Sample: 40k

        ⑤View: 360° with 6 camera views

        ⑥Each camera view has calibrated intrinsics K and extrinsics (R, t) at every timestep

        ⑦Default reshaping size: 224 × 448

        ⑧Annotating 3D bounding box by LiDAR data

        ⑨Generating ground truth label by projecting 3D box annotations onto the ground plane with a resolution of (200, 200)

(2)Evaluation

        ①Setting 1 uses a 100m×50m area around the vehicle and samples a map at a 25cm resolution

        ②Setting 2 uses a 100m×100m area around the vehicle, with a 50cm sampling resolution

2.6.1. Comparison to prior work

        ①They compared their model with single-timestep models:

        ②Comparison on different datasets where top rows are on nuScenes and bottom are on Argoverse:

2.6.2. Ablations of cross-view attention

        ①Ablations of cross-view attention mechanism:

2.6.3. Camera-aware positional embeddings

        ①Ablations of the camera-aware positional embeddings:

2.6.4. Accuracy vs distance

        ①Performance along with increasing distance:

        ②Qualitative results on scenes with varying degrees of occlusion:

where left side overhead view is the predicted result and the rights are ground truth

2.6.5. Robustness to sensor dropout

        ①How number of cameras affect performance:

2.6.6. Qualitative Results

        ①就是上面那个有真实场景图的大图,作者说远处的或被遮挡的车辆不能很好感知

2.6.7. Geometric reasoning in cross-view attention

        ①Visualization of cross-view attention:

2.7. Conclusion

        ~

3. Reference

Zhou, B. & Krähenbühl, P. (2022) 'Cross-view Transformers for real-time Map-view Semantic Segmentation', CVPR.


http://www.kler.cn/a/528581.html

相关文章:

  • tf.Keras (tf-1.15)使用记录3-model.compile方法
  • 90,【6】攻防世界 WEB Web_php_unserialize
  • 如何在Windows、Linux和macOS上安装Rust并完成Hello World
  • SQLite Update 语句详解
  • 数据库备份、主从、集群等配置
  • Python3 【装饰器】项目实战:5个新颖的学习案例
  • Spring Boot项目如何使用MyBatis实现分页查询
  • 90,【6】攻防世界 WEB Web_php_unserialize
  • python-leetcode-完全二叉树的节点个数
  • webrtc协议详细解释
  • 完美还是完成?把握好度,辨证看待
  • 洛谷 P10289 [GESP样题 八级] 小杨的旅游 C++ 完整题解
  • 开发指南093-平台底层技术网站
  • DeepSeek本地部署详细指南
  • 跨域问题解决实践
  • 电路研究9.2.7——合宙Air780EP中嵌入式 TCPIP 相关命令使用方法研究
  • G. XOUR
  • pytorch实现文本摘要
  • [LeetCode]day9 203.移除链表元素
  • ASP.NET Core 中使用依赖注入 (DI) 容器获取并执行自定义服务
  • w179基于Java Web的流浪宠物管理系统的设计与实现
  • 使用pandas的read_excel()报错:
  • websocket实现聊天室应用,包括文字和图片上传_websocket onmessage怎么接收客户端的图片
  • 【ts + java】古玩系统开发总结
  • 【算法设计与分析】实验8:分支限界—TSP问题
  • 【机器学习】自定义数据集 ,使用朴素贝叶斯对其进行分类