[CVPR 2022]Cross-view Transformers for real-time Map-view Semantic Segmentation
论文网址:Cross-View Transformers for Real-Time Map-View Semantic Segmentation
论文代码:cross_view_transformers/cross_view_transformer at master · bradyz/cross_view_transformers · GitHub
英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用
目录
1. 心得
2. 论文逐段精读
2.1. Abstract
2.2. Introduction
2.3. Related Works
2.4. Cross-view transformers
2.4.1. Cross-view attention
2.4.2. A cross-view transformer architecture
2.5. Implementation Details
2.6. Results
2.6.1. Comparison to prior work
2.6.2. Ablations of cross-view attention
2.6.3. Camera-aware positional embeddings
2.6.4. Accuracy vs distance
2.6.5. Robustness to sensor dropout
2.6.6. Qualitative Results
2.6.7. Geometric reasoning in cross-view attention
2.7. Conclusion
3. Reference
1. 心得
(1)虽然人需要挑战自己,但也不能总是挑战自己
2. 论文逐段精读
2.1. Abstract
①They implement segmentation on camera-aware cross-view attention way
2.2. Introduction
①Depth projection and measurement is a bottleneck
②Their cross view Transformer will never reason geometry but "learns to map between views through a geometry-aware positional embedding"
③Schematic of cross view Transformer:
2.3. Related Works
(1)Monocular 3D object detection
①Lists monocular detection methods, which convert 3D object to 2D and predict the depth
monocular adj. 单眼的 n. 单目;单筒望远镜
(2)Depth estimation
①Depth estimation methods rely on camera and accurate calibration
(3)Semantic mapping in the map-view
①This method divides input and output in two coordinate frames, where inputs are in calibrated camera views, outputs are rasterized onto a map
②The authors rekon that implicit geometric reasoning performs as well as explicit geometric models
rasterized v. 栅格化
2.4. Cross-view transformers
①Monocular views: with , where denotes input image, denotes camera intrisics, is extrinsic rotation, denotes translation relative to the center of the ego-vehicle
②They aim to predict a binary semantic segmentation mask
③Pipeline of cross-view transformer:
where positional embedding shares the same encoder with image
2.4.1. Cross-view attention
①For any world coordinate , perspective transformation converts it to corresponding image coordinate by:
where denotes equality up to a scale factor, adopts homogeneous coordinates
②Reconstruct geometric relationship between world and image coordinates to cosine similarity:
(1)Camera-aware positional encoding
①Each unprojected image coordinate is direction vector from the origin of camera to the image plane at depth
②Encoding direction vector to by MLP
③Learned positional encoding:
④Generating new embedding by learning every element
(2)Map-view latent embedding
①Combining 2 positional embeddings:
(3)Cross-view attention
①Image encoder: pretrained and fine tuned EfficientNet-B4
2.4.2. A cross-view transformer architecture
①Patch embedding from encoder: where is the number of resolutions
2.5. Implementation Details
(1)Architecture
①Two feature scales of images: (28, 60) and (14, 30) which are 8x and 16x downscaling
②The grid size
③Head of attention: 4 with embedding size
④Decoder: consists of three (bilinear upsample + conv) layers
⑤Output resolution:
(2)Training
①Loss: focal loss
②Batch size: 4 per GPU
③Epoch: 30
④Initial learning rate: 1e-2 with 1e-7 weight decay by AdamW
2.6. Results
(1)Dataset
①Dataset: nuScenes
②Scenes: 1k over different weather, time and traffic conditions
③Frames: 40 in 20 seconds in nuScenes, 10k in Argoverse dataset
④Sample: 40k
⑤View: 360° with 6 camera views
⑥Each camera view has calibrated intrinsics and extrinsics at every timestep
⑦Default reshaping size: 224 × 448
⑧Annotating 3D bounding box by LiDAR data
⑨Generating ground truth label by projecting 3D box annotations onto the ground plane with a resolution of (200, 200)
(2)Evaluation
①Setting 1 uses a 100m×50m area around the vehicle and samples a map at a 25cm resolution
②Setting 2 uses a 100m×100m area around the vehicle, with a 50cm sampling resolution
2.6.1. Comparison to prior work
①They compared their model with single-timestep models:
②Comparison on different datasets where top rows are on nuScenes and bottom are on Argoverse:
2.6.2. Ablations of cross-view attention
①Ablations of cross-view attention mechanism:
2.6.3. Camera-aware positional embeddings
①Ablations of the camera-aware positional embeddings:
2.6.4. Accuracy vs distance
①Performance along with increasing distance:
②Qualitative results on scenes with varying degrees of occlusion:
where left side overhead view is the predicted result and the rights are ground truth
2.6.5. Robustness to sensor dropout
①How number of cameras affect performance:
2.6.6. Qualitative Results
①就是上面那个有真实场景图的大图,作者说远处的或被遮挡的车辆不能很好感知
2.6.7. Geometric reasoning in cross-view attention
①Visualization of cross-view attention:
2.7. Conclusion
~
3. Reference
Zhou, B. & Krähenbühl, P. (2022) 'Cross-view Transformers for real-time Map-view Semantic Segmentation', CVPR.