大腿青筋明显是什么原因| 双鱼座什么性格| 生物制剂是什么药| 肠梗阻是什么病| 叶子像什么| 人养玉三年玉养人一生是什么意思| 鹅拉绿色粪便是什么病| 备孕要检查什么项目| 丝瓜络是什么| 遗精频繁是什么原因| 苏州机场叫什么| 和风什么| 男人肾虚吃什么补得快| 笑刑是什么| 纲目是什么意思| 没有了晨勃是什么原因| 依非韦伦片治什么病的| 家财万贯是什么动物| 白脉病西医叫什么病| 血热是什么原因引起的| 阴历六月十八是什么日子| 美女的阴暗是什么样的| 分水岭是什么意思| 晚上八点半是什么时辰| 大型血小板比率偏低是什么意思| 戴玉对身体有什么好处| 外阴过敏用什么药| 迁单是什么意思| 肚子左侧是什么器官| 每天早上起床口苦是什么原因| rebecca什么意思| 过梁是什么| 什么可以变白皮肤| 半夜三更是什么生肖| 翎字五行属什么| 菊花什么季节开| 间接胆红素是什么意思| 什么是滑档| 粉底液是干什么用的| 咖啡什么牌子的好| 小猫能吃什么水果| 手上长水泡痒用什么药| 支原体是什么病| 晨尿泡沫多是什么原因| 膀胱癌早期是什么症状| 糖尿病主食吃什么好| 油头粉面是什么意思| 骨龄什么时候闭合| 拉肚子吃点什么食物好| 儒家思想的核心是什么| 中耳炎不能吃什么食物| 梦见胎死腹中预示什么| 天河水命是什么意思| 淘米水洗脸有什么好处| 什么叫多重耐药菌| 青蛙长什么样| 融合是什么意思| 豆腐鱼是什么鱼| 该是什么意思| 美工是做什么的| 果葡糖浆是什么| 今年七夕节是什么时候| 吃什么对牙齿有好处| 淋巴细胞比率偏高是什么意思| 九一八事变是什么意思| 丙型肝炎病毒抗体阴性什么意思| 后背沉重感是什么原因引起的| 口羊读什么| 泽去掉三点水念什么| 音什么笑什么成语| 什么眉头| 橘猫是什么品种| 公务员是什么编制| 怀孕初期吃什么蔬菜好| 胃疼吃什么饭| 往返是什么意思| 七月三号什么星座| 大圣归来2什么时候上映| 备货是什么意思| 造影检查是什么意思| 来姨妈吃什么水果| 这是什么电影| 乡镇派出所所长是什么级别| 三点水及念什么| 芸豆长什么样子| 痛点是什么意思| 9月10日是什么节日| yearcon是什么牌子| 左肩后背疼是什么原因| 三文鱼长什么样| 糖尿病吃什么水果最好| 玛瑙五行属什么| 人体有365个什么| 越南用什么语言| 肺癌晚期有什么症状| 医保和农村合作医疗有什么区别| 子宫有积液是什么原因引起的| 重听是什么意思| 甘油三酯偏高说明什么问题| 不经意间是什么意思| 为什么小腿肌肉酸痛| 中国最高军衔是什么| 英国为什么要脱欧| 高手过招下一句是什么| 伤风是什么意思| 主心骨是什么意思| 包茎不割会有什么影响| 泡沫尿吃什么药| 神经性头疼吃什么药效果好| 软骨病是什么病| stranger什么意思| 左边肋骨下面是什么器官| 增加胃动力最好的药是什么药| 为什么尿会很黄| 什么时候中秋节| 三金片治什么病| 伦字五行属什么| 痱子什么样| 梦见蛇是什么预兆| 胎盘血窦是什么意思| 神经外科主要看什么病| 男人阳气不足有什么症状| 什么是孢子| 暑假什么时候放| 胃溃疡十二指肠溃疡吃什么药| 什么药可以治早迣| 升白针是什么药| 早上起来手发麻是什么原因| 肠胃不好吃什么好| 曲解什么意思| 什么食物降尿酸效果好| 釜底抽薪是什么计| 2000年出生属什么| 尿路感染为什么会尿血| 属鸡和什么属相相克| 咽喉炎吃什么| 额头发黑是什么原因| 怡字五行属什么的| 三文鱼长什么样| 犹太人是什么人种| 申时五行属什么| 膝盖有积液是什么症状| 弱精吃什么能提高活力| 化是什么意思| 开悟是什么意思| 白色搭配什么颜色好看| 啪啪是什么意思| 手汗脚汗多是什么原因| april什么意思| 鸡蛋为什么这么便宜| 甲鱼炖什么好吃| kai是什么意思| 经常做梦是什么原因| 手掌小鱼际发红是什么原因| 为什么会突然头晕| 鹿角有什么功效和作用| 你算什么东西| 为什么一动就出汗| 花中君子是什么| 牙根痛吃什么药| 7点至9点是什么时辰| 卵磷脂是什么| 茉莉花茶有什么作用| 宝宝肋骨外翻是什么原因| 痛风能吃什么菜| 为什么会长智齿| 卵泡破裂有什么症状| 1921年是什么年| 月经不干净是什么原因| 细菌性肠炎是什么原因引起的| 胀气是什么原因引起的| 什么面玲珑| 眉毛下方有痣代表什么| 他喵的什么意思| 中国人为什么要学英语| 料理机是干什么用的| 养猫有什么好处| 屏幕总成带框和不带框有什么区别| 牛黄解毒片不能和什么药一起吃| 刍狗是什么意思| 什么坚果适合减肥吃| 吃维生素b12有什么好处和副作用| 梦见知了猴是什么意思| 农历七月十五是什么节| 肺结节有什么症状| 金銮殿是什么意思| 杂交金毛犬长什么样子| 知是什么意思| 角瓜念什么| 吹泡泡什么意思| 做雪糕需要什么材料| 高血压早餐吃什么好| 入职体检70元一般检查什么| 红肉指的是什么肉| 狗被蜱虫咬了有什么症状| foreverlove是什么意思| 感统训练是什么| nk是什么意思| 湿疹是什么症状及图片| 人情世故什么意思| 51岁属什么| 什么叫世家| 什么时候出伏| 什么是偏光眼镜| 被艹是什么感觉| 发炎不能吃什么东西| 01年属什么| 沥水是什么意思| 什么是铂金| 什么是六道轮回| 称中药的小秤叫什么| 27年属什么生肖| 12583是什么电话| 低落是什么意思| 做完胃肠镜后可以吃什么| 什么鸡蛋营养价值最高| 胃息肉是什么原因引起的| 玫瑰红是什么颜色| 隔空是什么意思| 6月五行属什么| 什么是沙龙| 乳臭未干是什么意思| 脑梗是什么原因| 草果长什么样| 吃大枣有什么好处| 房子什么朝向好| 为什么睡觉会磨牙| 吃什么水果对胃好| 梦见粉条是什么意思| 怼人是什么意思| 猎奇是什么意思| 聿字五行属什么| 男生做爱什么感觉| 月经咖啡色是什么原因| 暗疾是什么意思| 防晒霜和隔离霜有什么区别| 路上遇到蛇是什么征兆| 什么叫化疗| 亚麻籽和什么相克| 什么动物倒着走| 铁观音是什么茶类| 常态是什么意思| 夏天有什么动物| 牙疼吃什么饭菜比较好| 人见人爱是什么意思| 耦合是什么意思| 1991年属羊的是什么命| 赖氨酸是什么| 什么是肺腺癌| 血糖高什么东西不能吃| 芊芊是什么意思| 女人为什么要少吃鳝鱼| 徐州有什么好吃的美食| bso是什么意思| 胃癌手术后吃什么补品| 脑供血不足什么症状| 甘油三酯高有什么症状| 天秤座是什么性格| 天蝎和什么星座最配对| 刚生完宝宝的产妇吃什么好| 为什么人死后要盖住脸| 钙化斑是什么意思| 鼠妇吃什么| 色斑是什么原因引起的| 百度

什么样的眼睛

Yuchen Zhou
Nanjing Agricultural University
Nanjing, China
2023119007@stu.njau.edu.cn
&Yan Luo
Nanjing Agricultural University
Nanjing, China
luoyan@njau.edu.cn
&Xiangang Wang
Desay SV Automotive co.,LTD.
Nanjing, China
Xiangang.Wang@desaysv.com
&Xingjian Gu
Nanjing Agricultural University
Nanjing, China
guxingjian@njau.edu.cn &Mingzhou Lu
Nanjing Agricultural University
Nanjing, China
mingzhou.lu@njau.edu.cn
Abstract
百度 六个方面构成我们房地产市场的脆弱性,需要三个方面来解决。

Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird’s-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.

Keywords?Autonomous Driving ??\cdot? 3D Occupancy Prediction ??\cdot? Directional Attention Mechanism ??\cdot? Geometric Structure Preservation

1 Introduction

Vision-based occupancy prediction aims to estimate both object occupancy and semantic information in 3D voxel space using surround-view images captured by ego-vehicle cameras?[1, 2, 3, 4]. Accurate 3D geometry and semantic understanding of the surrounding environment are essential for autonomous driving (AD)?[5]. Compared to conventional 3D object detection?[6, 7], occupancy prediction offers finer-grained 3D scene understanding and is capable of recognizing general objects by learning their occupancy patterns.

Refer to caption
Figure 1: The inference speed (FPS) and accuracy (mIoU) of various methods are evaluated on the Occ3D-nuScenes benchmark?[2]. Following the definition proposed by?[8], we consider an occupancy prediction method to be real-time if it achieves at least 10 FPS.

Occupancy prediction offers several advantages, including enhanced scene understanding, improved handling of unknown obstacles, and joint modeling of geometry and semantics. However, many existing methods?[9, 10, 11, 5] focus primarily on precision, often at the expense of real-time processing due to the complexity of their model architectures and large parameter counts. Consequently, these methods suffer from high computational costs and latency, making them impractical for deployment in real-time autonomous driving (AD) systems?[12]. To ensure real-time inference speed and facilitate deployment, some approaches adopt 2D methods to predict 3D voxel occupancy?[13, 4, 14, 15, 8]. These methods typically compress 3D voxel features into bird’s-eye view (BEV) representations, retaining only horizontal information. As a result, certain semantic and vertical geometric details are sacrificed, even at the cost of accuracy. Figure?1 illustrates the limitations of current methods: while high accuracy often leads to slower inference speeds, methods designed for faster inference tend to suffer significant accuracy loss, with performance differences exceeding 10%. This issue arises from the compression of vertical geometric structures, as shown in Figure?2. Methods relying solely on 2D BEV representations struggle to capture fine-grained spatial structures, which are essential for accurate scene understanding?[12, 15, 5, 8]. The visualization results further emphasize the importance of preserving vertical geometric features to maintain the overall semantic and structural integrity of the 3D voxel space.

Refer to caption
Figure 2: The top section illustrates traditional 2D BEV methods lead to the collapse of geometric structures, particularly in the vertical direction, during compression. The bottom section demonstrates how our approach effectively preserves these structures while maintaining efficiency, even in a purely 2D framework.

To ensure the preservation of geometric structure and achieve more accurate object representation in 3D coordinates, we propose a novel approach for processing 3D voxel features. Our method involves performing slicing operations on the 3D voxel features, which retains the full vertical (height) geometric structure, and subsequently fuses this information with the 2D BEV feature map. This fusion restores the vertical geometric details that are often lost in conventional 2D BEV representations, providing a more comprehensive understanding of the scene.

To maintain high inference speed while preserving accuracy, we propose a directional attention-based occupancy prediction method (DA-Occ). This approach efficiently captures geometric features in both horizontal and vertical directions, enabling the model to selectively focus on relevant features in different orientations. As a result, DA-Occ optimizes computational performance without compromising the integrity of the 3D voxel structure. With an mIoU of 39.3% and an inference speed of 27.7 FPS on the Occ3D-nuScenes dataset, DA-Occ achieves a balance between high accuracy and fast inference, making it well-suited for real-time applications in autonomous driving.

2 Related Work

2.1 2D-to-3D View Transformation.

Many existing methods utilize estimated depth information to project 2D image features into 3D voxel space?[16, 17, 18, 19, 20]. LSS?[21] explicitly predicts depth distributions to lift 2D image features into 3D space. More recent approaches?[3, 22, 23] have highlighted the importance of accurate depth estimation in the view transformation process. For instance, BEVDepth?[3] encodes both intrinsic and extrinsic camera parameters into its depth refinement module for improved 3D object detection. BEVStereo?[22] introduces an efficient temporal stereo mechanism to enhance depth estimation quality. In addition, some studies?[24, 25, 26, 27] focus on optimizing the projection stage itself. However existing approaches have largely overlooked the potential of leveraging height prediction as a direct supervisory signal for view transformation. Inspired by DHD?[5], our method introduces a novel approach by leveraging height scores from 2D images to guide the transformation into 3D voxel space. This enables more accurate height estimation in the voxel representation and lays a solid foundation for subsequent BEV feature reconstruction.

2.2 3D Occupancy Prediction.

3D occupancy prediction reconstructs the 3D scene by predicting a dense voxel grid that encodes spatial occupancy?[12]. A straightforward approach is to extend the BEV representation used in 3D object detection into voxel space by lifting BEV features into 3D?[14] and applying a segmentation head?[11, 4, 8, 13]. However, this strategy disrupts the geometric integrity of the voxel representation due to the lack of vertical structure awareness. Directly processing full 3D voxel features, on the other hand, incurs substantial computational overhead. To address this, COTR?[9] introduces a compact voxel representation via downsampling, while PanoOcc?[28] proposes a novel panoramic occupancy segmentation task and leverages sparse 3D convolutions to reduce computation. Nonetheless, these methods still suffer from limited inference speed, making them less suitable for real-time deployment. To tackle this challenge, we propose a 2D-based approach that preserves the geometric structure of the voxel space while significantly improving inference efficiency, offering a more deployable and scalable solution for 3D occupancy prediction.

3 Method

3.1 Problem Formulation

In 3D occupancy prediction tasks, the surrounding environment is discretized into a voxel-based volumetric grid?[21]. Assuming the ego vehicle is located at the origin of the global coordinate system, the 3D perception range is defined by the bounds [Xs,Ys,Zs,Xe,We,Ze][X_{s},Y_{s},Z_{s},X_{e},W_{e},Z_{e}][ italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ], where [Xs,Ys,Zs][X_{s},Y_{s},Z_{s}][ italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ] and [Xe,Ye,Ze][X_{e},Y_{e},Z_{e}][ italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] denote the lower and upper bounds along the height, width, and depth axes, respectively. The scene volume is partitioned into a grid of shape [X,Y,Z][X,Y,Z][ italic_X , italic_Y , italic_Z ] (e.g. [200,200,1][200,200,1][ 200 , 200 , 1 ] as in?[14]). Accordingly, the physical size of a single voxel along each axis is computed as [Xe?XsX,Ye?YsY,Ze?ZsZ][\frac{X_{e}-X_{s}}{X},\frac{Y_{e}-Y_{s}}{Y},\frac{Z_{e}-Z_{s}}{Z}][ divide start_ARG italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_X end_ARG , divide start_ARG italic_Y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_Y end_ARG , divide start_ARG italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_Z end_ARG ]. At inference time, each voxel is assigned a binary occupancy state—occupied or empty along with a semantic label from a predefined set of categories or an unknown class. The final voxel-wise predictions are derived from visual features extracted from multi-view images, denoted as ???N×C×H×W\mathbf{X}\in\mathbb{R}^{N\times C\times H\times W}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, where NNitalic_N is the number of cameras, [C,H,W][C,H,W][ italic_C , italic_H , italic_W ] is the feature size. Each camera is associated with both intrinsic and extrinsic parameters (KKitalic_K and [R|t][R|t][ italic_R | italic_t ]), which are utilized to transform 2D image features into the unified 3D coordinate space. These parameters enable precise projection and aggregation of multi-view information into the voxel grid for occupancy and semantic prediction.

3.2 Overall Architecture

Refer to caption
Figure 3: This diagram illustrates the overall architecture of DA-Occ. The left side shows the input images processed by the Backbone, generating feature maps ??n\mathbf{F}_{n}bold_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that are fed into the DepthNet and HeightNet for depth and height predictions. These features are then used to construct 3D features ??3?D\mathbf{F}_{3D}bold_F start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT (with height) and BEV features ??b?e?v\mathbf{F}_{bev}bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT (without height). The DBA and DHA are applied to enhance the feature representation. Finally, these features are fused to produce the final output, which is visualized on the right side. (To facilitate understanding, some feature maps use the original images instead.)

The overall architecture of DA-Occ is depicted in detail in Fig.?3. It comprises the following key components: (1)Image Backbone: A convolutional neural network (e.g., ResNet?[29]) is employed to extract high-level visual features from multi-view input images. (2) Geometry-Aware Feature Encoder: This integrates a depth estimation network (DepthNet) and a height attention network (HeightNet) to enrich the geometric understanding and improve the structural fidelity of subsequent 3D representations. (3) 2D-to-3D View Transformation: Image-plane features are lifted into the 3D voxel space using calibrated projection matrices, enabling spatial alignment across multiple views. (4) Geometry-Aware Feature Decoder: The final decoding stage incorporates geometric priors to enhance 3D structural consistency and improve the granularity of semantic predictions within the voxel grid.

3.3 Directional Attention Mechanism

Refer to caption
Figure 4: Internal operations of Directional Attention Mechanis. This process first performs directional average value compression on the input feature ??i?n\mathbf{F}_{in}bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT (in the horizontal or vertical direction) to generate one of ??h\mathbf{F}_{h}bold_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT or ??v\mathbf{F}_{v}bold_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. This intermediate result is passed through a Multi-Layer Perceptron (MLP) to generate dynamic convolutional weights. These weights are then applied to a concatenated feature tensor via convolution, producing the final output feature ??o?u?t\mathbf{F}_{out}bold_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.

In order to preserve the consistency of 3D geometric structures, particularly along the vertical axis, which is often underrepresented in 2D BEV-based representations, we introduce the Directional Attention Mechanism. This mechanism is utilized in both the Geometry-Aware Feature Encoder and the Geometry-Aware Feature Decoder, but at different stages of the model pipeline. In the Encoder, it is applied before the 2D-to-3D View Transformation to dynamically acquire height scores, which help enhance the vertical geometric features. After the transformation, in the Decoder, the Directional Height Attention (DHA) and Directional BEV Attention (DBA) modules are employed to refine and reconstruct the vertical and horizontal geometric structures, ensuring a comprehensive and accurate 3D representation. Inspired by ParC-Net?[30], which captures long-range dependencies in 2D image space, we propose a novel extension of this mechanism to the 3D voxel domain. Our extension dynamically extracts 3D geometric features, enabling the model to better capture long-range spatial dependencies and enhance the representation of 3D structures. By introducing direction-aware attention that model dependencies along both the vertical and horizontal axes, our approach preserves fine-grained 3D structure within a purely 2D framework. Specifically, DHA is designed to capture detailed height-dependent patterns by processing features along the ZZitalic_Z-axis (vertical), while DBA captures geometric dependencies along the horizontal plane within the BEV representation. The directional attention mechanism can be expressed as ???o?u?t=DA?(???i?n,dir=i)\mathbf{F}{out}=\mathrm{DA}(\mathbf{F}{in},\text{dir}=i)bold_F italic_o italic_u italic_t = roman_DA ( bold_F italic_i italic_n , dir = italic_i ), where iiitalic_i refers to either the horizontal or vertical direction in feature maps. Figure?4 illustrates the structure of this mechanism. The computation proceeds as:

W\displaystyle Witalic_W =MLP?(Cp?(??i?n,dir=i))\displaystyle=\mathrm{MLP}(\mathrm{Cp}(\mathbf{F}_{in},\text{dir}=i))= roman_MLP ( roman_Cp ( bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , dir = italic_i ) ) (1)
??c?h\displaystyle\mathbf{F}_{ch}bold_F start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT =Cat(??i?n,??i?n[:,:,:?1,:])\displaystyle=\mathrm{Cat}(\mathbf{F}_{in},\mathbf{F}_{in}[:,:,:-1,:])= roman_Cat ( bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT [ : , : , : - 1 , : ] )
??c?v\displaystyle\mathbf{F}_{cv}bold_F start_POSTSUBSCRIPT italic_c italic_v end_POSTSUBSCRIPT =Cat(??i?n,??i?n[:,:,:,:?1])\displaystyle=\mathrm{Cat}(\mathbf{F}_{in},\mathbf{F}_{in}[:,:,:,:-1])= roman_Cat ( bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT [ : , : , : , : - 1 ] )
??o?u?t\displaystyle\mathbf{F}_{out}bold_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT =Conv?(??c?h/c?v+pi,W)\displaystyle=\mathrm{Conv}(\mathbf{F}_{ch/cv}+p_{i},W)= roman_Conv ( bold_F start_POSTSUBSCRIPT italic_c italic_h / italic_c italic_v end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W )

where, WWitalic_W denotes the convolutional weights dynamically generated by the MLP. The function Cp?(?)\mathrm{Cp}(\cdot)roman_Cp ( ? ) performs average pooling along direction iiitalic_i to compress spatial information. ??c?h\mathbf{F}_{ch}bold_F start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT and ??c?v\mathbf{F}_{cv}bold_F start_POSTSUBSCRIPT italic_c italic_v end_POSTSUBSCRIPT represent horizontally and vertically concatenated features, respectively. The position encoding pip_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a learnable parameter with a shape of H×1H\times 1italic_H × 1 or 1×W1\times W1 × italic_W, and the final convolution Conv\mathrm{Conv}roman_Conv integrates both structure and learned weights to generate ??o?u?t\mathbf{F}_{out}bold_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.

3.4 Geometry-Aware Feature Encoder

HeightNet first introduced in?[5], shares a similar structure with the DepthNet module in BEVDepth?[3]. It is used to generate a predicted height map Hp?r?e?dH_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT, which is supervised by a ground-truth height map Hg?tH_{gt}italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT to facilitate accurate height estimation. The generation of Hg?tH_{gt}italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT proceeds as follows: a LiDAR point pl=[xl,yl,zl,1]Tp_{l}=[x_{l},y_{l},z_{l},1]^{T}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in the LiDAR coordinate frame is transformed to the ego-vehicle coordinate system as pe=[xe,ye,ze,1]Tp_{e}=[x_{e},y_{e},z_{e},1]^{T}italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The corresponding 2D projection onto the image plane is computed using:

d?[u,v,1]T=K?[R|t]?[xl,yl,zl,1]Td[u,v,1]^{T}=K[R|t][x_{l},y_{l},z_{l},1]^{T}italic_d [ italic_u , italic_v , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = italic_K [ italic_R | italic_t ] [ italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (2)

where dditalic_d is the depth value in the camera coordinate system, [u,v,1][u,v,1][ italic_u , italic_v , 1 ] denotes the pixel coordinates in homogeneous form, KKitalic_K is the intrinsic matrix, and R?3×3R\in\mathbb{R}^{3\times 3}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT, t?3t\in\mathbb{R}^{3}italic_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT are the camera extrinsic parameters.

Following?[5], we construct a set of tuples [u,v,d,ze]T[u,v,d,z_{e}]^{T}[ italic_u , italic_v , italic_d , italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where each pixel location [u,v][u,v][ italic_u , italic_v ] is assigned both a projected depth dditalic_d and a corresponding height zez_{e}italic_z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT in the ego coordinate system. For each pixel, only the point with the smallest depth is retained to eliminate occluded redundancies, resulting in a ground-truth height map Hg?tH_{gt}italic_H start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT that represents the visible uppermost surface in the scene.

Directional Height Attention Network is designed to predict Hp?r?e?dH_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT by leveraging structural priors along the vertical (ZZitalic_Z-axis) direction. It incorporates both the geometric shape prior of objects in height and the contextual information from adjacent horizontal locations. To efficiently capture vertical geometry while maintaining lightweight computation, DHA replaces the last two residual blocks in the original DepthNet with directional attention modules. This architectural modification enables more effective extraction of height-sensitive features with minimal additional cost.

The entire Geometry-Aware Feature Encoder process is described as follows: ??n\mathbf{F}_{n}bold_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is processed by DepthNet, generating the depth score Dp?r?e?dD_{pred}italic_D start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT and the feature map ??f?e?a?t\mathbf{F}_{feat}bold_F start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT. ??n\mathbf{F}_{n}bold_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is then passed through HeightNet, producing the height score Hp?r?e?dH_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT.

3.5 2D-to-3D View Transformation

To preserve the integrity of 3D geometric structures while maintaining computational efficiency, we adopt the Lift-Splat-Shoot (LSS) framework?[21] for view transformation. Unlike existing approaches such as FlashOcc?[14], which rely on a purely 2D transformation grid (e.g., 1×200×2001\times 200\times 2001 × 200 × 200) and often compress the vertical dimension, resulting in significant geometric distortion, our method takes a different approach. We adopt a lightweight 3D voxel grid (e.g., 16×32×3216\times 32\times 3216 × 32 × 32) to enrich the BEV feature map with enhanced vertical geometric information. This enables BEV feature map to retain vertical spatial cues more effectively and recover structural geometry with higher fidelity under similar computational constraints.

Specifically, we construct a 3D voxel grid that enables finer-grained depth and height modeling during the lifting process. The generated 3D voxel features are then sliced and fused along the depth axis (YYitalic_Y-axis), effectively capturing vertical structure while still allowing the final output to remain in a 2D feature map format for efficient processing. In the Splat stage of the LSS, we enhance the lifted 3D voxel representation by the predicted height confidence scores. Specifically, the resulting voxel projection is formulated as:

??3?D=iranks_bev?[i]=vs?o?f?t?m?a?x?(Hp?r?e?d)???f?e?a?t,i\mathbf{F}_{3D}=\sum_{i\in\text{ranks\_bev}[i]=v}{softmax(H_{pred})\cdot\mathbf{F}_{feat,i}}bold_F start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ ranks_bev [ italic_i ] = italic_v end_POSTSUBSCRIPT italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT ) ? bold_F start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t , italic_i end_POSTSUBSCRIPT (3)

where Hp?r?e?dH_{pred}italic_H start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT denote the scores along the height directions, and ??f?e?a?t,i\mathbf{F}_{feat,i}bold_F start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t , italic_i end_POSTSUBSCRIPT represents iiitalic_i-th pixel feature. This design enhances the model’s perception of vertical geometric structures while maintaining real-time inference efficiency, leading to improved spatial feature representation. After constructing the 3D voxel feature volume ??3?D?B×C×Z×Y×X\mathbf{F}_{3D}\in\mathbb{R}^{B\times C\times Z\times Y\times X}bold_F start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_Z × italic_Y × italic_X end_POSTSUPERSCRIPT, we perform slicing along the depth axis (i.e., the YYitalic_Y-axis) and concatenate the resulting slices along the horizontal axis (XXitalic_X-axis). This transforms the 3D voxel structure into a 2D feature map with enhanced perception of vertical (height-wise) geometric structures, facilitating more efficient extraction of height-aware spatial representations.This process is described as:

??h?e?i?g?h?t=Concatk=0k=Y?1???3?D?[:,:,:,k,:],d?i?m=4\mathbf{F}_{height}=\text{Concat}^{k=Y-1}_{k=0}{\mathbf{F}_{3D}[:,:,:,k,:],dim=4}bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT = Concat start_POSTSUPERSCRIPT italic_k = italic_Y - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT bold_F start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT [ : , : , : , italic_k , : ] , italic_d italic_i italic_m = 4 (4)

where the concatenation is applied along the spatial XXitalic_X-axis, resulting in a feature map ??h?e?i?g?h?t?B×C×Z×(Y×X)\mathbf{F}_{height}\in\mathbb{R}^{B\times C\times Z\times(Y\times X)}bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_Z × ( italic_Y × italic_X ) end_POSTSUPERSCRIPT

3.6 Geometry-Aware Feature Decoder

Refer to caption
Figure 5: It illustrates the combined effects of the DHA and DBA modules. The right side presents an equivalent depiction of these effects, highlighting the collaborative extraction of geometric features along the XXitalic_X, YYitalic_Y, and ZZitalic_Z -axes, and emphasizing the synergy between the three axes in capturing spatial information.

The BEV feature map ??b?e?v\mathbf{F}_{bev}bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT, which primarily encodes the target’s depth-related information, and the sliced height-aware feature map feature map ??h?e?i?g?h?t\mathbf{F}_{height}bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT, which captures the target’s vertical structural cues, are obtained following the view transformation stage. To enhance the geometric fidelity of ??b?e?v\mathbf{F}_{bev}bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT and ??h?e?i?g?h?t\mathbf{F}_{height}bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT, we introduce DBA and DHA, which specialize in refining horizontal and vertical geometrical features through direction-aware attention.

Directional-Aware Attention. The ??bev\mathbf{F}_{\text{bev}}bold_F start_POSTSUBSCRIPT bev end_POSTSUBSCRIPT captures the relative distance between the ego vehicle and surrounding targets. This spatial relationship can be further decomposed into front-back and left-right directional components within the BEV plane. To better model depth-related geometric structures, we apply a directional attention operator DA?(?,dir)\text{DA}(\cdot,\text{dir})DA ( ? , dir ) along both the XXitalic_X and YYitalic_Y axes of the BEV feature map. The refined BEV representation is computed as:

??B?E?V=DA?(??b?e?v,dir=h)+DA?(??b?e?v,dir=v)\mathbf{F}_{BEV}=\text{DA}(\mathbf{F}_{bev},\text{dir}=h)+\text{DA}(\mathbf{F}_{bev},\text{dir}=v)bold_F start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT = DA ( bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT , dir = italic_h ) + DA ( bold_F start_POSTSUBSCRIPT italic_b italic_e italic_v end_POSTSUBSCRIPT , dir = italic_v ) (5)

where the output ??BEV\mathbf{F}_{\text{BEV}}bold_F start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT has the same shape as the input ??bev\mathbf{F}_{\text{bev}}bold_F start_POSTSUBSCRIPT bev end_POSTSUBSCRIPT, and the fusion enhances geometric sensitivity in both planar directions.

Due to the previous processing of sliced ??height\mathbf{F}_{\text{height}}bold_F start_POSTSUBSCRIPT height end_POSTSUBSCRIPT, it preserves the geometric structure in the vertical direction. The DHA performs direction-aware processing that better aligns with the vertical structure . This directional sensitivity allows it to more effectively capture and refine height-dependent geometric features. The refinement process is defined as:

??h?3?D\displaystyle\mathbf{F}_{h3D}bold_F start_POSTSUBSCRIPT italic_h 3 italic_D end_POSTSUBSCRIPT =reshape?(DA?(??h?e?i?g?h?t,dir=h))\displaystyle=\text{reshape}(\text{DA}(\mathbf{F}_{height},\text{dir}=h))= reshape ( DA ( bold_F start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT , dir = italic_h ) ) (6)
??H?e?i?g?h?t\displaystyle\mathbf{F}_{Height}bold_F start_POSTSUBSCRIPT italic_H italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT =Concatz=0Z?1?(??h?3?D?[:,:,z,:,:]),dim=1\displaystyle=\text{Concat}_{z=0}^{Z-1}(\mathbf{F}_{h3D}[:,:,z,:,:]),\text{dim}=1= Concat start_POSTSUBSCRIPT italic_z = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z - 1 end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_h 3 italic_D end_POSTSUBSCRIPT [ : , : , italic_z , : , : ] ) , dim = 1

where ??h?3?D?B×C×Z×Y×X\mathbf{F}_{h3D}\in\mathbb{R}^{B\times C\times Z\times Y\times X}bold_F start_POSTSUBSCRIPT italic_h 3 italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_C × italic_Z × italic_Y × italic_X end_POSTSUPERSCRIPT is the reshaped 3D height-aware feature volume, and ??H?e?i?g?h?t?B×(C?Z)×Y×X\mathbf{F}_{Height}\in\mathbb{R}^{B\times(C\cdot Z)\times Y\times X}bold_F start_POSTSUBSCRIPT italic_H italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × ( italic_C ? italic_Z ) × italic_Y × italic_X end_POSTSUPERSCRIPT is the final height-enhanced feature map obtained by concatenating the ZZitalic_Z axes slices along the channel dimension.

This formulation enables fine-grained recovery of vertical geometry and improves the spatial expressiveness of the height feature representation. The final feature representation ??d?_?h\mathbf{F}_{d\_h}bold_F start_POSTSUBSCRIPT italic_d _ italic_h end_POSTSUBSCRIPT is obtained by fusing ??H?e?i?g?h?t\mathbf{F}_{Height}bold_F start_POSTSUBSCRIPT italic_H italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT and ??B?E?V\mathbf{F}_{BEV}bold_F start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT, effectively combining fine-grained height-aware and depth-aware information to enhance 3D spatial understanding. The whole process looks like Figure?5.

4 Experiment

4.1 Dataset and Evaluation Metrics

We train and evaluate our model on the Occ3D-nuScenes benchmark?[2], which is built upon the nuScenes dataset?[31]. The dataset comprises 1,000 video sequences, split into 700 for training, 150 for validation, and 150 for testing. Each keyframe includes a 32-beam LiDAR point cloud, six RGB images from surround-view cameras, and dense voxel-level semantic occupancy annotations. The perception range is defined as [?40?m,?40?m,?1?m,40?m,40?m,5.4?m][-40\,\text{m},-40\,\text{m},-1\,\text{m},40\,\text{m},40\,\text{m},5.4\,\text{m}][ - 40 m , - 40 m , - 1 m , 40 m , 40 m , 5.4 m ], and the voxel resolution is set to [0.4?m,0.4?m,0.4?m][0.4\,\text{m},0.4\,\text{m},0.4\,\text{m}][ 0.4 m , 0.4 m , 0.4 m ]. Each voxel is assigned one of 18 semantic classes, including 16 predefined object categories, one other class for unknown objects, and one empty class representing free space. To assess the performance of our approach, we adopt evaluation metrics widely used in 3D occupancy prediction: the mean Intersection-over-Union (mIoU)?[32], computed over all semantic categories.

4.2 Implementation Details

Following common practice?[9, 4, 32], we train our models using 4 NVIDIA H800 GPUs with a batch size of 4 per GPU. Unless otherwise specified, all models are optimized using the AdamW optimizer with a learning rate of 1×10?41\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a weight decay of 0.05. During the LSS stage, the resolution of the grid for the depth score is set to [1×200×200][1\times 200\times 200][ 1 × 200 × 200 ], and the resolution of the grid for the height score is set to [16×32×32][16\times 32\times 32][ 16 × 32 × 32 ]. Regarding the loss function, we design it based on the characteristics of geometric structure recovery in DA-Occ. Specifically, we follow the formulation proposed in MonoScene?[1] and define the overall loss as: ?=λ1??bcedepth+λ2??bceheight+λ3??ce+λ4??scalsem+λ5??scalgeo\mathcal{L}=\lambda_{1}\mathcal{L}_{\text{bce}}^{\text{depth}}+\lambda_{2}\mathcal{L}_{\text{bce}}^{\text{height}}+\lambda_{3}\mathcal{L}_{\text{ce}}+\lambda_{4}\mathcal{L}_{\text{scal}}^{\text{sem}}+\lambda_{5}\mathcal{L}_{\text{scal}}^{\text{geo}}caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT start_POSTSUPERSCRIPT depth end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT bce end_POSTSUBSCRIPT start_POSTSUPERSCRIPT height end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ce end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT scal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sem end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT scal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT geo end_POSTSUPERSCRIPT, where λ1,λ2,λ3,λ4,λ5\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4},\lambda_{5}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are set as 1.0,1.0,10.0,1.0,1.01.0,1.0,10.0,1.0,1.01.0 , 1.0 , 10.0 , 1.0 , 1.0. For inference, we use a single NVIDIA A100 GPU with a batch size of 1. The frame-per-second (FPS) metric is measured using the official mmdetection3d codebase.

4.3 Main Results

Table 1: Comparison with state-of-the-art methods on the Occ3D-nuScenes benchmark. FPS is tested on a single NVIDIA A100 using the mmdetection3d codebase. DA-Occ* and FlashOcc* represents the way of Stereo4D temporal fusion?[11], the 16-frame fusion method adopts the BEV-level approach?[32].
Method Venue History Frames Backbone Image Size mIoU (%) FPS
MonoScene CVPR’22 - ResNet-101 928×\times×1600 6.06 1.2
TPVFormer CVPR’23 - ResNet-101 928×\times×1600 27.83 3.1
CTF-Occ NIPS’24 - ResNet-101 928×\times×1600 28.53 -
OccFormer ICCV’23 - ResNet-50 256×\times×704 20.40 -
BEVDetOcc arXiv’22 - ResNet-50 256×\times×704 31.64 -
FlashOcc arXiv’23 - ResNet-50 256×\times×704 31.95 79.4
DA-Occ(Ours) - - ResNet-50 256×\times×704 34.38 39.6
BEVDetOcc-4D-Stereo arXiv’22 1 ResNet-50 256×\times×704 36.01 1.8
FB-OCC ICCV’23 1 ResNet-50 256×\times×704 37.52 6.9
FlashOCC* arXiv’23 1 ResNet-50 256×\times×704 37.84 9.1
GSD-Occ AAAI’25 1 ResNet-50 256×\times×704 34.84 22.7
EOPIA ICAIIC’25 1 ResNet-50 256×\times×704 35.26 28.7
DA-Occ(Ours)* - 1 ResNet-50 256×\times×704 38.97 7.3
DA-Occ(Ours) - 1 ResNet-50 256×\times×704 35.77 31.5
FastOCC ICRA’24 16 ResNet-101 320×\times×800 37.2 10.1
COTR CVPR’24 16 ResNet-50 256×\times×704 44.5 0.9
GSD-Occ AAAI’25 16 ResNet-50 256×\times×704 39.4 20.0
DA-Occ(Ours) - 16 ResNet-50 256×\times×704 39.3 27.7

Table?1 presents a comparison between DA-Occ and state-of-the-art (SOTA) methods on the Occ3D-nuScenes benchmark. The results are grouped by the number of frames used in the temporal sequence. DA-Occ demonstrates outstanding efficiency while maintaining high accuracy. Notably, even when using only a single frame for data fusion, DA-Occ* achieves accuracy comparable to that of many multi-frame models. This is largely attributed to the use of the Stereo4D strategy in its temporal fusion module?[11]. When using a single frame, the proposed method improves accuracy by 2.43% compared to FlashOcc. By employing the frame fusion technique at the BEV level?[32], DA-Occ achieves FPS of 27.7 and accuracy of 39.3%. These results highlight DA-Occ’s ability to strike an effective balance between efficiency and accuracy The way we make comparisons is all based on innovations within the model itself, for example MonoScene?[1], TPVForme?[33], CTF-Occ?[9], OccFormer?[34], BEVDetOcc?[11], FlashOcc?[14], FB-OCC?[4], GSD-Occ?[12], EOPIA?[13](Since there was no name for the model, the abbreviation of the article title was used instead.), and COTR?[9]. These results underscore the effectiveness of DA-Occ in preserving geometric structures using a purely 2D-based approach. To further validate this capability, we conduct both ablation studies and qualitative visualizations, which consistently confirm the robustness of DA-Occ in maintaining geometric structural integrity.

Refer to caption
Figure 6: This figure shows the comparison of the proposed method with FlashOcc in the visualization experiments. Our method demonstrates better generalization of geometric structures, resulting in more accurate segmentation in challenging areas.

4.4 Ablation Studies

Table 2: Effect of voxel grid resolution on mIoU, inference speed, and FLOPs.
Size of Grid mIoU (%) FPS GFLOPs
16×32×3216\times 32\times 3216 × 32 × 32 34.38 39.6 276.59
16×64×6416\times 64\times 6416 × 64 × 64 34.45 38.7 276.65
8×32×328\times 32\times 328 × 32 × 32 33.67 39.9 276.58
4×32×324\times 32\times 324 × 32 × 32 32.89 40.1 276.55
Table 3: Effect of DHA and DBA modules on mIoU, inference speed, and FLOPs.
DHA DBA mIoU (%) FPS Flops
? ? 34.38 39.6 276.59G
? - 34.01 42.5 271.24G
- ? 33.76 41.3 276.57G
- - 32.58 44.7 271.21G
Table 4: The mIoU and FPS of each model without the visible mask.
Method Venue mIoU (%) FPS
MonoScene CVPR’22 6.06 1.2
OccFormer ICCV’23 20.40 -
CTF-Occ NIPS’24 28.53 -
BECDet arXiv’22 18.05 -
SparseOcc ECCV’24 30.64 17.7
DA-Occ - 31.31 27.7
Table 5: Model inference speed of simulated edge device.
Precision Threads GPU Memory FPS
FP32 (A100) Full Full 39.6
FP16 (A100) 2 4GB 14.8

To accelerate experimental evaluation, we conducted ablation experiments on the model without DepthNet and without temporal fusion. To investigate the impact of height information on occupancy prediction, we varied the resolution of the 3D voxel grid and analyzed how vertical geometric granularity influences performance. The impact of the XXitalic_X and YYitalic_Y grid size on accuracy is negligible, as it primarily encodes depth information, which is predominantly obtained from the BEV feature map. The results, summarized in Table?2, highlight the crucial role of height-aware geometry in achieving accurate 3D occupancy predictions.

Next, to assess the effectiveness of the DHA and DBA modules in preserving geometric structures, we performed additional ablation studies on the nuScenes dataset, as shown in Table?3. The results confirm that the direction-specific dynamic convolutions in DHA and DBA are highly effective at capturing structural features, contributing to both improved performance and model efficiency.

To further evaluate the robustness of DA-Occ, we conducted experiments without using the Visible Mask. This setting simulates real-world scenarios where the model may not have access to complete or perfect information, enabling us to evaluate the model’s ability to generalize and maintain accuracy under challenging conditions. The results, presented in Table?4, demonstrate that DA-Occ achieves an mIoU of 31.31% without the Visible Mask, indicating its strong resilience to missing data and its robustness in less-than-ideal environments.

Finally, we simulate the inference speed of DA-Occ on edge devices by imposing several hardware constraints to mimic real-world deployment environments. Specifically, we limit the available GPU memory to 4GB (with other memory being occupied to simulate concurrent tasks on the edge device) and restrict the GPU’s maximum power consumption to 90W. As shown in Table?5, even under these simulated deployment constraints, DA-Occ maintains its effectiveness. This further validates the applicability of DA-Occ for real-time inference on edge devices.

4.5 Visualization Studies

In visualization experiments, we adopt FlashOcc?[14] as the benchmark model, including both non-temporal variants and deeply supervised versions for comparison. A variety of scenarios from the val set are employed to ensure comprehensive evaluation. All visualizations are generated using the official FlashOcc codebase. As shown in Figure?6, the red box highlights a significant semantic loss along the vertical axis in FlashOcc. In contrast, DA-Occ maintains geometric structures in both horizontal and vertical directions, even within a purely 2D framework. This structural consistency greatly contributes to its outstanding mIoU performance.

5 Conclusion

In this work, we propose DA-Occ, a novel directional 2D framework for efficient and accurate 3D occupancy prediction. The method leverages height-aware voxel slicing and incorporates a directional attention mechanism, including Directional Height Attention (DHA) and Directional BEV Attention (DBA). This design effectively preserves geometric structures, particularly along the vertical axis, while maintaining high inference speed. Extensive experiments on the Occ3D-nuScenes dataset demonstrate that DA-Occ achieves a favorable trade-off between accuracy and efficiency. Compared with existing approaches, it shows superior real-time performance and stronger deployment potential. The proposed framework enhances scene understanding in autonomous driving scenarios and provides a practical solution for real-world applications.

Acknowledgments

This work was supported by the Nanjing Desay SV Automotive co.,LTD. under Grant Nos. TRD2025001.

References

  • [1] Anh-Quan Cao and Raoul De?Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022.
  • [2] Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36:64318–64330, 2023.
  • [3] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI conference on artificial intelligence, volume?37, pages 1477–1485, 2023.
  • [4] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose?M Alvarez. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6919–6928, 2023.
  • [5] Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le?Hui, and Jian Yang. Deep height decoupling for precise vision-based 3d occupancy prediction. arXiv preprint arXiv:2409.07972, 2024.
  • [6] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35:10421–10434, 2022.
  • [7] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
  • [8] Jiawei Hou, Xiaoyan Li, Wenhao Guan, Gang Zhang, Di?Feng, Yuheng Du, Xiangyang Xue, and Jian Pu. Fastocc: Accelerating 3d occupancy prediction by fusing the 2d bird’s-eye view and perspective view. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16425–16431. IEEE, 2024.
  • [9] Qihang Ma, Xin Tan, Yanyun Qu, Lizhuang Ma, Zhizhong Zhang, and Yuan Xie. Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19936–19945, 2024.
  • [10] Yi?Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023.
  • [11] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  • [12] Yulin He, Wei Chen, Siqi Wang, Tianci Xun, and Yusong Tan. Achieving speed-accuracy balance in vision-based 3d occupancy prediction via geometric-semantic disentanglement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume?39, pages 3455–3463, 2025.
  • [13] Sungjin Park, Jaeha Song, and Soonmin Hwang. Efficient occupancy prediction with instance-level attention. In 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pages 0103–0107. IEEE, 2025.
  • [14] Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023.
  • [15] Feng Li, Kun Xu, Zhaoyue Wang, Yunduan Cui, Mohammad?Masum Billah, and Jia Liu. Instancebev: Unifying instance and bev representation for global modeling. arXiv preprint arXiv:2505.13817, 2025.
  • [16] Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Jun Li, and Jian Yang. Desnet: Decomposed scale-consistent network for unsupervised depth completion. In Proceedings of the AAAI conference on artificial intelligence, volume?37, pages 3109–3117, 2023.
  • [17] Zhiqiang Yan, Xiang Li, Le?Hui, Zhenyu Zhang, Jun Li, and Jian Yang. Rignet++: Semantic assisted repetitive image guided network for depth completion: Z. yan et al. International Journal of Computer Vision, pages 1–23, 2025.
  • [18] Zhiqiang Yan, Zhengxue Wang, Kun Wang, Jun Li, and Jian Yang. Completion as enhancement: A degradation-aware selective image guided network for depth completion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26943–26953, 2025.
  • [19] Kun Wang, Zhenyu Zhang, Zhiqiang Yan, Xiang Li, Baobei Xu, Jun Li, and Jian Yang. Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16055–16064, 2021.
  • [20] Zhiqiang Yan, Yupeng Zheng, Deng-Ping Fan, Xiang Li, Jun Li, and Jian Yang. Learnable differencing center for nighttime depth perception. Visual Intelligence, 2(1):15, 2024.
  • [21] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
  • [22] Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, volume?37, pages 1486–1494, 2023.
  • [23] Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
  • [24] Jinqing Zhang, Yanan Zhang, Qingjie Liu, and Yunhong Wang. Sa-bev: Generating semantic-aware bird’s-eye-view feature for multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3348–3357, 2023.
  • [25] Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17591–17602, 2023.
  • [26] Junjie Huang and Guan Huang. Bevpoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111, 2022.
  • [27] Xiaowei Chi, Jiaming Liu, Ming Lu, Rongyu Zhang, Zhaoqing Wang, Yandong Guo, and Shanghang Zhang. Bev-san: Accurate bev 3d object detection via slice attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17461–17470, 2023.
  • [28] Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, and Zhaoxiang Zhang. Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17158–17168, 2024.
  • [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [30] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Parc-net: Position aware circular convolution with merits from convnets and transformer. In European conference on computer vision, pages 613–630. Springer, 2022.
  • [31] Holger Caesar, Varun Bankiti, Alex?H Lang, Sourabh Vora, Venice?Erin Liong, Qiang Xu, Anush Krishnan, Yu?Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  • [32] Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li?Chen, Hongyang Li, and Limin Wang. Fully sparse 3d occupancy prediction. In European Conference on Computer Vision, pages 54–71. Springer, 2024.
  • [33] Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023.
  • [34] Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9433–9443, 2023.
双子座和什么星座最不配 走路腰疼是什么原因 嗓子总有痰吃什么药 孕妇梦见别人怀孕是什么意思 灰配什么颜色好看
发低烧是什么原因 世界上最小的动物是什么 猪古代叫什么 南京市徽为什么是貔貅 1991年属羊是什么命
什么心竭什么 什么人不能吃火龙果 睾酮低有什么影响 乳清粉是什么东西 螨虫长什么样子
为什么要多吃鱼 碳酸氢钠俗称什么 趴在桌子上睡觉有什么坏处 7月1日什么星座 孕囊是什么意思
彻底是什么意思hcv7jop9ns6r.cn dpm值是什么意思xjhesheng.com 流年花开讲的什么hcv8jop6ns5r.cn rr是什么意思jasonfriends.com yy是什么hcv8jop3ns5r.cn
双侧腋窝淋巴结可见什么意思hcv8jop7ns6r.cn 小麦什么时候播种hcv9jop5ns0r.cn 肾的作用和功能是什么520myf.com 胸透检查什么hcv7jop4ns6r.cn dl是什么意思hcv8jop3ns6r.cn
农历12月是什么月onlinewuye.com 棉是什么面料hcv8jop2ns7r.cn 尿酸高不能吃什么食物hcv7jop9ns3r.cn siri是什么意思hcv9jop5ns9r.cn 轶是什么意思xscnpatent.com
参见是什么意思hcv9jop0ns2r.cn 比五行属什么hcv9jop1ns3r.cn 湿气重吃什么能去湿气hcv8jop6ns1r.cn 破处是什么意思hcv9jop1ns3r.cn 肌酸激酶偏高吃什么药hcv8jop7ns6r.cn
百度