Chensheng Peng

PhD Student

University of California, Berkeley

About me

I am a Ph.D. student at UC Berkeley, affiliated with Berkeley AI Research (BAIR) Lab, Berkeley DeepDrive (BDD).

I received my Bachelor degree from Shanghai Jiao Tong University.

Research Interests

Generation / Reconstruction
- 3D/4D Generation
- Gaussian Splatting/NeRF
Perception
- Multi-Sensor Fusion
- Efficient Vision Algorithms
- Foundation Models

Publications/Preprints

* denotes equal contribution

DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes

Chensheng Peng*, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka

2025 Computer Vision and Pattern Recognition Conference (CVPR)

We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling effective static-dynamic decomposition and high-fidelity surface reconstruction in complex driving scenarios. Combined with the introduced geometric regularizations, our method are able to address the over-fitting issues caused by data sparsity in autonomous driving, reconstructing physically plausible Gaussians that align with object surfaces rather than floating in air.

A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

Chensheng Peng, Ido Sobol, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu, Or Litany

2024 Arxiv

We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D inputs to 3D. Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data. Diffusion models have recently shown promise as powerful generative models for 3D data, including Gaussian splats; however, standard diffusion frameworks typically require the target signal and denoised signal to be in the same modality, which is challenging given the scarcity of 3D data. To overcome this, we propose a novel training strategy that decouples the denoised modality from the supervision modality. By using a deterministic model as a noisy teacher to create the noised signal and transitioning from single-step to multi-step denoising supervised by an image rendering loss, our approach significantly enhances performance compared to the deterministic teacher. Additionally, our method is flexible, as it can learn from various 3D Gaussian Splat (3DGS) teachers with minimal adaptation; we demonstrate this by surpassing the performance of two different deterministic models as teachers, highlighting the potential generalizability of our framework. Our approach further incorporates a guidance mechanism to aggregate information from multiple views, enhancing reconstruction quality when more than one view is available.

X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios

Yichen Xie*, Chenfeng Xu*, Chensheng Peng, Shuqi Zhao, Nhat Ho, Alexander T. Pham, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan

2025 International Conference on Learning Representations (ICLR)

We propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual-branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence.

CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Chongjian Ge, Chenfeng Xu, Yuanfeng Ji, Chensheng Peng, Masayoshi Tomizuka,
Ping Luo, Mingyu Ding, Wei Zhan, Varun Jampani

2025 Computer Vision and Pattern Recognition Conference (CVPR)

We introduce CompGS, a novel generative framework that employs 3D Gaussian Splatting (GS) for efficient, compositional text-to-3D content generation. CompGS automatically decomposes 3D Gaussians into distinct entity parts, enabling optimization at both the entity and composition levels.

Q-SLAM: Quadric Representations for Monocular SLAM

Chensheng Peng*, Chenfeng Xu*, Yue Wang, Mingyu Ding, Heng Yang, Masayoshi Tomizuka, Kurt Keutzer, Marco Pavone

2024 Conference on Robot Learning (CoRL)

In this study, we propose a novel approach that reimagines volumetric representations through the lens of quadric forms. We posit that most scene components can be effectively represented as quadric planes. Leveraging this assumption, we reshape the volumetric representations with million of cubes by several quadric planes, which leads to more accurate and efficient modeling of 3D scenes in SLAM contexts.

DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds

Chensheng Peng, Guangming Wang, Xian Wan Lo, Xinrui Wu, Chenfeng Xu,
Masayoshi Tomizuka, Wei Zhan, Hesheng Wang

2023 International Conference on Computer Vision (ICCV)

Point clouds are naturally sparse, while image pixels are dense. The inconsistency limits feature fusion from both modalities for point-wise scene flow estimation. We regularize raw points to a dense format by storing 3D coordinates in 2D grids. We also present a novel warping projection technique to alleviate the information loss problem. Sufficient experiments demonstrate the efficiency and effectiveness of our method, outperforming the prior-arts on the FlyingThings3D and KITTI dataset.

Interactive multi-scale fusion of 2D and 3D features for multi-object vehicle tracking

Chensheng Peng*, Guangming Wang*, Yingying Gu, Jinpeng Zhang, Hesheng Wang

2023 IEEE Transactions on Intelligent Transportation Systems

In this paper, we propose multi-scale interactive query and fusion between pixel-wise and point-wise features to obtain more discriminative features. In addition, an attention mechanism is utilized to conduct soft feature fusion between multiple pixels and points to avoid inaccurate match problems of previous single pixel-point fusion methods. Our method can achieve 90.32% MOTA and 72.44% HOTA on the KITTI benchmark and outperform other approaches without using multi-scale soft feature fusion.

PNAS-MOT: Multi-Modal Object Tracking with Pareto Neural Architecture Search

Chensheng Peng, Zhaoyu Zeng, Jinling Gao, Jundong Zhou, Masayoshi Tomizuka, Xinbing Wang, Chenghu Zhou, Nanyang Ye

2024 IEEE Robotics and Automation Letters (RAL)

In this paper, we explore the use of the neural architecture search (NAS) methods to search for efficient architectures for tracking, aiming for low real-time latency while maintaining relatively high accuracy. We also propose a multi-modal framework to improve the robustness. Experiments demonstrate that our algorithm can run on edge devices within lower latency constraints, thus greatly reducing the computational requirements for multi-modal object tracking while keeping lower latency.

See all publications

Chensheng Peng

PhD Student

University of California, Berkeley

About me

Publications/Preprints

Contact