Photo with Shuning in 2025.
linzhan [at] princeton [dot] edu
I am a first-year CS Ph.D. student at Princeton, advised by Prof. Szymon Rusinkiewicz. I also collaborate closely with Prof. Adam Finkelstein.
I work on vision, robot learning and machine learning at Princeton PIXL Group, with recent interests in world modeling and multi-modal reasoning for embodied agents.
DIMO: Diverse 3D Motion Generation for Arbitrary Objects
ICCV 2025 (Highlight)
[abs]
[arXiv]
[website]
[code]
[video]
[poster]
We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation.
Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction
CoRL 2024
[abs]
[arXiv]
[website]
[code]
[video]
[poster]
Accurate perception of the dynamic environment is a fundamental task for autonomous driving and robot systems. This paper introduces Let Occ Flow, the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs, eliminating the need for 3D annotations. Utilizing TPV for unified scene representation and deformable attention layers for feature aggregation, our approach incorporates a novel attention-based temporal fusion module to capture dynamic object dependencies, followed by a 3D refine module for fine-gained volumetric representation. Besides, our method extends differentiable rendering to 3D volumetric flow fields, leveraging zero-shot 2D segmentation and optical flow cues for dynamic decomposition and motion optimization. Extensive experiments on nuScenes and KITTI datasets demonstrate the competitive performance of our approach over prior state-of-the-art methods.