Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

Xuyang Chen1,2*, Conglang Zhang3,4*, Chuanheng Fu3,4*, Zihao Yang5, Kaixuan Zhou3†✉, Yizhi Zhang3, Jianan He3, Yanfeng Zhang2, Mingwei Sun3,4, Zengmao Wang4✉, Zhen Dong4, Xiaoxiao Long6, Liqiu Meng1
1Technical University of Munich, 2Huawei Hilbert Research Center (Dresden), 3Huawei Riemann Lab, 4Wuhan University, 5University of Science and Technology of China, 6Nanjing University
*Equal contribution, Project lead, Corresponding author
DwD Teaser Figure

Sim-to-Real Results

Drag the slider to compare simulation input (left) with our DwD output (right). Use the arrows to browse different scenes.

Abstract

Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance.

In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability.

Extensive experiments show that our approach achieves State-of-the-Art performance, significantly outperforming existing baselines in generating photorealistic driving videos that remain faithfully aligned with the simulation.

Method Overview

DwD Method Overview

Comparison with Other Methods

Compare our DwD method with other Sim-to-Real approaches. Use the dropdown to select different methods, and drag the slider to compare with the simulation input. Use the arrows to browse different scenes.

Simulation
DwD (Ours)

Long Video Generation

Our method can generate long driving videos with consistent quality. Compare different methods on a long driving sequence.

Simulation
DwD (Ours)

Zero-shot Result on Reconstructed Bird-Eye View Meshes

Our method generalizes to novel viewpoints rendered from reconstructed bird-eye view meshes without any fine-tuning.

Rendered BEV Mesh
DwD Output

Qualitative Comparison Charts

Detailed visual comparisons of our method against baselines across different scenarios.

BibTeX

Coming soon.