Abstract
Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately withrendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency,can produce more geometry-aligned results when combinedwith ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, wegenerate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denserintermediate views via a component dubbed AGInpaint; andthird, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently withgeneration, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized,the scene can be rendered in diverse styles through relighting and style transfer techniques.
Method
The MeSS pipeline is designed to synthetically generate viewpoints to reconstruct gaussian scene following a sparse-to-dense scheme. Given a 3D city map (i.e., a mesh model with semantic and instance labels but without texture), we specify a virtual camera path via a sequence of M views. In Stage I, we generate a subset of N key view images along the sequence via a warp-and-outpaint procedure: starting from the initial key frame generated by geometric conditioned ControlNet-S, each proceeding key frame is warped as an additional condition to the outpainting of new key frame with ControlNet-N (Sec. 3.2). After obtaining all key frames, we use them to construct a Gaussian field through optimizing gaussian surfels on the surface of mesh models. In Stage II, we render from gaussian scene the intermediate views between each pair of subsequent key views. Artifacts like silhouettes in intermediate frames are filled up by Appearance Guided Inpainting (Sec. 3.3). Lastly, Global Consistency Alignment (Sec. 3.4) further enhances the appearance consistency of gaussian surfels learned from different views.
Stylized Videos trhough Relighting or SDEdit
BibTeX
@article{chen2025mess,
title={Mess: City mesh-guided outdoor scene generation with cross-view consistent diffusion},
author={Chen, Xuyang and Zhai, Zhijun and Zhou, Kaixuan and Wang, Zengmao and He, Jianan and Wang, Dong and Zhang, Yanfeng and Westermann, R{\"u}diger and Schindler, Konrad and Meng, Liqiu and others},
journal={arXiv preprint arXiv:2508.15169},
year={2025}
}