SemCityLoc: Aerial 6DoF Localization Using Semantic 3D City Models

Jingfeng Mao1, Xuyang Chen1, Qilin Zhang1, Oussema Dhaouadi1, Guangming Wang2, Brian Sheil2, Daniel Cremers1, Yan Xia3, Olaf Wysocki2
1Technical University of Munich, 2University of Cambridge, 3University of Science and Technology of China
ECCV 2026
SemCityLoc teaser

SemCityLoc estimates 6DoF UAV poses by aligning foundation-model semantics and monocular depth with lightweight, semantically structured 3D city models — without radiometric scene reconstructions. We further introduce SemCityLockeD, a centimeter-accurate benchmark combining standardized LoD city models with challenging low-altitude UAV imagery.

SemCityLoc in 2 Minutes

Abstract

Aerial 6DoF localization typically relies on precise GNSS signals or radiometrically rich 3D reconstructions, limiting scalability and on-board deployment. We propose SemCityLoc, a semantic–geometric alignment system that reframes aerial pose estimation as structured surface registration between foundation-model-derived visual priors and standardized LoD-compliant 3D city models.

Instead of matching sparse contours or dense texture, our method aligns semantic surfaces and monocular depth with lightweight semantic 3D building models, increasing pose discriminability in repetitive and occluded urban environments. To enable accurate evaluation, we introduce SemCityLockeD, the first real-world benchmark combining centimeter-accurate UAV poses with standardized LoD1–LoD3 semantic city models and challenging low-altitude imagery.

Experiments demonstrate substantial improvements over existing map-based approaches, improving recall by up to 36% and reducing mean positional error from 9.89 m to 2.62 m in challenging urban canyons. Our results indicate that semantically structured geometry provides sufficient and scalable constraints for high-precision aerial localization without radiometric scene reconstructions.

+34%
recall gain at 2m–2°
(35.1% → 69.2%)
2.62 m
mean error in urban canyons
(from 9.89 m)
0.88 s
per-image runtime
(GPU rasterization)
962
cm-accurate UAV images
with LoD1–LoD3 models

Contributions

  • Semantic–geometric aerial localization. We propose SemCityLoc, an approach that reformulates aerial localization as structured semantic–geometric surface alignment between foundation-model predictions and lightweight LoD city models.
  • Real-world benchmark. We present SemCityLockeD, a challenging dataset combining centimeter-accurate UAV poses, close-range urban imagery, and standardized LoD1–LoD3 semantic city models for accurate evaluation of map-based aerial localization.
  • Extensive evaluation. We demonstrate consistent improvements over state-of-the-art map-based localization approaches across our and other benchmarks and challenging urban scenarios.

Method Overview

SemCityLoc follows a coarse-to-fine semantic–geometric alignment strategy. Given a query image and an initial pose prior, we first perform a 4D semantic cost-volume search to obtain a coarse pose estimate, then refine it via joint semantic and depth alignment using a particle-filter optimization scheme to recover the final 6DoF camera pose.

SemCityLoc pipeline

Two-stage pose matching. In the coarse stage, a DINOv3-based semantic segmentation model and batch rendering construct a 4D cost volume for coarse pose estimation. In the fine stage, a particle filter fuses semantic and monocular depth (MoGe-2) cues for refined pose estimation.

Coarse Pose Selection

A frozen DINOv3 ViT backbone with a lightweight DPT decoder predicts semantic masks for the query image. Candidate poses are sampled around the prior over (x, y, z, ψ) — roll and pitch are fixed from gravity. For each pose we render the LoD semantic mask and score it against the query mask with a per-class IoU cost, building a 4D cost volume whose maximum yields the coarse pose.

Pose Refinement

We add monocular depth from MoGe-2, aligning estimated and rendered depth via a robust scale–shift term. The final cost combines semantic and depth terms. A particle filter iteratively perturbs the coarse hypothesis, evaluates the cost, and updates the pose in a weighted manner — enabling stable, efficient convergence without exhaustive high-dimensional grid search.

Unlike contour-based alignment over sparse edges, semantic surface partitions add region-level structural constraints that increase pose observability — transforming localization from edge matching into structured surface registration.

The SemCityLockeD Benchmark

SemCityLockeD is the first real-world benchmark pairing centimeter-accurate UAV poses with standardized LoD semantic 3D city models and challenging low-altitude imagery. It captures a densely built urban environment with closely spaced buildings, narrow urban canyons, and diverse architecture (19th–21st century).

  • 962 images (586 / 188 / 188 train/val/test) at 5,280×3,956 px, 1.6 cm GSD, nadir + oblique.
  • DJI Matrice 350 RTK + Zenmuse L2, ~75 m altitude; poses via RTK-GNSS + IMU + GCPs (~2 cm accuracy, EPSG:32632).
  • Standardized LoD1, LoD2 (semantic), and LoD3 models from the Bavarian state geoportal, plus textured meshes for synthetic rendering.
  • Automatic geometry-consistent semantic labels by projecting model semantics into image space; privacy-filtered imagery.

Pose Quality Comparison

Swiss-EPFL

(a) Swiss-EPFL (>75 m, sub-urban)

UAVD4L

(b) UAVD4L (high altitude)

SemCityLockeD nadir

(c) SemCityLockeD (nadir)

SemCityLockeD oblique

(d) SemCityLockeD (oblique)

Georeferenced keypoints projected with ground-truth poses align closely with image structures in (c) and (d), indicating high pose consistency under substantially more challenging close-range urban-canyon geometry than the high-altitude (a) and (b).

Results

Qualitative Comparison vs. LoD-Loc

Qualitative comparison

Projecting LoD geometry with the estimated poses reveals noticeable misalignment for LoD-Loc, especially in oblique views and facade-occluded urban canyons. SemCityLoc produces geometrically consistent overlays and accurate depth alignment (especially apparent in view 3).

Camera Pose Evaluation on SemCityLockeD

Method 2m–2° 3m–3° 5m–5° Yaw (°) XYZ (m)
MC-Loc (DINOv2)5.328.5118.095.9010.71
MC-Loc (RoMa)01.063.196.6718.61
LoD-Loc35.1147.8753.191.789.89
SemCityLoc (no coarse sel.)41.4956.3872.341.115.94
SemCityLoc (no refine)56.3870.2181.911.043.20
SemCityLoc (full)69.1584.0489.360.422.62

Recall in %, yaw in degrees, position in meters. CAD-Loc (e-LoFTR / RoMa) score 0 across thresholds and are omitted. Bold = best, underline = second best. Both coarse selection and semantic–depth refinement are complementary and jointly critical.

UAVD4L-LoD (out-of-Traj)

Method2m–2°5m–5°XYZ (m)
LoD-Loc88.1899.541.38
SemCityLoc89.6499.951.26

Swiss-EPFL (out-of-Place)

Method2m–2°5m–5°XYZ (m)
LoD-Loc17.4148.5512.54
SemCityLoc35.3689.183.09

The semantic–geometric alignment generalizes across heterogeneous environments — in the challenging out-of-Place setting it nearly doubles recall and cuts mean positional error from 12.54 m to 3.09 m.

Training Convergence

Convergence comparison

SemCityLoc converges within a few epochs and stabilizes after ~15 epochs. Fine-tuning a lightweight head on pretrained features with an IoU-based objective yields smoother gradients than reprojection-based losses trained from scratch.

Robustness Across LoDs

LoD levels

Performance improves from LoD1 to LoD3. Notably, semantic LoD2 matches or surpasses LoD3 despite lower geometric resolution — semantically partitioned surfaces enhance pose observability and partially compensate for missing fine-grained geometry.

Further Analyses

  • Segmentation quality. A lightweight head on a shared DINOv3 backbone reaches 88% / 85% / 78% mIoU on the three benchmarks — sufficient for stable alignment. Fully zero-shot CLIP+Semantic-SAM (32% mIoU) and Grounded-SAM2 (37% mIoU) degrade recall sharply due to terrestrial-to-aerial domain mismatch.
  • Pose-prior noise. Recall degrades smoothly under translational perturbations, staying at 62.8% / 70.2% / 76.6% even with 40–50 m per-axis noise, but drops sharply beyond ~100 m — revealing an operational boundary.
  • Runtime. With nvdiffrast GPU rasterization and frustum culling, the full pipeline runs in 0.878 s/image (0.116 s inference + 0.409 s coarse search + 0.353 s refinement) — feasible for practical online UAV relocalization.
  • Limitations. Relies on availability and accuracy of semantic 3D city models; robust to dynamic objects but may degrade with severely limited geometric observability (e.g., a single visible facade).

BibTeX

@inproceedings{semcityloc2026,
  author    = {Mao, Jingfeng and Chen, Xuyang and Zhang, Qilin and Dhaouadi, Oussema
               and Wang, Guangming and Sheil, Brian and Cremers, Daniel
               and Xia, Yan and Wysocki, Olaf},
  title     = {SemCityLoc: Aerial 6DoF Localization Using Semantic 3D City Models},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}