Aerial 6DoF localization typically relies on precise GNSS signals or radiometrically rich 3D
reconstructions, limiting scalability and on-board deployment. We propose
SemCityLoc, a semantic–geometric alignment system that reframes
aerial pose estimation as structured surface registration between foundation-model-derived
visual priors and standardized LoD-compliant 3D city models.
Instead of matching sparse contours or dense texture, our method aligns semantic surfaces and
monocular depth with lightweight semantic 3D building models, increasing pose discriminability in
repetitive and occluded urban environments. To enable accurate evaluation, we introduce
SemCityLockeD, the first real-world benchmark combining
centimeter-accurate UAV poses with standardized LoD1–LoD3 semantic city models and challenging
low-altitude imagery.
Experiments demonstrate substantial improvements over existing map-based approaches, improving recall
by up to 36% and reducing mean positional error from 9.89 m to 2.62 m in
challenging urban canyons. Our results indicate that semantically structured geometry provides
sufficient and scalable constraints for high-precision aerial localization without radiometric scene
reconstructions.