Aerial to Street View Image Translation using Cascaded Conditional GANs Kshitij Singh, Alexia Briassouli a and Mirela Popa Department of Data Science and Knowledge Engineering, Maastricht University, The Netherlands mirela.popa@maastrichtuniversity.nl Keywords: Cross View Image Translation, Conditional GANs, Semantic Segmentation, U-net. Abstract: Cross view image translation is a challenging case of viewpoint translation which involves generating the street view image when the aerial view image is given and vice versa. As there is no overlap in the two views, a single stage generation network fails to capture the complex scene structure of objects in these two views. Our work aims to tackle the task of generating street level view images from aerial view images on the benchmarking CVUSA dataset by a cascade pipeline consisting of three smaller stages: street view image generation, semantic segmentation map generation, and image refinement, trained together in a constrained manner in a Conditional GAN (CGAN) framework. Our contributions are twofold: (1) The first stage of our pipeline examines the use of alternate architectures ResNet, ResUnet++ in a framework similar to the current State-of-the-Art (SoA), leading to useful insights and comparable or improved results in some cases. (2) In the 3rd stage, ResUNet++ is used for the first time for image refinement. U-net performs the best for street view image generation and semantic map generation as a result of the skip connections between encoders and decoders, while ResU-Net++ performs the best for image refinement because of the presence of the attention module in the decoders. Qualitative and quantitative comparisons with existing methods show that our model outperforms all others on the KL Divergence metric and ranks amongst the best for other metrics. 1 INTRODUCTION The task of generating outdoor scenes from a variety of viewpoints is a challenging one that is gaining a lot of attention recently with applications in domains like autonomous driving, virtual reality, geo-tagging etc. Generation of a novel viewpoint involves trans- forming objects in a scene from a given view to the desired view in a natural setting, while maintaining the photo-realism of the transformation. Cross view image translation is a special case of viewpoint translation, where the desired view has no overlap with the given view (aerial to street or vice versa). This is much more challenging due to occlu- sion and the large degree of deformation while trans- forming from one view to another. Moreover, when transforming from aerial view to street view, there is uncertainty in the orientation in which the street view will be synthesized. Existing methods (Zhai et al., 2017) (Regmi and Borji, 2018) (Tang et al., 2019) show that a single stage image translation model fails to transfer fine details of the objects. Thus, a multi a https://orcid.org/0000-0002-0545-3215 step process is needed, with image refinement after street view image generation (Tang et al., 2019). A se- mantic segmentation map generator, for comparison with the ground truth semantic map, is added to the multi-step process, to guide image generation. The final pipeline consists of 3 steps, where a street view image is first generated, and is then provided to image refinement and semantic map generation networks. Our work builds upon (Tang et al., 2019), investi- gating which architectures are best suited for each of the 3 steps/subtasks (street view image generation, se- mantic map generation, and image refinement). Our contributions are: (1) Stage 1 of our pipeline exam- ines alternate SoA architectures ResNet, ResUnet++ in a framework similar to (Tang et al., 2019), leading to useful insights and comparable or improved results. (2) In stage 3, ResUNet++ is used for the first time for image refinement. Conditional GANs (CGAN), proven to be very ef- fective in image translation (Isola et al., 2017), are used as the framework for each step. In addition to U-Net, which is the standard for image translation, CGAN, ResNet and ResU-Net++ (Jha et al., 2019) are 372 Singh, K., Briassouli, A. and Popa, M. Aerial to Street View Image Translation using Cascaded Conditional GANs. DOI: 10.5220/0010814000003124 In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages 372-379 ISBN: 978-989-758-555-5; ISSN: 2184-4321 Copyright c 2022 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved