INVESTIGATING THE GEOGRAPHIC BIAS IN CLOUD COVER OVERESTIMATION OF SENTINEL-2 LEVEL 1C AND LEVEL 2A PRODUCTS Dirk Tiede 1* , Martin Sudmanns 1 , Hannah Augustin 1 and Andrea Baraldi 2 3 1 Department of Geoinformatics - Z_GIS, University of Salzburg, Schillerstr. 30, 5020, Salzburg, Austria; dirk.tiede@sbg.ac.at, martin.sudmanns@sbg.ac.at, hannah.augustin@sbg.ac.at 2 Baraldi Consultancy in Remote Sensing (BACRES), Modena, Italy & Spatial Services GmbH, Salzburg, Austria; andrea6311@gmail.com ABSTRACT Using exploratory methods, we evaluate a previously published geographic bias, i.e., Sentinel-2 Level 1C and Level 2A cloud cover overestimations for very high altitude areas, and visualize the detected bias using the example of South America. We narrow down the problem to specific areas and estimate the amount of data affected by the specific geographic characteristics, resulting in a strong image selection bias for global analyses. This should raise awareness of potentially omitted yet valid Sentinel-2 data in global big data applications. Index Terms— cloud cover overestimation, geographical bias, high altitude areas, Sentinel-2 Level 1C, Sentinel-2 Level 2A 1. INTRODUCTION Big data analyses must rely on proper pre-processing; analysis ready data (ARD) aims to fill this gap and therefore should achieve a certain minimum quality. If ARD fails to do so, analysis results lose reliability, e.g., because of data quality issues or excluding data due to evaluation of incorrect quality indicators. Many studies rely on the cloud cover estimation of Sentinel-2 images (Level 1C or Level 2A) as a first filter to select a subset of qualified images. The cloud estimation process usually tries to avoid inaccurate cloud cover estimation, focusing on avoiding an underestimation of clouds. However, the opposite, cloud cover overestimation, can also bias the image analysis process. This is especially concerning if the cloud overestimation is not randomly distributed, but if some areas of the globe are systematically and significantly more affected due to specific geographic characteristics. Elevation has proven to be one such geographic characteristic systematically affecting the quality of cloud cover estimates. Figure 1 shows an example for an archive query of all Sentinel-2 images for granule 19JDJ (Andes, Chile/Argentina, mean elevation ~ 4000m) with estimated cloud cover greater than 80%. This quite unusual filtering reveals mainly cloud free images (see depicted thumbnails in Fig. 1) for the time frame of 2015-07/2019-07, contrary to the estimated cloud cover. The inverse filtering of images with less than 20% estimated cloud cover results in only two images for the same period. In [1] we showed that the usage of the cirrus band of Sentinel-2 plays a crucial role in this overestimation of clouds, since cirrus detection can fail in certain conditions like high elevation, low water vapor content, bright surfaces [2]-[4] or a combination of these factors. This is one reason why more recent versions of the Sen2Cor algorithm (since v2.8 and similar to processing line 2.12) use a DEM threshold of 1500m altitude to exclude cirrus cloud detection in higher altitude areas. Fig. 1. Query for all Sentinel-2 Level 1C images available (query conducted on July 10, 2019) for granule 19JDJ and relative orbit R96 (Andes, border region between Chile and Argentina). Selection criteria: cloud cover estimate of >= 80 % in their metadata. The result shows the thumbnails of 90 images found in the database; most of them have no or much less than 80% cloud cover, images are returned from all seasons and show minimal variance over the year. A query for images with <20 % cloud cover revealed a result of two images for the whole timeframe. [source: [1]]