With the availability of enormous numbers of remote sensing images produced by satellites and airborne sensors, high-resolution remote sensing image analyses

computer science

Description

With the availability of enormous numbers of remote sensing images produced by satellites and airborne sensors, high-resolution remote sensing image analyses have stimulated a flood of interest in the domain of remote sensing and computer vision (Toth and Jo´´zkow, ´ 2016), such as image classification or land cover mapping (Cheng et al., 2017; Gomez ´ et al., 2016; You and Dong, 2020; Zhao et al., 2016), image retrieval (Wang et al., 2016), and object detection (Cheng et al., 2014; Han et al., 2014), etc. The great potential offered by these platforms in terms of observation capability poses great challenges for semantic scene understanding (Bazi, 2019). For instance, as these data are obtained from different locations, at different times and even with different satellites or image semantic segmentation (Xia et al., 2015), or migrate the model of multi-label data training to other visual tasks (e.g., image object recognition) (Gong et al., 2019). Therefore, multi-label datasets now attract increasing attention in the remote sensing community owing to that they are not expensive but have a lot of research potential. For these reasons, multi-label annotation of an image is necessary to present more details of the image and improve the performance of scene understanding. In addition, the multi-label annotation of an image can produce potential correlations among the labels, such as “road” and “car” tend to occur synchronously in a remote sensing image, and “grass” and “water” often accompany “golf course”. This will provide a better understanding of scene images, which is impossible for single-label image scene understanding. Therefore, annotating images with multiple labels is a vital step for semantic scene understanding in remote sensing. What is more, previous studies have proven that traditional machine learning methods cannot adequately mine ground object scene information (Cordts et al., 2016; Jeong et al., 2019; Kendall et al., 2015; Zhu et al., 2019). Recently, deep learning approaches, as a popular technology, have shown the great potential of providing solutions to problems related to semantic scene understanding, and many scholars have conducted relevant studies (Fang et al., 2020; Han et al., 2018; Hu et al., 2015; Ma et al., 2019; Paoletti et al., 2018; Wang et al., 2018; Zhang et al., 2016; Zhou et al., 2019). Such as, a highly reliable end-to-end realtime object detection-based situation recognition system was proposed for autonomous vehicles (Jeong et al., 2019). In another work (Cordts et al., 2016), the authors determined that fully convolutional networks achieve decent results in urban scene understanding. And scene classification CNNs were proved that they significantly outperform previous approaches (Zhou et al., 2017). In the reference (Workman et al., 2017), a novel CNN architecture for estimating geospatial functions, such as population density, land cover, or land use, was proposed. Moreover, CNNs were also used to identify weeds (Hung et al., 2014) and vehicles (Chen et al., 2014), etc. Additionally, there exists a logarithmic relationship between the performance of deep learning methods on vision tasks and the quantity of training data used for representation learning was proven recently (Sun et al., 2017). This work demonstrated that the power of CNNs on large-scale image recognition tasks can be substantially improved if the CNNs are trained on large multi-perspective samples. At present, there exist some widely used various-scale annotated datasets, including image classification datasets like ImageNet (Deng et al., 2009), Places (Zhou et al., 2017), PASCAL VOC (Everingham et al., 2015), YouTube8M (Abu-El-Haija et al., 2016), semantic segmentation datasets like PASCAL Context (Mottaghi et al., 2014), Microsoft COCO (Lin et al., 2014), Cityscapes (Cordts et al., 2016) and Mapillary Vistas Dataset (Neuhold et al., 2017). However, in these benchmarks, the data of outdoor objects on the ground are usually collected from ground-level views. In addition, the object-centric remote sensing image datasets constructed for scene classification, for instance, AID (Xia et al., 2017), NWPU-RESISC45 (Cheng et al., 2017), the Brazilian coffee scene dataset (Penatti et al., 2015), the UC-Merced dataset (Yang and Newsam, 2010), and WHU-RS19 dataset (Xia et al., 2010). But these datasets are insufficient to understand the scene due to the high intra-class diversity and low inter-class variation, with the limited number of remote sensing images (Xia et al., 2017). The SEN12MS dataset (Schmitt et al., 2019) attracts more attention in the domain of land use mapping recently. It consists of 180,662 triplets sampled over all meteorological seasons. Each triplet concludes a dual-pol synthetic aperture radar (SAR) image patches, a multi-spectral Sentinel-2 image patches, and four different MODIS land cover maps following different internationally established classification schemes. However, the SEN12MS contains no more than 17 classes under a selected classification schemes, which may be also insufficient for understanding the complex real world

Instruction Files

Related Questions in computer science category