With the availability of enormous numbers of remote sensing images
produced by satellites and airborne sensors, high-resolution remote
sensing image analyses have stimulated a flood of interest in the domain
of remote sensing and computer vision (Toth and Jo´´zkow, ´ 2016), such as
image classification or land cover mapping (Cheng et al., 2017; Gomez ´
et al., 2016; You and Dong, 2020; Zhao et al., 2016), image retrieval
(Wang et al., 2016), and object detection (Cheng et al., 2014; Han et al.,
2014), etc. The great potential offered by these platforms in terms of
observation capability poses great challenges for semantic scene understanding (Bazi, 2019). For instance, as these data are obtained from
different locations, at different times and even with different satellites or image semantic segmentation (Xia et al., 2015), or migrate the model of
multi-label data training to other visual tasks (e.g., image object recognition) (Gong et al., 2019). Therefore, multi-label datasets now attract
increasing attention in the remote sensing community owing to that they
are not expensive but have a lot of research potential. For these reasons,
multi-label annotation of an image is necessary to present more details
of the image and improve the performance of scene understanding. In
addition, the multi-label annotation of an image can produce potential
correlations among the labels, such as “road” and “car” tend to occur
synchronously in a remote sensing image, and “grass” and “water” often
accompany “golf course”. This will provide a better understanding of
scene images, which is impossible for single-label image scene understanding. Therefore, annotating images with multiple labels is a vital
step for semantic scene understanding in remote sensing.
What is more, previous studies have proven that traditional machine
learning methods cannot adequately mine ground object scene information (Cordts et al., 2016; Jeong et al., 2019; Kendall et al., 2015; Zhu
et al., 2019). Recently, deep learning approaches, as a popular technology, have shown the great potential of providing solutions to problems related to semantic scene understanding, and many scholars have
conducted relevant studies (Fang et al., 2020; Han et al., 2018; Hu et al.,
2015; Ma et al., 2019; Paoletti et al., 2018; Wang et al., 2018; Zhang
et al., 2016; Zhou et al., 2019). Such as, a highly reliable end-to-end realtime object detection-based situation recognition system was proposed
for autonomous vehicles (Jeong et al., 2019). In another work (Cordts
et al., 2016), the authors determined that fully convolutional networks
achieve decent results in urban scene understanding. And scene classification CNNs were proved that they significantly outperform previous
approaches (Zhou et al., 2017). In the reference (Workman et al., 2017),
a novel CNN architecture for estimating geospatial functions, such as
population density, land cover, or land use, was proposed. Moreover,
CNNs were also used to identify weeds (Hung et al., 2014) and vehicles
(Chen et al., 2014), etc.
Additionally, there exists a logarithmic relationship between the
performance of deep learning methods on vision tasks and the quantity
of training data used for representation learning was proven recently
(Sun et al., 2017). This work demonstrated that the power of CNNs on
large-scale image recognition tasks can be substantially improved if the
CNNs are trained on large multi-perspective samples. At present, there
exist some widely used various-scale annotated datasets, including
image classification datasets like ImageNet (Deng et al., 2009), Places
(Zhou et al., 2017), PASCAL VOC (Everingham et al., 2015), YouTube8M (Abu-El-Haija et al., 2016), semantic segmentation datasets like
PASCAL Context (Mottaghi et al., 2014), Microsoft COCO (Lin et al.,
2014), Cityscapes (Cordts et al., 2016) and Mapillary Vistas Dataset
(Neuhold et al., 2017). However, in these benchmarks, the data of
outdoor objects on the ground are usually collected from ground-level
views. In addition, the object-centric remote sensing image datasets
constructed for scene classification, for instance, AID (Xia et al., 2017),
NWPU-RESISC45 (Cheng et al., 2017), the Brazilian coffee scene dataset
(Penatti et al., 2015), the UC-Merced dataset (Yang and Newsam, 2010),
and WHU-RS19 dataset (Xia et al., 2010). But these datasets are insufficient to understand the scene due to the high intra-class diversity and
low inter-class variation, with the limited number of remote sensing
images (Xia et al., 2017). The SEN12MS dataset (Schmitt et al., 2019)
attracts more attention in the domain of land use mapping recently. It
consists of 180,662 triplets sampled over all meteorological seasons.
Each triplet concludes a dual-pol synthetic aperture radar (SAR) image
patches, a multi-spectral Sentinel-2 image patches, and four different
MODIS land cover maps following different internationally established
classification schemes. However, the SEN12MS contains no more than
17 classes under a selected classification schemes, which may be also
insufficient for understanding the complex real world
Get Free Quote!
391 Experts Online