亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing

        2022-08-13 02:07:58QiminChengYuzhuoZhouHaiyanHuangandZhongyuanWang
        IEEE/CAA Journal of Automatica Sinica 2022年8期

        Qimin Cheng, Yuzhuo Zhou, Haiyan Huang, and Zhongyuan Wang

        Dear editor,

        Cross-modal retrieval in remote sensing (RS) data has inspired increasing enthusiasm due to its merit in flexible input and efficient query. In this letter, we address to establish semantic relationship between RS images and their description sentences. Specially, we propose a multi-attention fusion and fine-grained alignment network,termed MAFA-Net, for bidirectional cross-modal image-sentence retrieval in RS. While multiple attention mechanisms are fused to enhance the discriminative ability of visual features for RS images with complex scenes, fine-grained alignment strategy is introduced to study the hidden connection between RS observations and sentences.To validate the capability of MAFA-Net, we leverage four captioning benchmark datasets with paired RS images and descriptions, i.e.,UCM-Captions, Sydney-Captions, RSICD and NWPU-Captions.Experimental results on the four datasets demonstrate that MAFANet can yield better performance than the current state-of-the-art approaches.

        Related work: The accelerated advancement in earth observation technology witnesses an explosive growth of multi-modal and multisource remote sensing data. Cross-modal retrieval in RS facilitates flexible and efficient query, which has attracted extensive interest in recent years and can be applied to natural disaster early-warning and military intelligence generation, etc.

        Significant efforts have been devoted to cross-modal retrieval for natural images. To probe fine-grained relationships among images and sentences, Chenet al. [1] proposed a cross-modal retrieval model(IMRAM) based on a recurrent attention technique. Leeet al. [2]proposed a stacked attention mechanism-based graphic retrieval model (SCAN) to learn more discriminative textual and visual feature representations. Wanget al. [3] proposed a multi-modal tensor fusion network (MTFN) to directly measure the similarity between different modalities through rank-based tensor fusion. Wanget al. [4] proposed a position focused attention network (PFAN) to improve cross-modal matching performance. Besides, to satisfy industrial requirement, Wuet al. [5] proposed a hashing approach to achieve large-scale cross-modal retrieval via learning a unified hash representation and deep hashing functions for different modalities in a self-supervised way. Although these achievements gained inspiring results for retrieval tasks in natural images, their robustness and generalization ability need to be verified when transfer to RS fields due to the intrinsic and extrinsic properties of RS data.

        Motivated by the burgeoning demands for multi-modal requests in RS like military intelligence generation, researchers have paid more attention to RS cross-modal retrieval during the recent several years.To explore semantic correlation between visual features and textual description of RS data, Abdullahet al. [6] proposed a novel deep bidirectional ternary network (DBTN) for Text-to-Image (T2I)matching task through features fusion strategy. With regard to Image-to-Text (I2T) retrieval for RS data, Chenget al. [7] proposed to use a cross-attention mechanism and a gating mechanism to enhance the association between RS images and descriptions, which is the first attempt to prove the possibility of bidirectional T2I and I2T retrieval in RS. Afterwards, Lvet al. [8] proposed a fusion-based correlation learning model (FCLM) to capture multi-modal complementary information and fusion features and to further supervise the learning of the feature extraction network. Yuanet al.[9] proposed an asymmetric multimodal feature matching network(AMFMN) to extract the salient visual features of RS images through a multi-scale visual self-attention technique, and exploited it to guide textual feature extraction. Moreover, they further designed a concise and efficient version of their cross-modal retrieval model, namely LW-MCR [10] on the basis of knowledge distillation. For fast and efficient retrieval on large-scale RS data, Mikriukovet al. [11]introduced a novel deep unsupervised cross-modal contrastive hashing model. Except for image-sentence retrieval, there has been some work on visual-audio retrieval [12], image-sketch retrieval[13], cross-source panchromatic-multispectral image retrieval [14],[15] and zero-shot image-word matching [16].

        It is no doubt that all the above work partly advances the crossmodal retrieval in RS from different aspects including visual feature representation and description optimization strategy, etc. However,current work on bidirectional image-sentence retrieval in RS is deficient in 1) Achievements on bidirectional image-sentence retrieval for RS data are very limited and comprehensive analysis is still lacking. Current work [6]?[11] conducts comparative experiments with the baseline for natural images unexceptionally; 2) The generalization of existing approaches on much larger and more challenging RS captioning datasets needs to be verified. The size of the datasets applied by existing approaches [6], [8]?[11] is limited(with the maximum of 24 333 original captions in RSICD [17] and 23 715 granular captions in RSITMD [9]); 3) Semantic ambiguity of complex scenes of RS data remains unsolved.

        To address these limitations, we propose a novel cross-modal network for bidirectional T2I and I2T retrieval in RS. The contribution of our work lies in: 1) We aim to differentiate visual features for complex scene representation through fusing multiple attention mechanisms and reinforce the intra-modality semantic association through fine-grained alignment strategy. 2) We evaluate the effectiveness and robustness of our proposed network on a much larger dataset, NWPU-Captions with 157 500 captions in total, along with the several popular benchmark datasets.

        MAFA-Net: The motivation of MAFA-Net includes two aspects.The first one is to depict RS images, especially those complex scenes, with more abstract and discriminative feature representation.The second one is to address semantic ambiguity existed in different modality of RS data through establishing fine-grained relevance between RS image region and visual words.

        To this end, MAFA-Net consists of two main parts: a multiattention fusion module and a fine-grained alignment module. The multi-attention fusion module aims to weaken interference from background noise in RS images and enhance the salient objects,thereby to improve the discriminative ability of the visual features.The fine-grained alignment module exploits sentence features as context information to further optimize and update the visual features of RS images. The overall architecture of MAFA-Net is shown in Fig. 1.

        Fig. 1. The overall architecture of MAFA-Net.

        Fig. 2. The architecture of multi-attention fusion module.

        Dataset and metrics: Four RS datasets are selected to evaluate the performance of different approaches in the cross-modal imagesentence retrieval task.

        1) UCM-Captions: This dataset is released by [18] based on the UCMerced dataset. The size of each image is 256×256, and the pixel resolution is 0.3048 m. Each image is described with five different sentences and hence contains 10 500 descriptions in total.

        2) Sydney-Captions: This dataset is released by [18] based on the Sydney dataset and includes 3065 descriptions for 613 cropped images. The original images in it are with size of 18 000×14 000 and pixel resolution of 0.5 m. Each cropped image is described by five varied sentences.

        3) RSICD: There are totally 10 921 RS images and 24 333 original descriptions in this dataset [17], the scale of which is larger than the aforementioned two datasets. Images in it are resized to 224×224 pixels, meanwhile 54 605 sentences are utilized by randomly duplicating existing descriptions.

        4) NWPU-Captions: NWPU-Captions is provided by Wuhan University and Huazhong University of Science and Technology based on the NWPU-RESISC45 dataset. It incorporates 45 different labels with each one including 700 instances. Each image is described by five sentences according to certain annotated rules and the total number of descriptions is 157 500. This dataset is challenging due to its large scale and big variations.

        We use the criteria R@K (K = 1, 5, 10) to evaluate the performance of different approaches. Larger R@K indicates better performance.

        Experimental settings: In the training process, we set the batch size to 16 and the learning rate to 0.0005 which decreases by 0.7 after every 20 epochs. Totally, 120 epochs are conducted. The margin thresholdδin the loss function is set to 0.2. The visual feature of image region is of 2048-dimensional while the word feature is of 300-dimensional. The hidden dimension of Bi-GRU is 2048. During training, word features are initialized randomly and fed to Bi-GRU.

        Results and analysis: We conduct experiments on the four benchmark datasets and Tables 1?4 report the experimental results of various methods including representative cross-modal models for natural images like IMRAM [1], SCAN [2], MTFN [3], PFAN [4]and latest models for RS data like FCLM [8], AMFMN [9] and LWMCR [10].

        It can be seen from Tables 1?4 that generally MAFA-Net achieves better retrieval performance than other models on four datasets.Although, on the first three datasets, MAFA-Net occasionally slightly underperforms others on some metrics. This might be related to the relatively small amount of data in the UCM-Captions dataset and the Sydney-Captions dataset, and the unbalanced distribution of data categories in the Sydney-Captions dataset itself. However, on the much larger and challenging NWPU-Captions dataset, MAFANet achieves best on all evaluation metrics. The results of MAFANet on four different datasets also demonstrate its robustness.

        Table 1.Comparative Experimental Results on UCM-Captions

        Table 2.Comparative Experimental Results on Sydney-Captions

        We also conduct ablation experiments to evaluate the contribution of multi-attention fusion module (MA) and fine-grained alignment module (FA) to MAFA-Net. Table 5 reports the results on NWPUCaptions, in which _nMA_nFA means the basic network without the two modules, _nMA means the network without MA module and_nFA means the network without FA module. It can be seen that the two modules can significantly improve the retrieval performance of the MAFA-Net separately, while their contributions are relatively close. Table 5 also tabulates the training and testing time for executing different models on NWPU-Captions.

        We further show the visualization results of our MAFA-Net in Figs. 3?6.

        It can be seen that most of the retrieval results match the input,which indicates that the MAFA-Net proposed in this letter can maintain a good semantic correspondence between RS images and sentences. It is worth mentioning that even for the challenging highdensity scenes with a great of small and clustered objects, MAFANet still performs well (see Fig. 6).

        Conclusion: In this letter, we propose a multi-attention fusion and fine-grained alignment network (MAFA-Net) to conduct the crossmodal image-sentence retrieval task in the remote sensing domain.MAFA-Net aims at addressing the properties of multiscale properties and the problem of semantic ambiguity existed in cross-modalretrieval of RS data. Specifically, we design a multi-attention fusion module to improve the feature representation ability. Meanwhile, a fine-grained alignment module is designed to make the information between two different modalities (e.g., visual and textural) interact.Besides the three public available benchmark datasets, a much larger captioning dataset, NWPU-Captions, is utilized to evaluate the performance of MAFA-Net. Experimental results prove that MAFANet outperforms current approaches and even for challenging highdensity scenes, MAFA-Net can get satisfying results. In the future,we would like to consider more modalities like LiDAR or multispectral images and domain adaption [19] for RS visual applications.

        Table 3.Comparative Experimental Results on RSICD

        Table 4.Comparative Experimental Results on NWPU-Captions

        Table 5.Ablation Experimental Results on NWPU-Captions

        Acknowledgments: This work was supported by the National Natural Science Foundation of China (42090012), Special Research and 5G Project of Jiangxi Province in China (20212ABC03A09),Guangdong-Macao Joint Innovation Project (2021A0505080008),Key R & D Project of Sichuan Science and Technology Plan(2022YFN0031), and Zhuhai Industry University Research Cooperation Project of China (ZH22017001210098PWC).

        Fig. 3. Visualization results of MAFA-Net on UCM-Captions.

        Fig. 4. Visualization results of MAFA-Net on Sydney-Captions.

        Fig. 5. Visualization results of MAFA-Net on RSICD.

        Fig. 6. Visualization results of MAFA-Net on NWPU-Captions.

        亚洲日本无码一区二区在线观看| 国模无码一区二区三区| 中文字幕av无码一区二区三区| 最新无码国产在线播放| 中文字幕亚洲日本va| 我要看免费久久99片黄色| 狠狠噜天天噜日日噜无码| 一本大道香蕉最新在线视频| 黄色三级视频中文字幕| 亚洲写真成人午夜亚洲美女| 久久久av波多野一区二区| 国产精品亚洲欧美云霸高清| 伊人不卡中文字幕在线一区二区| 经典三级免费看片天堂| 一区二区三区乱码在线 | 欧洲| 免费啪啪视频一区| 中文字幕亚洲乱亚洲乱妇| 亚洲av迷人一区二区三区| 狠狠噜天天噜日日噜无码| 欧美日韩国产综合aⅴ| 亚洲中文字幕日本日韩| 最近免费中文字幕中文高清6| 少妇高潮尖叫黑人激情在线| 亚洲一区二区自拍偷拍| 成人影院视频在线播放| 久久精品国产清自在天天线| 无码人妻精品一区二区三区66| 国产一级片内射在线视频| 久久99天堂av亚洲av| 伊人久久大香线蕉综合网站 | 亚洲视频高清| 国产一区二区美女主播| 性无码一区二区三区在线观看| 亚洲色偷拍区另类无码专区| 亚洲av中文无码乱人伦在线咪咕| 中文字幕亚洲精品在线| 狼人香蕉香蕉在线28 - 百度| 精品囯产成人国产在线观看| 在线观看国产自拍视频| 国产亚洲成av人片在线观黄桃| 欧美人与物videos另类xxxxx |