HCMUS at MediaEval 2020: Image-Text Fusion for Automatic News-Images Re-Matching Thuc Nguyen-Quang ∗1,3 , Tuan-Duy H. Nguyen ∗1,3 , Thang-Long Nguyen-Ho ∗1,3 , Anh-Kiet Duong ∗1,3 , Nhat Hoang-Xuan ∗1,3 , Vinh-Thuyen Nguyen-Truong ∗1,3 , Hai-Dang Nguyen 1,3 , Minh-Triet Tran 1,2,3 1 University of Science, VNU-HCM, 2 John von Neumann Institute, VNU-HCM 3 Vietnam National University, Ho Chi Minh city, Vietnam {nqthuc,nhtduy,ntvthuyen}@apcs.vn,{nhtlong,hxnhat,nhdang}@selab.hcmus.edu.vn,18120046@student.hcmus.edu.vn, tmtriet@ft.hcmus.edu.vn ABSTRACT Matching text and images based on their semantics has an important role in cross-media retrieval. Especially, in terms of news, text and images connection is highly ambiguous. In the context of MediaEval 2020 Challenge, we propose three multi-modal methods for map- ping text and images of news articles to the shared space in order to perform efcient cross-retrieval. Our methods show systemic im- provement and validate our hypotheses, while the best-performed method reaches a recall@100 score of 0.2064. 1 INTRODUCTION News articles represent a complex class of multimedia, whose textual content and accompanying images might not be explicitly related [25]. Existing research in multimedia and recommendation system domains mostly investigate image-text pairs with simple relation- ships, e.g., image captions that literally describe components of the images [16]. To address this, the MediaEval 2020 NewsImages Task calls for researchers to investigate the real-world relationship of news text and images in more depth, in order to understand its im- plications for journalism and news recommendation systems [19]. Our team at HCMUS responds to this call by addressing the Image- Text Re-Matching task. Particularly, given a set of image-text pairs in the wild, the task requires us to correctly re-assign images to their decoupled articles, with the aim to understand the implication of journalism in choosing illustrative images. Our methods mainly concern fusing cross-modal embeddings for automatic matching. We experimented with a range of embedded information, including simple set intersection, deep neural features, and knowledge-graph-enhanced neural features. We combine such features in various ways for various experiments. Finally, we obtain our best result with the ensemble of experimented methods. 2 METHODS 2.1 Metric Learning The primary idea of this baseline method is using metric learning to project embeddings of image-text pairs to bases of signifcant simi- larity. Particularly, we use two approaches to embed image features: global context embedding and local context embedding. In the frst approach, we use the EfcientNet [30], a SOTA classifcation architec- ture, to extract features of the image before taking the fatten output features. Our motivation in the latter approach is to harness critical local information from the extracted global context. Thus, we use the bottom-up-attention model [3] to extract the top-  objects based on their confdence score, before passing them over to a self-attention sequential model. For both routines, we employ BERT [12] language model to embed textual content, then project the textual and image Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). MediaEval’20, 14-15 December 2020, Online embeddings to the same dimension. Finally, we train our Triplet Loss [15] model with positive and negative pairs from a hard sample miner. 2.2 Image-Text Matching via Categorization In this method, we train two gradient boosting decision trees [18], one for categorizing images, and the other for categorizing arti- cles. The target categories are [’nrw’, ’kultur’, ’region’, ’panorama’, ’sport’, ’wirtschaft’, ’koeln’, ’ratgeber’, ’politik’, ’unknown’], which are deduced from URLs in the train set. We use features extracted for images and text to train the decision tree. To augment the data, we use VGG16, InceptionResNetV2, Mo- bileNetV2, EfcientNetB1-7, Xception, ResNet152V2, NASNetLarge, DenseNet201 [10, 14, 17, 27ś30, 32] for images, while using pre- trained BERT models[2, 8, 9, 11], and pretrained ELECTRA models [1, 9] to extract contextual features. We presume that images and articles of the same category might have some relations. Moreover, the rank of matching categories also afects ranking. For example, an image-text pair sharing a 3rd-ranked category might be less relevant than the pair sharing a 1st-ranked category. Hence, instead of using Jaccard similarity, we propose an iterative ranking method that takes into account the order of matched categories. At the  -th iteration, our method frst fnds top-  categories for each image and top-  categories for each article. Then for each article, we create a list of candidate images whose top-  categories intersect that of the article. This list of candidates at the  -th iteration is concatenated to the fnal list. Finally, the re- maining images that are not candidates are kept in their order and concatenated to the end of the fnal list. 2.3 Graph-based Face-Name Matching Based on our observation, in a lot of instances, the publisher uses a portrait of somebody mentioned in the text. We build the face-name graph to represent the relation between the name and the face. Person name extraction: To automatically extract people’s name from the text, we use entity-fshing[23] ś an open-source high- performance entity recognition and disambiguation tool. It relies on Random Forest and Gradient Tree Boosting to recognize named enti- ties, in our case people’s names, and link them against Wikidata enti- ties using their word embeddings and Wikidata entities’ embeddings. Face encoding: We use face recognition open-source library[13] to detect and represent the face as 128-dims vector. The tool uses a pre-trained model from the dlib-models repository[20] and chooses ResNet as the backbone for face feature extraction. Using the train set, we connect each person mentioned in the articles with features extracted from accompanying faces. During testing, we encode the face from the image and aggregate the number of matched faces connected to the people mentioned in the text. Two faces are matched if 2-distance between two vectors less than 0.6. The ranking of images is sorted by the total matched.