Quality Assessment of Biomedical Metadata Using Topic Modeling Stuti Nayak 1 , Amrapali Zaveri 1 , and Michel Dumontier 1 Institute of Data Science, Maastricht University, Maastricht, The Netherlands, firstname.lastname@maastrichtuniversity.nl Abstract. There is an abundance of biomedical data present on the Web. However, this data is not re-usable because it is insufficiently de- scribed using rich metadata. The recently published FAIR principles specify desirable criteria that metadata and their corresponding datasets need to be Findable, Accessible, Interoperable, and Reusable. However, currently the biomedical metadata quality is poor which makes data re- use extremely difficult. To tackle this problem, we propose the use of topic modeling, specifically non-negative matrix factorization (NMF), as a first step towards dimensionality reduction when dealing with large amounts of data. In this position paper, as a use case, we apply NMF to the BioSamples metadata and present preliminary results. Keywords: Metadata, Quality, Biomedical, NMF, Topic Modeling 1 Introduction There is an abundance of biomedical data present on the Web [5]. This biomed- ical data is instrumental in enabling several medical use cases which should be shared and re-used by other investigators. In order to understand the structure of the data, there is an urgent need for accurate, structured and complete descrip- tion of the data – defined as metadata . The recently published FAIR principles specify desirable criteria that metadata and their corresponding datasets should meet to be Findable, Accessible, Interoperable, and Reusable (FAIR) [14]. For data to be FAIR, metadata needs to be accurate and uniform (e.g., relying on controlled terms where possible), However, currently there is a large amount of biomedical metadata, which is of poor quality i.e. extremely heterogeneous and which makes data re-use extremely difficult [4]. Thus, we need to perform quality assessment of metadata to identify and ultimately improve the metadata quality. Currently, the challenges with metadata quality assessment are: (i) size of the data and (ii) heterogeneity of data. In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents [9]. In particular, topic modeling techniques allow examining a large set of documents and discovering, based on the occurrence frequency of the words, what the topics might be. The metadata elements are then associated to one or none of the topics, thus allowing one to easily detect erroneous ones.