Biomedical Question Answering: A Comprehensive Review Qiao Jin 1,2 * , Zheng Yuan 1,2 , Guangzhi Xiong 1 , Qianlan Yu 1 , Chuanqi Tan 2 Mosha Chen 2 , Songfang Huang 2 , Xiaozhong Liu 3 , Sheng Yu 1 1 Tsinghua University, 2 Alibaba Group, 3 Indiana University Bloomington Abstract Question Answering (QA) is a benchmark Nat- ural Language Processing (NLP) task where models predict the answer for a given ques- tion using related documents, images, knowl- edge bases and question-answer pairs. Au- tomatic QA has been successfully applied in various domains like search engines and chat- bots. However, for specific domains like biomedicine, QA systems are still rarely used in real-life settings. Biomedical QA (BQA), as an emerging QA task, enables innovative applications to effec- tively perceive, access and understand com- plex biomedical knowledge. In this work, we provide a critical review of recent efforts in BQA. We comprehensively investigate prior BQA approaches, which are classified into 6 major methodologies (open-domain, knowl- edge base, information retrieval, machine read- ing comprehension, question entailment and visual QA), 4 topics of contents (scientific, clinical, consumer health and examination) and 5 types of formats (yes/no, extraction, gen- eration, multi-choice and retrieval). In the end, we highlight several key challenges of BQA and explore potential directions for fu- ture works. 1 Introduction Biomedical knowledge acquisition is an important task in information retrieval and knowledge man- agement, and biomedical professionals as well as the general public need effective assistance to ac- cess, understand and consume complex biomedical concepts, e.g.: doctors need to be aware of up-to- date clinical evidence for the diagnosis and treat- ment of diseases under the scheme of Evidence- based Medicine (EBM, Sackett 1997), and the gen- eral public is increasingly interested in learning * jqa14@mails.tsinghua.edu.cn their own health conditions on the Internet (Fox and Duggan, 2012). The process of acquisition through abstraction, induction, and conception can be challenging. Tra- ditionally, Information Retrieval (IR) systems, i.e. search engines like Google and PubMed, are used to meet such information needs. However, classi- cal IR is still not efficient enough. For instance, Russell-Rose and Chamberlain (2017) report that it requires 4 expert hours to answer complex medi- cal queries using search engines. Compared with the retrieval systems that typically return a list of relevant documents for the users to read, Ques- tion Answering (QA) systems that provide direct answers to users’ questions are more straightfor- ward and intuitive. In general, QA itself is a chal- lenging benchmark Natural Language Processing (NLP) task for evaluating the abilities of intelli- gent systems to understand a question, retrieve and utilize relevant materials and finally generate its answer. With the rapid development of computer hardware, modern QA models, especially those based on deep learning (Cheng et al., 2016; Seo et al., 2016; Chen et al., 2017; Peters et al., 2018; Devlin et al., 2019), achieve comparable or even better performance than human on many bench- mark datasets (Hermann et al., 2015; Rajpurkar et al., 2016; Joshi et al., 2017; Rajpurkar et al., 2018; Yang et al., 2018) and have been success- fully adopted in general domain search engines and conversational assistants (Qiu et al., 2017; Zhou et al., 2020). The Text REtrieval Conference (TREC) QA Track has triggered the modern QA research (Voorhees, 2001), when QA models were mostly based on IR. Zweigenbaum (2003) identifies the distinct characteristics of BQA over gen- eral domain QA. Later, many traditional open- domain BQA systems have been proposed, such as EPoCare (Niu et al., 2003), PICO- and knowledge- arXiv:2102.05281v1 [cs.CL] 10 Feb 2021