Asking Qestions the Human Way: Scalable Qestion-Answer Generation from Text Corpus Bang Liu 1 , Haojie Wei 2 , Di Niu 1 , Haolan Chen 2 , Yancheng He 2 1 University of Alberta, Edmonton, AB, Canada 2 Platform and Content Group, Tencent, Shenzhen, China ABSTRACT The ability to ask questions is important in both human and ma- chine intelligence. Learning to ask questions helps knowledge acqui- sition, improves question-answering and machine reading compre- hension tasks, and helps a chatbot to keep the conversation fowing with a human. Existing question generation models are inefective at generating a large amount of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale by imitating the way a human asks questions. Our system consists of: i) an information extractor, which samples from the text multiple types of assistive information to guide question generation; ii) neu- ral question generators, which generate diverse and controllable questions, leveraging the extracted assistive information; and iii) a neural quality controller, which removes low-quality generated data based on text entailment. We compare our question generation models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the mean- time. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences found in Wikipedia. CCS CONCEPTS · Computing methodologies Natural language process- ing; Natural language generation; Machine translation. KEYWORDS Question Generation, Sequence-to-Sequence, Machine Reading Comprehension ACM Reference Format: Bang Liu 1 , Haojie Wei 2 , Di Niu 1 , Haolan Chen 2 , Yancheng He 2 . 2020. Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus. In Proceedings of The Web Conference 2020 (WWW ’20), April 20ś24, 2020, Taipei, Taiwan. ACM, New York, NY, USA, 12 pages. https: //doi.org/10.1145/3366423.3380270 This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. WWW ’20, April 20ś24, 2020, Taipei, Taiwan © 2020 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY 4.0 License. ACM ISBN 978-1-4503-7023-3/20/04. https://doi.org/10.1145/3366423.3380270 The fight scene finale between Sharon and the character played by Ali Larter, from the movie Obsessed, won the 2010 MTV Movie Award for Best Fight. Answer: MTV Movie Award for Best Fight Clue: from the movie Obsessed Style: Which Q: A fight scene from the movie, Obsessed, won which award? Answer: MTV Movie Award for Best Fight Clue: The flight scene finale between Sharon and the character played by Ali Larter Style: Which Q: Which award did the fight scene between Sharon and the role of Ali Larter win? Answer: Obsessed Clue: won the 2010 MTV Movie Award for Best Fight Style: What Q: What is the name of the movie that won the 2010 MTV Movie Award for Best Fight? Figure 1: Given the same input sentence, we can ask diverse questions based on the diferent choices about i) what the target answer is; ii) which answer-related chunk is used as a clue, and iii) what type of questions is asked. 1 INTRODUCTION Automatically generating question-answer pairs from unlabeled text passages is of great value to many applications, such as as- sisting the training of machine reading comprehension systems [10, 44, 45], generating queries/questions from documents to im- prove search engines [17], training chatbots to get and keep a conversation going [40], generating exercises for educational pur- poses [7, 18, 19], and generating FAQs for web documents [25]. Many question-answering tasks such as machine reading compre- hension and chatbots require a large amount of labeled samples for supervised training, acquiring which is time-consuming and costly. Automatic question-answer generation makes it possible to provide these systems with scalable training data and to transfer a pre-trained model to new domains that lack manually labeled training samples. Despite a large number of studies on Neural Question Generation, it remains a signifcant challenge to generate high-quality QA pairs from unstructured text at large quantities. Most existing neural question generation approaches try to solve the answer-aware question generation problem, where an answer chunk and the surrounding passage are provided as an input to the model while the output is the question to be generated. They formulate the task as a Sequence-to-Sequence (Seq2Seq) problem, and design various encoder, decoder, and input features to improve the quality of generated questions [10, 11, 22, 27, 39, 41, 53]. However, answer- aware question generation models are far from sufcient, since question generation from a passage is inherently a one-to-many