Asking Qestions the Human Way:
Scalable Qestion-Answer Generation from Text Corpus
Bang Liu
1
, Haojie Wei
2
, Di Niu
1
, Haolan Chen
2
, Yancheng He
2
1
University of Alberta, Edmonton, AB, Canada
2
Platform and Content Group, Tencent, Shenzhen, China
ABSTRACT
The ability to ask questions is important in both human and ma-
chine intelligence. Learning to ask questions helps knowledge acqui-
sition, improves question-answering and machine reading compre-
hension tasks, and helps a chatbot to keep the conversation fowing
with a human. Existing question generation models are inefective
at generating a large amount of high-quality question-answer pairs
from unstructured text, since given an answer and an input passage,
question generation is inherently a one-to-many mapping. In this
paper, we propose Answer-Clue-Style-aware Question Generation
(ACS-QG), which aims at automatically generating high-quality and
diverse question-answer pairs from unlabeled text corpus at scale
by imitating the way a human asks questions. Our system consists
of: i) an information extractor, which samples from the text multiple
types of assistive information to guide question generation; ii) neu-
ral question generators, which generate diverse and controllable
questions, leveraging the extracted assistive information; and iii)
a neural quality controller, which removes low-quality generated
data based on text entailment. We compare our question generation
models with existing approaches and resort to voluntary human
evaluation to assess the quality of the generated question-answer
pairs. The evaluation results suggest that our system dramatically
outperforms state-of-the-art neural question generation models in
terms of the generation quality, while being scalable in the mean-
time. With models trained on a relatively smaller amount of data,
we can generate 2.8 million quality-assured question-answer pairs
from a million sentences found in Wikipedia.
CCS CONCEPTS
· Computing methodologies → Natural language process-
ing; Natural language generation; Machine translation.
KEYWORDS
Question Generation, Sequence-to-Sequence, Machine Reading
Comprehension
ACM Reference Format:
Bang Liu
1
, Haojie Wei
2
, Di Niu
1
, Haolan Chen
2
, Yancheng He
2
. 2020. Asking
Questions the Human Way: Scalable Question-Answer Generation from
Text Corpus. In Proceedings of The Web Conference 2020 (WWW ’20), April
20ś24, 2020, Taipei, Taiwan. ACM, New York, NY, USA, 12 pages. https:
//doi.org/10.1145/3366423.3380270
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution.
WWW ’20, April 20ś24, 2020, Taipei, Taiwan
© 2020 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-7023-3/20/04.
https://doi.org/10.1145/3366423.3380270
The fight scene finale between Sharon and the character played by Ali Larter,
from the movie Obsessed, won the 2010 MTV Movie Award for Best Fight.
Answer: MTV Movie Award for Best Fight
Clue: from the movie Obsessed
Style: Which
Q: A fight scene from the movie, Obsessed, won which award?
Answer: MTV Movie Award for Best Fight
Clue: The flight scene finale between Sharon and the character played by
Ali Larter
Style: Which
Q: Which award did the fight scene between Sharon and the role of Ali
Larter win?
Answer: Obsessed
Clue: won the 2010 MTV Movie Award for Best Fight
Style: What
Q: What is the name of the movie that won the 2010 MTV Movie Award
for Best Fight?
Figure 1: Given the same input sentence, we can ask diverse
questions based on the diferent choices about i) what the
target answer is; ii) which answer-related chunk is used as a
clue, and iii) what type of questions is asked.
1 INTRODUCTION
Automatically generating question-answer pairs from unlabeled
text passages is of great value to many applications, such as as-
sisting the training of machine reading comprehension systems
[10, 44, 45], generating queries/questions from documents to im-
prove search engines [17], training chatbots to get and keep a
conversation going [40], generating exercises for educational pur-
poses [7, 18, 19], and generating FAQs for web documents [25].
Many question-answering tasks such as machine reading compre-
hension and chatbots require a large amount of labeled samples
for supervised training, acquiring which is time-consuming and
costly. Automatic question-answer generation makes it possible to
provide these systems with scalable training data and to transfer
a pre-trained model to new domains that lack manually labeled
training samples.
Despite a large number of studies on Neural Question Generation,
it remains a signifcant challenge to generate high-quality QA pairs
from unstructured text at large quantities. Most existing neural
question generation approaches try to solve the answer-aware
question generation problem, where an answer chunk and the
surrounding passage are provided as an input to the model while
the output is the question to be generated. They formulate the
task as a Sequence-to-Sequence (Seq2Seq) problem, and design
various encoder, decoder, and input features to improve the quality
of generated questions [10, 11, 22, 27, 39, 41, 53]. However, answer-
aware question generation models are far from sufcient, since
question generation from a passage is inherently a one-to-many