KDAP: An Open Source Toolkit to Accelerate Knowledge
Building Research
Amit Arjun Verma
SCCI Labs, IIT Ropar
Rupnagar, India
2016csz0003@iitrpr.ac.in
S.R.S. Iyengar
SCCI Labs, IIT Ropar
Rupnagar, Punjab, India
sudarshan@iitrpr.ac.in
Simran Setia
SCCI Labs, IIT Ropar
Rupnagar, Punjab, India
2017csz0001@iitrpr.ac.in
Neeru Dubey
SCCI Labs, IIT Ropar
Rupnagar, Punjab, India
neerudubey@iitrpr.ac.in
ABSTRACT
With the success of crowdsourced portals, such as Wikipedia, Stack
Overfow, Quora, and GitHub, a class of researchers is driven to-
wards understanding the dynamics of knowledge building on these
portals. Even though collaborative knowledge building portals are
known to be better than expert-driven knowledge repositories, lim-
ited research has been performed to understand the knowledge
building dynamics in the former. This is mainly due to two reasons;
frst, unavailability of the standard data representation format, sec-
ond, lack of proper tools and libraries to analyze the knowledge
building dynamics.
We describe Knowledge Data Analysis and Processing Platform
(KDAP), a programming toolkit that is easy to use and provides
high-level operations for analysis of knowledge data. We propose
Knowledge Markup Language (Knol-ML), a standard representation
format for the data of collaborative knowledge building portals.
KDAP can process the massive data of crowdsourced portals like
Wikipedia and Stack Overfow efciently. As a part of this toolkit,
a data-dump of various collaborative knowledge building portals
is published in Knol-ML format. The combination of Knol-ML and
the proposed open-source library will help the knowledge building
community to perform benchmark analysis.
URL: https://github.com/descentis/kdap
Supplementary Material: https://bit.ly/2Z3tZK5
CCS CONCEPTS
· Human-centered computing → Collaborative and social
computing systems and tools; Wikis; Open source software;
Computer supported cooperative work; Empirical studies in collabo-
rative and social computing.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
OpenSym 2020, August 25ś27, 2020, Virtual conference, Spain
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8779-8/20/08. . . $15.00
https://doi.org/10.1145/3412569.3412575
KEYWORDS
Knowledge Building, Wikipedia, datasets, open-source library, Q&A
ACM Reference Format:
Amit Arjun Verma, S.R.S. Iyengar, Simran Setia, and Neeru Dubey. 2020.
KDAP: An Open Source Toolkit to Accelerate Knowledge Building Research.
In 16th International Symposium on Open Collaboration (OpenSym 2020),
August 25ś27, 2020, Virtual conference, Spain. ACM, New York, NY, USA,
11 pages. https://doi.org/10.1145/3412569.3412575
1 INTRODUCTION
With progress in computational power, research in various domains
is primarily based on the availability of data and appropriate tools
for analysis. Open access to libraries and data enhances the ease and
pace of research [26]. The impact of open-source tools (like Python,
R, and Scilab) can be verifed by the expansion in the utility of
these tools by the research community [41]. For example, a simple
task like matrix inversion requires multiple lines of code to be
written in Python. Whereas, the usage of NumPy library reduces
the complexity of this task to a single line of code. Similar examples
can be found in various domains, where the usage of analysis tools
reduces the complexity of tasks in terms of time and efort. It is
useful to note that in recent years, the scientifc community is
positively infuenced by a growing number of libraries, such as
scikit-learn for machine learning, NumPy and SciPy for statistical
computing, and matplotlib for visualization [52].
The advancement in computational power and storage facilities
allows crowdsourced portals, such as Wikipedia, Stack Overfow,
Quora, Reddit, and GitHub, to host their data on publicly available
servers. The popularity and open access to the datasets of these
crowdsourced portals have drawn the attention of researchers from
various communities. Observing the collaboration and contribu-
tion of the crowd, researchers have generalized the knowledge
development on these portals to the actual knowledge building pro-
cess [12]. From predicting box ofce success of movies to building
state-of-the-art software, these portals have helped the research
communities in various aspects [13, 28, 46].
The diverse and rich content present on Wikipedia is used to
study online collaboration dynamics [10, 21], to examine its impact
on other online collaborative portals [45], and to train state-of-
the-art artifcial intelligence algorithms [30]. Similarly, the massive
growth of users and posts on crowdsourced QnA portals like Stack