KDAP: An Open Source Toolkit to Accelerate Knowledge Building Research Amit Arjun Verma SCCI Labs, IIT Ropar Rupnagar, India 2016csz0003@iitrpr.ac.in S.R.S. Iyengar SCCI Labs, IIT Ropar Rupnagar, Punjab, India sudarshan@iitrpr.ac.in Simran Setia SCCI Labs, IIT Ropar Rupnagar, Punjab, India 2017csz0001@iitrpr.ac.in Neeru Dubey SCCI Labs, IIT Ropar Rupnagar, Punjab, India neerudubey@iitrpr.ac.in ABSTRACT With the success of crowdsourced portals, such as Wikipedia, Stack Overfow, Quora, and GitHub, a class of researchers is driven to- wards understanding the dynamics of knowledge building on these portals. Even though collaborative knowledge building portals are known to be better than expert-driven knowledge repositories, lim- ited research has been performed to understand the knowledge building dynamics in the former. This is mainly due to two reasons; frst, unavailability of the standard data representation format, sec- ond, lack of proper tools and libraries to analyze the knowledge building dynamics. We describe Knowledge Data Analysis and Processing Platform (KDAP), a programming toolkit that is easy to use and provides high-level operations for analysis of knowledge data. We propose Knowledge Markup Language (Knol-ML), a standard representation format for the data of collaborative knowledge building portals. KDAP can process the massive data of crowdsourced portals like Wikipedia and Stack Overfow efciently. As a part of this toolkit, a data-dump of various collaborative knowledge building portals is published in Knol-ML format. The combination of Knol-ML and the proposed open-source library will help the knowledge building community to perform benchmark analysis. URL: https://github.com/descentis/kdap Supplementary Material: https://bit.ly/2Z3tZK5 CCS CONCEPTS · Human-centered computing → Collaborative and social computing systems and tools; Wikis; Open source software; Computer supported cooperative work; Empirical studies in collabo- rative and social computing. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. OpenSym 2020, August 25ś27, 2020, Virtual conference, Spain © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-8779-8/20/08. . . $15.00 https://doi.org/10.1145/3412569.3412575 KEYWORDS Knowledge Building, Wikipedia, datasets, open-source library, Q&A ACM Reference Format: Amit Arjun Verma, S.R.S. Iyengar, Simran Setia, and Neeru Dubey. 2020. KDAP: An Open Source Toolkit to Accelerate Knowledge Building Research. In 16th International Symposium on Open Collaboration (OpenSym 2020), August 25ś27, 2020, Virtual conference, Spain. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3412569.3412575 1 INTRODUCTION With progress in computational power, research in various domains is primarily based on the availability of data and appropriate tools for analysis. Open access to libraries and data enhances the ease and pace of research [26]. The impact of open-source tools (like Python, R, and Scilab) can be verifed by the expansion in the utility of these tools by the research community [41]. For example, a simple task like matrix inversion requires multiple lines of code to be written in Python. Whereas, the usage of NumPy library reduces the complexity of this task to a single line of code. Similar examples can be found in various domains, where the usage of analysis tools reduces the complexity of tasks in terms of time and efort. It is useful to note that in recent years, the scientifc community is positively infuenced by a growing number of libraries, such as scikit-learn for machine learning, NumPy and SciPy for statistical computing, and matplotlib for visualization [52]. The advancement in computational power and storage facilities allows crowdsourced portals, such as Wikipedia, Stack Overfow, Quora, Reddit, and GitHub, to host their data on publicly available servers. The popularity and open access to the datasets of these crowdsourced portals have drawn the attention of researchers from various communities. Observing the collaboration and contribu- tion of the crowd, researchers have generalized the knowledge development on these portals to the actual knowledge building pro- cess [12]. From predicting box ofce success of movies to building state-of-the-art software, these portals have helped the research communities in various aspects [13, 28, 46]. The diverse and rich content present on Wikipedia is used to study online collaboration dynamics [10, 21], to examine its impact on other online collaborative portals [45], and to train state-of- the-art artifcial intelligence algorithms [30]. Similarly, the massive growth of users and posts on crowdsourced QnA portals like Stack