An Empirical Study to Predict the Quality
of Wikipedia Articles
Imran Khan
1(&)
, Shahid Hussain
2
, Hina Gul
3
, Muhammad Shahid
4
,
and Muhammad Jamal
4
1
Virtual University of Pakistan, Lahore, Pakistan
ikniazi786@gmail.com
2
COMSATS University, Islamabad, Pakistan
shussain@comsats.edu.pk
3
IQRA National University, Peshawar, Pakistan
hinaafridi1984@gmail.com
4
Govt College No 1, DIKhan, Pakistan
bluefiber08@gmail.com, mustafvimasood@gmail.com
Abstract. Wikipedia is considered a common way to deliver content in a more
effective way as compared to other types of an encyclopedia. However, the
quality threat remains an issue regarding the Wikipedia articles. The basic aim
of propose research to perform an empirical study to predict the quality of
Wikipedia articles. In the proposed methodology, we consider few metrics such
as article length (total number of word in an article), number of edits, article age
(in the day) and article ranking and perform few statistical tests analyze the
quality of Wikipedia articles. Moreover, we observe a significant correlation of
proposed metrics with the rating of articles in order to identify their quality.
Keywords: Wikipedia Á Correlation Á Linear regression Á Article length Á
Number of edits Á Article age
1 Introduction
Wikipedia is worldwide most trusted online, open source, the nonprofit organization
that owns a large number of articles that almost every topic with a huge viewership
(with the total number of 35,147,128 registered users). Wikipedia is one of the socially
produced Big Data example. According to the Liu and Ram [8] in September 2017,
more than 5,472,000 articles were available on English Wikipedia, now data is pro-
duced quicker than ever, and up to now, more than 2.5K Petabyte of data are generated
on a day, which brings forth the generally coursed idea of Big Data. By the end of the
year 2018, it is more than about 5,763,800 unique articles on English Wikipedia
1
.
Wikipedia is a profoundly unique framework; we can change article content as often as
possible. Accordingly, the quality of an article is a period subordinate work and a
solitary article may contain high and low-quality content in various scopes of its
1
https://en.wikipedia.org/wiki/Special:Statistics
© Springer Nature Switzerland AG 2019
Á. Rocha et al. (Eds.): WorldCIST'19 2019, AISC 932, pp. 485–492, 2019.
https://doi.org/10.1007/978-3-030-16187-3_47