Abstract
One of the most popular collaborative knowledge bases on the Internet is Wikipedia. Articles of this free encyclopaedia are created and edited by users from different countries in about 300 languages. Depending on topic and language version, quality of information there may vary. This study presents and classifies measures that can be extracted from Wikipedia articles for the purpose of automatic quality assessment in different languages. Based on a state of the art analysis and own experiments, specific measures for various aspects of quality have been defined. Additional, in this work they were also defined measures for quality assessment of data contained in the structural parts of Wikipedia articles - infoboxes. This study describes also an extraction methods for various sources of measures, that can be used in quality assessment.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abramowicz, W., Auer, S., Heath, T.: Linked data in business. Bus. Inf. Syst. Eng. 58(5), 323–326 (2016). https://doi.org/10.1007/s12599-016-0446-0
Alexa: Wikipedia.org traffic, demographics and competitors. https://www.alexa.com/siteinfo/wikipedia.org
Altmetric: free tools. https://www.altmetric.com/products/free-tools/
Anderka, M.: Analyzing and predicting quality flaws in user-generated content: the case of Wikipedia. Ph.D. Bauhaus-Universitaet Weimar Germany (2013)
Blumenstock, J.E.: Automatically assessing the quality of Wikipedia articles. Technical report (2008). https://doi.org/10.1080/17439880802324251
Blumenstock, J.E.: Size matters: word count as a measure of quality on Wikipedia. In: WWW, pp. 1095–1096 (2008). https://doi.org/10.1145/1367497.1367673
Bormuth, J.R.: Readability: a new approach. Read. Res. Q. 1, 79–132 (1966)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
De la Calzada, G., Dekhtyar, A.: On measuring the quality of Wikipedia articles. In: Proceedings of the 4th Workshop on Information Credibility, pp. 11–18. ACM (2010)
Caylor, J.S., Sticht, T.G.: Development of a simple readability index for job reading material (1973)
Chen, H.H.: How to use readability formulas to access and select English reading materials. J. Educ. Media Libr. Sci. 50(2), 229–254 (2012)
Coleman, M., Liau, T.L.: A computer readability formula designed for machine scoring. J. Appl. Psychol. 60(2), 283 (1975)
Conti, R., Marzini, E., Spognardi, A., Matteucci, I., Mori, P., Petrocchi, M.: Maturity assessment of Wikipedia medical articles. In: 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), pp. 281–286. IEEE (2014)
Dale, E., Chall, J.S.: A formula for predicting readability: instructions. Educ. Res. Bull. 18, 37–54 (1948)
Dalip, D.H., Gonçalves, M.A., Cristo, M., Calado, P.: Automatic quality assessment of content created collaboratively by web communities: a case study of Wikipedia. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 295–304 (2009). https://doi.org/10.1145/1555400.1555449
Dalip, D.H., Gonçalves, M.A., Cristo, M., Calado, P.: Automatic assessment of document quality in web collaborative digital libraries. J. Data Inf. Quality 2(3), 1–30 (2011). https://doi.org/10.1145/2063504.2063507
Dang, Q.V., Ignat, C.L.: Measuring quality of collaboratively edited documents: the case of Wikipedia. In: 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), pp. 266–275. IEEE (2016)
DBpedia: Main Page. https://wiki.dbpedia.org
Einstein, A.: The Meaning of Relativity. Routledge, Abingdon (2003)
English Wikipedia: API sandbox. https://en.wikipedia.org/wiki/Special:ApiSandbox
English Wikipedia: Criticism of Wikipedia. https://en.wikipedia.org/wiki/Criticism_of_Wikipedia
English Wikipedia: Featured article criteria. https://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria
English Wikipedia: Featured articles. https://en.wikipedia.org/wiki/Wikipedia:Featured_articles
English Wikipedia: Good articles. https://en.wikipedia.org/wiki/Wikipedia:Good_articles
English Wikipedia: Verifiability. https://en.wikipedia.org/wiki/Wikipedia:Verifiability
English Wikipedia: Wikiproject tabular data. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Tabular_Data
Eppler, M.J.: Managing Information Quality: Increasing the Value of Information in Knowledge-Intensive Products and Processes. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-32225-6
Ferschke, O., Gurevych, I., Rittberger, M.: FlawFinder: a modular system for predicting quality flaws in Wikipedia. In: CLEF (Online Working Notes/Labs/Workshop), pp. 1–10 (2012)
Filipiak, D., Filipowska, A.: Improving the quality of art market data using linked open data and machine learning. In: Abramowicz, W., Alt, R., Franczyk, B. (eds.) BIS 2016. LNBIP, vol. 263, pp. 418–428. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52464-1_39
Flekova, L., Ferschke, O., Gurevych, I.: What makes a good biography? Multidimensional quality analysis based on Wikipedia article feedback data. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 855–866. ACM (2014)
Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221 (1948)
Greenfield, G.R.: Classic readability formulas in an EFL context: are they valid for Japanese speakers? Ph.D. thesis. Temple University (1999)
Gunning, R.: The Technique of Clear Writing. McGraw-Hill, New York (1952)
Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A.: Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80 (2014). https://doi.org/10.1016/j.ijpe.2014.04.018
Infoboxes.net: quality comparison of infoboxes in Miltilingual Wikipedia. http://infoboxes.net
Juran, J., Godfrey, A.B.: Quality Handbook, pp. 173–178. McGraw-Hill, New York (1999)
Kane, G.C.: A multimethod study of information quality in Wiki collaboration. ACM Trans. Manag. Inf. Syst. (TMIS) 2(1), 4 (2011)
Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. Technical report. Naval Technical Training Command Millington TN Research Branch (1975)
Kittur, A., Kraut, R.E.: Harnessing the wisdom of crowds in Wikipedia: quality through coordination. In: Proceedings of the ACM 2008 Conference on Computer Supported Cooperative Work - CSCW 2008, p. 37 (2008). https://doi.org/10.1145/1460563.1460572
Kontokostas, D., et al.: Test-driven evaluation of linked data quality. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 747–758. ACM (2014)
Lerner, J., Lomi, A.: Knowledge categorization affects popularity and quality of Wikipedia articles. PloS One 13(1), e0190674 (2018)
Lewoniewski, W.: Completeness and reliability of Wikipedia infoboxes in various languages. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 295–305. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69023-0_25
Lewoniewski, W.: Enrichment of information in multilingual Wikipedia based on quality analysis. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 216–227. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69023-0_19
Lewoniewski, W., Härting, R.-C., Wecel, K., Reichstein, C., Abramowicz, W.: Application of SEO metrics to determine the quality of Wikipedia articles and their sources. In: Damaševičius, R., Vasiljevienė, G. (eds.) ICIST 2018. CCIS, vol. 920, pp. 139–152. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99972-2_11
Lewoniewski, W., Khairova, N., Węcel, K., Stratiienko, N., Abramowicz, W.: Using morphological and semantic features for the quality assessment of Russian Wikipedia. In: Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 550–560. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67642-5_46
Lewoniewski, W., Węcel, K.: Relative quality assessment of Wikipedia articles in different languages using synthetic measure. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 282–292. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69023-0_24
Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of Wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46254-7_50
Lewoniewski, W., Węcel, K., Abramowicz, W.: Relative quality and popularity evaluation of multilingual Wikipedia articles. Informatics 4 (2017). https://doi.org/10.3390/informatics4040043
Lewoniewski, W., Węcel, K., Abramowicz, W.: Determining quality of articles in polish Wikipedia based on linguistic features. In: Damaševičius, R., Vasiljevienė, G. (eds.) ICIST 2018. CCIS, vol. 920, pp. 546–558. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99972-2_45
Lih, A.: Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news resource. In: 5th International Symposium on Online Journalism, p. 31 (2004)
Liu, J., Ram, S.: Using big data and network analysis to understand Wikipedia article quality. Data Knowl. Eng. 115, 80–93 (2018)
Lucassen, T., Schraagen, J.M.: Trust in Wikipedia: how users trust information from an unknown source. In: Proceedings of the 4th Workshop on Information Credibility, pp. 19–26. ACM (2010)
Mc Laughlin, G.H.: SMOG grading-a new readability formula. J. Read. 12(8), 639–646 (1969)
Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: Proceedings of the 2012 Joint EDBT/ICDT Workshops, pp. 116–123. ACM (2012)
Microsoft Azure: Cloud computing platform & services. https://azure.microsoft.com/en-us/
Moyer, D., Carson, S.L., Dye, T.K., Carson, R.T., Goldbaum, D.: Determining the influence of reddit posts on Wikipedia pageviews. In: Ninth International AAAI Conference on Web and Social Media, pp. 75–82. AAAI Press Oxford, UK (2015)
O’Brien, J.A., Marakas, G.M.: Introduction to Information Systems, vol. 13. McGraw-Hill/Irwin, New York City (2005)
OECD Glossary of Statistical Terms: ISO 8402 - quality. http://stats.oecd.org/glossary/detail.asp?ID=5150
Ransbotham, S., Kane, G.: Membership turnover and collaboration success in online communities: explaining rises and falls from grace in Wikipedia. MIS Q. 35(3), 613–627 (2011)
Ransbotham, S., Kane, G.C., Lurie, N.H.: Network characteristics and the value of collaborative user-generated content. Mark. Sci. 31(3), 387–405 (2012)
di Sciascio, C., Strohmaier, D., Errecalde, M., Veas, E.: WikiLyzer: interactive information quality assessment in Wikipedia. In: Proceedings of the 22nd International Conference on Intelligent User Interfaces, pp. 377–388. ACM (2017)
Senter, R., Smith, E.A.: Automated readability index. Technical report, University of Cincinnati, Ohio (1967)
Shang, W.: A comparison of the historical entries in Wikipedia and Baidu Baike. In: Chowdhury, G., McLeod, J., Gillet, V., Willett, P. (eds.) iConference 2018. LNCS, vol. 10766, pp. 74–80. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78105-1_9
Shen, A., Qi, J., Baldwin, T.: A hybrid model for quality assessment of Wikipedia articles. In: Proceedings of the Australasian Language Technology Association Workshop, pp. 43–52 (2017)
Soonthornphisaj, N., Paengporn, P.: Thai Wikipedia article quality filtering algorithm. In: Proceedings of the International Multi Conference of Engineers and Computer Scientists, vol. 1 (2017)
Stróżyna, M., Eiden, G., Abramowicz, W., Filipiak, D., Małyszko, J., Węcel, K.: A framework for the quality-based selection and retrieval of open data - a use case from the maritime domain. Electron. Mark. 28(2), 219–233 (2018). https://doi.org/10.1007/s12525-017-0277-y
Stvilia, B., Twidale, M.B., Gasser, L., Smith, L.C.: Information quality discussions in Wikipedia. In: Proceedings of the 2005 International Conference on Knowledge Management, pp. 101–113. Citeseer (2005)
Stvilia, B., Twidale, M.B., Smith, L.C., Gasser, L.: Assessing information quality of a community-based encyclopedia. In: Proceedings of ICIQ, pp. 442–454 (2005)
Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)
Warncke-wang, M., Cosley, D., Riedl, J.: Tell me more : an actionable quality model for Wikipedia. In: In: WikiSym 2013, pp. 1–10 (2013). https://doi.org/10.1145/2491055.2491063
Warncke-Wang, M., Ranjan, V., Terveen, L.G., Hecht, B.J.: Misalignment between supply and demand of quality content in peer production communities. In: ICWSM, pp. 493–502 (2015)
Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in Wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26762-3_27
WikiBest: Online game about comparing data quality between various languages of the wikipedia. https://wikibest.net
Wikidata: Main page. https://www.wikidata.org/wiki/Wikidata:Main_Page
Wikimedia Downloads: English Wikipedia latest database backup dumps. https://dumps.wikimedia.org/enwiki/latest/
Wikipedia Meta-Wiki: List of Wikipedias. https://meta.wikimedia.org/wiki/List_of_Wikipedias
Wikipedia Quality: Scientific works. https://wikipediaquality.com/wiki/Category:Scientific_works
WikiRank: Quality and popularity assessment of Wikipedia. https://wikirank.net
Wilkinson, D.M., Huberman, B.A.: Assessing the value of cooperation in Wikipedia. arXiv preprint arXiv: cs/0702140 (2007)
Wilkinson, D.M., Huberman, B.A.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis WikiSym 2007, pp. 157–164 (2007). https://doi.org/10.1145/1296951.1296968
Wu, K., Zhu, Q., Zhao, Y., Zheng, H.: Mining the factors affecting the quality of Wikipedia articles. In: 2010 International Conference of Information Science and Management Engineering (ISME), vol. 1, pp. 343–346. IEEE (2010)
Yaari, E., Baruchson-Arbib, S., Bar-Ilan, J.: Information quality assessment of community generated content: a user study of wikipedia. J. Inf. Sci. 37(5), 487–498 (2011)
Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Semant. Web 7(1), 63–93 (2016)
Zhang, S., Hu, Z., Zhang, C., Yu, K.: History-based article quality assessment on Wikipedia. In: 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 1–8. IEEE (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Lewoniewski, W. (2019). Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia. In: Abramowicz, W., Paschke, A. (eds) Business Information Systems Workshops. BIS 2018. Lecture Notes in Business Information Processing, vol 339. Springer, Cham. https://doi.org/10.1007/978-3-030-04849-5_53
Download citation
DOI: https://doi.org/10.1007/978-3-030-04849-5_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04848-8
Online ISBN: 978-3-030-04849-5
eBook Packages: Computer ScienceComputer Science (R0)