Skip to main content

Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia

  • Conference paper
  • First Online:
Book cover Business Information Systems Workshops (BIS 2018)

Abstract

One of the most popular collaborative knowledge bases on the Internet is Wikipedia. Articles of this free encyclopaedia are created and edited by users from different countries in about 300 languages. Depending on topic and language version, quality of information there may vary. This study presents and classifies measures that can be extracted from Wikipedia articles for the purpose of automatic quality assessment in different languages. Based on a state of the art analysis and own experiments, specific measures for various aspects of quality have been defined. Additional, in this work they were also defined measures for quality assessment of data contained in the structural parts of Wikipedia articles - infoboxes. This study describes also an extraction methods for various sources of measures, that can be used in quality assessment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abramowicz, W., Auer, S., Heath, T.: Linked data in business. Bus. Inf. Syst. Eng. 58(5), 323–326 (2016). https://doi.org/10.1007/s12599-016-0446-0

    Article  Google Scholar 

  2. Alexa: Wikipedia.org traffic, demographics and competitors. https://www.alexa.com/siteinfo/wikipedia.org

  3. Altmetric: free tools. https://www.altmetric.com/products/free-tools/

  4. Anderka, M.: Analyzing and predicting quality flaws in user-generated content: the case of Wikipedia. Ph.D. Bauhaus-Universitaet Weimar Germany (2013)

    Google Scholar 

  5. Blumenstock, J.E.: Automatically assessing the quality of Wikipedia articles. Technical report (2008). https://doi.org/10.1080/17439880802324251

    Article  Google Scholar 

  6. Blumenstock, J.E.: Size matters: word count as a measure of quality on Wikipedia. In: WWW, pp. 1095–1096 (2008). https://doi.org/10.1145/1367497.1367673

  7. Bormuth, J.R.: Readability: a new approach. Read. Res. Q. 1, 79–132 (1966)

    Article  Google Scholar 

  8. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)

    Article  Google Scholar 

  9. De la Calzada, G., Dekhtyar, A.: On measuring the quality of Wikipedia articles. In: Proceedings of the 4th Workshop on Information Credibility, pp. 11–18. ACM (2010)

    Google Scholar 

  10. Caylor, J.S., Sticht, T.G.: Development of a simple readability index for job reading material (1973)

    Google Scholar 

  11. Chen, H.H.: How to use readability formulas to access and select English reading materials. J. Educ. Media Libr. Sci. 50(2), 229–254 (2012)

    Google Scholar 

  12. Coleman, M., Liau, T.L.: A computer readability formula designed for machine scoring. J. Appl. Psychol. 60(2), 283 (1975)

    Article  Google Scholar 

  13. Conti, R., Marzini, E., Spognardi, A., Matteucci, I., Mori, P., Petrocchi, M.: Maturity assessment of Wikipedia medical articles. In: 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), pp. 281–286. IEEE (2014)

    Google Scholar 

  14. Dale, E., Chall, J.S.: A formula for predicting readability: instructions. Educ. Res. Bull. 18, 37–54 (1948)

    Google Scholar 

  15. Dalip, D.H., Gonçalves, M.A., Cristo, M., Calado, P.: Automatic quality assessment of content created collaboratively by web communities: a case study of Wikipedia. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 295–304 (2009). https://doi.org/10.1145/1555400.1555449

  16. Dalip, D.H., Gonçalves, M.A., Cristo, M., Calado, P.: Automatic assessment of document quality in web collaborative digital libraries. J. Data Inf. Quality 2(3), 1–30 (2011). https://doi.org/10.1145/2063504.2063507

    Article  Google Scholar 

  17. Dang, Q.V., Ignat, C.L.: Measuring quality of collaboratively edited documents: the case of Wikipedia. In: 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), pp. 266–275. IEEE (2016)

    Google Scholar 

  18. DBpedia: Main Page. https://wiki.dbpedia.org

  19. Einstein, A.: The Meaning of Relativity. Routledge, Abingdon (2003)

    Book  Google Scholar 

  20. English Wikipedia: API sandbox. https://en.wikipedia.org/wiki/Special:ApiSandbox

  21. English Wikipedia: Criticism of Wikipedia. https://en.wikipedia.org/wiki/Criticism_of_Wikipedia

  22. English Wikipedia: Featured article criteria. https://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria

  23. English Wikipedia: Featured articles. https://en.wikipedia.org/wiki/Wikipedia:Featured_articles

  24. English Wikipedia: Good articles. https://en.wikipedia.org/wiki/Wikipedia:Good_articles

  25. English Wikipedia: Verifiability. https://en.wikipedia.org/wiki/Wikipedia:Verifiability

  26. English Wikipedia: Wikiproject tabular data. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Tabular_Data

  27. Eppler, M.J.: Managing Information Quality: Increasing the Value of Information in Knowledge-Intensive Products and Processes. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-32225-6

    Book  Google Scholar 

  28. Ferschke, O., Gurevych, I., Rittberger, M.: FlawFinder: a modular system for predicting quality flaws in Wikipedia. In: CLEF (Online Working Notes/Labs/Workshop), pp. 1–10 (2012)

    Google Scholar 

  29. Filipiak, D., Filipowska, A.: Improving the quality of art market data using linked open data and machine learning. In: Abramowicz, W., Alt, R., Franczyk, B. (eds.) BIS 2016. LNBIP, vol. 263, pp. 418–428. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52464-1_39

    Chapter  Google Scholar 

  30. Flekova, L., Ferschke, O., Gurevych, I.: What makes a good biography? Multidimensional quality analysis based on Wikipedia article feedback data. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 855–866. ACM (2014)

    Google Scholar 

  31. Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221 (1948)

    Article  Google Scholar 

  32. Greenfield, G.R.: Classic readability formulas in an EFL context: are they valid for Japanese speakers? Ph.D. thesis. Temple University (1999)

    Google Scholar 

  33. Gunning, R.: The Technique of Clear Writing. McGraw-Hill, New York (1952)

    Google Scholar 

  34. Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A.: Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80 (2014). https://doi.org/10.1016/j.ijpe.2014.04.018

    Article  Google Scholar 

  35. Infoboxes.net: quality comparison of infoboxes in Miltilingual Wikipedia. http://infoboxes.net

  36. Juran, J., Godfrey, A.B.: Quality Handbook, pp. 173–178. McGraw-Hill, New York (1999)

    Google Scholar 

  37. Kane, G.C.: A multimethod study of information quality in Wiki collaboration. ACM Trans. Manag. Inf. Syst. (TMIS) 2(1), 4 (2011)

    Google Scholar 

  38. Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. Technical report. Naval Technical Training Command Millington TN Research Branch (1975)

    Google Scholar 

  39. Kittur, A., Kraut, R.E.: Harnessing the wisdom of crowds in Wikipedia: quality through coordination. In: Proceedings of the ACM 2008 Conference on Computer Supported Cooperative Work - CSCW 2008, p. 37 (2008). https://doi.org/10.1145/1460563.1460572

  40. Kontokostas, D., et al.: Test-driven evaluation of linked data quality. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 747–758. ACM (2014)

    Google Scholar 

  41. Lerner, J., Lomi, A.: Knowledge categorization affects popularity and quality of Wikipedia articles. PloS One 13(1), e0190674 (2018)

    Article  Google Scholar 

  42. Lewoniewski, W.: Completeness and reliability of Wikipedia infoboxes in various languages. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 295–305. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69023-0_25

    Chapter  Google Scholar 

  43. Lewoniewski, W.: Enrichment of information in multilingual Wikipedia based on quality analysis. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 216–227. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69023-0_19

    Chapter  Google Scholar 

  44. Lewoniewski, W., Härting, R.-C., Wecel, K., Reichstein, C., Abramowicz, W.: Application of SEO metrics to determine the quality of Wikipedia articles and their sources. In: Damaševičius, R., Vasiljevienė, G. (eds.) ICIST 2018. CCIS, vol. 920, pp. 139–152. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99972-2_11

    Chapter  Google Scholar 

  45. Lewoniewski, W., Khairova, N., Węcel, K., Stratiienko, N., Abramowicz, W.: Using morphological and semantic features for the quality assessment of Russian Wikipedia. In: Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 550–560. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67642-5_46

    Chapter  Google Scholar 

  46. Lewoniewski, W., Węcel, K.: Relative quality assessment of Wikipedia articles in different languages using synthetic measure. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 282–292. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69023-0_24

    Chapter  Google Scholar 

  47. Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of Wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46254-7_50

    Chapter  Google Scholar 

  48. Lewoniewski, W., Węcel, K., Abramowicz, W.: Relative quality and popularity evaluation of multilingual Wikipedia articles. Informatics 4 (2017). https://doi.org/10.3390/informatics4040043

    Article  Google Scholar 

  49. Lewoniewski, W., Węcel, K., Abramowicz, W.: Determining quality of articles in polish Wikipedia based on linguistic features. In: Damaševičius, R., Vasiljevienė, G. (eds.) ICIST 2018. CCIS, vol. 920, pp. 546–558. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99972-2_45

    Chapter  Google Scholar 

  50. Lih, A.: Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news resource. In: 5th International Symposium on Online Journalism, p. 31 (2004)

    Google Scholar 

  51. Liu, J., Ram, S.: Using big data and network analysis to understand Wikipedia article quality. Data Knowl. Eng. 115, 80–93 (2018)

    Article  Google Scholar 

  52. Lucassen, T., Schraagen, J.M.: Trust in Wikipedia: how users trust information from an unknown source. In: Proceedings of the 4th Workshop on Information Credibility, pp. 19–26. ACM (2010)

    Google Scholar 

  53. Mc Laughlin, G.H.: SMOG grading-a new readability formula. J. Read. 12(8), 639–646 (1969)

    Google Scholar 

  54. Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: Proceedings of the 2012 Joint EDBT/ICDT Workshops, pp. 116–123. ACM (2012)

    Google Scholar 

  55. Microsoft Azure: Cloud computing platform & services. https://azure.microsoft.com/en-us/

  56. Moyer, D., Carson, S.L., Dye, T.K., Carson, R.T., Goldbaum, D.: Determining the influence of reddit posts on Wikipedia pageviews. In: Ninth International AAAI Conference on Web and Social Media, pp. 75–82. AAAI Press Oxford, UK (2015)

    Google Scholar 

  57. O’Brien, J.A., Marakas, G.M.: Introduction to Information Systems, vol. 13. McGraw-Hill/Irwin, New York City (2005)

    Google Scholar 

  58. OECD Glossary of Statistical Terms: ISO 8402 - quality. http://stats.oecd.org/glossary/detail.asp?ID=5150

  59. Ransbotham, S., Kane, G.: Membership turnover and collaboration success in online communities: explaining rises and falls from grace in Wikipedia. MIS Q. 35(3), 613–627 (2011)

    Article  Google Scholar 

  60. Ransbotham, S., Kane, G.C., Lurie, N.H.: Network characteristics and the value of collaborative user-generated content. Mark. Sci. 31(3), 387–405 (2012)

    Article  Google Scholar 

  61. di Sciascio, C., Strohmaier, D., Errecalde, M., Veas, E.: WikiLyzer: interactive information quality assessment in Wikipedia. In: Proceedings of the 22nd International Conference on Intelligent User Interfaces, pp. 377–388. ACM (2017)

    Google Scholar 

  62. Senter, R., Smith, E.A.: Automated readability index. Technical report, University of Cincinnati, Ohio (1967)

    Google Scholar 

  63. Shang, W.: A comparison of the historical entries in Wikipedia and Baidu Baike. In: Chowdhury, G., McLeod, J., Gillet, V., Willett, P. (eds.) iConference 2018. LNCS, vol. 10766, pp. 74–80. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78105-1_9

    Chapter  Google Scholar 

  64. Shen, A., Qi, J., Baldwin, T.: A hybrid model for quality assessment of Wikipedia articles. In: Proceedings of the Australasian Language Technology Association Workshop, pp. 43–52 (2017)

    Google Scholar 

  65. Soonthornphisaj, N., Paengporn, P.: Thai Wikipedia article quality filtering algorithm. In: Proceedings of the International Multi Conference of Engineers and Computer Scientists, vol. 1 (2017)

    Google Scholar 

  66. Stróżyna, M., Eiden, G., Abramowicz, W., Filipiak, D., Małyszko, J., Węcel, K.: A framework for the quality-based selection and retrieval of open data - a use case from the maritime domain. Electron. Mark. 28(2), 219–233 (2018). https://doi.org/10.1007/s12525-017-0277-y

    Article  Google Scholar 

  67. Stvilia, B., Twidale, M.B., Gasser, L., Smith, L.C.: Information quality discussions in Wikipedia. In: Proceedings of the 2005 International Conference on Knowledge Management, pp. 101–113. Citeseer (2005)

    Google Scholar 

  68. Stvilia, B., Twidale, M.B., Smith, L.C., Gasser, L.: Assessing information quality of a community-based encyclopedia. In: Proceedings of ICIQ, pp. 442–454 (2005)

    Google Scholar 

  69. Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)

    Article  Google Scholar 

  70. Warncke-wang, M., Cosley, D., Riedl, J.: Tell me more : an actionable quality model for Wikipedia. In: In: WikiSym 2013, pp. 1–10 (2013). https://doi.org/10.1145/2491055.2491063

  71. Warncke-Wang, M., Ranjan, V., Terveen, L.G., Hecht, B.J.: Misalignment between supply and demand of quality content in peer production communities. In: ICWSM, pp. 493–502 (2015)

    Google Scholar 

  72. Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in Wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26762-3_27

    Chapter  Google Scholar 

  73. WikiBest: Online game about comparing data quality between various languages of the wikipedia. https://wikibest.net

  74. Wikidata: Main page. https://www.wikidata.org/wiki/Wikidata:Main_Page

  75. Wikimedia Downloads: English Wikipedia latest database backup dumps. https://dumps.wikimedia.org/enwiki/latest/

  76. Wikipedia Meta-Wiki: List of Wikipedias. https://meta.wikimedia.org/wiki/List_of_Wikipedias

  77. Wikipedia Quality: Scientific works. https://wikipediaquality.com/wiki/Category:Scientific_works

  78. WikiRank: Quality and popularity assessment of Wikipedia. https://wikirank.net

  79. Wilkinson, D.M., Huberman, B.A.: Assessing the value of cooperation in Wikipedia. arXiv preprint arXiv: cs/0702140 (2007)

  80. Wilkinson, D.M., Huberman, B.A.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis WikiSym 2007, pp. 157–164 (2007). https://doi.org/10.1145/1296951.1296968

  81. Wu, K., Zhu, Q., Zhao, Y., Zheng, H.: Mining the factors affecting the quality of Wikipedia articles. In: 2010 International Conference of Information Science and Management Engineering (ISME), vol. 1, pp. 343–346. IEEE (2010)

    Google Scholar 

  82. Yaari, E., Baruchson-Arbib, S., Bar-Ilan, J.: Information quality assessment of community generated content: a user study of wikipedia. J. Inf. Sci. 37(5), 487–498 (2011)

    Article  Google Scholar 

  83. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Semant. Web 7(1), 63–93 (2016)

    Article  Google Scholar 

  84. Zhang, S., Hu, Z., Zhang, C., Yu, K.: History-based article quality assessment on Wikipedia. In: 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 1–8. IEEE (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Włodzimierz Lewoniewski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lewoniewski, W. (2019). Measures for Quality Assessment of Articles and Infoboxes in Multilingual Wikipedia. In: Abramowicz, W., Paschke, A. (eds) Business Information Systems Workshops. BIS 2018. Lecture Notes in Business Information Processing, vol 339. Springer, Cham. https://doi.org/10.1007/978-3-030-04849-5_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04849-5_53

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04848-8

  • Online ISBN: 978-3-030-04849-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics