Extracting Knowledge from the Structure of Wikipedia Links
|Date and time||10.10.2019 – 11:00 › 12:00|
|Place and room|
|Speaker||Cristian Consonni. Cristian Consonni is a PhD student in Computer Science at the Department of Information Engineering and Computer Science (DISI) at the University of Trento, Italy. He is part of the dbTrento group. He is interested in machine learning and data mining techniques on time-evolving graphs.|
|Category||Conferences - Seminars|
Surfing the links between Wikipedia articles constitutes a valuable way to acquire new knowledge related to a topic. In Wikipedia parlance, these links are called internal links or wikilinks. We introduce WikiLinkGraphs: a complete, longitudinal dataset of the network of internal Wikipedia links for the 9 largest language editions. The dataset contains yearly snapshots of the network and spans 17 years, from the creation of Wikipedia in 2001 to March 1st, 2018. Equipped with this data, we explore the problem of establishing which are the most relevant topics related to a given page. In Wikipedia, the density of connections makes that, starting from a single page, it is possible to reach virtually any other topic on the encyclopedia. A well-known option to solve this problem is Personalized PageRank; its performance, however, is hindered by pages with high indegree that function as hubs and obtain high scores regardless of the starting point. In this talk, we present CycleRank, a novel algorithm based on cyclic paths aimed at finding the most relevant nodes related to a topic. We compare the results of CycleRank with those of Personalized PageRank and other algorithms derived from it, both with qualitative examples and with an extensive quantitative evaluation. We perform different experiments based on ground truths such as the number of clicks that links receive from visitors and the set of related articles highlighted by editors in the “See also” section of each article. We find that CycleRank tends to identify pages that are more relevant to the selected topic. Finally, we show that computing CycleRank is two orders of magnitude faster than computing the other baselines.