Wikipedia:Wikipedia Signpost/2020-05-31/Recent research

Recent research

Automatic detection of covert paid editing; Wiki Workshop 2020

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Automatic detection of undisclosed paid editing

Figure from the paper: "Article network: two articles are connected by an edge if they have been edited by a common user. Colors indicate articles create by the same sockpuppet group of undisclosed paid editors (UPEs). Negative articles (in gray) are articles never edited by an UPE."

In a paper published in the proceedings of last month's (virtual) The Web Conference, four researchers from Boise State University (collaborating with an English Wikipedia administrator) present a machine learning framework for "automatically detecting Wikipedia undisclosed paid contributions, so that they can be quickly identified and flagged for removal."

Their approach is based on constructing two datasets, of articles and editors, each consisting of undisclosed paid editing (UPE; as previously confirmed by Wikipedia administrators) and a control group of articles/users assumed to be "benign" (i.e., not the result of, or engaged in, UPE). In more detail, the authors started from a previously published dataset that had collected the results of 23 past sockpuppet investigations, yielding 1,006 known UPE accounts, and added 98 manually determined UPE accounts. A sample of articles newly created in March 2019 (limited to those created by users with less than 200 edits who were manually assessed to not being engaged in paid editing) was used to come with the benign parts of the two datasets.

For both articles and editors, the authors tested three different classification algorithms (logistic regression, support vector machine, and random forest) on a relatively simple set of features (e.g., for articles, the number of categories, or for editors, the average time between two consecutive edits made by the user). Still, the resulting method appears quite effective for detecting undisclosed paid articles:

Among the most effective features was "the percentage of edits made by a user that are less than 10 bytes. Undisclosed paid editors try to become autoconfirmed users; thus they typically make around 10 minor edits before creating a promotional article."

Overall, the results appear to hold high promise for a practical application that could be of significant assistance to the editing community in combating the abuse of Wikipedia for promotional purposes, which is an ongoing and pervasive problem (compare e.g. this month's Signpost coverage of a recent investigation on the French Wikipedia). Obviously, any output of such an algorithm would be needed to be vetted manually, considering the relatively small but (in absolute terms) still considerable number of false positives. The paper contains little discussion of possible limitations of the sockpuppet investigations dataset used (e.g., how representative it might be of UPE efforts overall, as opposed to focused on the activities of some specific PR agencies), leaving open the possibility of overfitting.

The paper also includes an analysis of the network of the articles in the dataset, with two articles connected by an edge if the same user had edited both (see figure). But its results do not appear to have been used in the detection method. Among the findings: "there is less user collaboration among positive articles [as measured by local clustering coefficient and PageRank]. UPEs only work on a limited number of Wikipedia titles that they are interested in promoting, whereas genuine users edit more pages related to their field of expertise."

The authors highlight the importance of sockpuppets, observing that "undisclosed paid editors typically act as a group of sockpuppet accounts" and basing most of their ground truth dataset on sockpuppet cases. A brief literature review covers previous research on the automatic detection of sockpuppets on Wikipedia, including a paper from the 2016 Web Conference presenting a method able "to detect 99% of fake accounts," and an earlier stylometric method (cf. our 2013 coverage: " Sockpuppet evidence from automated writing style analysis" / "New sockpuppet corpus"). An ongoing research project by the Wikimedia Foundation (presented at last year's Wikimania) concerns the practical implementation of such a tool.


Wikiworkshop 2020

As part of The Web Conference, the annual Wiki Workshop "[brought] together researchers exploring all aspects of Wikipedia, Wikidata, and other Wikimedia projects", this year held as an one-day Zoom meeting with over 100 participants. Among the papers (see also proceedings):

From the abstract:

See also slides

"Layered Graph Embedding for Entity Recommendation using Wikipedia in the Yahoo! Knowledge Graph"

From the abstract:

See also slides


"WikiHist.html: English Wikipedia's Full Revision History in HTML Format"

From the abstract:

See also slides and the underlying 7 terabyte dataset with code


"Collaboration of Open Content News in Wikipedia: The Role and Impact of Gatekeepers"

From the abstract:

See also slides


"Domain-Specific Automatic Scholar Profiling Based on Wikipedia"

From the abstract:

See also slides


From the abstract:

See also slides


"Beyond Performing Arts: Network Composition and Collaboration Patterns"

From the abstract:

See also slides


"Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia"

From the abstract:

See also slides


"The Positioning Matters: Estimating Geographical Bias in the Multilingual Record of Biographies on Wikipedia"

From the abstract:

"Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale"

From the abstract:

See also code and blog post.


For coverage of some other papers from Wiki Workshop 2020, see last month's issue ("What is trending on (which) Wikipedia?"), and upcoming issues. This blog post about the event covers several non-paper aspects of the schedule, including the keynote by Jess Wade.


Briefly

References

Uses material from the Wikipedia article Wikipedia:Wikipedia Signpost/2020-05-31/Recent research, released under the CC BY-SA 4.0 license.