Wikipedia:Wikipedia Signpost/2019-11-29/Recent research
Bot census; discussions differ on Spanish and English Wikipedia; how nature's seasons affect pageviews
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
"First census of Wikipedia bots"
- Reviewed by Indy beetle and Tilman Bayer
A paper titled "The Roles Bots Play in Wikipedia", published in Proceedings of the ACM on Human-Computer Interaction by five researchers from the Stevens Institute of Technology was presented at this month's CSCW conference. Bots are a core component of English Wikipedia, and account for approximately 10 percent of all edits as of 2019. After retrieving all 1,601 registered bots (as of 28 February 2019), the researchers used a procedure involving machine learning to organise them into a taxonomy with nine key "roles":
- Generator, e.g. Rambot
- Fixer, doing tasks such as correcting typos or adjusting links, e.g. Xqbot
- Connector, connecting English Wikipedia to other wikis or external sites, e.g. Citation bot
- Tagger, adding and modifying templates and categories
- Clerk, "updating statistical information, documenting user status, updating maintenance pages, and delivering article alert to Wikiprojects", e.g. WP 1.0 bot
- Archiver
- Protector, e.g. COIbot, XLinkBot, and ClueBot NG
- Advisor
- Notifier
Some bots act in several roles (e.g. AnomieBOT as Tagger, Clerk and Archiver).
The last part of the paper concerns the impact of bots on new editors that they interact with. Extending previous research that had found increased retention for newbies who were invited to the Teahouse support space by HostBot, an "Advisor" bot, the researchers show that other Advisor bots have a significant positive effect as well (although in the example cited, SuggestBot, they may have wanted to mention as a confounding factor that users need to opt into receiving its messages). Likewise confirming previous research, messages from ClueBot NG were found to have a negative effect, but this wasn't the case for other "Protector" bots: "The newcomers seem to not care about the bot signing their comments (SineBot) and are even positively influenced by the bot reverting their added links that violate Wikipedia’s copyright policy (XLinkBot)."
A press release, titled "Rise of the bots: Team completes first census of Wikipedia bots", quoted one of the authors as saying "People don't mind being criticized by bots, as long as they're polite about it. Wikipedia's transparency and feedback mechanisms help people to accept bots as legitimate members of the community."
The authors note the relevance of Wikidata to their study, where the proportion of bot edits "has reached 88%" (citing a 2014 paper), and find that the move of interlanguage link information to Wikidata lead to a decrease in "Connector" bot activity on Wikipedia. At last year's CSCW, a paper titled "Bot Detection in Wikidata Using Behavioral and Other Informal Cues" had presented a machine learning approach for identifying undeclared bot edits, showing that "in some cases, unflagged bot activities can significantly misrepresent human behavior in analyses". In the present study about Wikipedia, it would have been interesting to read whether the authors see any limitations in the data source they used (Category: All Wikipedia bots).
Seasonality in pageviews reflects plants blooming and birds migrating
- Reviewed by Tilman Bayer


A paper in PLoS Biology uses Wikipedia pageview data for "the first broad exploration of seasonal patterns of interest in nature across many species and cultures". Specifically, the researchers looked at the traffic for articles about 31,751 different species across 245 Wikipedia language editions. They found "that seasonality plays an important role in how and when people interact with plants and animals online. ... Pageview seasonality varies across taxonomic clades in ways that reflect observable patterns in phenology, with groups such as insects and flowering plants having higher seasonality than mammals. Differences between Wikipedia language editions are significant; pages in languages spoken at higher latitudes exhibit greater seasonality overall, and species seldom show the same pattern across multiple language editions." Seasonality was often found to "clearly correspond with phenological patterns (e.g., bird migration or breeding...)", but in other cases also to human-made events such as annual holidays. For example, traffic for the English Wikipedia's article on the wild turkey (Meleagris gallopavo) spiked during Thanksgiving in the US, and saw a softer peak during "the spring hunting season for wild turkey in many US states."
Overall, articles about plants and animals exhibited seasonality much more often than articles about other topics. (Concretely, 20.2% of the species articles in the dataset were found to have seasonal traffic, compared to 6.51% in a random selection of nonspecies articles. One quarter of species had a seasonal article in at least one language. Technically, seasonality was determined via a method that involved, among other steps, fitting the pageviews time series to a sinusoidal model with one or two annual peaks, using a manually defined threshold.)
See also earlier coverage of a related paper involving some of the same authors: "Using Wikipedia page views to explore the cultural importance of global reptiles"

Editor Interactions in Spanish and English Wikipedia
- Reviewed by Isaac Johnson
"How Does Editor Interaction Help Build the Spanish Wikipedia?" by Taryn Bipat, Diana Victoria Davidson, Melissa Guadarrama, Nancy Li, Ryder Black, David W. McDonald, and Mark Zachry of University of Washington, published in the 2019 CSCW Companion, examines talk page discussions in Spanish Wikipedia with a specific eye to how they might be different from the types of interactions in English Wikipedia. It replicates work from ACM GROUP 2007 that had developed a classification scheme for how editors use policy to discuss article changes.
This is a short paper so it does not have the depth of work you would expect in a full-length conference paper, but the authors select 38 talk pages from Spanish Wikipedia (presumably using the methods from the original work, which focused specifically on talk page conversations that involved high levels of conversation over the course of a month) and code them based on how often policies are linked to and in what context the policies are being linked to. The contextual codes that are applied are: "article scope", "legitimacy of source", "prior consensus", "power of interpretation", "threat of sanction", "practice on other pages", and "legitimacy of contributor". They find that "power of interpretation" and "article scope" are the most-used strategies, followed by "legitimacy of source". They also found a number of examples of editors linking to English Wikipedia pages.
While I would love to see a more robust analysis comparing English and Spanish talk pages that were sampled with the same strategy and from the same time periods, this work is an example of much-needed analyses of how the frameworks and models that are designed for one language community do or do not apply to other language communities. It would be fascinating to further understand the degree to which editors who are active across multiple languages adapt their discussion strategies to the local community versus apply similar strategies across all communities.
Too many editors spoil the broth - at least for global warming
- Reviewed by Tilman Bayer
In this article, three researchers from China present "a system dynamic model of Wikipedia based on the co-evolution theory, and [investigate] the interrelationships among topic popularity, group size, collaborative conflict, coordination mechanism, and information quality by using the vector error correction model (VECM)."
These five factors ("PSCCQ") are each represented by a monthly time series:
- number of searches in Google Trends (for the topic of global warming), indicating popularity
- number of unique editors contributing to the article ("group size")
- number of rollbacks in the article, as a measure of conflict
- monthly accumulated number of discussions recorded in the article's talk page, quantifying coordination effort
- number of edits to the article, which "provides a good indicator of a 'high level of quality' for Wikipedia articles"

In the paper, they are analyzed for the English Wikipedia's article on global warming, for the timespan of February 2004 to November 2015. First, the researchers apply Granger causality tests to identify which of the five variables tend to predict which, resulting in the depicted graph. E.g. popularity is predicted by coordination (number of talk page discussions, as the only factor in this case), indicating perhaps that Wikipedia editors tend to be quicker to debate new information about global warming than the general public will take it as occasion to look up global warming on Google. Furthermore, the authors calculate the impulse response functions for each of the 20 possible pairs. In the above example, this indicates how the popularity measure tends to "react" to a given increase in coordination. The application of a third technique, forecast error variance decomposition, further corroborates the results about how the five variables relate to each other.
The study presents two quite far-reaching takeaways from the relations it identified between the five factors:
- "the critical importance of coordination mechanism [i.e. talk page discussions] in effectively harnessing the 'wisdom of the crowd'"
- "too many contributors involved in a particular project may be detrimental to group performance. Wikipedia managers should not necessarily pursue a more-is-better strategy towards the number of contributors."
An obvious limitation of this research, only somewhat coyly mentioned in the paper, is its restriction to a single article (and only one Wikipedia language version). While an effort is made to justify the choice of global warming as a high-traffic page with a substantial amount of controversies, it remains unclear how much the takeaways can be generalized.
Conferences and events
See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
Other recent publications
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
- Compiled by Tilman Bayer and Miriam Redi
"Does Sleep Deprivation Cause Online Incivility? Evidence from a Natural Experiment"
This paper found that on English Wikipedia talk pages, about 22% more uncivil messages originate from impacted regions on the Mondays following the shift to daylight saving time.
"Smaller, more tightly-knit" WikiProjects may be more efficient
From the abstract:
"Web Traffic Prediction of Wikipedia Pages"
From the abstract:
"Anomaly Detection in the Dynamics of Web and Social Networks Using Associative Memory"
From the abstract:
"Operationalizing Conflict and Cooperation between Automated Software Agents in Wikipedia: A Replication and Expansion of 'Even Good Bots Fight'"
This paper from CSCW 2017 "replicates, extends, and refutes conclusions" of a paper by Yasseri et al. that had received wide and prolonged media attention for its claims that Wikipedia bots are fighting each other (cf. previous review: "Wikipedia bot wars capture the imagination of the popular press - but are they real?").
"The digital knowledge economy index: mapping content production"
From the abstract:
Linking 20 GB of data from Wikidata with a biodiversity database in 10 minutes
From the abstract:
"Inspiration, Captivation, and Misdirection: Emergent Properties in Networks of Online Navigation"
From the abstract:
"Different Topic, Different Traffic: How Search and Navigation Interplay on Wikipedia"
This paper aims to understand two paradigms of information seeking in Wikipedia: search by formulating a query, and navigation by following hyperlinks.
References
- Supplementary references and notes:
Discuss this story
@HaeB and Miriam (WMF): your "aggregated clickstream data" link for the Gildersleve and Yasseri "Inspiration..." paper is broken, and does not appear to be in the paper itself. EllenCT (talk) 00:49, 30 November 2019 (UTC)[reply]