Wikipedia:Wikipedia Signpost/2023-10-03/Recent research
Readers prefer ChatGPT over Wikipedia; concerns about limiting "anyone can edit" principle "may be overstated"
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
In blind test, readers prefer ChatGPT output over Wikipedia articles in terms of clarity, and see both as equally credible

A preprint titled "Do You Trust ChatGPT? -- Perceived Credibility of Human and AI-Generated Content" presents what the authors (four researchers from Mainz, Germany) call surprising and troubling findings:
The human-generated texts were taken from the lead section of four English Wikipedia articles (Academy Awards, Canada, malware and United States Senate). The LLM-generated versions were obtained from ChatGPT using the prompt Write a dictionary article on the topic "[TITLE]". The article should have about [WORDS] words
.
The researchers report that
One caveat about these results (which is only indirectly acknowledged in the paper's "Limitations" section) is that the study focused on four quite popular (i.e. non-obscure) topics – Academy Awards, Canada, malware and US Senate. Also, it sought to present only the most important information about each of these, in the form of a dictionary entry (as per the ChatGPT prompt) or the lead section of a Wikipedia article. It is well known that the output of LLMs tends to have fewer errors when it draws from information that is amply present in their training data (see e.g. our previous coverage of a paper that, for this reason, called for assessing the factual accuracy of LLM output on a benchmark that specifically includes lesser-known "tail topics"). Indeed, the authors of the present paper "manually checked the LLM-generated texts for factual errors and did not find any major mistakes," something that is well reported to not be the case for ChatGPT output in general. That said, it has similarly been claimed that Wikipedia, too, is less reliable on obscure topics. Also, the paper used the freely available version of ChatGPT (in its 23 March 2023 revision) which is based on the GPT 3.5 model, rather than the premium "ChatGPT Plus" version which, since March 2023, has been using the more powerful GPT-4 model (as does Microsoft's free Bing chatbot). GPT-4 has been found to have a significantly lower hallucination rate than GPT 3.5.
FlaggedRevs study finds that concerns about limiting Wikipedia's "anyone can edit" principle "may be overstated"
A paper titled "The Risks, Benefits, and Consequences of Prepublication Moderation: Evidence from 17 Wikipedia Language Editions", from last year's CSCW conference, addresses a longstanding open question in Wikipedia research, with important implications for some current issues.
Wikipedia famously allows anyone to edit, which generally means that even unregistered editors can make changes to content that go live immediately – only subject to "postpublication moderation" by other editors afterwards. Less well known is that on many Wikipedia language versions, this principle has long been limited by a software feature called Flagged Revisions (FlaggedRevs), which was developed and designed at the request of the German Wikipedia community and deployed there first in 2008, and has since been adopted by various other Wikimedia projects. (These do not include the English Wikipedia, which after much discussion implemented a system called "Pending Changes" that is very similar, but is only applied on a case-by-case basis to a small percentage of pages.) As summarized by the authors:
The paper studies the impact of the introduction of FlaggedRevs "on 17 Wikipedia language communities: Albanian, Arabic, Belarusian, Bengali, Bosnian, Esperanto, Persian, Finnish, Georgian, German, Hungarian, Indonesian, Interlingua, Macedonian, Polish, Russian, and Turkish" (leaving out a few non-Wikipedia sister projects that also use the system). The overall findings are that
In the "Discussion" section, the authors write
(This reviewer agrees in particular regarding the lack of notifications for new and unregistered editors that their edit has been approved – having filed, in vain, a proposal to implement this uncontroversially beneficial and already designed software feature to the annual "Community Wishlist", in 2023, 2022, and 2019.)
Interestingly, while the FlaggedRevs feature was (as summarized by the authors) developed by the Wikimedia Foundation and the German Wikimedia chapter (Wikimedia Deutschland), community complaints about a lack of support from the Foundation for the system were present even then, e.g. in a talk at Wikimania 2008 (notes, video recording) by User:P. Birken, a main driving force behind the project. Perhaps relatedly, the authors of the present study highlight a lack of researcher attention:
Still, it may be worth mentioning that there have been at least two preceding attempts to study this question (neither of these has been published in peer-reviewed form, thus their omission from the present study is understandable). They likewise don't seem to have identified major concerns that FlaggedRevs might contribute to community decline:
- A talk at Wikimania 2010 presented preliminary results from a study commissioned by Wikimedia Germany, e.g. that on German Wikipedia, "In general, flagged revisions did not [affect] anonymous editing" and that "most revisions got approved very rapidly" (the latter result surely doesn't hold everywhere; e.g. on Russian Wikipedia, the median time for an unregistered editor's edit to get reviewed is over 13 days at the time of writing). It also found, unsurprisingly, a "reduced impact of vandalism", consistent with the present study.
- An informal investigation of an experiment conducted by the Hungarian Wikipedia in 2018/19 similarly found that FlaggedRevs had "little impact on the growth of the editor community" overall. The experiment consisted of deactivating the feature of FlaggedRevs that hides unreviewed revisions from readers. As a second question, the Hungarian Wikipedians asked "How much extra load does [deactivating FlaggedRevs] put on patrollers?" They found that "[t]he ratio of bad faith or damaging edits grew minimally (2-3 percentage points); presumably it is a positive feedback for vandals that they see their edits show up publicly. The absolute number of such edits grew significantly more than that, since the number of anonymous edits grew [...]."
In any case, the CSCW paper reviewed here presents a much more comprehensive and methodical approach, not just because it examined the impact of FlaggedRevs across multiple wikis, but also regarding the formalizing of various research hypotheses and concerning the use of more reliable statistical techniques.
The findings in detail
In more detail, the researchers formalized four groups of research hypotheses about the impact of FlaggedRevs [our bolding]:
- First, the study assessed whether the "system is indeed functioning as intended", by hypothesizing that it reduces the "number of visible rejected contributions" (i.e. edits that were reverted after being approved, i.e. becoming visible to the general reader), both from users affected by the restriction (H1a) and from all editors (H1c), but not from those editors not affected (H1b). All three of these sub-hypotheses were confirmed in an interrupted time series (ITS) analysis of the monthly counts of such reverts (aggregated for all users in each group, over the entire wiki), covering the timespan from 12 months before to 12 months after the date (or month) on which FlaggedRevs was activated on a particular Wikipedia version. The researchers conclude that "In general, the results we see are in line with our expectations and provide strong evidence that FlaggedRevs achieves its primary goal of hiding low-quality contributions from the general public."
- Secondly, "Our H2 hypotheses suggest that prepublication review will affect the quality of contributions overall. We operationalize quality in two ways. First, we use the number of rejected contributions that we operationalize as the number of reverts. [...] We also test our second hypotheses using average quality that we operationalize as revert rate [...] measured as the proportion of contributions that are eventually." Like the first hypothesis, this is separately assessed for affected users, non-affected users and all users. The authors anticipated a rise in the quality of contributions overall (H2c) and of contributions from affected users such as IP editors (H2a), reasoning that "proactive measures of content moderation and production control can play an important role in encouraging prosocial behavior." For unaffected users, they again hypothesized a null effect (H2b). This set of hypotheses was again examined in an ITS analysis of the time series of monthly aggregate counts of such edits. Here, the authors "find little evidence of the prepublication moderation system having a major impact on the quality of contributions. Thus, we cannot conclude that FlaggedRevs alters the quantity or quality of newcomers’ contributions."
- The third group of hypotheses was motivated by existing "research that has shown that additional quality control policies may negatively affect the growth of a peer production community" (citing several papers which have been covered here before, see e.g. "'The rise and decline' of the English Wikipedia"). Again, this is split into three sub-hypotheses for affected users (H3a), unaffected users (H3b) and the community overall (H3c). The authors chose the aggregate number of mainspace (article) edits as their measure of productivity, and hypothesized that FlaggedRevs would decrease it in all three cases – for affected (non-trusted) editors because of a "reduced sense of self-efficacy" (i.e. the lack of satisfaction that comes with seeing one's change immediately being shown to the public), but also for unaffected (trusted) editors, because "prepublication [review] systems require effort from experienced contributors and may result in a net increase in the demands on these users’ time". These hypotheses are again tested using an ITS analysis aggregate (per-wiki) monthly numbers. Regarding H3a, this confirms a significant decrease for IP editors as one group of affected users in H3a, but not for newly registered editors as the other group of affected users. (Unfortunately, the analysis appears to treat these as static groups, without examining the possibility that FlaggedRevs may have motivated at least some people who had habitually contributed without logging in to do so under an account instead, with the anticipation of becoming a trusted/unaffected user after passing the applicable threshold.) The study finds that "the deployment of the prepublication [review] discouraged the participation of the group of editors with the lowest commitment and most targeted by the additional safeguard [i.e. IP editors], but not the other groups." In particular, FlaggedRevs did not cause a significant decline in article edits overall, contradicting the expectations formed based on the aforementioned previous research.
- The fourth hypothesis was similarly motivated by previous research that had found that the "barrier to entry posed by prepublication review, combined with the delayed intrinsic reward, might be disheartening enough to drive newcomers away" (in case of the creation of new articles on English Wikipedia, see our previous coverage: "Newcomer productivity and pre-publication review"). Here, the authors "hypothesize that the deployment of [FlaggedRevs] will negatively affect the return rate of newcomers (H4)." Differently from the previous three hypothesis, this effect on retention rate is tested using a per-user (instead of aggregate) dataset. The study finds "that although FlaggedRevs did negatively affect the return rate of newcomers in a way that was statistically significant, the size of this effect is extremely small." Again though, the analysis is limited by treating this group as static, without being able to consider the possibility that FlaggedRevs may motivate more people to create an account instead of contributing under an IP. What's more, the authors caution their analysis had been limited by the fact that "we do not have access to wiki-level configuration data on FlaggedRev" (referring to settings such as the edit number threshold where an editor will be automatically promoted to trusted status). However, the Wikimedia Foundation does in fact publish this information, so there might be opportunities for future research to examine this research question more thoroughly. Relatedly, while that paper promises that "[a] replication dataset including data, code, and other supplementary material has been placed in the Harvard Dataverse archive and is available at: https://doi.org/10.7910/DVN/G1YFLE ", that URL does not (yet) contain such material for most of the paper's results. (In March 2023, the authors acknowledged this issue and planned to remedy it, but at the time of writing the data repository appears unchanged.)
(Disclosure: This reviewer provided some input to the authors at the beginning of their research project, as acknowledged in the paper, but was not involved in it otherwise.)
See also related earlier coverage: "Sociological analysis of debates about flagged revisions in the English, German and French Wikipedias" (2012)
Briefly
- Wikimania, the annual global conference of the Wikimedia movement, took place in Singapore in August (as an in-person event again for the first time since 2019). Its research track included the by now traditional "State of Wikimedia Research" presentation highlighting research trends from the past year (with involvement by members of this research newsletter), see our blog post with videos and slides. Videos and slides from other presentations are being uploaded, too.
- See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
Other recent publications
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
"Wikidata as Semantic Infrastructure: Knowledge Representation, Data Labor, and Truth in a More-Than-Technical Project"
From the abstract:
"Naked data: curating Wikidata as an artistic medium to interpret prehistoric figurines"
From the abstract:
"The Wikipedia Republic of Literary Characters"
From the abstract:
"What makes Individual I's a Collective We; Coordination mechanisms & costs"
From the abstract:
"Wikidata Research Articles Dataset"
From the abstract:
"Speech Wikimedia: A 77 Language Multilingual Speech Dataset"
From the abstract:
15 years later, repetition of philosophy vandalism experiment yields "surprisingly similar results"
From the paper
References
- Supplementary references and notes:
Discuss this story
FlaggedRevs
There seem to be some errors in The Risks, Benefits, and Consequences of Prepublication Moderation: Evidence from 17 Wikipedia Language Editions (https://arxiv.org/abs/2202.05548) papers assumptions on how FlaggedRevs works. For example:
--Zache (talk) 09:11, 4 October 2023 (UTC)[reply]
In Russian Wikipedia, as well as in Russian Wikinews, FlaggedRevs is a disaster. You say Germans are guilty in that? --ssr (talk) 06:09, 27 October 2023 (UTC)[reply]
Wikidata
ChatGPT v. Wikipedia
The study authors comment on prose quality. I happened to ask ChatGPT yesterday to explain what government shutdowns in the U.S. are and what effects they have. I got the following answer:
I then compared that to the lead of Government shutdowns in the United States:
Personally I found ChatGPT's output a lot more readable than the Wikipedia lead – it is just better written. The English Wikipedia text often required me to go back and read the sentence again.
Take the first sentence:
At first I parsed "when funding legislation" as an indication of when shutdowns occur (i.e. "when you are funding legislation"). I needed to read on to realise that this wasn't where the sentence was going.Next, Wikipedia uses the rather technical expression "when funding legislation ... is not enacted" (which is also passive voice) where ChatGPT uses the much easier-to-understand "when Congress fails to pass a budget" (active voice).
Where ChatGPT speaks of a "temporary suspension of non-essential government services", Wikipedia says the federal government "curtails agency activities and services, ceases non-essential operations", etc. I find the ChatGPT phrase easier to understand and faster to read while providing much the same information as the quoted Wikipedia passage (a point the study authors commented on specifically).
The Wikipedia sentence
leaves me wondering even now what the word "it" at the end of the sentence is meant to refer to.I suspect our sentence construction and word use are not helping us win friends. It's one thing when we are the only service available; it's another when there is a new kid on the block. Andreas JN466 13:56, 4 October 2023 (UTC)[reply]
Even if ChatGPT or its successor becomes the predominant internet search tool, that doesn't mean Wikipedia will be obsolete. It likely means that Wikipedia will go back to its theoretical origin as a reference work rather than the internet search tool many readers use it as. Thebiguglyalien (talk) 16:11, 4 October 2023 (UTC)[reply]
Ah, the rise of AI. I've used it to get ideas for small projects in the past, but people prefer LLMs over Wikipedia? That's, just... sad. The Master of Hedgehogs is back again! 22:09, 4 October 2023 (UTC)[reply]