Wikipedia:Wikipedia Signpost/2025-02-07/Recent research
GPT-4 writes better edit summaries than human Wikipedians
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
GPT-4 is better at writing edit summaries than human Wikipedia editors
A preprint by researchers from EPFL and the Wikimedia Foundation presents

This solution was designed to match the performance and open source requirements for a live service deployed on Wikimedia Foundation servers. It consists of a "very small" language model (ca. 220 million parameters), based on Google's LongT5 (an extension of the company's T5 model from 2019, available under an Apache-2.0 license).
Separately, the authors also tested several contemporaneous large language models (GPT-4, GPT-3.5 and Llama 3 8B). GPT-4's edit summaries in particular were rated as significantly better than those provided by the human Wikipedia editors who originally made the edits in the sample – both using an automated scoring method based on semantic similarity, and in a quality ranking by human raters (where "to ensure high-quality results, instead of relying on the crowdsourcing platforms [like Mechanical Turk, frequently used in similar studies], we recruited 3 MSc students to perform the annotation").
This outcome joins some other recent research indicating that modern LLMs can match or even surpass the average Wikipedia editor in certain tasks (see e.g. our coverage: "'Wikicrow' AI less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors when writing gene articles").
A substantial part of the paper is devoted to showing that this particular task (generating good edit summaries) is both important and in need of improvements, motivating the use of AI to "overcome this problem and help editors write useful edit summaries":
In more detail:
The paper discusses various other nuances and special cases in interpreting these results and in deriving suitable training data for the "Edisum" model. (For example, "edit summaries should ideally explain why the edit was performed, along with what was changed, which often requires external context" that is not available to the model – or really to any human apart from the editor who made the edit.) The authors' best performing approach relies on fine-tuning the aforementioned LongT5 model on 100% synthetic data generated using a LLM (gpt-3.5-turbo) as an intermediate step.
Overall, they conclude that
The authors wisely refrain from suggesting the complete replacement of human-generated edit summaries. (It is intriguing, however, to observe that Wikidata, a fairly successful sister project of Wikipedia, has been content with relying almost entirely on auto-generated edit summaries for many years. And the present paper exclusively focuses on English Wikipedia – Wikipedias in other languages might have fairly different guidelines or quality issues regarding edit summaries.)
Still, there might be great value in deploying Edisum as an opt-in tool for editors willing to be mindful of its potential pitfalls. (While the English Wikipedia community has rejected proposals for a policy or guideline about LLMs, a popular essay advises that while their use for generating original content is discouraged, "LLMs can be used for certain tasks (like copyediting, summarization, and paraphrasing) if the editor has substantial prior experience in the intended task and rigorously scrutinizes the results before publishing them.")
On that matter, it is worth noting that the paper was first published (as a preprint) ten months ago already, in April 2024. (It appears to have been submitted for review at an ACL conference, but does not seem to have been published in peer-reviewed form yet.) Given the current extremely fast-paced developments in large language models, this likely means that the paper is already quite outdated concerning several of the constraints that Edisum was developed for. Specifically, the authors write that
But the performance of open LLMs (at least those released under the kind of license that is regarded as open-source in the paper) has greatly improved over the past year, while the costs of using LLMs in general have dropped.
Besides the Foundation's licensing requirements, its hardware constraints also played a big role:
Here too one wonders whether the situation might have improved over the past year since the paper was first published. Unlike much of the rest of the industry, the Wikimedia Foundation avoids NVIDIA GPUs because of their proprietary CUDA software layer and uses AMD GPUs instead, which are known for having some challenges in running standard open LLMs – but conceivably, AMD's software support and performance optimizations for LLMs might have been improving. Also, given the size of WMF's overall budget, it seems interesting that compute budget constraints would apparently prevent the deployment of a better-performing tool for supporting editors in an important task.
Briefly
- Submissions are open until March 9, 2025 for Wiki Workshop 2025, to take place on May 21-22, 2025. The virtual event will be the 12th in this annual series (formerly part of The Web Conference), and has been extended from one to two days this time. It is organized by the Wikimedia Foundation's research team with other collaborators. The call for contributions asks for 2-page extended abstracts which will be "non-archival, which means that they can be ongoing, completed, or already published work."
- See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
Other recent publications
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
"Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs"
From the abstract:
"Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence"
This study uses Wikipedia articles about neighborhoods in Madrid and Barcelona to predict immigrant concentration and segregation. From the abstract:
"On the effective transfer of knowledge from English to Hindi Wikipedia"
From the abstract:
Discuss this story
We should have a gadget using AI to write edit summaries. But of course, some will try to veto it because of anti-AI sentiment. In 20 years, when everyone is using AI for everything and the anti-AI Luddite sentiment dies out, maybe we will do a test run, I guess. Personal context: I am happily using AI to generate DYK hooks and article abstracts - of course, I am proofreading and fact checking them, and often copyediting further. But while I use edit summaries sometimes I am sure I could do it more, but, sorry, I do not consider it an efficient use of my time (also, because nobody ever complains about it), and this looks like a nice tool to have to popularize what is a best practice. --Piotr Konieczny aka Prokonsul Piotrus| reply here 06:32, 7 February 2025 (UTC)[reply]
Looking at those edit summary comparisons, I don't necessarily consider them "better". More verbose, certainly, but these are looking at them without the context of the actual edit. When comparing the diffs between two edits, "added artist", for example, is just as much as explanation as "Added Stefan Brüggemann to the list of artists whose works are included", because the diff clearly shows that's what's happening. On a slightly different point, the summary "This \"however\" doesn't make sense here" is actually clearer than "Removed the word "However," from the beginning of the sentence", etc. The bigger problem is that all the LLM summaries (and some of the human ones) fail on one of the key points on what an edit summary is supposed to do, which isn't to explain what the edit was, but to explain why it was done. AI may be able to put in ten words what has been done, but the six words from a human explain why. - SchroCat (talk) 07:36, 7 February 2025 (UTC)[reply]
I really don't think a lot of these AI generated summaries are needed. I would also note that I would still have to check and review a lot of these edits as the AI shows no signs of thought or credibility. Unless I know an article well, there is a chance that I don't even know what the edits are referring to, human or AI. One beneficial thing about human edits is that I can track patterns across edits. For example, a spam of edits by someone on an article may be a red flag, but if I see that it is going under a GA review and the editor is relatively well known, I don't feel like I am as needed to check it and I can spend my time elsewhere instead of tediously checking the veracity of every edit. ✶Quxyz✶ 15:45, 9 February 2025 (UTC)[reply]