Wikipedia:Wikipedia Signpost/2025-02-07/Recent research

File:A human writer and a creature with the head and wings of a crow, both sitting and typing on their own laptops, experiencing mild hallucinations (DALL-E illustration).webp
HaeB
CC0
300
Recent research

GPT-4 writes better edit summaries than human Wikipedians


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


GPT-4 is better at writing edit summaries than human Wikipedia editors

A preprint by researchers from EPFL and the Wikimedia Foundation presents

Average aggregated human evaluation scores for edit summaries generated by language models and by the human editors who originally made the edits

This solution was designed to match the performance and open source requirements for a live service deployed on Wikimedia Foundation servers. It consists of a "very small" language model (ca. 220 million parameters), based on Google's LongT5 (an extension of the company's T5 model from 2019, available under an Apache-2.0 license).

Separately, the authors also tested several contemporaneous large language models (GPT-4, GPT-3.5 and Llama 3 8B). GPT-4's edit summaries in particular were rated as significantly better than those provided by the human Wikipedia editors who originally made the edits in the sample – both using an automated scoring method based on semantic similarity, and in a quality ranking by human raters (where "to ensure high-quality results, instead of relying on the crowdsourcing platforms [like Mechanical Turk, frequently used in similar studies], we recruited 3 MSc students to perform the annotation").

This outcome joins some other recent research indicating that modern LLMs can match or even surpass the average Wikipedia editor in certain tasks (see e.g. our coverage: "'Wikicrow' AI less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors when writing gene articles").

A substantial part of the paper is devoted to showing that this particular task (generating good edit summaries) is both important and in need of improvements, motivating the use of AI to "overcome this problem and help editors write useful edit summaries":

In more detail:

The paper discusses various other nuances and special cases in interpreting these results and in deriving suitable training data for the "Edisum" model. (For example, "edit summaries should ideally explain why the edit was performed, along with what was changed, which often requires external context" that is not available to the model – or really to any human apart from the editor who made the edit.) The authors' best performing approach relies on fine-tuning the aforementioned LongT5 model on 100% synthetic data generated using a LLM (gpt-3.5-turbo) as an intermediate step.

Overall, they conclude that

The authors wisely refrain from suggesting the complete replacement of human-generated edit summaries. (It is intriguing, however, to observe that Wikidata, a fairly successful sister project of Wikipedia, has been content with relying almost entirely on auto-generated edit summaries for many years. And the present paper exclusively focuses on English Wikipedia – Wikipedias in other languages might have fairly different guidelines or quality issues regarding edit summaries.)

Still, there might be great value in deploying Edisum as an opt-in tool for editors willing to be mindful of its potential pitfalls. (While the English Wikipedia community has rejected proposals for a policy or guideline about LLMs, a popular essay advises that while their use for generating original content is discouraged, "LLMs can be used for certain tasks (like copyediting, summarization, and paraphrasing) if the editor has substantial prior experience in the intended task and rigorously scrutinizes the results before publishing them.")

On that matter, it is worth noting that the paper was first published (as a preprint) ten months ago already, in April 2024. (It appears to have been submitted for review at an ACL conference, but does not seem to have been published in peer-reviewed form yet.) Given the current extremely fast-paced developments in large language models, this likely means that the paper is already quite outdated concerning several of the constraints that Edisum was developed for. Specifically, the authors write that

But the performance of open LLMs (at least those released under the kind of license that is regarded as open-source in the paper) has greatly improved over the past year, while the costs of using LLMs in general have dropped.

Besides the Foundation's licensing requirements, its hardware constraints also played a big role:

Here too one wonders whether the situation might have improved over the past year since the paper was first published. Unlike much of the rest of the industry, the Wikimedia Foundation avoids NVIDIA GPUs because of their proprietary CUDA software layer and uses AMD GPUs instead, which are known for having some challenges in running standard open LLMs – but conceivably, AMD's software support and performance optimizations for LLMs might have been improving. Also, given the size of WMF's overall budget, it seems interesting that compute budget constraints would apparently prevent the deployment of a better-performing tool for supporting editors in an important task.


Briefly

  • Submissions are open until March 9, 2025 for Wiki Workshop 2025, to take place on May 21-22, 2025. The virtual event will be the 12th in this annual series (formerly part of The Web Conference), and has been extended from one to two days this time. It is organized by the Wikimedia Foundation's research team with other collaborators. The call for contributions asks for 2-page extended abstracts which will be "non-archival, which means that they can be ongoing, completed, or already published work."
  • See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs"

From the abstract:

"Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence"

This study uses Wikipedia articles about neighborhoods in Madrid and Barcelona to predict immigrant concentration and segregation. From the abstract:

"On the effective transfer of knowledge from English to Hindi Wikipedia"

From the abstract:


References

Uses material from the Wikipedia article Wikipedia:Wikipedia Signpost/2025-02-07/Recent research, released under the CC BY-SA 4.0 license.