Wikipedia:Wikipedia Signpost/2025-02-07/Recent research

Recent research

GPT-4 writes better edit summaries than human Wikipedians

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

GPT-4 is better at writing edit summaries than human Wikipedia editors

A preprint by researchers from EPFL and the Wikimedia Foundation presents

This solution was designed to match the performance and open source requirements for a live service deployed on Wikimedia Foundation servers. It consists of a "very small" language model (ca. 220 million parameters), based on Google's LongT5 (an extension of the company's T5 model from 2019, available under an Apache-2.0 license).

Separately, the authors also tested several contemporaneous large language models (GPT-4, GPT-3.5 and Llama 3 8B). GPT-4's edit summaries in particular were rated as significantly better than those provided by the human Wikipedia editors who originally made the edits in the sample – both using an automated scoring method based on semantic similarity, and in a quality ranking by human raters (where "to ensure high-quality results, instead of relying on the crowdsourcing platforms [like Mechanical Turk, frequently used in similar studies], we recruited 3 MSc students to perform the annotation").

This outcome joins some other recent research indicating that modern LLMs can match or even surpass the average Wikipedia editor in certain tasks (see e.g. our coverage: "'Wikicrow' AI less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors when writing gene articles").

A substantial part of the paper is devoted to showing that this particular task (generating good edit summaries) is both important and in need of improvements, motivating the use of AI to "overcome this problem and help editors write useful edit summaries":

In more detail:

The paper discusses various other nuances and special cases in interpreting these results and in deriving suitable training data for the "Edisum" model. (For example, "edit summaries should ideally explain why the edit was performed, along with what was changed, which often requires external context" that is not available to the model – or really to any human apart from the editor who made the edit.) The authors' best performing approach relies on fine-tuning the aforementioned LongT5 model on 100% synthetic data generated using a LLM (gpt-3.5-turbo) as an intermediate step.

Overall, they conclude that

The authors wisely refrain from suggesting the complete replacement of human-generated edit summaries. (It is intriguing, however, to observe that Wikidata, a fairly successful sister project of Wikipedia, has been content with relying almost entirely on auto-generated edit summaries for many years. And the present paper exclusively focuses on English Wikipedia – Wikipedias in other languages might have fairly different guidelines or quality issues regarding edit summaries.)

Still, there might be great value in deploying Edisum as an opt-in tool for editors willing to be mindful of its potential pitfalls. (While the English Wikipedia community has rejected proposals for a policy or guideline about LLMs, a popular essay advises that while their use for generating original content is discouraged, "LLMs can be used for certain tasks (like copyediting, summarization, and paraphrasing) if the editor has substantial prior experience in the intended task and rigorously scrutinizes the results before publishing them.")

On that matter, it is worth noting that the paper was first published (as a preprint) ten months ago already, in April 2024. (It appears to have been submitted for review at an ACL conference, but does not seem to have been published in peer-reviewed form yet.) Given the current extremely fast-paced developments in large language models, this likely means that the paper is already quite outdated concerning several of the constraints that Edisum was developed for. Specifically, the authors write that

But the performance of open LLMs (at least those released under the kind of license that is regarded as open-source in the paper) has greatly improved over the past year, while the costs of using LLMs in general have dropped.

Besides the Foundation's licensing requirements, its hardware constraints also played a big role:

Here too one wonders whether the situation might have improved over the past year since the paper was first published. Unlike much of the rest of the industry, the Wikimedia Foundation avoids NVIDIA GPUs because of their proprietary CUDA software layer and uses AMD GPUs instead, which are known for having some challenges in running standard open LLMs – but conceivably, AMD's software support and performance optimizations for LLMs might have been improving. Also, given the size of WMF's overall budget, it seems interesting that compute budget constraints would apparently prevent the deployment of a better-performing tool for supporting editors in an important task.

Briefly

Submissions are open until March 9, 2025 for Wiki Workshop 2025, to take place on May 21-22, 2025. The virtual event will be the 12th in this annual series (formerly part of The Web Conference), and has been extended from one to two days this time. It is organized by the Wikimedia Foundation's research team with other collaborators. The call for contributions asks for 2-page extended abstracts which will be "non-archival, which means that they can be ongoing, completed, or already published work."
See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs"

From the abstract:

"Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence"

This study uses Wikipedia articles about neighborhoods in Madrid and Barcelona to predict immigrant concentration and segregation. From the abstract:

"On the effective transfer of knowledge from English to Hindi Wikipedia"

From the abstract:

References

← Previous "Recent research"

Next "Recent research" →

In this issue

7 February 2025 (all comments)

Recent research

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

We should have a gadget using AI to write edit summaries. But of course, some will try to veto it because of anti-AI sentiment. In 20 years, when everyone is using AI for everything and the anti-AI Luddite sentiment dies out, maybe we will do a test run, I guess. Personal context: I am happily using AI to generate DYK hooks and article abstracts - of course, I am proofreading and fact checking them, and often copyediting further. But while I use edit summaries sometimes I am sure I could do it more, but, sorry, I do not consider it an efficient use of my time (also, because nobody ever complains about it), and this looks like a nice tool to have to popularize what is a best practice. --_{Piotr Konieczny aka Prokonsul Piotrus| reply here} 06:32, 7 February 2025 (UTC)[reply]

Piotrus Edit summaries seems like a valid use case, god knows I have written some subpar edit summaries in my day. But using it for DYK hooks surprises me. To me, creating a hook is one of the most fun things an editor can do (aside from maybe writing a well done lead). Why outsource it to a machine? Also, what do you mean by "article abstracts"? CaptainEek ^{Edits Ho Cap'n!}⚓ 07:13, 7 February 2025 (UTC)[reply]

NPR recently had this same discussion with a professional musician who creates film scores. They found that the AI software created film scores as good or better than the musician did. There were drawbacks; AI just isn't as creative as humans at this point. But if you need to make something that is required to look like/sound like/ read like something else, that it might be a useful tool for the job. The musician in question was very upset and seriously considered that they might not have a job soon. Viriditas (talk) 10:18, 7 February 2025 (UTC)[reply]

To me creating a hook is a pain - and my job is being a writer. If it is fun for you, go nuts :) but for me having some help would remove a hassle from DYK. cyclopia ^speak! 10:44, 7 February 2025 (UTC)[reply]

@CaptainEek Hmmm, pretty much what @Cyclopia said. Maybe it's because I have written 1000+ DYKs - I am a bit burned out coming up with hooks; and I also prefer just writing another DYK then coming up with DYKs. Particularly as I started this (AI hooks) for some DYKs where the reviewer or DYK admins complained that my hooks are not "interesting" and I was stumpted what to do - then I asked AI (after feeding it DYK rules and the article text), and it generated a bunch of hooks, some of which were pretty decent, and did satisfy the "boring" crowd. Frankly, now I just outsource most of my hooks to AI, because I no longer find coming up with my own worth my time (but of course, if you enjoy it, more power to you) :D And by article abstracts, sorry, I am a bit off my game today (fever, cold, etc.). I meant leads. After I finish my recent articles, I often ask AI to write Wiki MOS compliant leads (which then I copyedit and merge with my leads). AI does a pretty good job summarizing stuff. Obviously, it helps I am very familiar with my articles, so I can spot any errors AI makes (which are rare but happens). I would be more cautious using it for articles I haven't read - but people will do it, not much we can do about it (hopefully the issue of AI hallucination will be solved in the near future...). _{Piotr Konieczny aka Prokonsul Piotrus| reply here} 11:18, 7 February 2025 (UTC)[reply]

I find the use of LLM for leads rather disappointing actually :( A lead is one of the only things most people read in an article. I often put as much time into a lead as I do the entire rest of an article. Thinking about what's important and how to best say it is so crucial. For example, along with the other regulars at American Civil War, I've spent years trying to to come up with the perfect lead. We've had more discussions about the lead than anything else, agonizing over single words, and frankly we've come up with something rather amazing. No machine could make a better lead. CaptainEek ^{Edits Ho Cap'n!}⚓ 17:44, 7 February 2025 (UTC)[reply]

I am not incredibly impressed by the (current) capabilities of LLMs in generating elegant text, but remember that we are machines as well. There is no reason an algorithm cannot or will not ever generate a good lead. That said, apart from the issue of potential copyvio, I see no drawback in using LLMs to generate some initial ideas on which we humans can work on. cyclopia ^speak! 13:00, 10 February 2025 (UTC)[reply]

I'd personally argue the anti-AI sentiment (that I share) is not a result of opposition to the technology itself, but rather opposition to the unethical and wasteful nature of how the technology is being used. In other words, I wouldn't be so quick to dismiss our criticisms as "Luddite sentiment". /home/gracen/ (they/them) 16:18, 7 February 2025 (UTC)[reply]

@Gracen There are blurry boundaries, and all stuff can be misued, but to me it's more like missing the forest for the trees, and ignoring the potential for greater good due to mostly irrelevant concerns; I'd compare it to refusing to use electrical power because some of it comes from non-renewable sources, or criticizing the concept of medical treatment because some drugs come from companies that have behaved unethicality, etc. Plus organizational inertia and fear of change ("we did not need AIs before so we don't need them now or forever, sonny boy cough, cough..."). _{Piotr Konieczny aka Prokonsul Piotrus| reply here} 01:57, 12 February 2025 (UTC)[reply]

I appreciate your perspective, and I agree with you that criticizing AI overall is very similar tocriticizing the concept of medical treatment because some drugs come from [unethical companies]. However, I and many others are not criticizing AI overall (although I won't say that nobody's irrationally opposed to AI), but we are in fact criticizing the unethical parts of it. (Skip to the second paragraph if you want to skip my AI rant.) I'm all for computer vision, text to speech (in cases that aren't deepfakes), and AI translation tools. However, I'm very much against LLMs due to the large amounts of energy they consume for what's essentially predictive text that's really good at pretending to think (however, not opposed to LMs in general). I'm also against generative image models due to the incredible levels of artist exploitation and stolen content that they are trained on. I'm also strongly opposed to the marketing of both of these technologies (LLMs and image models) as being things that they are not: i.e. machines capable of creativity and critical thinking.

To be clear, I believe that AI-assisted edit summaries have great potential. Editors should only have to explain the "why" of their edit in a summary, and leaving the "what" to a language model which is trained specifically for this purpose would be excellent. /home/gracen/ (they/them) 16:13, 13 February 2025 (UTC)[reply]

To be fair to the Luddites as well, they were very much left in the lurch by a transition to a system with atrocious working conditions, poor safety, and all-round disregard for ethics. Maybe there are similarities towards opposition to AI, but have we considered that maybe the Luddites had a point, and that, Luddism was, if not in the right, then at least not unambiguously worse than the government that violently suppressed it. Alpha3031 (t • c) 05:48, 14 February 2025 (UTC)[reply]

Looking at those edit summary comparisons, I don't necessarily consider them "better". More verbose, certainly, but these are looking at them without the context of the actual edit. When comparing the diffs between two edits, "added artist", for example, is just as much as explanation as "Added Stefan Brüggemann to the list of artists whose works are included", because the diff clearly shows that's what's happening. On a slightly different point, the summary "This \"however\" doesn't make sense here" is actually clearer than "Removed the word "However," from the beginning of the sentence", etc. The bigger problem is that all the LLM summaries (and some of the human ones) fail on one of the key points on what an edit summary is supposed to do, which isn't to explain what the edit was, but to explain why it was done. AI may be able to put in ten words what has been done, but the six words from a human explain why. - SchroCat (talk) 07:36, 7 February 2025 (UTC)[reply]

What's the criteria for "better" here? The AI-generated edit summaries are often more verbose, but that's not necessarily better—in a lot of cases, it seems to be using a lot of words for no real reason. I also note that the human editors often include why they are making an edit (e.g.,doesn't make sense here,Per feedback given in GA review), while the AI never includes a reason for the edit; it just describes what happened in it. Looking at the diff for the edit shows me what was done just fine; one of the crucial functions of an edit summary is to explain why it was done. Seraphimblade ^{Talk to me} 08:11, 7 February 2025 (UTC)[reply]
They're judged as "better" by some MScs who got roped into the exercise. Total waste of a good research idea. I'd have been really interested to know how well the AI performs, but this is not useful data for wikipedia or wikipedians. -- asilvering (talk) 18:03, 7 February 2025 (UTC)[reply]

For human-written edit summaries, we do have a convention of being brevitous and clipped, although I would aver this is because it is unreasonable to have a 100-byte explanation for every 4-byte edit. If it was costless to actually describe th3 changes, I would much rather peruse a history filled with those than the current thing where there's just a solid row of 70 "ce" and "add date" edits and to find where a specific thing was added you have to manually bisect it 😭 jp×g 🗯️ 08:25, 7 February 2025 (UTC)[reply]

You can pry ce from my cold dead hands. I do try to make more detailed summaries when it's more than just a ce haha Wilhelm Tell DCCXLVI (^{talk to me!}/_{my edits}) 17:38, 7 February 2025 (UTC)[reply]

Seconding! Help:Edit summary lists three reasons for edit summaries. Yes, as SchroCat says, they offer a rationale for the edit, but they should also describe the edit itself to save us from a binary search through the article history. Sure there's xtools:blame, but that's insufficient for characterizing deletions. ViridianPenguin🐧 (💬) 00:56, 8 February 2025 (UTC)[reply]

They are almost certainly "better" then no edit summaries, and I think no edit summaries is the rule, followed by auto-generated ones... :( Writing edit summaries is very rarely "fun", I think - most of us think it is a waste of our time, and it kind of is (writing a new sentence for an article is more productive then writing an edit summary; of course doing both is best but...). _{Piotr Konieczny aka Prokonsul Piotrus| reply here} 11:19, 7 February 2025 (UTC)[reply]

Piotrus, the real problem with poor or non-existent edit summaries is that it wastes other editors' time having to check if the edit was a reasonable one. And I have no way of knowing whether or not someone with judgement I trust has already looked at the edit. Thus, it can waste the time of several editors. Edwardx (talk) 11:34, 7 February 2025 (UTC)[reply]

True, but since it is not required, most folks ignore it, like many minor best practices. _{Piotr Konieczny aka Prokonsul Piotrus| reply here} 11:41, 7 February 2025 (UTC)[reply]

See, I just don't understand that. I won't claim I'm the most prolific editor, but I will claim that in all my time here, I've made exactly three edits in mainspace without an edit summary — and the last one was in May 2011.

(I definitely understand it not being required, though, because it's one of those things you can't enforce with technology. If edit summaries were required, we'd have an epidemic of edit summaries that read "edit", or ".", or "reghrhtrera". Require a certain number of characters, same thing but longer. There's no way to enforce a requirement for meaningful edit summaries, which would be the only requirement that would matter.) FeRDNYC (talk) 17:55, 7 February 2025 (UTC)[reply]

I do see a surprising number of cases where a reference says exactly the opposite of the claim it's supposed to be supporting. I suspect this is largely due to the human equivalent of "hallucinating." All the best: Rich Farmbrough 11:39, 7 February 2025 (UTC).[reply]

Wikidata, a fairly successful sister project of Wikipedia, has been content with relying almost entirely on auto-generated edit summaries for many years — who asked Wikidata users, though? It is simply impossible to add a summary in most cases on Wikidata, I don’t think anyone is ‘content’ with that, no one was asked whether they want edit summaries or not. As for the article more generally, I also agree with people who said that many of AI-generated summaries are not at all better than human-written ones. stjn 13:35, 7 February 2025 (UTC)[reply]

Well, nobody is protesting either so... _{Piotr Konieczny aka Prokonsul Piotrus| reply here} 14:05, 7 February 2025 (UTC)[reply]

Wikidata has many unsolved problems, like non-existing mobile editing, silent edit wars that happen without a semblance of edit summary in sight (since you can only add an edit summary there if you revert someone’s individual edit directly), and lack of more granular page protection. I don’t think anyone is ‘content’ with what I listed, they are just failures of governance that are put on the back burner by the fact that Wikidata is getting bigger and bigger and all other problems with it get smaller in importance. stjn 14:49, 7 February 2025 (UTC)[reply]

Maybe I should have used a different wording in the review. I actually agree with you that this is a significant shortcoming of Wikidata, it has annoyed me too when making edits on Wikidata. What I meant by "content" is that 1) WMDE (i.e. the people who make the actual decisions about Wikidata's interface design) doesn't seem to have felt a need to address this situation since the project's launch in 2012 (cf. phab:T47224), and 2) as Piotrus pointed out already, there don't seem to be widespread protests about it. Perhaps "complacent" would have been a better term. Still, I regard it as a relevant data point that Wikidata has been fairly successful (at least in attracting sustained participation) relying almost entirely on automated edit summaries.

So yes, I wouldn't disagree withfailures of governance that are put on the back burner, although I would note that this expression could also be applied the English Wikipedia's inability to address the longstanding and widespread problem of missing or misleading edit summaries. As I mention on my user page, a lot of my time as editor here has been spent on checking edits on my watchlist and patrolling RC. And the aforementioned problem has a significant negative effect on this kind of work. (I do sometimes raise it with the editors responsible, although I have also received pushback.) Regards, HaeB (talk) 04:43, 8 February 2025 (UTC) (Tilman)[reply]

Like some others, this seems like a really sensible place to use LLMs here, and I'd support a pilot. The question is how to make it workable. Probably the most flexible, energy efficient way for a pilot is to have a button to click at time of publication that says "use AI summary". i.e. human input by default. We've probably all seen claims about how much energy queries take, and it would be a shame for Wikipedia to contribute to that -- parsing two versions of a page and summarizing the difference -- for a minor edit. If it works well, I could see a variety of use cases up to and including e.g. an experiment to turn AI summaries on by default for non autoconfirmed users. But yes, we don't want to completely replace human judgment, especially given edits frequently require context in past edit summaries, on the talk page, on other pages, etc. — Rhododendrites ^talk \\ 14:53, 7 February 2025 (UTC)[reply]

I would note that some of the problems with ‘ce’-type summaries are solved not by using AI, but by adding buttons to choose common edit summaries from, like Polish, Russian, Ukrainian et al. Wikipedias do by default, see ru:Википедия:Гаджеты/Кнопки описания правок. It is too easy to go to AI to solve interface problems that are solvable in an easier and environmentally friendlier fashion. stjn 15:02, 7 February 2025 (UTC)[reply]

We've probably all seen claims about how much energy queries take - which claims specifically, and how are they relevant for estimating the environmental impact of deploying a tool like Edisum?

There are a lot of wildly inaccurate claims out there about the energy use of current GenAI tools. See e.g. this new estimate, which points out flaws in earlier efforts and findsthat typical ChatGPT queries using GPT-4o likely consume roughly 0.3 watt-hours [... which] is less than the amount of electricity that an LED lightbulb or a laptop consumes in a few minutes. And even for a heavy chat user, the energy cost of ChatGPT will be a small fraction of the overall electricity consumption of a developed-country resident. (Also, before anyone applies that estimate to the GPT-4 experiment in the paper: It is based on an output size of500 tokens (~400 words, or roughly a full page of typed text), many times larger than typical edit summaries.)

What's more, as discussed in the review, 1) the authors of the present paper designed their model to use much fewer resources than GPT-4 and run on CPUs instead of GPUs, 2) WMF already operates a number of GPUs for other AI/ML purposes. And currently, every edit already triggers cascade of computational processes on various servers, some of which incur nontrivial resource usage too, e.g. database operations, edit filter evaluations, and indeed processing in existing AI/ML models (Cluebot, ORES etc).

Overall, I'd encourage folks concerned about how Wikipedia's energy use contributes to climate change to take a more holistic view and pay attention to the Foundation's overall greenhouse gas emissions (m:Sustainability).

Regards, HaeB (talk) 08:56, 8 February 2025 (UTC) (Tilman)[reply]

This is one of the few things I'd expect an AI to do better than a human. I put no effort into most of my edit summaries, and I imagine many editors are the same. —Compassionate727 ^(T·C) 15:01, 7 February 2025 (UTC)[reply]
Why don't you try to put more effort into your edit summaries? rogerd (talk) 02:48, 9 February 2025 (UTC)[reply]
@HaeB: Sorry in advance for asking this after we published the issue, but I think the "Scholarly Wikidata" publication is directly related to one of the projects presented at FOSDEM I've mentioned over at News and notes, correct? Oltrepier (talk) 16:10, 7 February 2025 (UTC)[reply]
I'm just begging researchers to get volunteer wikipedians to do the work of rating things (like the edit summaries in this research) rather than some rando MSc students. This has not generated useful data! -- asilvering (talk) 18:00, 7 February 2025 (UTC)[reply]
When they recruit MSc students to perform these ratings, they're not giving them free reign to subjectively evaluate them. They're having the students apply pre-defined criteria supplied by the researchers. (Those criteria, you may agree or disagree with, but they're not just asking students for their subjective evaluations.) Volunteer Wikipedians' ratings would be entirely subjective, which means to get a reasonable sample size they'd need ratings from dozens, if not hundreds of volunteers. (...And it would probably still ultimately morph into a study on how Wikipedians rate edit summaries, more than anything else.) FeRDNYC (talk) 18:16, 7 February 2025 (UTC)[reply]
What else really matters? Edit summaries are for the benefit of Wikipedia editors, so why would any criterion besides "What do Wikipedia editors prefer?" have any value at all? Seraphimblade ^{Talk to me} 18:19, 7 February 2025 (UTC)[reply]
Precisely. -- asilvering (talk) 18:38, 7 February 2025 (UTC)[reply]
@Seraphimblade I'm not disagreeing, I'm just saying, it becomes a very different study framed that way. In order to represent "what do Wikipedia editors prefer?" objectively, you'd need a representative sample of Wikipedia editors. Which would mean recruiting a diverse cross-section of the community numbering in the hundreds. (Also, I think there's a degree of perspective bias here... from the point of view of Wikipedia editors, only Wikipedia editors' opinions matter. From the POV of an AI researcher, they couldn't care less about Wikipedia editors' subjective opinions. The goal of their research isn't to benefit Wikipedia, it's to show off their AI.) FeRDNYC (talk) 18:59, 7 February 2025 (UTC)[reply]
If these "researchers" cared about meaningful data, they wouldn't be working in AI. XOR'easter (talk) 01:32, 8 February 2025 (UTC)[reply]
I try very hard to use accurate edit summaries, but many editors, even experienced ones don't. This frustrates me no end. Certainly a bot could do no worse than many editors do. --rogerd (talk) 18:09, 7 February 2025 (UTC)[reply]
I'm a pretty lazy edit summary provider. Bring on AI.-TonyTheTiger (T / C / WP:FOUR / WP:CHICAGO / WP:WAWARD) 22:10, 7 February 2025 (UTC)[reply]
Uhhh...... no, thanks. I'm fine with writing edit summaries on my own as long as it's understandable. 💽 🌙Eclipse 💽 🌹 ⚧ (she/they) ^talk/edits 22:50, 7 February 2025 (UTC)[reply]
That's fine for you, if you are meticulous about using summaries. But many aren't. We are not talking about actual article content here, but edit summaries. rogerd (talk) 02:17, 8 February 2025 (UTC)[reply]
@Rogerd: I am talking about edit summaries. The "editing" in my comment was an autocorrect typo (assuming that's what made you think of something else) I forgot to double check. 💽 🌙Eclipse 💽 🌹 ⚧ (she/they) ^talk/edits 11:26, 8 February 2025 (UTC)[reply]

But of course, some will try to veto it because of anti-AI sentiment. Me. I will try to veto it. Because it's a breathtakinginly unethical technology that is directly and fundamentally opposed to everything that an encyclopedia should stand for. Endorsing it teaches people to accept whatever bullshit the machine outputs instead of thinking and learning. XOR'easter (talk) 01:29, 8 February 2025 (UTC)[reply]

Certainly this, but I think even more than that. The reason people do still trust Wikipedia, to a great degree (and in spite of some people telling them not to) is precisely because it is written by actual people who have put actual thought into what they are doing. Replace that with AI, and we may as well be one more clickbait farm. Seraphimblade ^{Talk to me} 01:33, 8 February 2025 (UTC)[reply]

Our coming to rely upon AI would make us fucking hypocrites. XOR'easter (talk) 01:39, 8 February 2025 (UTC)[reply]

And no, the "it's just summaries, not articles" excuse won't fly. We give editors the boot for pumping slop into noticeboards and deletion debates. Bullshit is bullshit, whether in article space or not. XOR'easter (talk) 02:39, 8 February 2025 (UTC)[reply]

As already mentioned in the review, the English Wikipedia has rejected a blanket prohibition against use of AI, and what you claimwon't fly is actually specifically highlighted as a possibly appropriate use in the nutshell of the popular WP:LLM essay.

You are evidently extremely emotional about this topic. Personally, I think such decisions should be made in a rational, fact-based manner. For example, while you didn't specify what you meant by AI beinga breathtakinginly[sic] unethical technology, it's possible that you were in part worrying about its climate impact. A productive way to discuss such concerns would be to estimate the climate impact of this particular tool (if implemented), and how much it would contribute to the WMF's overall carbon emissions (see m:Sustainability). Given that the researchers already designed it to have low compute usage (running on CPU instead of GPU etc), I would be surprised if it would cause a substantial increase.

Regards, HaeB (talk) 03:58, 8 February 2025 (UTC) (Tilman)[reply]

You understand that snide condescension is a type of "extreme emotion" too, right? Parabolist (talk) 10:26, 8 February 2025 (UTC)[reply]

+1. WP:RGW attitudes have no place in our decision-making process. Thebiguglyalien (talk) 02:48, 9 February 2025 (UTC)[reply]

If nothing else, some years from now, this comment will be useful if I am accused of exaggerating a description of how fervently people would complain about LLMs back in the early 20s. Here I am not really sure what can really be objected to -- in an edit where I change "paralelled" to "paralleled", is it your actual opinion that we need a professional writer/editor (e.g. the level of competence we typically expect from editors) to manually type that out in an edit summary? Assuming that such a person can read the diff and type a simple edit summary like this in how many seconds? At a rate of how many dollars per hour? Is anyone volunteering to be sent an invoice for this? jp×g 🗯️ 03:58, 11 February 2025 (UTC)[reply]

You don't need to write a whole paragraph about it, and you shouldn't, in that instance, write a whole paragraph about it; that junks up the history. It takes you what, all of two seconds to put "typo" in the edit summary? That's all that's needed—okay, you fixed a typo. No more than that is necessary. Seraphimblade ^{Talk to me} 04:02, 11 February 2025 (UTC)[reply]

Whatever WP:FIES might say, currently there are no consequences for editors who routinely don't provide an edit summary. And we do have a lot of empty edit summaries. So if an editor publishes without an edit summary, I think it's OK to add an AI summary PROVIDED it is noted as such "blah blah blah (AI name with appropriate link)", as at that point we have nothing to lose by doing it (the option being no edit summary). But as already noted, the "why" is often the important issue for an edit summary and that is often linked to a policy. I will be more impressed by an AI that can deduce why I did something and name the corresponding policy (BLP violation etc), but I don't discount that such an AI might be possible. And I can see see a role for some AI in Twinkle, where we now have quite a long list of possibilities in some of its menus (e.g. reverting), so it might be nice if an AI could put the most likely options at the top of the list. Often problematic behaviour isn't revealed in a single edit, but might be revealed looking across a user's edit history, e.g. excessive citing of the same author (self-citing?). What about an AI that is trained on edits of known sockpuppets and watches out for similar edits reappearing under a new user name or IP. I don't think there is a problem having "AI" tools to help us build and protect Wikipedia provided a human user takes responsibility for their use (just as we currently do with AutoWikiBrowser and other tools that make edits on our behalf). So I am open to using AI tools provided we proceed cautiously with appropriate discussion and perhaps with some rate-limiting until we are confident of a tool's safety and effectiveness. Clearly we do not want someone coming up with a new AI-driven editing tool and have it running amock across tens of thousands of articles without prior approval. Back in 2001, Wikipedia itself was a bold experiment using the then new "Web 2.0" technology. Lots of people condemned it and predicted it would be dead in no time or would become a pack of lies and harmful to human knowledge etc. Yet here we are today, the most-accessed not-for-profit website of the world and, for all its faults, we've done a very good job. So, in that same spirit, I don't think we can ignore AI, but we do need to ask "how and where can it benefit the encyclopedia while minimising any risk it may pose". Kerry (talk) 04:47, 8 February 2025 (UTC)[reply]
What about an AI that is trained on edits of known sockpuppets and watches out for similar edits reappearing under a new user name or IP. - that idea is actually quite similar to something that the Wikimedia Foundation's research department has been working on since 2020: m:Research:Sockpuppet detection in Wikimedia projects. It even resulted in a working MediaWiki extension already: mw:Extension:SimilarEditors (apparently as part of efforts to mitigate the negative impacts of the Foundation's upcoming IP Masking/Temporary accounts change), although that page says that development has been paused.
I don't think there is a problem having "AI" tools to help us build and protect Wikipedia provided a human user takes responsibility - indeed; it's also worth recalling that the English Wikipedia community has been using AI/ML tools since at least 2016 already (in the form of ORES to help find damaging edits in RC/watchlists).
Regards, HaeB (talk) 05:09, 8 February 2025 (UTC)[reply]
Since a fair while before that, with anti-vandalism bots. And nothing wrong with that, because all the bot does is report that a user matches its programming to say "This user is a vandal". Ultimately, a person makes the decision to say "Yes, the bot is right and I'm going to act on that", and that person is ultimately responsible. But more importantly, what the bot does is transparent; anyone can go see what type of vandal reports it's making, and certainly it is never a secret that it happened. Seraphimblade ^{Talk to me} 05:49, 8 February 2025 (UTC)[reply]
Since a fair while before that, with anti-vandalism bots - true, I was pinpointing the use of AI/ML tools as provided by WMF as part of the general editing interface. User:Cluebot NG and its predecessor were set up by volunteers (the 2013 paper "When the Levee Breaks: Without Bots, What Happens to Wikipedia’s Quality Control Processes?" contains a good overview on Cluebot NG and its impact).
all the bot does is report that a user matches its programming to say "This user is a vandal". Ultimately, a person makes the decision to say "Yes, the bot is right and I'm going to act on that" - well, not quite. User:Cluebot NG and its predecessor have long acted autonomously to revert vandalism edits, without additional human review. Regards, HaeB (talk) 08:24, 8 February 2025 (UTC) (Tilman)[reply]
In terms of potential LLM integrations into Wiki structures, I think this could have the most potential to be beneficial rather than harmful. The current state of edit summaries on Wikipedia are woeful; reading an article's edit history often gives little to no information about how an article has changed over time, without painstakingly going through every diff. Far too many users neglect writing edit summaries at all, and many who do write edit summaries often don't given enough detail to help people understand what exactly they're changing (i.e. "Spelling fix: "opose" -> "oppose"" is a lot more helpful than the more common "sp"). Detailed edit summaries are very few and far between, and sadly, the longest edit summaries seem to mostly consist of users yelling at each other during edit conflicts. I think users should be strongly encouraged to write edit summaries, and to write more detailed edit summaries at that; but if integrating an AI summary is what it takes, I wouldn't be opposed to that in the same way I'm strongly opposed to using LLMs to write article content or talk page comments. --Grnrchst (talk) 13:16, 8 February 2025 (UTC)[reply]
Looking over WP:EDITSUMCITE now, I think edit summaries that follow these guidelines are more of an exception than they are the standard. I include myself in this, I certainly could be a lot better at writing summaries. I think if research is showing that robots are doing better jobs at explaining what we're doing than we are, that is a problem we need to solve on our own end (preferably without introducing AI into the mix). --Grnrchst (talk) 13:39, 8 February 2025 (UTC)[reply]
Edit summaries can be so much more than the description of an edit that has been made. Many editors use them for communication purpose or to make a joke or general comment about the state of an article or evaluation of a source. Probably because researchers are rarely editors, they don't realize that edit summaries are multi-functional. Liz ^{Read! Talk!} 00:04, 9 February 2025 (UTC)[reply]
This was my first thought too. The most informative edit summaries are the ones where the editor briefly adds their thought process or some other thing worth noting about that edit, when applicable. Thebiguglyalien (talk) 02:47, 9 February 2025 (UTC)[reply]
I can understand the reasoning behind wanting to outsource edit summaries to generative AI, but I can't personally get behind it. For one, I feel it'd be damaging to the image Wikipedia has made being against the flagrant and unchecked use of LLMs to create content. Maybe it's a slippery slope type argument, but I have a genuine concern that introducing it as a built-in feature for use on the site, or even something that's encouraged for editors to utilize, as it could possibly lead to people wanting to introduce it for use on other parts of the site where it's more gray in its use. Sure, edit summaries are innocuous to use LLMs, because it's a very short usually one to two sentence description of a change; but then it could lead to arguments for AI in copyediting and AI in reference sourcing — or image generation as per a previous village pump discussion — and I definitely can't get behind those. As Liz also stated, edit summaries are not just a description of an edit; they can be used as a communication tool for editors, and can describe outside context like RfCs and other outside discussions that have impacted an edit or series of edits. Generative AI simply cannot do that, and allowing users the option to do so wouldn't be conducive to good editing habits where you just automatically click a "summarize edit" button that can leave out crucial context from other discussions, or just something beyond the raw text written. I don't believe it would be helpful as a whole, and I'm not even sure it would be a neutral feature — it would be more of a detriment than anything. SmittenGalaxy | talk! 05:24, 9 February 2025 (UTC)[reply]

I really don't think a lot of these AI generated summaries are needed. I would also note that I would still have to check and review a lot of these edits as the AI shows no signs of thought or credibility. Unless I know an article well, there is a chance that I don't even know what the edits are referring to, human or AI. One beneficial thing about human edits is that I can track patterns across edits. For example, a spam of edits by someone on an article may be a red flag, but if I see that it is going under a GA review and the editor is relatively well known, I don't feel like I am as needed to check it and I can spend my time elsewhere instead of tediously checking the veracity of every edit. ✶Qux yz ✶ 15:45, 9 February 2025 (UTC)[reply]

This is certainly notable work and I appreciate the Signpost for covering it. I have a couple technical problems with the methodology, but the more interesting question is what would we all think if those were resolved? First, the comparison between LLMs and humans should not lump all humans together. I would like to see a category for IP users, category for <500 edit users, and an experienced category. Second we should really think about all the roles an edit summary plays. They quote the guidelines on this but the guidelines were written with certain assumptions one of which was the edit summaries will be written by humans. It is notable that the edit summaries by Edisum (and especially gpt4) were often longer. Brevity is important. Edit summaries may perform important roles implicit in the guidelines. For example, they may teach newer users about MOS:QUOT or WP:RSOPINION because those are cited in the summary. These are things that would be difficult (especially for team not experienced on WP) to categorize into simply better or worse. The fact is that a worse summary of an edit diff can lead to a better wikipedia. I would like to see some basic normative research on how users even encounter edit summaries. I see them on my watch list mostly, and when I am trying to find a specific previous version of an article to a lesser extent. Is that representative? They are clearly performing two different roles in that scenario. Czarking0 (talk) 17:00, 9 February 2025 (UTC)[reply]
Edit summaries only have two purposes: 1) signal editors to either review or not review the edit, 2) provide information that you can't get from just looking at the edit in question. The AI summaries don't do #2 at all, and it's not immediate to me whether the more descriptive edit summaries do anything to make me more or less likely to review an edit compared to human ones. Photos of Japan (talk) 05:04, 10 February 2025 (UTC)[reply]
The AI edit summaries quoted are far too long-winded, literal-minded and humourless. The best summaries should be terse and, if the occasion warrants it, witty. Still a way to go, AI old chap. Ericoides (talk) 14:05, 17 February 2025 (UTC)[reply]
Edisum seems mostly harmless. But GPT-4 for edit summaries, no thank you! No need to waste a bottle of water for every human-made edit. jeschaton (immanentize) 01:52, 23 February 2025 (UTC)[reply]
That "a bottle of water" is likely orders of magnitude too high, and lacks perspective anyway -every single day, the average American uses enough water for 24,000-61,000 ChatGPT prompts. [1]. Regards, HaeB (talk) 16:37, 2 May 2025 (UTC)[reply]

The Signpost needs your help putting together the next issue.

Home

About