Wikipedia:Wikipedia Signpost/2024-11-18/Recent research

Recent research

SPINACH: AI help for asking Wikidata "challenging real-world questions"

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"SPINACH": LLM-based tool to translate "challenging real-world questions" into Wikidata SPARQL queries

A paper presented at last week's EMNLP conference reports on a promising new AI-based tool (available at https://spinach.genie.stanford.edu/ ) to retrieve information from Wikidata using natural language questions. It can successfully answer complicated questions like the following:

The authors note that Wikidata is

one of the largest publicly available knowledge bases [and] currently contains 15 billion facts, and claim that itis of significant value to many scientific communities. However, they observe thatEffective access to Wikidata data can be challenging, requiring use of the SPARQL query language.

This motivates the use of large language models to convert natural language questions into SPARQL queries, which could obviously be of great value to non-technical users. The paper is far from being the first such attempt, see also below for a more narrowly tailored effort. And in fact, some of its authors (including Monica S. Lam and members of her group at Stanford) had already built such a system – "WikiSP" – themselves last year, obtained by fine-tuning an LLM; see our review: "Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata". (Readers of this column may also recall coverage of Wikipedia-related publications out of Lam's group, see "STORM: AI agents role-play as 'Wikipedia editors' and 'experts' to create Wikipedia-like articles" and "WikiChat, 'the first few-shot LLM-based chatbot that almost never hallucinates'" – a paper that received the Wikimedia Foundation's "Research Award of the Year".)

The SPINACH dataset

More generally, this kind of task is called "Knowledge Base Question Answering" (KBQA). The authors observe that many benchmarks have been published for it over the last decade, and that recently,the KBQA community has shifted toward using Wikidata as the underlying knowledge base for KBQA datasets. However, they criticize those existing benchmarks aseither contain[ing] only simple questions [...] or synthetically generated complex logical forms that are not representative enough ofreal-world queries. To remedy this, they

In more detail, the researchers scraped the "Request a Query" forum's archive from 2016 up to May 2024, obtaining 2780 discussions that had resulted in a valid SPARQL query, which were then filtered by various criteria and sampled to a subset of920 conversations spanning many domains for consideration. Those were then further winnowed down with afocus on end-users rather than Wikipedia and Wikidata contributors interested in obscure optimizations or formatting. The remaining conversations were manually annotated witha self-contained, decontextualized natural language question that accurately captures the meaning of the user-written SPARQL. These steps include disambiguation of terms in the question as originally asked in the forum (For example, instead of asking "where a movie takes place", we distinguish between the "narrative location” and the "filming location"; thus avoiding an example that had confused the authors' own WikiSP system). This might be regarded as attaching training wheels, i.e. artificially making the task a little bit easier. However, another step goes in the other direction, byrefrain[ing] from directly using [Wikidata's] entity and property names, instead using a more natural way to express the meaning. For instance, instead of asking "what is the point of time of the goal?", a more natural question with the same level of accuracy like "when does the goal take place?" should be used.

The SPINACH agent

The paper's second contribution is an LLM-based system, also called "SPINACH", that on the authors' own datasetoutperforms all baselines, including the best GPT-4-based KBQA agent by a large margin, and alsoachiev[es] a new state of the art on several existing KBQA benchmarks, although on it narrowly remains behind the aforementioned WikiSP model on the WikiWebQuestions dataset (both also out of Lam's lab).

This agent is given several tools to use, namely

searching Wikidata for the QID for a string (like a human user would using the search box on the Wikidata site). This addresses an issue that thwarts many naive attempts to use e.g. ChatGPT directly for generating SPARQL queries, which the aforementioned WikiSP paper already pointed out last year: "While zero-shot LLMs [e.g. ChatGPT] can generate SPARQL queries for the easiest and most common questions, they do not know all the PIDs and QIDs [property and item IDs in Wikidata]."
retrieving the Wikidata entry for a QID (i.e. all the information on its Wikidata page)
retrievinga few examples demonstrating the use of the specified property in Wikidata
running a SPARQL query on the Wikidata Query Service

The authors note thatImportantly, the results of the execution of each action are put in a human-readable format to make it easier for the LLM to process. To limit the amount of information that the agent has to process, we limit the output of search results to at most 8 entities and 4 properties, and limit large results of SPARQL queries to the first and last 5 rows. That LLMs and humans have similar problems reading through copious Wikidata query results is a somewhat intriguing observation, considering that Wikidata was conceived as a machine-readable knowledge repository. (In an apparent effort to address the low usage of Wikidata in today's AI systems, Wikimedia Deutschland recently announced "a project to simplify access to the open data in Wikidata for AI applications" by "transformation of Wikidata’s data into semantic vectors.")

The SPINACH system uses the popular ReAct (Reasoning and Acting) framework for LLM agents, where the model is alternating between reasoning about its task (e.g. It seems like there is an issue with the QID I used for the University of Washington. I should search for the correct QID) and acting (e.g. using its search tool: search_wikidata("University of Washington")).

The generation of these thought + action pairs in each turn is driven by an agent policy prompt

Successfully answering a question with a correct SPARQL query can require numerous turns. The researchers limit these by providing the agents witha budget of 15 actions to take, and an extra 15 actions to spend on [...] "rollbacks" of such actions. Even so,Since SPINACH agent makes multiple LLM calls for each question, its latency and cost are higher compared to simpler systems. [...] This seems to be the price for a more accurate KBQA system.

Still, for the time being, an instance is available for free at https://spinach.genie.stanford.edu/ , and also on-wiki as a bot (operated by one of the authors, a – now former – Wikimedia Foundation employee), which has already answered about 30 user queries since its introduction some months ago.

Briefly

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question Answering over a Life Science Knowledge Graph"

From the abstract:

From the paper:

A small number of "culprits" cause over 10 million "Disjointness Violations in Wikidata"

This preprint identifies 51 pairs of classes on Wikidata that should be disjoint (e.g. "natural object" vs. "artificial object") but aren't, with over 10 million violations, caused by a small number of "culprits". From the abstract: