Wikipedia:Wikipedia Signpost/2021-02-28/Recent research
Take an AI-generated flashcard quiz about Wikipedia; Wikipedia's anti-feudalism
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
"WikiFlash: Generating Flashcards from Wikipedia Articles"
- Reviewed by Tilman Bayer
Flashcards are a popular method for memorizing information. A paper by six Zurich-based researchers, presented earlier this month at the annual AAAI conference, describes a tool to automatically extract flashcards from Wikipedia articles, aiming "to make independent education more attractive to a broader audience."
A proof-of-concept version is available online, with results available for export in a format that can be used with the popular flashcard software Anki. User can choose from four different variants based on either the entire Wikipedia article or just its introductory section.
The researchers emphasize that "generating meaningful flashcards from an arbitrary piece of text is not a trivial problem" (also concerning the computational effort), and that there is currently no single model that can do this. They separate the task into four stages, each making use of existing NLP techniques:
- summarization, to first extract the most relevant information from Wikipedia (the user can also choose to have this step skipped and instead generate flashcards based on the full text)
- answer identification, where a model extracts answer statements from a given sentence based on context information from the surrounding paragraph
- question generation, where a model constructs a question from the statement generated in the previous step, again taking context information from the surrounding paragraph into account
- To improve quality, these are followed by a final filtering step, where a question-answering model tries to reconstruct the answer based on the paragraph from which the question was extracted, and the generated flashcard is discarded if the reconstructed answer does not overlap enough with the pre-generated answer.
Apart from evaluating the results using quantitative text measures, the researchers also conducted a user study to compare the output of their tool to human-generated flashcards from two topic areas, geography and history, rated by helpfulness, comprehensibility and perceived correctness. The "results show that in the case of geography there is no statistically meaningful difference between human-created and our cards for either of the three aspects. For history, the difference for helpfulness and comprehensibility is statistically significant (p < 0.01), with human cards being marginally better than our cards. Neither category revealed a statistically significant difference in perceived correctness." (However, the sample was rather small, with 50 Mechanical Turk users split into two groups for geography and history.)
A quick test of the tool with the article Wikipedia (introduction only) yielded the following result (text reproduced without changes):
Question: What does Wikipedia use to maintain it's [sic] content?
Answer |
---|
wiki-based editing system |
Question: In 2021, where was Wikipedia ranked?
Answer |
---|
13th |
Question: What language was Wikipedia initially available in?
Answer |
---|
English |
Question: How many articles are in English version of Wikipedia [sic] as of February 2021?
Answer |
---|
6.3 million |
Question: Who hosts Wikipedia?
Answer |
---|
Wikimedia Foundation |
Question: Whose vision did Time magazine believe made Wikipedia the best encyclopedia in the world?
Answer |
---|
Jimmy Wales |
Question: What is a systemic bias on Wikipedia?
Answer |
---|
gender bias |
Question: What did Wikipedia receive praise for in the 2010s?
Answer |
---|
unique structure, culture, and absence of commercial bias |
Question: What two social media sites announced in 2018 that they would help users detect fake news by suggesting links to related Wikipedia articles?
Answer |
---|
Facebook and YouTube |
Briefly
- See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
- @WikiResearch, the Twitter feed associated with this monthly research update, celebrated its ninth anniversary today. Over the past 9 years, we have shared on average 1.9 tweets per day about Wikimedia-related research. The feed is also available in syndicated form on Facebook and Mastodon.
Other recent publications
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
- Compiled by Tilman Bayer and Miriam Redi
Wikipedia's "sophisticated democracy" resists the "implicit feudalism" of online communities
A paper in New Media & Society argues that
"Most scientific articles cited by Wikipedia articles are uncited or untested by subsequent studies"
From the abstract:
"HopRetriever: Retrieve Hops over Wikipedia to Answer Complex Questions"
From the abstract:
(See also the above review of the "WikiFlash" paper presented at the same conference)
"Structured Knowledge: Have we made progress? An extrinsic study of KB [knowledge base] coverage over 19 years"
From the abstract:
See also the video recording of a talk by the authors at Wikidata Workshop 2020.
"A Review of Public Datasets in Question Answering Research"
Presented at the ACM Special Interest Group on Information Retrieval (SIGIR) forum last December, this paper found that the majority of Question Answering (QA) datasets are based on Wikipedia data.
Wikipedia has "become more popular in research on knowledge representation and natural language processing" in recent years
From the "Evaluation" section of an AAAI'21 paper titled "Identifying Used Methods and Datasets in Scientific Publications":
"SF-QA: Simple and Fair Evaluation Library for Open-domain Question Answering"
The contributions of this paper include
"The Truth is Out There: Investigating Conspiracy Theories in Text Generation"
This preprint includes a dataset consisting of 17 conspiracy theory topics from Wikipedia (including e.g. the articles Death of Marilyn Monroe, Men in black, Sandy Hook school shooting) and comes with a content warning ("Note: This paper contains examples of potentially offensive conspiracy theory text").
"Spontaneous versus interaction-driven burstiness in human dynamics: The case of Wikipedia edit history"
From the abstract:
(See also our earlier coverage of research on editors' burstiness)
Discuss this story