Wikipedia:Typo Team/moss/Archive

DNA

DNA sequences, like those in

Hmm, I will have to ask around on MoS or something. Thanks for finding that. -- Beland (talk) 01:38, 19 July 2018 (UTC)[reply]

If not, we could make one, a template with at bare minimum <span class="dna-sequence">{{{1}}}</span> Do similarly for the poem structure patterns. We did this with trade designations for horticultural plants, and it has worked out well: {{tdes}}. Turn out the nomenclature authority requires them (in a scientific name) to be in a differenced font, so we used kerned monospace (it supports extra options, but that part was probably a bad idea). Anyway, here's all "Template:"-namespace pages with "dna" in their titles here's those with "gene", in case there's already a template for this (I have not pored over them).  — SMcCandlish ¢ 😼 14:48, 21 July 2018 (UTC)[reply]
Oof, that has resulted in some pretty ugly plant name typography; I wish we hadn't followed the typographical conventions of that source. I do like the idea of a template, though - that would make it easy for anyone who is interested to find all of the DNA sequences on Wikipedia...which is a thing that could happen? I put your code in {{DNA sequence}} and applied that to this article; thanks for throwing that together! I'll ponder poem patterns a bit more. -- Beland (talk) 05:44, 25 July 2018 (UTC)[reply]

Fixed

These are due to difficult-to-parse mixtures of tables and templates. ::sigh:: I think I can fix this in code. -- Beland (talk) 00:49, 19 July 2018 (UTC)[reply]

These should be ignored in the next run (20 July 2018 dump or later). -- Beland (talk) 22:28, 25 July 2018 (UTC)[reply]
ghola is either related to West Bengal, Pakistan, Afghanistan, or related to the Dune universe; Wiktionary does not have either wikt:ghola or wikt:gholas;
  24 - "gholas" : of 24 matches only one (Hasnabad (community development block)) is not from the Dune universe
427 - "ghola"
  59 - "ghola" -"bengal"
of 59, only 7 are not about the Dune universe: Ghoul, Prem Pujari/List of songs recorded by Kishore Kumar (a song title) Mount Paiko/Kharkoo (places) Bogeyman List of rampage killers/List of rampage killers (familicides) (a town)
So this is the plural of a word that is most often a made-up term from the Dune universe, not exactly ready for Wiktionary! What to do? Shenme (talk) 19:12, 28 August 2018 (UTC)[reply]
Ah, we have a redirect from ghola; I can add redirects to the exclusion list. I'll have to be careful of those with {{R from misspelling}} and variations, and we'll have to go through all untagged redirects and tag those that are also misspellings. (In the end, I think all redirects will be tagged; categorizing them helps projects decide whether or not they are worthy for inclusion in a print version or CD, etc.) -- Beland (talk) 19:53, 13 September 2018 (UTC)[reply]
Oh, redirects are already included in the dictionary. I just created a redirect from gholas, so this should be ignored on the next run. -- Beland (talk) 04:43, 24 September 2018 (UTC)[reply]

Notes from Apr 2018

Poems

These are patterns used to describe poetry. Not sure they are appropriate for Wiktionary; if not, I will whitelist them. -- Beland (talk) 00:49, 19 July 2018 (UTC)[reply]

There may be a better and even conventionally marked-up way to represent these. Check poetry sources? Maybe they done as c-d-c-d or whatever.  — SMcCandlish ¢ 😼 14:38, 21 July 2018 (UTC)[reply]

Oh, there are lots more where that came from. Maybe these should be tagged or maybe I can fix in code with a pattern recognizer or something. I'll have to ponder. -- Beland (talk) 01:42, 19 July 2018 (UTC)[reply]

From longest:

I think these should be capitalized or enclosed in quotes, either of which would prevent them as showing up here as spelling errors. I started a discussion at Wikipedia talk:Manual of Style § Rhyme scheme patterns. -- Beland (talk) 22:26, 25 July 2018 (UTC)[reply]

Continued at Wikipedia:Typo Team/moss#Repeating patterns. -- Beland (talk) 02:04, 17 August 2018 (UTC)[reply]

Notes from Jan 2019

Statistics

2018-04 to 2018-09

The spell checker has been getting smarter over time, so more recent versions report fewer false alarms. This explains most of the drop in the number of possible typos reported. Most of the gains for pages with more than 100 possible typos is due to changes that ignore pages with {{cleanup}} and similar tags, which indicate the page may not be ready for spell checking. I have been specifically tagging pages with a high number of possible typos to bring them to the attention of interested editors. Pages tagged for cleanup are reported in the statistics of cleanup-related work queues.

Some variation in the number of typos fixed between runs is also explained by the differences in the amount of time between runs. The biggest sources of variance are the unusually long time between the first two runs and the fact that dumps snapshotted on the first day of the month (which have a lot of additional data the spell checker doesn't need) take longer for Wikimedia servers to generate than the dumps snapshotted on the twentieth day of the month. There is also considerable activity from other editors writing new material and correcting typos as they find them while reading or editing articles.

moss project participants have been correcting hundreds or thousands of typos per month (yay!) mostly in articles with a single typo. We have also been adding somewhere from handfuls to dozens of entries to Wiktionary a month. Looking only at the generated reports, these numbers are difficult to separate from the other changes in data and code, but we do see progress as we strike through or remove items from the todo lists.

Since figuring out which words are not typos is such a big part of the problem to be solved, the code may need to get smarter in the future, but we're probably going to have an upcoming period of relative stability as we work through some low-hanging fruit. Hopefully upcoming statistics will reflect progress in actually reducing typos more than changes in spell checker code. -- Beland (talk) 18:20, 12 October 2018 (UTC)[reply]

2018-09 to 2019-03

At least 10% of possible typos reported in the old statistics are definitely misspellings, but it's unclear how many of the remaining 90% are. Below is a new way of breaking down possible typos, by type instead of count per article. The "T1" items are almost all typos, and those are what we've been working on in the main "by article" section. Some of the other types have their own reports on this page, but most will require further analysis to either automatically distinguish typos vs. legitimate strings, or produce a more useful report for human editors.

2019-03 to 2020-02

From 2018-09-20 to 2019-03-01, the number of typos classified as T1 (edit distance 1 from an English word, the most likely to be actual misspellings) dropped by 35,488, or 32%, and this appears to be due to the hard work of editors participating in the moss project fixing typos on the T1 lists. Amazing progress! The numbers for categories we aren't fixing have remained relatively stable, though for all categories there is some bouncing around as new typos are created and fixed in the normal course of writing and editing articles.

While processing the 2019-03-01 dump, I made a major change to how typos are classified. (You can see the old method in the archived statistics.) I've dropped categories with an edit distance greater than 3 from an English word (T4 thru T16) since these are quite unlikely to be misspellings. Most of the reported typos that are not likely English misspellings are either compound words or non-English words. (Some of the non-English words are also misspelled.) Some English compounds end up as TS, if they are caught by a conventional spell checker; the rest are now classified as ME. (There are various other categories for compounds, all starting with M, and these will all need to be refined later because a fair number of words are up there that don't belong.) In an effort to exclude as many non-English words as possible, I've started looking at non-English Wiktionaries; any words found there but not in the English Wiktionary are classified as W. Romanizations are not eligible for Wiktionary; words native to non-Latin writing systems are entered under those other systems. I've written some code that attempts to perform transliteration from any given writing system. It's starting to catch a few thousand words (classified as L) but is obviously missing a lot and so will need to be further refined. I've also added some categories for bad HTML tags and similar problems.

Since the classification changes make the new numbers incomparable with the old numbers, I've started a new table below. I've started posting some TS typos as well as T1s, so expect to see both those numbers to improve significantly in the coming months. -- Beland (talk) 07:30, 23 March 2019 (UTC)[reply]

* Affected by significant algorithm changes. 1 Sep 2019: Added BC and BW. (Parse failures dropped due to JWB-powered MOS:STRAIGHT cleanup.) 20 Sep 2019: BC and BW restricted to lowercase; added TS+COMMA, TS+BRACKET, TS+EXTRA.

  • red = Probably need to fix
  • yellow = Unsorted
  • blue = Probably OK (but may need to verify)
  • bold = actively working on fixing

2020 statistics

In the year from March 2019 to March 2020, moss volunteers fixed over 94,000 typos! The most impressive progress is in the T1 category (single-letter misspellings), where we eliminated about half from the English Wikipedia. During this period we also started fixing missing spaces (focusing on those around punctuation) and those have dropped by about one-fifth. As we make progress, clear misspellings are increasingly mixed in with unclear cases; I'll be doing some more work on separation algorithms to keep the typo reports useful, so you'll probably see some more changes to typo classifications. Thanks to everyone who has been helping out! -- Beland (talk) 16:54, 28 April 2020 (UTC)[reply]

  • red = Probably need to fix
  • yellow = Unsorted
  • blue = Probably OK (but may need to verify)
  • bold = actively working on fixing

* Identification of Z was broken
** Affected by major bug fix for counting inter-word typos (e.g. involving punctuation)

2021 statistics

A major upgrade to word categorization was made in October 2021. The same dump is shown on the old and new systems for comparison. R, I, W, MI, MW, and ML were eliminated and sorted by language as TE or TF instead. New categories:

  • A = mAth
  • T/ = Suspected MOS:SLASH violation
  • TE = AI thinks it's trying to be English
  • TF = AI thinks it's trying to be a non-English language (Foreign to English Wikipedia), sorted by language (e.g. TF+el)

2022 statistics

* ae346b0 started ignoring content inside curly quotes
† 1432a2f excluded more end sections
‡ 5ee7ffd started ignoring italicised content
⹋ 6965e1f started ignoring content inside single quotes

2023 statistics

* Due to software issues, language detection wasn't working for this run.

Likely new words by frequency (non-English)

From 2019-02-01 dump:

From 2019-02-01 dump, but clearly not foreign words (need to figure out what to do with them):

Case notes from 2019-06-01 dump

Uses material from the Wikipedia article Wikipedia:Typo Team/moss/Archive, released under the CC BY-SA 4.0 license.