1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
|
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Wiktionary word lists</title>
<meta content="width=device-width,initial-scale=1" name="viewport">
<meta property="og:title" content="Wiktionary word lists">
<meta property="og:type" content="article">
<meta property="og:url" content="https://s.pommicket.com/wiktionary/index.html">
<meta property="og:locale" content="en_US">
<meta property="og:site_name" content="pommicket.com">
<meta property="article:author" content="pommicket">
</head>
<body>
<h2>pommicket's Wiktionary-based word lists</h2>
<p>
These are various lists of words extracted from Wiktionary data dumps. Some of the code
used to produce them is available <a href="https://github.com/pommicket/wiktionary" target="_blank">here</a>.<br>
Of course, all these lists undoubtedly contain errors because Wiktionary contains errors.<br>
You can do whatever you like with them, subject to
<a href="https://en.wiktionary.org/wiki/Wiktionary:Copyrights" target="_blank">Wiktionary's licensing</a>, where applicable.
</p>
<ul>
<li>
English definitions:
<a href="/wiktionary/en-definitions.txt.xz">en-definitions.txt.xz (23MB compressed, 127MB uncompressed, 1,629,482 entries)</a>
and<br>Translingual definitions:
<a href="/wiktionary/trans-definitions.txt.xz">trans-definitions.txt.xz (696KB compressed, 4.3MB uncompressed, 48,138 entries)</a>.¹<br>
Every English/Translingual definition in English wiktionary.
Format is <code style="white-space: pre;">WORD PART_OF_SPEECH DEFINITION</code>
on each line (note the two spaces between word and part of speech).<br>
Words can have multiple definitions; they are listed as separate lines.<br>
<code>PART_OF_SPEECH</code> is one of the following:
<ul>
<li><code>%adjective</code> (e.g. <i>unbelievable</i>)</li>
<li><code>%noun</code> (e.g. <i>belief</i>)</li>
<li><code>%noun.proper</code> (e.g. <i>France</i>)</li>
<li><code>%verb</code> (e.g. <i>believe</i>)</li>
<li><code>%adverb</code> (e.g. <i>unbelievably</i>)</li>
<li><code>%interjection</code> (e.g. <i>yowza</i>)</li>
<li><code>%particle</code> (e.g. <i>O</i>)</li>
<li><code>%conjunction</code> (e.g. <i>unless</i>)</li>
<li><code>%preposition</code> (e.g. <i>into</i>)</li>
<li><code>%determiner</code> (e.g. <i>the</i>)</li>
<li><code>%pronoun</code> (e.g. <i>yourself</i>)</li>
<li><code>%contraction</code> (e.g. <i>woulda</i>)</li>
<li><code>%number</code> (e.g. <i>2</i>, <i>twenty-seven</i>)</li>
<li><code>%phrase</code> (e.g. <i>you'd better believe it</i>)</li>
<li><code>%phrase.prepositional</code> (e.g. <i>beyond belief</i>)</li>
<li><code>%phrase.proverb</code> (e.g. <i>seeing is believing</i>)</li>
<li><code>%affix</code> (e.g. <i>🅱</i>, a “simulfix”)</li>
<li><code>%affix.prefix</code> (e.g. <i>un-</i>)</li>
<li><code>%affix.suffix</code> (e.g. <i>-ism</i>)</li>
<li><code>%affix.infix</code> (e.g. <i>-fuckin-</i>)</li>
<li><code>%affix.circumfix</code> (e.g. <i>a- -ing</i>)</li>
<li><code>%affix.interfix</code> (rare, e.g. <i>-retin-</i>)</li>
<li><code>%symbol</code> (e.g. <i>℞</i>)</li>
<li><code>%symbol.punctuation</code> (e.g. <i>…</i>)</li>
<li><code>%symbol.letter</code> (e.g. <i>b</i>)</li>
<li><code>%symbol.diacritic</code> (e.g. <i>◌́</i>)</li>
<li><code>%unknown</code> — couldn’t be determined/none of the above</li>
</ul>
<code>DEFINITION</code> is in the wikitext format.<br>
It’s possible that there are parsing errors, but I haven’t spotted any yet.
</li>
<li>
All English animal terms: <a href="/wiktionary/animalia.txt.xz">animalia.txt.xz (62KB compressed, 192KB uncompressed)</a>.¹<br>
This includes both nouns referring to animals (e.g. <i>dog</i>) and animal-related adjectives (e.g. <i>canine</i>).
There could definitely be errors due to bad parsing (but I have checked a number of entries at random and they seem good).
</li>
<li>
The Big List: <a href="/tmt/word-list.txt.xz">word-list.txt.xz (27MB compressed, 120MB uncompressed, 9,878,558 entries)</a>.¹<br>
Every English Wikipedia article title & entry in English Wiktionary; containing only ASCII a-z/A-Z/space, max 2 words.<br>
Words labelled <i>offensive</i> on Wiktionary were filtered out (overly aggressively—some totally inoffensive words were removed in the process).
</li>
</ul>
<p>¹ Derived from <a href="https://dumps.wikimedia.org/enwiktionary/20250701/" target="_blank">enwiktionary-20250701</a> dump.</p>
</body>
</html>
|