diff options
Diffstat (limited to 'index.html')
-rw-r--r-- | index.html | 41 |
1 files changed, 38 insertions, 3 deletions
@@ -16,6 +16,7 @@ <p> These are various lists of words extracted from Wiktionary data dumps. Some of the code used to produce them is available <a href="https://github.com/pommicket/wiktionary" target="_blank">here</a>.<br> + Of course, all these lists undoubtedly contain errors because Wiktionary contains errors.<br> You can do whatever you like with them, subject to <a href="https://en.wiktionary.org/wiki/Wiktionary:Copyrights" target="_blank">Wiktionary's licensing</a>, where applicable. </p> @@ -26,10 +27,44 @@ Words labelled <i>offensive</i> on Wiktionary were filtered out (overly aggressively—some totally inoffensive words were removed in the process). </li> <li> - English definitions: <a href="/wiktionary/en-definitions.txt.xz">en-definitions.txt.xz (22MB compressed, 115MB uncompressed, 1,629,682 entries)</a>.¹<br> - Every English definition in English wiktionary. Format is <code style="white-space: pre;">WORD DEFINITION</code> - on each line (note: delimiter is <b>2</b> spaces).<br> + English definitions: + <a href="/wiktionary/en-definitions.txt.xz">en-definitions.txt.xz (23MB compressed, 127MB uncompressed, 1,629,482 entries)</a> + and<br>Translingual definitions: + <a href="/wiktionary/trans-definitions.txt.xz">trans-definitions.txt.xz (MB compressed, MB uncompressed, entries)</a>.¹<br> + Every English/Translingual definition in English wiktionary. + Format is <code style="white-space: pre;">WORD PART_OF_SPEECH DEFINITION</code> + on each line (note the two spaces between word and part of speech).<br> Words can have multiple definitions; they are listed as separate lines.<br> + <code>PART_OF_SPEECH</code> is one of the following: + <ul> + <li><code>%adjective</code> (e.g. <i>unbelievable</i>)</li> + <li><code>%noun</code> (e.g. <i>belief</i>)</li> + <li><code>%noun.proper</code> (e.g. <i>France</i>)</li> + <li><code>%verb</code> (e.g. <i>believe</i>)</li> + <li><code>%adverb</code> (e.g. <i>unbelievably</i>)</li> + <li><code>%interjection</code> (e.g. <i>yowza</i>)</li> + <li><code>%particle</code> (e.g. <i>O</i>)</li> + <li><code>%conjunction</code> (e.g. <i>unless</i>)</li> + <li><code>%preposition</code> (e.g. <i>into</i>)</li> + <li><code>%determiner</code> (e.g. <i>the</i>)</li> + <li><code>%pronoun</code> (e.g. <i>yourself</i>)</li> + <li><code>%contraction</code> (e.g. <i>woulda</i>)</li> + <li><code>%number</code> (e.g. <i>2</i>, <i>twenty-seven</i>)</li> + <li><code>%phrase</code> (e.g. <i>you'd better believe it</i>)</li> + <li><code>%phrase.prepositional</code> (e.g. <i>beyond belief</i>)</li> + <li><code>%phrase.proverb</code> (e.g. <i>seeing is believing</i>)</li> + <li><code>%affix</code> (e.g. <i>🅱</i>, a “simulfix”)</li> + <li><code>%affix.prefix</code> (e.g. <i>un-</i>)</li> + <li><code>%affix.suffix</code> (e.g. <i>-ism</i>)</li> + <li><code>%affix.infix</code> (e.g. <i>-fuckin-</i>)</li> + <li><code>%affix.circumfix</code> (e.g. <i>a- -ing</i>)</li> + <li><code>%affix.interfix</code> (rare, e.g. <i>-retin-</i>)</li> + <li><code>%symbol</code> (e.g. <i>℞</i>)</li> + <li><code>%symbol.punctuation</code> (e.g. <i>…</i>)</li> + <li><code>%symbol.letter</code> (e.g. <i>b</i>)</li> + <li><code>%symbol.diacritic</code> (e.g. <i>◌́</i>)</li> + <li><code>%unknown</code> — couldn’t be determined/none of the above</li> + </ul> <code>DEFINITION</code> is in the wikitext format.<br> It’s possible that there are parsing errors, but I haven’t spotted any yet. </li> |