diff options
-rw-r--r-- | .gitignore | 2 | ||||
-rw-r--r-- | index.html | 39 |
2 files changed, 40 insertions, 1 deletions
@@ -1,6 +1,6 @@ /target enwiktionary-*.xml-p* -definitions.txt +*definitions.txt* .*.tmp *~ .vscode diff --git a/index.html b/index.html new file mode 100644 index 0000000..3bd880c --- /dev/null +++ b/index.html @@ -0,0 +1,39 @@ +<!DOCTYPE html> +<html lang="en"> +<head> + <meta charset="utf-8"> + <title>Wiktionary word lists</title> + <meta content="width=device-width,initial-scale=1" name="viewport"> + <meta property="og:title" content="Wiktionary word lists"> + <meta property="og:type" content="article"> + <meta property="og:url" content="https://s.pommicket.com/wiktionary/index.html"> + <meta property="og:locale" content="en_US"> + <meta property="og:site_name" content="pommicket.com"> + <meta property="article:author" content="pommicket"> +</head> +<body> + <h2>pommicket's Wiktionary-based word lists</h2> + <p> + These are various lists of words extracted from Wiktionary data dumps. Some of the code + used to produce them is available <a href="https://github.com/pommicket/wiktionary" target="_blank">here</a>.<br> + You can do whatever you like with them, subject to + <a href="https://en.wiktionary.org/wiki/Wiktionary:Copyrights" target="_blank">Wiktionary's licensing</a>, where applicable. + </p> + <ul> + <li> + The Big List: <a href="/tmt/word-list.txt.xz">word-list.txt.xz (27MB compressed, 120MB uncompressed, 9,878,558 entries)</a>.¹<br> + Every English Wikipedia article title & entry in English Wiktionary; containing only ASCII a-z/A-Z/space, max 2 words.<br> + Words labelled <i>offensive</i> on Wiktionary were filtered out (overly aggressively—some totally inoffensive words were removed in the process). + </li> + <li> + English definitions: <a href="/wiktionary/en-definitions.txt.xz">en-definitions.txt.xz (22MB compressed, 115MB uncompressed, 1,629,682 entries)</a>.¹<br> + Every English definition in English wiktionary. Format is <code style="white-space: pre;">WORD DEFINITION</code> + on each line (note: delimiter is <b>2</b> spaces).<br> + Words can have multiple definitions; they are listed as separate lines.<br> + <code>DEFINITION</code> is in the wikitext format.<br> + It’s possible that there are parsing errors, but I haven’t spotted any yet. + </li> + </ul> + <p>¹ Derived from <a href="https://dumps.wikimedia.org/enwiktionary/20250701/" target="_blank">enwiktionary-20250701</a> dump.</p> +</body> +</html> |