summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md99
1 files changed, 99 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..141047a
--- /dev/null
+++ b/README.md
@@ -0,0 +1,99 @@
+# boostrapping a (Linux x86-64) C compiler
+
+Compilers nowadays are written in languages like C, which themselves need to be
+compiled. But then, you need a C compiler to compile your C compiler! Of course,
+the very first C compiler was not written in C (because how would it be
+compiled?). Instead, it was slowly built up, starting from a very basic
+assembler, eventually reacing a full-scale compiler. This process is known as
+bootstrapping. In this repository, we'll explore how that's done. Each directory
+represents a new "stage" in the process. The first one, `00`, is a hand-written
+executable, and the last one will be a C compiler. Each directory has its own
+README explaining what's going on.
+
+You can run `bootstrap.sh` to run through and test every stage.
+
+## the basics
+
+In this series, I want to explain *everything* that's going on. I'm going to
+need to assume some passing knowledge about computers, but here's a quick
+overview of what you'll want to know before starting. I can't explain everything
+so you may need to do your own research. You don't need to understand each of
+these in full, just get a general idea at least:
+
+- what an operating system is
+- what memory is
+- what a programming language is
+- what a compiler is
+- what an executable file is
+- number bases -- if a number is preceded by 0x, 0o, or 0b in this series, that
+means hexadecimal/octal/binary respectively. So 0xff = FF hexadecimal = 255
+decimal.
+- what a CPU is
+- what a CPU architecture is
+- what a CPU register is
+- what a pointer is
+- bits, bytes, kilobytes, etc.
+- bitwise operations (not, or, and, xor, left shift, right shift)
+- 2's complement
+- null-terminated strings
+- how floating-point numbers work
+- maybe some basic Intel-style x86-64 assembly (you can probably pick it up on
+the way though)
+
+
+## instruction set
+
+x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages
+long! So, it makes sense to select only a small subset of it to use for all the
+stages of our compiler. The set I've chosen can be found in `instructions.txt`.
+I think it achieves a pretty good balance between having few enough
+instructions to be manageable and having enough instructions to be useable.
+To be clear, you don't need to read that file to understand the series, at least
+not right away.
+
+## principles
+
+- as simple as possible
+
+Bootstrapping a compiler is not an easy task, so we're trying to make it as easy
+as possible. We don't even necessarily need a standard-compliant C compiler, we
+only need enough to compile someone else's C compiler, specifically TCC
+(https://bellard.org/tcc/) since that's a compiler with very few dependencies.
+
+- efficiency is not a concern
+
+We will create big and slow executables, and that's okay. It doesn't really
+matter if compiling TCC takes 8 as opposed to 0.01 seconds; once we compile TCC
+with itself, we'll get the same executable either way.
+
+## reflections on trusting trust
+
+In 1984, Ken Thompson wrote the well-known article
+[*Reflections on Trusting Trust*](http://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf).
+This is one of the things that inspired me to start this project. To summarize
+the article: it is possible to create a malicious C compiler which will
+replicate its own malicious functionalities (e.g. detecting password-checking
+routines to make them also accept another password the attacker knows) when used
+to compile other C compilers. For all we know, such a compiler was used to
+compile GCC, say, and so all programs around today could be compromised. Of
+course, this is practically definitely not the case, but it's still an
+interesting experiment to try to create a fully trustable compiler. This
+project can't necessarily even do that though, because the Linux kernel, which
+we depend on, is compiled from C, so we can't fully trust *it*. To *truly*
+create a fully trustable compiler, you'd need to manually write to a USB with a
+circuit, create an operating system from nothing (without even a text editor),
+and then follow this series, or maybe you don't even trust your CPU vendor...
+I'll leave that to someone else
+
+## license
+
+```
+This project is in the public domain. Any copyright protections from any law
+for this project are forfeited by the author(s). No warranty is provided for
+this project, and the author(s) shall not be held liable in connection with it.
+```
+
+## contributing
+
+If you notice a mistake/want to clarify something, you can submit a pull request
+via GitHub, or email `pommicket at pommicket.com`. Translations are welcome!