diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 99 |
1 files changed, 99 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..141047a --- /dev/null +++ b/README.md @@ -0,0 +1,99 @@ +# boostrapping a (Linux x86-64) C compiler + +Compilers nowadays are written in languages like C, which themselves need to be +compiled. But then, you need a C compiler to compile your C compiler! Of course, +the very first C compiler was not written in C (because how would it be +compiled?). Instead, it was slowly built up, starting from a very basic +assembler, eventually reacing a full-scale compiler. This process is known as +bootstrapping. In this repository, we'll explore how that's done. Each directory +represents a new "stage" in the process. The first one, `00`, is a hand-written +executable, and the last one will be a C compiler. Each directory has its own +README explaining what's going on. + +You can run `bootstrap.sh` to run through and test every stage. + +## the basics + +In this series, I want to explain *everything* that's going on. I'm going to +need to assume some passing knowledge about computers, but here's a quick +overview of what you'll want to know before starting. I can't explain everything +so you may need to do your own research. You don't need to understand each of +these in full, just get a general idea at least: + +- what an operating system is +- what memory is +- what a programming language is +- what a compiler is +- what an executable file is +- number bases -- if a number is preceded by 0x, 0o, or 0b in this series, that +means hexadecimal/octal/binary respectively. So 0xff = FF hexadecimal = 255 +decimal. +- what a CPU is +- what a CPU architecture is +- what a CPU register is +- what a pointer is +- bits, bytes, kilobytes, etc. +- bitwise operations (not, or, and, xor, left shift, right shift) +- 2's complement +- null-terminated strings +- how floating-point numbers work +- maybe some basic Intel-style x86-64 assembly (you can probably pick it up on +the way though) + + +## instruction set + +x86-64 has a *gigantic* instruction set. The manual for it is over 2,000 pages +long! So, it makes sense to select only a small subset of it to use for all the +stages of our compiler. The set I've chosen can be found in `instructions.txt`. +I think it achieves a pretty good balance between having few enough +instructions to be manageable and having enough instructions to be useable. +To be clear, you don't need to read that file to understand the series, at least +not right away. + +## principles + +- as simple as possible + +Bootstrapping a compiler is not an easy task, so we're trying to make it as easy +as possible. We don't even necessarily need a standard-compliant C compiler, we +only need enough to compile someone else's C compiler, specifically TCC +(https://bellard.org/tcc/) since that's a compiler with very few dependencies. + +- efficiency is not a concern + +We will create big and slow executables, and that's okay. It doesn't really +matter if compiling TCC takes 8 as opposed to 0.01 seconds; once we compile TCC +with itself, we'll get the same executable either way. + +## reflections on trusting trust + +In 1984, Ken Thompson wrote the well-known article +[*Reflections on Trusting Trust*](http://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf). +This is one of the things that inspired me to start this project. To summarize +the article: it is possible to create a malicious C compiler which will +replicate its own malicious functionalities (e.g. detecting password-checking +routines to make them also accept another password the attacker knows) when used +to compile other C compilers. For all we know, such a compiler was used to +compile GCC, say, and so all programs around today could be compromised. Of +course, this is practically definitely not the case, but it's still an +interesting experiment to try to create a fully trustable compiler. This +project can't necessarily even do that though, because the Linux kernel, which +we depend on, is compiled from C, so we can't fully trust *it*. To *truly* +create a fully trustable compiler, you'd need to manually write to a USB with a +circuit, create an operating system from nothing (without even a text editor), +and then follow this series, or maybe you don't even trust your CPU vendor... +I'll leave that to someone else + +## license + +``` +This project is in the public domain. Any copyright protections from any law +for this project are forfeited by the author(s). No warranty is provided for +this project, and the author(s) shall not be held liable in connection with it. +``` + +## contributing + +If you notice a mistake/want to clarify something, you can submit a pull request +via GitHub, or email `pommicket at pommicket.com`. Translations are welcome! |