summaryrefslogtreecommitdiff
path: root/05
diff options
context:
space:
mode:
authorpommicket <pommicket@gmail.com>2022-02-18 17:16:49 -0500
committerpommicket <pommicket@gmail.com>2022-02-18 17:16:49 -0500
commit59b7931165ecbd189214142b95d3d2033f4f579f (patch)
tree9cce37eb4b003f07222ca9298b793b7f8117c4c7 /05
parent06def8fb862286658ef8cfc5cebb2711faadba1e (diff)
more 05 readme
Diffstat (limited to '05')
-rw-r--r--05/README.md100
1 files changed, 98 insertions, 2 deletions
diff --git a/05/README.md b/05/README.md
index 276aac6..5e5ba1d 100644
--- a/05/README.md
+++ b/05/README.md
@@ -16,11 +16,107 @@ it, you should get the output
Hello, world!
```
-So now we just compile TCC with itself, and we're done, right?
-Well, not quite...
+## the C compiler
+
+The C compiler for this stage is written in the [04 language](../04/README.md), using the [04a preprocessor](../04a/README.md)
+and is spread out across multiple files:
+
+```
+util.b - various utilities (syscall, puts, memset, etc.)
+constants.b - numerical and string constants used by the rest of the program
+idents.b - functions for creating mappings from identifiers to arbitrary 64-bit values
+preprocess.b - preprocesses C files
+tokenize.b - turns preprocessing tokens into tokens (see explanation below)
+parse.b - turns tokens into a nice representation of the program
+codegen.b - turns parse.b's representation into actual code
+main.b - puts everything together
+```
+
+The whole thing is ~12,000 lines of code, which is ~280KB when compiled.
+
+### the C standard
+
+In 1989, the C programming language was standardized by the [ANSI](https://en.wikipedia.org/wiki/American_National_Standards_Institute).
+
+The C89 standard (in theory) defines which C programs are legal, and exactly what any particular legal C program does.
+A draft of it, which is about as good as the real thing, is [available here](http://port70.net/~nsz/c/c89/c89-draft.html).
+
+Since 1989, more features have been added to C, and so more C standards have been published.
+To keep things simple, our compiler only supports the features from C89 (with a few exceptions).
+
+
+### compiling a C program
+
+Compiling a C program involves several "translation phases" (C89 standard ยง 2.1.1.2).
+Here, I'll only be outlining the process our C compiler uses. The technical details
+of the standard are slightly different.
+
+First, each time a backslash is immediately followed by a newline, both are deleted, e.g.
+```
+Hel\
+lo,
+wo\
+rld!
+```
+becomes
+```
+Hello,
+world!
+```
+Well, we actually turn this into
+```
+Hello,
+
+world!
+
+```
+so that line numbers are preserved for errors (this doesn't change the meaning of any program).
+This feature exists so that you can spread one line of code across multiple lines, which is useful sometimes.
+
+Then, comments are deleted (technically, replaced with spaces), and the file is split up into
+*preprocesing tokens*. A preprocessing token is one of:
+
+- A number (e.g. `5`, `10.2`, `3.6.6`)
+- A string literal (e.g. `"Hello"`)
+- A symbol (e.g. `<`, `{`, `.`)
+- An identifier (e.g. `int`, `x`, `main`)
+- A character constant (e.g. `'a'`, `'\n'`)
+- A space character
+- A newline character
+
+Note that preprocessing tokens are just strings of characters, and aren't assigned any meaning yet; `3.6.6e-.3` is a valid
+"preprocessing number" even though it's gibberish.
+
+Next, preprocessor directives are executed. These include things like
+```
+#define A_NUMBER 4
+```
+which will replace every preprocessing token consisting of the identifier `A_NUMBER` in the rest of the program with `4`. Also in this phase,
+```
+#include "X"
+```
+is replaced with the (preprocessing tokens in the) file named `X`.
+
+Then preprocessing tokens are turned into *tokens*.
+Tokens are one of:
+
+- A keyword (e.g. `int`, `while`)
+- A symbol (e.g. `<`, `-`, `{`)
+- An identifier (e.g. `main`, `f`, `x_3`)
+- An integer literal (e.g. `77`, `0x123`)
+- A character literal (e.g. `'a'`, `'\n'`)
+- A floating-point literal (e.g. `3.6`, `5e10`)
+
+## limitations
+
+## modifications of tcc's source code
+
## the nightmare begins
+So now we just compile TCC with itself, and we're done, right?
+Well, not quite...
+
The issue here is that to compile TCC/GCC with TCC, we need libc, the C standard library functions.
Our C compiler just includes these functions in the standard header files, but normally
the code for them is located in a separate library file (called something like