diff options
-rw-r--r-- | 05/README.md | 100 |
1 files changed, 98 insertions, 2 deletions
diff --git a/05/README.md b/05/README.md index 276aac6..5e5ba1d 100644 --- a/05/README.md +++ b/05/README.md @@ -16,11 +16,107 @@ it, you should get the output Hello, world! ``` -So now we just compile TCC with itself, and we're done, right? -Well, not quite... +## the C compiler + +The C compiler for this stage is written in the [04 language](../04/README.md), using the [04a preprocessor](../04a/README.md) +and is spread out across multiple files: + +``` +util.b - various utilities (syscall, puts, memset, etc.) +constants.b - numerical and string constants used by the rest of the program +idents.b - functions for creating mappings from identifiers to arbitrary 64-bit values +preprocess.b - preprocesses C files +tokenize.b - turns preprocessing tokens into tokens (see explanation below) +parse.b - turns tokens into a nice representation of the program +codegen.b - turns parse.b's representation into actual code +main.b - puts everything together +``` + +The whole thing is ~12,000 lines of code, which is ~280KB when compiled. + +### the C standard + +In 1989, the C programming language was standardized by the [ANSI](https://en.wikipedia.org/wiki/American_National_Standards_Institute). + +The C89 standard (in theory) defines which C programs are legal, and exactly what any particular legal C program does. +A draft of it, which is about as good as the real thing, is [available here](http://port70.net/~nsz/c/c89/c89-draft.html). + +Since 1989, more features have been added to C, and so more C standards have been published. +To keep things simple, our compiler only supports the features from C89 (with a few exceptions). + + +### compiling a C program + +Compiling a C program involves several "translation phases" (C89 standard ยง 2.1.1.2). +Here, I'll only be outlining the process our C compiler uses. The technical details +of the standard are slightly different. + +First, each time a backslash is immediately followed by a newline, both are deleted, e.g. +``` +Hel\ +lo, +wo\ +rld! +``` +becomes +``` +Hello, +world! +``` +Well, we actually turn this into +``` +Hello, + +world! + +``` +so that line numbers are preserved for errors (this doesn't change the meaning of any program). +This feature exists so that you can spread one line of code across multiple lines, which is useful sometimes. + +Then, comments are deleted (technically, replaced with spaces), and the file is split up into +*preprocesing tokens*. A preprocessing token is one of: + +- A number (e.g. `5`, `10.2`, `3.6.6`) +- A string literal (e.g. `"Hello"`) +- A symbol (e.g. `<`, `{`, `.`) +- An identifier (e.g. `int`, `x`, `main`) +- A character constant (e.g. `'a'`, `'\n'`) +- A space character +- A newline character + +Note that preprocessing tokens are just strings of characters, and aren't assigned any meaning yet; `3.6.6e-.3` is a valid +"preprocessing number" even though it's gibberish. + +Next, preprocessor directives are executed. These include things like +``` +#define A_NUMBER 4 +``` +which will replace every preprocessing token consisting of the identifier `A_NUMBER` in the rest of the program with `4`. Also in this phase, +``` +#include "X" +``` +is replaced with the (preprocessing tokens in the) file named `X`. + +Then preprocessing tokens are turned into *tokens*. +Tokens are one of: + +- A keyword (e.g. `int`, `while`) +- A symbol (e.g. `<`, `-`, `{`) +- An identifier (e.g. `main`, `f`, `x_3`) +- An integer literal (e.g. `77`, `0x123`) +- A character literal (e.g. `'a'`, `'\n'`) +- A floating-point literal (e.g. `3.6`, `5e10`) + +## limitations + +## modifications of tcc's source code + ## the nightmare begins +So now we just compile TCC with itself, and we're done, right? +Well, not quite... + The issue here is that to compile TCC/GCC with TCC, we need libc, the C standard library functions. Our C compiler just includes these functions in the standard header files, but normally the code for them is located in a separate library file (called something like |