summaryrefslogtreecommitdiff
path: root/05/README.md
diff options
context:
space:
mode:
Diffstat (limited to '05/README.md')
-rw-r--r--05/README.md145
1 files changed, 143 insertions, 2 deletions
diff --git a/05/README.md b/05/README.md
index 5e5ba1d..5f54a9a 100644
--- a/05/README.md
+++ b/05/README.md
@@ -9,8 +9,8 @@ make
```
to build our C compiler and TCC. This will take some time (approx. 25 seconds on my computer).
-A test program, `test.out` will be compiled using `tcc`. If you run
-it, you should get the output
+Two test programs will be produced: `a.out`, compiled using our C compiler, and
+`test.out`, compiled using `tcc`. If you run either one, you should get the output
```
Hello, world!
@@ -107,8 +107,149 @@ Tokens are one of:
- A character literal (e.g. `'a'`, `'\n'`)
- A floating-point literal (e.g. `3.6`, `5e10`)
+Next, an internal representation of the program is constructed in memory.
+This is where we read the tokens `if` `(` `a` `)` `printf` `(` `"Hello!\n"` `)` `;`
+and interpret it as an if statement, whose condition is the variable `a`, and whose
+body consists of the single statement calling the `printf` function with the argument `"Hello!\n"`.
+
+Finally, we output the code for every function.
+
+## executable format
+
+This compiler's executables are much more sophisticated than the previous ones'.
+Instead of storing code and data all in one segment, we have three segments: one
+6MB segment for code (the program's functions are only allowed to use up 4MB of that, though),
+one 4MB segment for read-only data (strings), and one 4MB segment for read-write data.
+
+Well, it *should* only be read-write, but unfortunately it also has to be executable...
+
+## syscalls
+
+Of course, we need some way of making system calls in C.
+We do this with a macro, `__syscall`, which you'll find in `stdc_common.h`:
+
+```
+static unsigned char __syscall_data[] = {
+ // mov rax, [rsp+24]
+ 0x48, 0x8b, 0x84, 0x24, 24, 0, 0, 0,
+ // mov rdi, rax
+ 0x48, 0x89, 0xc7,
+ // mov rax, [rsp+32]
+ 0x48, 0x8b, 0x84, 0x24, 32, 0, 0, 0,
+ // mov rsi, rax
+ 0x48, 0x89, 0xc6,
+ // mov rax, [rsp+40]
+ 0x48, 0x8b, 0x84, 0x24, 40, 0, 0, 0,
+ // mov rdx, rax
+ 0x48, 0x89, 0xc2,
+ // mov rax, [rsp+48]
+ 0x48, 0x8b, 0x84, 0x24, 48, 0, 0, 0,
+ // mov r10, rax
+ 0x49, 0x89, 0xc2,
+ // mov rax, [rsp+56]
+ 0x48, 0x8b, 0x84, 0x24, 56, 0, 0, 0,
+ // mov r8, rax
+ 0x49, 0x89, 0xc0,
+ // mov rax, [rsp+64]
+ 0x48, 0x8b, 0x84, 0x24, 64, 0, 0, 0,
+ // mov r9, rax
+ 0x49, 0x89, 0xc1,
+ // mov rax, [rsp+16]
+ 0x48, 0x8b, 0x84, 0x24, 16, 0, 0, 0,
+ // syscall
+ 0x0f, 0x05,
+ // mov [rsp+8], rax
+ 0x48, 0x89, 0x84, 0x24, 8, 0, 0, 0,
+ // ret
+ 0xc3
+};
+
+#define __syscall(no, arg1, arg2, arg3, arg4, arg5, arg6)\
+ (((unsigned long (*)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long))__syscall_data)\
+ (no, arg1, arg2, arg3, arg4, arg5, arg6))
+```
+
+The `__syscall_data` array contains machine language instructions which perform a system call, and the
+`__syscall` macro "calls" the array as if it were a function. This is why we need a read-write-executable data
+segment -- otherwise we'd need to implement system calls in the compiler.
+
+## C standard library
+
+The C89 standard specifies a bunch of "standard library" functions which any implementation has to make available, e.g.
+`printf()`, `atoi()`, `exit()`.
+Fortunately, we don't have to write these functions in the 04 language; we can write them in C.
+
+To use a particular function, a C program needs to include the appropriate header file, e.g.
+`#include <stdio.h>` lets you use `printf()` and other I/O-related functions. Normally,
+these header files just declare what types the parameters to the functions should be,
+but we actually put the function implementations there.
+
+Let's take a look at the contents of `ctype.h`, which provides the functions `islower`, `isupper`, etc.:
+```
+#ifndef _CTYPE_H
+#define _CTYPE_H
+
+#include <stdc_common.h>
+
+int islower(int c) {
+ return c >= 'a' && c <= 'z';
+}
+
+int isupper(int c) {
+ return c >= 'A' && c <= 'Z';
+}
+
+int isalpha(int c) {
+ return isupper(c) || islower(c);
+}
+
+int isalnum(int c) {
+ return isalpha(c) || isdigit(c);
+}
+
+...
+
+#endif
+```
+The first two lines and last line prevent problems when the file is included multiple times.
+We begin by including `stdc_common.h`, which has a bunch of functions and type definitions which all
+our header files use, and then we define each of the necessary C standard library functions.
+
+
## limitations
+There are various minor ways in which this compiler doesn't actually handle all of C89.
+Here is a list of things we do wrong (this list is probably missing things, though):
+
+- [trigraphs](https://en.wikipedia.org/wiki/Digraphs_and_trigraphs#C) are not handled
+- `char[]` string literal initializers can't contain null characters (e.g. `char x[] = "a\0b";` doesn't work)
+- you can only access members of l-values (e.g. `int x = function_which_returns_struct().member` doesn't work)
+- no default-int (this is a legacy feature of C, e.g. `main() { }` can technically stand in for `int main() {}`)
+- the keyword `auto` is not handled (again, a legacy feature of C)
+- `default:` must be the last label in a switch statement.
+- external variable declarations are ignored (e.g. `extern int x; int main() { return x; } int x = 5; ` doesn't work)
+- `typedef`s, and `struct`/`union`/`enum` declarations aren't allowed inside functions
+- conditional expressions aren't allowed inside `case` (horribly, `switch (x) { case 5 ? 6 : 3: ; }` is legal C).
+- bit-fields aren't handled
+- Technically, `1[array]` is equivalent to `array[1]`, but we don't handle that.
+- C89 has *very* weird typing rules about `void*`/`non-void*` inside conditional expressions. We don't handle that properly.
+- C89 allows calling functions without declaring them, for legacy reasons. We don't handle that.
+- Floating-point constant expressions are very limited. Only `double` literals and 0 are supported (it was hard enough
+to parse floating-point literals in a language without floating-point variables!)
+- Floating-point literals can't have their integer part greater than 2<sup>64</sup>-1.
+- Redefining a macro is always an error, even if it's the same definition.
+- You can't have a variable/function/etc. called `defined`.
+- Various little things about when macros are evaluated in some contexts.
+setjmp.h:// @NONSTANDARD: we don't actually support setjmp
+stddef.h:// @NONSTANDARD: we don't have wchar_t
+stdlib.h:// @NONSTANDARD: we don't define MB_CUR_MAX or any of the mbtowc functions
+time.h:// @NONSTANDARD(except in UTC+0): we don't support local time in timezones other than UTC+0.
+time.h: // @NONSTANDARD-ish.
+
+
+Also, the keywords `signed`, `volatile`, `register`, and `const` are all ignored. This shouldn't have an effect
+on any legal C program, though.
+
## modifications of tcc's source code