From 7bb8ab02f70c0a436a00e29275ab87b5bb56d584 Mon Sep 17 00:00:00 2001 From: pommicket Date: Sun, 14 Nov 2021 00:33:40 -0500 Subject: 03 README --- 01/README.md | 2 +- 02/README.md | 6 +-- 03/Makefile | 2 +- 03/README.md | 168 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 03/ex03 | 105 ++++++++++++++++++++++++++----------- 03/in02 | 3 ++ 03/in03 | 6 +-- README.md | 9 ++-- 8 files changed, 259 insertions(+), 42 deletions(-) create mode 100644 03/README.md diff --git a/01/README.md b/01/README.md index a67d28b..01a4db0 100644 --- a/01/README.md +++ b/01/README.md @@ -1,7 +1,7 @@ # stage 01 The code for the compiler for this stage is in the file `in00`. And yes, that's -an input to our previous program, `hexcompile`, from stage 00! To compile it, +an input to our [previous program](../00/README.html), `hexcompile`, from stage 00! To compile it, run `../00/hexcompile` from this directory. You will get a file, `out00`. That is the executable for this stage's compiler. Run it (it'll read from the file `in01` I've provided) and you'll get a file `out01`. That executable will print diff --git a/02/README.md b/02/README.md index d09c7ca..1083ec9 100644 --- a/02/README.md +++ b/02/README.md @@ -1,6 +1,6 @@ # stage 02 -The compiler for this stage is in the file `in01`, an input for our previous compiler. +The compiler for this stage is in the file `in01`, an input for our [previous compiler](../01/README.md). So if you run `../01/out00`, you'll get the file `out01`, which is this stage's compiler. The specifics of how this compiler works are in the comments in `in01`, but here I'll @@ -187,5 +187,5 @@ if you use a label without defining it, it uses address 0, rather than outputtin an error message. This could be fixed: if the value in the label table is 0 and we are on the second pass, output an error message. Also, duplicate labels aren't detected. -But thanks to labels, for future compilers at least we won't have to calculate -any jump offsets manually. +But thanks to labels, at least we won't have to calculate +any jump offsets manually anymore. With that, let's move on to [stage 03](../03/README.md). diff --git a/03/Makefile b/03/Makefile index 3d71765..2a50640 100644 --- a/03/Makefile +++ b/03/Makefile @@ -1,4 +1,4 @@ -all: out02 out03 +all: out02 out03 README.html out02: in02 ../02/out01 ../02/out01 out03: out02 in03 diff --git a/03/README.md b/03/README.md new file mode 100644 index 0000000..b4ab20b --- /dev/null +++ b/03/README.md @@ -0,0 +1,168 @@ +# stage 03 +The code for this compiler (the file `in02`, an input for our [stage 02 compiler](../02/README.md)) +is 2700 lines—quite a bit larger than the previous ones. And as we'll see, it's a lot more powerful too. +To compile it, run `../02/out01` from this directory. +Let's take a look at `in03`, the example program I've written for it: +``` +B=:hello_world +call :puts +; exit code 0 +J=d0 +syscall x3c + +:hello_world +str Hello, world! +xa +x0 + +; output null-terminated string in rbx +:puts + R=B + call :strlen + D=A + I=R + J=d1 + syscall d1 + return + +; calculate length of string in rbx +:strlen + ; keep pointer to start of string + D=B + I=B + :strlen_loop + C=1I + ?C=0:strlen_loop_end + I+=d1 + !:strlen_loop + :strlen_loop_end + I-=D + A=I + return +``` +This language looks a lot nicer than the previous one. No more obscure two-letter label names +and commands! Furthermore, try changing `:strlen_loop` on line 31 +to a typo like `:strlen_lop`. You should get: +``` +Bad label 001f +``` +Not only do we get an error message, we also get the line number +of the error! It's in hexadecimal, unfortunately, but that's +better than nothing. + +I spent a while on this compiler (perhaps I went a bit overboard +on the features), because for the 02 language +was the first that was actually pleasant to use! +It's much less sophisticated than even most assembly languages, +but being able to use labels without having to worry about filling +in the offsets later made it way nicer to use than the previous +languages. + +In addition to `in03`, this directory also has `ex03`, +which gives examples of all of the instructions supported by this compiler. + +Seeing as this is a relatively large compiler, +here is an overview of how it works: + +## functions + +Thanks to labels, we can actually use functions in this compiler, without +it being a complete nightmare. Functions are called like this: +``` +im +--fu +cl (this would call the function ::fu) +``` +and at the end of each function, we get `re`, which returns from the function. +I've used the convention of storing return values in `rax` and +passing the argument to a unary function in `rbx`. + +This compiler ended up having a lot of functions, some of them used in all sorts +of different places. + +## execution + +Just as with the 02 compiler, we need two passes: +the first one +computes the address of each label, +and the second one uses the correct addresses to +write the executable. + +Each pass is a loop, which starts by incrementing +the line number (`::L#`). Then we read in a line +from the source file, `in03`. This is done one character +at a time, until a newline is reached. The line is stored +in the buffer `::LI`. In the remainder of the program we +(mostly) use the fact that the line is newline-terminated, +rather than keeping track of how long it is. + +Once the line is read in, a bunch of tests are performed on it. +We start by looking at the first character: if it's a `;`, +the line is a comment; if it's a `!`, it's an unconditional jump; etc. +Failing that, we look at the second character, to see if it's +`=`, `+=`, `-=`, etc. If it doesn't match any of them, we use +the `::s=` (string equals) function, which conveniently lets you +set the terminator. We check if the line is equal to `"syscall"` +up to a terminator of `' '` to check if it's a syscall, for example. + +## `+=`, et al. + +We can emit the correct instruction for `D+=C` with: + +- `mov rbx, rdx` +- `mov rax, rcx` +- `add rax, rbx` +- `mov rdx, rax` + +A similar pattern can be used for `-=`, `&=`, etc. +This made it pretty easy to write the implementation of all of these: +there's one function for setting `rbx` to the first operand (`::B1`), +another for setting `rax` to the second operand (`::A2`), and another for +setting the first operand to `rax` (`::1A`). The implementations of +`+=`/`-=`/etc. just call those three functions, with a bit of stuff in between +to perform the corresponding operation. +A similar approach also works for loading/storing values in memory. + +## label list + +Instead of a label table, we now have a "label list" (or array +if you prefer) at `::LB`. +A pointer to the current end of the list is stored at `::L$`. +Each entry is the name of the label, including the `:`, then a newline, +then the 4-byte address. +`::ll` is used to look up labels. If it's the first pass, +`::ll` just returns 0. Otherwise, it looks up the label by +comparing it to each entry using `s=` with a terminator of `'\n'`. +If no label matches, we get an error. + +## alignment +A lot of data used in this program is +[not correctly aligned](https://en.wikipedia.org/wiki/Bus_error#Unaligned_access)—e.g. +8-byte values are not always stored at an address that is a multiple of 8. +This would be a problem on some processors, but x86-64 can handle it. +It's still not a good idea in practice—reading unaligned memory +is much slower. But we're not really concerned about performance here, +and it would be a bit finnicky to align everything correctly. +However, I have introduced `align` into this language, +which you can put before a label to ensure that its address is aligned +to 8 bytes. + +## errors + +Errors are handled in functions beginning with `!`, e.g. `::!n` for "bad number". +Each of these ends up calling `::er`. `::er` prints +a string specific to the type of error, then +converts the line number to a string, and prints it. +The line number is always converted to a 4-digit hexadecimal number. +This means it won't fully work past 65,535 lines, but +let's hope we don't need to write any programs that long! + +## limitations + +Functions in this 03 language will probably overwrite the previous values +of registers. This can make it kind of annoying to call functions, since +you need to make sure you store away any information you'll need after the function. +And the language definitely won't be as nice to use as something with real variables. But overall, +I'm very happy with this compiler, considering it's written in a language with 2-letter label +names. + diff --git a/03/ex03 b/03/ex03 index 510018e..0270bb9 100644 --- a/03/ex03 +++ b/03/ex03 @@ -1,42 +1,87 @@ +; You can use registers like variables: rax = A, rbx = B, rcx = C, rdx = D, rsi = I, rdi = J, rsp = S, rbp = R +; However, because of the way things are implemented, you should be careful about using A/B as variables: +; they sometimes might not work correctly, and will be overwritten by a lot of statements + +; set register to... +; decimal +D=d123 +; hexadecimal +D=x1ef +; another register +D=R we can have a comment here and in some other places. not after numbers or labels though. +; label address +D=:label +; add D+=d4 +D+=R +; subtract +D-=d123 +D-=R +; left/right shift (only rcx is supported for variable shifts) +D<=C +D<=d33 +D>=C +D>=x12 +; arithmetic right shift D]=d7 D]=C -D^=C -D|=C -D&=C -~C -B|=A -8D=C -A=1B -B>=d33 -call :funciton +; bitwise xor, or, and +D^=R +D|=R +D&=R +D^=d1 +D|=d1 +D&=d1 +; bitwise not +; (this sets D to ~D) +~D +; dereference +; set 8 bytes at rdx to rbp +8D=R +; set 4 bytes at rdx to ebp +4D=R +2D=R +1D=R +; set rcx/ecx/cx/cl to 8/4/2/1 bytes at rdx +C=8D +C=4D +C=2D +C=1D +; call a function +call :function +; return +return +; label declarations +;:function +;:label +; literal byte x4b +'H +'i +; string +str This text will appear in the executable! +; unconditional jump !:label -?JS:label +; (unsigned comparisons above/below) +?RaS:label +?RbS:label +; syscall syscall x3c +; align to 8 bytes align -:label +; reserve some number of bytes of memory reserve d1000 -B+=J -B<=d9 -B-=J -?J=B:label -?A!B:label -?A>B:label -A=:label -x3c -return +; signed/unsigned multiply/divide imul idiv mul div -:funciton -call A -str Here is some text which will be put in the executable! -?CaD:label - +; e.g. to compute 5*3 into rcx (note rdx is wiped in the process): +A=d5 +B=d3 +mul diff --git a/03/in02 b/03/in02 index 879e17a..1632de1 100644 --- a/03/in02 +++ b/03/in02 @@ -2886,6 +2886,9 @@ jm ~~ ::LI line buffer ~~ +~~ +~~ +~~ ::L$ end of current label list --LB ::LB labels diff --git a/03/in03 b/03/in03 index ef0640a..a8d8744 100644 --- a/03/in03 +++ b/03/in03 @@ -1,6 +1,6 @@ -; write to stdout B=:hello_world call :puts +; exit code 0 J=d0 syscall x3c @@ -11,15 +11,15 @@ x0 ; output null-terminated string in rbx :puts + R=B call :strlen - I=D D=A + I=R J=d1 syscall d1 return ; calculate length of string in rbx -; keeps pointer to start of string in rdx, end of string in rsi :strlen ; keep pointer to start of string D=B diff --git a/README.md b/README.md index 9a97c8a..231c09f 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ hexadecimal digit pairs to a binary file. - [stage 01](01/README.md) - a language with comments, and 2-character command codes. - [stage 02](02/README.md) - a language with labels +- [stage 03](03/README.md) - a language with longer labels, better error messages, and less register manipulation - more coming soon (hopefully) ## prerequisite knowledge @@ -93,10 +94,10 @@ compile GCC, say, and so all programs around today could be compromised. Of course, this is practically definitely not the case, but it's still an interesting experiment to try to create a fully trustable compiler. This project can't necessarily even do that though, because the Linux kernel, which -we depend on, is compiled from C, so we can't fully trust *it*. To *truly* -create a fully trustable compiler, you'd need to manually write to a USB with a -circuit, create an operating system from nothing (without even a text editor), -and then follow this series, or maybe you don't even trust your CPU... +we depend on, is compiled from C, so we can't fully trust *it*. To +create a *fully* trustable compiler, you'd need to manually write +an operating system to a USB key with a circuit or something, +assuming you trust your CPU... I'll leave that to someone else. ## license -- cgit v1.2.3