diff options
Diffstat (limited to '03')
-rw-r--r-- | 03/Makefile | 2 | ||||
-rw-r--r-- | 03/README.md | 168 | ||||
-rw-r--r-- | 03/ex03 | 105 | ||||
-rw-r--r-- | 03/in02 | 3 | ||||
-rw-r--r-- | 03/in03 | 6 |
5 files changed, 250 insertions, 34 deletions
diff --git a/03/Makefile b/03/Makefile index 3d71765..2a50640 100644 --- a/03/Makefile +++ b/03/Makefile @@ -1,4 +1,4 @@ -all: out02 out03 +all: out02 out03 README.html out02: in02 ../02/out01 ../02/out01 out03: out02 in03 diff --git a/03/README.md b/03/README.md new file mode 100644 index 0000000..b4ab20b --- /dev/null +++ b/03/README.md @@ -0,0 +1,168 @@ +# stage 03 +The code for this compiler (the file `in02`, an input for our [stage 02 compiler](../02/README.md)) +is 2700 lines—quite a bit larger than the previous ones. And as we'll see, it's a lot more powerful too. +To compile it, run `../02/out01` from this directory. +Let's take a look at `in03`, the example program I've written for it: +``` +B=:hello_world +call :puts +; exit code 0 +J=d0 +syscall x3c + +:hello_world +str Hello, world! +xa +x0 + +; output null-terminated string in rbx +:puts + R=B + call :strlen + D=A + I=R + J=d1 + syscall d1 + return + +; calculate length of string in rbx +:strlen + ; keep pointer to start of string + D=B + I=B + :strlen_loop + C=1I + ?C=0:strlen_loop_end + I+=d1 + !:strlen_loop + :strlen_loop_end + I-=D + A=I + return +``` +This language looks a lot nicer than the previous one. No more obscure two-letter label names +and commands! Furthermore, try changing `:strlen_loop` on line 31 +to a typo like `:strlen_lop`. You should get: +``` +Bad label 001f +``` +Not only do we get an error message, we also get the line number +of the error! It's in hexadecimal, unfortunately, but that's +better than nothing. + +I spent a while on this compiler (perhaps I went a bit overboard +on the features), because for the 02 language +was the first that was actually pleasant to use! +It's much less sophisticated than even most assembly languages, +but being able to use labels without having to worry about filling +in the offsets later made it way nicer to use than the previous +languages. + +In addition to `in03`, this directory also has `ex03`, +which gives examples of all of the instructions supported by this compiler. + +Seeing as this is a relatively large compiler, +here is an overview of how it works: + +## functions + +Thanks to labels, we can actually use functions in this compiler, without +it being a complete nightmare. Functions are called like this: +``` +im +--fu +cl (this would call the function ::fu) +``` +and at the end of each function, we get `re`, which returns from the function. +I've used the convention of storing return values in `rax` and +passing the argument to a unary function in `rbx`. + +This compiler ended up having a lot of functions, some of them used in all sorts +of different places. + +## execution + +Just as with the 02 compiler, we need two passes: +the first one +computes the address of each label, +and the second one uses the correct addresses to +write the executable. + +Each pass is a loop, which starts by incrementing +the line number (`::L#`). Then we read in a line +from the source file, `in03`. This is done one character +at a time, until a newline is reached. The line is stored +in the buffer `::LI`. In the remainder of the program we +(mostly) use the fact that the line is newline-terminated, +rather than keeping track of how long it is. + +Once the line is read in, a bunch of tests are performed on it. +We start by looking at the first character: if it's a `;`, +the line is a comment; if it's a `!`, it's an unconditional jump; etc. +Failing that, we look at the second character, to see if it's +`=`, `+=`, `-=`, etc. If it doesn't match any of them, we use +the `::s=` (string equals) function, which conveniently lets you +set the terminator. We check if the line is equal to `"syscall"` +up to a terminator of `' '` to check if it's a syscall, for example. + +## `+=`, et al. + +We can emit the correct instruction for `D+=C` with: + +- `mov rbx, rdx` +- `mov rax, rcx` +- `add rax, rbx` +- `mov rdx, rax` + +A similar pattern can be used for `-=`, `&=`, etc. +This made it pretty easy to write the implementation of all of these: +there's one function for setting `rbx` to the first operand (`::B1`), +another for setting `rax` to the second operand (`::A2`), and another for +setting the first operand to `rax` (`::1A`). The implementations of +`+=`/`-=`/etc. just call those three functions, with a bit of stuff in between +to perform the corresponding operation. +A similar approach also works for loading/storing values in memory. + +## label list + +Instead of a label table, we now have a "label list" (or array +if you prefer) at `::LB`. +A pointer to the current end of the list is stored at `::L$`. +Each entry is the name of the label, including the `:`, then a newline, +then the 4-byte address. +`::ll` is used to look up labels. If it's the first pass, +`::ll` just returns 0. Otherwise, it looks up the label by +comparing it to each entry using `s=` with a terminator of `'\n'`. +If no label matches, we get an error. + +## alignment +A lot of data used in this program is +[not correctly aligned](https://en.wikipedia.org/wiki/Bus_error#Unaligned_access)—e.g. +8-byte values are not always stored at an address that is a multiple of 8. +This would be a problem on some processors, but x86-64 can handle it. +It's still not a good idea in practice—reading unaligned memory +is much slower. But we're not really concerned about performance here, +and it would be a bit finnicky to align everything correctly. +However, I have introduced `align` into this language, +which you can put before a label to ensure that its address is aligned +to 8 bytes. + +## errors + +Errors are handled in functions beginning with `!`, e.g. `::!n` for "bad number". +Each of these ends up calling `::er`. `::er` prints +a string specific to the type of error, then +converts the line number to a string, and prints it. +The line number is always converted to a 4-digit hexadecimal number. +This means it won't fully work past 65,535 lines, but +let's hope we don't need to write any programs that long! + +## limitations + +Functions in this 03 language will probably overwrite the previous values +of registers. This can make it kind of annoying to call functions, since +you need to make sure you store away any information you'll need after the function. +And the language definitely won't be as nice to use as something with real variables. But overall, +I'm very happy with this compiler, considering it's written in a language with 2-letter label +names. + @@ -1,42 +1,87 @@ +; You can use registers like variables: rax = A, rbx = B, rcx = C, rdx = D, rsi = I, rdi = J, rsp = S, rbp = R +; However, because of the way things are implemented, you should be careful about using A/B as variables: +; they sometimes might not work correctly, and will be overwritten by a lot of statements + +; set register to... +; decimal +D=d123 +; hexadecimal +D=x1ef +; another register +D=R we can have a comment here and in some other places. not after numbers or labels though. +; label address +D=:label +; add D+=d4 +D+=R +; subtract +D-=d123 +D-=R +; left/right shift (only rcx is supported for variable shifts) +D<=C +D<=d33 +D>=C +D>=x12 +; arithmetic right shift D]=d7 D]=C -D^=C -D|=C -D&=C -~C -B|=A -8D=C -A=1B -B>=d33 -call :funciton +; bitwise xor, or, and +D^=R +D|=R +D&=R +D^=d1 +D|=d1 +D&=d1 +; bitwise not +; (this sets D to ~D) +~D +; dereference +; set 8 bytes at rdx to rbp +8D=R +; set 4 bytes at rdx to ebp +4D=R +2D=R +1D=R +; set rcx/ecx/cx/cl to 8/4/2/1 bytes at rdx +C=8D +C=4D +C=2D +C=1D +; call a function +call :function +; return +return +; label declarations +;:function +;:label +; literal byte x4b +'H +'i +; string +str This text will appear in the executable! +; unconditional jump !:label -?J<B:label -:label -1B=C -; :l ba b -J=d0 -A=d60 +; conditional jump +?R<S:label +?R=S:label +?R!S:label +?R>S:label +; (unsigned comparisons above/below) +?RaS:label +?RbS:label +; syscall syscall x3c +; align to 8 bytes align -:label +; reserve some number of bytes of memory reserve d1000 -B+=J -B<=d9 -B-=J -?J=B:label -?A!B:label -?A>B:label -A=:label -x3c -return +; signed/unsigned multiply/divide imul idiv mul div -:funciton -call A -str Here is some text which will be put in the executable! -?CaD:label - +; e.g. to compute 5*3 into rcx (note rdx is wiped in the process): +A=d5 +B=d3 +mul @@ -2886,6 +2886,9 @@ jm ~~ ::LI line buffer ~~ +~~ +~~ +~~ ::L$ end of current label list --LB ::LB labels @@ -1,6 +1,6 @@ -; write to stdout B=:hello_world call :puts +; exit code 0 J=d0 syscall x3c @@ -11,15 +11,15 @@ x0 ; output null-terminated string in rbx :puts + R=B call :strlen - I=D D=A + I=R J=d1 syscall d1 return ; calculate length of string in rbx -; keeps pointer to start of string in rdx, end of string in rsi :strlen ; keep pointer to start of string D=B |