summaryrefslogtreecommitdiff
path: root/03/README.md
blob: b4465740eaca2a6cffcd1d546f7cd36a0f2c5fce (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# stage 03
The code for this compiler (in the file `in02`, an input for our [stage 02 compiler](../02/README.md))
is 2700 lines—quite a bit longer than the previous ones.
To compile it, run `../02/out01` from this directory.
Let's take a look at `in03`, the example program I've written for it:
```
B=:hello_world
call :puts
; exit code 0
J=d0
syscall x3c

:hello_world
str Hello, world!
xa
x0

; output null-terminated string in rbx
:puts
	R=B
	call :strlen
	D=A
	I=R
	J=d1
	syscall d1
	return

; calculate length of string in rbx
:strlen
	; keep pointer to start of string
	D=B
	I=B
	:strlen_loop
	C=1I
	?C=0:strlen_loop_end
	I+=d1
	!:strlen_loop
	:strlen_loop_end
	I-=D
	A=I	
	return
```
This language looks a lot nicer than the previous one. No more obscure two-letter label names
and commands! Furthermore, try changing `:strlen_loop` on line 31
to a typo like `:strlen_lop`. You should get:
```
Bad label 001f
```
Not only do we get an error message, we also get the line number
of the error! It's in hexadecimal, unfortunately, but that's
better than nothing.

I spent a while on this compiler (perhaps I went a bit overboard
on the features), because for the 02 language
was the first that was actually pleasant to use!
It's much less sophisticated than even most assembly languages,
but being able to use labels without having to worry about filling
in the offsets later made it way nicer to use than the previous
languages.

In addition to `in03`, this directory also has `ex03`,
which gives examples of all of the instructions supported by this compiler.

Seeing as this is a relatively large compiler,
here's an overview of how it works:

## functions

Thanks to labels, we can actually use functions in this compiler, without
it being a complete nightmare. Functions are called like this:
```
im
--fu
cl    (this would call the function ::fu)
```
and at the end of each function, we get `re`, which returns from the function.
I've used the convention of storing return values in `rax` and
passing the argument to a unary function in `rbx`.

This compiler ended up having a lot of functions, some of them used in all sorts
of different places.

## execution

Just as with the 02 compiler, we need two passes:
the first one
computes the address of each label,
and the second one uses the correct addresses to
write the executable.

Each pass is a loop, which starts by incrementing
the line number (`::L#`). Then we read in a line
from the source file, `in03`. This is done one character
at a time, until a newline is reached. The line is stored
in the buffer `::LI`. In the remainder of the program we
(mostly) use the fact that the line is newline-terminated,
rather than keeping track of how long it is.

Once the line is read in, a bunch of tests are performed on it.
We start by looking at the first character: if it's a `;`,
the line is a comment; if it's a `!`, it's an unconditional jump; etc.
Failing that, we look at the second character, to see if it's
`=`, `+=`, `-=`, etc. If it doesn't match any of them, we use
the `::s=` (string equals) function, which conveniently lets you
set the terminator. We check if the line is equal to `"syscall"`
up to a terminator of `' '` (space) to check if it's a syscall, for example.

## `+=`, et al.

We can emit the correct instruction for `D+=C` with:

- `mov rbx, rdx`
- `mov rax, rcx`
- `add rax, rbx`
- `mov rdx, rax`

A similar pattern can be used for `-=`, `&=`, etc.
This made it pretty easy to write the implementation of all of these:
there's one function for setting `rbx` to the first operand (`::B1`),
another for setting `rax` to the second operand (`::A2`), and another for
setting the first operand to `rax` (`::1A`). The implementations of
`+=`/`-=`/etc. just call those three functions, with a bit of stuff in between
to perform the corresponding operation.
A similar approach also works for loading/storing values in memory.

## label list

Instead of a label table, we now have a "label list" (or array
if you prefer) at `::LB`.
A pointer to the current end of the list is stored at `::L$`.
Each entry is the name of the label, including the `:`, then a newline,
then the 4-byte address.
`::ll` is used to look up labels. If it's the first pass,
`::ll` just returns 0. Otherwise, it looks up the label by
comparing it to each entry using `s=` with a terminator of `'\n'`.
If no label matches, we get an error.

## alignment
A lot of data used in this program is
[not correctly aligned](https://en.wikipedia.org/wiki/Bus_error#Unaligned_access)—e.g.
8-byte values are not always stored at an address that is a multiple of 8.
This would be a problem on some processors, but x86-64 can handle it.
It's still not a good idea in practice—reading unaligned memory
is much slower. But we're not really concerned about performance here,
and it would be a bit finnicky to align everything correctly.
However, I have introduced `align` into this language,
which you can put before a label to ensure that its address is aligned
to 8 bytes.

## errors

Errors are handled in functions beginning with `!`, e.g. `::!n` for "bad number".
Each of these ends up calling `::er`. `::er` prints
a string specific to the type of error, then
converts the line number to a string, and prints it.
The line number is always converted to a 4-digit hexadecimal number.
This means it won't fully work past 65,535 lines, but
let's hope we don't need to write any programs that long!

## limitations

Functions in this 03 language will probably overwrite the previous values
of registers. This can make it kind of annoying to call functions, since
you need to make sure you store away any information you'll need after the function.
And the language definitely won't be as nice to use as something with real variables. But overall,
I'm very happy with this compiler, especially considering it's written in a language with 2-letter label
names.
With that, let's move on to the [next stage](../04a/README.md).