summaryrefslogtreecommitdiff
path: root/00
diff options
context:
space:
mode:
Diffstat (limited to '00')
-rw-r--r--00/README.md84
1 files changed, 43 insertions, 41 deletions
diff --git a/00/README.md b/00/README.md
index a8d4c38..b30adb8 100644
--- a/00/README.md
+++ b/00/README.md
@@ -5,13 +5,14 @@ takes input file `in00` containing space/newline/(any character)-separated
hexadecimal digit pairs (e.g. `3f`) and outputs them as bytes to the file
`out00`. On 64-bit Linux, try running `./hexcompile` from this directory (I've
already provided an `in00` file, which you can take a look at), and you will get
-a file named `out00` containing the text `Hello, world!`. This stage is needed
-so that you can use your favorite text editor to write executables by hand
-(which have bytes outside of ASCII/UTF-8). I wrote it with a program called
-hexedit, which can be found on most Linux distributions. Only 64-bit Linux is
-supported, because each OS/architecture combination would need its own separate
-executable. The executable is just 632 bytes long, and you could definitely make
-it even smaller if you wanted to. Let's take a look at what's inside (`od -t x1
+a file named `out00` containing the text `Hello, world!`. This stage
+lets you use your favorite text editor to write executables
+(which have bytes outside of ASCII/UTF-8).
+I made `hexcompile` with a program called
+[hexedit](https://github.com/pixel/hexedit),
+which can be found in most Linux package managers.
+The executable is just 632 bytes long.
+Let's take a look at what's inside (`od -t x1
-An -v hexcompile`):
```
@@ -74,12 +75,12 @@ version of ELF)
- `02 00` Object type = executable file (not a dynamic library/etc.)
- `3e 00` Architecture x86-64
- `01 00 00 00` Version 1 of ELF, again
-- `78 00 40 00 00 00 00 00` **Entry point of the executable** = 0x400078 (explained later)
-- `40 00 00 00 00 00 00 00` Program header table offset in bytes from start of file (see below)
+- `78 00 40 00 00 00 00 00` **Entry point of the executable** = 0x400078
+- `40 00 00 00 00 00 00 00` Program header table offset in bytes from start of file
- `00 00 00 00 00 00 00 00` Section header table offset (we're not using sections)
-- `00 00 00 00` Flags (not important)
+- `00 00 00 00` Flags (not important to us)
- `40 00` The size of this header, in bytes = 64
-- `38 00` Size of the program header (see below) = 56
+- `38 00` Size of the program header = 56
- `01 00` Number of program headers = 1
- `00 00` Size of each section header (unused)
- `00 00` Number of section headers (unused)
@@ -88,8 +89,8 @@ version of ELF)
You might notice that all the numbers are backwards, e.g. `38 00` for the number
0x0038 (56 decimal). This is because almost all modern architectures (including
x86-64) are little-endian, meaning that the *least significant byte* goes first,
-and the most significant byte goes last. There are various reasons why this is
-easier to deal with, but I won't explain that here.
+and the most significant byte goes last.
+There are reasons for this ([see here](https://en.wikipedia.org/wiki/Endianness#Optimization), for example, if you're interested).
## program header
The program header describes a segment of data that is loaded into memory when
@@ -108,11 +109,12 @@ Without further ado, here's the contents of the program header:
**wait a minute, what's that?**
-We just specified the *virtual address* of this segment. This is the virtual
-memory address that the segment will be loaded to. Virtual memory means that
-addresses in our program do not actually correspond to where the memory is
-physically stored in RAM, with the CPU translating between virtual and physical
-memory addresses. There are many reasons for this: making sure each process has
+This is the virtual
+memory address that the segment will be loaded to.
+Nowadays, computers use virtual memory, meaning that
+addresses in our program don't actually correspond to where the memory is
+physically stored in RAM (the CPU translates between virtual and physical
+memory addresses). There are many reasons for this: making sure each process has
its own memory space, memory protection, etc. You can read more about it
elsewhere.
@@ -129,11 +131,10 @@ that is a multiple of 4096. Our program needs to be loaded into a memory page,
so its *virtual address* needs to be a multiple of 4096. We're using `0x400000`.
But wait! Didn't we use `0x400078` for the virtual address? Well, yes but that's
because the *data in the file* is loaded to address `0x400078`. The actual page
-of memory that the OS will allocate for our code will start at `0x400000`. The
-reason we need to start `0x78` bytes in is that Linux expects the data *in the
-file* to be at the same position in the page as when it will be loaded, and it
-appears at offset `0x78` in our file. But don't worry if you don't understand
-that.
+of memory that the OS will allocate for our segment will start at `0x400000`. The
+reason we need to start `0x78` bytes in is that Linux expects the data in the
+file to be at the same position in the page as when it will be loaded, and it
+appears at offset `0x78` in our file.
## the code
@@ -163,16 +164,17 @@ about it with `man 2 open`.
The first argument, `0x40026d`, is a pointer to some data at the very end of
this segment (see further down). Specifically, it holds the bytes
`69 6e 30 30 00`, the null-terminated ASCII string `"in00"`.
-This indicates the name of the file. The second argument (`O_RDONLY`, or 0)
-specifies that we will be reading from this file. There is a third argument to
+This indicates the name of the file. The second argument, `0`,
+specifies that we will (only) be reading from this file. There is a third argument to
this syscall (we'll get to it later), but it's not applicable here so we don't
set it.
-This call gives us back a *file descriptor*, which can be used to read from the
+This call gives us back a *file descriptor*, a number which we can use to read from the
file, in register `rax`. But we don't actually need to look at what file
descriptor Linux gave us. This is because Linux assigns file descriptor numbers
-sequentially, starting from `0` for standard input, `1` for standard output, `2`
-for standard error, and then `3, 4, 5, ...` for any files our program opens. So
+sequentially, starting from
+[0 for stdin, 1 for stdout, 2 for stderr](https://en.wikipedia.org/wiki/Standard_streams),
+and then 3, 4, 5, ... for any files our program opens. So
this file, the first one our program opens, will have descriptor `3`.
Now we open our output file:
@@ -194,8 +196,8 @@ file (`O_WRONLY = 0x01`), that we want to create it if it doesn't exist
(`O_TRUNC = 0x200`). Secondly, we are setting the third argument this time. It
specifies the permissions our file is created with (`0o755` means user
read/write/execute, group/other read/execute). This is not very important to
-the actual execution of the program, so don't worry if you don't know what it
-means.
+the actual execution of the program, so don't worry if you don't know
+about UNIX permissions.
Now we can start reading from the file. We're going to loop back to this part of
the code every time we want to read a new hexadecimal number from the input
@@ -210,7 +212,7 @@ file.
- `0f 05` `syscall`
In C, this is `read(3, 0x40026a, 3)`. Here we call syscall #0, `read`, with
-arguments:
+three arguments:
- `fd = 3` This is the descriptor number of our input file.
- `buf = 0x40026a` This is the memory address we want Linux to output the data
@@ -221,7 +223,6 @@ We're telling Linux to output to `0x40026a`, which is just a part of this
segment (see further down). Normally you would read to a different segment of
the program from where the code is, but we want this to be as simple as
possible.
-
The number of bytes *actually read*, taking into account that we might have
reached the end of the file, is stored in `rax`.
@@ -231,8 +232,8 @@ reached the end of the file, is stored in `rax`.
- `0f 8f 50 01 00 00` `jg 0x400250`
This tells the CPU to jump to a later part of the code (address `0x400250`) if 3
-is greater than the number of bytes read in (in other words, if we reached the
-end of the file). Note that we don't specifiy the *address* to jump to, but
+is greater than the number of bytes we got, in other words, if we reached the
+end of the file. Note that we don't specifiy the *address* to jump to, but
instead the *relative address*, relative to the first byte after the jump
instruction (so here we're saying to jump `0x150` bytes forward). There are
reasons for this which I won't get into here.
@@ -299,7 +300,7 @@ the one above:
- `48 89 fb` `mov rbx, rdi`
- `48 09 d8` `or rax, rbx`
-Okay, now we have the byte specified by the two hex digits we read in `rax`.
+Okay, now `rax` contains the byte specified by the two hex digits we read.
- `48 89 c3` `mov rbx, rax`
- `48 b8 6c 02 40 00 00 00 00 00` `mov rax, 0x40026c`
@@ -343,8 +344,8 @@ These bytes aren't actually used by our program, and could be set to anything.
These are here because I wasn't sure how long the program would be when I
started, so I just set the segment size to 512 bytes, which turned out to be
more than enough. I could have cut these out and edited all the addresses to get
-a smaller, cleaner executable, but I'm leaving them in because that's what you
-probably would do if you were doing this for non-instructional purposes.
+a smaller executable, but really there's no point—modern
+computers can definitely handle 600-byte files.
- `31 c0` `mov rax, 0`
- `48 89 c7` `mov rdi, rax`
@@ -355,7 +356,7 @@ This is where we conditionally jumped to way back when we determined if we
reached the end of the file. This calls syscall #60, `exit`, with one argument,
0 (exit code 0, indicating we exited successfully).
-You'd normally close the files first (with syscall #3), to tell Linux you're
+Normally, you should close files descriptors (with syscall #3), to tell Linux you're
done with them, but we don't need to. It'll automatically close all our open
file descriptors when our program exits.
@@ -373,8 +374,8 @@ editor and get them translated into a binary file.
There are many ways in which this is a bad program. It will *only* properly
handle lowercase hexadecimal digit pairs, separated by exactly one character,
-with a terminating character. What's worse, a bad input file (maybe you
-accidentally write `3F` instead of `3f`) won't print out a nice error message,
+with a terminating character. What's worse, a bad input file (maybe someone
+accidentally writes `3F` instead of `3f`) won't print out a nice error message,
but instead continue processing as usual, without any indication that anything's
gone wrong, giving you an unexpected result.
Also, we only read in data *three bytes at a time*, and output one byte at a
@@ -385,4 +386,5 @@ a while.
But these problems aren't really a big deal. We'll only be running this on
little programs and we'll be sure to check that our input is in the right
-format. And with that, we are ready to move on to the next stage...
+format. And with that, we are ready to move on to the
+[next stage...](../01/README.md).