On this page:
1.1.1 Picking the target language
1.1.2 Programming in x64
1.1.3 System Calls, and Flavours of x64
1.1.3.1 x64-linux
1.1.3.2 x64-macos
1.1.3.3 x64-windows
1.1.4 Conclusion

1.1 A Compiler Begins with a Language

To me, the study of compilers is the study of languages. As a mathematical object, a compiler is a transformation between languages, transforming a source language into a target language—but that is for another time. As a program, a compiler is a function that takes any program in a source language and produces some program that "behaves the same" in a target language.

This definition tells us we need three things before we can start designing our compiler: We need a source language, target language, and some definition of "behaviour" for programs in each. Each of these is a design choice. Often, we choose a pre-existing target language. Picking a target language is an interesting point in the design of any compiler. By contrast, the source language is often designed to achieve a balance between aesthetic and pragmatic goals.

In this course, we will pick a starting target language (x64), and systematically design many source languages. Each of our source languages will be designed to remove a restriction from a target language, or add high-level features to improve usability. We also take care to ensure the source languages can be compiled efficiently. For each of these languages, their behaviour is defiend by an interpreter. For x64, the behaviour is defined by whatever happens when we turn it into binary and run it on a CPU (The First Interpreter).

This last step, understanding the behaviour of programs in the language, is a key step. We will write programs that validate, analyze, and transform programs. We will write templates for programs, whole classes of programs. That is the nature of a compiler. So we must understand what a program means.

So we begin by picking and studying our choosen target language.

1.1.1 Picking the target language

The target language of a compiler is the language in which the compiler writes programs.

Our first language, and choosen target language, is x86-64 plus system calls, which we call x64 for short. In fact, we’ll only use a subset x64 of this language, which will expand it slightly throughout the course; x64 will refer to the current subset. I elaborate on this language and what system calls are shortly.

This is an unusual choice for a target language in a contemporary compiler. Contemporary compilers often choose higher-level languages, such as C, LLVM, or even JavaScript as target language.

There are good reasons to a choose higher-level language as a target. Often, a higher-level language provides more convenient abstractions, making generating programs simpler. The language may already have an efficient implementation, making writing optimizations less important. This frees the compiler writer to focus on other design goals of the new source language. Perhaps they want to focus on ruling out more errors statically, or providing convenient syntax for certain programming patterns they find common. These are common goals in source language and domain-specific language design.

However, choosing such languages as a target has a cost. These targets come with their own implementations, their own compilers and run-time systems. If we want to use them, we commit ourselves to using their toolchains as well. This may complicate our own development if it is difficult to setup those toolchains. Choosing a language also fixes an abstraction boundary—we cannot choose to work at a level of abstraction lower than our target language. This may complicate or entirely prevent us from implementing certain functionality or optimizations.

For this course, we choose x64 for two main reasons. First, for educational goals, we want to write transformations and optimizations at some of the lowest abstraction layers we can. Second, for pragmatic goals, it is significantly easier on the students (many) to setup the toolchain for x64 than a C, LLVM, or JavaScript toolchain, at a minor cost to the course staff (few) who must maintain some x64 support code. The second reason is a common rationale in compiler design—simplifying something for the many users at the expense of the few compiler writers is usually a good trade-off.

1.1.2 Programming in x64

Having chosen our target language, we must now understand the language—its behaviour, the abstractions it provides, the restrictions it requires us to work with, and even its syntax. As far as we are concerned, the target language was not designed—it was discovered

Take a computer architecture course if you want to learn how it was designed.

It is the language of a wild CPU found by geologists in cave somewhere, probably in the Amazon. It is primordial. It is not sensible to ask questions like "why"—x64 is. And, because we want to talk to the CPU, we must deal with it.

While we have choosen x64 as our target language, x64 itself does not run on most hardware. Instead, it is the source language yet another compiler, which transforms x64 to a binary machine code that I call bin64 for now. In this course, we use the nasm assembler to compile x64 into bin64. The compiled programs can run natively on most hardware, but you may need an interpreter depending on your machine.

Usually, interpreters for machine code are called "virtual machines".

To get started with x64, let’s consider the factorial function fact. In Racket, following the design recipe, we would simply write:

Examples:
; Natural -> Natural
; Takes a natural number n and returns the factorial of n.
; (fact 0) -> 1
; (fact 1) -> 1
; (fact 5) -> 120
; Follows the template for self-referential natural numbers.
> (define (fact n)
    (if (zero? n)
        1
        (* n (fact (sub1 n)))))
> (module+ test
    (require rackunit)
    (check-equal? (fact 0) 1)
    (check-equal? (fact 1) 1)
    (check-equal? (fact 5) 120))
> (fact 5)

120

We run this and the computer prints 120.

We can implement the equivalent computation in x64 as follows.

x64 actually supports at least two syntaxes. We use Intel assembler syntax, as it maps more closely to familiar imperative language syntax.

global start   ; Declare starting block for linker.

section .text                   ; Start of the "code"

;; Compute (fact 5)
start:
  mov r8, 5

;; Compute the factorial of the value in r8.
;; Expects  r8 : PositiveInt64
;; Produces r9 : PositiveInt64, the result of the factorial of r8.
;; Constraint: factorial of r8 must be less than 2^63-1.
fact:
  mov r9, 1                     ; Initialize accumulator to (fact 0).

;; Computes the factorial of r8 with accumulator r9
;; r9 is PositiveInt64; contains the result so far of computing the factorial of the input.
;;
;; Follows the template for self-referential natural numbers.
fact_acc:
  cmp r8, 0                     ; compare r8 and 0, i.e., (zero? r8)
  je fact_done                  ; jump if last comparison was equal
  imul r9, r8                   ; (set! r9 (* r9 r8))
  dec r8                        ; (set! r8 (sub1 r8))
  jmp fact_acc                  ; (fact r8)

fact_done:
;; Done. The result is in r9.

section .data ; The data section. This is where we declare initialized memory locations.

dummy: db 0
> nasm -f elf64 -o plain-fact-x64.o plain-fact-x64.s
> ld -e start -o plain-fact-x64.exe plain-fact-x64.o
> ./plain-fact-x64.exe

In this example, we compute the result, which is stored in r9. Intuitively, the meaning of this program is a transformation over a register machine. It expects a machine with at least two registers, r8 and r9, whose initial values are arbitrary. It begins executing the first instruction, then executes the next, and so on. After running the machine is left with in a state where r9 contains the result. Unfortunately, that’s not what happens if we try to run it.

Instead, after reaching fact_done, the machine happily tries to execute the next instruction in memory. There is no next instruction, so it executes whatever happens to be in that memory location. The program hopefully crashes with a SEGFAULT or SIGBUS error, but this depends on the operating system (OS).

This fact that the behaviour depends on the OS is our first clue that this example is not implemented in just x86-64, the language of the CPU. The meaning of the program, and thus the actual language in which we were programming, depends critically on the OS.

Each OS makes different assumptions about and enforce different restrictions on x64 programs. Most operating systems require that you explicitly "exit" the process, by calling a special, OS-specific "exit" procedure at the end of the process. Each OS defines a different set of these procedures, which we can think of as the "standard library" for x64. OS, and exactly where the OS expects the final answer to be differs. macOS imposes a restrictions: the .data section, which is used to declare initialized memory locations, cannot be empty, and the linker rejects the program if it is. In the above example, we don’t actually use the heap, so we declare a dummy value to ensure compatibility with macOS. This example is compatible with macOS, Linux, and Windows, as long as you specify "start" as the entry label, and don’t mind exiting with a bang.

Because of this critical dependence on the OS, the target language in this course cannot really be "plain" x64x64 is a family of programming languages, indexed by an operating system. In this class, we never program the raw CPU—we program the operating system. The CPU together with the operating system implements a different programming language than the CPU by itself. From a programming languages perspective, the operating system (e.g., Linux) is the run-time system for the OS-flavoured x64 programming language (e.g., x64-linux).

In x64, the result of the computation is not implicitly returned to the user via the REPL like it is in Racket. Instead, we have to explicitly say what the result is, and how to communicate it to the user. To do this, we need to use a system call, a procedure predefined by the operating system that defines how an x64 program interacts with the system.

After compiling our x64 program with nasm, we are left with program in the first target language—a machine code binary, which is just about the raw language of the CPU. In fact, it is machine code in a binary format described by the OS, which contains x64 instructions arranged according to the OS specification, with additional boilerplate (often installed by the linker) to ensure correct interoperation with the OS. I call this target language bin64, and like x64, it is a family of languages. As we rely on nasm to generate bin64, we do not study this target language in any detail, and normally forget about its existence except when directing nasm to generate the right language.

Take an operating systems course if you want to know more about how to program the "raw" CPU to implement the "lowest" level of programming language.

Of course, a chip designer would really be implementing the "lowest" level language by implementing the interpreter in raw silicon.

Or, well, I guess the electrical engineer, who builds the transistors that the chip designer uses, would be implementing the "lowest" language, which implements an interpreter for binary gates in the language of raw, analogue electrical signals.

But what about the chemist who ... It’s languages all the way down!

1.1.3 System Calls, and Flavours of x64

Now that we know x64 is actually a family of languages, let us study the different flavours of x64.

In this course, the main difference between the flavours of x64 are in the system calls. System calls are x64 primitives provided by the OS. Once we start using system calls, code becomes OS-specific. One of the first things a compiler writer will do is abstract away from system calls. But to abstract away from them, we need to understand how they work in various flavours of x64.

Our earlier example x64 program was limited because we did not know how to communicate with the OS. In the rest of this section, we walk through how to use basic system calls to make a complete program that can exit properly and communicate its result. We will introduce further system calls throughout this course.

1.1.3.1 x64-linux

In this course, x64-linux refers to the Linux-flavoured x64. This is x64, using Linux system calls, compiled to elf64. This version of x64 has few quirky restrictions and will be our "default", the point of comparison for other flavours of x64.

There are multiple system calls we could use to return a value to the system in Linux. For example, we could modify the program to return a value as an exit code through the exit system call. In Linux, each process must call the exit system call to terminate properly. exit takes an exit code, a number between 0 and 255, which is passed to the parent process after the child process exits. Usually, the exit code indicates whether the process terminated normally (0) or not (non-zero). But we could just as well use it to return the value of factorial.

global start

section .text

start:
  mov r8, 5

fact:
  mov r9, 1

fact_acc:
  cmp r8, 0
  je fact_done
  imul r9, r8
  dec r8
  jmp fact_acc

fact_done:
exit:
  mov     rax, 60         ; I'm about to call the OS sys_exit function
  mov     rdi, r9         ; The exit code is the value from r9.
  syscall                 ; Perform the system call operation
> nasm -f elf64 -o exit-fact-x64-linux.o exit-fact-x64-linux.s
> ld -e start -o exit-fact-x64-linux.exe exit-fact-x64-linux.o
> ./exit-fact-x64-linux.exe
> echo $?
> 120

Now, when fact completes, we call the exit system call. This requires we set the register rax to the value 60 (in decimal), which corresponds to the exit system call in Linux. This system call expects one argument, the exit code, passed in register rdi.

In general, the Linux system call interface uses the following calling convention. The system call number is set in register rax. The arguments are set according to the System V AMD64 ABI. The first six integer or pointer arguments are passed in registers (in order): rdi, rsi, rdx, rcx, r8, and r9. Remaining arguments are passed on the stack, although we will not use any system calls with more than six arguments. We also won’t be dealing with floating point arguments, which are passed in separate registers. For more details, Wikipedia is not a bad place to start: https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI. But you could also go to the source: https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-1.0.pdf.

To compile this on Linux, we run nasm -f elf64 exit-fact-x64-linux.s -o exit-fact-x64-linux.o. This will generate a binary in the elf64 target language. The elf64 is the Linux-flavoured bin64, a binary format used on Linux. It includes the x64 instructions and additional structure that allows the Linux operating system to control the binary.

Next, we link the file by running ld -e start -o exit-fact-x64-linux.exe exit-fact-x64-linux.o, which essentially connects the binary to the operating system and any other external libraries, and specifies the entry point where execution should begin (the start label). Now we can execute the file exit-fact-x64-linux.exe on the command line in the usual way, e.g., set it to executable using chmod +x exit-fact-x64-linux.exe then execute with ./exit-fact-x64-linux.exe. We can get the result of the exit code using echo $? or echo $status, depending on your shell.

Most programming languages communicate their result by printing. Thankfully, the OS does provide a print system call. We could write the program below to print the answer instead of using the exit status, to get a somewhat more friendly user interface.

global start

section .text

start:
  mov r8, 5

fact:
  mov r9, 1

fact_acc:
  cmp r8, 0
  je fact_done
  imul r9, r8
  dec r8
  jmp fact_acc

fact_done:
  mov r8, msg  ; Move the address of `msg` into `r8`.
  mov [r8], r9 ; Move r9 into the 0th index of the address pointed to by r8.

print_msg:
  mov     rax, 1      ; Say, "I'm about to call the OS sys_write function"
  mov     rdi, 1      ; And I want it to write to the standard output port
  mov     rsi, msg    ; The message to print is stored in msg
  mov     rdx, len    ; Length is len.
  syscall

exit:
  mov     rax, 60
  mov     rdi, 0                ; The exit code is 0; normal.
  syscall

section .data

len:   equ   1         ; The constant 1, representing the length of the message.
msg:   times len dw 0  ; A `len` length, in bytes, buffer, initialized to be full of 0.
> nasm -f elf64 -o fact-x64-linux.o fact-x64-linux.s
> ld -e start -o fact-x64-linux.exe fact-x64-linux.o
> ./fact-x64-linux.exe
x

To use the write system call, we need to create a memory buffer to store the message we wish to print. We can statically allocate a buffer in the data section by creating a label, msg:, and using the times keyword to allocate len bytes (indicated by dw) of space, and initializing each to 0. Then we call the write system call, number 1, specifying the standard output port as the file to print to by setting rdi to 1, specifying msg as the address to print from in rsi, and the length of the buffer in rdx. When we run the program, it prints the expected result—x.

Question: Why is x the expected result instead of 120?

You can find complete lists of the Linux system calls online:

1.1.3.2 x64-macos

macOS is very similar to Linux, and we can easily adapt the above examples to macOS. On macOS, there are 4 system call tables, computed by adding the system call number to a specific prefix. The BSD system call table, the most generic and Linux-like, uses the hex prefix #x2000000. The exit system call is system call 1 in the BSD table, so we use the system call number #x2000001. The example "exit-fact-x64-linux.s" from above is rewritten for macOS below.

global start

section .text

start:
  mov r8, 5

fact:
  mov r9, 1

fact_acc:
  cmp r8, 0
  je fact_done
  imul r9, r8
  dec r8
  jmp fact_acc

fact_done:
exit:
  mov     rax, 0x2000001       ; I'm about to call the OS sys_exit function.
                               ; Specifying the value in hex for readability,
                               ; but could convert it to decimal for a simpler
                               ; assembler.
  mov     rdi, r9              ; The exit code is the value from r9.
  syscall

section .data

dummy: db 0
> nasm -f macho64 -o exit-fact-x64-macos.o exit-fact-x64-macos.s
> ld -no_pie -macosx_version_min 10.6 -e start -o exit-fact-x64-macos.exe exit-fact-x64-macos.o
> ./exit-fact-x64-macos.exe
> echo $?
> 120

To compile this file, we run nasm -f macho64 exit-fact-x64-macos.s exit-fact-x64-macos.o. macho64 is the binary formatted used by 64-bit macOS. To link, we run ld -macosx_version_min 10.6 -e start -o exit-fact-x64-macos.exe exit-fact-x64-macos.o. We need to specify a minimum macOS version number of 10.6, otherwise the linker will ignore the custom entry label and fail to link. We can execute the file exit-fact-x64-macos.exe on the command line in the usual way, and can get the result of the exit code using echo $? or echo $status, depending on your shell.

macOS also has a similar write system call: number 4 in the BSD table. The file fact-x64-linux.s is ported to macOS below.

global start

section .text

start:
  mov r8, 5

fact:
  mov r9, 1

fact_acc:
  cmp r8, 0
  je fact_done
  imul r9, r8
  dec r8
  jmp fact_acc

fact_done:
  mov r8, msg  ; Move the address of `msg` into `r8`.
  mov [r8], r9 ; Move r9 into the 0th index of the address pointed to by r8.

print_msg:
  mov     rax, 0x2000004    ; Say, "I'm about to call the OS sys_write function"
  mov     rdi, 1            ; And I want it to write to the standard output port
  mov     rsi, msg          ; The message to print is stored in msg
  mov     rdx, len          ; Length is len.
  syscall

exit:
  mov     rax, 0x2000001
  mov     rdi, 0                ; The exit code is 0; normal.
  syscall

section .data

len:   equ   1         ; The constant 1, representing the length of the message.
msg:   times len dw 0  ; A `len` length, in bytes, buffer, initialized to be full of 0.
> nasm -f macho64 -o fact-x64-macos.o fact-x64-macos.s
> ld -no_pie -macosx_version_min 10.6 -e start -o fact-x64-macos.exe fact-x64-macos.o
> ./fact-x64-macos.exe
x

1.1.3.3 x64-windows

Windows does not allow user processes to perform system calls. Instead, we have to call kernel functions that provide similar functionality. The ExitProcess kernel function provides the same functionality as the exit system call on Linux and macOS. However, making a function call is more complex than making a system call. We have to declare external functions for the linker, and ensure we link with the library that includes that function.

global Start   ; GoLink expects "Start"
  extern ExitProcess ; Declare that ExitProcess is an external symbol to be resolved by the linker.

section .text

Start:
  mov r8, 5

fact:
  mov r9, 1

fact_acc:
  cmp r8, 0
  je fact_done
  imul r9, r8
  dec r8
  jmp fact_acc

fact_done:
exit:
  mov rcx, r8                   ; Windows 64-bit calling convention uses rcx as first argument
  call ExitProcess              ; Windows doesn't expose system calls to user processes; call ExitProcess function to exit.
> nasm -f win64 -o exit-fact-x64-windows.o exit-fact-x64-windows.s
> golink /entry Start /fo exit-fact-x64-windows.exe exit-fact-x64-windows.o kernel32.dll
> ./exit-fact-x64-windows.exe
> echo $?
> 120

Windows also doesn’t ship with a linker. GoLink is a small, freely available linker. After downloading nasm and GoLink, you can compile using nasm -f win64 exit-fact-x64-windows.s -o exit-fact-x64-windows.o and link using golink /entry Start /fo exit-fact-x64-windows.exe exit-fact-x64-windows.o kernel32.dll.

1.1.4 Conclusion

You should now have an understanding of the first target language we will use, x64. We looked at its syntax, its behaviour, and the existing toolchain to compile and execute it (an operating system, nasm and a linker such as ld, and an x86-64 CPU). We by no means saw a complete description of everything that language offers us, but it is enough that we can start designing a source language that abstracts aways some of the annoying aspects of the language, and a compiler to write programs in x64 for us.