1.2 Abstracting Boilerplate (v1)

In this chapter, we walk through the design of an entire compiler. It’s a small one that merely abstracts some boilerplate, but it serves as a concrete example of the entire process.

We will design our compilers by starting from a fixed abstraction boundary, an existing target language, and building a new layer of abstraction atop it. Our goal is to systematically design and build a new layer of abstraction, formalized as a new programming language, a source language, that can be implemented by translation into a corresponding target language. The new abstraction is meant to solve some problem, such as software development being error-prone, software design being complex, or software produced in the language being unsafe, unportable, or verbose. Ideally, we solve this problem without introducing some cost, such as a high learning curve, or some performance penalty.

To solve a problem, we must first concretely identify one, and then design a layer of abstraction to address it. Starting from the target language, we ask a question:

What’s wrong with this language?

Once we identify a concrete problem, we design a new source language that solves the problem, and then design a compiler to transform the source language into the target language. The source language in its definition should capture all there is to the new abstraction. Our goal is to faithfully implement the new abstraction in terms of the old one.

We aim to derive each compiler from these language definitions, letting the abstractions be our guide. This is easiest when we focus on a single problem at a time, so we will introduce many intermediate languages in the process—languages that are both source and target languages, serving some intermediate compilers. Each of these intermediate compilers is more called a compiler pass or just pass, while compiler is reserved for the transformation from a source language in which a programmer writes programs. Each of our passes may be as simple as inserting some code, or as complex as analyzing and rewriting the structure of many pieces of interacting code. While some of our passes are easily derived from our language definitions, some have been designed after years of iteration to effectively implement some well-known abstraction.

A note on conventions. In each chapter, we incrementally design a whole compiler. Each subsequent design will be an iteration on the previous design. To distinguish them, we include a version tag for each; the version tag in this chapter is v1. Since our compilers are designed around languages, we include this version tag in the language name, even if the set of features in the language does not change from one chapter to the next. For example, in this chapter, we design Paren-x64 v1 (paren-x64-v1), a subset of x64. This language will go through many iterations, and its version tag will change with each.
This version tag is also used in the various support libraries that accompany this book when support code must be sensitive to the compiler or language versions. For example, the reference implementations of the v1 languages are found in the library cpsc411/langs/v1, and the test suites for the v1 compiler as in cpsc411/test-suite/public/v1.
The tag is usually, but not necessarily, a number. When a chapter does not change any interface at all, the version number may remain the same with some additional suffix added to the tag, akin to semantic versioning.

1.2.1 Designing an Abstraction

Our current abstraction boundary, i.e., our current language, is x64. So we start by asking: what’s wrong with x64?

There is a lot wrong with x64, but we’ll start with one problem that many programmers are familiar with, and one that is simple to address: boilerplate. Writing programs in x64 requires the programmer to insert repetitive boilerplate, such as the declaration of the initial label, and some code to exit the program and report the result to the user. This boilerplate prevents the user from focusing on the program and requires them to copy and paste the same snippets of code into their programs, an error-prone process if that snippet ever needs to change.

We could solve this problem by allowing the programmer to focus on writing and composing sequences of instructions, by giving those sequences meaning independent of the boilerplate and the run-time system. For example, a programmer should be able to write two sequences, separately:

;; Program 1

mov r9, 42

Somewhere, separately:

;; Program 2

mov rax, r9

And, in our desired language, combining these would be well defined, such as in the following example, without ever referring to the boilerplate required by nasm.

;; Program 3, by combining Program 1 and Program 2

mov r9, 42

mov rax, r9

Our goal is to introduce this abstraction of instruction sequences: lists of instructions that represent the code of an x64 program. Instruction sequences separate the code from the boilerplate. As a result, we get a notion of program composition, allowing us to focus on the program, and decompose a program into separate pieces that we can easily stitch together. Supposing p_1 and p_2 are both instruction sequences, then there exists (p-append p_1 p_2) (for some definition of p-append) which first executes the instructions in p_1 and then executes the instructions in p_2.

Below, we select the subset of x64 instructions we plan to support in our compiler, but note that the abstraction of instruction sequences applies to all x64 instructions.

mov triv, triv
This string represents the move instruction, which moves a value from one location to another, or moves a value into a location.
x64 imposes further restrictions on mov (remember—x64 is, we must not ask why). We can only move a value into a register, or a value in one register to another register.
add triv, triv
This string represents the add instruction, which intuitively adds two values.
In fact, we cannot add values in x64: we can add a value (i.e., an integer), to a register add reg, integer, or add the values of two registers, add reg_1, reg_2.
Furthermore, when using an integer directly in the add statement, it must be a 32-bit integer, in the range -2^31 <= i <= 2^31 - 1. Yes, that’s right, 32-bit not 64-bit; we do not ask "why" of x64. For example, add rax, 2147483647 is valid, but add rax, 2147483648 is not. Instead, we would first need to move 2147483648 into a register to add it to rax. For example:
mov rbx, 2147483648
add rax, rbx
imul triv, triv
This string represents the multiply instruction, which intuitively multiplies two values.
Multiplication is further restricted by x64. Again, we cannot multiply values directly. We can only multiply the value of a register by a 32-bit integer imul reg_1, int32, or the value in a register by the value in a register imul reg_1, reg_2.

Below is an example of a valid instruction sequence in our subset of x64.

mov rax, 170679
mov rdi, rax
add rdi, rdi
mov rsp, rdi
imul rsp, rsp
mov rbx, 8991

Note that this does not correspond to a x64 program, as it is missing much of the structure: the starting label, the section declarations, etc. It has no meaning on its own in nasm, and part of the job of our compiler is to transform it so that it does have meaning when given to nasm. You probably have some idea of how to fix this program to have meaning, but we will be systematic in our approach in defining its meaning.

We represent x64 instruction sequences as Racket strings, with each instruction separated by newline characters. For example, the x64 instruction sequence mov rax, 42 corresponds to the Racket string " mov rax, 42".

Note that this representation of x64 instruction sequences is whitespace sensitive.For example, the following instruction sequence is represented by both the Racket string " mov rbx, 2147483648\\n add rax, rbx" and "\\nmov rbx, 2147483648\\n\\nadd rax, rbx\\n".

mov rbx, 2147483648
add rax, rbx

This is one of the first problems we will solve, by moving away from strings as a representation.

Before we can begin compiling, we need to define what an instruction sequence means, independent of how it is implemented. Otherwise, we would not know whether we are compiling them correctly.

We address both problems next by designing a source language that captures the meaning of instruction sequences, for our chosen subset of x64, and choosing a new representation of programs that we will use for the rest of this book.

1.2.2 Defining a Source Language

Our next step is to capture this abstraction in its own language.

When defining languages, we start by defining their abstract syntax via an eBNF grammar, such as the one below. When we described instruction sequences, we described them in terms of the concrete syntax of x64 and strings, but this syntax is inconvenient for a compiler to manipulate. Concrete syntax often contains irrelevant details that are useful for human programmers, but irrelevant to a machine. For example, strings are difficult to work with as they lack structure for conveniently accessing substructures, and whitespace sensitivity means two identical programs have multiple representations.

From now on, we’ll work almost entirely in abstract syntax. Our abstract syntax is meant to be represented as quasiquoted data, so the expression (begin (set! rax 42)) is represented in Racket as `(begin (set! rax 42)), or (equivalently) (list 'begin (list 'set! 'rax 42)). See Quasiquoting: quasiquote and ‘ for more.

Below, we define a new language, Paren-x64 v1, with our new instruction sequence abstraction. We first present a simple definition of the abstract syntax, then gradually refine the definition to encode more constraints.

Every Paren-x64 v1 program, p, begins with begin, followed by a sequence of statements (instructions). The non-terminal p defines the instruction sequence abstraction. We can define the composition operation as follows:

(define (p-append p1 p2)
  (match `(,p1 ,p2)
    [`((begin ,s1 ...) (begin ,s2 ...))
     `(begin ,@s1 ,@s2)]))

Each instruction s corresponds to one of the x64 instructions described earlier. For example, the x64 instruction mov reg, integer corresponds to the Paren-x64 v1 instruction (set! reg integer). Our abstract syntax makes more clear that the arithmetic operations are actually a combination of an arithmetic operation and an operation that changes state. The instruction add reg, integer is represented (set! reg (+ reg integer)). Note that this duplicates the register in the syntax, and the two occurrences must be the same to correspond to a valid x64 instruction.

Some restrictions from x64 are not apparent in the definition of the grammar. To simplify precisely defining languages as grammars without too much additional English specification, we create two extensions in our eBNF language.

Any time we suffix a non-terminal by an underscore and a number, such as reg_1, this is a reference to a particular instance of a non-terminal, and it is restricted to be identical to any other instance of the same non-terminal with the same underscore suffix within the same expression. So (set! reg_1 (+ reg_1 integer)) represents an add instruction in Paren-x64 v1, and both occurrences of reg_1 must be the same register. However, two instances of non-terminals with different suffixes, like reg_1 and reg_2, are not necessarily different—they are only not guaranteed to be the same. As usual in eBNF, any reference to a non-terminal without an underscore is also unrestricted. For example, in (set! reg reg), neither occurrence of reg is guaranteed to be the same nor different.
Any non-terminal that is not defined as syntax, such as int64, may be defined by a Racket predicate, such as int64?. Such definitions link to the documentation defining the predicate.

Using these two features, we define the final grammar of Paren-x64 v1 (paren-x64-v1) with all x64 restrictions as follows.

Since we are frequently extending and comparing languages, we typeset them in a tabbed interface with one or more diffs comparing two languages, and usually a tab containing the language definition on its own. The diff emphasize what has been added to the language, and what has been removed from the language. Anything that is unchanged between the two is left without additional formatting.

Paren-x64 v1DiffParen-x64 v1 Simplified

p	::=	(begin s ...)

s	::=	(set! reg int64)
	\|	(set! reg reg)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 reg))

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r10
	\|	r11
	\|	r12
	\|	r13
	\|	r14
	\|	r15

binop	::=	*
	\|	+

int64	::=	int64?

int32	::=	int32?

p	::=	(begin s ...)

s	::=	(set! reg int64)
	\|	(set! reg reg)
	\|	(set! reg_1 (binop reg_1 int32) triv triv)
	\|	(set! reg_1 triv (binop reg_1 reg triv triv))

triv	::=	reg
	\|	integer

binop	::=	+
	\|	*

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r10
	\|	r11
	\|	r12
	\|	r13
	\|	r14
	\|	r15

binop	::=	*
	\|	+

int64	::=	int64?

int32	::=	int32?

integer	::=	integer?

p	::=	(begin s ...)

s	::=	(set! triv triv)
	\|	(set! triv (binop triv triv))

triv	::=	reg
	\|	integer

binop	::=	+
	\|	*

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r10
	\|	r11
	\|	r12
	\|	r13
	\|	r14
	\|	r15

integer	::=	integer?

The predicates int64? and int32? are defined by the support library cpsc411/compiler-lib, and return #t if and only if given an integer in the range for a 64-bit or 32-bit, respectively, signed two’s complement integer.

A grammar is not enough to define our language. When we create a new language, we want to ensure we understand the meaning of that grammar separate from how it is compiled. This is for two reasons. First, optimizations depend on when various programs in a language are equivalent. We need to understand the language to understand when programs are equivalent. Second, we cannot know whether the compiler is correct if we do not know the meaning of programs before they are compiled. Unit tests will help us debug, but when we know the meaning of all programs in the language, we can say whether that meaning is preserved through compilation.

We can define the meaning of a language by writing an interpreter. For most of our languages, we assume the grammar has an interpreter that roughly corresponds to embedding the program in Racket, although we describe in detail any features that don’t correspond closely to existing Racket features. For some languages, including this first source language, we walk through the design of an interpreter explicitly.

Reference implementations of all our languages are available the CPSC 411 Infrastructure Package

Since Paren-x64 v1 is an imperative language and does not return values, we must decide how to interpret a Paren-x64 v1 program as a value. Unlike x64, there is no preexisting convention for how to produce a value, so we must create our own. We create a convention, or pattern that every program must follow to be well-defined, for producing a final value in Paren-x64 v1 by deciding that the final value is the value of rax when the program is finished executing (has no instructions in the instruction sequence left to execute), modulo 256. This choice is completely up to us, but this choice is designed to continue to work well as we build up our compiler. We explain the modulo 256 later.

Design an interpreter that produces the value of a Paren-x64 v1 is straightforward. We implement a register machine: a recursive function over instruction sequences that interprets each instruction using an accumulator mapping registers to values. When there are no instructions left, the interpreter returns the value of rax, the register designated by our convention.

When we begin writing our interpreter, we will immediately notice an edge case. What is a register has no value before it is used? For example, what is the value of a program when rax is never initialized?

There are several ways to deal with this, but we take the simplest that will enable us to efficiently compile Paren-x64 v1: we also require that any register is initialized before it is accessed. We inherit this restriction from x64. In x64, the value of an uninitialized register is undefined, that is, accessing an uninitialized register results in undefined behaviour, behaviour that has no specified definition in the language specification. Since we don’t want to insert code to check every register is initialized (if that’s even possible, it would be expensive), or insert extra code to initialize registers to arbitrary values (how would we distinguish them from real values?), we simply restrict the language.

Undefined behaviour is common in low-level languages that lack a strong enough enforcement mechanism for checking assumptions. Eliminating undefined behaviour by adding static or dynamic checks in the source language improves the ability of programmers to predict behaviour of all programs in your language. However, it is not always practical to achieve. It may be too difficult to statically check an assumption and still allow all the programs you want to allow, or too expensive to check a property dynamically. In these cases, we are forced to make assumptions that we cannot enforce, injecting undefined behaviour into our language.

In this book, we make it a non-negotiable goal: source languages must never have undefined behaviour. If they might, we (temporarily) sacrifice expressivity until we have enough expressivity to remove the undefined behaviour. So in Paren-x64 v1, there are no uninitialized registers.

In the interpreter, we assume the input is a valid Paren-x64 v1 program. Not only syntactically, but also obeying any restrictions or conventions required by the language. In a user interface, we would validate all input, but in the implementation of the interpreter, we keep the two concerns separate. Instead, the interpreter is free to assume all integers are in the right range, arithmetic instructions correctly refer to the same register in both operand positions, and all registers are initialized before use. For example, in the instruction (set! reg_1 (+ reg_2 integer)), we assume reg_1 and reg_2 are identical, and integer is a 32-bit integer, since otherwise the input would not have been a valid Paren-x64 v1 program. In fact, it would be bad style to check these again in the interpreter, since this mixes concerns and duplicates code.

procedure
(interp-paren-x64 x) → int64?
  x : paren-x64-v1?
Interprets the Paren-x64 v1 program, returning the final value as an exit code in the range 0–255.
Examples:
> (interp-paren-x64
   '(begin
      (set! rax 0)
      (set! rax (+ rax 42))))
42
> (interp-paren-x64
   '(begin
      (set! rax 170679)
      (set! rdi rax)
      (set! rdi (+ rdi rdi))
      (set! rsp rdi)
      (set! rsp (* rsp rsp))
      (set! rbx 8991)))
183

To properly implement arithmetic operations, you need to handle two’s complement arithmetic, which overflows on large positive numbers and underflows on small negative numbers. You may want to use x64-add and x64-mul from cpsc411/compiler-lib.

Now we can begin designing a compiler.

1.2.3 Enforcing Assumptions

One of the jobs of the front-end of a compiler is to enforce assumptions the rest of the compiler makes. Enforcement mechanisms take various forms, such as parsers, type checkers, linters, and static analyses.

We’re going to design a function check-paren-x64 to validate Paren-x64 v1 programs. It is similar to a parser. It reads an arbitrary value, expected to represent a Paren-x64 v1 program, and returns a valid program in the language Paren-x64 v1, or raises an error. However, it is a trivial parser. Usually we think of parsers as transforming from one representation to another, but this parser does not transform the representation of input if it is valid—it only transforms the type, or interpretation, of the data.

You could view check-paren-x64 as a type checker. In this view, it checks for a single type: The-Paren-x64-Type, which every valid instruction has, and which has simple typing rules. check-paren-x64 checks that the input program is following the typing disciplines of the language (which aren’t very restrictive).

I’ll call passes of this kind validator, a generic name for a compiler passes that does not transform the representation of data, but does transform the type or interpretation of that data.

Writing validators for intermediate language programs, including those produced by your compiler, is a powerful debugging technique. By designing them in the same way as check-paren-x64, so that they return the input if it’s valid, you can easily add them as passes in your compiler and detect when an early pass produces an invalid program. This is a form of property-based testing, and will catch many more bugs than unit testing alone.

To ensure no undefined behaviour, our validator should check the following.

The input conforms to the grammar of Paren-x64 v1, including restrictions regarding the valid range of integers and when the same register must appear in two places.
No register is referred to before is it initialized. We assume that no register is initialized at the beginning of a program.
The register rax is initialized before the end of the program.

We split this into two separate functions to enable separation of concerns. We always want our syntax to be valid, but because Paren-x64 v1 will serve as a target language, we may not always want to enforce that registers are provably initialized. For example, the language may evolve to the point where checking this will be undecidable, and source languages will be responsible for enforcing the guarantee instead. By separating the two checks, we’ll be able to reuse code in this eventuality.

First, we check that the syntax is valid.

procedure
(check-paren-x64-syntax x) → paren-x64-v1?
  x : any/c
Takes an arbitrary value and either returns it, if it is valid Paren-x64 v1 syntax, or raises an error with a descriptive error message.
Examples:
> (check-paren-x64-syntax
   `(begin (set! rax ,(min-int 64))))
'(begin (set! rax -9223372036854775808))
> (check-paren-x64-syntax
   `(begin (set! rax ,(- (min-int 64) 1))))
check-paren-x64: Invalid statement; expected one of:
  Expected a triv, got -9223372036854775809
  something of the form `(set! ,reg_1 (,binop ,reg_2
,int32)), but got '(set! rax -9223372036854775809)
  something of the form `(set! ,reg_1 (,binop ,reg_2
,reg_3)), but got '(set! rax -9223372036854775809)
> (check-paren-x64-syntax
   '(begin
      (set! r17 170679)))
check-paren-x64: Invalid statement; expected one of:
  Expected a register, got r17
  something of the form `(set! ,reg_1 (,binop ,reg_2
,int32)), but got '(set! r17 170679)
  something of the form `(set! ,reg_1 (,binop ,reg_2
,reg_3)), but got '(set! r17 170679)
> (check-paren-x64-syntax
   '(begin
      (set! rax 170679)
      (set! rdi rax)
      (set! rdi (+ rdi rdi))
      (set! rsp rdi)
      (set! rsp (* rsp rsp))
      (set! rbx 8991)))
'(begin
   (set! rax 170679)
   (set! rdi rax)
   (set! rdi (+ rdi rdi))
   (set! rsp rdi)
   (set! rsp (* rsp rsp))
   (set! rbx 8991))

Then, we check register initialization. Note that this procedure can assume its input is well-formed Paren-x64 v1 syntax, and only concern itself with register initialization.

procedure
(check-paren-x64-init x) → paren-x64-v1?
  x : paren-x64-v1?
Takes valid Paren-x64 v1 syntax, and returns a valid Paren-x64 v1 program or raises an error with a descriptive error message.
Examples:
> (check-paren-x64-init
   '(set! (+ rax rdi) 42))
check-paren-x64-init: contract violation
  expected: paren-x64-v1?
  given: '(set! (+ rax rdi) 42)
  in: the 1st argument of
      the check-paren-x64-init method in
      (class/c
       (check-paren-x64 (->m any/c paren-x64-v1?))
       (check-paren-x64-syntax
        (->m any/c paren-x64-v1?))
       (check-paren-x64-init
        (->m paren-x64-v1? paren-x64-v1?))
       (generate-x64 (->m paren-x64-v1? string?))
       (interp-paren-x64
        (->m paren-x64-v1? int64?))
       (wrap-x64-run-time (->m string? string?))
       (wrap-x64-boilerplate
        (->m string? string?)))
  contract from: (definition a1-compile)
  contract on: a1-compile
  blaming: <pkgs>/cpsc411-reference-lib/cpsc411/reference/a1
-solution.rkt
   (assuming the contract is correct)
  at: <pkgs>/cpsc411-reference-lib/cpsc411/reference/a1-solu
tion.rkt:261:17
> (check-paren-x64-init
   '(begin
      (set! (+ rax rdi) 42)))
check-paren-x64-init: contract violation
  expected: paren-x64-v1?
  given: '(begin (set! (+ rax rdi) 42))
  in: the 1st argument of
      the check-paren-x64-init method in
      (class/c
       (check-paren-x64 (->m any/c paren-x64-v1?))
       (check-paren-x64-syntax
        (->m any/c paren-x64-v1?))
       (check-paren-x64-init
        (->m paren-x64-v1? paren-x64-v1?))
       (generate-x64 (->m paren-x64-v1? string?))
       (interp-paren-x64
        (->m paren-x64-v1? int64?))
       (wrap-x64-run-time (->m string? string?))
       (wrap-x64-boilerplate
        (->m string? string?)))
  contract from: (definition a1-compile)
  contract on: a1-compile
  blaming: <pkgs>/cpsc411-reference-lib/cpsc411/reference/a1
-solution.rkt
   (assuming the contract is correct)
  at: <pkgs>/cpsc411-reference-lib/cpsc411/reference/a1-solu
tion.rkt:261:17
> (check-paren-x64-init
   '(begin
      (set! rax (+ rax 42))))
check-paren-x64-init: Cannot reference the uninitialized
register rax in the statement (set! rax (+ rax 42))
> (check-paren-x64-init
   '(begin
      (set! rax (+ rdi 42))))
check-paren-x64-init: contract violation
  expected: paren-x64-v1?
  given: '(begin (set! rax (+ rdi 42)))
  in: the 1st argument of
      the check-paren-x64-init method in
      (class/c
       (check-paren-x64 (->m any/c paren-x64-v1?))
       (check-paren-x64-syntax
        (->m any/c paren-x64-v1?))
       (check-paren-x64-init
        (->m paren-x64-v1? paren-x64-v1?))
       (generate-x64 (->m paren-x64-v1? string?))
       (interp-paren-x64
        (->m paren-x64-v1? int64?))
       (wrap-x64-run-time (->m string? string?))
       (wrap-x64-boilerplate
        (->m string? string?)))
  contract from: (definition a1-compile)
  contract on: a1-compile
  blaming: <pkgs>/cpsc411-reference-lib/cpsc411/reference/a1
-solution.rkt
   (assuming the contract is correct)
  at: <pkgs>/cpsc411-reference-lib/cpsc411/reference/a1-solu
tion.rkt:261:17
> (check-paren-x64-init
   '(begin
      (set! rax 170679)
      (set! rdi rax)
      (set! rdi (+ rdi rdi))
      (set! rsp rdi)
      (set! rsp (* rsp rsp))
      (set! rbx 8991)))
'(begin
   (set! rax 170679)
   (set! rdi rax)
   (set! rdi (+ rdi rdi))
   (set! rsp rdi)
   (set! rsp (* rsp rsp))
   (set! rbx 8991))

In these examples, the reference implementation raises contract errors when the input is invalid. The reference implementation uses contracts on every pass to detect invalid input and output. These errors are separate from the errors raised by the validator.

For convenience, we define a single validator that validates all properties of the language as check-paren-x64.

procedure
(check-paren-x64 x) → paren-x64-v1?
x : any/c
Takes an arbitrary value and either returns it, if it is valid Paren-x64 v1 program, or raises an error with a descriptive error message.

1.2.4 Compiling

Finally, we get to compiling. We have designed our new abstraction, made it precise in the form of a language, and enforced our assumptions.

The job of our compiler is to translate one level of abstraction into another. We currently have three levels of abstraction: (1) Paren-x64 v1, the abstract syntax representation of x64 instruction sequences, (2) the string representation of x64 instruction sequences, and (3), x64 programs. The structure of our compiler is determined partially by these levels of abstractions, both the source layer and the target layer. There are two pieces that need to be added to a Paren-x64 v1 program to make it a complete x64 program: (1) boilerplate, such as the declaration of the starting label and (2) the run-time system, which implements our convention for the meaning of instruction sequences.

We design our compiler as a series of compiler passes along these natural lines to separate concerns as much as possible. First, we translate from Paren-x64 v1 into the string representation of instruction sequences. Then, we introduce the run-time system. Finally, we wrap the whole thing in boilerplate. This order is not arbitrary; it is dictated by our abstract layers. Implementing the run-time system requires instructions outside the subset used by Paren-x64 v1, so we must reach a lower level of abstraction to implement it. However, it can be implemented as an instruction sequence, and we want to take advantage of instruction sequence composition if we can. We won’t be able to do that after introducing boilerplate.

procedure
(generate-x64 p) → string?
  p : paren-x64-v1?
Compiles a Paren-x64 v1 program into a x64 instruction sequence represented as a string.
Examples:
> (generate-x64
   '(begin
      (set! rax 0)
      (set! rax (+ rax 42))))
"mov rax, 0\nadd rax, 42\n"
> (require racket/pretty)
> (pretty-display
   (generate-x64
    '(begin
       (set! rax 0)
       (set! rax (+ rax 42)))))
mov rax, 0
add rax, 42
> (pretty-display
   (generate-x64
    '(begin
       (set! rax 170679)
       (set! rdi rax)
       (set! rdi (+ rdi rdi))
       (set! rsp rdi)
       (set! rsp (* rsp rsp))
       (set! rbx 8991))))
mov rax, 170679
mov rdi, rax
add rdi, rdi
mov rsp, rdi
imul rsp, rsp
mov rbx, 8991

1.2.4.1 Implementing a Run-time System

The abstractions provided by the operating system for running x64 are not the same as the convention we just created for Paren-x64 v1. The operating system does not know it should "return" the result of rax, whatever "return" means. To implement this convention, we need to write some x64 code (ideally, an instruction sequence) that will take any instruction sequence implementing the Paren-x64 v1 convention and communicate the result to the operating system. This code is a very simple run-time system.

The run-time system provides all run-time support required by the language but that that is not provided by the underlying machine. Exactly what this run-time support is depends on the language. Typically, the language run-time provides memory allocation and deallocation, provides initialization of the process environment such as the stack, handles returning values to the user, and provides any built-in procedures that all programs in the language can expect to use. For Paren-x64 v1, the only run-time support we require is returning the final value to the operating system and exiting.

Our choice of run-time system depends on the abstractions provided by the language, the machine, the operating system, and the user interface we desire. Some languages print the final result. Some languages discard the final result, relying on the user program to print the result, or modify the filesystem, or modify the state of the machine (or the world) in some other non-temporary way.

Our language does not provide the user with any way to observe the state of the machine, so our run-time system must do the job of communicating the return value to the user.

We could print the result, but as we saw in the factorial example above, the operating system’s definition of "print" does not match our intuition. When trying to print "120", we get the character "x". This would make for a very confusing user interface, or a very complicated run-time system.

Instead, we opt for a very simple run-time system: communicate via the operating system exit code. This exit code is a number between 0 and 255 given to the exit system call, and is easily accessible in shells via the variable $? (or $status in some shells). In Racket, we can access the exit code of a subprocess using system/exit-code. This limits how much our programs can communicate; we will lift that restriction in later versions of our compiler.

Our run-time system is an x64 instruction sequence which expects to be composed after another instruction sequence. The run-time system assumes that the first instruction sequence must initialize rax. The run-time system then calls the exit system call with the value of rax passed as the exit code. cpsc411/compiler-lib provides some definitions, such as sys-exit, that are helpful for this.

For formatting strings in Racket, you may want to investigate format, ~a, and at-exp.

procedure
(wrap-x64-run-time x) → string?
x : string?
Installs the Paren-x64 v1 run-time system. The input is the same as the output for generate-x64: a string representing an x64 instruction sequence. The run-time system is composed with the input as a second instruction sequence.
Note that in the string representation, string concatenation implements instruction sequence composition.

1.2.4.2 Implementing Instruction Sequences

Finally, we implement a simple pass to turn the instruction sequence into a program, by introducing the x64 boilerplate described in A Compiler Begins with a Language.

procedure
(wrap-x64-boilerplate x) → string?
x : string?
Takes an x64 instruction sequence and wraps it with the necessary boilerplate to return a complete x64 program in Intel syntax.

After that, we have a complete compiler. The compiler is easily defined by composing all the individual passes.

(define (paren-x64-v1-compiler x)
(wrap-x64-boilerplate (wrap-x64-run-time (generate-x64 x))))

(define paren-x64-v1-compiler^
(compose wrap-x64-boilerplate wrap-x64-run-time generate-x64))

Examples:

> (interp-paren-x64
   '(begin
      (set! rax 170679)
      (set! rdi rax)
      (set! rdi (+ rdi rdi))
      (set! rsp rdi)
      (set! rsp (* rsp rsp))
      (set! rbx 8991)))
183
> (current-pass-list
   (list
    check-paren-x64
    generate-x64
    wrap-x64-run-time
    wrap-x64-boilerplate))
> (execute
   '(begin
      (set! rax 170679)
      (set! rdi rax)
      (set! rdi (+ rdi rdi))
      (set! rsp rdi)
      (set! rsp (* rsp rsp))
      (set! rbx 8991))
    nasm-run/exit-code)
183

The support library cpsc411/compiler-lib provides a few abstractions for deriving the compiler from a list of passes. See current-pass-list, compile, and execute.

1.2.5 Is the Compiler Correct?

Now that we have a compiler the meaning of all our languages is fully defined. We have an interpreter for the source language to define its meaning. We have an "interpreter" for the target language (the CPU). So we can define what it means for a compiler to be correct.

A compiler for Paren-x64 v1 is correct if:

the meaning (as defined by the interpreter) of a program p is the value integer_1
we compile p and execute it as a x64 program and get the value integer_2
the values integer_1 and integer_2 are equivalent. In general, we have to define equivalence for each pair of source and target languages. In this case, the interpreter and the compiler should return the same value.

Instead of defining Paren-x64 v1 to produce a value modulo 256, we could have instead defined its meaning as the final value of rax, and then defined equivalence between Paren-x64 v1 and x64 differently. In that case, the interpreter and compiled programs would produce different results for some programs, but they would always be equivalent modulo 256.

This gives us yet another design choice in our compiler: do we restrict the definition of our source language to ensure the compiler is correct, or design a more complex equivalence relation that can decide whether the compiler is correct? We won’t spend much more time on this, and in general, choose to ensure the interpreter and compiler produce "the same" value.

1.2.6 Appendix: Compiler Overview

contents ← prev up next →

1	Book
2	Project Logistics
3	Project Iterations
4	CPSC 411 Infrastructure Package
5	Credits

1.1	A Compiler Begins with a Language
1.2	Abstracting Boilerplate (v1)
1.3	Abstract Locations
1.4	Value Orientation
1.5	Register Allocation
1.6	Structured Control Flow
1.7	Procedural Abstraction: Call
1.8	Procedural Abstraction: Return
1.9	Algebraic Expressions
1.10	Data types: Immediates
1.11	Data types: Structured Data and Heap Allocation
1.12	First-class Procedures: Code is Data
1.13	Recursive Data
1.14	Syntactic Sugar
1.15	Appendix: Racket Preliminaries
1.16	Appendix: Compiler Design Recipe

1.2.1	Designing an Abstraction
1.2.2	Defining a Source Language
1.2.3	Enforcing Assumptions
1.2.4	Compiling
1.2.5	Is the Compiler Correct?
1.2.6	Appendix: Compiler Overview