1.11 Data types: Structured Data and Heap Allocation

1.11.1 Preface: What’s wrong with Exprs-Lang v7

Exprs-lang v7 gained proper data types, which is a huge step forward in expressivity and high-level reasoning. However, it still does not allow us to express structured data. Real languages require structured heap-allocated data—such as strings, vectors, and linked lists—to express interesting programs over data larger than a single word. Functional languages use procedures, a data structure, to provide functions as first-class values.

To express data larger than a single word, we need support from the low-level languages to get access to locations larger than a single word in size. All our locations so far, registers and frame locations, are only a single word in size. We need access to the heap—arbitrary unstructured and unrestricted memory locations.

Once we have the ability to allocate arbitrary space in heap memory, and pointers to that memory, we can add structured data types. We’ll use the same tagging approach introduced in the last chapter to tag pointers, so we can distinguish pointers to vectors vs pointers to lists, perform dynamic checking, etc.

Since our low-level languages don’t have any operations on memory address except frame variables, we’ll need to introduce new low-level abstraction. Our run-time system will also need to provide access to the heap memory in some way.

1.11.2 Implementing Structured Data

In this chapter, we’ll design and implement two heap-allocated data types, described below.

Pairs are constructed using (cons e_1 e_2). The predicate pair? should return #t when passed any value constructed this way, and #f for any other value—(eq? (pair? (cons e_1 e_2)) #t). (car e) returns the value of the first element of the pair, and (cdr e) returns the value of the second element.
As usual, we want to ensure safety in our source language. That is, (eq? (car (cons e_1 e_2)) e_1) and (eq? (cdr (cons e_1 e_2)) e_2), while (if (not (pair? e)) (eq? (cdr e) (error uint8)) #t). We do not need to worry about uninitialized values, since the only constructors ensures all parts of the pair are initialized.
Vectors are arrays that know their length. They are constructed using (make-vector e); the constructor takes the length of the vector as the argument. The predicate vector? should return #t for any value constructed this way, and #f for any other value—(eq? (vector? (make-vector e)) #t). (vector-ref e_1 e_2) returns the value at index e_2 in the vector e_1. (vector-set! e_1 e_2 e_3) mutates the index e_2 in the vector e_1, setting its value to the value of e_3, and returning (void)
To ensure safety, the appropriate arguments must be ensured to be vectors, the index arguments must be positive integers, and make-vector or vector-ref must work to ensure that we never get an uninitialized value. We’ll implement these with dynamic checks and by making make-vector initialize all values in the vector to 0.

These two are interesting, since one has a statically known size, while one has a dynamically determined size. Each of these properties requires attention in our compiler. We could add more, but these two are enough to demonstrate the general approach to compiling structured data.

As we’re adding new data types, we need new tags. These are two very commonly used data types, so we assign them primary tags. Here is our updated list of primary tags:

#b000, fixnums, fixed-sized integers
#b001, pairs
#b010, unused
#b011, vectors
#b100, unused
#b101, unused
#b110, non-fixnum immediates (booleans, etc)
#b111, unused

To add immediate data, we needed three operations: tagging, untagging, and tag checking. We need all of these to implement structured data types and follow the same pattern as Data types: Immediates for them, but we also need three additional operations: memory allocation, dynamically computed memory assignment, and dynamically computed memory dereference.

First, to implement constructors, we need the ability to allocate memory. For now, we’ll assume the existence of some abstraction alloc that can do this (which we become responsible for implementing later). cons creates a pair, so it should allocate two words of space by producing `(alloc ,(current-pair-size)). make-vector allocates one word for the length, and then one word for every element of the vector. That is, it should allocate n+1 words for a vector of length n.

Second, to initialize and mutate structured data, we need the ability to assign to a memory location that is dynamically computed. With frame variables, we added displacement mode operands, which could statically offset from the frame base pointer (whose location is statically known), and we could use this as an operand to a set! instruction. However, for structured data, we need something more general. We want the ability to (1) use an arbitrary location storing the base address, since we are dynamically allocating structured data, and passing that pointer around as a value, so the address itself could end up in any register or frame variable and (2) a dynamically determined offset from that base address, since we may want pass as an argument the index into vector. This is quite different from the abstractions we used to add frame variables.

We’ll assume a new abstraction, (mset! value value value), where the first operand is an arbitrary expression that evaluates to a base pointer (and is assumed to be produced from alloc), the second operand is an expression that evaluates to an offset in bytes from that base pointer, and the third operand evaluates to the value to be stored in the memory location computed by adding the offset to the base. As usual, we become responsible for implementing this abstraction at some point further down the compiler pipeline.

Using this abstraction, we can implement our constructors as specified. cons is meant to not only allocate, but also initialize the two words it allocated. For example, intuitively, we would transform (cons value_1 value_2).

(let ([x.1 (alloc 16)])
   (begin
     (mset! x.1 0 value_1)
     (mset! x.1 8 value_2)
     x.1))

First we allocate 2 words, then set the first word to contain the first value, and the second word to contain the second value.

However, we must also tag the pointer as a pair, to create a ptr. Previously, we used a combination of shifting and bitwise-and to tag data. This limited the range of our immediate data. To tag pointers, we need to similar limit the range of addresses that we get pointers to, to ensure some bits are available as tags. This is relatively easy to achieve—most operating systems enable requesting aligned pointers, ensuring the 3 low-order bits are 0. We could also over-allocate, then find an address in the allocated range that is aligned properly to consider our base. We’ll assume the run-time system is capable of giving us aligned pointers, so we avoid wasting memory through over allocation. However, this method means we get pointers whose high-order bits contain part of the address, and low-order bits are already zero—we do not need to shift the pointer. Instead, to tag, we combine the pointer with the tag bits using bitwise-ior and untag by using the bitwise-xor on the tagged pointer and the tag, which resets only the tag bits to 0.

For example, we generate an aligned pointer for pairs using (alloc 16). The primary tag for pairs is 1. We immediately tag the pointer, and anytime we access the pointer (for example, to initialize the pair), we must untag it first. For example, we would essentially implement (cons value_1 value_2) as:

(let ([x.1 (bitwise-ior (alloc 16) 1)])
  (begin
    (mset! (bitwise-xor x.1 1) 0 value_1)
    (mset! (bitwise-xor x.1 1) 8 value_2)
    x.1))

We can optimize some memory operations to avoid masking the pointer by taking advantage of pointer arithmetic. For example, (bitwise-ior (alloc 16) 1) is the same as (+ (alloc 16) 1). We can therefore adjust the index by -1 to access the base of the pointer, instead of untagging the pointer. Performing this optimization for pairs, we would instead implement (cons value_1 value_2) into

(let ([x.1 (+ (alloc 16) 1)])
   (begin
     (mset! x.1 -1 value_1)
     (mset! x.1 7 value_2)
     x.1))

The same optimization holds for vectors with different constants.

Finally, we need some way to actually access memory locations to implement the destructors for our data types. This requires similar functionality to mset!—it needs to take dynamically computed pointers and offsets in order to access a dynamically computed memory address. We’ll assume some abstraction (mref value value), where the first operand is the pointer and the second operand is the offset, and which returns the value stored at the pointer plus the offset. Using this we could implement (car value_1) as (mref value_1 -1), using the previous optimization to avoid the explicit untagging.

In summary, our strategy is to add three intermediate abstractions that can be used to easily create new data structures for its surface language.

These forms are:

(alloc value) allocates a number of bytes specified by value and returns a pointer to the base address of those bytes.
(mref value_base value_index) dereferences the pointer at value_base with the offset specified by value_index. Thinking in terms of pointer arithmetic, this dereferences (+ value_base value_index). The value of (+ value_base value_index) should always be word-aligned, i.e., a multiple of 8, and point to a uninitialized heap allocated value.
(mset! value_base value_index value) stores the value of value in the address (+ value_base value_index), i.e., in the address given by pointer at value_base with the offset specified by value_index. The value of (+ value_base value_index) should always be word-aligned, i.e., a multiple of 8.

To implement these new memory operations, or mops (pronounced em ops), we need to expose additional features from x64 and from the run-time system. Namely, we need low-level pointer operations that are different from those for frame variables, and we need run-time support for getting aligned addresses from the operating system to enable tagging pointers efficiently.

We then reuse our object tagging approach from the previous chapter to tag, untag, and tag check pointers to our new structured data types.

1.11.3 Memory Allocation and Access in x64

To implement the memory allocation and access abstractions we’ve just introduced, we need additional support from x64 and from the run-time system.

To access memory, we expose the index-mode operand from x64. It is written QWORD [reg_base + reg_offset], and refers to the memory location whose base pointer is stored in the register reg_base and whose index is stored in the register reg_offset. Unlike displacement-mode operands, both components of this operand are registers, so both the base and offset can be computed dynamically. We also enable a generalization of the displacement-mode operand, so we can access statically known offsets from base pointers other than the frame base pointer. This generalized operand, QWORD [reg_base + int32], adds some static offset to a base pointer to compute an address. Note that unlike the version used to implement frame variables, we add to (rather than subtract from) the base pointer, and the index is not restricted to a multiple of 8.

Each of these operands can be as part of move and arithmetic instructions. For example:

mov r10, 0

mov QWORD [r12 + r10], 0

add QWORD [r12 + r10], 1

mov rax, QWORD [r12 + r10]

moves 0 into the address r12 + r10, adds 1 to it, then moves the result into the return value register. It assumes r12 contains some valid pointer. Index-mode operands are subject to similar restrictions as with the displacement-mode operand. We formalize these restrictions in our definition of Paren-x64 v8.

To implement allocation, we need some strategy for managing memory. Our run-time system or compiler needs to know what memory is in use, how to allocate (mark free memory as in use), and how to deallocate memory (return in-use memory to the pool of free memory and ultimately to the system). There are many strategies for this, such as "let the user deal with it", "add a process to the run-time system that dynamically manages memory", or "make the type system so complex that the compiler can statically manage memory". Each of these has merits.

We choose an all together different approach: memory is infinite so just keep allocating. Of course, memory isn’t quite infinite, but if our programs are relatively short lived, and the operating system cleans up a process’s memory when it finishes running, then this approach will work and save us a lot of effort. This isn’t as unrealistic a strategy as it might as first appear; it’s used in practice by short-lived programs, such as some Unix command line tools and the control systems for some missiles.

Design digression:
In general, a language implementation might abstract access to the mmap system call for allocation, and implement a strategy (such as garbage collection) to deallocate memory that is no longer used. Garbage collection is tricky to implement so we avoid it for now to focus on implement structured data.
For a quick introduction to garbage collection, see this short video https://mastodon.social/@TartanLlama/109643267561353728.

To implement our allocation-only strategy, we need the run-time system to provide an initial base pointer from which we can start allocating. We reserve another register, the current-heap-base-pointer-register (abbreviated hbp). The run-time system initializes this register to point to the base of the heap, with all positive-integer indexes after this base as free memory. Allocation is implemented by copying the current value of this pointer, and incrementing it by the number of bytes we wish to allocate. The pointer must only be incremented by word-size multiples of bytes, to ensure the 3 low-order bits are 0 and the pointer can be tagged. Any other access to this register is now undefined behaviour, similar to accesses to fbp that do not obey the stack of frames discipline.

1.11.4 Implementing Structured Data

We design Exprs-lang v8 below. The language is large, as we include several new structured data types and their primitives.

Diff vs v7Exprs-lang v8

p	::=	(module (define x (lambda (x ...) value)) ... value)

value	::=	triv
	\|	(let ([x value] ...) value)
	\|	(if value value value)
	\|	(call value value ...)

triv	::=	x
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

x	::=	name?
	\|	prim-f

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

p	::=	(module (define x (lambda (x ...) value)) ... value)

value	::=	triv
	\|	(let ([x value] ...) value)
	\|	(if value value value)
	\|	(call value value ...)

triv	::=	x
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

x	::=	name?
	\|	prim-f

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

Since the number of primitive operations is growing, we simplify the syntax to only give primops, rather than distinguishing unops, binops, and so on, so we can easily group like primops with like.

As usual, our first task is to uniquify. Below we define the target language, Exprs-unique-lang v8.

Diff vs v7 (excerpts)Diff vs SourceExprs-unique-lang v8

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

value	::=	triv
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if value value value)

triv	::=	aloc
	\|	label
	\|	prim-f
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	eq?
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref

p	::=	(module (define label x (lambda (aloc x ...) value)) ... value)

value	::=	triv
	\|	(call value value ...)
	\|	(let ([aloc x value] ...) value)
	\|	(if value value value)
	\|	(call value value ...)

triv	::=	aloc
	\|	label
	\|	prim-f
	\|	x
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

x	::=	name?
	\|	prim-f

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

value	::=	triv
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if value value value)

triv	::=	aloc
	\|	label
	\|	prim-f
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

procedure
(uniquify p) → exprs-unique-lang-v8?
p : exprs-lang-v8?
Resolves top-level lexical identifiers into unique labels, and all other lexical identifiers into unique abstract locations.

Next, we implement new safe primitive operations. All the accessors for the new data types can result in undefined behaviour if used on the wrong ptr. Similarly, vector reference can access undefined values if the vector is constructed but never initialized. We fix this by wrapping each primitive to perform dynamic checks and remove undefined behaviour.

We design the target language, Exprs-unsafe-data-lang v8, below.

Diff vs v7Exprs-unsafe-data-lang v8

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

value	::=	triv
	\|	(primop value ...)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if value value value)
	\|	(begin effect ... value)

effect	::=	(primop value ...)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	label
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

primop	::=	binop
	\|	unop

binop	::=	unsafe-fx*
	\|	unsafe-fx+
	\|	unsafe-fx-
	\|	eq?
	\|	unsafe-fx<
	\|	unsafe-fx<=
	\|	unsafe-fx>
	\|	unsafe-fx>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	vector?
	\|	cons
	\|	unsafe-car
	\|	unsafe-cdr
	\|	unsafe-make-vector
	\|	unsafe-vector-length
	\|	unsafe-vector-set!
	\|	unsafe-vector-ref

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

value	::=	triv
	\|	(primop value ...)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if value value value)
	\|	(begin effect ... value)

effect	::=	(primop value ...)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	label
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

primop	::=	binop
	\|	unop

binop	::=	unsafe-fx*
	\|	unsafe-fx+
	\|	unsafe-fx-
	\|	eq?
	\|	unsafe-fx<
	\|	unsafe-fx<=
	\|	unsafe-fx>
	\|	unsafe-fx>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	vector?
	\|	cons
	\|	unsafe-car
	\|	unsafe-cdr
	\|	unsafe-make-vector
	\|	unsafe-vector-length
	\|	unsafe-vector-set!
	\|	unsafe-vector-ref

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

Note that in this language, we expose effect context and add begin. In the source, we have an effect procedure, namely vector-set!. The user must manually call impure functions and bind the result. The result of an effectful function is either void, or an error. The unsafe variant, unsafe-vector-set!, does not need to produce a value, so we expose effect context.

Introducing this contextual distinction, we require primops to respect contexts. Only primops that produce values can appear in value context, while only imperative primops (only unsafe-vector-set! so far) can be directly called in effect context. Otherwise, we’d need to support values in effect context sometime later, which we haven’t considered how to do yet.

To implement the safe language, we wrap all accessors to perform dynamic tag checking before using the unsafe operations. We also wrap unsafe-make-vector to initialize all elements to 0.

To implement the safe primops, it may be useful to abstract out the safety specification for each primitive operation. For example, we could write our compiler as a a function over the program and the following specification language.

; Symbol x Symbol x (List-of Parameter-Types)
; The first symbol is the name of a function in the source language.
; The second is either the name of a primop or a label in the target language implementing the
; behaviour safely, assuming well-typed parameters.
; The third is list of predicates, one for each argument to the source
; function, to check the parameters with. `any?` is specially recognized to
; not be checked.
(define prim-f-specs
  `((* unsafe-fx* (fixnum? fixnum?))
    (+ unsafe-fx+ (fixnum? fixnum?))
    (- unsafe-fx- (fixnum? fixnum?))
    (< unsafe-fx< (fixnum? fixnum?))
    (<= unsafe-fx<= (fixnum? fixnum?))
    (> unsafe-fx> (fixnum? fixnum?))
    (>= unsafe-fx>= (fixnum? fixnum?))

    (make-vector ,make-init-vector-label (fixnum?))
    (vector-length unsafe-vector-length (vector?))
    (vector-set! ,unsafe-vector-set!-label (vector? fixnum? any?))
    (vector-ref ,unsafe-vector-ref-label (vector? fixnum?))

    (car unsafe-car (pair?))
    (cdr unsafe-cdr (pair?))

    ,@(map (lambda (x) `(,x ,x (any?)))
           '(fixnum? boolean? empty? void? ascii-char? error? pair?
                     vector? not))
    ,@(map (lambda (x) `(,x ,x (any? any?)))
           '(cons eq?))))

procedure
(implement-safe-primops p) → exprs-unsafe-data-lang-v8?
p : exprs-unique-lang-v8?

Implement safe primitive operations by inserting procedure definitions for each primitive operation which perform dynamic tag checking, to ensure type and memory safety.

Finally, we must extend specify-representation to implement the new data structures and primitives. We design the target Exprs-bits-lang v8 below.

Diff vs v7 (excerpts)Diff vs SourceExprs-bits-lang v8

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

pred	::=	(relop value value)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)
	\|	(begin effect ... pred)

value	::=	triv
	\|	(binop value value)
	\|	(mref value value)
	\|	(alloc value)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(begin effect ... value)

effect	::=	(mset! value value value)
	\|	(begin effect ... effect)

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

pred	::=	(relop value value)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)
	\|	(begin effect ... pred)

value	::=	triv
	\|	(binop value value)
	\|	(mref value value)
	\|	(alloc primop value ...)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value value)
	\|	(begin effect ... value)

effect	::=	(mset! value value value)
	\|	(primop value ...)
	\|	(begin effect ... effect)

primop	::=	binop
	\|	unop

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	vector?
	\|	cons
	\|	unsafe-car
	\|	unsafe-cdr
	\|	unsafe-make-vector
	\|	unsafe-vector-length
	\|	unsafe-vector-set!
	\|	unsafe-vector-ref

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

pred	::=	(relop value value)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)
	\|	(begin effect ... pred)

value	::=	triv
	\|	(binop value value)
	\|	(mref value value)
	\|	(alloc value)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(begin effect ... value)

effect	::=	(mset! value value value)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	label
	\|	int64

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

aloc	::=	aloc?

label	::=	label?

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

We implement the new data structures using the approach described earlier, combining tagging with the new allocation and memory abstractions. We compile each constructor, namely cons and unsafe-make-vector, to alloc plus tagging, producing the tagged pointer as the value. We compile each destructor to mref, detagging or statically adjusting the index to the pointer first. Initialization, done by cons, and mutation, done by unsafe-vector-set!, are both compiled to mset!.

procedure
(specify-representation p) → exprs-bits-lang-v8?
p : exprs-unsafe-data-lang-v8?

Compiles data types and primitive operations into their implementations as ptrs and primitive bitwise operations on ptrs.

Examples:

> (specify-representation '(module (cons 5 6)))
'(module
   (let ((tmp.1 (+ (alloc 16) 1)))
     (begin (mset! tmp.1 -1 40) (mset! tmp.1 7 48) tmp.1)))
> (specify-representation '(module (unsafe-car (cons 5 6))))
'(module
   (mref
    (let ((tmp.2 (+ (alloc 16) 1)))
      (begin (mset! tmp.2 -1 40) (mset! tmp.2 7 48) tmp.2))
    -1))
> (specify-representation '(module (unsafe-vector-ref (unsafe-make-vector 3) 6)))
'(module
   (mref
    (let ((tmp.3 (+ (alloc 32) 3))) (begin (mset! tmp.3 -3 24) tmp.3))
    53))

1.11.5 Front-end Extensions

We’ve added effect context and a new effect higher in the compiler pipeline than previously. This requires some attention to a few of our front-end passes.

Our next pass, remove-complex-opera*, is responsible for imposing left-to right evaluation for operands and arguments, and explicitly binding all algebraic expressions. We design Values-bits-lang v8 below.

Diff vs v7 (excerpts)Diff vs SourceValues-bits-lang v8

p	::=	(module (define label (lambda (aloc ...) tail)) ... tail)

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)
	\|	(begin effect ... pred)

tail	::=	value
	\|	(let ([aloc value] ...) tail)
	\|	(if pred tail tail)
	\|	(call triv opand ...)
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop opand opand)
	\|	(mref aloc opand)
	\|	(alloc opand)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(call triv opand ...)
	\|	(begin effect ... value)

effect	::=	(mset! aloc opand value)
	\|	(let ([aloc value] ...) effect)
	\|	(begin effect ... effect)

opand	::=	aloc
	\|	int64

p	::=	(module (define label (lambda (aloc ...) value tail)) ... value tail)

pred	::=	(relop value value opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)
	\|	(begin effect ... pred)

tail	::=	value
	\|	(let ([aloc value] ...) tail)
	\|	(if pred tail tail)
	\|	(call triv opand ...)

value	::=	triv
	\|	(binop value value opand opand)
	\|	(mref value value)
	\|	(alloc value)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(begin effect ... value)
	\|	(call triv opand ...)

effect	::=	(mset! value value value)
	\|	(begin effect ... effect)

p	::=	(module (define label (lambda (aloc ...) tail)) ... tail)

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)
	\|	(begin effect ... pred)

tail	::=	value
	\|	(let ([aloc value] ...) tail)
	\|	(if pred tail tail)
	\|	(call triv opand ...)
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop opand opand)
	\|	(mref aloc opand)
	\|	(alloc opand)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(call triv opand ...)
	\|	(begin effect ... value)

effect	::=	(mset! aloc opand value)
	\|	(let ([aloc value] ...) effect)
	\|	(begin effect ... effect)

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	label

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

aloc	::=	aloc?

label	::=	label?

We add an effect context to support mset!, and a begin expression for convenience. Previously, all expressions in the Values-lang languages were pure—they evaluated and produced the same value regardless of in which order expressions were evaluated. We could freely reorder expressions, as long as we respected scope. Now, however, mset! modifies memory during its execution. It not safe to reorder expressions after an mset!. Furthermore, mset! does not return a useful value.

To deal with this, we introduce a contextual distinction in the language. We add the non-terminal effect to represent an impure computation. A effect represents an expression that does not have a value, and is executed only for its effect. We can use effect in certain expression contexts using begin. If we’re already in an impure context, that is, in a effect, then we can freely nest other effects.

This contextual distinction is similar to the one we introduce to distinguish tail calls from non-tail calls.

Supporting effect context requires paying attention to order when designing remove-complex-opera*, but does not significantly complicate anything.

procedure
(remove-complex-opera* p) → values-bits-lang-v8?
p : exprs-bits-lang-v8?
Performs the monadic form transformation, unnesting all non-trivial operators and operands, making data flow explicit and and simple to implement imperatively.

Next, we transform into imperative instructions. Now that effects can appear on the right-hand side of a let expression, it MAY not longer be safe to reorder them. This is a design choice: we could make it clear to the programmer that let does not guarantee a particular order of evaluation for its bindings, but then effects on the right-hand side lead to undefined behaviour. Or, we could impose a particular order, such as left-to-right, forbidding possible optimizations. A middle ground is to impose such an order only if any effects are detected in the right-hand side of a let (or rather, if we can guarantee no effects are present, because Rice still does not let us know for sure).

As usual, we choose to forbid exposing undefined behaviour to the source language, so design sequentialize-let to impose left-to-right (unless we know its safe to reorder).

We design the target language, Imp-mf-lang v8, below.

Diff vs v7 (excerpts)Diff vs Source (Excerpts)Imp-mf-lang v8

p	::=	(module (define label (lambda (aloc ...) tail)) ... tail)

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	value
	\|	(call triv opand ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

value	::=	triv
	\|	(call triv opand ...)
	\|	(binop opand opand)
	\|	(mref aloc opand)
	\|	(alloc opand)
	\|	(begin effect ... value)
	\|	(if pred value value)

effect	::=	(set! aloc value)
	\|	(mset! aloc opand value)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)

p	::=	(module (define label (lambda (aloc ...) tail)) ... tail)

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)
	\|	(begin effect ... pred)

tail	::=	value
	\|	(let ([aloc value] ...) tail)
	\|	(if pred tail tail)
	\|	(call triv opand ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

value	::=	triv
	\|	(call triv opand ...)
	\|	(binop opand opand)
	\|	(mref aloc opand)
	\|	(alloc opand)
	\|	(begin effect ... value)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(call triv opand ...)
	\|	(begin effect ... value)

effect	::=	(set! aloc value)
	\|	(mset! aloc opand value)
	\|	(let ([aloc value] ...) effect)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)

p	::=	(module (define label (lambda (aloc ...) tail)) ... tail)

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	value
	\|	(call triv opand ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

value	::=	triv
	\|	(call triv opand ...)
	\|	(binop opand opand)
	\|	(mref aloc opand)
	\|	(alloc opand)
	\|	(begin effect ... value)
	\|	(if pred value value)

effect	::=	(set! aloc value)
	\|	(mset! aloc opand value)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	label

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

aloc	::=	aloc?

label	::=	label?

int64	::=	int64?

procedure
(sequentialize-let p) → imp-mf-lang-v8?
p : values-bits-lang-v8?
Picks a particular order to implement let expressions using set!.

With the addition of an mset!, normalize-bind must be updated. This operator enables the same kind of nesting as set!, and like set!, its operand needs to be normalized.

Next we design Proc-imp-cmf-lang v8

Diff vs v7Diff vs SourceProc-imp-cmf-lang v8

p	::=	(module (define label (lambda (aloc ...) entry)) ... entry)

entry	::=	tail

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	value
	\|	(call triv opand ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

value	::=	triv
	\|	(binop opand opand)
	\|	(mref aloc opand)
	\|	(alloc opand)
	\|	(call triv opand ...)

effect	::=	(set! aloc value)
	\|	(mset! aloc opand triv)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	label

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

aloc	::=	aloc?

label	::=	label?

int64	::=	int64?

p	::=	(module (define label (lambda (aloc ...) entry tail)) ... entry tail)

entry	::=	tail

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	value
	\|	(call triv opand ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

value	::=	triv
	\|	(call triv opand ...)
	\|	(binop opand opand)
	\|	(mref aloc opand)
	\|	(alloc opand)
	\|	(call triv opand ...)
	\|	(begin effect ... value)
	\|	(if pred value value)

effect	::=	(set! aloc value)
	\|	(mset! aloc opand triv value)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	label

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

aloc	::=	aloc?

label	::=	label?

int64	::=	int64?

p	::=	(module (define label (lambda (aloc ...) entry)) ... entry)

entry	::=	tail

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	value
	\|	(call triv opand ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

value	::=	triv
	\|	(binop opand opand)
	\|	(mref aloc opand)
	\|	(alloc opand)
	\|	(call triv opand ...)

effect	::=	(set! aloc value)
	\|	(mset! aloc opand triv)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	label

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

aloc	::=	aloc?

label	::=	label?

Question: Why can’t (or shouldn’t) we allow the index position to also be a value?

procedure
(normalize-bind p) → proc-imp-cmf-lang-v8?
p : imp-mf-lang-v8?

Pushes set! and mset! under begin and if so that the right-hand-side of each is simple value-producing operand.

This normalizes Imp-mf-lang v8 with respect to the equations

(set! loc

(begin effect_1 ...

value))

(begin effect_1 ...

(set! loc value))

(set! loc

(if pred

value_1

value_2))

(if pred

(set! loc value_1)

(set! loc value_2))

(mset! loc opand

(begin effect_1 ...

value))

(begin effect_1 ...

(mset! loc opand value))

(mset! loc opand

(if pred

value_1

value_2))

(if pred

(mset! loc opand value_1)

(mset! loc opand value_2))

impose-calling-conventions and select-instructions require only minor changes, so we leave those changes as an exercise for the reader.

procedure
(impose-calling-conventions p) → imp-cmf-lang-v8?
p : proc-imp-cmf-lang-v8?
Compiles Proc-imp-cmf-lang v8 to Imp-cmf-lang v8 by imposing calling conventions on all calls (both tail and non-tail calls), and entry points.

procedure
(select-instructions p) → asm-alloc-lang-v8?
p : imp-cmf-lang-v8?
Selects appropriate sequences of abstract assembly instructions to implement the operations of the source language.

1.11.6 Implementing Allocation

As previously discussed, our allocation strategy is to simply grab the current heap base pointer, and increment beyond the bytes we want to use for our data structure. The question is where to implement this operation. We want to do this after we have access to physical locations, so after impose-calling-conventions, but before the register allocator passes, so we do not have to update those passes to know that alloc introduces a reference to a register. This puts the pass right between select-instructions and uncover-locals.

Below, we design Asm-alloc-lang v8, the source language for this pass. We typeset the differences compared to asm-pred-lang-v7.

Diff vs v7 (excerpts)Full

p	::=	(module info (define label info tail) ... tail)

info	::=	(#:from-contract (info/c (new-frames (frame ...))))

frame	::=	(aloc ...)

pred	::=	(relop loc opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	(jump trg loc ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

effect	::=	(set! loc triv)
	\|	(set! loc_1 (binop loc_1 opand))
	\|	(set! loc_1 (mref loc_2 index))
	\|	(set! loc (alloc index))
	\|	(mset! loc index triv)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)
	\|	(return-point label tail)

index	::=	loc
	\|	int64

int32	::=	int32?

p	::=	(module info (define label info tail) ... tail)

info	::=	(#:from-contract (info/c (new-frames (frame ...))))

frame	::=	(aloc ...)

pred	::=	(relop loc opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	(jump trg loc ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

effect	::=	(set! loc triv)
	\|	(set! loc_1 (binop loc_1 opand))
	\|	(set! loc_1 (mref loc_2 index))
	\|	(set! loc (alloc index))
	\|	(mset! loc index triv)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)
	\|	(return-point label tail)

opand	::=	loc
	\|	int64

triv	::=	opand
	\|	label

loc	::=	rloc
	\|	aloc

trg	::=	loc
	\|	label

index	::=	loc
	\|	int64

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

int32	::=	int32?

aloc	::=	aloc?

label	::=	label?

rloc	::=	register?
	\|	fvar?

This language contains our intermediate mops, including (set! loc (alloc index)). This is the low-level form of our allocation operation.

We design the target language, Asm-pred-lang v8, below. This language removes the (alloc index) form and is the highest-level language parameterized by current-heap-base-pointer-register. This language, and all languages between it and x64, assumes all accesses to hbp obey the restrictions described in Paren-x64 v8.

Diff vs Source (excerpts)Diff vs v7 (excerpts)Asm-pred-lang v8

p	::=	(module info (define label info tail) ... tail)

info	::=	(#:from-contract (info/c (new-frames (frame ...))))

frame	::=	(aloc ...)

pred	::=	(relop loc opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	(jump trg loc ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

effect	::=	(set! loc triv)
	\|	(set! loc_1 (binop loc_1 opand))
	\|	(set! loc_1 (mref loc_2 index))
	\|	(set! loc (alloc index))
	\|	(mset! loc index triv)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)
	\|	(return-point label tail)

index	::=	loc
	\|	int64

p	::=	(module info (define label info tail) ... tail)

info	::=	(#:from-contract (info/c (new-frames (frame ...))))

frame	::=	(aloc ...)

pred	::=	(relop loc opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	(jump trg loc ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

effect	::=	(set! loc triv)
	\|	(set! loc_1 (binop loc_1 opand))
	\|	(set! loc_1 (mref loc_2 index))
	\|	(mset! loc index triv)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)
	\|	(return-point label tail)

index	::=	loc
	\|	int64

p	::=	(module info (define label info tail) ... tail)

info	::=	(#:from-contract (info/c (new-frames (frame ...))))

frame	::=	(aloc ...)

pred	::=	(relop loc opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(begin effect ... pred)
	\|	(if pred pred pred)

tail	::=	(jump trg loc ...)
	\|	(begin effect ... tail)
	\|	(if pred tail tail)

effect	::=	(set! loc triv)
	\|	(set! loc_1 (binop loc_1 opand))
	\|	(set! loc_1 (mref loc_2 index))
	\|	(mset! loc index triv)
	\|	(begin effect ... effect)
	\|	(if pred effect effect)
	\|	(return-point label tail)

opand	::=	loc
	\|	int64

triv	::=	opand
	\|	label

loc	::=	rloc
	\|	aloc

trg	::=	loc
	\|	label

index	::=	loc
	\|	int64

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

int32	::=	int32?

aloc	::=	aloc?

label	::=	label?

rloc	::=	register?
	\|	fvar?

Intuitively, we will transform each `(set! ,loc (alloc ,index)) into

`(begin
(set! ,loc ,hbp)
(set! ,hbp (+ ,hbp ,index)))

procedure
(expose-allocation-pointer p) → asm-pred-lang-v8?
p : asm-alloc-lang-v8?
Implements the allocation primitive in terms of pointer arithmetic on the current-heap-base-pointer-register.

1.11.7 Implementing mops

The new mops require minor changes to most of the pipeline between Imp-cmf-lang-v8 to Para-asm-lang v8, where we will start implementing these abstractions in the low-level languages.

Like when we implemented fvars to support working with frame locations, we added mops to simplify working with heap addresses. We want to keep that simplification around as long as possible, to avoid complicating the already complex logic for rewriting set! instructions.

The last pass that needs to do non-trivial rewriting of set! is patch-instructions, and it would benefit from that simplification. So we decide to leave mops in place and implement them after patch-instruction.

Below, we design Para-asm-lang v8, the new source language for patch-instructions.

Diff vs v7 (excerpts)Para-asm-lang v8

p	::=	(begin s ...)

s	::=	(set! loc triv)
	\|	(set! loc_1 (binop loc_1 opand))
	\|	(set! loc_1 (mref loc_2 index))
	\|	(mset! loc_1 index triv)
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare loc opand)
	\|	(jump-if relop trg)

triv	::=	opand
	\|	label

loc	::=	reg
	\|	addr

index	::=	loc
	\|	int64

addr	::=	(fbp - dispoffset)

fbp	::=	frame-base-pointer-register?

int32	::=	int32?

dispoffset	::=	dispoffset?

p	::=	(begin s ...)

s	::=	(set! loc triv)
	\|	(set! loc_1 (binop loc_1 opand))
	\|	(set! loc_1 (mref loc_2 index))
	\|	(mset! loc_1 index triv)
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare loc opand)
	\|	(jump-if relop trg)

trg	::=	loc
	\|	label

triv	::=	opand
	\|	label

opand	::=	loc
	\|	int64

loc	::=	reg
	\|	addr

index	::=	loc
	\|	int64

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r12
	\|	r13
	\|	r14
	\|	r15

addr	::=	(fbp - dispoffset)

fbp	::=	frame-base-pointer-register?

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

int32	::=	int32?

label	::=	label?

dispoffset	::=	dispoffset?

The target language, Paren-x64-mops v8, is given below.

Diff vs Source (excerpts)Diff vs Paren-x64 v7Paren-x64-mops v8

p	::=	(begin s ...)

s	::=	(set! addr int32)
	\|	(set! addr trg)
	\|	(set! reg loc)
	\|	(set! reg loc triv)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 loc_1 (binop reg_1 loc loc_1 opand))
	\|	(set! reg_1 (mref reg_2 index))
	\|	(mset! reg_1 index int32)
	\|	(mset! reg_1 index trg)
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare reg loc opand)
	\|	(jump-if relop label trg)

triv	::=	trg
	\|	int64
	\|	opand
	\|	label

loc	::=	reg
	\|	addr

index	::=	int32
	\|	reg

addr	::=	(fbp - dispoffset)

fbp	::=	frame-base-pointer-register?

int32	::=	int32?

dispoffset	::=	dispoffset?

p	::=	(begin s ...)

s	::=	(set! addr int32)
	\|	(set! addr trg)
	\|	(set! reg loc)
	\|	(set! reg triv)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 loc))
	\|	(set! reg_1 (mref reg_2 index))
	\|	(mset! reg_1 index int32)
	\|	(mset! reg_1 index trg)
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare reg opand)
	\|	(jump-if relop label)

triv	::=	trg
	\|	int64

loc	::=	reg
	\|	addr

index	::=	int32
	\|	reg

addr	::=	(fbp - dispoffset)

fbp	::=	frame-base-pointer-register?

int32	::=	int32?

dispoffset	::=	dispoffset?

p	::=	(begin s ...)

s	::=	(set! addr int32)
	\|	(set! addr trg)
	\|	(set! reg loc)
	\|	(set! reg triv)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 loc))
	\|	(set! reg_1 (mref reg_2 index))
	\|	(mset! reg_1 index int32)
	\|	(mset! reg_1 index trg)
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare reg opand)
	\|	(jump-if relop label)

trg	::=	reg
	\|	label

triv	::=	trg
	\|	int64

opand	::=	reg
	\|	int64

loc	::=	reg
	\|	addr

index	::=	int32
	\|	reg

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r10
	\|	r11
	\|	r12
	\|	r13
	\|	r14
	\|	r15

addr	::=	(fbp - dispoffset)

fbp	::=	frame-base-pointer-register?

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

int32	::=	int32?

label	::=	label?

dispoffset	::=	dispoffset?

As usual, patch-instructions makes a lot of changes to the operands of each instructions.

By leaving mops separate forms, we only need to patch the new instructions, and leave patching of old instructions untouched. We do not need to pay attention to and patch a new kind of operand.

The main restrictions on mops are that the operands must be registers, and that the index must be a 32-bit integer. The operands for mops have to be registers since they are compiled to low-level instructions where one operand is implicitly an index-mode operand. Similar to patching displacement-mode operands, both operands cannot be reference to memory. The index must be patches to 32-bit literals only for the same reason that binary operations must only contain 32-bit literals. This makes index and opand coincide, syntactically, but they are conceptually different so we maintain separate non-terminal definitions.

procedure
(patch-instructions p) → paren-x64-mops-v8?
p : para-asm-lang-v8?
Patches instructions that have no x64 analogue into to a sequence of instructions and an auxiliary register from current-patch-instructions-registers.

Finally, we can translate mops into index-mode operands and displacement-mode operands. Note that the compiler pass doesn’t need to know which it is generating; their syntax is the same except for type of the second operand. The target language is Paren-x64 v8, which we define below.

Diff vs SourceDiff vs v7Paren-x64 v8

p	::=	(begin s ...)

s	::=	(set! addr int32)
	\|	(set! addr trg)
	\|	(set! reg loc)
	\|	(set! reg triv)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 loc))
	\|	(set! reg_1 (mref reg_2 index))
	\|	(mset! reg_1 index int32)
	\|	(mset! reg_1 index trg)
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare reg opand)
	\|	(jump-if relop label)

addr	::=	(fbp - dispoffset)
	\|	(reg + int32)
	\|	(reg + reg)

p	::=	(begin s ...)

s	::=	(set! addr int32)
	\|	(set! addr trg)
	\|	(set! reg loc)
	\|	(set! reg triv)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 loc))
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare reg opand)
	\|	(jump-if relop label)

addr	::=	(fbp - dispoffset)
	\|	(reg + int32)
	\|	(reg + reg)

p	::=	(begin s ...)

s	::=	(set! addr int32)
	\|	(set! addr trg)
	\|	(set! reg loc)
	\|	(set! reg triv)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 loc))
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare reg opand)
	\|	(jump-if relop label)

trg	::=	reg
	\|	label

triv	::=	trg
	\|	int64

opand	::=	reg
	\|	int64

loc	::=	reg
	\|	addr

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r10
	\|	r11
	\|	r12
	\|	r13
	\|	r14
	\|	r15

addr	::=	(fbp - dispoffset)
	\|	(reg + int32)
	\|	(reg + reg)

fbp	::=	frame-base-pointer-register?

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

int32	::=	int32?

dispoffset	::=	dispoffset?

label	::=	label?

The language contains a new addr representing the x64 index-mode operand (reg + reg). This supports accessing a memory location by the index stored in another register. For example, in x64, we represent loading the nth element of an array into r10 using mov r10 [r11 + r12], where the base of the array is stored at r11 and the value of n is stored in r12.

The index-mode operand is not restricted to use a particular register, unlike the displacement-mode operand from Paren-x64 v7.

procedure
(implement-mops p) → paren-x64-v8?
p : paren-x64-mops-v8?
Compiles mops to instructions on pointers with index- and displacement-mode operands.

Finally, we update generate-x64 to emit the string representation of the index-mode operand.

procedure
(generate-x64 p) → string?
p : paren-x64-v8?
Compile the Paren-x64 v8 program into a valid sequence of x64 instructions, represented as a string.

1.11.8 Appendix: Overview

Figure 9: Overview of Compiler Version 8

contents ← prev up next →

1	Book
2	Project Logistics
3	Project Iterations
4	CPSC 411 Infrastructure Package
5	Credits

1.1	A Compiler Begins with a Language
1.2	Abstracting Boilerplate (v1)
1.3	Abstract Locations
1.4	Value Orientation
1.5	Register Allocation
1.6	Structured Control Flow
1.7	Procedural Abstraction: Call
1.8	Procedural Abstraction: Return
1.9	Algebraic Expressions
1.10	Data types: Immediates
1.11	Data types: Structured Data and Heap Allocation
1.12	First-class Procedures: Code is Data
1.13	Recursive Data
1.14	Syntactic Sugar
1.15	Appendix: Racket Preliminaries
1.16	Appendix: Compiler Design Recipe

1.11.1	Preface: What’s wrong with Exprs-Lang v7
1.11.2	Implementing Structured Data
1.11.3	Memory Allocation and Access in x64
1.11.4	Implementing Structured Data
1.11.5	Front-end Extensions
1.11.6	Implementing Allocation
1.11.7	Implementing mops
1.11.8	Appendix: Overview