1.10 Data types: Immediates

1.10.1 Preface: What’s wrong with Exprs-lang v6.5

Exprs-lang v6.5 gained the ability to nest expressions pretty much arbitrarily. However, our functions are still limited to work on machine integers. A realistic language would allow us to express programs over more interesting data types than mere machine integers.

Unfortunately, once we add data types, we have a problem distinguishing between any two data types. Everything is eventually represented as 64-bit integers in x64. We need some way to distinguish a collection of 64 bits as an integer, from a boolean, from the empty list, from a pointer to a procedure. Otherwise, we will have serious problems: undefined behaviours, unsafe casts, and/or memory safety errors.

This is not only a problem for ruling out unsafe behaviour in the source language. A static type system could take care to prevent the user from calling an integer as a procedure, for example. However, the language itself may need to distinguish different kinds of data at run time. For example, a garbage collector may need to traverse data structures, but not immediate data like integers; a pretty printer may want to print different data differently.

To enable the language to distinguish different kinds of data, we can steal a few of our 64 bits to represent a data type tag. This limits the range of machine integers, but allows us to us to distinguish data dynamically, enabling safety and abstractions that must differentiate data dynamically.

Our goal in this assignment is to implement the following language, Exprs-lang v7.

Exprs-lang v7Diff vs v6.5

p	::=	(module (define x (lambda (x ...) value)) ... value)

value	::=	triv
	\|	(let ([x value] ...) value)
	\|	(if value value value)
	\|	(call value value ...)

triv	::=	x
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

x	::=	name?
	\|	prim-f

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

p	::=	(module (define x (lambda (x ...) value)) ... value)

pred	::=	(relop triv triv)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([x value] ...) pred)
	\|	(if pred pred pred)

value	::=	triv
	\|	(binop value value)
	\|	(let ([x value] ...) value)
	\|	(if value pred value value)
	\|	(call value x value ...)

triv	::=	x
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

triv	::=	x
	\|	int64

x	::=	name?
	\|	prim-f

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

We add a bunch of new values in Exprs-lang v7, including booleans, the empty list, the void object, (printable) ASCII character literals, and an error value indicating an exit code.

We restrict ASCII characters to the printable ones, so we don’t have to figure out how to print non-printable characters. The run-time system will work with some non-printable characters, but the results will not be converted to Racket properly.

We want to allow Exprs-lang v7 programs to return any of these values, which requires additional support from the run-time system. The run-time system must be able to distinguish these different values in order to pretty print or otherwise return them to the user (such as via an exit code in the case of the error value). This means our run-time representation of values cannot just be 64-bit machine integers; we need some way for the run-time system to differentiate values dynamically.

The support libraries include an implementation of this new run-time system in cpsc411/ptr-run-time, which supports pretty printing all the new values. cpsc411/compiler-lib also contains parsers for the values. execute takes a new optional second parameter, which can be used to change your view of the result. By default, it parsers the printed value as a Racket value. You can pass nasm-run/print-string to get back the string representation of the result, which will be handy to view the result of (void) correctly. You can pass nasm-run/exit-code to get the exit code, which is helpful for viewing the result of (error uint8), which sets the exit code to uint8.

We also add new primitive operations, primarily predicates on our new data types. The interpretation of several operations will also change to add dynamic type checking. This will prevent those operations from being a source of undefined behaviour.

With proper booleans, we can finally allow an arbitrary value in the predicate position of an if expression in the surface language.

1.10.2 Introduction to Tagged Object Representation

The key challenge we need to solve in adding data types to our language is how to implement data-type checking. That is, given some value triv, branch on whether it is a particular type data type such as a boolean or a number: (if (fixnum? triv) value_1 value_2) and (if (boolean? triv) value_1 value_2) should be supported. Ideally, each data type is unique: (boolean? triv) and (fixnum? triv) should never both be true.

These don’t necessarily need to be dynamic checks, as long as they are somehow expressible by the compiler, whether as expressions that dynamically check or meta-data that is statically checked. However, even in a statically typed language, we may need some ability for the run-time system to distinguish different kinds of data dynamically.

Depending on how these type checks are implement, this alone is enough to let us implement different data types safely. We could use procedural abstraction to wrap existing binary operations to ensure, e.g., that + is only operating on numbers, and not booleans. We could expand if to check that the value in predicate position is both a boolean (that is, that (boolean? triv) is equal to some value using a relop), and that the value is equal to some representation of true.

A common approach to efficiently represent word-sized data types is called object tagging. This lets us implement data-type checking by providing the following primitive operations (which we implement with further, low-level primitives in x64):

Tagging, i.e., given some machine integer, tag it to indicate what data type it represents, producing a tagged representation of the underlying data. This tagged representation will happen to correspond to some machine integer, since all sequences of bits do, but maybe not in any meaningful way.
Untagging, i.e., given some tagged representation, remove the tag returning the underlying data that can be used with primitive x64 instructions.
Tag checking, i.e., given some tagged representation, get the tag or compare the tag to something.

So far, we’ve treated all of our data as 64-bit integers. Now, we treat all our data as 64 bits, which we manipulate to implement various data types (including some fixed-sized integers). We choose a particular representation of object tagging below that has some nice properties, and implement each of these operation by expanding them into bitwise operations.

Some of our clever tag choices, and some terminology, is borrowed from R. Kent Dybvig. You can learn more about its use in Chez Scheme from this talk by his former PhD student Andy Keep: https://www.youtube.com/watch?v=BcC3KScZ-yA.

Each data type in our language will now be represented by a ptr (pronounced like footer). A ptr is a machine word whose n least-significant bits represent the primary tag, and whose upper (- (* 8 (current-word-size-bytes)) n) bits represent the data.

In our setting, we have 64-bit words and will use the 3 least-significant bits for primary tags. With 3 bits, we can represent 8 primary data types in a ptr. We want to reserve these for the most common and most performance critical data that can fit in a ptr, but we have additional constraints. This will limit the size of data we can represent, but gives us additional data types—probably a good trade off.

If we want more than 8 data types, which most languages support, we must reserve one tag for "everything else" (a pointer to memory). The structure in memory has infinite space in which to store additional tags. This seems to leave us only 7 possible data types without going to memory.

Some data types require fewer than 64 bits, and we can exploit this fact to gain a few extra data types. For example, booleans really only require 1 bit. The empty list and the void object only require a tag, since they do not contain any data. ASCII characters only require 8 bits. We can make all of these share a single primary tag, and steal some of the unnecessary high-order bits for a secondary tag. This gives us 4 data types for the price of only 1 primary tag—what a bargain!

This leaves us 6 primary tags. One obvious choice is to use one for fixed sized integers; these will now be 61-bit integers, which we call fixnums. Integer operations are fast, and we don’t want to slow them down by making them go through heap allocation and pointer indirection. If we wanted to support graphics and scientific operations, we would reserve one for floating-point numbers too.

We reserve the remaining primary tags for tagged pointers, pointers to common heap-allocated structures. Storing the tag on the pointer instead of in memory avoids the expense of an additional memory access to check a tag and the additional memory overhead of storing the tag in memory. We’ll address the implementation of tagged pointers and heap-allocated structured data in the next chapter.

Here is the default set of tags we will use in this assignment, given in base 2.

#b000, fixnums, fixed-sized integers
#b001, unused
#b010, unused
#b011, unused
#b100, unused
#b101, unused
#b110, non-fixnum immediates (booleans, etc)
#b111, unused

Racket supports binary literals, and automatically interprets them as two’s complement integers. If you type #b111, you get back 7, and the two values are equal?. This will be helpful when writing this assignment and writing tests.

For the non-fixnum immediates, we use the following secondary tags. Note that the 3 least-significant bits of secondary tags are the shared primary tag.

#b00000110, for #f
#b00001110, for #t
#b00010110, for empty
#b00011110, for (void)
#b00101110, for an ASCII character
#b00111110, for the error exit-code value

For annoying technical reasons, our representation of the empty list differs slightly from Racket. We use empty, while Racket uses '().

The integer 7 is #b111 in base 2, and would be represented, in base 2, as the ptr #b111000. The three low-order bits are the tag #b000, and the high-order bits are the data #b111 (implicitly padded with 0 to 64-bits).

To implement untagging, we can simply right-shift the ptr by the size of a primary tag to get a two’s complement integer:

Example:

> (arithmetic-shift 56 -3)
7

All these notes are typeset using base-2 literals, but Racket cannot distinguish them from integers, so they may get accidentally rendered in base 10.

A handy fact about the choice of tag for fixnums is that any number n is represented as (* 8 n). This fact allows for implementing fast fixnum operations on the ptr representation directly. That is, we can implement binary operations on fixnums without detagging followed by tagging, by relying on some algebraic facts about operations on integers.

Similarly, the ASCII character 7 would be as a ptr, in base 2, #b011011100101110.

See https://en.wikipedia.org/wiki/ASCII#Printable_characters for the representation of ASCII characters.

For character operations, to detag we right-shift by the size of a secondary tag (8). The character #b0110111 is the ASCII representation of 7. Combined with its tag, this is #b011011100101110, which reads in Racket as 14126.

Example:

> (integer->char (arithmetic-shift 14126 -8))
#\7

We can implement tag checks using bitwise masking. For example, to check whether a ptr is a fixnum, we mask the ptr with tag #b111 (7) and compare the result to the primary tag for fixnums, #b000 (0). For example, we can ask whether the above two examples are fixnums (and whether the latter is a character) as follows:

Examples:

> (eq? (bitwise-and 56 7) 0)
#t
> (eq? (bitwise-and 14126 7) 0)
#f
> (eq? (bitwise-and 14126 255) 46)
#t

Our representation of the number 7 is a fixnum, while the representation of the character 7 is not.

The representation of booleans is a bit tricky. We can think of booleans either as the 61-bit values 0 and 1, with tag #b110, or as two separate secondary tags (with no payload) #b00001110 and #b00000110. I typically think in the the latter interpretation, where we represent #t and #f as separate data types, whose tags are #b1110 (14) and #b110 (6).

To implement tag checking for booleans, we use the mask #b11110111 (247), and compare to the tag #b110. This succeed regardless of whether the fourth bit is set, as long as all the other secondary tag bits are 0.

Examples:

> (eq? (bitwise-and 14 247) 6)
#t
> (eq? (bitwise-and 6 247) 6)
#t

To extract the payload for booleans, all we really need to know is whether the boolean is false. We can perform this test without untagging, merely by comparing to #b110 (with all other bits 0, implicitly). This gives us the same result as explicitly untagging, and compares to 0.

Examples:

> (eq? (arithmetic-shift 6 -3) 0)
#t
> (eq? (arithmetic-shift 14 -3) 0)
#f
> (eq? 6 6)
#t
> (eq? 14 6)
#f

To implement branching on booleans in terms of predicates, we can compile a boolean value to the predicate (!= value 6).

The representation of error, which represents an exit code, is inefficient. We only require 8 bits of data for the error code, so 24 bits are wasted. But we’re not adding more data types with secondary tags, so it is not worth over-engineering. Besides, a better error data type would use at least a string payload, which requires allocating space on the heap anyway.

The particular choice of tags is not important for correctness, although clever choices like above can help us implement common operations more efficiently. We therefore should introduce abstractions to keep the compiler abstract with respect to particular tag choices, so we can change our minds later if a better representation occurs to us.

To implement ptrs, we need bitwise operations such as arithmetic-shift so that we can implement tag checking, tagging, and untagging. This means we need to expose bitwise operations from x64, all the way through the compiler pipeline. This is not very interesting, so we begin by implementing tagged objects in terms of these, and then extending the rest of the compiler to expose bitwise operations.

1.10.3 Extending the source and front-end passes

We start by discussing the design of Exprs-lang v7.

Diff vs v6.5Exprs-lang v7

p	::=	(module (define x (lambda (x ...) value)) ... value)

pred	::=	(relop triv triv)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([x value] ...) pred)
	\|	(if pred pred pred)

value	::=	triv
	\|	(binop value value)
	\|	(let ([x value] ...) value)
	\|	(if value pred value value)
	\|	(call value x value ...)

triv	::=	x
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

triv	::=	x
	\|	int64

x	::=	name?
	\|	prim-f

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

p	::=	(module (define x (lambda (x ...) value)) ... value)

value	::=	triv
	\|	(let ([x value] ...) value)
	\|	(if value value value)
	\|	(call value value ...)

triv	::=	x
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

x	::=	name?
	\|	prim-f

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

As usual, we allow arbitrary shadowing. This means users can shadow prim-fs.

Example:

> (interp-exprs-lang-v7
    '(module
       (let ([* 1] [- 2])
         (call + * -))))
3

Now, all operations on data types in the source are procedures. This is because we’ll need to implement binary operations by detagging, operating on the underlying data, and retagging. For safety, we also should perform dynamic tag checking, to ensure we never add booleans or characters, or other odd and undefined behaviour.

Example:

> (interp-exprs-lang-v7
   '(module
      (call * #f #\y)))
Dynamic type error in *
  *: contract violation
  expected: number?
  given: #f
'(error 255)

We could distinguish procedure calls from primitive operations, and implement primitive operations by expanding to expressions, but this risks duplicating code (although, not much code). But, at some point, primitive operations will need to be wrapped as procedures if we want to use them with functional abstractions, so we might as well do it now, and leave some other pass to implement procedure inlining.

We also expose the tag checking operations to the source, again as procedures. Exposing these is a design choice, and this choice moves us toward a dynamically typed source language.

Examples:

> (interp-exprs-lang-v7
   '(module
      (call fixnum? #t)))
#f
> (interp-exprs-lang-v7
   '(module
      (call fixnum? 5)))
#t

These choices mean we can change our validator, check-exprs-lang. We no longer need to type check the arguments (previously, operands) to primitive operations, except to call, which still must call a statically known procedure (since we don’t have a tagged representation of procedures, yet). Instead, we statically allow expressions like (call * #f #\y), but expect this to raise a dynamic error.

procedure
(check-exprs-lang p) → exprs-lang-v7?
p : any/c
Checks that a Exprs-lang v7 program is well typed (only that procedures are called with the right number of arguments), and well scoped.

We extend uniquify as usual. Below, we design Exprs-unique-lang v7.

Diff vs v6.5Diff vs SourceExprs-unique-lang v7

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc e] ...) pred)
	\|	(if pred pred pred)

value	::=	triv
	\|	(binop value value)
	\|	(call value triv value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if value pred value value)

triv	::=	aloc
	\|	label
	\|	prim-f
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

prim-f	::=	binop
	\|	unop

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	label

binop	::=	*
	\|	+
	\|	-
	\|	<
	\|	eq?
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

int64	::=	int64?

p	::=	(module (define label x (lambda (aloc x ...) value)) ... value)

value	::=	triv
	\|	(call value value ...)
	\|	(let ([aloc x value] ...) value)
	\|	(if value value value)
	\|	(call value value ...)

triv	::=	aloc
	\|	label
	\|	prim-f
	\|	x
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

x	::=	name?
	\|	prim-f

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	<
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

value	::=	triv
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if value value value)

triv	::=	aloc
	\|	label
	\|	prim-f
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

prim-f	::=	binop
	\|	unop

binop	::=	*
	\|	+
	\|	-
	\|	<
	\|	eq?
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

As usual, we transform names into either abstract locations or labels. Data types do not complicate this.

procedure
(uniquify p) → exprs-unique-lang-v7?
p : exprs-lang-v7?
Resolves top-level lexical identifiers into unique labels, and all other lexical identifiers into unique abstract locations.

And now, we figure out where to place our new passes.

As we saw in Introduction to Tagged Object Representation, many of the operations we want to perform on ptrs are easily expressed as algebraic expressions. For example, the expression (call fixnum? 7) is compiled to as (= (bitwise-and 7 7) 0).

Thankfully, we just added algebraic expressions. If we position our new passes above remove-complex-opera*, we don’t need to manually introduce auxiliary variables, or generate code with additional let-bindings. The compiler will do this for us, and allow us to write the code we want to write.

1.10.4 Specifying Data Type Representation

First we design Exprs-unsafe-data-lang v7. In this language, we compile our dynamically typed safe primitive procedures into lower level primitive operations on data types.

Diff vs SourceExprs-unsafe-data-lang v7

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

value	::=	triv
	\|	(primop value ...)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if value value value)

triv	::=	aloc
	\|	label
	\|	prim-f
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

primop prim-f	::=	binop
	\|	unop

binop	::=	unsafe-fx*
	\|	unsafe-fx+
	\|	unsafe-fx-
	\|	eq?
	\|	unsafe-fx<
	\|	unsafe-fx<=
	\|	unsafe-fx>
	\|	unsafe-fx>=

binop	::=	*
	\|	+
	\|	-
	\|	<
	\|	eq?
	\|	<=
	\|	>
	\|	>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

value	::=	triv
	\|	(primop value ...)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if value value value)

triv	::=	aloc
	\|	label
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

primop	::=	binop
	\|	unop

binop	::=	unsafe-fx*
	\|	unsafe-fx+
	\|	unsafe-fx-
	\|	eq?
	\|	unsafe-fx<
	\|	unsafe-fx<=
	\|	unsafe-fx>
	\|	unsafe-fx>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

aloc	::=	aloc?

label	::=	label?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

To make clear the distinction between the source and target implementations of these operations, we prefix the unsafe operators that require a dynamic check with unsafe-. call is still unsafe, since we do not know how to tag procedures yet.

To implement this language, we essentially "link" the definitions of each procedure wrapper for each primitive operation and replace the reserved prim-f names for the functions with the appropriate fresh labels. Since our compiler has not provided any means of linking separately compiled modules, we implement this linking by directly adding new definitions to the module. For example, the program:

(module (call + 1 2))

should compile to the something equivalent to the following (although perhaps with different error codes).

(module
  (define L.+.1
    (lambda (tmp.3 tmp.4)
      (if (fixnum? tmp.4)
          (if (fixnum? tmp.3)
              (unsafe-fx+ tmp.3 tmp.4)
              (error 2))
          (error 2))))
  (call L.+.1 1 2))

Each safe procedure should produce some error code indicating the kind of error that occurred. Be sure to document your error codes. For now, we have enough error codes that we can encode which operation and which argument caused the error.

procedure
(implement-safe-primops p) → exprs-unsafe-data-lang-v7?
p : exprs-unique-lang-v7?

Implement safe primitive operations by inserting procedure definitions for each primitive operation which perform dynamic tag checking, to ensure type safety.

Design digression:
To avoid producing unnecessary definitions, you may consider how to design this pass so that a definition for a primitive procedure is only added to the module if that procedure is actually used.
You may also consider an alternative implementation of this pass that avoids inserting dynamic checks at all. Note that the the language definitions allow this behaviour for all well-typed programs.

Next we design Exprs-bits-lang v7, a language with all values represented as 64 bits and only primitive bitwise operations, but that allows algebraic expressions in most positions. The predicate position of if is again restricted to the predicate sublanguage.

Diff vs SourceExprs-bits-lang v7

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

pred	::=	(relop value value)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)

value	::=	triv
	\|	(binop value primop value ...)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value value)

triv	::=	aloc
	\|	label
	\|	int64

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

triv	::=	aloc
	\|	label
	\|	fixnum
	\|	#t
	\|	#f
	\|	empty
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal

primop	::=	binop
	\|	unop

binop	::=	unsafe-fx*
	\|	unsafe-fx+
	\|	unsafe-fx-
	\|	eq?
	\|	unsafe-fx<
	\|	unsafe-fx<=
	\|	unsafe-fx>
	\|	unsafe-fx>=

unop	::=	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not

aloc	::=	aloc?

label	::=	label?

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

fixnum	::=	int61?

uint8	::=	uint8?

ascii-char-literal	::=	ascii-char-literal?

p	::=	(module (define label (lambda (aloc ...) value)) ... value)

pred	::=	(relop value value)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)

value	::=	triv
	\|	(binop value value)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)

triv	::=	aloc
	\|	label
	\|	int64

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

aloc	::=	aloc?

label	::=	label?

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

This is the hard part of adding data types—translating our data types and their operations into bits and primitive operations on bits.

First, we translate each value literal to a ptr.

For booleans, empty, and void, this is trivial. Since these contain no payload data, their ptr representation is just a statically known tag. We simply emit their ptr representation.

You can find some parameters for ptrs defined in cpsc411/compiler-lib.

For data types with a payload (fixnum and ASCII characters) we need to do some work to merge the payload data with the tag. The general strategy is to first left shift the data by the number of bits in the tag, then perform a bitwise inclusive or with the tag. After the left shift, the least significant bits will be all 0, and a bitwise inclusive or will simply set the tags bits. For example, to tag the character "x" as an ASCII character, we would shift left by 8 (the number of bits in the secondary tag), and bitwise or with the tag:

(bitwise-ior (arithmetic-shift (char->integer #\x) 8) #b00101110)

For a fixnum, which uses a 3 bit primary tag, we could shift left by 3 and bitwise or with the fixnum tag:

(bitwise-ior (arithmetic-shift 7 3) #b000)

Similarly, to implement operations on our ptrs, in general we need to untag them. We can untag easily be performing an arithmetic right shift by the size of the tag. For example, below we tag the character #\x, producing a tagged ptr representation. To untag it, we simply shift right by 8 (the length of the secondary tag), which is implement by shift by -8.

Examples:

> (bitwise-ior (arithmetic-shift (char->integer #\x) 8) 46)
30766
> (integer->char
   (arithmetic-shift
    (bitwise-ior (arithmetic-shift (char->integer #\x) 8) 46)
    -8))
#\x

For fixnum operations, we can avoid some of these steps, due to our choice of tag. Because the fixnum tag is all 0s, we can omit the bitwise or.

Example:

> (= (bitwise-ior (arithmetic-shift 7 3) 0) (arithmetic-shift 7 3))
#t

For operations on fixnums, we can use some handy algebraic facts to avoid some tagging and untagging. Recall that every fixnum n is represented by a ptr whose value is (* 8 n). For + and -, this means we don’t need to do anything at all, since 8x + 8y = 8(x + y), and similarly 8x - 8y = 8(x - y). Similarly, <, <=, >, >=, and = can all work on ptrs directly without untagging. However, these are boolean operations in Exprs-unsafe-data-lang v7, so their implementation must return a boolean ptr.

Only * poses a problem, since 8x * 8y = 64(x * y). However, we do not need to adjust both arguments: we observe that 8x * y = 8(x * y), and similarly x * 8y = 8(x * y). We only need to shift one operand before performing * to get the correct result as a ptr. If either argument is constant, we can perform the shift at compile time, completely eliminating the additional overhead. Otherwise, we translate (* e_1 e_2) to (roughly) (* e_1 (arithmetic-shift-right e_2 3)).

Next, we translate if, which should be translated from an operation on booleans to an operation on preds. We’ll do some work later transforming booleans into predicates, for optimization, but for now we just consider how to implement booleans correctly. Racket and Scheme are falsey languages—any thing that is not #f is considered true. We can implement this easily: simply compare to the ptr for #f.

(= value #,(current-false-ptr))

When translating the booleans unops and binops on ptrs, we need to produce something that the translation of if can consume. if is expecting a boolean value, so each unop should be translated to an expression that returns a boolean. As we saw earlier, type predicates are implemented by masking the ptr using bitwise-and, and comparing the result to the tag using =. But the target language = is a relop, not a boolean operation, so we translate (fixnum? value) to (if (= (bitwise-and value 7) 0) #t #f) (although, of course, #t and #f must be compiled to ptrs too).

procedure
(specify-representation p) → exprs-bits-lang-v7?
p : exprs-unsafe-data-lang-v7?

Compiles immediate data and primitive operations into their implementations as ptrs and primitive bitwise operations on ptrs.

1.10.5 Exposing Bitwise Operations

Specifying the tagged representation happens very close to our source language. We need to expose these bitwise operations all the way up to what was previously Values-unique-lang-v6. Below, we design the new Values-bits-lang v7, typeset with differences compared to Values-unique-lang-v6.

Diff vs v6Diff vs SourceValues-bits-lang-v7

p	::=	(module (define label (lambda (aloc ...) tail)) ... tail)

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)

tail	::=	value
	\|	(let ([aloc value] ...) tail)
	\|	(if pred tail tail)
	\|	(call triv opand ...)

value	::=	triv
	\|	(binop opand opand)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(call triv opand ...)

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	label

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

aloc	::=	aloc?

label	::=	label?

int64	::=	int64?

p	::=	(module (define label (lambda (aloc ...) tail value)) ... tail value)

pred	::=	(relop opand opand value value)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)

tail	::=	value
	\|	(let ([aloc value] ...) tail)
	\|	(if pred tail tail)
	\|	(call triv opand ...)

value	::=	triv
	\|	(binop opand opand value value)
	\|	(call value value ...)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(call triv opand ...)

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	aloc
	\|	label
	\|	int64

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

aloc	::=	aloc?

label	::=	label?

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

p	::=	(module (define label (lambda (aloc ...) tail)) ... tail)

pred	::=	(relop opand opand)
	\|	(true)
	\|	(false)
	\|	(not pred)
	\|	(let ([aloc value] ...) pred)
	\|	(if pred pred pred)

tail	::=	value
	\|	(let ([aloc value] ...) tail)
	\|	(if pred tail tail)
	\|	(call triv opand ...)

value	::=	triv
	\|	(binop opand opand)
	\|	(let ([aloc value] ...) value)
	\|	(if pred value value)
	\|	(call triv opand ...)

opand	::=	aloc
	\|	int64

triv	::=	opand
	\|	label

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

aloc	::=	aloc?

label	::=	label?

int64	::=	int64?

The new operations do not have large effects on the language designs or compiler passes between remove-complex-opera* and generate-x64, so these details are left unspecified in this chapter. Redesigning the intermediate languages with the new operations is left as an exercise for the reader.

A specification of these languages is given in the support library, if you wish to use it.

procedure
(remove-complex-opera* p) → values-bits-lang-v7?
p : exprs-bits-lang-v7?
Performs the monadic form transformation, unnesting all non-trivial operators and operands to binops, calls, and relopss, making data flow explicit and simple to implement imperatively.

1.10.6 Exposing Bitwise Operations in Paren-x64

Finally, we need to expose the following x64 instructions.

sar loc, int
Perform arithmetic right-shift by int on loc.
This instruction requires that its second operand be an integer literal between 0 and 63, inclusive, i.e., 0 <= int <= 63. We will assume this constraint in the intermediate languages, and never expose this operation in Exprs-lang v7.
and loc, op
Compute the bitwise "and" of loc and op, storing the result in loc. Like with other binary operations, when op is an integer, it must be an int32, and when op is an addr, loc cannot also be an addr.
or loc, op
Compute the bitwise "inclusive or" of loc and op, storing the result in loc. Like with other binary operations, when op is an integer, it must be an int32, and when op is an addr, loc cannot also be an addr.
xor loc, op
Compute the bitwise "exclusive or" of loc and op, storing the result in loc. Like with other binary operations, when op is an integer, it must be an int32, and when op is an addr, loc cannot also be an addr.

We add each of these operations as a binop to Paren-x64 v7 below.

Diff vs v6 (excerpts)Paren-x64 v7

p	::=	(begin s ...)

s	::=	(set! addr int32)
	\|	(set! addr trg)
	\|	(set! reg loc)
	\|	(set! reg triv)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 loc))
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare reg opand)
	\|	(jump-if relop label)

triv	::=	trg
	\|	int64

opand	::=	reg
	\|	int64

loc	::=	reg
	\|	addr

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

p	::=	(begin s ...)

s	::=	(set! addr int32)
	\|	(set! addr trg)
	\|	(set! reg loc)
	\|	(set! reg triv)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 loc))
	\|	(with-label label s)
	\|	(jump trg)
	\|	(compare reg opand)
	\|	(jump-if relop label)

trg	::=	reg
	\|	label

triv	::=	trg
	\|	int64

opand	::=	reg
	\|	int64

loc	::=	reg
	\|	addr

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r10
	\|	r11
	\|	r12
	\|	r13
	\|	r14
	\|	r15

addr	::=	(fbp - dispoffset)

fbp	::=	frame-base-pointer-register?

binop	::=	*
	\|	+
	\|	-
	\|	bitwise-and
	\|	bitwise-ior
	\|	bitwise-xor
	\|	arithmetic-shift-right

relop	::=	<
	\|	<=
	\|	=
	\|	>=
	\|	>
	\|	!=

int64	::=	int64?

int32	::=	int32?

dispoffset	::=	dispoffset?

label	::=	label?

procedure
(generate-x64 p) → (and/c string? x64-instructions?)
p : paren-x64-v7?
Compile the Paren-x64 v7 program into a valid sequence of x64 instructions, represented as a string.

1.10.7 Appendix: Overview

Figure 8: Overview of Compiler Version 7

contents ← prev up next →

1	Book
2	Project Logistics
3	Project Iterations
4	CPSC 411 Infrastructure Package
5	Credits

1.1	A Compiler Begins with a Language
1.2	Abstracting Boilerplate (v1)
1.3	Abstract Locations
1.4	Value Orientation
1.5	Register Allocation
1.6	Structured Control Flow
1.7	Procedural Abstraction: Call
1.8	Procedural Abstraction: Return
1.9	Algebraic Expressions
1.10	Data types: Immediates
1.11	Data types: Structured Data and Heap Allocation
1.12	First-class Procedures: Code is Data
1.13	Recursive Data
1.14	Syntactic Sugar
1.15	Appendix: Racket Preliminaries
1.16	Appendix: Compiler Design Recipe

1.10.1	Preface: What’s wrong with Exprs-lang v6.5
1.10.2	Introduction to Tagged Object Representation
1.10.3	Extending the source and front-end passes
1.10.4	Specifying Data Type Representation
1.10.5	Exposing Bitwise Operations
1.10.6	Exposing Bitwise Operations in Paren-x64
1.10.7	Appendix: Overview