1.4 Value Orientation

1.4.1 Preface: What’s wrong with our language?

In Asm-lang v2, which abstracted away from machine-specific physical locations exist and, therefore, instructions that operate on physical locations.

However, the source language is still not something in which we want to program. In order to program, we, as Asm-lang v2 programmers, can only compose operations by moving values in and out of locations. We cannot focus on their values, but instead must focus on their locations. This makes composing operations harder. For example, when doing arithmetic, we cannot simply write (+ 2 2), i.e., "Add the values 2 and 2". Instead, we must write "first move 2 into a location, then move 2 into a different location, now add the contents of the two locations".

We want to move towards a value-oriented language, i.e., a language where operations consume and produce values directly, and away from an imperative language that manipulates some underlying machine state. This would free the programmer from keeping the state of the machine in mind at all times. The structure of the computation could instead be reflected in the syntax of the program.

To this end we design a new source language, Values-lang v3. Implementing this language is the goal for this chapter. The language is designed primarily to address the above limitations.

p	::=	(module tail)

tail	::=	value
	\|	(let ([x value] ...) tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(let ([x value] ...) value)

triv	::=	x
	\|	int64

x	::=	name?

binop	::=	*
	\|	+

int64	::=	int64?

We can see the language has changed significantly. The binary operations now act uniformly on trivs, which represent expressions that directly compute to values, rather than alocs, which represent locations on the machine where values are stored. Intermediate operations can be named using let, which binds an arbitrary number of independent computations to names.

This language introduces two new abstractions: names and expressions.

A name, or lexical identifier, is a placeholder that is indistinguishable from the value of the expression it names. Like abstract locations, names can be created at will. Unlike abstract locations, names do not unique identify a physical location. Names obey lexical binding, shadowing the same name if an existing name is reused but without overwriting the old value. For example, the following example uses the same names multiple times, but this doesn’t overwrite any other use of the name.

procedure
(interp-values-lang p) → int64?
p : values-lang-v3?
Interprets the Values-lang v3 program p as a value. For all p, the value of (interp-values-lang p) should equal to (execute p).

Example:

> (interp-values-lang
   '(module
      (let ([x (let ([y 1]
                     [x 2])
                 (+ y x))])
        (let ([y (let ([x 3]) x)])
          (+ x y)))))
6

Expressions are computations that represent values, without reference to any machine state. The expression (+ 2 2) represents and computes to the value 4, and hides any reference to where these values are located on the machine. This gives the programmer the ability to compose computations locally, without knowing anything about the state of the machine. For the moment, our expressions are very limited—we can only compute over values, and to compose two expressions, we explicitly name the value of the expression using let. This makes dependencies between expressions explicit in our syntax. But, with names, we are able to make up a brand new name without fear that we will erroneously overwrite some existing value; at worst, we locally shadow it. This again hides the details of the machine state from the programmer.

Some of the names for non-terminals are counter-intuitive. For example, triv for trivial values makes sense, but why is the expression (+ 2 2) a value? In this language, we use the name of a non-terminal to indicate in which context it is valid, not to merely describe the non-terminal’s productions. (+ 2 2) is valid in any context that expects a value, since it computes to a value and is indistinguishable from a value by the operations of our language. One context that expects a value is the right-hand side of a let binding. Only a value or a let whose body is a tail is valid in tail context.

In Values-lang v3, the let expression implements a particular kind of lexical binding. The same name can be shadowed by nested let expressions, and nested let expressions can refer to the bindings of prior let expressions. However, in the same let expression, the binding are considered parallel—they do not shadow, and cannot refer to each other. This means we can freely reorder the bindings in a single let statement. This can be useful for optimization. However, it means we cannot allow duplicate names to be bound in a single let or allow references to names within the bound expressions the same let expression. Otherwise, the ability to reorder could introduce undefined behaviour.

To guard against this undefined behaviour, we introduce a validator.

We must also check for reference to unbound names, since these may be compiled into reference to uninitialized memory, resulting in undefined behaviour.

procedure
(check-values-lang p) → values-lang-v3?
  p : any/c
Takes an arbitrary value and either returns it, if it is a valid Values-lang v3 program, or raises an error with a descriptive error message.
Examples:
> (check-values-lang
    '(module
       (let ([x 5]
             [y 6])
         x)))
'(module (let ((x 5) (y 6)) x))
> (check-values-lang
   '(module
      (let ([x 5]
            [y 6])
        (let ([y x])
          y))))
'(module (let ((x 5) (y 6)) (let ((y x)) y)))
> (check-values-lang
   '(module
      (let ([x 5]
            [y x])
        y)))
check-values-lang: Invalid triv; expected one of:
  something of the form int, but got 'x
  a bound name, got x in environment ()
> (check-values-lang
   '(module
      (let ([x 5]
            [x 6])
        x)))
check-values-lang: Invalid triv; expected one of:
  something of the form int, but got '(let ((x 5) (x 6)) x)
  a bound name, got (let ((x 5) (x 6)) x) in environment ()
> (check-values-lang
   '(module (let () 5)))
'(module (let () 5))
> (check-values-lang
    '(module (let () x)))
check-values-lang: Invalid triv; expected one of:
  something of the form int, but got 'x
  a bound name, got x in environment ()

1.4.2 Abstracting from Assembly

With Asm-lang v2, we abstracted all machine details—there are no more registers, no more odd machine-specific restrictions, just a location-oriented abstract assembly language.

The language is a still an extremely imperative, non-compositional language. We first add the ability to more easily compose imperative expressions, before moving to composing non-imperative expressions representing values.

But the assembly language is still an (extremely) imperative machine language. The operations it provides are: (1) move a value to a location (2) perform a binary operation on a location. This requires the programmer to always remember the values of abstract locations and manipulate this underlying state to program.

We design Imp-cmf-lang v3, an imperative language that allows expressing operations on values directly, and composing them by sequencing these computation through abstract locations. This removes some amount of imperativity from the language. The programmer can reason about each primitive operation using only values, and needs only to think about the state of the machine when composing operations.

Imp-cmf-lang v3 is in a variant of A-normal form (ANF), a syntactic form that restricts all operations to trivial values, and forbids nesting in our value position. It is roughly equivalent to other compiler intermediate forms, such as static-single assignment.

Imp-cmf-lang-v3DiffAsm-lang-v2

p	::=	(module tail)

tail	::=	value
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop triv triv)

effect	::=	(set! aloc value)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

p	::=	(module info tail)

info	::=	info?

tail	::=	value
	\|	(halt triv)
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop triv triv)

effect	::=	(set! aloc value triv)
	\|	(set! aloc_1 (binop aloc_1 triv))
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

p	::=	(module info tail)

info	::=	info?

tail	::=	(halt triv)
	\|	(begin effect ... tail)

effect	::=	(set! aloc triv)
	\|	(set! aloc_1 (binop aloc_1 triv))
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

We add the value non-terminal to represent operations that produce values at run time. Now, set! simply assigns any value, or operation that produces a value, to an abstract location. At the top-level, a program is a tail, which represents the last computation executed in a program. It is either an operation that produces a value, or a sequence of such operations composed by storing the result of the value in an intermediate abstract locations. The Imp-cmf-lang v3 program implicitly halts with the final value of the tail.

To implement this language, we compile each operation to a sequence of Asm-lang v2 instructions. This involves select inserting an explicit halt in the final value of the tail, and selecting instruction sequences to implement primitive operations, and possibly introducing auxiliary abstract locations.

procedure
(select-instructions p) → asm-lang-v2?
  p : imp-cmf-lang-v3?
Compiles Imp-cmf-lang v3 to Asm-lang v2, selecting appropriate sequences of abstract assembly instructions to implement the operations of the source language.
Examples:
> (select-instructions '(module (+ 2 2)))
'(module () (begin (set! tmp.1 2) (set! tmp.1 (+ tmp.1 2)) (halt tmp.1)))
> (select-instructions
   '(module
      (begin (set! x.1 5) x.1)))
'(module () (begin (set! x.1 5) (halt x.1)))
> (select-instructions
   '(module
      (begin
        (set! x.1 (+ 2 2))
        x.1)))
'(module () (begin (set! x.1 2) (set! x.1 (+ x.1 2)) (halt x.1)))
> (select-instructions
   '(module
      (begin
        (set! x.1 2)
        (set! x.2 2)
        (+ x.1 x.2))))
'(module
   ()
   (begin
     (set! x.1 2)
     (set! x.2 2)
     (set! tmp.2 x.1)
     (set! tmp.2 (+ tmp.2 x.2))
     (halt tmp.2)))

Imp-cmf-lang v3 forces us to carefully linearize all effects at the top-level. This is an annoying detail that the language could manage instead.

Below, we design Imp-mf-lang v3, which allows nesting effects in most contexts.

DiffImp-mf-lang-v3Imp-cmf-lang-v3

p	::=	(module tail)

tail	::=	value
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(begin effect ... value)

effect	::=	(set! aloc value)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

p	::=	(module tail)

tail	::=	value
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(begin effect ... value)

effect	::=	(set! aloc value)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

p	::=	(module tail)

tail	::=	value
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop triv triv)

effect	::=	(set! aloc value)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

Design digression:
Imp-mf-lang v3 is in monadic form (MF), a syntactic form that allows composing operations that operate on values and have no side-effect (such as changing the value of an abstract location), but requires explicit sequencing any effectful operations. This allows additional nesting compared to ANF. Monadic form is often used in high-level functional languages to support reasoning about when side-effecting operations happen, and is also a useful compiler intermediate form for the same reason.
ANF almost corresponds to a canonical form for MF, a syntactic form in which equal programs (for some notion of equality) have the same representation. The form is canonical in the sense that there is one right way to represent every program. All ANF programs are also in MF, but not all MF programs are in ANF.
Writing transformations and optimizations over canonical forms is often easier since we do not have to manually consider two equal programs as they have the same representation. Unfortunately, transformations over canonical forms are often tricky because the transformation may not preserve canonicity. In the case of of MF and ANF, it is often easier to write the same optimization over MF, since MF frees the optimization from attempting to unnest operations. On the other hand, selecting instructions over ANF is simpler since assembly features no nesting, and ANF guarantees that no nesting exists.
Strictly speaking, our Imp-cmf-lang v3 is a variant of ANF that is not truly canonical, since we can nest effects. For example, the following two tails represent the same instruction sequence: (begin (begin (set! x.1 5)) x.1) and (begin (set! x.1 5) x.1). However, the language is canonical for the equation we care about, and this little bit of non-canonicity simplifies the work of the compiler in some cases, and only complicates a single pass in the short-term—once we add new features, it will not complicate anything, since we will be forced to deal with the complication anyway.
This design choice is an example of where the compiler design, like most software design, benefits from iteration. While the bottom-up approach of building layers on abstraction is often beneficial, some design decisions may not be obvious until the software evolves.

procedure
(normalize-bind p) → imp-cmf-lang-v3?
  p : imp-mf-lang-v3?
Compiles Imp-mf-lang v3 to Imp-cmf-lang v3, pushing set! under begin so that the right-hand-side of each set! is a simple value-producing operation. This normalizes Imp-mf-lang v3 with respect to the equations
(set! aloc
     (begin effect_1 ...
            value))

=

(begin effect_1 ...
       (set! aloc value))

1.4.3 Be, not Do

Now we want to abstract further, away from locations and focus on values and operations on values. This allows us to express computation as something that represents a value, not a sequence of operations to compute a value. That is, to declare what is, not instruct what to do. This is a move toward declarative programming.

Values-lang v3 is the final source language for this week. It abstracts away from operations on locations and instead has operations on values. It also includes a feature for binding values to names for reuse. The value of a Values-lang v3 program is the final value of an expression.

But dealing with names is hard, so we make a simplifying assumption. We assume that actually, someone has already resolved all names into abstract locations. This gives us the language Values-unique-lang v3, defined below.

Values-unique-lang-v3DiffImp-mf-lang-v3

p	::=	(module tail)

tail	::=	value
	\|	(let ([aloc value] ...) tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(let ([aloc value] ...) value)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

p	::=	(module tail)

tail	::=	value
	\|	(let ([aloc value] ...) tail)
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(let ([aloc value] ...) value)
	\|	(begin effect ... value)

effect	::=	(set! aloc value)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

p	::=	(module tail)

tail	::=	value
	\|	(begin effect ... tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(begin effect ... value)

effect	::=	(set! aloc value)
	\|	(begin effect ... effect)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

This language is similar to Imp-mf-lang v3, but uses let to compose operations rather than sequential, imperative set! instructions. These two operations have different semantics. All of the bindings in a single let are assumed to be independent of each other, and can happen in any order. They should not be thought of as sequential, but simply declarations that some bindings to some values exist. For example, the following two Values-unique-lang v3 programs are equivalent:

(let ([x.1 5] [y.2 6])

(+ x.1 y.2))

(let ([y.2 6] [x.1 5])

(+ x.1 y.2))

However, the apparently similar Imp-mf-lang v3 programs are not necessarily equal, since set!s modify global state. The following two tails are not necessarily equal, if any other part of the program relies on x.1 or y.2.

(begin (set! x.1 5)

(set! y.2 6)

(+ x.1 y.2))

≠

(begin (set! y.2 6)

(set! x.1 5)

(+ x.1 y.2))

This means we can easily implement optimizations over Values-unique-lang v3 that are not possible or are difficult in Imp-mf-lang v3, although no such optimizations are apparent yet.

procedure
(optimize-let-bindings p) → values-unique-lang-v3?
p : values-unique-lang-v3?
Optimizes let bindings by reordering them to minimize or maximize some metric.

To implement Values-unique-lang v3, we must sequentialize let.

procedure
(sequentialize-let p) → imp-mf-lang-v3?
p : values-unique-lang-v3?
Compiles Values-unique-lang v3 to Imp-mf-lang v3 by picking a particular order to implement let expressions using set!.

Finally, we must discharge our assumption that all names are unique. Below we define Values-lang v3, a value-oriented language with simple binary expressions and lexical binding.

DiffValues-lang-v3Values-unique-lang-v3

p	::=	(module tail)

tail	::=	value
	\|	(let ([x aloc value] ...) tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(let ([x aloc value] ...) value)

triv	::=	x
	\|	aloc
	\|	int64

x	::=	name?

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

p	::=	(module tail)

tail	::=	value
	\|	(let ([x value] ...) tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(let ([x value] ...) value)

triv	::=	x
	\|	int64

x	::=	name?

binop	::=	*
	\|	+

int64	::=	int64?

p	::=	(module tail)

tail	::=	value
	\|	(let ([aloc value] ...) tail)

value	::=	triv
	\|	(binop triv triv)
	\|	(let ([aloc value] ...) value)

triv	::=	aloc
	\|	int64

binop	::=	*
	\|	+

aloc	::=	aloc?

int64	::=	int64?

x is a short-hand non-terminal for names, which are arbitrary symbols. To ensure transform names into abstract location, it suffices to append a unique number to each.

To implement this language, we must resolve all lexical binding and replace names by abstract locations.

procedure
(uniquify p) → values-unique-lang-v3?
  p : values-lang-v3?
Compiles Values-lang v3 to Values-unique-lang v3 by resolving all lexical identifiers to abstract locations.
Examples:
> (uniquify '(module (+ 2 2)))
'(module (+ 2 2))
> (uniquify
   '(module
      (let ([x 5])
        x)))
'(module (let ((x.3 5)) x.3))
> (uniquify
   '(module
      (let ([x (+ 2 2)])
        x)))
'(module (let ((x.4 (+ 2 2))) x.4))
> (uniquify
   '(module
      (let ([x 2])
        (let ([y 2])
          (+ x y)))))
'(module (let ((x.5 2)) (let ((y.6 2)) (+ x.5 y.6))))
> (uniquify
   '(module
      (let ([x 2])
        (let ([x 2])
          (+ x x)))))
'(module (let ((x.7 2)) (let ((x.8 2)) (+ x.8 x.8))))

1.4.4 Appendix: Overview

Figure 2: Overview of Compiler Version 3

contents ← prev up next →

1	Book
2	Project Logistics
3	Project Iterations
4	CPSC 411 Infrastructure Package
5	Credits

1.1	A Compiler Begins with a Language
1.2	Abstracting Boilerplate (v1)
1.3	Abstract Locations
1.4	Value Orientation
1.5	Register Allocation
1.6	Structured Control Flow
1.7	Procedural Abstraction: Call
1.8	Procedural Abstraction: Return
1.9	Algebraic Expressions
1.10	Data types: Immediates
1.11	Data types: Structured Data and Heap Allocation
1.12	First-class Procedures: Code is Data
1.13	Recursive Data
1.14	Syntactic Sugar
1.15	Appendix: Racket Preliminaries
1.16	Appendix: Compiler Design Recipe

1.4.1	Preface: What’s wrong with our language?
1.4.2	Abstracting from Assembly
1.4.3	Be, not Do
1.4.4	Appendix: Overview