On this page:
1.16.1 Systematic Program Design for Compilers
1.16.2 The Compiler Template Recipe
8.10

1.16 Appendix: Compiler Design Recipe

Compilers are complex pieces of software, and it can be difficult to know where to start. Thankfully, most of you are already indoctrinated into the ways of systematic program design from CPSC 110, so you know that we should start with the design recipe. If you are not yet familiar with systematic program design, don’t worry: you will be. The weekly assignments in this course were written following the systematic program design recipe taught in CPSC 110. By following the design recipe, you will always know where to start.

In this document, we explain how the design recipe applies to compilers, which pieces of the design recipe you are given, and which pieces you will need to complete your weekly assignment.

1.16.1 Systematic Program Design for Compilers

Figure 12 describes a design recipe from How to Design Programs, on which CPSC 110 is based.

  1. From Problem Analysis to Data Definitions

  2. Signature, Purpose Statement, Header

  3. Functional Examples

  4. Function Template

  5. Function Definition

  6. Testing

Figure 12: The Design Recipe

The hardest step in this recipe is Step 1: creating a data definition from a problem analysis. A compiler is a transformation between languages, so in this course, most data definitions are language definitions. The data are programs from a language. Almost all of the creativity in the compilers we will design is in designing the languages that the compilers manipulate. In class, we will analyze some language design problem then discuss design decisions that led to the source languages we present in the weekly assignment. You are not yet experienced language developers, so we have done the work of describing the languages, and will give you these data definitions each week.

We will also give you most of Step 2. For compilers, Step 2 follows a regular pattern: every compiler pass takes a source (input) language and produces a target (output) language. The pass will have one of two purposes. Either to implement an abstraction for the source in terms of the target, or to parse some information implicit in the source into an explicit representation in the target.

From the language definitions you provide, you should complete Step 3 by hand compiling some programs. These will also form your tests for Step 6. The weekly assignments will include a few examples, but it is up to you to come up with enough examples to test your compiler thoroughly.

Step 4, the template, will be uniform for each compiler you write. Every compiler is defined over a source language, and each source language has a common tree structure. In the rest of this document, I’ll explain the template pattern you should use.

By the time you have worked your way through the steps of the design recipe, there is only one step, Step 5, that requires thinking about writing code. You will be able to write most of your compiler by following the systematic process.

1.16.2 The Compiler Template Recipe

The data definition of each language is essentially a tree, given as a BNF grammar, which satisfies some properties and can be evaluated to a value. Each instance of a tree, i.e., each program in the language, satisfies some common properties, such as having no references to unbound variables. Each program has an interpretation as a value, given by the interpreter for the language.

Since every data structure is a tree, every compiler pass will follow a template for treelike data. By following the design recipe, and using the template, systematic design will write half of your code for you.

The structure of our languages is so regular that we have a recipe for writing a template for each new language. You can develop the template for each language using the process in Figure 13.

  1. Design predicates for each data non-terminal of the grammar.

  2. Design one function for each code non-terminal of the grammar.

  3. Design each structural function by pattern matching on the input expression, creating one branch for each case of non-terminal.

  4. Following the natural recursion: call the corresponding non-terminal function on each structural sub-expression.

  5. Transform the leaves of the tree, i.e., the expressions with no complex sub-expressions.

Figure 13: Compiler Template Recipe

For example, consider implementing a compiler for the following two languages. The compiler should add a subtraction operation to the source language by compiling it into addition and multiplication.

  p = (begin s ...)
     
  s = (set! loc int64)
  | (set! loc loc)
  | (set! loc (binop loc loc))
     
  binop = *
  | +
  | -
     
  loc = reg
  | addr
     
  reg = rsp
  | rbp
  | rax
  | rbx
  | rcx
  | rdx
  | rsi
  | rdi
  | r8
  | r9
  | r10
  | r11
  | r12
  | r13
  | r14
  | r15
     
  addr = (rbp - integer)
  | (rbp + integer)

Figure 14: Paren-asm-sub, the source language

In Figure 14, p (for program) and s (for statement) are code non-terminals. These give the structure of code in the language, and are what we recur on and process when designing a compiler.

The rest, such as binop, reg, and int64, are data non-terminals. These define data that appears in the structure of programs, but that the compiler will typically not process in a complex way.

  p = (begin s ...)
     
  s = (set! loc int64)
  | (set! loc loc)
  | (set! loc (binop loc loc))
     
  binop = *
  | +
     
  loc = reg
  | addr
     
  reg = rsp
  | rbp
  | rax
  | rbx
  | rcx
  | rdx
  | rsi
  | rdi
  | r8
  | r9
  | r10
  | r11
  | r12
  | r13
  | r14
  | r15
     
  addr = (rbp - integer)
  | (rbp + integer)

Figure 15: Paren-asm, the target language

Note that Figure 15 is essentially the same, except binop excludes the symbol -.

We would design this compiler using the following template for Paren-asm-sub.

#lang racket
(define-syntax-rule (TODO . rest)
  (error "Template not yet completed"))
 
(define (max-int word-size) (sub1 (expt 2 (sub1 word-size))))
(define (min-int word-size) (* -1 (expt 2 (sub1 word-size))))
 
(define (int-size? word-size i)
  (and (integer? i)
       (<= (min-int word-size) i (max-int word-size))))
 
(define (int32? i) (int-size? 32 i))
(define (int64? i) (int-size? 64 i))
 
(define (Paren-asm-sub-binop? op) (and (member op '(* + -)) #t))
(define (Paren-asm-binop? op) (and (member op '(* +)) #t))
(define (location? loc) (or (register? loc) (address? loc)))
(define (register? r)
  (and
    (member r '(rsp rbp rax rbx rcx rdx rsi rdi r8 r9 r10 r11 r12 r13 r14 r15))
    #t))
 
(define (address? a)
  (match a
    [`(rbp ,op ,int)
      #:when (and (member op '(- +))
                  (integer? int))
     #t]
    [_ #f]))
 
(define (compiler-sub p)
  (define (process-p p)
    (match p
      [`(begin ,s ...)
       (TODO process-s s)]))
  (define (process-s s)
    (match s
      [`(set! ,loc ,int64)
       #:when (and (location? loc) (int64? int64))
       (TODO)]
      [`(set! ,loc1 ,loc2)
       #:when (and (location? loc1) (location? loc2))
       (TODO)]
      [`(set! ,loc1 (,binop ,loc2 ,loc3))
       #:when (and (location? loc1)
                   (Paren-asm-sub-binop? binop)
                   (location? loc2)
                   (location? loc3))
       (TODO)]))
  (process-p p))

Racket list predicates, such as member, often return a non-boolean result. It’s a good idea to force predicates to return a boolean result using the (and ... #t) pattern, to avoid type confusion.

None of this code required creativity; it is the template induced by the definition of Paren-asm-sub. We have two non-terminals: programs p and statements s, so we design two functions. While the language definitions includes other non-terminals in the grammar, these define atomic data. Each function pattern matches on the expression using match. Nothing interesting happens to programs, so the template for process-p simply calls process-s on each statement. In process-s, we pattern match, and use #:when guards to disambiguate between statements with similar structure but different atoms in the expression. While the last statement is unambigious in the current language definition, it’s a good idea to disambiguate it anyway, as the language definitions may change throughout the course as we iteratively refine them.