Performance Analysis and Evaluation

Fall 2017

turtle and the
  hare

Department of Computer Science, Vrije Universiteit Brussel

Instructor: Jennifer B. Sartor

When: Thursdays 1-3pm
Where: 10F744 SOFT at VUB
Instructor Email: jsartor@soft.vub.ac.be
Question and Answer Time: Thursdays 3-4pm, F.10.739

Course Overview

As the computer industry has grown over the last 50 years, the focal point has often been performance. Machines have gotten faster and faster, smaller and smaller. We have much larger and more complex pieces of software now than we did 40 years ago. This course will be about de-mystifying how to evaluate and analyze the performance of computer programs, particularly those written in managed languages. We will look at the software stack, and discuss how each of the layers -- the application, the compiler, the (potential) runtime environment, the operating system, the processor, and the memory system -- all impact the end performance of a program. We will look at some tools that shed light on why a program gets the execution time it does - analyzing its memory behavior or in what methods the bottlenecks might be. These are profiling or performance monitoring tools that are relatively easy to use to analyze performance. We will talk about metrics and statistics - once you run a program or a suite of programs, how do you accurately present and summarize results about their performance? A part of this is understanding the managed runtime running underneath a Java application, and how the dynamic compiler or memory manager impacts performance. Finally, we will touch on how the multicore era has also influenced the industry, and talk about the opportunities and limitations programming in parallel bring to performance. We will include a discussion on heterogeneous hardware, as well, and how some computer architects have built simulators to help design future processors that are both well-performing and energy-efficient.

Here is a preliminary list of topics to be discussed:

Of the layers of the software/system stack, we will focus more on the runtime environment and memory system than other layers. We will touch on details about adaptive or dynamic compilers that are used in managed runtime environments. We will also discuss processors and how they execute instructions. Understanding how they work at a high level helps to understand how an application achieves the performance it does. Also, the memory system is clearly linked to the processor and has a huge impact on overall performance. So we will focus on how caches are laid out and used, and how a runtime environment lays out memory and reclaims dead memory. To gather information about the hardware's behavior, we will explore hardware performance counters, and use tools that can read them and summarize results from them.

Two important aspects of analyzing performance are 1) how to properly set up experiments to measure and evaluate what you want to show, and 2) how to present results if you do perform experiments. We will discuss proper experimental methodology (for Java applications), such as how to eliminate non-determinism and exploring the space-time tradeoff of garbage collection. I will also teach good practices on how to repeat experiments and take averages over a set of results, and present confidence intervals in graphs.

Finally, we will explore a bit of how parallel programming affects performance. We'll assume a shared-memory model. We will discuss speedup and Amdahl's Law and Gustafson's law. We'll talk a little about cache coherence and false sharing. We'll talk a bit about the OS's job scheduling tasks. We will also talk briefly about the hardware reordering instructions and memory consistency, if time allows. We will also touch on power as a prime concern, which led to the rise of heterogeneous machines.

Schedule

Date Material Assignments
28 Sept Introduction and the evolution of processors (to superscalar and out-of-order)
5 Oct How caches and the memory subsystem work
12 Oct Performance analysis tools and a bit about methodology Project 1: Matrix multiplication, due Monday 6 November 18:00. Use your own machine or ask for a login to Serenity, a server within SOFT.
19 Oct: 1pm to 5pm Analytical modeling of an out-of-order processor, CPI stacks
AND Simulators, Heterogeneous Machines (a bit about in-order processors), Power
Project 2: Use a simulator (through VirtualBox), due Wednesday 22 November 23:59
26 Oct NO CLASS (SPLASH conference)
2 Nov NO CLASS (HOLIDAY)
9 Nov Language virtual machine or runtime environment, dynamic compilers
16 Nov Garbage collection
23 Nov Experimental methodology for managed languages, and data presentation/analysis Project 3: Using Jikes Research Virtual Machine to measure garbage collectors
30 Nov Multicore
7 Dec Multicore Project 4: Multicore
14 Dec
21 Dec
X Jan ORAL EXAM

Prerequisites

I expect advanced programming skills. In particular, I assume you have basic knowledge of object-oriented programming in a mainstream OO language such as C++/Java/C#. We will focus on Java in this course. I will also use code examples in C, and will assume you will be able to understand the code. You will have to write your own basic C programs, or extend existing C code that I give you as a starting point.

I will teach you the basics of how modern processors work, and the memory subsystem, so only minimal computer architecture knowledge is expected. However, it will be useful to have some system-level background from a prior course. Prior statistics classes could be useful for the part of the class where we talk about how to present results, but are not required. It is highly recommended that you have already taken a compilers course prior to this course. I will go into some details of dynamic and adaptive compilers in a runtime environment, but will assume you have a basic knowledge of compilers and routine optimizations that they perform.

Study Material

The main study material used in this course are slides, which can be downloaded from Pointcarre. For some lectures, I also provide relevant pointers to the literature (mostly to articles, books or papers freely available online). Reading material will be listed below, which you are encouraged to read. Some papers could be required reading (which will be clearly specified).

I will be taking content from the following online course taught at MIT called Performance Engineering of Software Systems Performance
Engineering of Software Systems [Saman Amarasinghe, and Charles Leiserson. 6.172 Performance Engineering of Software Systems, Fall 2010. (Massachusetts Institute of Technology: MIT OpenCourseWare), http://ocw.mit.edu (Accessed). License: Creative Commons BY-NC-SA].

I will also borrow material from Professor Lieven Eeckhout at Ghent University, and from his two books:

I will also be using material from other professor's courses: Dr. Kathryn McKinley's Memory Management course, and Dr. Kathryn McKinley's Advanced Compilers course.

Useful Links

Guide to C for Java programmers (slides)
How to Survive the Multicore Software Revolution article.
The Cilk++ concurrency platform book.
The Cilkview scalability analyzer article.
Producing Wrong Data without Doing Anything Obviously Wrong!" article on experimental methodology.

Grading and Assessment

For the theoretical part of this course, students are expected to attend the lectures and are encouraged to read the reading material, if it is listed. Although this course is 6 credits, we won't have a fixed lab scheduled. If there are readings listed, they will count as your extra work for those weeks. The practical part of this course will be done on your own, and you will have to complete 4 projects throughout the semester. The projects, roughly, will be over the following topics:

  1. Analyze a simple program with hardware performance counters and optimize it to improve performance. You should be able to demonstrate the improvements in not only execution time, but also memory behavior. This project will be done in C.
  2. Run some small applications on a simulator, and analyze and categorize their behavior. You will then do further analysis on how to schedule them on a heterogeneous machine to save energy.
  3. Compare various garbage collection algorithms and their space-time tradeoff. For this you will need to run many Java applications, and compare their performance, varying certain runtime parameters. You will also be expected to use best practices to present your results.
  4. Analyze some multi-threaded programs with tools that show why they don't show perfect speedup. Optimize those programs to achieve better parallelization.
Your grade will be based on these 4 projects, and your final exam. The final exam will be an oral exam in January. There will be a written preparation part that you have to complete within the exam slot and bring to the oral exam with you. The oral exam can cover concepts from the entire semester.

The grade for this course will be divided as follows. Each of the 4 projects will be 15% of your grade, for a total of 60% during the semester. These projects will be completed individually. You are required to turn in all 4 projects to pass this class! You are also required to take the oral examination at the end to pass. Your final exam will then be worth 40% of your total grade. To pass the entire course, you are required to pass the project part of the course (60%) as well as the exam portion of the course (40%).

Academic Honesty

You are required to do your own work. Don't copy code or answers from projects. It's okay to talk about general concepts or algorithms, or even benchmark sets for the projects, but don't share pseudocode or code or data. You can talk about problems you are having on assignments, but do not show code or your project output data to classmates to get help. Either use tools or ask the instructor for more specific help. The best way to do this is to avoid talking to others about the program while you are at the computer. If you have questions on what constitutes cheating or questions about this policy, please talk to the instructor.

Learning Outcomes

Knowledge and Insight
This course will give students the opportunity to gain deep insight into a computer program's performance -- why it might be slow, how to analyze bottlenecks, how to break down each layer of the software/system stack and understand how it contributes to the program's end performance. Furthermore, students will understand specifically why managed languages that run on top of a runtime environment are more complicated to analyze and evaluate than native languages. Students will have a deep understanding of how the memory subsystem contributes to performance. Students will also know the practical usage of computer architecture simulators.

Application of knowledge and insight
After successful completion of this course, students will know how to design their own performance experiments, and how to summarize, aggregate, and present results. Students will also be able to make changes in their program's memory access pattern to obtain improved memory efficiency. Furthermore, students will be able to analyze the sources of inefficient code using hardware performance counters, and analyze the sources of limited scalability in parallel programs.

Development of judgment
After successful completion of this course, sutdents will be able to evaluate how a program accesses memory, whether efficiently or not, and will be able to evaluate the scalability of parallel applications.

Communication skills
This course will develop students' ability to present meaningful and comprehensive experimental results in the form of graphs, and to be able to analyze the meaning of the results.

Learning skills
This course gives students a system-level view of the computer, across the software stack. Thus, students will have a deep understanding of how a program gets executed, and what layers affect its overall performance. This will enable students to more insightfully write efficient computer programs, taking the memory system into account, and understanding the ramifications of choosing a managed language. Furthermore, students will have learned how to analyze the computer program's behavior in detail in order to inform optimization.