Why Are My Bytecode Interpreters Slow? Hunting Truffles with VTune

As part of our work on the AST vs. Bytecode Interpreters paper, I briefly looked at how the native code of ahead-of-time-compiled bytecode loops looks like, but except for finding much more code than what I expected, I didn’t look too closely at what was going on.

After the paper was completed, I went on to make big plans, falling into the old trap of simply assuming that I roughly knew where the interpreters spend their time, and based on these assumptions developed grand ideas to make them faster. None of these ideas would be easy of course, requiring possibly month of work. Talking to my colleagues working on the Graal compiler, I was very kindly reminded that I should know and not guess were the execution time goes. I remember hearing that before, probably when I told my students the same thing…

So, here we are. This blog post has two goals: document how to build the various Truffle interpreters and how to use VTune for future me, and discuss a bit the findings of where Truffle bytecode interpreters spent much of their time.

To avoid focusing on implementation-specific aspects, I’ll look at four different Truffle-based bytecode interpreters. I’ll look at my own trusty TruffleSOM, Espresso (a JVM on Truffle), GraalPy, and GraalWasm.

To keep things brief, below I’ll just give the currently working magic incantations that produce ahead-of-time-compiled binaries for the bytecode interpreters, as well as how to run them as pure interpreters using the classic Mandelbrot benchmark.

Building the Interpreters

Let’s start with TruffleSOM. The following is the command line to build all dependencies and the bytecode interpreter. It also compiles it in a way that just-in-time compilation is disabled, something which the other implementations take as a command-line flag. The result is a binary in the same folder.

TruffleSOM$ mx build-native --build-native-image-tool --build-trufflesom --no-jit --type BC

Espresso is part of the Graal repository and all necessary build settings are conveniently maintained as part of the repository. To find the folder with the binary, we can use the second command:

espresso$ mx --env native-ce build
espresso$ mx --env native-ce graalvm-home

GraalPy is in its own repository, though can also be built similarly. It also prints conveniently the path where we find the result.

graalpy$ mx python-svm

Last but not least, GraalWasm takes a little more convincing to get the same result. Here the configuration isn’t included in the repository.

wasm$ export DYNAMIC_IMPORTS=/substratevm,/sdk,/truffle,/compiler,/wasm,/tools
wasm$ export COMPONENTS=cmp,cov,dap,gvm,gwa,ins,insight,insightheap,lg,lsp,pro,sdk,sdkl,tfl,tfla,tflc,tflm
wasm$ export NATIVE_IMAGES=lib:jvmcicompiler,lib:wasmvm
wasm$ export NON_REBUILDABLE_IMAGES=lib:jvmcicompiler
wasm$ export DISABLE_INSTALLABLES=False
wasm$ mx build
wasm$ mx graalvm-home

At this point, we have our four interpreters ready for use.

Building the Benchmarks

As benchmark, I’ll use the Are We Fast Yet version of Mandelbrot. This is mostly for my own convenience. For this experiment we just need a benchmark that mostly runs inside the bytecode loop, and Mandelbrot will do a good-enough job with that.

TruffleSOM and Python take care of compilation implicitly, but for Java and Wasm, we need to produce the jar and wasm files ourselves. For Wasm, I used the C++ version of the benchmark and Emscripten to compile it.

Java$  ant jar   # creates a benchmarks.jar
C++$   CXX=em++ OPT='-O3 -sSTANDALONE_WASM' build.sh

The -sSTANDALONE_WASM flag makes sure Emscripten gives us a wasm module that works without further issues on GraalWasm.

Running the Benchmarks

Executing the Mandelbrot benchmark is now relatively straightforward. Though, I’ll skip over the full path details below. For Espresso, GraalPy, and GraalWasm, we use the command-line flags to disable just-in-time compilation as follows.

Note the executables are the ones we built above. Running for instance java --version should show something like the following:

openjdk 21.0.2 2024-01-16
OpenJDK Runtime Environment GraalVM CE 21.0.2-dev+13.1 (build 21.0.2+13-jvmci-23.1-b33)
Espresso 64-Bit VM GraalVM CE 21.0.2-dev+13.1 (build 21-espresso-24.1.0-dev, mixed mode)

With this, running the benchmarks uses roughly the following commands:

som-native-interp-bc -cp Smalltalk Harness.som Mandelbrot 10 500
java    --experimental-options --engine.Compilation=false -cp benchmarks.jar Harness Mandelbrot 10 500
graalpy --experimental-options --engine.Compilation=false ./harness.py Mandelbrot 10 500
wasm    --experimental-options --engine.Compilation=false ./harness-em++-O3-sSTANDALONE_WASM Mandelbrot 10 500

Using VTune

There are various useful profilers out there, though, my colleagues specifically asked me to have a look at VTune, and I figured, it might be a convenient way to grab various hardware details from an Intel CPU.

However, I do not have direct access to an Intel workstation. So, instead of using the VTune desktop user interface or command line, I’ll actually use the VTune server on one of our benchmarking machines. This was surprisingly convenient and seems useful for rerunning previous experiments with different settings or binaries.

The machine is suitably protected, but I can’t recommend to use the following in an open environment:

vtune-server --web-port $PORT --enable-server-profiling --reset-passphrase

This prints the URL where the web interface can be accessed, and is configured so that we can run experiments directly from the interface, which helps with finding the various interesting options.

For all four interpreters, I’ll focus on what VTune calls the Hotspots profiling runs. I used the Hardware Event-Based Sampling setting with additional performance insights.

After it finished running the benchmark, VTune opens a Summary of the results with various statistics. Though, for this investigation most interesting is the Bottom-up view of where the program spent its time. For all four interpreters, the top function is the bytecode loop.

Opening the top function allows us to view the assembly, and group by Basic Block / Address. This neatly adds up the time of each instruction in a basic block, and gives us an impression of how much time we spent in each block.

The Bytecode Dispatch

VTune gives us a convenient way to identify which parts of the compiled code are executed and how much time we spent in it. What surprised me is that about 50% of all time is spent in bytecode dispatch. Not the bytecode operation itself, no, but the code executed for every single bytecode leading up to and including the dispatch.

Below is the full code of the “bytecode dispatch” for GraalWasm. As far as I can see, all four interpreters have roughly the same native code structure. It starts with a very long sequence of instructions that likely read various bits out of Truffle’s VirtualFrame objects, and then proceeds to do the actual dispatch via what the Graal compiler calls an IntegerSwitchNode in its intermediate representation, for which the RangeTableSwitchOp strategy is used for compilation. This encodes the bytecode dispatch by looking up the jump target in a table, and then performing the jmp %rdx instruction at the very end of the code below.

Address Assembly CPU Time
  Block 18 11.963s
0x1f5b229 xorpd %xmm0, %xmm0 0.306s
0x1f5b22d mov $0xfffffffffffff, %rdx 0.486s
0x1f5b237 movq %rdx, 0x1f0(%rsp) 0.080s
0x1f5b23f mov $0xeae450, %rdx 0.070s
0x1f5b249 mov %r8d, %ebx 0.346s
0x1f5b24c movzxb 0x10(%r10,%rbx,1), %ebx 0.391s
0x1f5b252 movq 0x18(%rdi), %rbp 0.050s
0x1f5b256 movq %rbp, 0xd0(%rsp) 0.030s
0x1f5b25e movq 0x10(%rdi), %rbp 0.256s
0x1f5b262 movq %rbp, 0xc8(%rsp) 0.441s
0x1f5b26a lea (%r14,%rbp,1), %rsi 0.256s
0x1f5b26e movq %rsi, 0xc0(%rsp) 0.020s
0x1f5b276 lea 0xe(%r8), %esi 0.611s
0x1f5b27a lea 0xd(%r8), %ebp 0.356s
0x1f5b27e lea 0xc(%r8), %edi 0.135s
0x1f5b282 lea 0xb(%r8), %r11d 0.055s
0x1f5b286 lea 0xa(%r8), %r13d 0.306s
0x1f5b28a movl %r13d, 0x1ec(%rsp) 0.401s
0x1f5b292 lea 0x9(%r8), %r12d 0.125s
0x1f5b296 movl %r12d, 0x1e8(%rsp) 0.065s
0x1f5b29e lea 0x8(%r8), %edx 0.326s
0x1f5b2a2 lea 0x7(%r8), %ecx 0.416s
0x1f5b2a6 movl %esi, 0x1e4(%rsp) 0.115s
0x1f5b2ad lea 0x6(%r8), %esi 0.030s
0x1f5b2b1 movl %ebp, 0x1e0(%rsp) 0.311s
0x1f5b2b8 lea 0x5(%r8), %ebp 0.356s
0x1f5b2bc movl %edi, 0x1dc(%rsp) 0.150s
0x1f5b2c3 lea 0x4(%r8), %edi 0.040s
0x1f5b2c7 movl %r11d, 0x1d8(%rsp) 0.321s
0x1f5b2cf lea 0x3(%r8), %r11d 0.416s
0x1f5b2d3 movl %r11d, 0x1d4(%rsp) 0.135s
0x1f5b2db lea 0x2(%r8), %r13d 0.065s
0x1f5b2df movl %r13d, 0x1d0(%rsp) 0.276s
0x1f5b2e7 mov %r8d, %r12d 0.426s
0x1f5b2ea inc %r12d 0.125s
0x1f5b2ed movl %r12d, 0x1cc(%rsp) 0.045s
0x1f5b2f5 movl %edx, 0x1c8(%rsp) 0.321s
0x1f5b2fc mov %r9d, %edx 0.441s
0x1f5b2ff inc %edx 0.120s
0x1f5b301 movl %edx, 0x1c4(%rsp) 0.040s
0x1f5b308 lea -0x3(%r9), %edx 0.366s
0x1f5b30c movl %edx, 0x1c0(%rsp) 0.311s
0x1f5b313 lea -0x2(%r9), %edx 0.221s
0x1f5b317 movl %edx, 0x1bc(%rsp) 0.045s
0x1f5b31e lea 0xf(%r8), %edx 0.376s
0x1f5b322 mov %r9d, %r8d 0.366s
0x1f5b325 dec %r8d 0.085s
0x1f5b328 movl %r8d, 0x1b8(%rsp) 0.045s
0x1f5b330 movl %edx, 0x1b4(%rsp) 0.371s
0x1f5b337 mov %ebx, %r9d 0.481s
0x1f5b33a cmp $0xfe, %r9d 0.035s
0x1f5b341 jnbe 0x1f7c658  
     
  Block 19 0.982s
0x1f5b347 lea 0xa(%rip), %rdx 0.035s
0x1f5b34e movsxdl (%rdx,%r9,4), %r9 0.321s
0x1f5b352 add %r9, %rdx 0.506s
0x1f5b355 jmp %rdx 0.120s

Someone who writes bytecode interpreters directly in assembly might be mortified by this code. Though, to me this is more of an artifact of some missed optimization opportunities in the otherwise excellent Graal compiler, which hopefully can be fixed.

I won’t include the results for the other interpreters here, but to summarize, let’s count the instructions of the bytecode dispatch for each of them:

  • TruffleSOM: 31 instructions in 1 basic block (after some extra optimization)
  • Espresso: 79 instructions in 2 basic blocks
  • GraalPy: 81 instructions in 3 basic blocks
  • GraalWasm: 56 instructions in 2 basic blocks

For GraalPy it is even a little more complex. There are various other basic blocks involved and none of the bytecode handler jump back directly to the same block. Instead there seems to be some more code after each handler before they jump back to the top of the loop.

The First Micro-Optimization Opportunity

As mentioned for TruffleSOM, I did already look into one optimization opportunity. The very careful reader might have noticed the end of block 18 above.

Address Assembly CPU Time
0x1f5b33a cmp $0xfe, %r9d 0.035s
0x1f5b341 jnbe 0x1f7c658  

This is a correctness check for the switch/case statement in the Java code. It makes sure that the value we switch over is covered by the cases in the switch. Otherwise, we’re jumping to some default block, or just back to the top of the loop.

The implementation in Graal is a little bit too hard-coded for my taste. While it has all the logic to eliminate unreachable cases, for instance when it sees that the value we switch over is guaranteed to exclude some cases, it does handle the default case directly in the machine-specific lowering code.

Seems like one could generalize this a little more and possibly handle the default case like the other cases. The relevant Graal issue for this micro-optimization is #8425. However, when I applied this to TruffleSOM’s version of Graal, and eliminated those two instructions, it didn’t make a difference. The remaining 31 instructions still dominate the bytecode dispatch.

Conclusion

The most important take away message here is of course know, don’t assume, or more specifically measure, don’t guess.

For the problem at hand, it looks like Graal struggles with hoisting some common reads out of the bytecode loops. If there’s a way to fix this, this could give a massive speedup to all Truffle-based bytecode interpreters, perhaps enough to invalidate our AST vs. Bytecode Interpreters paper. Wouldn’t that be fun?! 🤓 🧑🏻‍🔬

The mentioned micro optimization would also avoid a few instructions for every switch/case in normal Java code, when it doesn’t need the default case. So, it might be relevant for more than just bytecode interpreters.

For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.

Addendum: Dispatch Code for PySOM’s RPython-based Bytecode Interpreter

After turning my notes into this blog post yesterday, I figured today I should also look at what RPython is doing for my PySOM bytecode interpreter.

The below 13 instructions are the bytecode dispatch for the interpreter. While it is much shorter, it also contains the safety check cmp $0x45, %al to make sure the bytecode is within the set of targets. Ideally, a bytecode verifier would have ensure that already, or perhaps we setup the native code so that there are simply no unsafe jump targets to avoid having to check every time, which at least based on VTune seems to consume a considerable amount of the overall run time. Also somewhat concerning is that the check is done twice. Block 21 already has a cmp $0x45, %rax, which should make the second test unnecessary. (Correction: the second check was unrelated, and I managed to remove it by applying an optimization I had already on TruffleSOM, but not yet in PySOM.)

So, yeah, I guess PySOM could be a bit faster on every single bytecode, which might mean PyPy could possibly also improve its interpreted performance.

Address Assembly CPU Time
  Block 20 0.085s
0xb45d8 cmp %r14, %rcx 0.085s
0xb45db jle 0xb6ba0  
  Block 21 0.631s
0xb45e1 movzxb 0x10(%rdx,%r14,1), %eax 0.311s
0xb45e7 cmp $0x45, %rax 0.321s
0xb45eb jnle 0xb6c00  
  Block 22 4.601s
0xb45f1 lea 0x69868(%rip), %rdi 0.571s
0xb45f8 movq 0x10(%rdi,%rax,8), %r15 0.020s
0xb45fd add %r14, %r15 2.942s
0xb4600 cmp $0x45, %al 1.068s
0xb4602 jnbe 0xb6c59  
  Block 23 0.476s
0xb4608 movsxdl (%rbx,%rax,4), %rax 0s
0xb460c add %rbx, %rax 0.015s
0xb460f jmp %rax 0.461s

Rank 10 Language Implementations

Please rank 10 language implementations by their median performance, based on your best guess or estimate.

Boxplot of the performance of 10 language implementations, without labels.

The above plot is based on the performance of 9 microbenchmarks and 5 slightly larger benchmarks, which were design to study the effectiveness of compilers.

Use your best guess:

My goal with this poll is to see what the general performance expectations for various language implementations are.

For questions, comments, or suggestions, please find me on Twitter @smarr or Mastodon.

Thanks for playing!

The Changing “Guarantees” Given by Python's Global Interpreter Lock

In this blog post, I will look into the implementation details of CPython’s Global Interpreter Lock (GIL) and how they changed between Python 3.9 and the current development branch that will become Python 3.13.

My goal is to understand which concrete “guarantees” the GIL gives in both versions, which “guarantees” it does not give, and which ones one might assume based on testing and observation. I am putting “guarantees” in quotes, because with a future no-GIL Python, none of the discussed properties should be considered language guarantees.

While Python has various implementations, including CPython, PyPy, Jython, IronPython, and GraalPy, I’ll focus on CPython as the most widely used implementation. Though, PyPy and GraalPy also use a GIL, but their implementations subtly differ from CPython’s, as we will see a little later.

1. What Is the GIL?

Let’s recap a bit of background. When CPython started to support multiple operating system threads, it became necessary to protect various CPython-internal data structures from concurrent access. Instead of adding locks or using atomic operations to protect the correctness of for instance reference counting, the content of lists, dictionaries, or internal data structures, the CPython developers decided to take a simpler approach and use a single global lock, the GIL, to protect all of these data structures from incorrect concurrent accesses. As a result, one can start multiple threads in CPython, though only a single of them runs Python bytecode at any given time.

The main benefit of this approach is its simplicity and single-threaded performance. Because there’s only a single lock to worry about, it’s easy to get the implementation correct without risking deadlocks or other subtle concurrency bugs at the level of the CPython interpreter. Thus, the GIL represented a suitable point in the engineering trade-off space between correctness and performance.

2. Why Does the Python Community Think About Removing the GIL?

Of course, the obvious downside of this design is that only a single thread can execute Python bytecode at any given time. I am talking about Python bytecode here again, because operations that may take a long time, for instance reading a file into memory, can release the GIL and allow other threads to run in parallel.

For programs that spend most of their time executing Python code, the GIL is of course a huge performance bottleneck, and thus, PEP 703 proposes to make the GIL optional. The PEP mentions various use cases, including machine learning, data science, and other numerical applications.

3. Which “Guarantees” Does the GIL Provide?

So far, I only mentioned that the GIL is there to protect CPython’s internal data structures from concurrent accesses to ensure correctness. However, when writing Python code, I am more interested in the “correctness guarantees” the GIL gives me for the concurrent code that I write. To know these “correctness guarantees”, we need to delve into the implementation details of when the GIL is acquired and released.

The general approach is that a Python thread obtains the GIL when it starts executing Python bytecode. It will hold the GIL as long as it needs to and eventually release it, for instance when it is done executing, or when it is executing some operation that often would be long-running and itself does not require the GIL for correctness. This includes for instance the aforementioned file reading operation or more generally any I/O operation. However, a thread may also release the GIL when executing specific bytecodes.

This is where Python 3.9 and 3.13 differ substantially. Let’s start with Python 3.13, which I think roughly corresponds to what Python has been doing since version 3.10 (roughly since this PR). Here, the most relevant bytecodes are for function or method calls as well as bytecodes that jump back to the top of a loop or function. Thus, only a few bytecodes check whether there was a request to release the GIL.

In contrast, in Python 3.9 and earlier versions, the GIL is released at least in some situations by almost all bytecodes. Only a small set of bytecodes including stack operations, LOAD_FAST, LOAD_CONST, STORE_FAST, UNARY_POSITIVE, IS_OP, CONTAINS_OP, and JUMP_FORWARD do not check whether the GIL should be released.

These bytecodes all use the CHECK_EVAL_BREAKER() on 3.13 (src) or DISPATCH() on 3.9 (src), which eventually checks (3.13, 3.9) whether another thread requested the GIL to be released by setting the GIL_DROP_REQUEST bit in the interpreter’s state.

What makes “atomicity guarantees” more complicated to reason about is that this bit is set by threads waiting for the GIL based on a timeout (src). The timeout is specified by sys.setswitchinterval().

In practice, what does this mean?

For Python 3.13, this should mean that a function that contains only bytecodes that do not lead to a CHECK_EVAL_BREAKER() check should be atomic.

For Python 3.9, this means a very small set of bytecode sequences can be atomic, though, except for a tiny set of specific cases, one can assume that a bytecode sequence is not atomic.

However, since the Python community is taking steps that may lead to the removal of the GIL, the changes in recent Python versions to give much stronger atomicity “guarantees” are likely a step in the wrong direction for the correctness of concurrent Python code. I mean this in the sense of people to accidentally rely on these implementation details, leading to hard to find concurrency bugs when running on a no-GIL Python.

4. Which Guarantees Might One Incorrectly Assume the GIL Provides?

Thanks to @cfbolz, I have at least one very concrete example of code that someone assumed to be atomic:

request_id = self._next_id
self._next_id += 1

The code tries to solve a classic problem: we want to hand out unique request ids, but it breaks when multiple threads execute this code at the same time, or rather interleaved with each other. Because then we end up getting the same id multiple times. This concrete bug was fixed by making the reading and incrementing atomic using a lock.

On Python 3.9, we can relatively easily demonstrate the issue:

def get_id(self):
    # expected to be atomic: start
    request_id = self.id
    self.id += 1
    # expected to be atomic: end

    self.usage[request_id % 1_000] += 1

Running this on multiple threads will allow us to observe an inconsistent number of usage counts. They should be all the same, but they are not. Arguably, it’s not clear whether the observed atomicity issue is from the request_id or the usage counts, but the underlying issue is the same in both cases. For the full example see 999_example_bug.py.

This repository contains a number of other examples that demonstrate the difference between different Python implementations and versions.

Generally, on Python 3.13 most bytecode sequences without function calls will be atomic. On Python 3.9, much few are, and I believe that would be better to avoid people from creating code that relies on the very strong guarantees that Python 3.13 gives.

As mentioned earlier, because the GIL is released based on a timeout, one may also perceive bytecode sequences as atomic when experimenting.

Let’s assume we run the following two functions on threads in parallel:

def insert_fn(list):
    for i in range(100_000_000):
        list.append(1)
        list.append(1)
        list.pop()
        list.pop()
    return True


def size_fn(list):
    for i in range(100_000_000):
        l = len(list)
        assert l % 2 == 0, f"List length was {l} at attempt {i}"
    return True

Depending on how fast the machine is, it may take 10,000 or more iterations of the loop in size_fn before we see the length of the list to be odd. This means it takes 10,000 iterations before the function calls to append or pop allowed the GIL to be released before the second append(1) or after the first pop().

Without looking at the CPython source code, one might have concluded easily that these bytecode sequences are atomic.

Though, there’s a way to make it visible earlier. By setting the thread switch interval to a very small value, for instance with sys.setswitchinterval(0.000000000001), one can observe an odd list length after only a few or few hundred iterations of the loop.

5. Comparing Observable GIL Behavior for Different CPython Versions, PyPy, and GraalPy

In my gil-sem-demos repository, I have a number of examples that try to demonstrate observable differences in GIL behavior.

Of course, the very first example tries to show the performance benefit of running multiple Python threads in parallel. Using the no-GIL implementation, one indeed sees the expected parallel speedup.

On the other tests, we see the major differences between Python 3.8 - 3.9 and the later 3.10 - 3.13 versions. The latter versions usually execute the examples without seeing results that show a bytecode-level atomicity granularity. Instead, they suggest that loop bodies without function calls are pretty much atomic.

For PyPy and GraalPy, it is also harder to observe the bytecode-level atomicity granularity, because they are simply faster. Lowering the switch interval makes it a little more observable, except for GraalPy, which likely aggressively removes the checks for whether to release the GIL.

Another detail for the no-GIL implementation: it crashes for our earlier bug example. It complains about *** stack smashing detected ***.

A full log is available as a gist.

6. Conclusion

In this blog post, I looked into the implementation details of CPython’s Global Interpreter Lock (GIL). The semantics between Python 3.9 and 3.13 differ substantially. Python 3.13 gives much stronger atomicity “guarantees”, releasing the GIL basically only on function calls and jumps back to the top of a loop or function.

If the Python community intends to remove the GIL, this seems problematic. I would expect more people to implicitly rely on these much stronger guarantees, whether consciously or not.

My guess would be that this change was done mostly in an effort to improve the single-threaded performance of CPython.

To enable people to test their code on these versions closer to semantics that match a no-GIL implementation, I would suggest to add a compile-time option to CPython that forces a GIL release and thread switch after bytecodes that may trigger behavior visible to other threads. This way, people would have a chance to test on a stable system that is closer to the future no-GIL semantics and probably only minimally slower at executing unit tests.

Any comments, suggestions, or questions are also greatly appreciated perhaps on Twitter @smarr or Mastodon.

Older Posts

Subscribe via RSS