I’m in the middle of an investigation of the branch predictor on newer Intel chips. Read the previous article to get some background.
Where I left off I’d just decided to look into static prediction myself. I’ve previously used Agner Fog’s tools to do these kinds of investigations, and so that was my first stop.
Agner’s tools install a kernel module to give user-mode access to the hardware performance monitoring counters inside the Intel chips. Each CPU has four counters that can be used to count one of a number of internal CPU events. Agner’s tools then run micro-benchmarks while counting the various internal things going on inside the processor. This seems perfect for us!
Each CPU microarchitecture has a different set of counters available, so the first stop was to pick a CPU and choose the counters. As a starting point for my investigations I picked my laptop’s CPU – an Arrendale CPU (Core(TM) i5 CPU M 520 @ 2.40GHz). This is a mobile version of the Westmere chip. Later I’d run similar investigations on other CPUs.
Digging out the docs (Intel Architectures Software Developer Manual Volume 3B, part 2, pages 362-393), there’s a bewildering array of useful-sounding counters.
Back to Agner’s tools and my plan for the test started to form – I would write a series of microbenchmarks with:
Within each microbenchmark I’d count the number of instructions and branch mispredictions (and some other useful-sounding counters). The expectation is that on the first run of the benchmark we would see a number of mispredictions, as the Branch Target Buffer (BTB) would miss for every branch. Then the decoder would be the first to spot the branch, and it has a choice as to whether it re-steers the fetcher to the branch destination or not. Either way, once the instruction retires we get to see if the decoder made the right guess or not.
The expectation is if the decoder assumes the branch is not taken, then we should see no mispredictions for forward-not-taken and backward-not-taken, but we should see pretty much a 100% misprediction rate for the taken cases.
If the decoder assumes backwards branches are loops and are likely to be taken, then we’d expect forward-not-taken and backwards-taken to have no mispredicts, and the other two cases would be close to 100% misprediction rate.
After a number of false starts with Agner’s set-up (and noting that some of his counter definitions seem to be incorrect for Westmere/Arrendale), I had some initial results. The test was 1000 back-to-back branches:
Pretty compelling evidence for loop-style static prediction here!1
I ran the same test on an Ivy Bridge (E5-2667 v2) and a Haswell (E5-1650 v3):
Here the evidence is a little more puzzling. The Ivy seems to weakly predict ahead as not taken, but there’s enough variability to make me wonder if something else is going on. For Haswell it seems no branch is ever predicted as taken. Interesting, and worth more investigation.
The full results – including some notes I made and the other counters – are available in a Google Sheet.
By this point I had resolved to improve upon Agner’s tools which are a little finicky and easy to misuse. To that end I branched his code and started on a little refactoring. That refactored code is available on GitHub as the agner project. The Google Sheet link above links to the git hash of the exact point at which the data was collected, should you wish to try reproducing.
So: job done? Seems the Arrendale does static prediction of loop-like branches, but newer Intels don’t, so case closed?
Agner’s tests run each microbenchmark multiple times, back-to-back. By default 10 iterations are performed. Obviously, once the BPU has learned about the whereabouts of each branch, the misprediction rate drops to essentially zero. This was useful initially in ensuring I was using the right counters: after the first iteration the misprediction rate ought to drop to zero.
But then I thought I should be able to “scramble” the BPU in-between
iterations to get a consistent misprediction rate on each iteration. I wrote
some code (
ScrambleBTB) to flood the CPU with loads of branches in an effort
to evict the whole BTB. This I placed between iterations (but outside of the
To my surprise it took a lot more effort than I thought to make the routine effective: and I’m still not quite sure what is actually required to trick the BTB properly.
In order to find out what’s going on, I decided to further lift the lid and try and work out the size, arrangement and eviction policies of the BTB in my Arrendale.
More on that next time though!
The misprediction number being above 1000 in some cases is an artifact of some branches around the outside of the test setup. ↩
Over the last week or so I’ve been investigating the static branch prediction on modern Intel processors. A thread on the excellent Mechnical Sympathy mailing list got me thinking about it: a claim was made that static prediction is still used on Intel processors; and my understanding from Agner Fog’s excellent resources is that newer Intel processors no longer do so.
This has led to quite an odyssey of understanding, which I’m still embroiled in; so forgive the length of this post and the fact that it’s the first in a series…
So what’s branch prediction? And what’s static prediction?
As you may know, modern processors have long pipelines composed of Fetch / Decode / Execute / Retire1 stages. An instruction is fetched, then decoded, then executed, and finally retired. This process is overlapped between adjacent instructions; so while one is retiring, the next is executing, the next is decoding and another is being fetched.
On older CPUs these stages would be sequential2: most instructions would take several cycles before the next even started, so pipelining is a big win. However, branches throw a spanner into the works: when a branch gets to the execution stage it changes what instruction will be executed next. This means anything in the Fetch and Decode stage needs to be thrown away – wasted work!
This is where branch prediction comes in: something ahead of even the Fetch stage makes a guess as to where the branches might be – and in the case of conditional branches, whether they’re going to be taken or not. This stage is the branch predictor (or branch prediction unit, BPU).
The BPU has two jobs then: before the instruction stream has even been decoded it must make a guess whether there are any branches coming up. For unconditional branches, it makes a guess where the branch is going, and steers the fetcher accordingly. For conditional branches, it makes a guess whether the branch is going to be taken too.
On Intel CPUs whose pipelines are far more complex and often have 10+ stages between the fetcher and the execution unit, the BPU’s job is critical for performance. Intel have understandably put a lot of effort into making the BPU very good at its job. Older CPUs (Pentium M era) have been reversed engineered to some extent, but Westmere and beyond haven’t really been looked at3.
Intel processors fetch instructions in blocks of 16 bytes every cycle and each block can contain several branches. As best is understood, a cache-like structure called the Branch Target Buffer (BTB) is used to look up previously-seen branches. Just like a memory cache, the BTB has a number of sets, and each set has a number of ways. Some number of bits of the current program counter are used to pick a set within the BTB, and then another subset of the bits is compared with the tags in each of the ways of the set. Any hits found contain information about potential branches. Because there can be multiple matching branches in a set, some logic may have to pick the “first” encountered branch. For conditional branches, a second system is used to predict the outcome. I won’t go into too many details about this part in this post.
The guesses of the branch destination are sent on, and then in the decode stage, those guesses are checked against the actual decoded instruction stream. If a branch destination is wrong – or if a non-branch was mis-guessed as a branch – it’s noted there, the BTB is corrected, and the fetchers are re-steered to the right destination.
It’s at this point the “static prediction” comes in: If the decoder spots a branch that the BPU hadn’t predicted, it has to re-steer the fetcher. If it’s a conditional, the decoder gets a chance to pick whether it’s predicted taken or not. This guess is made based on static rules instead of any kind of knowledge about that particular branch.
What might such a static prediction algorithm be? The most obvious is “predict not taken” – just assume the fetcher should carry on to the instruction beyond the branch.
Another sensible static prediction algorithm is to assume conditional branches to previous addresses are the branch at the end of a loop and so are more likely to be taken than not. Conditional branches forward are considered not taken. This latter algorithm is what the Pentium Ms and some earlier processors used, according to the documentation.
A final choice is kind of not really a choice: instead use whatever prediction the dynamic “is it taken or not” predictor gives for this branch, even though it may know nothing about this branch.
So, returning to the point in hand: I was reading the Mechanical Sympathy thread and – having spent rather a long time reading about and thinking about this kind of thing – I had an xkcd moment:
And so began a long process of me trying to work out what type of static prediction modern processors use. Over the next few days I’ll blog about how I went about this, and what I’ve found so far.
Next: some results.
This of course is a huge simplification as this post is already getting gargantuan. Modern Intels have probably 10-15 cycles’ worth of cycles in its front end before it even reaches the even-more-complex Out-Of-Order system. If your interest is piqued in this, check out my YouTube video on the subject. ↩
Though even the simple (but awesome) 6502 had a very simple pipeline where the fetch of the next instruction was overlapped with the execution of the previous. ↩
Though a friend has told me there’s some research in a paper I haven’t yet been able to get hold of: “Shah Mohammad Faizur Rahman, Zhe Wang, and Daniel A. Jiménez, “Studying Microarchitectural Structures with Object Code Reordering”, Proceedings of the 2009 Workshop on Binary Instrumentation and Applications (WBIA), December, 2009.” ↩
I’ve been given permission to release the video; so here it is, warts and all:
The slides are now available online. Hopefully they make enough sense by themselves to be interesting. Please note that you can go both left/right and up/down: Some slides have more information if you go down.
The slides were made using reveal with a bit of custom work to get the 6502 assembly syntax highlighted.
Update: I’ve just been given permission to release the video.