Written by me, proof-read by an LLM.
Details at end.
Yesterday’s LICM post ended with the compiler pulling invariants like size() and get_range out of our loop - clean assembly, great performance. Job done, right?
Not quite. Let’s see how that optimisation can disappear.
Written by me, proof-read by an LLM.
Details at end.
Look back at our simple loop example - there’s an optimisation I completely glossed over. Let me show you what I mean:
Written by me, proof-read by an LLM.
Details at end.
Sometimes the compiler decides the best way to optimise your loop is to… write it twice. Sounds counterintuitive? Let’s change our sum example from before to optionally return a sum-of-squares1:
Written by me, proof-read by an LLM.
Details at end.
Who among us hasn’t looked at a number and wondered, “How many one bits are in there?” No? Just me then?
Actually, this “population count” operation can be pretty useful in some cases like data compression algorithms, cryptography, chess, error correction, and sparse matrix representations. How might one write some simple C to return the number of one bits in an unsigned 64 bit value?
Written by me, proof-read by an LLM.
Details at end.
A common theme for helping the compiler optimise is to give it as much information as possible. Using the right signedness of types, targeting the right CPU model, keeping loop iterations independent, and for today’s topic: telling it how many loop iterations there are going to be ahead of time.
Taking the range-based sum example from our earlier post on loops, but using a std::span1, we can explore this ability. Let’s take a look at what happens if we use a dynamically-sized span - we’d expect it to look very similar to the std::vector version:
Written by me, proof-read by an LLM.
Details at end.
Loop optimisations often surprise us. What looks expensive might be fast, and what looks clever might be slow.
Yesterday we saw how the compiler canonicalises loops so it (usually) doesn’t matter how you write them, they’ll come out the same. What happens if we do something a little more expensive inside the loop?
Written by me, proof-read by an LLM.
Details at end.
Which loop style is “best”? This very question led to the creation of Compiler Explorer! In 2011 I was arguing with my team about whether we could switch all our loops from ordinal or iterator-style to the “new” range-for1. I wrote a small shell script to iteratively show the compiler output as I edited code in vim, and the seed of Compiler Explorer was born.
C and C++ have many ways to phrase loops: for(), while(), do..while(), range-for (for (x : container)), STL algorithms like std::for_each, and now range transformations! Let’s see if loop style actually matters for performance.
Written by me, proof-read by an LLM.
Details at end.
I occasionally give presentations to undergraduates, and one of my favourites is taking the students on a journey of optimising a “binary to decimal” routine1. There are a number of tricks, which I won’t go in to here, but the opening question I have is “how do you even turn a number into its ASCII representation?”
If you’ve never stopped to think about it, take a moment now to do so, it can be a fun problem.
Matt Godbolt is a C++ developer living in Chicago. He works for Hudson River Trading on super fun but secret things. He is one half of the Two's Complement podcast. Follow him on Mastodon or Bluesky.