Counting CPU cycles on ESP32-C3 and ESP32-C6 microcontrollers
In this post I discuss how to count the number of CPU cycles spent executing code on the ESP32-C3 and ESP32-C6 microcontrollers, and why you might want to do that. I work through a small example of measuring the CPU cycles spent executing a number of iterations of a simple loop, and describe how we can use the results to build a model describing how many CPU cycles are required for each instruction in the loop.
Table of Contents
Introduction
When bit banging protocols, especially high-speed protocols that push the speed limits of the CPU, timing is everything. It's not uncommon to find yourself having to hand-write assembly code to perform a particular protocol function, in order to ensure that the implementation respects the timing constraints dictated by the protocol spec.
The dedicated IO feature on ESP32-C6 and other microcontrollers allows us to manipulate a GPIO output signal within a single CPU cycle, but you need to hand-write assembly code to use it. For example, we can use the feature to generate a clock signal with a given frequency, by writing a loop that toggles the signal level every so many CPU cycles.
What is so hard about writing precisely timed code?
When writing signaling loops we need to know exactly how long each instruction in the loop takes, not just the instruction that actually manipulates the output signal. For example, to generate a simple perpetually repeating clock signal with a 5 MHz frequency, we need to write a loop that toggles the signal twice per 50 ns signal period (one transition from high to low, and one from low to high). If the CPU frequency is 160 MHz, that means we have to toggle the output once every 16 CPU cycles.
That seems simple enough, but as we saw in my previous
post on the dedicated IO feature
even this simple task can be a little tricky, because in order to write an
infinite loop we need to use a j
jump instruction, and that instruction
requires a different number of CPU cycles on different chips. On ESP32-C3 chips
the instruction always takes two CPU cycles, but on ESP32-C6 chips it can take
either a single or two CPU cycles, depending on the situation.
It's clear that getting the timing right for such a simple loop is already a fragile affair. Now, what happens if we want to do more than just generate a perpetually repeating signal? What if we want emit a signal of a given length instead?
Instead of an unconditional jump instruction, we would probably need to use a counter register plus integer addition and conditional branch instructions, to break out of the loop at the appropriate moment.
Or what if rather than transmitting a signal, we want to receive one by sampling the GPIO input? In that case we might need to detect some kind of start or end condition after which we should start or stop sampling for new data, and hence we'd also need to break out of the loop, again relying on conditional branch instructions.
The more complex the loop, the more important it becomes to know exactly how long each iteration of the loop takes, and therefore how long each instruction takes. However, for various reasons the number of CPU cycles spent per instruction can be quite variable and situation-dependent. For example, branch predictors in the CPU core might cause branch instructions to take fewer CPU cycles the 2nd, 3rd, and subsequent times they're encountered, compared to the first time they're encountered.
This means that any approach that requires precisely tied behavior from a piece of code will inherently be quite fragile. The only way to try and get such code to work is to measure its behavior in practice, and even then you'll have to hope that the behavior you observed is deterministic and consistent.
Measuring CPU cycles using hardware performance counters
Since my Niccle project revolves around bit banging the Ethernet 10BASE-T protocol, I'm forced to try and work around this inherent fragility. To make my life a little easier, I set out to measure the behavior of my chip's CPU under various circumstances that would be relevant to my project.
Many chips, including the ESP32-C3 and ESP32-C6 chips, support a "hardware performance counter" feature that makes this task easier. These counters allow you to run a bit of code and then read back how many CPU cycles were spent executing that code, how many instructions were executed, how many of the CPU cycles were "idle", etc.
On the ESP32-C3 and ESP32-C6 in particular, this can be done by accessing a set
of custom control and status register (CSRs) called mpcer
(Machine Performance
Counter Event Register, at address 0x7E0
), mpcmr
(Machine Performance
Counter Mode Register, at address 0x7E1
). and mpccr
(Machine Performance
Counter Count Register, at address 0x7E2
). They are documented in the
ESP32-C3 technical reference manual
and the
ESP32-C6 manual.
With this basic functionality, it becomes quite easy to instrument and inspect the CPU's behavior in various situations. For example, by writing a piece of looping code and running 1, 2, 3, or more iterations of that loop, we can observe how many CPU cycles are spent in each case. From that information we can then work backwards and deduce how many cycles were spent on each executed instruction in the loop.
Deducing the CPU cycles taken by specific instructions
To show how this works in practice, I'll use the simple finite loop we discussed
above as an example. The following code measures {n}
iterations of that loop:
When I measured this loop for 1, 2, 3, 4 and 1,000 iterations, I got the following results.
Iterations{n} | CPU cycles (ESP32-C3) | CPU cycles (ESP32-C6) |
---|---|---|
1 | 2 | 5 |
2 | 6 | 8 |
3 | 10 | 10 |
4 | 14 | 12 |
1,000 | 3,998 | 2,004 |
The number of CPU cycles spent when executing the above finite loop for the given number of iterations.
Note that these results show quite the amount of variability. For example, looking at the ESP32-C6 results:
- A single iteration takes 5 CPU cycles,
- Two iterations take 8 cycles, a difference of three CPU cycles.
- Three iterations take 10 cycles, a smaller difference of only two cycles.
- All subsequent iterations seem to take two cycles after that.
One way to explain these observations is that the cycles taken by the instructions fall into three different situations: the last iteration, the second to last iteration, and all other iterations.
To come up with a model that can predict the number of CPU cycles taken we can actually treat these observations as a system of linear equations, which we can then solve to figure out the number of CPU cycles spent on each instruction in each situation. To do this we will:
- Assume that the
addi
instruction only takes a single CPU cycle, and does so consistently in each iteration. - Assume that the
bne
branch instruction uses a different number of CPU cycles in each of the three situations. I.e. it takes either $x$, $y$, or $z$ cycles, with $x$ representing the cycles taken in the last iteration, $y$ the cycles taken in the second to last iteration, and $z$ the cycles taken in the all other iterations.
Applying this approach to the results for the ESP32-C6 chip gives us the following (overdetermined) system of equations.
$$ \begin{cases} \begin{split} 5&=x + 1 \\ 8&=x + y + 2 \\ 10&=x + y + z + 3 \\ 12&=x + y + 2z + 4 \\ 2004&=x + y + 998z + 1000 \end{split} \end{cases} $$
which has the following solution
$$ \begin{cases} \begin{split} x&=4 \\ y&=2 \\ z&=1 \end{split} \end{cases} $$
We can follow a similar approach for the ESP32-C3. In that chip's case, the
bne
instruction only exhibits two different behaviors (one in the last
iteration, and one in all other iterations). This leads us to the following
model of the CPUs' behavior in this particular benchmark.
Instruction | CPU cycles (ESP32-C3) | CPU cycles (ESP32-C6) |
---|---|---|
addi | 1 | 1 |
bne , when branch is not taken (the last iteration) | 1 | 4 |
bne , when branch is taken (second to last iteration) | 3 | 2 |
bne , when branch is taken (subsequent iterations) | 3 | 1 |
A table describing the number of CPU cycles taken up by each instruction in the previous benchmark, in different situations. Note how strongly the ESP32-C3 and ESP32-C6 CPUs differ in their behavior.
So for this particular benchmark run on the ESP32-C6, we can infer that the CPU's implementation optimizes for taken backward branches (since taken branches take one less CPU cycle than untaken branches), and that after the first taken branch some level of branch prediction seems to kick in (since subsequent taken branches take one less CPU cycle than the first taken branch).
Note that the ESP32-C3 doesn't behave the same way at all, and in fact seems to optimize for untaken backward branches instead.1 It also does not seem to have any branch prediction.
Limitations to this approach
There are important limitations to the approach I took here. In particular, I
made some assumptions that may not always be true, such as that the addi
instruction always take a constant number of cycles (although this is in fact
the case, it may not always be true for other instructions).
I also assumed that on the ESP32-C6 chip the bne
instruction requires the same
number of CPU cycles in the last iteration in each of the benchmarks, regardless
of whether the benchmark ran only a single iteration, or whether it ran multiple
iterations. That is, I assumed that an "untaken branch" always takes the same
number of CPU cycles.
I could be wrong about that. For example, the bne
instruction implementation
could perhaps be described by the following model instead.
Instruction | CPU cycles (ESP32-C6) |
---|---|
addi | 1 |
bne , when branch is not taken and it was never taken before (i.e. when there's only a single iteration) | 4 |
bne , when branch is not taken and it was taken in a previous iteration (i.e. when there's two or more iterations) | 5 |
bne , when branch is taken | 1 |
A table describing an alternate model of the ESP32-C6 CPU's behavior, for the benchmark above. Note that this does not reflect the actual CPU behavior, but is being shown to illustrate that there could be more than one way to reason about a benchmark's results.
This model would also be consistent with our benchmark results, and it would indicate that the CPU behavior is even more variable. There is ultimately no way to know, just from the performance counter data, which of the two models is actually the correct one. One way to resolve this question is to instead update the benchmark to emit a signal at the start of each iteration, or even between each instruction. We can then measure the time between signals and infer the precise CPU cycles spent on each instruction.
For my own curiosity I did, in fact, take that step, and with my oscilloscope I
was able to observe that the last iteration in each benchmark does always take
the same amount of time. Hence, I was able to conclude that the model I
described in the previous section is the correct one. Even so, it's important to
remember that this model is based only on a single benchmark, and still might
not exhaustively describe the CPU's behavior for the bne
instruction in every
single case.
Conclusion
The results in this post confirm what we probably all already knew... That the number of CPU cycles spent executing a given piece of code cannot easily be predicted by just reading the assembly code. The number of cycles taken by an instruction can vary significantly depending on the situation. The results also show that there can be significant differences in the behavior of different chips, even if they support the same instruction set. Both of the RISC-V based ESP32-C3 and ESP32-C6 chips show very different results, after all. I'm sure this is true for many other chips as well.
These observations show that if precise timing of a piece code is important, it is crucial to measure the actual CPU behavior in practice. If the CPU has a performance counter feature, then this can make it a lot easier to inspect the CPU behavior under various conditions and to help us form a hypothesis, but to be absolutely sure you probably should still validate your conclusions with an external measuring instrument such as an oscilloscope.
The hardware performance counter feature can also come in handy in many other situations, such as when you're trying to optimize a piece of code more generally. On the ESP32-C3 chips I saw the performance counter feature used for the first time when reading this esp-rs/esp-hal pull request, which used the performance counter-based benchmarks in onsdagens/esp32c3-rt-benchmarks to conclude that a much more optimized implementation of interrupt vectoring was possible. It made me take a closer look and realize that it would be quite useful for my needs.
Lastly, this post only discusses the behavior of the ESP32-C3 and ESP32-C6 chips on a single, very small benchmark. Because I'll need to write a lot more assembly code than this for my Niccle project, I've actually written a little binary that tests many more scenarios on the ESP32-C3 and ESP32-C6 CPUs, and I've added it to the Niccle repository. Take a look if you're interested!
Footnotes
It's interesting that the CPUs in the ESP32-C3 and ESP32-C6 differ so much
in their behavior here. The ESP32-C6 chip seems to optimize for taken
backward branches, which is actually in line with the guidance of
Section 2.5 of the
RISC-V spec,
which states the following: Software should be optimized such that the
sequential code path is the most common path, with less-frequently taken
code paths placed out of line. Software should also assume that backward
branches will be predicted taken and forward branches as not taken, at least
the first time they are encountered. Dynamic predictors should quickly learn
any predictable branch behavior.
This reflects the fact that the most
natural way to write a loop in assembly is in fact a structure where the
taken backward branch is the common case, and the untaken backward
branch is the rarer, end-of-loop case. The ESP32-C3 on the other hand seems
to optimize for the opposite case of untaken backward branches, which is a
bit unexpected.