Counting CPU cycles on ESP32-C3 and ESP32-C6 microcontrollers

In this post I discuss how to count the number of CPU cycles spent executing code on the ESP32-C3 and ESP32-C6 microcontrollers, and why you might want to do that. I work through a small example of measuring the CPU cycles spent executing a number of iterations of a simple loop, and describe how we can use the results to build a model describing how many CPU cycles are required for each instruction in the loop.

Table of Contents

Introduction

When bit banging protocols, especially high-speed protocols that push the speed limits of the CPU, timing is everything. It's not uncommon to find yourself having to hand-write assembly code to perform a particular protocol function, in order to ensure that the implementation respects the timing constraints dictated by the protocol spec.

The dedicated IO feature on ESP32-C6 and other microcontrollers allows us to manipulate a GPIO output signal within a single CPU cycle, but you need to hand-write assembly code to use it. For example, we can use the feature to generate a clock signal with a given frequency, by writing a loop that toggles the signal level every so many CPU cycles.

What is so hard about writing precisely timed code?

When writing signaling loops we need to know exactly how long each instruction in the loop takes, not just the instruction that actually manipulates the output signal. For example, to generate a simple perpetually repeating clock signal with a 5 ⁠MHz frequency, we need to write a loop that toggles the signal twice per 50 ⁠ns signal period (one transition from high to low, and one from low to high). If the CPU frequency is 160 ⁠MHz, that means we have to toggle the output once every 16 CPU cycles.

That seems simple enough, but as we saw in my previous post on the dedicated IO feature even this simple task can be a little tricky, because in order to write an infinite loop we need to use a j jump instruction, and that instruction requires a different number of CPU cycles on different chips. On ESP32-C3 chips the instruction always takes two CPU cycles, but on ESP32-C6 chips it can take either a single or two CPU cycles, depending on the situation.

1: csrrsi ...              // Set the pin high.
   nop, nop, nop, nop, ... // Idle for some number of CPU cycles.
   csrrci ...              // Set the pin low.
   nop, nop, nop, ...      // Idle for a smaller number of CPU cycles. The
                           // exact number depends on the number of cycles taken
                           // by the `j` instruction.
   j 1b                    // Loop back to label 1.

The general structure of an infinite loop that generates a repeating signal with a perfect 50% duty cycle using the "dedicated IO" feature on ESP32-C3/C6 chips. Note that to achieve the correct duty cycle we must know how many CPU cycles will be spent on each of the instructions in the loop, and then manually balance the CPU cycles spent in each phase of the loop.

It's clear that getting the timing right for such a simple loop is already a fragile affair. Now, what happens if we want to do more than just generate a perpetually repeating signal? What if we want emit a signal of a given length instead?

Instead of an unconditional jump instruction, we would probably need to use a counter register plus integer addition and conditional branch instructions, to break out of the loop at the appropriate moment.

   li t0, N                // Configure the loop to run for N iterations.
   li t1, 0
1: csrrsi ...              // Set the pin high..
   nop, nop, nop, nop, ... // Idle for some number of CPU cycles.
   csrrci ...              // Set the pin low.
   nop, nop, ...           // Idle for a smaller number of CPU cycles. The
                           // exact number depends on the number of cycles taken
                           // the `addi` and `bne` instructions.
   addi t1, t1, 1          // Increment the counter register t1 by one.
   bne t0, t1, 1b          // Loop back to label 1 if t1 != t0.

The general structure of a loop that generates a signal with a perfect 50% duty cycle, and of a limited length. Note the introduction of the addi addition and bne conditional branch instructions, which replace the j unconditional branch instruction in the previous snippet.

Or what if rather than transmitting a signal, we want to receive one by sampling the GPIO input? In that case we might need to detect some kind of start or end condition after which we should start or stop sampling for new data, and hence we'd also need to break out of the loop, again relying on conditional branch instructions.

The more complex the loop, the more important it becomes to know exactly how long each iteration of the loop takes, and therefore how long each instruction takes. However, for various reasons the number of CPU cycles spent per instruction can be quite variable and situation-dependent. For example, branch predictors in the CPU core might cause branch instructions to take fewer CPU cycles the 2nd, 3rd, and subsequent times they're encountered, compared to the first time they're encountered.

This means that any approach that requires precisely tied behavior from a piece of code will inherently be quite fragile. The only way to try and get such code to work is to measure its behavior in practice, and even then you'll have to hope that the behavior you observed is deterministic and consistent.

Measuring CPU cycles using hardware performance counters

Since my Niccle project revolves around bit banging the Ethernet 10BASE⁠-⁠T protocol, I'm forced to try and work around this inherent fragility. To make my life a little easier, I set out to measure the behavior of my chip's CPU under various circumstances that would be relevant to my project.

Many chips, including the ESP32-C3 and ESP32-C6 chips, support a "hardware performance counter" feature that makes this task easier. These counters allow you to run a bit of code and then read back how many CPU cycles were spent executing that code, how many instructions were executed, how many of the CPU cycles were "idle", etc.

On the ESP32-C3 and ESP32-C6 in particular, this can be done by accessing a set of custom control and status register (CSRs) called mpcer (Machine Performance Counter Event Register, at address 0x7E0), mpcmr (Machine Performance Counter Mode Register, at address 0x7E1). and mpccr (Machine Performance Counter Count Register, at address 0x7E2). They are documented in the ESP32-C3 technical reference manual and the ESP32-C6 manual.

// Configure the performance counter to count CPU cycles.
csrw 0x7E0, 1
// Enable the performance counter.
csrw 0x7E1, 1
// Read the performance counter into register t0.
csrr t0, 0x7E2

// Run the code we want to measure
...

// Read the performance counter again into register t1.
csrr t1, 0x7E2

An example of using the hardware performance counter on ESP32-C3 and ESP32-C6 chips. The number of spent CPU cycles can be calculated by subtracting the start value (register t0) from the end value (register t1).

With this basic functionality, it becomes quite easy to instrument and inspect the CPU's behavior in various situations. For example, by writing a piece of looping code and running 1, 2, 3, or more iterations of that loop, we can observe how many CPU cycles are spent in each case. From that information we can then work backwards and deduce how many cycles were spent on each executed instruction in the loop.

Deducing the CPU cycles taken by specific instructions

To show how this works in practice, I'll use the simple finite loop we discussed above as an example. The following code measures {n} iterations of that loop:

// Run n iterations.
li t0, {n}
li t1, 0
// Configure and read the performance counter.
csrw 0x7E0, 1
csrw 0x7E1, 1
csrr t2, 0x7E2

// Run the code we want to measure.
1: addi t1, t1, 1  // Increment the counter register t1 by one.
   bne t0, t1, 1b  // Loop back to label 1 if t1 != t0.

// Read the performance counter again.
csrr t3, 0x7E2

Measuring the CPU cycles spent executing a simple loop.

When I measured this loop for 1, 2, 3, 4 and 1,000 iterations, I got the following results.

Iterations
{n}
CPU cycles
(ESP32⁠-⁠C3)
CPU cycles
(ESP32⁠-⁠C6)
125
268
31010
41412
1,0003,9982,004

The number of CPU cycles spent when executing the above finite loop for the given number of iterations.

Note that these results show quite the amount of variability. For example, looking at the ESP32-C6 results:

  • A single iteration takes 5 CPU cycles,
  • Two iterations take 8 cycles, a difference of three CPU cycles.
  • Three iterations take 10 cycles, a smaller difference of only two cycles.
  • All subsequent iterations seem to take two cycles after that.

One way to explain these observations is that the cycles taken by the instructions fall into three different situations: the last iteration, the second to last iteration, and all other iterations.

To come up with a model that can predict the number of CPU cycles taken we can actually treat these observations as a system of linear equations, which we can then solve to figure out the number of CPU cycles spent on each instruction in each situation. To do this we will:

  • Assume that the addi instruction only takes a single CPU cycle, and does so consistently in each iteration.
  • Assume that the bne branch instruction uses a different number of CPU cycles in each of the three situations. I.e. it takes either $x$, $y$, or $z$ cycles, with $x$ representing the cycles taken in the last iteration, $y$ the cycles taken in the second to last iteration, and $z$ the cycles taken in the all other iterations.

Applying this approach to the results for the ESP32-C6 chip gives us the following (overdetermined) system of equations.

$$ \begin{cases} \begin{split} 5&=x + 1 \\ 8&=x + y + 2 \\ 10&=x + y + z + 3 \\ 12&=x + y + 2z + 4 \\ 2004&=x + y + 998z + 1000 \end{split} \end{cases} $$

which has the following solution

$$ \begin{cases} \begin{split} x&=4 \\ y&=2 \\ z&=1 \end{split} \end{cases} $$

We can follow a similar approach for the ESP32-C3. In that chip's case, the bne instruction only exhibits two different behaviors (one in the last iteration, and one in all other iterations). This leads us to the following model of the CPUs' behavior in this particular benchmark.

InstructionCPU cycles
(ESP32⁠-⁠C3)
CPU cycles
(ESP32⁠-⁠C6)
addi11
bne, when branch is not taken (the last iteration)14
bne, when branch is taken (second to last iteration)32
bne, when branch is taken (subsequent iterations)31

A table describing the number of CPU cycles taken up by each instruction in the previous benchmark, in different situations. Note how strongly the ESP32-C3 and ESP32-C6 CPUs differ in their behavior.

So for this particular benchmark run on the ESP32-C6, we can infer that the CPU's implementation optimizes for taken backward branches (since taken branches take one less CPU cycle than untaken branches), and that after the first taken branch some level of branch prediction seems to kick in (since subsequent taken branches take one less CPU cycle than the first taken branch).

Note that the ESP32-C3 doesn't behave the same way at all, and in fact seems to optimize for untaken backward branches instead.1 It also does not seem to have any branch prediction.

Limitations to this approach

There are important limitations to the approach I took here. In particular, I made some assumptions that may not always be true, such as that the addi instruction always take a constant number of cycles (although this is in fact the case, it may not always be true for other instructions).

I also assumed that on the ESP32-C6 chip the bne instruction requires the same number of CPU cycles in the last iteration in each of the benchmarks, regardless of whether the benchmark ran only a single iteration, or whether it ran multiple iterations. That is, I assumed that an "untaken branch" always takes the same number of CPU cycles.

I could be wrong about that. For example, the bne instruction implementation could perhaps be described by the following model instead.

InstructionCPU cycles
(ESP32⁠-⁠C6)
addi1
bne, when branch is not taken and it was never taken before (i.e. when there's only a single iteration)4
bne, when branch is not taken and it was taken in a previous iteration (i.e. when there's two or more iterations)5
bne, when branch is taken1

A table describing an alternate model of the ESP32-C6 CPU's behavior, for the benchmark above. Note that this does not reflect the actual CPU behavior, but is being shown to illustrate that there could be more than one way to reason about a benchmark's results.

This model would also be consistent with our benchmark results, and it would indicate that the CPU behavior is even more variable. There is ultimately no way to know, just from the performance counter data, which of the two models is actually the correct one. One way to resolve this question is to instead update the benchmark to emit a signal at the start of each iteration, or even between each instruction. We can then measure the time between signals and infer the precise CPU cycles spent on each instruction.

For my own curiosity I did, in fact, take that step, and with my oscilloscope I was able to observe that the last iteration in each benchmark does always take the same amount of time. Hence, I was able to conclude that the model I described in the previous section is the correct one. Even so, it's important to remember that this model is based only on a single benchmark, and still might not exhaustively describe the CPU's behavior for the bne instruction in every single case.

Conclusion

The results in this post confirm what we probably all already knew... That the number of CPU cycles spent executing a given piece of code cannot easily be predicted by just reading the assembly code. The number of cycles taken by an instruction can vary significantly depending on the situation. The results also show that there can be significant differences in the behavior of different chips, even if they support the same instruction set. Both of the RISC-V based ESP32-C3 and ESP32-C6 chips show very different results, after all. I'm sure this is true for many other chips as well.

These observations show that if precise timing of a piece code is important, it is crucial to measure the actual CPU behavior in practice. If the CPU has a performance counter feature, then this can make it a lot easier to inspect the CPU behavior under various conditions and to help us form a hypothesis, but to be absolutely sure you probably should still validate your conclusions with an external measuring instrument such as an oscilloscope.

The hardware performance counter feature can also come in handy in many other situations, such as when you're trying to optimize a piece of code more generally. On the ESP32-C3 chips I saw the performance counter feature used for the first time when reading this esp-rs/esp-hal pull request, which used the performance counter-based benchmarks in onsdagens/esp32c3-rt-benchmarks to conclude that a much more optimized implementation of interrupt vectoring was possible. It made me take a closer look and realize that it would be quite useful for my needs.

Lastly, this post only discusses the behavior of the ESP32-C3 and ESP32-C6 chips on a single, very small benchmark. Because I'll need to write a lot more assembly code than this for my Niccle project, I've actually written a little binary that tests many more scenarios on the ESP32-C3 and ESP32-C6 CPUs, and I've added it to the Niccle repository. Take a look if you're interested!

Footnotes

1

It's interesting that the CPUs in the ESP32-C3 and ESP32-C6 differ so much in their behavior here. The ESP32-C6 chip seems to optimize for taken backward branches, which is actually in line with the guidance of Section 2.5 of the RISC-V spec, which states the following: Software should be optimized such that the sequential code path is the most common path, with less-frequently taken code paths placed out of line. Software should also assume that backward branches will be predicted taken and forward branches as not taken, at least the first time they are encountered. Dynamic predictors should quickly learn any predictable branch behavior. This reflects the fact that the most natural way to write a loop in assembly is in fact a structure where the taken backward branch is the common case, and the untaken backward branch is the rarer, end-of-loop case. The ESP32-C3 on the other hand seems to optimize for the opposite case of untaken backward branches, which is a bit unexpected.