The GPIO speed of ESP32-C3 and ESP32-C6 microcontrollers

  • Published on
  • • Updated on
  • • 28 min read
  • • Tags: 
  • esp32
  • risc-v
  • rust

In this post I explore the maximum GPIO output and input speed we can achieve with ESP32-C3 and ESP32-C6 microcontrollers. The standard way of doing GPIO is limited to signal frequencies of less than 10 ⁠MHz, but the "dedicated IO" mechanism available on the ESP-C and ESP-S chip series provides an alternative solution that can output and sample signals with frequencies in the 40 ⁠MHz to 80 ⁠MHz range.

Table of Contents

Why care about the GPIO speed?

It's worth considering why we would care about the GPIO speed at all. Simple tasks like controlling an LED certainly don't require very high signaling speeds, and for more complex tasks microcontrollers often contain dedicated peripherals that can take care of higher-speed signaling without needing to rely on the GPIO function (e.g. for I²C or SPI connections). Hence, most projects don't need to use GPIO for high-speed signaling.

However, there are cases where using GPIO at high speeds is necessary or desired, and then it becomes important to know what the maximum achievable GPIO speed actually is. One such case is if you want to write a bit banged implementation of a high-speed protocol like "high speed" I²C, SPI, or Ethernet. You might want to bit bang a protocol if your microcontroller doesn't have a dedicated peripheral for it. Or, as is the case for me, perhaps you want to write a software-only implementation of some protocol for educational purposes, simply for the sake of learning something new.

I've been working on a bit banged Ethernet 10BASE⁠-⁠T implementation (one that doesn't rely on another dedicated IC to perform either MAC or PHY functions), which at the very least requires being able to transmit and receive signals with frequencies up to 10 ⁠MHz, since that's the bandwidth that protocol uses. If my microcontroller's GPIO cannot toggle pin values fast enough to achieve that signaling frequency, then it won't be possible to build a bit banged implementation, no matter how good my software is.

// A hand-wavy example of bit banging the start of an 10BASE⁠-⁠T Ethernet frame
// transmission (the "preamble"), consisting of a signal that alternates
// between high and low values every 100 ⁠ns.
 for i in 0..8 {
  pin0.set_high();
  delay_ns(100);
  pin0.set_low();
  delay_ns(100);
}

Defining GPIO speed

Let's define the concept of "GPIO speed" a bit more formally, as it often gets discussed in somewhat vague and differing terms around the internet.

I find it easiest to reason about GPIO speed in the context of a minimal benchmark. For example, we can use the following benchmark to establish an upper bound for the output speed of a single GPIO pin, by showing us how quickly we can toggle a its value from high to low and back to high again.

// An example of setting a pin high, low, and high again in a tight loop.
while true {
  pin0.set_high();
  pin0.set_low();
}

Given such a benchmark, we can then define the observed GPIO output speed $f_{out}$ as the fundamental frequency of the signal generated by that benchmark

$$f_{out} = \frac{1}{T}$$

where $T$ is the period of generated signal, the time it takes to complete one cycle of pin toggling. The unit of this quantity is Hz. The maximum achievable GPIO speed can then be defined as the GPIO speed we can achieve with the fastest benchmark we can come up with.

For example, if we are able to cycle a pin's value from high to low and then high again within a period of 200 ⁠ns (that is, 100 ⁠ns spent at the high value, and 100 ⁠ns spent at the low value), then we have achieved a 5 ⁠MHz output speed.

This definition of GPIO output speed is convenient, because it allows us to directly compare it to the required spectral bandwidth or clock speed of the protocol we're interested in bit banging (e.g. 10 ⁠MHz of bandwidth for Ethernet 10BASE⁠-⁠T, or a 100 ⁠KHz clock speed for "fast-mode" I²C). We could define "speed" in other ways, such as by the time it takes to output a single signal transition (100 ⁠ns in the example above), but those won't be as directly comparable. By choosing to define the GPIO speed using an actual benchmark, we also know that the upper bound we define is achievable in practice, and not just based on something we read in a datasheet.

The limitations of a simple benchmark

Now, a bit banged implementation will really only be feasible in practice if such a benchmark can reach a much higher signalling speed than the target protocol's required bandwidth, perhaps 2x or 4x as high. That's because we need to account for the additional delays between signal transitions that will be incurred in a real implementation, which a simple on-off toggling benchmark ignores:

  • the delays introduced by reading the data to be transmitted from memory, or by storing received data into memory,

  • the delays introduced by the conditional branches needed to terminate the loop iteration after transmitting a real, finite-length output signal,

  • the delays introduced by having to signal on multiple pins at the same time (e.g. such as with I²C where there's both a clock and a data signal), since some approaches might only be able to reach the maximum speed on a single pin at a time.

In addition to those concerns, there's other aspects that will impact the feasibility of bit banging a protocol as well. For example,

  • the rise time of the signal,

  • the achievable duty cycle: in most cases we'll want to achieve a 50% duty cycle at the target frequency,

  • the speed at which we can receive and process incoming signals, in addition to the speed at which we can output them. I wont perform any input speed benchmarks in this post, but I will discuss the topic more below.

Despite these concerns, the benchmark-based definition of output speed serves as a good way to establish an optimistic upper bound, which we can then use to do some basic feasibility calculations for a given project.

Benchmarks

On most of the ESP32 microcontrollers we can perform GPIO in a number of ways:

  • By using the standard setter functions provided by the framework we're using (e.g. ESP-IDF).

  • By using the same underlying "simple GPIO" mechanism that those setter functions use, but by writing to the output registers directly, rather than through function calls.

  • By using the "dedicated GPIO" mechanism.

I benchmarked each of these approaches in the sections below. I performed my benchmarks using both the ESP32-C3 and ESP32-C6 microcontrollers (using a ESP32-C3-DevKitC-02 and an ESP32-C6-DevKitC-1-N8). However, most of the findings should apply to other microcontrollers in the ESP32 family as well, such as the ESP32-S series of chips.

During the benchmarks both chips were mounted in their own breadboard, and I left the GPIO output pins floating, with no load provided to them.

The "simple GPIO output" mechanism

Most frameworks used for programming microcontrollers contain helper functions for configuring GPIO pins and for controlling their output values. For the ESP32 family of chips these APIs use what is referred to in the technical reference manuals as the "simple GPIO output" mechanism, under the hood (e.g. for ESP32-C6 see Section 7.5.3 of the manual).

With this mechanism, you first configure the pin's function (setting it to "simple GPIO") and direction (setting to "output"), and then you can control the pin's output value by manipulating the GPIO_OUT_REG register, either by writing to it directly or by using the auxiliary GPIO_OUT_W1TS or GPIO_OUT_W1TC registers to set or clear a specific bit in the output register.

If you're using the ESP-IDF framework, then this is exactly what the gpio_set_level function does (see its definition which eventually calls into a lower-level chip-specific implementation). Instead of ESP-IDF I'm currently using the Rust-based esp-rs/esp-hal framework for my personal projects, and it similarly implements this in the GpioPin::set_high/set_low setter functions (their implementation also utilizes a helper function that actually manipulates the register). Note that these links are for the ESP32-C6 chip, but you can navigate the codebases to find implementations for the other chips as well.

Benchmarking GPIO output with setter functions

Let's take the following benchmark that uses the Rust esp-rs/esp-hal framework:

#[ram]
fn benchmark_simple_gpio_one_pin(io: IO) -> ! {
    // Set GPIO10 as output.
    let mut gpio10 = io.pins.gpio10.into_push_pull_output();

    // Toggle the pin's state in a busy loop.
    loop {
        gpio10.set_high().unwrap();
        gpio10.set_low().unwrap();
    }
}

A "simple GPIO" benchmark that toggles a single pin using setter functions.
Full source on Github.

You'll notice that this is exactly the type of benchmark we proposed in the earlier section. It's not shown in the snippet, but in this and all following benchmarks I've also configured the CPU's clock speed to be at its maximum value of 160 ⁠MHz, to try and maximize the speed we can achieve. When running the benchmark I also use cargo build --release, to ensure we're running with all compiler optimizations applied. Lastly, I've ensured that the code is immediately loaded into RAM at boot-up before the code is executed (using the #[ram] directive), to avoid any performance impact from having to load the code from flash into RAM when the code is first accessed.

Let's take a look at the signal this generates on my ESP32-C6 chip.

A trace of a periodic square waveform with a 1.74 ⁠MHz frequency and 47.8% duty cycle

An oscilloscope trace of the "simple GPIO" benchmark running on ESP32-C6.

And let's look at the signal on the ESP32-C3 chip as well:

A trace of a periodic square waveform with a 4.22 ⁠MHz frequency and 47.3% duty cycle

An oscilloscope trace of the "simple GPIO" benchmark running on ESP32-C3.

We're able to achieve a speed of about 1.74 ⁠MHz on the ESP32-C6 and 4.22 ⁠MHz on the ESP32-C3. Note that this is markedly less than the CPU clock speed of 160 ⁠MHz. This is could be due to a number of reasons:

  • The setter functions we use may cause a lot of CPU cycles to spent be simply calling the function and its helper functions, in between each actual signal change.

  • The simple GPIO output mechanism uses the microcontroller's APB bus to apply the output signals, and this bus runs at a lower speed (80 ⁠MHz) than the CPU itself, while possibly also introducing buffering delays.

  • The CPU clock speed describes the frequency with which CPU cycles are completed, and in the best case a single instruction is executed per CPU cycle. Even if we could effect an GPIO output value transition with a single instruction, we'd still always need two instructions to effect a full cycle of transitions (one instruction to go high, one instruction to go low). Hence, given a clock speed of 160 ⁠MHz, we can only hope to output at most a 80 ⁠MHz signal. We're clearly still quite far from that upper limit, however.

The first source of delays can be easily diagnosed as well as well mitigated, so let's take a look at that next.

Setter function overhead

To do so, let's take a look out the disassembled code for our benchmark for the ESP32-C6 chip.1 The sequence of instructions corresponding to the busy loop is as follows (note that I've omitted a few verbose bits marked with ...).

## Main busy loop
40808f16: ... li      a0,1024   # 1024 corresponds to bit #11, i.e. GPIO10.
40808f1a: ... auipc   ra,0x17fa
40808f1e: ... jalr    -1196(ra) # Function call to <...write_output_set...>
40808f22: ... li      a0,1024
40808f26: ... auipc   ra,0x17fa
40808f2a: ... jalr    -1200(ra) # Function call to <...write_output_clear...>
40808f2e: ... j       40808f16  # Loop back to the start.

## The write_output_set function.
42002a6e <...write_output_set...>:
42002a6e: ... lui     a1,0x60091
42002a72: ... sw      a0,8(a1)
42002a74: ... ret

## The write_output_clear function.
42002a76 <...write_output_clear...>:
42002a76: ... lui     a1,0x60091
42002a7a: ... sw      a0,12(a1)
42002a7c: ... ret

I'm not showing the generated code for the ESP32-C3 binary, but it looks practically identical. Given that the code is identical, it's quite interesting that the ESP32-C3 is able to achieve a much faster signal frequency than the ESP32-C6, right out of the box.

We can see that each iteration of the loop requires executing quite a few instructions. I count 7 in the main loop and 3 more for each of the write_output_set and write_output_clear functions (the GpioPin::set_high/set_low functions seem to have been inlined by the compiler, avoiding what would've otherwise been an additional function call). This adds up to 13 instructions per iteration, just to perform two signal transitions.

Even assuming a best-case scenario of one CPU cycle per instruction, this many instructions would only allow for a signal frequency of up to of 12.3 ⁠MHz, and that's likely an optimistic estimate since at least a few of the instructions will take longer than one CPU cycle to execute (e.g. some of the jump instructions).

On the other hand, the observed frequencies of 1.74 ⁠MHz and 4.22 ⁠MHz are quite far from that 12.3 ⁠MHz frequency anyway. This indicates that the instruction count is likely only part of the bottleneck. Let's see if we can reduce the instruction count, and see how much we can gain from that.

Benchmarking GPIO output with direct register manipulation

To reduce the loop's instruction count we can rewrite our benchmark to manipulate the GPIO output register directly via inline assembly instructions rather than via a function call. This will give us a clearer sense of the maximum GPIO output speed we can achieve using the simple GPIO output mechanism, which will still be limited by the APB bus speed.

#[ram]
fn benchmark_simple_gpio_one_pin_with_assembly(mut io: IO) -> ! {
    // Set GPIO10 as output.
    io.pins.gpio10.set_to_push_pull_output();

    // The GPIO matrix address for ESP32-C6, see "Table 5-2" in the technical
    // reference manual.
    const GPIO_MATRIX_ADDRESS: usize = 0x6009_1000;

    // Toggle the pin's state in a busy loop.
    unsafe {
        asm!("
            // Ensures that the '1' label corresponds to a 4 byte-aligned
            // address, thereby allowing the `j` instruction to take only a
            // single CPU cycle on ESP32-C6 chips.
            .align 4
            1:
            // Set the pin high.
            sw {gpio_pin}, 0x8({gpio_matrix}) // 0x8 is the GPIO_OUT_W1TS_REG offset.
            // Set the pin low.
            sw {gpio_pin}, 0xc({gpio_matrix}) // 0xc is the GPIO_OUT_W1TC_REG offset.
            // Loop back to label 1.
            j 1b",
            // GPIO10 corresponds to bit #11 in the GPIO_OUT_W1TS_REG and
            // GPIO_OUT_W1TC_REG registers.
            gpio_pin = in(reg) 1 << 10,
            gpio_matrix = in(reg) GPIO_MATRIX_ADDRESS,
            // The assembly code will loop forever and never return.
            options(noreturn)
        );
    }
}

A "simple GPIO" benchmark that toggles a single pin by manipulating the GPIO_OUT registers directly.
Full source on Github.

Let's take a look at the signal this generates on the ESP32-C6:

A trace of a periodic square waveform with a 2.22 ⁠MHz frequency and 50% duty cycle

An oscilloscope trace of the "simple GPIO with direct register manipulation" benchmark running on ESP32-C6.

As well as on the ESP32-C3:

A trace of a periodic square waveform with a 8.85 ⁠MHz frequency and 44.2% duty cycle

An oscilloscope trace of the "simple GPIO with direct register manipulation" benchmark running on ESP32-C3.

We're now reaching 2.22 ⁠MHz and 8.85 ⁠MHz on the ESP32-C6 and ESP32-C3, respectively. For the ESP32-C6 that's not a very big improvement (compared to the 1.75 ⁠MHz we reached before). For the ESP32-C3 it's almost a doubling in speed, but nevertheless the results are still lower than I'd like them to be, and at this point it seems we're clearly bottlenecked not by the instruction count, but by the underlying bus/peripheral speed. Fortunately, we can do much better than this.

Signal rise time

However, before I move on let's also take a quick look at the signal rise times for the previous benchmark, since that's another aspect of the signal waveform we might interested in. For the ESP32-C6:

A zoomed in trace of a waveform showing a 3.3 ⁠ns rise time

An oscilloscope trace of the signal rise time on ESP32-C6.

For the ESP32-C3:

A zoomed in trace of a waveform showing a 3.3 ⁠ns rise time

An oscilloscope trace of the signal rise time on ESP32-C3.

We can see that in both cases the rise time calculated by the oscilloscope is approximately 3.3 ⁠ns. It defines rise time as the time it takes to go from 10% to 90% of a rising waveform.

The "dedicated IO" mechanism

Most of the microcontrollers in the ESP32 family support an alternate mechanism for doing GPIO that is distinct from the "simple GPIO" mechanism that we used in the previous sections, and which is more suitable for high-speed signaling. It is called "dedicated GPIO" in the ESP-IDF documentation, and "dedicated IO" in Section 14.1 of the ESP32-C6 manual, which describes the mechanism as follows (other ESP chips support the mechanism as well, but don't document it clearly in the manual):

Normally, GPIOs are an APB peripheral, which means that changes to outputs and reads from inputs can get stuck in write buffers or behind other transfers, and in general are slower because generally the APB bus runs at a lower speed than the CPU. As an alternative, the CPU core implements I/O processors specific CPU registers (CSRs) which are directly connected to the GPIO matrix or IO pads. As these registers can get accessed in one instruction, speed is fast.

To use this mechanism, you first need to configure the pin's function and direction just like you would with the "simple GPIO" mechanism, but you subsequently redirect the pin to be connected to one of the CPU_GPIO_OUT{n} IO signals. The CPU has 8 such input/output signals, so we can drive or read up to 8 pins this way.

The ESP-IDF framework defines APIs for configuring the pins for this mechanism, as well as for manipulating the pins via setter functions. However, in order to reap the benefits of the mechanism, the setter functions should be avoided. The Rust esp-rs/esp-hal framework also supports the mechanism via its GpioPin::connect_peripheral_to_output API, by passing it one of the OutputSignal::CPU_GPIO_OUT{n} values.

In both cases, once the mechanism is configured, the GPIO output value of one or more pins can then be set in a single CPU cycle by using the RISC-V Control and Status Register instructions to manipulate the CPU_GPIO_OUT CSR (address 0x805 on the ESP32-C3/C6 chips).

This means that with a clock speed of 160 ⁠MHz, we should be able to reach a maximum GPIO speeds of 80 ⁠MHz. Let's try it out.

Benchmarking GPIO output using dedicated IO

#[ram]
fn benchmark_dedicated_io(io: IO) -> ! {
    // Set GPIO10 as an output.
    let mut gpio10 = io.pins.gpio10.into_push_pull_output();
    // Connect GPIO10 to the CPU_GPIO_OUT0 peripheral signal.
    gpio10.connect_peripheral_to_output_with_options(
        hal::gpio::OutputSignal::CPU_GPIO_OUT0, ...);

    // Toggle the pin in a busy loop.
    unsafe {
        asm!("
            // Ensures that the '1' label corresponds to a 4 byte-aligned
            // address, thereby allowing the `j` instruction to take only a
            // single CPU cycle on ESP32-C6 chips.
            .align 4
            1:
            // Set the pin high.
            csrrsi zero, {csr_cpu_gpio_out}, {cpu_gpio_signal}
            // Set the pin low.
            csrrci zero, {csr_cpu_gpio_out}, {cpu_gpio_signal}
            // Loop back to label 1.
            j 1b",
            csr_cpu_gpio_out = const(0x805),
            cpu_gpio_signal = const(1), // CPU_GPIO_OUT0 maps to register bit 0.
            options(noreturn)
        );
    }
}

A benchmark that toggles a single pin using the "dedicated IO" mechanism.
Full source on GitHub.

Notice how similar it is to the previous benchmark, but how instead of writing to a memory location to set/clear the pin, we write to the CSR register. Let's take a look at the waveform this generates for the ESP32-C6:

A trace of a somewhat sinusoidal waveform with a frequency of 53.2 ⁠MHz and a duty cycle of 31.9%

An oscilloscope trace of the "dedicated IO" benchmark running on ESP32-C6.

And on the ESP32-C3:

A trace of a somewhat sinusoidal waveform with a frequency of 39.7 ⁠MHz and a duty cycle of 27%

An oscilloscope trace of the "dedicated IO" benchmark running on ESP32-C3.

In this case, we see that we achieve a GPIO speed of about 53.2 ⁠MHz, with a lopsided duty cycle of about 32% on the ESP32-C6, while the ESP32-C3 reaches a 39.7 ⁠MHz frequency with a 27% duty cycle. Still not quite 80 ⁠MHz...

This is explained by the fact that we need three instructions to complete the benchmark's loop on the ESP32-C6 and each instruction takes one CPU cycle, meaning that the pin is held high for one CPU cycle but held low for two CPU cycles. On the ESP32-C3 the jump instruction seems to take two CPU cycles, so the signal is even more lopsided there, and the frequency even lower.

If we unroll the loop for a number of iterations then we should be able to reach a speed of 80 ⁠MHz, at least for short bursts. The following benchmark implements this approach.

#[ram]
fn benchmark_dedicated_io(io: IO) -> ! {
    // Set GPIO10 as an output.
    let mut gpio10 = io.pins.gpio10.into_push_pull_output();
    // Connect GPIO10 to the CPU_GPIO_OUT0 peripheral signal.
    gpio10.connect_peripheral_to_output_with_options(
        hal::gpio::OutputSignal::CPU_GPIO_OUT0, ...);

    // Toggle the pin in a busy loop.
    unsafe {
        asm!(
            ".macro toggle_pin
            // Set the pin high.
            csrrsi zero, {csr_cpu_gpio_out}, {cpu_gpio_signal}
            // Set the pin low.
            csrrci zero, {csr_cpu_gpio_out}, {cpu_gpio_signal}
            .endm

            // Ensures that the '1' label corresponds to a 4 byte-aligned
            // address, thereby allowing the `j` instruction to take only a
            // single CPU cycle on ESP32-C6 chips.
            .align 4
            1:
            // Toggle the pin 8 times, and then loop back. This manual unrolling
            // of the loop ensures we can gauge the absolute maximum achievable
            // speed, even if we can only reach it for short bursts of time.
            toggle_pin,
            toggle_pin,
            toggle_pin,
            toggle_pin,
            toggle_pin,
            toggle_pin,
            toggle_pin,
            toggle_pin,
            // Loop back to label 1.
            j 1b",
            csr_cpu_gpio_out = const(0x805),
            cpu_gpio_signal = const(1), // CPU_GPIO_OUT0 maps to register bit 0.
            options(noreturn)
        );
    }
}

A benchmark that toggles a single pin using the "dedicated IO" mechanism using 8 unrolled loop iterations.
Full source on GitHub.

The following screenshots show that this way we are indeed able to reach 80 ⁠MHz signal frequencies for short bursts (note the wider gaps on the left sides of the screenshots, corresponding to the extra delay from the jump instruction at the end of the unrolled loop). On the ESP32-C6:

A trace of a sinusoidal waveform with a frequency of 80.6 ⁠MHz and a duty cycle of 48.4%

An oscilloscope trace of the "dedicated IO with unrolled loop" benchmark running on ESP32-C6.

And on the ESP32-C3:

A trace of a sinusoidal waveform with a frequency of 80.6 ⁠MHz and a duty cycle of 61.29%

An oscilloscope trace of the "dedicated IO with unrolled loop" benchmark running on ESP32-C3.

Note that because the signaling speeds are now getting to be so fast as to approach the signal rise times we observed earlier, the resulting waveform is now also much more sinusoidal than the waveforms we saw in the earlier benchmarks.

Emitting multiple signals simultaneously

Before I conclude this post, I want to highlight again that the dedicated IO mechanism allows us to manipulate up to 8 pins per CPU cycle. This means we can achieve high signal frequencies across multiple outputs simultaneously. For example, on the ESP32-C6 the following benchmark generates four 40 ⁠MHz signals in two groups of two, each group being 180 degrees out of phase with the other group.

#[ram]
fn benchmark_dedicated_io(io: IO) -> ! {
    let (mut gpio_a, mut gpio_b, mut gpio_c, mut gpio_d) = (
        io.pins.gpio10.into_push_pull_output(),
        io.pins.gpio11.into_push_pull_output(),
        io.pins.gpio22.into_push_pull_output(),
        io.pins.gpio23.into_push_pull_output(),
    );
    let (output_signal_a, output_signal_b, output_signal_c, output_signal_d) = (
        hal::gpio::OutputSignal::CPU_GPIO_OUT0,
        hal::gpio::OutputSignal::CPU_GPIO_OUT1,
        hal::gpio::OutputSignal::CPU_GPIO_OUT2,
        hal::gpio::OutputSignal::CPU_GPIO_OUT3,
    );
    // Connect the GPIOs to the CPU_GPIO_OUTn peripheral signals.
    gpio_a.connect_peripheral_to_output_with_options(output_signal_a, ...);
    gpio_b.connect_peripheral_to_output_with_options(output_signal_b, ...);
    gpio_c.connect_peripheral_to_output_with_options(output_signal_c, ...);
    gpio_d.connect_peripheral_to_output_with_options(output_signal_d, ...);

    // Toggle the pins in groups of two in a busy loop.
    unsafe {
        asm!(
            "
            .align 4
            1:
            // Set pins A & B high and pins C & D low.
            csrrwi zero, {csr_cpu_gpio_out}, {ab_high_cd_low}
            // Wait one CPU cycle, to ensure the loop takes a total of four CPU
            // cycles on ESP32-C6. One ESP32-C3 the jump takes two cycles, for a
            // total of five instructions per loop.
            nop
            // Set pins A & B low and pins C & D high.
            csrrwi zero, {csr_cpu_gpio_out}, {ab_low_cd_high}
            // Loop back to label 1.
            j 1b",
            csr_cpu_gpio_out = const(0x805),
            ab_low_cd_high = const((1 << 0) | (1 << 1)),// CPU_GPIO_OUT0/OUT1 high.
            ab_high_cd_low = const((1 << 2) | (1 << 3)),// CPU_GPIO_OUT2/OUT3 high.
            options(noreturn)
        );
    }
}

A benchmark that toggles four pins simultaneously using the "dedicated IO" mechanism.
Full source on GitHub.

The resulting waveform on the ESP32-C6 (note the clean 50% duty cycle):

Four traces of neatly sinusoidal waveforms with frequencies of around 40 ⁠MHz and duty cycles of close to 50%, in two pairs and with each pair's phase offset by 180 degrees

An oscilloscope trace of the "dedicated IO with four output signals" benchmark running on ESP32-C6.

On the ESP32-C3 the benchmark can't quite reach 40 ⁠MHz frequencies with 50% duty cycles, because the jump instruction takes two CPU cycles, resulting in five CPU cycles per iteration:

Four traces of somewhat sinusoidal waveforms with frequencies of around 31.6 ⁠MHz and duty cycles of around 59.5%, in two pairs and with each pair's phase offset by 180 degrees

An oscilloscope trace of the "dedicated IO with four output signals" benchmark running on ESP32-C3.

Instead we see an imbalanced signal with a 40/60% duty cycle and a frequency of almost 32 ⁠MHz (since the 160 ⁠MHz CPU clock is divided by five CPU cycles).

What about the GPIO input speed?

So far I've focused solely on measuring the maximum GPIO output speed we can achieve, but that doesn't tell us anything about the maximum GPIO input speed we can achieve. That is, the frequency with which we can sample an incoming signal.

I won't attempt to define a benchmark to measure the input speed with in this post. However, just as the "dedicated IO" mechanism supports driving a pin output with every CPU cycle, it supports sampling an input pin value with every CPU cycle as well, albeit with a certain amount of sampling latency (two CPU cycles in the case of the ESP32-C6).

This means that on the ESP32-C3 and ESP32-C6, we could in theory sample signals with short bursts of a 160 ⁠MHz sampling frequency. When using a repeating loop like the one below we could reach sustained sampling frequencies of up to 80 ⁠MHz.

        ...
        asm!("
            .align 4
            1:
            // Sample the value of all CPU-connected input pins (up to 8 pins
            // simultaneously). Bit 0 of t0 will contain the value of
            // CPU_GPIO_IN0, etc.
            csrrsi t0, {csr_cpu_gpio_in}, 0
            // Loop back to label 1.
            j 1b",
            csr_cpu_gpio_in = const(0x804),
            options(noreturn)
        );
        ...

However, remember that to reconstruct a signal with a bandwidth of $B$, it needs to be sampled at a sampling frequency of at least $2B$ (this is the Nyquist-Shannon sampling theorem). Hence, a sustained 80 ⁠MHz sampling frequency could only reconstruct signals with frequencies of at most 40 ⁠MHz.

And of course in practice you'd need to add a number of additional instructions to this most-minimal sampling loop, which would limit the the sampling frequency further. This is similar to how our benchmarks ignored various other delays you'd likely need to account for in practice as well. You'd likely need instructions to terminate the loop iteration, to store the sampled data into memory, etc.

So as a rule of thumb, we should probably reduce this maximum achievable sampling rate by a factor of 2x to 3x when considering real-world use cases. On the ESP32-C6 this means we should be able to successfully reconstruct input signals with bandwidths up to 13.33 ⁠MHz, perhaps even up to 20 ⁠MHz.

Conclusion

I've shown that while the the standard "simple GPIO" mechanism is limited to lower GPIO speeds, the "dedicated IO" mechanism provides a powerful alternative for doing high-speed signaling via GPIO. It can produce output signals with frequencies up to half of the chip's maximum CPU clock speed (which is 160 ⁠MHz for the ESP32-C3 and ESP32-C6 chips), since each signal change requires one CPU cycle.

ResultESP32-C3ESP32-C6
Maximum burst output speed using dedicated IO80 ⁠MHz
50% duty cycle
80 ⁠MHz
50% duty cycle
Maximum sustained output speed using dedicated IO32 ⁠MHz
40/60% duty cycle
40 ⁠MHz
50% duty cycle
Maximum "simple GPIO" output speed using assembly code8.85 ⁠MHz2.22 ⁠MHz
Maximum "simple GPIO" output speed using setter APIs4.22 ⁠MHz1.74 ⁠MHz

Summary of the GPIO output speed benchmark results.

The dedicated IO mechanism also supports sensing an input value every CPU cycle, which allows for sustained sampling frequencies of up to 80 ⁠MHz (enough to reconstruct signals with up 40 ⁠MHz of bandwidth, at least in theory).

The dedicated IO mechanism is available on both the ESP32-C3 and ESP32-C6, as well as most other chips in the ESP32 lineup. The ESP32-S2 and ESP32-S3 chips should be able to reach even higher output speeds of 120 ⁠MHz, given their CPU clocks go up 240 ⁠MHz. I didn't have any of those chips to test, however.

Note that these results should be treated as optimistic upper bounds. Depending on the specific use case the maximum signal frequencies that can be managed successfully are going to be lower. Nevertheless, having the ability to do such high-speed signaling using just GPIO opens up a number of interesting use cases. Even if we assume that in practice we'll be limited to only 0.5x or 0.25x of these speeds, that would make it feasible to bit bang many commonly used high-speed protocols with bandwidths up to 10 or even 20 ⁠MHz.

For example, since we know that a four CPU cycle loop reaches an output speed of 40 ⁠MHz, we can reason that to output a 10 ⁠MHz signal we would have 16 CPU cycles to spend per signal period, or 8 cycles per signal transition. This budget of extra CPU cycles should be enough to allow us to load the next bit to transmit from memory, perform conditional branches to handle the end of a data transmission, etc. between each signal transition instruction.

That's exactly what I'll try out in my next set of posts, as I'll attempt to write a bit banged Ethernet 10BASE⁠-⁠T implementation, which will require signal frequencies up to 10 ⁠MHz. Stay tuned!

Footnotes

1

I generated the disassembled code using the riscv64-unknown-elf-objdump -d {the_binary} command on the binary generated by the cargo build --release command. On Debian-based systems the riscv64-unknown-elf-objdump utility can be installed using apt install binutils-riscv64-unknown-elf.