Benchmarking latency throughout widespread wi-fi hyperlinks for microcontrollers

Scott
I used to be not too long ago attempting to quantify the tradeoffs in user-experience for a wi-fi product and efficiently nerd-sniped myself into evaluating a super-set of wi-fi modules and protocols.
Whereas requirements teams, radio chipset distributors, and IOT system integrators fortunately speak about enhancements to bandwidth, long-range capabilities, or how low their energy consumption is, I’ve actually struggled to search out substantive details about latency past hand-wavy advertising and marketing superlatives.
Calculating image fee and latency figures from radio first-principles is doable, however fashionable radio chipsets are additionally topic to protocol particular behaviours and more and more complicated software program stacks. So let’s experimentally evaluate them in ‘typical’ implementations!
Microbenchmarking embedded {hardware}
With any real-world undertaking there are dozens of {hardware} and firmware design decisions and optimisations that might meaningfully affect efficiency, and the matrix of potential exams turns into slightly unapproachable if we additionally take a look at throughout environments consultant of real-world interference situations.
Whereas hardcore optimisation of every implementation is not the first focus right here, I do need any comparisons to be pretty consultant to the applied sciences and groups of engineers who’ve constructed them.
So I have to simplify this primary spherical of testing by:
- Choosing a smaller set of fashionable {hardware} choices and protocols,
- Solely performing ‘bench exams’ in a semi-controlled atmosphere,
- Attempting to reply one particular query: “How responsive can one-way wi-fi person interplay be?“
A number of the most typical examples of this behaviour additionally occur to be probably the most latency delicate: toggling lightbulbs, real-time evaluation of sensor streams, and wi-fi management of actuators or robots.
Now that we all know what we’re testing for, let’s work out the right way to measure the outcomes.
Sizing Check Packets
Totally different use-cases could prioritise information fee or energy consumption over responsiveness – a tank-level sensor has relaxed bandwidth and latency necessities, whereas a quadcopter management sign must be delivered with constant latency and at a excessive fee.
Each communication hyperlink has a distinct set of design objectives, however I need the exams to permit these design decisions to be proven if attainable. In my experiences with embedded programs, typical functions may describe their transfers with these widespread teams:
- Small packets with one or two small items of information, like a sensor studying or heart-beat worth,
- Longer buildings of information, many sensor fields, a set of configuration values,
- ‘Large’ packets containing chunked historic information, audio or pictures, and user-facing file transfers
So I will take a look at three completely different payload lengths: 12B
, 128B
, and 1024B
, which ought to assist form an inexpensive image of how these wi-fi hyperlinks behave. A number of the protocols have a MTU (Maximum Transmission Unit) which could not match the bigger packets, so the place wanted I will break them into a number of packets.
Timing Seize
To maintain issues manageable in a while, I will be triggering every implementation with the rising fringe of a logic-level sign (with a excessive precedence interrupt) to set off a brand new packet to be despatched.
The receiving finish will point out a sound packet has arrived by driving an IO pin excessive.
There are a number of causes for this:
- Help for exterior hint probes, debug peripherals, and inside timekeeping high quality varies between machine,
- Exterior take a look at tools can measure timing data for all targets equally within the lab,
- Testing latency and jitter over longer distances and in real-world crowded RF environments will likely be tough utilizing lab gear!
- Externally synchronised set off pulses provide some semblance of consistency (GPS PPS, PTP?)
I will seize the timing data with a Saleae Logic analyser sampling at 100 Msamples/sec (10 ns), after which post-process exported edge timestamps with a easy R script.
Check Validation
I am totally conscious of the complexities that include distant communications in an embedded undertaking, and that improvement time is best spent testing and optimising for energy consumption or connection reliability. I count on the timing behaviour of most hyperlinks’ bodily layers and respective protocols to trivialise any take a look at fixture overheads, however I nonetheless wish to quantify and eradicate benchmark artifacts.
Skimming the floor of attainable optimisations sees us contemplate compiler optimisation settings, microcontroller clock tree configuration, peripheral use and configuration, use of an RTOS, and the load of various {hardware} abstraction layers resembling ST’s LL (LowLayer), STM32Cube’s HAL or the Arduino framework.
With out blowing this right into a full dissertation detailing every {hardware} goal, I will attempt to reveal the affect of a few of these decisions utilizing considered one of my most popular microcontroller households.
First, let’s examine how extreme the affect of software program decisions is likely to be on the outcomes. {Hardware} is saved constant throughout these exams – a STM32F429ZI
micro is clocked at 168 MHz
and operating nearly identical code that catches the set off sign by way of interrupt and drives an IO pin excessive.
Whereas we count on launch builds to be quicker than debug and a bare-metal LL undertaking to exhibit much less overhead than an Arduino sketch, this ‘fast instance’ nonetheless raises some attention-grabbing questions:
- Why does optimising for measurement (
-Os
) run quicker than optimising for efficiency (-O3
)?- Because of completely different dealing with of a boolean test! Godbolt comparison here.
- For this contrived instance,
-O1
,-O2
and-O3
launch builds give an identical efficiency. - Over the following few exams with larger complexity,
-Os
was constantly slower.
- Why do the Arduino outcomes have such a large variance?
- Curiously, the stm32duino undertaking makes use of ST’s LL internally, however deciphering the place efficiency is misplaced wants its personal dialogue…
- As a easy reply, EXTI redirection and heavier peripheral housekeeping.
This typically matched my expectations. Whereas I will solely present information utilizing the LL and constructed with -O3
from right here on, we should always check out a extra vital explanation for take a look at variations – {hardware} configuration.
Most fashionable micros help completely different methods to handle efficiency vital peripherals: by polling registers, IRQ (Interrupt Request), or with DMA offload (Direct Memory Access). These configuration particulars are way more prone to affect latency, but it surely’s essential to level out that these decisions are usually made to entry particular options, scale back energy consumption, or get out of the best way of different utility logic.
We have to talk with a few of our wi-fi modules utilizing serial, so let’s do a fast take a look at of the UART peripheral utilizing all three approaches and see why among the finest options of utilizing DMA backed peripherals may not be so unbelievable for these exams.
The vast majority of DMA exams have to attend for the UART peripheral to detect when the RX line is unused and generate an interrupt to wake the micro (normally one byte’s price of time, or ~87 µs). The small cluster of outliers almost matching IRQ outcomes is because of the DMA half-complete or full interrupts firing when the ultimate byte of the take a look at sequence arrives.
That is usually much less of a sensible concern when dealing with different duties or sending bigger packets. The advantages for actual tasks are massively improved energy consumption as a result of the core can sleep for so long as attainable with out lacking information, or different duties may be executed with decreased overhead and context switches.
Whereas the polling take a look at seems to carry out in addition to IRQ, the micro must spend all of its time sending and checking for information. In real-world functions, it’ll miss information with out cautious cooperative sharing of CPU time with utility workloads.
It is essential to recollect the context of those exams and the straightforward indisputable fact that implementation particulars are insignificant in comparison with rising throughput – simply ready for 12 bytes at 115200 baud was accountable for 1041 µs of the 1050 µs, or ~99%.
Baud | Bits/s | Bit period | 8N1 byte period |
---|---|---|---|
115200 | 115200 bits/s | 8.681 µs | 86.806 µs |
230400 | 230400 bits/s | 4.340 µs | 43.403 µs |
921600 | 921600 bits/s | 1.085 µs | 10.851 µs |
So we really wish to have a look at the overhead, ideally as we enhance baudrate to scale back the whole switch period. This plot exhibits a variety of exams the place I’ve subtracted the theoretical period for the 12 byte payload from the outcomes.
Once we ignore the DMA implementation’s line-idle behaviour, the implementations have fairly comparable overheads at the same time as we enhance the throughput by 16x. So it is in all probability affordable to recommend that round-tripping (transmit and obtain) information by my FIFO dealing with implementation leads to an overhead below ~4.5 µs.
You did not join a lecture on embedded programs fundamentals, so I will get into the precise exams now, however I do not wish to contribute to the ocean of subpar ‘benchmark’ weblog posts with out stating the significance of double-checking underlying implementation particulars.
Firmware, logic traces, R scripts, and uncooked/processed logs are in the git repo.
Radio Assessments
Integrating wi-fi communications into any embedded undertaking is finally an train in balancing compromises. The radio is usually probably the most energy hungry {hardware} for battery powered gadgets, and integration complexity and ecosystem interoperability typically drive the associated fee.
For 90% of use instances we begin by contemplating probably the most forcing necessities:
Choosing a sufficiently small set of modules to cowl all of those edges was onerous. Whereas I’ve tried to make use of the F429 Nucleo-144
with exterior radio modules, the preferred platforms for Bluetooth and WiFi are built-in microcontroller+radio elements. I’ve used the ESP32
, ESP32-C6
, and nRF52840
to assist spherical out the take a look at {hardware}.
SiK
Open-source modules primarily based on the (now getting old) SiliconLabs 10x0-GM
RF+8051 micro operating SiK firmware have been generally used for the final decade as telemetry radios for lengthy vary UAV telemetry. In regular configurations they function as a clear serial hyperlink, although many have MAVLink conscious firmware and help communications with a number of nodes.
The smaller modules have been carried out and cloned dozens of occasions and use a minimal implementation usually rated for 100 mW output. The beefier RFD900 modules provide variety antenna switching, higher filters, and further amplification (TX as much as 1W).
These modules are examined with their default configuration – wired serial at 57600 baud, air data-rate at 64 kbit/s, and output energy of 20 dBm (100 mW).
Maintain on… on paper the method to ship a 12 byte packet ought to naively take about 6 milliseconds (UART takes 12B at 57600 = 2 ms
per aspect, 12B at 64kbit/sec = ~1.5 ms
airtime) however we see an enormous unfold of latencies from a fairly affordable 8 ms as much as 130 ms.
Why is not the magic clear serial pipe simply sending information after I do?
With some adaptors to attach an antenna to the spectrum analyser, we are able to peek into the transmission behaviour of the radio hyperlink to work out what’s occurring.
SiK radios use FHSS (Frequency Hopping Spread Spectrum) which quickly modifications the channel in a pseudorandom sequence. This spreads the sign over a wider bandwidth to assist scale back interference and meet regulatory necessities.
By letting the analyser accumulate information for a short time, we are able to depend out the 50 hopping channels throughout their configured 915-928 MHz frequency vary. Nothing surprising but…
When PvT (Power versus Time) plots, we are able to see distinct periodic transmit bursts from every of the radios with lots of off-time. The ‘receiving’ radio module is a further meter away from the spectrum analyser and has a barely weaker sign in these screenshots.
By triggering on an influence stage threshold (proven as a blue horizontal line), we are able to get a extra steady have a look at the radio whereas operating the 12 byte take a look at sample.
What we’re seeing is Time Domain Multiplexing (TDM) behaviour interacting with transmit behaviour, which may be grossly simplified into some easy steps:
- Synced modules hop to a brand new channel frequency at an agreed time,
- Every module is allotted a transmit window lengthy sufficient for 3 packets,
- If nothing is within the buffer, ship a zero size packet to yield to different radios (~2 ms).
- As much as ~232 bytes of buffered information is packetised with a preamble and header (~133 µs/byte). The 12B take a look at payload ought to use ~3.7 ms of air time.
- If different radios aren’t utilizing their transmit time slots, proceed sending packets if wanted.
- When nothing else wants sending, pay attention in obtain mode till the following hop!
Operating the spectrum analyser’s set off output by a frequency counter tells us the modems hop frequency each 120 ms.
So the underlying radio behaviour is definitely fairly near our theoretical transmit period, however the take a look at situations do not have in mind that pending information is buffered by the module till the beginning of the following channel hop, resulting in the huge variation in latency outcomes we noticed earlier.
That is additionally why the outcomes are so evenly distributed – we’re really measuring the time we spend ready for the following transmit window, and so long as the UART switch arrives earlier than the following window we do not acquire any instant profit from a better UART baudrate.
For a fast little bit of enjoyable, I attempted utilizing the RF energy stage (yellow hint) as a set off enter to the sign generator to synchronise the take a look at IO stimulus sign (blue hint) with a configurable offset. With a 113 ms delay utilized on the sig-gen, the microcontroller can reliably ship it is packet simply earlier than an upcoming transmission window.
And we are able to now obtain a steady 9-15 ms latency consequence!
However that is not how these modules are configured or supposed for use, and the lab gear wanted to attain this timing hack is out of attain for many!
Whereas the low latency consequence is not indicative of real-world efficiency for these radios, I do suppose the method of exploring why is instructive and an indicator of what is attainable for ‘easy’ point-to-point packet radios with a distinct design purpose.
LoRa
In conditions the place periodic reporting of small messages from edge gadgets is required, low energy huge space networks (LPWAN) are an more and more widespread alternative for asset monitoring, good energy meters, and agri-sensing. The purpose of those sorts of networks is to help giant fleets of low-power nodes utilizing one-hop star networks with an web linked gateway.
LoRa (Long Range) is the model identify for the modulation scheme (bodily layer), which makes use of CSS (Chirp Spread Spectrum) to attain lengthy distance communication with very low energy consumption.
It is price stating that LoRa has slightly low information charges in comparison with the opposite radios I am testing, maxing out at 37.5 kbps.
Within the spectrum analyser waterfall under we are able to see a part of a typical LoRa transmission. By studying from the underside of the hint upwards, we see 8 preamble sweeps, adopted by 2 reverse-direction sync message sweeps, then payload chirps persevering with previous the highest of the waterfall. We will see the beginning and cease frequencies change for payload chirps, which is how LoRa transmits symbols.
LoRaWAN is the preferred of the upper stage protocols (MAC) constructed on LoRa and has three classes describing when nodes can transmit, obtain, or sleep. It additionally handles authentication, encryption, and message forwarding to upstream community companies.
LoRaWAN makes use of a intelligent trick to enhance capability and scale back interference – through the use of completely different IQ (phase and quadrature) configurations for TX and RX modes, nodes can solely hear transmissions from gateway radios and never different node transmissions.
The {hardware} below take a look at is the Semtech SX1276 transceiver within the HopeRF RFM95W module (on an Adafruit Breakout board). The modules talk with the STM32F429 utilizing 10 MHz clocked SPI, and my LL primarily based driver minimises timing overheads through the use of the transceiver’s interrupt strains.
I will take a look at a point-to-point LoRa hyperlink, as I haven’t got an current community or gateway readily available. If these exams have been being carried out with a real-world LoRaWAN I might actually be measuring any timing restrictions utilized by the community – greatest apply is to usually sleep for minutes between packets to minimise air-time and energy consumption.
I ran exams with two completely different chirp configurations representing smart ‘excessive velocity’ and lengthy vary use-cases. LoRa’s most payload is 255 bytes, so the 1 kiB payload is damaged into 5 transmissions.
Bandwidth | Coding Charge | Spreading Issue | Information fee | |
---|---|---|---|---|
Excessive Velocity | 250 kHz | 4/5 (1.25x overhead) | 7 = 128 chips/image | 10.9 kbps |
Lengthy Vary | 64.5 kHz | 4/6 (1.5x overhead) | 11 = 2048 chips/image | 224 bps |
Semtech’s LoRa web calculator offers air-time durations which we are able to evaluate our outcomes towards. For the excessive velocity configuration we are able to count on a 128 byte transmission to take ~107 ms.
Experimental outcomes line up with the theoretical air-time timing, and the entire system has lower than 10 µs of jitter (ignoring a dozen 128B outliers arriving 250 ms late) which is surprisingly effectively managed.
As somebody who predominantly works with micro-controllers, I am most acquainted considering in milliseconds and microseconds, so seeing a calculated air-time of 5.4 seconds to transmit 128 bytes utilizing the lengthy vary configuration hinted at a fairly scary 1 KiB transmit period.
These are the longest switch occasions of the {hardware} I examined, and I wanted to extend the stimulus pulse interval as much as 50 seconds for the 1 kiB exams – this meant I ended capturing after ~100 samples (over an hour).
Benchmark outcomes align with theoretical timings fairly effectively and are demonstration of the significance of minimising information switch by cautious payload design. These modules ought to higher present their strengths in vary and energy measurement exams.
I might like to know what number of days deployed nodes have spent accumulating chunks of firmware updates…
nRF24
Nordic’s nRF24 household of two.4GHz transciever modules has been a generally chosen possibility for customized wi-fi hyperlinks for greater than 15 years (early datasheets seem ~2006).
Through the years Nordic have co-packaged the transceiver with microcontrollers and USB interface {hardware} for tighter integration, and whereas they are not really useful for brand spanking new designs I nonetheless see these half numbers showing in usually in analysis papers and pastime tasks. Probably the most broadly identified industrial use was in older Logitech wi-fi receivers.
Whereas it is onerous to test if my ‘real’ modules are utilizing cloned silicon or not, you could find barebones nRF24L01 modules utilizing PCB antennas as low cost as $2 in single portions on eBay (virtually definitely clones) and fancier modules with low-noise amplifiers (LNA) and transmit amplifiers (PA) include an exterior antenna for lower than $10.
My interrupt pushed implementation clocks the SPI hyperlink at 10 MHz (rated max) and configures the modules for max throughput with a 2 Mbps air fee. By enabling Nordic’s Enhanced Shockburst the modules transparently deal with computerized 16-bit checksums, acks, and re-transmit behaviour.
For the 128 and 1024 byte exams, the payload information is shipped in chunks because of the nRF24’s 32 byte payload restrict. The subsequent chunk is shipped as soon as the module’s transmit success interrupt arrives. Enabling the dynamic payload size performance impacted reliability, so any chunks requiring lower than 32B are padded with 0x00
bytes (a 12B solely packet take a look at is proven under as Uncooked 12B
).
Reaching a lower-bound latency of 300 microseconds for a 12 byte switch is a good consequence and the tight clustering exhibits extremely constant behaviour.
The nRF24’s low jitter is definitely visualised with some RF PvT traces (proven in yellow). We will see the module begins it is first RF burst about 100 µs after the take a look at stimulus set off (horizontally offset by -2 ms) and the entire the sequence of bursts inside 5 ms. The marginally decrease amplitude bursts are the RX module acknowledging the transmissions.
By wanting on the PvT behaviour with longer payload sizes, it looks like a few of the variation is attributable to occasional quiet intervals between chunks. I have not been in a position to work out why these occur.
I discovered it attention-grabbing that decreasing the air-data fee to 256 kbps for ‘lengthy vary’ efficiency did not affect latency as a lot as we would count on from the ~8x discount in rated throughput.
Lengthy-range mode maintains extremely constant outcomes however incurs a barely lower than 4x enhance in switch period, lower than half of what we might have anticipated from the air-data fee discount.
256 kbps | 2 Mbps | Distinction | |
---|---|---|---|
12B (padded) | 1.5 ms | 0.4 ms | 3.75x |
128B | 7 ms | 1.9 ms | 3.6x |
1024B | 68 ms | 23 ms | 3x |
If I saturated the long-range hyperlink with extra frequent take a look at packets I believe the distinction would turn into extra obvious.
ESPNOW
The ESP32 and ESP8266 are in all probability the preferred pastime microcontrollers we have seen over the previous 5 years, principally resulting from extremely low price, built-in WiFi/BT, and fairly good improvement tooling from launch. Whereas the group has rapidly grown keen on them, they’re additionally present in lots of industrial IOT merchandise.
ESPNow is Espressif’s proprietary point-to-point networking protocol operating within the 2.4GHz band and is self-described as a low complexity possibility for good lighting, sensors, and remote-control functions with out a bridge or gateway. It makes use of a custom action-frame within the 802.11 Wi-Fi normal for particular machine performance which offers 250 bytes of usable payload house and usually runs at 1 Mbps.
The web site does have a latency declare that it “can obtain a millisecond-level delay” which we are able to try to copy.
Utilizing Espressif’s IDF example as reference, my stripped down implementation does not ship complicated structured payloads.
- At startup, the ESP32 boards discover one another with some broadcast packets.
- If the printed got here from a MAC handle that hasn’t been seen but, add it to the peer listing.
- If a set off pulse interrupt happens, search the peer listing for our vacation spot MAC handle.
- Blindly ship the take a look at payload to that handle as the one person payload data,
- As a result of the 1024B take a look at exceeds the 250 byte restrict, 5 packets have to be despatched. I look ahead to transmit completion callbacks to succeed earlier than sending the following chunk.
- The espnow job callbacks present inbound packets that are handed to the principle job loop with a FreeRTOS queue,
- Check packets are checked for legitimate size and checksum values utilizing the identical logic as earlier exams.
The outcomes are fairly good – typical end-to-end latency for a single packet switch is constantly ~5 ms. The 1 kiB payload exhibits good scaling behaviour because the 5 packet sequence takes ~24 ms to finish.
Curiously, enabling long-range mode (which limits the PHY to 512Kbps or 256Kbps) hurts the 5-packet sequence barely greater than I might count on.
802.15.4
IEEE 802.15.4 is a standardised bodily and MAC layer protocol used mostly for wi-fi dwelling automation networks. Zigbee, Matter, and Thread are all excessive layer protocols constructed on IEEE 802.15.4.
Designed for embedded gadgets and low energy consumption, it affords an inexpensive vary of information charges as much as 250 kbit/second, three working bands throughout 868/915/2450 MHz, and might function point-to-point or with star community topologies.
I am utilizing a pair of Espressif’s official ESP32-C6-MINI devboards for this take a look at, that are 2.4 GHz solely.
I opted to make use of the ESP-IDF’s low-level ieee802154
library instantly as a result of it is small and really straightforward to work with (although constructing 802.15.4 MAC frames manually is tedious). There are additionally Zigbee and OpenThread instance tasks for these chips.
whsniff
+ Wireshark provides us view of the 9 chunks it takes to ship a 1 KiB packet because of the 127B MTU.
I benchmarked each sending packets with out acknowledgement aka “Blind”, and with the acknowledgement request bit enabled.
The tight clustering of outcomes is nice to see, and a ~2.5 ms decrease certain for small packets is pretty spectacular. Typically talking these are comparable outcomes to the comparable nRF24’s 256 Kbps configuration.
Regardless of attempting for a short time, I wasn’t in a position to work out why acknowledged 128B take a look at outperformed blind transmission. The distinction is not too significant, but when anybody studying is aware of why I might love to listen to from you.
Enabling the IDF Menuconfig’s “Throughput Optimisation” setting did not make any measurable affect for this take a look at.
Bluetooth SPP
Typically packaged alongside merchandise as a ‘wi-fi RS-232 dongle’ or ‘Bluetooth serial adaptor’, modules implementing Bluetooth SPP (Serial Port Profile) act as clear serial bridges and are the start line for the dive into Bluetooth primarily based transports.
HC-05
The HC-05
/HC-06
modules are one mannequin generally discovered embedded in pastime electronics tasks as a zero-effort strategy to ship UART information to a cellphone, PC, or between microcontrollers. These use Bluetooth 2.0 + EDR (now known as Bluetooth Basic) and might allegedly attain air-rates of 1 Mbps at shut vary.
My modules arrived operating 2.0-20100601
firmware and default to 9600 baud UART. I used AT instructions to set one as ‘grasp’ to auto-bind to the second module.
The default configuration does not give nice outcomes, which is usually attributable to the low default UART velocity.
At 9600 baud (8N1) it takes 1066 ms to switch 1 kiB from the STM32 microcontroller to the HC-05 module. Growing the baudrate instantly improves the state of affairs.
Taking a look at logic analyser traces (diagrams simplified for readability), we measure a 20 ±4 ms overhead period between the UART transfers for the 12B take a look at. This behaviour is in step with any UART configuration.
For the bigger 128 and 1024 byte payloads the modules behave constantly at 9600 and 57600 baud. The receiving aspect begins emitting the payload earlier than the total payload has been written out however the output remains to be one constant stream.
460800 baud was the best my modules would settle for and nonetheless move every of the payload exams. We nonetheless see the ~20 ms latency between sending the final byte and seeing it on the opposite aspect, however the stream now seems to reach in variable size bursts. I’ve seen these bursts vary from a single byte to 254 bytes, on a ~5 ±2 ms slot interval.
At these charges it is easy to overwhelm the modules by sending an excessive amount of information – they deal with this by dropping information randomly. This principally justifies the gradual default baudrate because it removes the necessity to contemplate fee limiting in userspace.
ESP32
We will evaluate the HC-05 behaviour towards a pair of ESP32 modules because the Bluedroid stack helps Basic BT. We also needs to have the ability to scale back latency a bit bit as a result of the ESP32 does not have to incur micro-to-radio switch overheads!
Much like the opposite ESP32 take a look at firmwares, my implementation follows Espressif’s example however makes use of a FreeRTOS queue to move write completion and inbound information occasions from SPP callbacks to a person job to deal with the benchmark logic.
Because the ESP32 in BT Basic mode has an MTU of 990 bytes, the 1 kiB payload requires splitting into two transfers.
ESP32 outperforms the HC-05 modules, reaching virtually half the latency for smaller packets with the efficiency hole widening as payloads enhance in measurement.
Espressif’s docs reiterate some widespread sense – sending bigger payloads much less steadily is extra environment friendly than excessive frequency smaller payloads. I experimented by forcing the 1 KiB payload into 32 and 64 byte chunks to match towards the 990 byte MTU consequence.
As anticipated, elevated overheads imply smaller chunk sizes take longer, however we additionally see extra variability in switch timing.
Including instrumentation and logging narrowed down the most certainly trigger as elevated congestion occasions. Congestion flags bubble up from one of many decrease ranges of the Bluetooth stack (L2CAP) and sign that we should not ship extra chunks till the flag is cleared.
Bluetooth LE
Launched alongside Bluetooth 4 in late 2009, BLE (Bluetooth Low Energy) was designed ground-up for low energy gadgets with the purpose of enhancing compatibility with person gadgets like smartphones. Since then we have seen an explosion in app-connected merchandise throughout just about all client markets and even industrial {hardware}, with nearly all of these gadgets utilizing BLE.
BLE gadgets talk utilizing the GATT (Generic ATTribute Profile) server-client mannequin: the server describes information (traits) and metadata (attributes), and shopper gadgets (a person’s cellphone) learn or write towards these traits. You may usually see them known as Peripheral and Central in Bluetooth documentation.
The BLE specification restricts the minimal connection interval to 7.5 ms, so I am anticipating all the implementations to attain lower-bound outcomes below 10 ms. The BLE Throughput Primer on Memfault’s blog covers lots of the underlying behaviours being exercised on this part.
ESP32 Bluedroid
The take a look at firmware configures a pair of ESP32 boards utilizing a typical strategy for connecting a sensor node (peripheral/server) to an ‘finish person’ fashion machine (central/shopper). GATT servers have the power to ‘push’ information to the shopper utilizing both an Indication (requiring acknowledgement) or Notification (with out acknowledgement).
I in contrast the latency of the server notification strategy towards the shopper’s ‘Write With out Response‘ by supporting each instructions of information switch within the implementation and easily swapping the take a look at setup to set off the shopper board.
I hadn’t correctly examined this particular element earlier than and did not count on to see a lot distinction in latency, however we’ve got a ~5ms distinction with some outlier shopper writes extending previous 40ms. I might be shocked if this element can be a show-stopper for real-world tasks although.
The Notification take a look at information present some distinct distribution bands of upper density.
These teams are roughly 7 ms aside, which has a robust correlation to the minimal 7.5 ms connection interval for BLE.
Persevering with with the completely different take a look at payloads despatched by way of Notification, each the 12B and 128B exams match contained in the ESP32’s really useful 200 byte MTU, however the 1 kiB take a look at wants 6 packets to ship.
These outcomes are pretty good, but it surely’s onerous to see an enchancment over the older Basic SPP outcomes from the ESP32 with out contemplating the variations in energy consumption and vastly higher end-user connection expertise.
ESP32 NimBLE
Whereas the Bluedroid stack was used for the earlier ESP32 BLE exams, the ESP-IDF additionally helps Apache’s MyNewt NimBLE stack which has been developed particularly for low-power and reminiscence constrained {hardware}.
I could not discover any details about any potential efficiency or latency advantages, so I re-implemented the BLE SPP firmware to see how NimBLE stacks up.
It is price stating that Espressif’s NimBLE examples do not match their README. Additionally, be ready to dig by scraps of MyNewt documentation to implement MTU trade and subscriptions to attain characteristic parity with the Bluedroid instance.
Odd, that is not what I anticipated in any respect. Earlier than we soar to any conclusions, let’s take a look at the opposite payload sizes with the identical setup because the earlier SPP and GATT exams…
There’s clearly a number of issues incorrect right here, these NimBLE take a look at outcomes are meaningfully slower and much much less constant than the Bluedroid stack. I am additionally slightly confused by the inversion in latency for shopper writes.
I asked Espressif if every thing was working correctly, then went poking for a number of days whereas ready for a response.
On the time of publishing there’s been no official response…
Issues improved considerably with manually specifying quicker ble_gap_upd_params
for interval and connection timings (fixing the 40 ms hole spacings), but it surely wasn’t till @xyzzy42
dropped a touch within the challenge thread a number of weeks later which led to some extra configuration.
Taking a look at some sniffed BLE captures in Wireshark we are able to see that our packets are being damaged into six 26B fragments regardless that the peer confirmed our bigger requested 200B MTU worth throughout connection.
So the low-level controller’s MTU does not appear to be affected and packets are being robotically fragmented and reassembled by L2CAP. Espressif makes use of an intermediate VHCI (Host-Controller Interface) layer between the NimBLE Host and the underlying Bluetooth controller, which might be the place this tough edge comes into play.
By calling the NimBLE’s ble_gap_set_data_len(deal with, tx_octets, tx_time)
, we’re making a (wrapped) name towards the ESP32’s Bluetooth HCI which does make the configuration change we needed. We will sniff the connection and see it behaving accurately in Wireshark.
Operating the benchmarks once more, we see that these modifications have contributed to an enormous enchancment to variance, and each one of the best and worst-case latency outcomes are halved for the 1024B take a look at.
A typically higher consequence and principally matching Bluedroid’s defaults. I have not investigated energy consumption or useful resource utilization deeply (but) however I might nonetheless wish to decide Bluedroid with the ESP32 primarily based purely on the standard of examples and documentation.
nRF52
Implementing the identical BLE behaviour with a platform designed round Bluetooth is worthy of comparability. I purchased a pair of Nordic’s nRF52850-DK boards and carried out the benchmark exams utilizing the “Nordic UART Bridge Service (NUS)” library which offers helper capabilities and vendor-standardisation for a GATT-based generic information transport.
The implementation is just like the handful of ESP32 tasks examined earlier, however Zephyr RTOS has some slight variations in strategy and runs on a better tick fee than FreeRTOS. Identical to a lot of the different exams, I wanted to spend a while experimenting with BLE configuration to get honest outcomes because the defaults have been a bit relaxed.
One notable distinction to the opposite BLE implementations was constant efficiency whatever the switch route between boards.
One thing I discovered attention-grabbing throughout early exams was impressively tight clustering of outcomes (inside ±1 ms) for shorter benchmark sequences. Over longer spans of many minutes the distribution of outcomes ranged extra evenly.
This was a transparent demonstration of unintended alignment and delicate drift between the stimulus sign and the boards ready for the following BLE transmission slot. In consequence, I am together with a variation take a look at run right here to present a greater impression of latency unfold in much less managed environments:
- Regular periodic set off pulse interval,
- ‘Randomised’ set off intervals to mitigate synchronisation biases with the connection interval. The sign generator’s sweep performance slowly added +50 ms to the traditional pulse interval.
The tight clumps of outcomes are aligned to multiples of the connection interval. We will additionally see the anticipated behaviour that bigger packets enhance the widths of the clusters. Outliers aren’t wherever to be seen, and the lower-bound latency for the 1 kiB packet is half of the ESP32’s greatest BLE consequence.
Sniffing a 1 KiB take a look at with Wireshark exhibits us the idealised 1 KiB switch sequence in motion, 6 packets instantly after one another taking ~6 ms mixed. So we’re nonetheless actually restricted by the connection interval!
Nordic get brownie factors for his or her documentation and examples working first attempt with out modification, and the inclusion of GATT based Latency and Throughput APIs exhibits us what the naked minimal must be for builders to fairly reproduce and take a look at their {hardware}.
WiFi
Given the ESP32’s core characteristic is its WiFi help, we actually ought to see the way it stacks up. Once more, we’re solely latency and ignoring the restricted vary and better energy consumption.
First up, non-blocking TCP and UDP socket implementations between the 2 boards utilizing current WiFi infrastructure (Unifi U6+ about 5 meters away, 8 different 2.4 GHz purchasers).
All the take a look at payloads can slot in a single packet and the boards are operating at 72 Mbps PHY fee (HT20) which ought to trivialise take a look at payload timings with sheer throughput.
As soon as once more, we’re seeing the ESP-IDF default configuration underperforming the anticipated latency outcomes. For context, pinging both of the boards from my workstation provides ~7 ms outcomes.
Going by the docs and boards exhibits us a number of knobs we are able to flip – enjoying with the modem’s power-saving modes, making certain the WiFi and LwIP stacks are in IRAM, and disabling Nagle’s Algorithm for TCP.
Significantly better. Each TCP and UDP have been in a position to obtain the identical lower-bound latencies and comparable worst-case outliers. UDP is proven with a ~2.5 ms larger median which I attributed to the extra even distribution of leads to the span, however I might contemplate it too-close to name for this micro-benchmark.
In real-world tasks it is extra widespread to see high-level protocols over uncooked sockets, particularly given how usually integrations have to help telephones, internet companies, and third celebration programs. WebSockets are a fairly fashionable alternative and we would count on them to carry out equally on the ESP32 to our TCP outcomes.
Espressif’s instance tasks (esp_websockets_client
and ws_echo_server
) supplied a greater start line for the benchmark implementation than the LwIP socket implementations.
At this level I should not have been shocked, however I actually struggled to attain constant run-to-run outcomes throughout many testing and optimisation makes an attempt.
Whereas drafting a extremely detailed GitHub challenge I labored out that I used to be being thwarted by Nagle’s Algorithm once more, needing a barely completely different strategy to disable it when utilizing the httpd server library.
static esp_err_t ws_server_handler(httpd_req_t *req)
{
if (req->technique == HTTP_GET)
{
ESP_LOGI(TAG, "WS Handshake Full");
// Modify the underlying TCP socket. Certainly there's a greater method?
int sock_id = httpd_req_to_sockfd(req);
int no_delay = 1;
setsockopt(sock_id, IPPROTO_TCP, TCP_NODELAY, &no_delay, sizeof(int));
return ESP_OK;
}
// Remainder of websocket packet dealing with code
As anticipated, there is no significant affect of packet measurement because of the excessive hyperlink throughput, however we are able to see that Websockets have price us round ~6 ms over the lower-level TCP socket implementation (this may very well be phrased as “double the latency” for clickbait?). Bigger packets repeatedly examined quicker than smaller transfers for some unknown motive.
So whereas WebSockets are so much simpler to work with, as carried out, they do have a latency price on the ESP32.
For a fast comparability, a pair of Raspberry Pi’s utilizing onboard WiFi can obtain Websocket switch outcomes on-par with the ESP32’s TCP outcomes, with a pretty minimal NodeJS implementation.
The Pi’s outliers are unfold a bit wider than I might count on (particularly when on Ethernet) however digging into community and efficiency tuning of Linux and run-times like Node is not one thing I will be doing on this submit!
Outcomes
We lastly made it! It solely took 10k strains of code, a brand new 12 months, and operating an amazing >200 exams throughout the completely different targets…
To make comparisons simpler, we’ll begin with a barchart of the higher quartile latency figures as I feel they’re most statistically honest throughout the board.
After I began these micro-benchmarks I did not count on the nRF24 module to carry out so effectively – it recorded the bottom minimal, decrease/higher quartile, and median for 12B and 128B payloads.
We will see an attention-grabbing pattern with the highest three outcomes: regardless of sturdy efficiency for the small and medium payloads, their 1 KiB outcomes are alongside the BLE implementations within the midfield. The widespread aspect between them is the small MTU which requires many chunk transfers.
Sadly, the NRF52 is positioned second-to-last on this chart resulting from it is 75% latency consequence being simply barely larger than the HC05 and LoRA 12B outcomes, regardless that it is median ought to place it alongside the ESP32 BLE (Bluedroid) implementation.
The 915 Mhz LoRa and SiK modules are available final place as anticipated – low-throughput hyperlinks optimised for lengthy vary are going to battle in a take a look at that favours excessive throughput. I count on these modules will fare higher after I look into vary and congested RF environments in future exams!
60fps is usually thought of the lower-bound for playable recreation framerates, at simply 16.6 ms per body.
Typically, most radios achieved common small-packet outcomes decrease than that!
As a result of so many of those exams used an ESP32, we should always have sufficient information to match the completely different protocols and wi-fi stacks when utilizing the identical RF front-end. Most protocols had comparable efficiency with 12B and 128B packets – all of them have a MTU exceeding the 128B take a look at and used (comparatively) excessive throughput hyperlinks.
I’ve plotted the probability distribution for every protocol towards latency. This lets us make extra intuitive comparisons between protocols than one other set of box-plots. Choosing a degree on a line tells us what proportion of a take a look at’s outcomes had completed previous to that period.
There are a number of common findings which might be pretty apparent:
- As packet measurement will increase, WiFi’s larger throughput beats every thing.
- When utilizing WiFi with TCP/IP transfers, there are solely small variations between TCP, UDP and Websockets efficiency for these benchmark situations.
- SPP leads over the BLE outcomes, in all probability resulting from it is ~5x bigger 990 byte MTU.
- If utilizing BLE on the ESP32, the Bluedroid stack is decrease latency than NimBLE.
Growth Expertise
The best implementation was the ESP32-C6 with IEEE 802.15.4, adopted by the clear UART bridges SiK and HC-05, and ESPNOW.
The most time consuming half was implementing and testing the RFM95 LoRA modules, as I burnt time attempting a number of completely different OSS libraries with design points starting from blocking sleeps, bugs, and polling the module’s standing IO as an alternative of utilizing interrupts.
Nevertheless the most irritating work was troubleshooting the NimBLE stack on the ESP32. The right storm of sub-par default efficiency, stale instance tasks, and needing to repeatedly cross-reference between the Espressif and MyNewt documentation web sites and source-code.
What wi-fi module is greatest for my undertaking?
I see this query on a regular basis on-line, and it is onerous to advocate for a selected radio module or protocol over one other primarily based on latency alone.
One of many takeaways of those exams must be how succesful fashionable radios are and after they have means to make use of a number of protocols (usually on the identical time), this turns into extra of a software program alternative than {hardware}!
For microcontrollers with built-in radios:
- Newer Espressif elements just like the ESP32-C6 are a compelling alternative for his or her low price, affordable tooling, group help and succesful {hardware}.
- I will undoubtedly be constructing with the C6 and 802.15.4 in future tasks…
- If you happen to can tolerate a steeper studying curve and have climbed a DeviceTree earlier than, then Nordic’s nRF elements provide a extra constant developer expertise and first-party examples.
- ST’s WB sequence is price , however have prompted me lots of ache beforehand.
If you happen to’re including an exterior radio to your microcontroller and do not want the very best efficiency, any clear UART bridge is an effective low-effort alternative.
The nRF24 carried out effectively in these exams however is slightly dated at this level. Most likely nonetheless an inexpensive alternative for easy one-off tasks, in any other case have a look at the nRF5’s Shockburst support for newer choices.
Nonetheless unsure? Search for modules that use SPI to maximise efficiency. start line is likely to be wanting on the RadioHead Arduino library which helps a variety of modules.
Key Takeaways
The method of implementing and testing every of those modules bolstered a few helpful classes:
- Incredible low-latency communication hyperlinks are extra accessible than ever.
- Unsurprisingly the stability between energy consumption, throughput, and latency issues – and default settings are sometimes on the conservative aspect.
- If you happen to want the bottom latency and tight management over your system’s behaviour, you may in all probability discover one of the best outcomes with a wi-fi stack that is not attempting to co-exist with different protocols or gadgets.
- Even whenever you’re doing every thing correctly, validate with scope traces and Wireshark captures.
- Benchmarking issues correctly is actually time consuming!
I am considering of doing a set of real-world vary exams towards a sub-set of those gadgets, and would love suggestions if you happen to discovered this useful or attention-grabbing (or have any corrections/strategies).