Nebula is Not the Quickest Mesh VPN
When have you ever seen a vendor or developer publish benchmarks the place their product or undertaking was represented as being on par with (or behind) its direct rivals? There have to be examples of this, however I’m struggling to give you any. Regardless, why would anybody select to do that? I hope this text clarifies why we’re publishing these outcomes and helps clear up some quite common misconceptions about efficiency.
The primary part describes the how and why, however if you wish to skip to the outcomes, be at liberty. I might encourage you to consult with the sooner sections if you need clarification on this slightly advanced matter.
Background
We began creating Nebula in early 2017, and open sourced it practically 3 years later, on the finish of 2019. As a result of we constructed Nebula inside Slack, an organization that was rising rapidly, we have been pressured to think about scale and reliability from the start.
You is likely to be stunned to study that we’ve additionally been benchmarking Nebula in opposition to comparable software program since <checks notes> October of 2018, in keeping with this git commit:
These benchmarks have been a beneficial technique of validating main modifications we’ve made to Nebula through the years, however have additionally helped us see the place we stand in comparison with our friends. Because the area has developed, we’ve seen outcomes enhance for practically the entire choices we take a look at in opposition to. For our personal functions, we additionally benchmark Nebula in opposition to older variations of Nebula to make sure we catch and resolve issues like reminiscence leaks or surprising CPU use between variations. When your software program connects the infrastructure of a service tens of millions of individuals rely upon, it is very important do efficiency regression testing. Consistency and predictability in useful resource use and efficiency are issues we worth.
Regardless of the actual fact we’ve been doing this for years, there is no such thing as a good public model of knowledge like this. Within the circumstances the place benchmarks appear to exist, they’re usually dreadfully unscientific, or they’re point-in-time snapshots, versus an ongoing effort. There are good folks bettering mesh VPN choices on a regular basis, however nobody is monitoring their progress because it occurs. We intention to alter that by opening our benchmarking methodology to public evaluate and contribution.
I like to recommend you learn the paper “Fair Benchmarking Considered Difficult: Common Pitfalls In Database Performance Testing” from Centrum Wiskunde & Informatica (CWI), authored by Mark Raasveldt, Pedro Holanda, Tim Gubner & Hannes Mühleisen. It’s completely good and value your time. The title mentions database, which is their specific experience, however the paper itself makes glorious factors which can be relevant to all benchmarking. Determine 1 is so correct, that it made me chuckle out loud.
Our Tips for Significant Benchmark Testing
We’ve put a variety of thought into the right way to make our testing helpful and honest through the years. Listed below are a number of the most essential tips we observe in our testing:
-
Be as goal as doable. Though we’re fairly keen on Nebula, we intention to be as goal as doable concerning this testing these outcomes. For years these exams have solely been used internally, so there is no such thing as a incentive to govern or distort the info. This has at all times been for our profit and can proceed to be an especially beneficial a part of our testing course of.
-
Purchase some {hardware}. At first, we made the identical mistake as everybody else, by making an attempt to get significant and reproducible benchmark numbers by spinning up hosts at varied cloud suppliers. To be honest, this can be helpful to check modifications to your codebase, in case you settle for the very actual caveats to such testing. It’s a horrible solution to evaluate software program from completely different distributors, in case your aim is accuracy. We’ve in depth expertise working Nebula at an enormous scale on hosts at varied cloud suppliers, and the one constant factor we’ve seen is the inconsistency of the outcomes. A couple of years in the past, we bought 5 very boring Dell desktop computer systems, with comparatively boring i7-10700 CPUs, and we put in 10 gigabit community interface playing cards in every of them, linked to a change that value greater than any of the computer systems. Testing is completed by netbooting contemporary OSes each single time we run our benchmarks. We usually boot the newest LTS launch of Ubuntu earlier than each spherical of testing. Our outcomes are repeatable and constant, and we run a number of rounds of the identical take a look at to validate the outcomes.
-
Detune the {hardware} so that you aren’t combating thermal points. CPUs and cooling options can have a variety of variability, so if the chip fab was having a foul day, or your fan has a blade with an invisible turbulent crack it is extremely doable for 2 “equivalent” bins to carry out in a different way when you attain the highest finish. To take away this, we’ve disabled a number of the pace states that may end up in inconsistent efficiency. Hyperthreading can be disabled on these hosts.
-
Check a number of streams between a number of hosts. When evaluating mesh VPN software program, you need to be transmitting and receiving visitors from all of them concurrently. There are a shocking variety of new and completely different efficiency traits that emerge when you contain a number of hosts. Typically, when somebody posts benchmarks, you’ll see them spin up two hosts and use a single iperf3 stream between them, however that’s of restricted worth. For one factor, sometimes iperf3 is utilizing a full core, and iperf3 turns into the bottleneck. This will make for nonsensical outcomes. (Writer’s opinion: iperf3 is a remarkably simple to make use of and great tool, which is a blessing and a curse.) In case you are utilizing a mesh VPN, you in all probability have greater than two hosts speaking at any given time. Actually, in case you solely care a point-to-point connection, use no matter you want. Wireguard is nice. IPsec exists. OpenVPN isn’t even that unhealthy as of late.
-
Evaluate functionally equal issues. Traditionally, we benchmarked Nebula in opposition to each Mesh VPN software program, similar to ZeroTier and Tinc, but additionally in opposition to classical VPNs, similar to Wireguard and OpenVPN. There have been much less choices within the area, however that has modified considerably prior to now few years. The objectives of Nebula and Wireguard are fairly completely different, and Wireguard doesn’t concern itself with issues like ACLs or host id. This text and subset of our testing is purposefully restricted to issues functionally equal to Nebula, to keep away from inevitable prolonged explanations concerning the variations between dissimilar software program.
-
Degree the taking part in subject. Each single one of many mesh VPN choices examined right here makes use of a distinct default MTU. There’s nothing improper with this, however the one solution to meaningfully evaluate efficiency between these choices is to set a lowest widespread denominator packet measurement. The purposes you employ will decide how a lot information they ship at a given second, not you, so assuming they are going to at all times ship packets that take full benefit of a big MTU is unlikely. We’ve gone forwards and backwards between an efficient MTU of 1240 and 1040 varied instances through the years, by using MSS Clamping inside iperf3. As you’ll see within the outcomes, probably the most related metric is usually packets per second. As you scale up and down by means of varied MTU choices, the height variety of packets per second stays the bottleneck. Most networking {hardware} distributors communicate in these phrases, and in the end, the variety of packets you possibly can transmit and obtain in a specific timeframe is the one factor that issues.
-
By no means comingle the software program you’re testing on a bunch. I’ve witnessed some absurdly unhealthy outcomes because of sloppy testing. Most mesh VPNs repeatedly attempt to uncover new paths between friends. In the event you occur to run two of them without delay, and don’t inform them to exclude one another as a viable path, you possibly can find yourself sending, as an example, Nebula visitors by means of a ZeroTier tunnel. I’ve by accident completed this myself, the truth is. Moreover, some mesh VPNs modify routing tables, add iptables guidelines, and do all method of issues to the host system which will have an effect on the following take a look at. These are variables that needs to be eradicated.
-
Study to tune all the things, not simply your factor. Over time, we’ve invested a variety of time in understanding the efficiency traits of all the things we take a look at. There’s normally no incentive to learn to make your competitor’s software program carry out effectively, however in our case, we need to know whether it is taking place and study from these outcomes. A collection of exams the place you tune your factor and ignore all the things else is doubtful.
-
Have enjoyable. Simply kidding. This has been a variety of laborious work through the years. It’s fascinating, however we (I) wouldn’t say it’s notably enjoyable.
Mesh VPNs In contrast
For this primary public launch of knowledge, we’ve chosen to check the most well-liked mesh VPN choices we’re conscious of, so we will likely be evaluating Nebula (in AES mode), Netmaker, Tailscale, and ZeroTier (be aware, this record is deliberately in alphabetical order, as extra affirmation of our dedication to equity). There’s an especially essential caveat to think about when evaluating these choices, because of its efficiency implications. Solely Nebula and Tailscale straight implement stateful packet filtering. You need to use both of them with out making use of extra guidelines on the digital community interfaces related to their their mesh overlay. ZeroTier’s stateless firewall is extra restricted in functionality, however a dialogue of the deserves of stateful vs stateless packet filtering out of scope for this writing.
Netmaker has one thing known as “ACLs”, however in actuality, they’ll solely stop complete hosts from speaking to one another. They can’t be used to specify particular ports or protocols. Netmaker recommends that you just use iptables or comparable for nice grained management. It is likely to be assumed that iptables is quick sufficient to be successfully “free”, however that is completely not the case. Even a single stateful conntrack rule on the INPUT chain can influence efficiency to the tune of about 10% in our testing. We determined to not use such a rule when testing Netmaker right here, even supposing most massive deployments would and will use nice grained filtering of visitors between hosts.
Solutions Welcome
A case is likely to be made for us to incorporate (insert one other undertaking right here), however most others are nonetheless primarily based on Wireguard or Wireguard-go. We’ll think about requests so as to add different initiatives primarily based on Wireguard if of us are prepared to ship us proof of an alternative choice performing considerably in a different way than the same possibility we’ve examined right here.
These benchmarks are meant for use as an ongoing file of efficiency. We’ll settle right into a cadence for publishing/evaluating issues, as soon as we get a really feel for the demand for this data. This can be a time-intensive activity, so there will likely be limits, however the configurations, take a look at parameters and command traces, and uncooked outcomes from the exams will likely be made out there on GitHub each time we do a spherical of testing. If the authors or customers of any of those initiatives want to assist us additional tune and refine our testing, we’ll gladly combine any affordable modifications and benchmark new variations when doable.
The Exams
We’ve completed a number of completely different exams through the years however have distilled this down to simply three major exams that permit us to usefully evaluate completely different mesh VPN choices. We’ll describe the testing technique used, after which present visualizations of the assorted outcomes together with our interpretation of the info and any caveats.
Check 1: multi-host unidirectional transmit
Description: A single host transmits information to the opposite 4 hosts concurrently with no predetermined charge restrict for ten minutes. This take a look at deliberately focuses on the transmit facet of every possibility, so we are able to decide if there are any asymmetrical bottlenecks.
Methodology used: iperf3 [host2,3,4,5] -M 1200 -t 600
This graph reveals three of the 4 choices, Nebula, Netmaker, and Tailscale, can attain throughput that matches the boundaries of the underlying {hardware}, practically 10 gigabits per second (Gbps). The fourth possibility, ZeroTier, is single threaded, which means it can not make the most of hosts with a lot of CPU cores out there, not like the others. This ends in ZeroTier’s efficiency being considerably restricted, in comparison with the others. The Tailscale end result is a little more variable than the opposite two on the prime, and you’ll see varied quick drops and barely inconsistent efficiency, which is +/- ~900 Mbps over the course of the testing.
Be aware: The unusual drop in ZeroTier does occur briefly on each transmit-only take a look at we’ve completed, although at completely different instances. We’ve not but decided the reason for these momentary throughput drops.
That is the entire reminiscence utilized by processes of three of the 4 choices. Tailscale reminiscence use is extremely variable throughout our testing, and seems to be associated to their efforts to coalesce packets for segmentation offloading. A few of this reminiscence is likely to be recovered by means of rubbish assortment after the exams, however that is additionally out of scope for this writing. We’ve seen reminiscence use exceed 1GB throughout our exams, and the variability has been troublesome to isolate. The reminiscence outcomes listed here are from the very best case run we’ve recorded (the place Tailscale used the least reminiscence, in comparison with different runs).
Nebula and ZeroTier are extraordinarily constant in reminiscence use, with nearly no notable modifications all through testing. Nebula averages 27 megabytes of reminiscence used, and ZeroTier averages 10 megabytes used.
Be aware: As a result of Netmaker on Linux makes use of the Wireguard kernel module, it isn’t doable to meaningfully acquire information on its reminiscence use, however it’s usually environment friendly and constant, from exterior commentary.
This graph reveals the connection between throughput and CPU sources. You’ll be able to see that ZeroTier and Nebula are fairly comparable right here, with ZeroTier being a bit extra variable. Nebula scales very linearly with extra CPUs. The Tailscale end result seems considerably higher right here, due to their use of varied Linux segmentation offloading mechanisms. This enables them to make use of much less syscalls of their packet processing path. It needs to be famous, nevertheless, that this does enhance CPU use by the kernel, when coping with these ‘superpackets’, so whereas segmentation offloading is definitely an effectivity enhance, it isn’t as drastic as this makes it seem, when accounting for complete system sources. Regardless, segmentation offloading is spectacular and is what allowed Tailscale’s Linux efficiency to catch as much as Nebula’s throughput numbers in early 2023. See be aware 1 on the finish of this text for some essential caveats concerning non-Linux platforms.
Be aware: Netmaker is once more not included as a result of it’s laborious to quantify kernel thread CPU use. It needs to be famous that it’s fairly just like the others, and makes use of important sources at these speeds.
Check 2: multi-host unidirectional obtain
Description: 4 hosts transmit information to a person host as quick as doable for ten minutes. This take a look at deliberately focuses on the obtain facet of every possibility, so we are able to decide if there are any asymmetrical bottlenecks.
Methodology used: iperf3 [host2,3,4,5] -M 1200 -t 600 -R
This graph reveals two of the 4 choices, Netmaker, and Tailscale, can attain throughput that matches the boundaries of the underlying {hardware}, practically 10 Gbps. Evaluate the road for Netmaker with Determine 2 and also you’ll see that on the obtain facet, Netmaker’s line is now not flat. It’s because it’s CPU restricted on the obtain facet, identical to the entire different choices. That is associated to the obtain facet of kernel Wireguard having a distinct processing path model, which turns into a bottleneck. You’ll be able to see that Nebula has fallen behind on this take a look at, and is constantly about 900 Mbit/s behind the leaders. As earlier than, ZeroTier is held again by its incapacity to make use of a number of CPU cores for packet processing.
That is once more the entire reminiscence utilized by processes of three of the 4 choices, however for the obtain facet. Tailscale reminiscence continues to be extremely variable throughout our testing, although a bit much less so when receiving. As soon as once more, a few of Tailscale’s reminiscence use is likely to be recovered by means of rubbish assortment after the exams, however that is additionally out of scope for this writing. The reminiscence outcomes listed here are once more from the very best case run we’ve recorded (the place Tailscale used the least reminiscence, in comparison with different runs).
Nebula and ZeroTier are extraordinarily constant in reminiscence use, and once more right here we see no notable modifications all through testing. Nebula once more averages 27 megabytes of reminiscence used, and ZeroTier averages 10 megabytes used.
Be aware: As a result of Netmaker on Linux makes use of the Wireguard kernel module, it isn’t doable to meaningfully acquire information on its reminiscence use, however it’s usually environment friendly and constant, from exterior commentary.
Just like Determine 4, this graph reveals the connection between throughput and CPU sources. You’ll be able to see that ZeroTier is kind of just like earlier than, however Nebula seems to be extra environment friendly. Whereas that is true, it’s just like the explanation Tailscale seems extra environment friendly in Determine 4, as a result of on this case, Nebula has lengthy carried out the recvmmsg syscall, which shifts the burden barely towards the kernel, however maybe much less considerably than segmentation offloading (additional testing wanted to substantiate this).
The Tailscale end result once more seems considerably higher right here, due to their use of varied Linux segmentation offloading mechanisms, however that is once more due to shifting a number of the burden of packet processing overhead into the kernel by way of segmentation offloading.
Be aware: Netmaker is once more not included as a result of it’s laborious to quantify kernel thread CPU use. It needs to be famous that it’s fairly just like the others, and makes use of important sources at these speeds.
Check 3: multi-host bidirectional transmit obtain
Description: A single host transmits and receives information to/from the opposite 4 hosts concurrently with no predetermined charge restrict for ten minutes. This take a look at deliberately combines the ship and obtain streams, so we are able to decide if there are any bottlenecks when a field is sending and receiving at its restrict.
Methot used: iperf3 [host2,3,4,5] -M 1200 -t 600
and iperf3 [host2,3,4,5] -M 1200 -t 600 -R
are run concurrently on host 1
This graph reveals unbiased traces for the simultaneous ship/obtain visitors, that are added to a complete throughput quantity. It needs to be famous that the utmost achievable quantity right here is between 19 and 20 Gbps, and you’ll see that not one of the choices obtain this efficiency. Netmaker achieves a complete ship/obtain throughput common of ~13 Gbps, adopted by Nebula and Tailscale roughly tied at a mean of ~9.6 Gbps. ZeroTier once more is available in behind the remainder, however you’ll discover that it does have the flexibility to deal with sending and receiving independently, and sees an enchancment over the directional exams, averaging ~3Gbps.
Be aware: The unusual drop in ZeroTier doesn’t appear to occur throughout bidirectional exams.
The outcomes are in line with earlier reminiscence use, with Nebula and ZeroTier utilizing a constant quantity, and Tailscale being extra variable and utilizing considerably extra reminiscence to course of packets.
Be aware: As a result of Netmaker on Linux makes use of the Wireguard kernel module, it isn’t doable to meaningfully acquire information on its reminiscence use, however it’s usually environment friendly and constant, from exterior commentary.
As in Determine 4 and Determine 7, this graph reveals the connection between throughput and CPU sources. You’ll be able to see that ZeroTier is kind of just like earlier than, as is Nebula. This take a look at finally ends up being considerably extra transmit heavy, so the outcomes have a tendency symbolize the transmit facet extra prominently Tailscale once more seems extra environment friendly, however with the caveat that the kernel is doing extra work.
Be aware: Netmaker is once more not included as a result of it’s laborious to quantify kernel thread CPU use. It needs to be famous that it’s fairly just like the others, and makes use of important sources at these speeds.
Conclusion
There is no such thing as a single “finest” resolution. Nebula, Netmaker, and Tailscale can realistically obtain efficiency that saturates a ten Gbps community in a single route on modern-ish CPUs, and have a tendency to have fairly comparable profiles concerning complete CPU use. Tailscale constantly makes use of considerably extra reminiscence than the remainder of the choices examined. (be aware: Traditionally Tailscale was fairly far behind, however segmentation offloading has allowed them to attain a lot, a lot better efficiency, regardless of the excessive overhead of their advanced inside packet dealing with paths.)
As famous above, solely Nebula and Tailscale have stateful packet filtering, which is a vital consideration right here. If of us would really like an up to date model of this take a look at exhibiting the influence of iptables guidelines on Netmaker, please tell us. For now it was safer to provide Netmaker a slight unfair benefit than attempt to clarify the explanation we would have added seemingly pointless iptables guidelines.
In case your community is gigabit, ZeroTier is equally succesful as the remainder, with the bottom reminiscence use of the three measured. It’s fairly environment friendly however held again by its lack of multithreading.
Lastly, I assume we’ve to begin sourcing 40 gigabit ethernet {hardware}, as a result of the underlying community is now the restrict in some exams.
Thank You + Further Notes
A honest “thanks” for taking the time to learn this comparatively dense efficiency exploration. In the event you learn this hoping for a easy reply concerning probably the most performant mesh, I’m certain you’re sorely dissatisfied. In equity we did telegraph that this may be the case within the title of the article. The Github repository with the uncooked information and configurations will likely be out there quickly, and we’re blissful to run the exams once more with affordable recommendations for tweaking efficiency.
Bonus notes that didn’t match anyplace specifically, however are included as a result of maybe somebody will discover them fascinating:
-
Many of the efficiency optimizations carried out by these initiatives solely have an effect on Linux. Issues like segmentation offload should not out there as generally on Home windows and MacOS. Whereas it isn’t in scope right here, we’ve information that proves Nebula is considerably extra environment friendly than wireguard-go (Which each Netmaker and Tailscale use on non-Linux patforms), and if of us care to see this information, we could write a followup article.
-
Relying on the underlying community, you should utilize greater MTU values to extend complete throughput considerably. In AWS the default MTU is 9001, so Nebula’s MTU is 8600. However one other essential element: if you depart an AZ your MTU drops to 1500. In the event you ship a big Nebula packet it would turn out to be fragmented. AWS (and others) charge restrict fragmented packets aggressively, so be careful for that. Nebula has a inbuilt functionality the place completely different community ranges can use completely different MTUs, which lets you make the most of massive MTUs internally on an AZ however default to smaller ones elsewhere.
-
We’ve excluded Tinc from the outcomes, just because it’s not often aggressive, performance-wise. I respect Guus Sliepen and the work of us have completed on Tinc for years, and was a cheerful person effectively into the mid 2010s, however it isn’t aggressive performance-wise. Moreover, it has some inside routing options which can be novel and never replicated by any of the opposite choices.
-
The Raspberry Pi 5 is the primary Pi that helps AES directions, and Nebula in AES mode simply saturates gigabit with out utilizing greater than a single core.
-
It may be laborious (or unimaginable) to judge kernel primarily based choices straight, since kernel thread should not as simple to measure as userspace. This is the reason we would not have reminiscence use numbers for kernel Wireguard (Netmaker). We might in all probability get considerably shut by oblique commentary, however we aren’t assured in that being correct sufficient for our functions.
-
I personal a Lightning Ethernet adapter as a result of I needed to take away the variability of WiFi from the testing on iOS, however the Lightning Ethernet adapter has far more variability than Wifi. So, that’s enjoyable.