VPP on FreeBSD – Half 2

Ever since I first noticed VPP – the Vector Packet Processor – I’ve been deeply impressed with its
efficiency and flexibility. For these of us who’ve used Cisco IOS/XR units, just like the basic
ASR (aggregation providers router), VPP will feel and appear fairly acquainted as most of the approaches
are shared between the 2. Through the years, of us have requested me usually “What about BSD?” and to
my shock, late final 12 months I learn an announcement from the FreeBSD Basis
[ref] as they seemed again
over 2023 and ahead to 2024:
Porting the Vector Packet Processor to FreeBSD
Vector Packet Processing (VPP) is an open-source, high-performance consumer house networking stack
that gives quick packet processing appropriate for software-defined networking and community perform
virtualization functions. VPP goals to optimize packet processing by way of vectorized operations
and parallelism, making it well-suited for high-speed networking functions. In November of this
12 months, the Basis started a contract with Tom Jones, a FreeBSD developer specializing in community
efficiency, to port VPP to FreeBSD. Below the contract, Tom may also allocate time for different
duties equivalent to testing FreeBSD on widespread virtualization platforms to enhance the desktop
expertise, bettering {hardware} assist on arm64 platforms, and including assist for low energy idle
on Intel and arm64 {hardware}.
In my first [article], I wrote a kind of a howdy world
by putting in FreeBSD 14.0-RELEASE on each a VM and a naked steel Supermicro, and confirmed that Tom’s
VPP department compiles, runs and pings. On this article, I’ll check out some comparative
efficiency numbers.
Evaluating implementations
FreeBSD has an in depth community stack, together with common kernel based mostly performance equivalent to
routing, filtering and bridging, a sooner netmap based mostly datapath, together with some userspace
utilities like a netmap bridge, and naturally utterly userspace based mostly dataplanes, such because the
VPP challenge that I’m engaged on right here. Final week, I discovered that VPP has a netmap driver, and from
earlier travels I’m already fairly conversant in its DPDK based mostly forwarding. I resolve to do a
baseline loadtest for every of those on the Supermicro Xeon-D1518 that I put in final week. See the
[article] for particulars on the setup.
The loadtests will use a typical set of various configurations, utilizing Cisco T-Rex’s default
benchmark profile referred to as bench.py
:
- var2-1514b: Massive Packets, a number of flows with modulating supply and vacation spot IPv4
addresses, usually referred to as an ‘iperf check’, with packets of 1514 bytes. - var2-imix: Blended Packets, a number of flows, usually referred to as an ‘imix check’, which features a
bunch of 64b, 390b and 1514b packets. - var2-64b: Small Packets, nonetheless a number of flows, 64 bytes, which permits for a number of obtain
queues and kernel or utility threads. - 64b: Small Packets, however now single circulation, usually referred to as ‘linerate check’, with a packet dimension
of 64 bytes, limiting to at least one obtain queue.
Every of those 4 loadtests would possibly happen in solely undirectionally (port0 -> port1) or bidirectionally
(port0 <-> port1). This yields eight totally different loadtests, every taking about 8 minutes. I put the kettle
on and get underway.
FreeBSD 14: Kernel Bridge
The machine I’m testing has a quad-port Intel i350 (1Gbps copper, utilizing the FreeBSD igb(4)
driver),
a dual-port Intel X522 (10Gbps SFP+, utilizing the ix(4)
driver), and a dual-port Intel i710-XXV
(25Gbps SFP28, utilizing the ixl(4)
driver). I resolve to dwell it up just a little, and select the 25G ports
for my loadtests right now, even when I believe this machine with its comparatively low-end Xeon-D1518 CPU
will battle just a little bit at very excessive packet charges. No ache, no acquire, amirite?
I take my recent FreeBSD 14.0-RELEASE set up, with none tinkering aside from compiling a GENERIC
kernel that has assist for the DPDK modules I’ll want later. For my first loadtest, I create a
kernel based mostly bridge as follows, simply tying the 2 25G interfaces collectively:
[pim@france /usr/obj]$ uname -a
FreeBSD france 14.0-RELEASE FreeBSD 14.0-RELEASE #0: Sat Feb 10 22:18:51 CET 2024 root@france:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
[pim@france ~]$ dmesg | grep ixl
ixl0: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf8000000-0xf8ffffff,0xf9008000-0xf900ffff irq 16 at system 0.0 on pci7
ixl1: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf7000000-0xf7ffffff,0xf9000000-0xf9007fff irq 16 at system 0.1 on pci7
[pim@france ~]$ sudo ifconfig bridge0 create
[pim@france ~]$ sudo ifconfig bridge0 addm ixl0 addm ixl1 up
[pim@france ~]$ sudo ifconfig ixl0 up
[pim@france ~]$ sudo ifconfig ixl1 up
[pim@france ~]$ ifconfig bridge0
bridge0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
choices=0
ether 58:9c:fc:10:6c:2e
id 00:00:00:00:00:00 precedence 32768 hellotime 2 fwddelay 15
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
root id 00:00:00:00:00:00 precedence 32768 ifcost 0 port 0
member: ixl1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 4 precedence 128 path value 800
member: ixl0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 3 precedence 128 path value 800
teams: bridge
nd6 choices=9<PERFORMNUD,IFDISABLED>
One factor that I rapidly understand, is that FreeBSD, when utilizing hyperthreading, does have 8 threads
accessible, however solely 4 of them take part in forwarding. After I put the machine underneath load, I see a
curious 399% spent in kernel whereas I see 402% in idle:
After I then do a single-flow unidirectional loadtest, the anticipated consequence is that just one CPU
participates (100% in kernel and 700% in idle) and if I carry out a single-flow bidirectional
loadtest, my expectations are confirmed once more, seeing two CPU threads do the work (200% in kernel
and 600% in idle).
Whereas the maths checks out, the efficiency is just a little bit much less spectacular:
Kind | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Charge |
---|---|---|---|---|
vm=var2,dimension=1514 | Unidirectional | 2.02Mpps | 24.77Gbps | 99% |
vm=var2,dimension=imix | Unidirectional | 3.48Mpps | 10.23Gbps | 43% |
vm=var2,dimension=64 | Unidirectional | 3.61Mpps | 2.43Gbps | 9.7% |
dimension=64 | Unidirectional | 1.22Mpps | 0.82Gbps | 3.2% |
vm=var2,dimension=1514 | Bidirectional | 3.77Mpps | 46.31Gbps | 93% |
vm=var2,dimension=imix | Bidirectional | 3.81Mpps | 11.22Gbps | 24% |
vm=var2,dimension=64 | Bidirectional | 4.02Mpps | 2.69Gbps | 5.4% |
dimension=64 | Bidirectional | 2.29Mpps | 1.54Gbps | 3.1% |
Conclusion: FreeBSD’s kernel on this Xeon-D1518 processor can deal with about 1.2Mpps per CPU
thread, and I can use solely 4 of them. FreeBSD is completely satisfied to ahead large packets, and I can
fairly attain 2x25Gbps however as soon as I begin ramping up the packets/sec by decreasing the packet dimension,
issues in a short time deteriorate.
FreeBSD 14: netmap Bridge
Tom identified a software within the supply tree, referred to as the netmap bridge initially written by Luigi
Rizzo and Matteo Landi. FreeBSD ships the supply code, however you can too check out their GitHub
repository [ref].
What’s netmap anyway? It’s a framework for very quick and environment friendly packet I/O for userspace
and kernel purchasers, and for Digital Machines. It runs on FreeBSD, Linux and a few variations of
Home windows. As an apart, my buddy Pavel from FastNetMon identified a blogpost from 2015 during which
Cloudflare of us described a solution to do DDoS mitigation on Linux utilizing visitors classification to
program the community playing cards to maneuver sure offensive visitors to a devoted {hardware} queue, and
service that queue from a netmap shopper. Should you’re curious (I actually was!), you would possibly take a
have a look at that cool write-up
[here].
I compile the code and put it to work, and the man-page tells me that I have to fiddle with the
interfaces a bit. They have to be:
- set to promiscuous, which is sensible as they must obtain ethernet frames despatched to MAC
addresses aside from their very own - flip off any {hardware} offloading, notably
-rxcsum -txcsum -tso4 -tso6 -lro
- my consumer wants write permission to
/dev/netmap
to bind the interfaces from userspace.
[pim@france /usr/src/tools/tools/netmap]$ make
[pim@france /usr/src/tools/tools/netmap]$ cd /usr/obj/usr/src/amd64.amd64/instruments/instruments/netmap
[pim@france .../tools/netmap]$ sudo ifconfig ixl0 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo ifconfig ixl1 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo chmod 660 /dev/netmap
[pim@france .../tools/netmap]$ ./bridge -i netmap:ixl0 -i netmap:ixl1
065.804686 predominant [290] ------- zerocopy supported
065.804708 predominant [297] Wait 4 secs for hyperlink to return up...
075.810547 predominant [301] Able to go, ixl0 0x0/4 <-> ixl1 0x0/4.
I begin my first loadtest, which fairly instantly fails. It’s an fascinating habits sample which
I’ve not seen earlier than. After staring on the downside, and studying the code of bridge.c
, which is a
remarkably straight ahead program, I restart the bridge utility, and visitors passes once more however solely
for a short time. Whoops!
I took a [screencast] in case any sort soul on freebsd-net
desires to take a more in-depth have a look at this:
I begin a little bit of trial and error during which I conclude that if I ship loads of visitors (like 10Mpps),
forwarding is ok; but when I ship just a little visitors (like 1kpps), in some unspecified time in the future forwarding stops
alltogether. So whereas it’s not nice, this does enable me to measure the whole throughput simply by
sending lots of visitors, say 30Mpps, and seeing what quantity comes out the opposite facet.
Right here I am going, and I’m having enjoyable:
Kind | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Charge |
---|---|---|---|---|
vm=var2,dimension=1514 | Unidirectional | 2.04Mpps | 24.72Gbps | 100% |
vm=var2,dimension=imix | Unidirectional | 8.16Mpps | 23.76Gbps | 100% |
vm=var2,dimension=64 | Unidirectional | 10.83Mpps | 5.55Gbps | 29% |
dimension=64 | Unidirectional | 11.42Mpps | 5.83Gbps | 31% |
vm=var2,dimension=1514 | Bidirectional | 3.91Mpps | 47.27Gbps | 96% |
vm=var2,dimension=imix | Bidirectional | 11.31Mpps | 32.74Gbps | 77% |
vm=var2,dimension=64 | Bidirectional | 11.39Mpps | 5.83Gbps | 15% |
dimension=64 | Bidirectional | 11.57Mpps | 5.93Gbps | 16% |
Conclusion: FreeBSD’s netmap implementation can be certain by packets/sec, and on this
setup, the Xeon-D1518 machine is able to forwarding roughly 11.2Mpps. What I discover cool is that
single circulation or a number of flows doesn’t appear to matter that a lot, in truth bidirectional 64b single
circulation loadtest was most favorable at 11.57Mpps, which is an order of magnitude higher than utilizing simply
the kernel (which clocked in at 1.2Mpps).
FreeBSD 14: VPP with netmap
It’s good to have a baseline on this machine on how the FreeBSD kernel itself performs. However of
course this sequence is about Vector Packet Processing, so I now flip my consideration to the VPP department
that Tom shared with me. I wrote a bunch of particulars concerning the VM and naked steel set up in my
[first article] so I’ll simply go straight to the
configuration components:
DBGvpp# create netmap identify ixl0
DBGvpp# create netmap identify ixl1
DBGvpp# set int state netmap-ixl0 up
DBGvpp# set int state netmap-ixl1 up
DBGvpp# set int l2 xconnect netmap-ixl0 netmap-ixl1
DBGvpp# set int l2 xconnect netmap-ixl1 netmap-ixl0
DBGvpp# present int
Title Idx State MTU (L3/IP4/IP6/MPLS) Counter Rely
local0 0 down 0/0/0/0
netmap-ixl0 1 up 9000/0/0/0 rx packets 25622
rx bytes 1537320
tx packets 25437
tx bytes 1526220
netmap-ixl1 2 up 9000/0/0/0 rx packets 25437
rx bytes 1526220
tx packets 25622
tx bytes 1537320
At this level I can just about rule out that the netmap bridge.c
is the problem, as a result of a
few seconds after introducing 10Kpps of visitors and seeing it efficiently move, the loadtester
receives no extra packets, although T-Rex remains to be sending it. Nevertheless, a few minute later
I can additionally see the RX and TX counters proceed to extend within the VPP dataplane:
DBGvpp# present int
Title Idx State MTU (L3/IP4/IP6/MPLS) Counter Rely
local0 0 down 0/0/0/0
netmap-ixl0 1 up 9000/0/0/0 rx packets 515843
rx bytes 30950580
tx packets 515657
tx bytes 30939420
netmap-ixl1 2 up 9000/0/0/0 rx packets 515657
rx bytes 30939420
tx packets 515843
tx bytes 30950580
.. and I can see that each packet that VPP obtained is accounted for: interface ixl0
has obtained
515843 packets, and ixl1
claims to have transmitted precisely that quantity of packets. So I believe
maybe they’re getting misplaced someplace on egress between the kernel and the Intel i710-XXV community
card.
Nevertheless, counter to the earlier case, I can not maintain any cheap quantity of visitors, be it
1Kpps, 10Kpps or 10Mpps, the system fairly persistently involves a halt mere seconds after
introducing the load. Restarting VPP makes it ahead visitors once more for just a few seconds, simply to finish
up in the identical upset state. I don’t study a lot.
Conclusion: This setup with VPP utilizing netmap doesn’t yield outcomes, for the second. I’ve a
suspicion that regardless of the trigger is of the netmap bridge within the earlier check, is probably going additionally the
offender for this check.
FreeBSD 14: VPP with DPDK
However not all is misplaced – I’ve one check left, and judging by what I discovered final week when mentioning
the primary check surroundings, this one goes to be a good bit higher. In my earlier loadtests, the
community interfaces have been on their regular kernel driver (ixl(4)
within the case of the Intel i710-XXV
interfaces), however now I’m going to combine it up just a little, and rebind these interfaces to a selected DPDK
driver referred to as nic_uio(4)
which stands for Community Interface Card Userspace Enter/Output:
[pim@france ~]$ cat < EOF | sudo tee -a /boot/loader.conf
nic_uio_load="YES"
hw.nic_uio.bdfs="6:0:0,6:0:1"
EOF
After I reboot, the community interfaces are gone from the output of ifconfig(8)
, which is nice. I
begin up VPP with a minimal config file [ref], which defines
three employee threads and begins DPDK with 3 RX queues and 4 TX queues. It’s a typical query why
there can be yet one more TX queue. The reason is that in VPP, there’s one (1) predominant thread and
zero or extra employee threads. If the predominant thread desires to ship visitors (for instance, in a plugin
like LLDP which sends periodic bulletins), it could be best to make use of a transmit queue
particular to that predominant thread. Any return visitors shall be picked up by the DPDK Course of on employee
threads (as predominant doesn’t have one in every of these). That’s why the overall rule num(TX) = num(RX)+1.
[pim@france ~/src/vpp]$ export STARTUP_CONF=/house/pim/src/startup.conf
[pim@france ~/src/vpp]$ gmake run-release
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/0 TwentyFiveGigabitEthernet6/0/1
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/1 TwentyFiveGigabitEthernet6/0/0
vpp# set int state TwentyFiveGigabitEthernet6/0/0 up
vpp# set int state TwentyFiveGigabitEthernet6/0/1 up
vpp# present int
Title Idx State MTU (L3/IP4/IP6/MPLS) Counter Rely
TwentyFiveGigabitEthernet6/0/0 1 up 9000/0/0/0 rx packets 11615035382
rx bytes 1785998048960
tx packets 700076496
tx bytes 161043604594
TwentyFiveGigabitEthernet6/0/1 2 up 9000/0/0/0 rx packets 700076542
rx bytes 161043674054
tx packets 11615035440
tx bytes 1785998136540
local0 0 down 0/0/0/0
And with that, the dataplane shoots to life and begins forwarding (numerous) packets. To my nice
reduction, sending both 1kpps or 1Mpps “simply works”. I can run my loadtest as per regular, first with
1514 byte packets, then imix, then 64 byte packets, and at last single-flow 64 byte packets. And of
course, each unidirectionally and bidirectionally.
I check out the system load whereas the loadtests are operating:
It’s totally anticipated that the VPP course of is spinning 300% +epsilon of CPU time. It’s because it
has began three employee threads, and these are execuing the DPDK Ballot Mode Driver which is
primarily a good loop that asks the community playing cards for work, and if there are any packets
arriving, executes on that work. As such, every employee thread is at all times burning 100% of its
assigned CPU.
That stated, I can check out finer grained statistics within the dataplane itself:
vpp# present run
Thread 0 vpp_main (lcore 0)
Time .9, 10 sec inner node vector price 0.00 loops/sec 297041.19
vector charges in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
Title State Calls Vectors Suspends Clocks Vectors/Name
ip4-full-reassembly-expire-wal any wait 0 0 18 2.39e3 0.00
ip6-full-reassembly-expire-wal any wait 0 0 18 3.08e3 0.00
unix-cli-process-0 lively 0 0 9 7.62e4 0.00
unix-epoll-input polling 13066 0 0 1.50e5 0.00
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time .9, 10 sec inner node vector price 12.38 loops/sec 1467742.01
vector charges in 5.6294e6, out 5.6294e6, drop 0.0000e0, punt 0.0000e0
Title State Calls Vectors Suspends Clocks Vectors/Name
TwentyFiveGigabitEthernet6/0/1 lively 399663 5047800 0 2.20e1 12.63
TwentyFiveGigabitEthernet6/0/1 lively 399663 5047800 0 9.54e1 12.63
dpdk-input polling 1531252 5047800 0 1.45e2 3.29
ethernet-input lively 399663 5047800 0 3.97e1 12.63
l2-input lively 399663 5047800 0 2.93e1 12.63
l2-output lively 399663 5047800 0 2.53e1 12.63
unix-epoll-input polling 1494 0 0 3.09e2 0.00
(et cetera)
I confirmed just one employee thread’s output, however there are literally three employee threads, and they’re
all doing related work, as a result of they’re selecting up 33% of the visitors every assigned to the three RX
queues within the community card.
Whereas the general CPU load is 300%, right here I can see a unique image. Thread 0 (the predominant thread)
is doing primarily ~nothing. It’s polling a set of unix sockets within the node referred to as
unix-epoll-input
, however aside from that, predominant doesn’t have a lot on its plate. Thread 1 nonetheless is
a employee thread, and I can see that it’s busy doing work:
dpdk-input
: it’s polling the NIC for work, it has been referred to as 1.53M occasions, and in complete it has
dealt with simply over 5.04M vectors (that are packets). So I can derive, that every time the Ballot
Mode Driver provides work, on common there are 3.29 vectors (packets), and every packet is
taking about 145 CPU clocks.ethernet-input
: The DPDK vectors are all ethernet frames coming from the loadtester. Seeing as
I’ve cross related all visitors from Tf6/0/0 to Tf6/0/1 and vice-versa, VPP is aware of that it
ought to deal with the packets within the L2 forwarding path.l2-input
is named with the (listing of N) ethernet frames, which all get cross related to the
output interface, on this case Tf6/0/1.l2-output
prepares the ethernet frames for output into their egress interface.TwentyFiveGigabitEthernet6/0/1-output
(Notice: the identify is truncated) If this have been to have
been L3 visitors, this might be the place the place the vacation spot MAC tackle is inserted into the
ethernet body, however since that is an L2 cross join, the node merely passes the ethernet frames
by way of to the ultimate egress node in DPDK.TwentyFiveGigabitEthernet6/0/1-tx
(Notice: the identify is truncated) fingers them to the DPDK
driver for marshalling on the wire.
Midway by way of, I see that there’s a problem with the distribution of ingress visitors over the
three staff, possibly you may spot it too:
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time 56.7, 10 sec inner node vector price 38.59 loops/sec 106879.84
vector charges in 7.2982e6, out 7.2982e6, drop 0.0000e0, punt 0.0000e0
Title State Calls Vectors Suspends Clocks Vectors/Name
TwentyFiveGigabitEthernet6/0/0 lively 6689553 206899956 0 1.34e1 30.93
TwentyFiveGigabitEthernet6/0/0 lively 6689553 206899956 0 1.37e2 30.93
TwentyFiveGigabitEthernet6/0/1 lively 6688572 206902836 0 1.45e1 30.93
TwentyFiveGigabitEthernet6/0/1 lively 6688572 206902836 0 1.34e2 30.93
dpdk-input polling 7128012 413802792 0 8.77e1 58.05
ethernet-input lively 13378125 413802792 0 2.77e1 30.93
l2-input lively 6809002 413802792 0 1.81e1 60.77
l2-output lively 6809002 413802792 0 1.68e1 60.77
unix-epoll-input polling 6954 0 0 6.61e2 0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 56.7, 10 sec inner node vector price 256.00 loops/sec 7702.68
vector charges in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
Title State Calls Vectors Suspends Clocks Vectors/Name
TwentyFiveGigabitEthernet6/0/0 lively 456112 116764672 0 1.27e1 256.00
TwentyFiveGigabitEthernet6/0/0 lively 456112 116764672 0 2.64e2 256.00
TwentyFiveGigabitEthernet6/0/1 lively 456112 116764672 0 1.39e1 256.00
TwentyFiveGigabitEthernet6/0/1 lively 456112 116764672 0 2.74e2 256.00
dpdk-input polling 456112 233529344 0 1.41e2 512.00
ethernet-input lively 912224 233529344 0 5.71e1 256.00
l2-input lively 912224 233529344 0 3.66e1 256.00
l2-output lively 912224 233529344 0 1.70e1 256.00
unix-epoll-input polling 445 0 0 9.59e2 0.00
---------------
Thread 3 vpp_wk_2 (lcore 3)
Time 56.7, 10 sec inner node vector price 256.00 loops/sec 7742.43
vector charges in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
Title State Calls Vectors Suspends Clocks Vectors/Name
TwentyFiveGigabitEthernet6/0/0 lively 456113 116764928 0 8.94e0 256.00
TwentyFiveGigabitEthernet6/0/0 lively 456113 116764928 0 2.81e2 256.00
TwentyFiveGigabitEthernet6/0/1 lively 456113 116764928 0 9.54e0 256.00
TwentyFiveGigabitEthernet6/0/1 lively 456113 116764928 0 2.72e2 256.00
dpdk-input polling 456113 233529856 0 1.61e2 512.00
ethernet-input lively 912226 233529856 0 4.50e1 256.00
l2-input lively 912226 233529856 0 2.93e1 256.00
l2-output lively 912226 233529856 0 1.23e1 256.00
unix-epoll-input polling 445 0 0 1.03e3 0.00
Thread 1 (vpp_wk_0
) is dealing with 7.29Mpps and reasonably loaded, whereas Thread 2 and three are dealing with
every 4.11Mpps and are utterly pegged. That stated, the relative quantity of CPU clocks they’re
spending per packet within reason related, however they don’t fairly add up:
- Thread 1 is doing 7.29Mpps and is spending on common 449 CPU cycles per packet. I get this
quantity by including up the entire values within the Clocks column, apart from theunix-epoll-input
node. However that’s considerably unusual, as a result of this Xeon D1518 clocks at 2.2GHz – and but 7.29M *
449 is 3.27GHz. My expertise (in Linux) is that these numbers really line up fairly effectively. - Thread 2 is doing 4.12Mpps and is spending on common 816 CPU cycles per packet. This type of
is sensible because the cycles/packet is roughly double that of thread 1, and the packet/sec is
roughly half … and the whole of 4.12M * 816 is 3.36GHz. - I can see equally values for thread 3: 4.12Mpps and in addition 819 CPU cycles per packet which
quantities to VPP self-reporting utilizing 3.37GHz value of cycles on this thread.
After I have a look at the thread to CPU placement, I get one other shock:
vpp# present threads
ID Title Kind LWP Sched Coverage (Precedence) lcore Core Socket State
0 vpp_main 100346 (nil) (n/a) 0 42949674294967
1 vpp_wk_0 staff 100473 (nil) (n/a) 1 42949674294967
2 vpp_wk_1 staff 100474 (nil) (n/a) 2 42949674294967
3 vpp_wk_2 staff 100475 (nil) (n/a) 3 42949674294967
vpp# present cpu
Mannequin identify: Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz
Microarch mannequin (household): [0x6] Broadwell ([0x56] Broadwell DE) stepping 0x3
Flags: sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe
rdseed aes invariant_tsc
Base frequency: 2.19 GHz
The numbers in present threads
are all tousled, and I don’t fairly know what to make of it but. I
suppose the maybe overly particular Linux implementation of the thread pool administration is throwing off
FreeBSD a bit. Maybe some profiling may very well be helpful, so I make an observation to debate this with Tom or
the freebsd-net mailing listing, who will know a good bit extra about this kind of stuff on FreeBSD than
I do.
Anyway, functionally: this works. Efficiency sensible: I’ve some questions 🙂 I let all eight
loadtests full and with out additional ado, right here’s the outcomes:
Kind | Uni/BiDir | Packets/Sec | L2 Bits/Sec | Line Charge |
---|---|---|---|---|
vm=var2,dimension=1514 | Unidirectional | 2.01Mpps | 24.45Gbps | 99% |
vm=var2,dimension=imix | Unidirectional | 8.07Mpps | 23.42Gbps | 99% |
vm=var2,dimension=64 | Unidirectional | 23.93Mpps | 12.25Gbps | 64% |
dimension=64 | Unidirectional | 12.80Mpps | 6.56Gbps | 34% |
vm=var2,dimension=1514 | Bidirectional | 3.91Mpps | 47.35Gbps | 86% |
vm=var2,dimension=imix | Bidirectional | 13.38Mpps | 38.81Gbps | 82% |
vm=var2,dimension=64 | Bidirectional | 15.56Mpps | 7.97Gbps | 21% |
dimension=64 | Bidirectional | 20.96Mpps | 10.73Gbps | 28% |
Conclusion: I’ve to say: 12.8Mpps on a unidirectional 64b single-flow loadtest (thereby solely
with the ability to make use of 1 DPDK employee), and 20.96Mpps on a bidirectional 64b single-flow
loadtest, isn’t too shabby. However seeing as one CPU thread can do 12.8Mpps, I might think about that
three CPU threads would carry out at 38.4Mpps or there-abouts, however I’m seeing solely 23.9Mpps and a few
unexplained variance in per-thread efficiency.
Outcomes
I discovered loads! Some hilights:
- The netmap implementation isn’t taking part in ball for the second, as forwarding stops persistently, in
each thebridge.c
in addition to the VPP plugin. - It’s clear although, that netmap is a good bit sooner (11.4Mpps) than kernel forwarding which got here in at
roughly 1.2Mpps per CPU thread. What’s a bit troubling is that netmap doesn’t appear to work
very effectively in VPP – visitors forwarding additionally stops right here. - DPDK performs fairly effectively on FreeBSD, I handle to see a throughput of 20.96Mpps which is sort of
twice the throughput of netmap, which is cool however I can’t fairly clarify the stark variance
in throughput between the employee threads. Maybe VPP is putting the employees on hyperthreads?
Maybe an equal ofisolcpus
within the Linux kernel would assist?
For the curious, I’ve bundled up just a few information that describe the machine and its setup:
[dmesg]
[pciconf]
[loader.conf]
[VPP startup.conf]