Now Reading
VPP on FreeBSD – Half 2

VPP on FreeBSD – Half 2

2024-02-18 19:59:58

FreeBSD

Ever since I first noticed VPP – the Vector Packet Processor – I’ve been deeply impressed with its
efficiency and flexibility. For these of us who’ve used Cisco IOS/XR units, just like the basic
ASR (aggregation providers router), VPP will feel and appear fairly acquainted as most of the approaches
are shared between the 2. Through the years, of us have requested me usually “What about BSD?” and to
my shock, late final 12 months I learn an announcement from the FreeBSD Basis
[ref] as they seemed again
over 2023 and ahead to 2024:

Porting the Vector Packet Processor to FreeBSD

Vector Packet Processing (VPP) is an open-source, high-performance consumer house networking stack
that gives quick packet processing appropriate for software-defined networking and community perform
virtualization functions. VPP goals to optimize packet processing by way of vectorized operations
and parallelism, making it well-suited for high-speed networking functions. In November of this
12 months, the Basis started a contract with Tom Jones, a FreeBSD developer specializing in community
efficiency, to port VPP to FreeBSD. Below the contract, Tom may also allocate time for different
duties equivalent to testing FreeBSD on widespread virtualization platforms to enhance the desktop
expertise, bettering {hardware} assist on arm64 platforms, and including assist for low energy idle
on Intel and arm64 {hardware}.

In my first [article], I wrote a kind of a howdy world
by putting in FreeBSD 14.0-RELEASE on each a VM and a naked steel Supermicro, and confirmed that Tom’s
VPP department compiles, runs and pings. On this article, I’ll check out some comparative
efficiency numbers.

Evaluating implementations

FreeBSD has an in depth community stack, together with common kernel based mostly performance equivalent to
routing, filtering and bridging, a sooner netmap based mostly datapath, together with some userspace
utilities like a netmap bridge, and naturally utterly userspace based mostly dataplanes, such because the
VPP challenge that I’m engaged on right here. Final week, I discovered that VPP has a netmap driver, and from
earlier travels I’m already fairly conversant in its DPDK based mostly forwarding. I resolve to do a
baseline loadtest for every of those on the Supermicro Xeon-D1518 that I put in final week. See the
[article] for particulars on the setup.

The loadtests will use a typical set of various configurations, utilizing Cisco T-Rex’s default
benchmark profile referred to as bench.py:

  1. var2-1514b: Massive Packets, a number of flows with modulating supply and vacation spot IPv4
    addresses, usually referred to as an ‘iperf check’, with packets of 1514 bytes.
  2. var2-imix: Blended Packets, a number of flows, usually referred to as an ‘imix check’, which features a
    bunch of 64b, 390b and 1514b packets.
  3. var2-64b: Small Packets, nonetheless a number of flows, 64 bytes, which permits for a number of obtain
    queues and kernel or utility threads.
  4. 64b: Small Packets, however now single circulation, usually referred to as ‘linerate check’, with a packet dimension
    of 64 bytes, limiting to at least one obtain queue.

Every of those 4 loadtests would possibly happen in solely undirectionally (port0 -> port1) or bidirectionally
(port0 <-> port1). This yields eight totally different loadtests, every taking about 8 minutes. I put the kettle
on and get underway.

FreeBSD 14: Kernel Bridge

The machine I’m testing has a quad-port Intel i350 (1Gbps copper, utilizing the FreeBSD igb(4) driver),
a dual-port Intel X522 (10Gbps SFP+, utilizing the ix(4) driver), and a dual-port Intel i710-XXV
(25Gbps SFP28, utilizing the ixl(4) driver). I resolve to dwell it up just a little, and select the 25G ports
for my loadtests right now, even when I believe this machine with its comparatively low-end Xeon-D1518 CPU
will battle just a little bit at very excessive packet charges. No ache, no acquire, amirite?

I take my recent FreeBSD 14.0-RELEASE set up, with none tinkering aside from compiling a GENERIC
kernel that has assist for the DPDK modules I’ll want later. For my first loadtest, I create a
kernel based mostly bridge as follows, simply tying the 2 25G interfaces collectively:

[pim@france /usr/obj]$ uname -a
FreeBSD france 14.0-RELEASE FreeBSD 14.0-RELEASE #0: Sat Feb 10 22:18:51 CET 2024     root@france:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

[pim@france ~]$ dmesg | grep ixl
ixl0: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf8000000-0xf8ffffff,0xf9008000-0xf900ffff irq 16 at system 0.0 on pci7
ixl1: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf7000000-0xf7ffffff,0xf9000000-0xf9007fff irq 16 at system 0.1 on pci7

[pim@france ~]$ sudo ifconfig bridge0 create
[pim@france ~]$ sudo ifconfig bridge0 addm ixl0 addm ixl1 up
[pim@france ~]$ sudo ifconfig ixl0 up
[pim@france ~]$ sudo ifconfig ixl1 up
[pim@france ~]$ ifconfig bridge0
bridge0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
	choices=0
	ether 58:9c:fc:10:6c:2e
	id 00:00:00:00:00:00 precedence 32768 hellotime 2 fwddelay 15
	maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
	root id 00:00:00:00:00:00 precedence 32768 ifcost 0 port 0
	member: ixl1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 4 precedence 128 path value 800
	member: ixl0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 3 precedence 128 path value 800
	teams: bridge
	nd6 choices=9<PERFORMNUD,IFDISABLED>

One factor that I rapidly understand, is that FreeBSD, when utilizing hyperthreading, does have 8 threads
accessible, however solely 4 of them take part in forwarding. After I put the machine underneath load, I see a
curious 399% spent in kernel whereas I see 402% in idle:

FreeBSD top

After I then do a single-flow unidirectional loadtest, the anticipated consequence is that just one CPU
participates (100% in kernel and 700% in idle) and if I carry out a single-flow bidirectional
loadtest, my expectations are confirmed once more, seeing two CPU threads do the work (200% in kernel
and 600% in idle).

Whereas the maths checks out, the efficiency is just a little bit much less spectacular:

Kind Uni/BiDir Packets/Sec L2 Bits/Sec Line Charge
vm=var2,dimension=1514 Unidirectional 2.02Mpps 24.77Gbps 99%
vm=var2,dimension=imix Unidirectional 3.48Mpps 10.23Gbps 43%
vm=var2,dimension=64 Unidirectional 3.61Mpps 2.43Gbps 9.7%
dimension=64 Unidirectional 1.22Mpps 0.82Gbps 3.2%
vm=var2,dimension=1514 Bidirectional 3.77Mpps 46.31Gbps 93%
vm=var2,dimension=imix Bidirectional 3.81Mpps 11.22Gbps 24%
vm=var2,dimension=64 Bidirectional 4.02Mpps 2.69Gbps 5.4%
dimension=64 Bidirectional 2.29Mpps 1.54Gbps 3.1%

Conclusion: FreeBSD’s kernel on this Xeon-D1518 processor can deal with about 1.2Mpps per CPU
thread, and I can use solely 4 of them. FreeBSD is completely satisfied to ahead large packets, and I can
fairly attain 2x25Gbps however as soon as I begin ramping up the packets/sec by decreasing the packet dimension,
issues in a short time deteriorate.

FreeBSD 14: netmap Bridge

Tom identified a software within the supply tree, referred to as the netmap bridge initially written by Luigi
Rizzo and Matteo Landi. FreeBSD ships the supply code, however you can too check out their GitHub
repository [ref].

What’s netmap anyway? It’s a framework for very quick and environment friendly packet I/O for userspace
and kernel purchasers, and for Digital Machines. It runs on FreeBSD, Linux and a few variations of
Home windows. As an apart, my buddy Pavel from FastNetMon identified a blogpost from 2015 during which
Cloudflare of us described a solution to do DDoS mitigation on Linux utilizing visitors classification to
program the community playing cards to maneuver sure offensive visitors to a devoted {hardware} queue, and
service that queue from a netmap shopper. Should you’re curious (I actually was!), you would possibly take a
have a look at that cool write-up
[here].

I compile the code and put it to work, and the man-page tells me that I have to fiddle with the
interfaces a bit. They have to be:

  • set to promiscuous, which is sensible as they must obtain ethernet frames despatched to MAC
    addresses aside from their very own
  • flip off any {hardware} offloading, notably -rxcsum -txcsum -tso4 -tso6 -lro
  • my consumer wants write permission to /dev/netmap to bind the interfaces from userspace.
[pim@france /usr/src/tools/tools/netmap]$ make
[pim@france /usr/src/tools/tools/netmap]$ cd /usr/obj/usr/src/amd64.amd64/instruments/instruments/netmap
[pim@france .../tools/netmap]$ sudo ifconfig ixl0 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo ifconfig ixl1 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo chmod 660 /dev/netmap
[pim@france .../tools/netmap]$ ./bridge -i netmap:ixl0 -i netmap:ixl1
065.804686 predominant [290] ------- zerocopy supported
065.804708 predominant [297] Wait 4 secs for hyperlink to return up...
075.810547 predominant [301] Able to go, ixl0 0x0/4 <-> ixl1 0x0/4.

Warning

I begin my first loadtest, which fairly instantly fails. It’s an fascinating habits sample which
I’ve not seen earlier than. After staring on the downside, and studying the code of bridge.c, which is a
remarkably straight ahead program, I restart the bridge utility, and visitors passes once more however solely
for a short time. Whoops!

I took a [screencast] in case any sort soul on freebsd-net
desires to take a more in-depth have a look at this:

FreeBSD netmap Bridge

I begin a little bit of trial and error during which I conclude that if I ship loads of visitors (like 10Mpps),
forwarding is ok; but when I ship just a little visitors (like 1kpps), in some unspecified time in the future forwarding stops
alltogether. So whereas it’s not nice, this does enable me to measure the whole throughput simply by
sending lots of visitors, say 30Mpps, and seeing what quantity comes out the opposite facet.

Right here I am going, and I’m having enjoyable:

Kind Uni/BiDir Packets/Sec L2 Bits/Sec Line Charge
vm=var2,dimension=1514 Unidirectional 2.04Mpps 24.72Gbps 100%
vm=var2,dimension=imix Unidirectional 8.16Mpps 23.76Gbps 100%
vm=var2,dimension=64 Unidirectional 10.83Mpps 5.55Gbps 29%
dimension=64 Unidirectional 11.42Mpps 5.83Gbps 31%
vm=var2,dimension=1514 Bidirectional 3.91Mpps 47.27Gbps 96%
vm=var2,dimension=imix Bidirectional 11.31Mpps 32.74Gbps 77%
vm=var2,dimension=64 Bidirectional 11.39Mpps 5.83Gbps 15%
dimension=64 Bidirectional 11.57Mpps 5.93Gbps 16%

Conclusion: FreeBSD’s netmap implementation can be certain by packets/sec, and on this
setup, the Xeon-D1518 machine is able to forwarding roughly 11.2Mpps. What I discover cool is that
single circulation or a number of flows doesn’t appear to matter that a lot, in truth bidirectional 64b single
circulation loadtest was most favorable at 11.57Mpps, which is an order of magnitude higher than utilizing simply
the kernel (which clocked in at 1.2Mpps).

FreeBSD 14: VPP with netmap

It’s good to have a baseline on this machine on how the FreeBSD kernel itself performs. However of
course this sequence is about Vector Packet Processing, so I now flip my consideration to the VPP department
that Tom shared with me. I wrote a bunch of particulars concerning the VM and naked steel set up in my
[first article] so I’ll simply go straight to the
configuration components:

DBGvpp# create netmap identify ixl0
DBGvpp# create netmap identify ixl1
DBGvpp# set int state netmap-ixl0 up
DBGvpp# set int state netmap-ixl1 up
DBGvpp# set int l2 xconnect netmap-ixl0 netmap-ixl1
DBGvpp# set int l2 xconnect netmap-ixl1 netmap-ixl0

DBGvpp# present int
    Title       Idx    State  MTU (L3/IP4/IP6/MPLS)   Counter      Rely     
local0         0     down          0/0/0/0       
netmap-ixl0    1      up          9000/0/0/0     rx packets       25622
                                                 rx bytes       1537320
                                                 tx packets       25437
                                                 tx bytes       1526220
netmap-ixl1    2      up          9000/0/0/0     rx packets       25437
                                                 rx bytes       1526220
                                                 tx packets       25622
                                                 tx bytes       1537320

At this level I can just about rule out that the netmap bridge.c is the problem, as a result of a
few seconds after introducing 10Kpps of visitors and seeing it efficiently move, the loadtester
receives no extra packets, although T-Rex remains to be sending it. Nevertheless, a few minute later
I can additionally see the RX and TX counters proceed to extend within the VPP dataplane:

DBGvpp# present int
    Title       Idx    State  MTU (L3/IP4/IP6/MPLS)   Counter      Rely     
local0         0     down          0/0/0/0       
netmap-ixl0    1      up          9000/0/0/0     rx packets      515843
                                                 rx bytes      30950580
                                                 tx packets      515657
                                                 tx bytes      30939420
netmap-ixl1    2      up          9000/0/0/0     rx packets      515657
                                                 rx bytes      30939420
                                                 tx packets      515843
                                                 tx bytes      30950580

.. and I can see that each packet that VPP obtained is accounted for: interface ixl0 has obtained
515843 packets, and ixl1 claims to have transmitted precisely that quantity of packets. So I believe
maybe they’re getting misplaced someplace on egress between the kernel and the Intel i710-XXV community
card.

Nevertheless, counter to the earlier case, I can not maintain any cheap quantity of visitors, be it
1Kpps, 10Kpps or 10Mpps, the system fairly persistently involves a halt mere seconds after
introducing the load. Restarting VPP makes it ahead visitors once more for just a few seconds, simply to finish
up in the identical upset state. I don’t study a lot.

Conclusion: This setup with VPP utilizing netmap doesn’t yield outcomes, for the second. I’ve a
suspicion that regardless of the trigger is of the netmap bridge within the earlier check, is probably going additionally the
offender for this check.

FreeBSD 14: VPP with DPDK

However not all is misplaced – I’ve one check left, and judging by what I discovered final week when mentioning
the primary check surroundings, this one goes to be a good bit higher. In my earlier loadtests, the
community interfaces have been on their regular kernel driver (ixl(4) within the case of the Intel i710-XXV
interfaces), however now I’m going to combine it up just a little, and rebind these interfaces to a selected DPDK
driver referred to as nic_uio(4) which stands for Community Interface Card Userspace Enter/Output:

[pim@france ~]$ cat < EOF | sudo tee -a /boot/loader.conf
nic_uio_load="YES"
hw.nic_uio.bdfs="6:0:0,6:0:1"
EOF

After I reboot, the community interfaces are gone from the output of ifconfig(8), which is nice. I
begin up VPP with a minimal config file [ref], which defines
three employee threads and begins DPDK with 3 RX queues and 4 TX queues. It’s a typical query why
there can be yet one more TX queue. The reason is that in VPP, there’s one (1) predominant thread and
zero or extra employee threads. If the predominant thread desires to ship visitors (for instance, in a plugin
like LLDP which sends periodic bulletins), it could be best to make use of a transmit queue
particular to that predominant thread. Any return visitors shall be picked up by the DPDK Course of on employee
threads (as predominant doesn’t have one in every of these). That’s why the overall rule num(TX) = num(RX)+1.

[pim@france ~/src/vpp]$ export STARTUP_CONF=/house/pim/src/startup.conf 
[pim@france ~/src/vpp]$ gmake run-release

vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/0 TwentyFiveGigabitEthernet6/0/1
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/1 TwentyFiveGigabitEthernet6/0/0
vpp# set int state TwentyFiveGigabitEthernet6/0/0 up
vpp# set int state TwentyFiveGigabitEthernet6/0/1 up
vpp# present int
              Title               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Rely     
TwentyFiveGigabitEthernet6/0/0    1      up          9000/0/0/0     rx packets           11615035382
                                                                    rx bytes           1785998048960
                                                                    tx packets             700076496
                                                                    tx bytes            161043604594
TwentyFiveGigabitEthernet6/0/1    2      up          9000/0/0/0     rx packets             700076542
                                                                    rx bytes            161043674054
                                                                    tx packets           11615035440
                                                                    tx bytes           1785998136540
local0                            0     down          0/0/0/0       

And with that, the dataplane shoots to life and begins forwarding (numerous) packets. To my nice
reduction, sending both 1kpps or 1Mpps “simply works”. I can run my loadtest as per regular, first with
1514 byte packets, then imix, then 64 byte packets, and at last single-flow 64 byte packets. And of
course, each unidirectionally and bidirectionally.

I check out the system load whereas the loadtests are operating:

FreeBSD top

It’s totally anticipated that the VPP course of is spinning 300% +epsilon of CPU time. It’s because it
has began three employee threads, and these are execuing the DPDK Ballot Mode Driver which is
primarily a good loop that asks the community playing cards for work, and if there are any packets
arriving, executes on that work. As such, every employee thread is at all times burning 100% of its
assigned CPU.

That stated, I can check out finer grained statistics within the dataplane itself:

vpp# present run
Thread 0 vpp_main (lcore 0)
Time .9, 10 sec inner node vector price 0.00 loops/sec 297041.19
  vector charges in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
             Title                 State         Calls   Vectors   Suspends   Clocks  Vectors/Name  
ip4-full-reassembly-expire-wal  any wait            0         0         18   2.39e3          0.00
ip6-full-reassembly-expire-wal  any wait            0         0         18   3.08e3          0.00
unix-cli-process-0               lively             0         0          9   7.62e4          0.00
unix-epoll-input                 polling        13066         0          0   1.50e5          0.00
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time .9, 10 sec inner node vector price 12.38 loops/sec 1467742.01
  vector charges in 5.6294e6, out 5.6294e6, drop 0.0000e0, punt 0.0000e0
             Title                 State         Calls   Vectors   Suspends   Clocks  Vectors/Name  
TwentyFiveGigabitEthernet6/0/1   lively        399663   5047800          0   2.20e1         12.63
TwentyFiveGigabitEthernet6/0/1   lively        399663   5047800          0   9.54e1         12.63
dpdk-input                       polling      1531252   5047800          0   1.45e2          3.29
ethernet-input                   lively        399663   5047800          0   3.97e1         12.63
l2-input                         lively        399663   5047800          0   2.93e1         12.63
l2-output                        lively        399663   5047800          0   2.53e1         12.63
unix-epoll-input                 polling         1494         0          0   3.09e2          0.00

(et cetera)

I confirmed just one employee thread’s output, however there are literally three employee threads, and they’re
all doing related work, as a result of they’re selecting up 33% of the visitors every assigned to the three RX
queues within the community card.

Whereas the general CPU load is 300%, right here I can see a unique image. Thread 0 (the predominant thread)
is doing primarily ~nothing. It’s polling a set of unix sockets within the node referred to as
unix-epoll-input, however aside from that, predominant doesn’t have a lot on its plate. Thread 1 nonetheless is
a employee thread, and I can see that it’s busy doing work:

  • dpdk-input: it’s polling the NIC for work, it has been referred to as 1.53M occasions, and in complete it has
    dealt with simply over 5.04M vectors (that are packets). So I can derive, that every time the Ballot
    Mode Driver
    provides work, on common there are 3.29 vectors (packets), and every packet is
    taking about 145 CPU clocks.
  • ethernet-input: The DPDK vectors are all ethernet frames coming from the loadtester. Seeing as
    I’ve cross related all visitors from Tf6/0/0 to Tf6/0/1 and vice-versa, VPP is aware of that it
    ought to deal with the packets within the L2 forwarding path.
  • l2-input is named with the (listing of N) ethernet frames, which all get cross related to the
    output interface, on this case Tf6/0/1.
  • l2-output prepares the ethernet frames for output into their egress interface.
  • TwentyFiveGigabitEthernet6/0/1-output (Notice: the identify is truncated) If this have been to have
    been L3 visitors, this might be the place the place the vacation spot MAC tackle is inserted into the
    ethernet body, however since that is an L2 cross join, the node merely passes the ethernet frames
    by way of to the ultimate egress node in DPDK.
  • TwentyFiveGigabitEthernet6/0/1-tx (Notice: the identify is truncated) fingers them to the DPDK
    driver for marshalling on the wire.

Midway by way of, I see that there’s a problem with the distribution of ingress visitors over the
three staff, possibly you may spot it too:

---------------
Thread 1 vpp_wk_0 (lcore 1)
Time 56.7, 10 sec inner node vector price 38.59 loops/sec 106879.84
  vector charges in 7.2982e6, out 7.2982e6, drop 0.0000e0, punt 0.0000e0
             Title                 State       Calls     Vectors  Suspends   Clocks  Vectors/Name  
TwentyFiveGigabitEthernet6/0/0   lively     6689553   206899956         0   1.34e1         30.93
TwentyFiveGigabitEthernet6/0/0   lively     6689553   206899956         0   1.37e2         30.93
TwentyFiveGigabitEthernet6/0/1   lively     6688572   206902836         0   1.45e1         30.93
TwentyFiveGigabitEthernet6/0/1   lively     6688572   206902836         0   1.34e2         30.93
dpdk-input                       polling    7128012   413802792         0   8.77e1         58.05
ethernet-input                   lively    13378125   413802792         0   2.77e1         30.93
l2-input                         lively     6809002   413802792         0   1.81e1         60.77
l2-output                        lively     6809002   413802792         0   1.68e1         60.77
unix-epoll-input                 polling       6954           0         0   6.61e2          0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 56.7, 10 sec inner node vector price 256.00 loops/sec 7702.68
  vector charges in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
             Title                 State       Calls     Vectors  Suspends   Clocks  Vectors/Name  
TwentyFiveGigabitEthernet6/0/0   lively      456112   116764672         0   1.27e1        256.00
TwentyFiveGigabitEthernet6/0/0   lively      456112   116764672         0   2.64e2        256.00
TwentyFiveGigabitEthernet6/0/1   lively      456112   116764672         0   1.39e1        256.00
TwentyFiveGigabitEthernet6/0/1   lively      456112   116764672         0   2.74e2        256.00
dpdk-input                       polling     456112   233529344         0   1.41e2        512.00
ethernet-input                   lively      912224   233529344         0   5.71e1        256.00
l2-input                         lively      912224   233529344         0   3.66e1        256.00
l2-output                        lively      912224   233529344         0   1.70e1        256.00
unix-epoll-input                 polling        445           0         0   9.59e2          0.00
---------------
Thread 3 vpp_wk_2 (lcore 3)
Time 56.7, 10 sec inner node vector price 256.00 loops/sec 7742.43
  vector charges in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
             Title                 State       Calls     Vectors  Suspends  Clocks  Vectors/Name  
TwentyFiveGigabitEthernet6/0/0   lively      456113   116764928         0  8.94e0        256.00
TwentyFiveGigabitEthernet6/0/0   lively      456113   116764928         0  2.81e2        256.00
TwentyFiveGigabitEthernet6/0/1   lively      456113   116764928         0  9.54e0        256.00
TwentyFiveGigabitEthernet6/0/1   lively      456113   116764928         0  2.72e2        256.00
dpdk-input                       polling     456113   233529856         0  1.61e2        512.00
ethernet-input                   lively      912226   233529856         0  4.50e1        256.00
l2-input                         lively      912226   233529856         0  2.93e1        256.00
l2-output                        lively      912226   233529856         0  1.23e1        256.00
unix-epoll-input                 polling        445           0         0  1.03e3          0.00

Thread 1 (vpp_wk_0) is dealing with 7.29Mpps and reasonably loaded, whereas Thread 2 and three are dealing with
every 4.11Mpps and are utterly pegged. That stated, the relative quantity of CPU clocks they’re
spending per packet within reason related, however they don’t fairly add up:

  • Thread 1 is doing 7.29Mpps and is spending on common 449 CPU cycles per packet. I get this
    quantity by including up the entire values within the Clocks column, apart from the unix-epoll-input
    node. However that’s considerably unusual, as a result of this Xeon D1518 clocks at 2.2GHz – and but 7.29M *
    449 is 3.27GHz. My expertise (in Linux) is that these numbers really line up fairly effectively.
  • Thread 2 is doing 4.12Mpps and is spending on common 816 CPU cycles per packet. This type of
    is sensible because the cycles/packet is roughly double that of thread 1, and the packet/sec is
    roughly half … and the whole of 4.12M * 816 is 3.36GHz.
  • I can see equally values for thread 3: 4.12Mpps and in addition 819 CPU cycles per packet which
    quantities to VPP self-reporting utilizing 3.37GHz value of cycles on this thread.

After I have a look at the thread to CPU placement, I get one other shock:

vpp# present threads 
ID     Title                Kind        LWP     Sched Coverage (Precedence)  lcore  Core   Socket State
0      vpp_main                        100346  (nil) (n/a)              0      42949674294967
1      vpp_wk_0            staff     100473  (nil) (n/a)              1      42949674294967
2      vpp_wk_1            staff     100474  (nil) (n/a)              2      42949674294967
3      vpp_wk_2            staff     100475  (nil) (n/a)              3      42949674294967

vpp# present cpu 
Mannequin identify:               Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz
Microarch mannequin (household): [0x6] Broadwell ([0x56] Broadwell DE) stepping 0x3
Flags:                    sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe
                          rdseed aes invariant_tsc 
Base frequency:           2.19 GHz

The numbers in present threads are all tousled, and I don’t fairly know what to make of it but. I
suppose the maybe overly particular Linux implementation of the thread pool administration is throwing off
FreeBSD a bit. Maybe some profiling may very well be helpful, so I make an observation to debate this with Tom or
the freebsd-net mailing listing, who will know a good bit extra about this kind of stuff on FreeBSD than
I do.

Anyway, functionally: this works. Efficiency sensible: I’ve some questions 🙂 I let all eight
loadtests full and with out additional ado, right here’s the outcomes:

Kind Uni/BiDir Packets/Sec L2 Bits/Sec Line Charge
vm=var2,dimension=1514 Unidirectional 2.01Mpps 24.45Gbps 99%
vm=var2,dimension=imix Unidirectional 8.07Mpps 23.42Gbps 99%
vm=var2,dimension=64 Unidirectional 23.93Mpps 12.25Gbps 64%
dimension=64 Unidirectional 12.80Mpps 6.56Gbps 34%
vm=var2,dimension=1514 Bidirectional 3.91Mpps 47.35Gbps 86%
vm=var2,dimension=imix Bidirectional 13.38Mpps 38.81Gbps 82%
vm=var2,dimension=64 Bidirectional 15.56Mpps 7.97Gbps 21%
dimension=64 Bidirectional 20.96Mpps 10.73Gbps 28%

Conclusion: I’ve to say: 12.8Mpps on a unidirectional 64b single-flow loadtest (thereby solely
with the ability to make use of 1 DPDK employee), and 20.96Mpps on a bidirectional 64b single-flow
loadtest, isn’t too shabby. However seeing as one CPU thread can do 12.8Mpps, I might think about that
three CPU threads would carry out at 38.4Mpps or there-abouts, however I’m seeing solely 23.9Mpps and a few
unexplained variance in per-thread efficiency.

Outcomes

I discovered loads! Some hilights:

  1. The netmap implementation isn’t taking part in ball for the second, as forwarding stops persistently, in
    each the bridge.c in addition to the VPP plugin.
  2. It’s clear although, that netmap is a good bit sooner (11.4Mpps) than kernel forwarding which got here in at
    roughly 1.2Mpps per CPU thread. What’s a bit troubling is that netmap doesn’t appear to work
    very effectively in VPP – visitors forwarding additionally stops right here.
  3. DPDK performs fairly effectively on FreeBSD, I handle to see a throughput of 20.96Mpps which is sort of
    twice the throughput of netmap, which is cool however I can’t fairly clarify the stark variance
    in throughput between the employee threads. Maybe VPP is putting the employees on hyperthreads?
    Maybe an equal of isolcpus within the Linux kernel would assist?

For the curious, I’ve bundled up just a few information that describe the machine and its setup:
[dmesg] [pciconf] [loader.conf] [VPP startup.conf]

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top