The quantum state of a TCP port
Have you ever observed how easy questions generally result in complicated solutions? In the present day we’ll sort out one such query. Class: our favourite – Linux networking.
When can two TCP sockets share a neighborhood tackle?
If I navigate to https://blog.cloudflare.com/, my browser will connect with a distant TCP tackle, is perhaps 104.16.132.229:443 on this case, from the native IP tackle assigned to my Linux machine, and a randomly chosen native TCP port, say 192.0.2.42:54321. What occurs if I then determine to go to a distinct website? Is it potential to determine one other TCP connection from the identical native IP tackle and port?
To seek out the reply let’s do a little bit of learning by discovering. We’ve ready eight quiz questions. Every will allow you to uncover one side of the foundations that govern native tackle sharing between TCP sockets beneath Linux. Honest warning, it’d get a bit mind-boggling.
Questions are break up into two teams by take a look at state of affairs:
Within the first take a look at state of affairs, two sockets join from the identical native port to the identical distant IP and port. Nonetheless, the native IP is completely different for every socket.
Whereas, within the second state of affairs, the native IP and port is identical for all sockets, however the distant tackle, or truly simply the IP tackle, differs.
In our quiz questions, we’ll both:
- let the OS routinely choose the the native IP and/or port for the socket, or
- we’ll explicitly assign the native tackle with
bind()
earlier thanconnect()
’ing the socket; a way also referred to as bind-before-connect.
As a result of we might be inspecting nook circumstances within the bind() logic, we’d like a method to exhaust accessible native addresses, that’s (IP, port) pairs. We may simply create a lot of sockets, however it is going to be simpler to tweak the system configuration and faux that there’s only one ephemeral native port, which the OS can assign to sockets:
sysctl -w internet.ipv4.ip_local_port_range="60000 60000"
Every quiz query is a brief Python snippet. Your process is to foretell the result of operating the code. Does it succeed? Does it fail? If that’s the case, what fails? Asking ChatGPT is just not allowed ????
There may be at all times a standard setup process to remember. We are going to omit it from the quiz snippets to maintain them quick:
from os import system
from socket import *
# Lacking constants
IP_BIND_ADDRESS_NO_PORT = 24
# Our community namespace has simply *one* ephemeral port
system("sysctl -w internet.ipv4.ip_local_port_range="60000 60000"")
# Open a listening socket at *:1234. We are going to connect with it.
ln = socket(AF_INET, SOCK_STREAM)
ln.bind(("", 1234))
ln.hear(SOMAXCONN)
With the formalities out of the way in which, allow us to start. Prepared. Set. Go!
Situation #1: When the native IP is exclusive, however the native port is identical
In Situation #1 we join two sockets to the identical distant tackle – 127.9.9.9:1234. The sockets will use completely different native IP addresses, however is it sufficient to share the native port?
native IP | native port | distant IP | distant port |
---|---|---|---|
distinctive | similar | similar | similar |
127.0.0.1 127.1.1.1 127.2.2.2 |
60_000 | 127.9.9.9 | 1234 |
Quiz #1
On the native facet, we bind two sockets to distinct, explicitly specified IP addresses. We are going to enable the OS to pick the native port. Bear in mind: our native ephemeral port vary comprises only one port (60,000).
s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.1.1.1', 0))
s1.join(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()
s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.2.2.2', 0))
s2.join(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()
GOTO Answer #1
Quiz #2
Right here, the setup is nearly similar as earlier than. Nonetheless, we ask the OS to pick the native IP tackle and port for the primary socket. Do you assume the consequence will differ from the earlier query?
s1 = socket(AF_INET, SOCK_STREAM)
s1.join(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()
s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.2.2.2', 0))
s2.join(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()
GOTO Answer #2
Quiz #3
This quiz query is rather like the one above. We simply modified the ordering. First, we join a socket from an explicitly specified native tackle. Then we ask the system to pick a neighborhood tackle for us. Clearly, such an ordering change shouldn’t make any distinction, proper?
s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.1.1.1', 0))
s1.join(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()
s2 = socket(AF_INET, SOCK_STREAM)
s2.join(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()
GOTO Answer #3
Situation #2: When the native IP and port are the identical, however the distant IP differs
In Situation #2 we reverse our setup. As a substitute of a number of native IP’s and one distant tackle, we now have one native tackle 127.0.0.1:60000
and two distinct distant addresses. The query stays the identical – can two sockets share the native port? Reminder: ephemeral port vary continues to be of measurement one.
native IP | native port | distant IP | distant port |
---|---|---|---|
similar | similar | distinctive | similar |
127.0.0.1 | 60_000 | 127.8.8.8 127.9.9.9 |
1234 |
Quiz #4
Let’s begin from the fundamentals. We join()
to 2 distinct distant addresses. It is a heat up ????
s1 = socket(AF_INET, SOCK_STREAM)
s1.join(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()
s2 = socket(AF_INET, SOCK_STREAM)
s2.join(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()
GOTO Answer #4
Quiz #5
What if we bind()
to a neighborhood IP explicitly however let the OS choose the port – does something change?
s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.0.0.1', 0))
s1.join(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()
s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.0.0.1', 0))
s2.join(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()
GOTO Answer #5
Quiz #6
This time we explicitly specify the native tackle and port. Typically there’s a have to specify the native port.
s1 = socket(AF_INET, SOCK_STREAM)
s1.bind(('127.0.0.1', 60_000))
s1.join(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()
s2 = socket(AF_INET, SOCK_STREAM)
s2.bind(('127.0.0.1', 60_000))
s2.join(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()
GOTO Answer #6
Quiz #7
Simply whenever you thought it couldn’t get any weirder, we add SO_REUSEADDR
into the combo.
First, we ask the OS to allocate a neighborhood tackle for us. Then we explicitly bind to the identical native tackle, which we all know the OS should have assigned to the primary socket. We allow native tackle reuse for each sockets. Is that this allowed?
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.join(('127.8.8.8', 1234))
s1.getsockname(), s1.getpeername()
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.bind(('127.0.0.1', 60_000))
s2.join(('127.9.9.9', 1234))
s2.getsockname(), s2.getpeername()
GOTO Answer #7
Quiz #8
Lastly, a cherry on high. That is Quiz #7 however in reverse. Frequent sense dictates that the result needs to be the identical, however is it?
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s1.bind(('127.0.0.1', 60_000))
s1.join(('127.9.9.9', 1234))
s1.getsockname(), s1.getpeername()
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s2.join(('127.8.8.8', 1234))
s2.getsockname(), s2.getpeername()
GOTO Answer #8
The key tri-state lifetime of a neighborhood TCP port
Is all of it clear now? Properly, in all probability no. It looks like reverse engineering a black field. So what is going on behind the scenes? Let’s have a look.
Linux tracks all TCP ports in use in a hash desk named bhash. To not be confused with with ehash desk, which tracks sockets with each native and distant tackle already assigned.
Every hash desk entry factors to a sequence of so-called bind buckets, which group collectively sockets which share a neighborhood port. To be exact, sockets are grouped into buckets by:
- the network namespace they belong to, and
- the VRF system they’re certain to, and
- the native port quantity they’re certain to.
However within the easiest potential setup – single community namespace, no VRFs – we are able to say that sockets in a bind bucket are grouped by their native port quantity.
The set of sockets in every bind bucket, that’s sharing a neighborhood port, is backed by a linked record of named homeowners.
After we ask the kernel to assign a neighborhood tackle to a socket, its process is to test for a battle with any present socket. That’s as a result of a neighborhood port quantity could be shared solely under some conditions:
/* There are just a few easy guidelines, which permit for native port reuse by
* an software. In essence:
*
* 1) Sockets certain to completely different interfaces could share a neighborhood port.
* Failing that, goto take a look at 2.
* 2) If all sockets have sk->sk_reuse set, and none of them are in
* TCP_LISTEN state, the port could also be shared.
* Failing that, goto take a look at 3.
* 3) If all sockets are certain to a selected inet_sk(sk)->rcv_saddr native
* tackle, and none of them are the identical, the port could also be
* shared.
* Failing this, the port can't be shared.
*
* The fascinating level, is take a look at #2. That is what an FTP server does
* all day. To optimize this case we use a selected flag bit outlined
* under. As we add sockets to a bind bucket record, we carry out a
* test of: (newsk->sk_reuse && (newsk->sk_state != TCP_LISTEN))
* So long as all sockets added to a bind bucket move this take a look at,
* the flag bit might be set.
* ...
*/
The remark above hints that the kernel tries to optimize for the blissful case of no battle. To this finish the bind bucket holds further state which aggregates the properties of the sockets it holds:
struct inet_bind_bucket {
/* ... */
signed char fastreuse;
signed char fastreuseport;
kuid_t fastuid;
#if IS_ENABLED(CONFIG_IPV6)
struct in6_addr fast_v6_rcv_saddr;
#endif
__be32 fast_rcv_saddr;
unsigned quick fast_sk_family;
bool fast_ipv6_only;
/* ... */
};
Let’s focus our consideration simply on the primary mixture property – fastreuse
. It has existed since, now prehistoric, Linux 2.1.90pre1. Initially within the type of a bit flag, because the remark says, solely to evolve to a byte-sized discipline over time.
The opposite six fields got here on a lot later with the introduction of SO_REUSEPORT
in Linux 3.9. As a result of they play a job solely when there are sockets with the SO_REUSEPORT
flag set. We’re going to ignore them as we speak.
Every time the Linux kernel must bind a socket to a neighborhood port, it first has to search for the bind bucket for that port. What makes life a bit extra sophisticated is the truth that the seek for a TCP bind bucket exists in two locations within the kernel. The bind bucket lookup can occur early – at bind()
time – or late – at join()
– time. Which one will get known as will depend on how the related socket has been arrange:
Nonetheless, whether or not we land in inet_csk_get_port
or __inet_hash_connect
, we at all times find yourself strolling the bucket chain within the bhash searching for the bucket with an identical port quantity. The bucket may exist already or we would need to create it first. However as soon as it exists, its fastreuse discipline is in certainly one of three potential states: -1
, 0
, or +1
. As if Linux builders have been impressed by quantum mechanics.
That state displays two elements of the bind bucket:
- What sockets are within the bucket?
- When can the native port be shared?
So allow us to attempt to decipher the three potential fastreuse states then, and what they imply in every case.
First, what does the fastreuse property say concerning the homeowners of the bucket, that’s the sockets utilizing that native port?
fastreuse is | homeowners record comprises |
---|---|
-1 | sockets join()’ed from an ephemeral port |
0 | sockets certain with out SO_REUSEADDR |
+1 | sockets certain with SO_REUSEADDR |
Whereas this isn’t the entire reality, it’s shut sufficient for now. We are going to quickly unravel it.
When it comes port sharing, the scenario is way much less easy:
Can I … when … | fastreuse = -1 | fastreuse = 0 | fastreuse = +1 |
---|---|---|---|
bind() to the identical port (ephemeral or specified) | sure IFF native IP is exclusive ① | ← idem | ← idem |
bind() to the precise port with SO_REUSEADDR | sure IFF native IP is exclusive OR conflicting socket makes use of SO_REUSEADDR ① | ← idem | sure ② |
join() from the identical ephemeral port to the identical distant (IP, port) | sure IFF native IP distinctive ③ | no ③ | no ③ |
join() from the identical ephemeral port to a singular distant (IP, port) | sure ③ | no ③ | no ③ |
① Decided by inet_csk_bind_conflict()
known as from inet_csk_get_port()
(particular port bind) or inet_csk_get_port()
→ inet_csk_find_open_port()
(ephemeral port bind).
② As a result of inet_csk_get_port()
skips conflict check for fastreuse == 1 buckets
.
③ As a result of inet_hash_connect()
→ __inet_hash_connect()
skips buckets with fastreuse != -1
.
Whereas all of it appears to be like relatively sophisticated at first sight, we are able to distill the desk above into just a few statements that maintain true, and are a bit simpler to digest:
bind()
, or early native tackle allocation, at all times succeeds if there isn’t a native IP tackle battle with any present socket,join()
, or late native tackle allocation, at all times fails when TCP bind bucket for a neighborhood port is in any state apart fromfastreuse = -1
,join()
solely succeeds if there isn’t a native and distant tackle battle,SO_REUSEADDR
socket choice permits native tackle sharing, if all conflicting sockets additionally use it (and none of them is within the listening state).
That is loopy. I don’t consider you.
Fortuitously, you do not have to. With drgn, the programmable debugger, we are able to look at the bind bucket state on a dwell kernel:
#!/usr/bin/env drgn
"""
dump_bhash.py - Listing all TCP bind buckets within the present netns.
Script is just not conscious of VRF.
"""
import os
from drgn.helpers.linux.record import hlist_for_each, hlist_for_each_entry
from drgn.helpers.linux.internet import get_net_ns_by_fd
from drgn.helpers.linux.pid import find_task
def dump_bind_bucket(head, internet):
for tb in hlist_for_each_entry("struct inet_bind_bucket", head, "node"):
# Skip buckets not from this netns
if tb.ib_net.internet != internet:
proceed
port = tb.port.value_()
fastreuse = tb.fastreuse.value_()
owners_len = len(record(hlist_for_each(tb.homeowners)))
print(
"{:8d} {:{signal}9d} {:7d}".format(
port,
fastreuse,
owners_len,
signal="+" if fastreuse != 0 else " ",
)
)
def get_netns():
pid = os.getpid()
process = find_task(prog, pid)
with open(f"/proc/{pid}/ns/internet") as f:
return get_net_ns_by_fd(process, f.fileno())
def essential():
print("{:8} {:9} {:7}".format("TCP-PORT", "FASTREUSE", "#OWNERS"))
tcp_hashinfo = prog.object("tcp_hashinfo")
internet = get_netns()
# Iterate over all bhash slots
for i in vary(0, tcp_hashinfo.bhash_size):
head = tcp_hashinfo.bhash[i].chain
# Iterate over bind buckets within the slot
dump_bind_bucket(head, internet)
essential()
Let’s take this script for a spin and attempt to affirm what Desk 1 claims to be true. Understand that to provide the ipython --classic
session snippets under I’ve used the identical setup as for the quiz questions.
Two related sockets sharing ephemeral port 60,000:
>>> s1 = socket(AF_INET, SOCK_STREAM)
>>> s1.join(('127.1.1.1', 1234))
>>> s2 = socket(AF_INET, SOCK_STREAM)
>>> s2.join(('127.2.2.2', 1234))
>>> !./dump_bhash.py
TCP-PORT FASTREUSE #OWNERS
1234 0 3
60000 -1 2
>>>
Two certain sockets reusing port 60,000:
>>> s1 = socket(AF_INET, SOCK_STREAM)
>>> s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>>> s1.bind(('127.1.1.1', 60_000))
>>> s2 = socket(AF_INET, SOCK_STREAM)
>>> s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>>> s2.bind(('127.1.1.1', 60_000))
>>> !./dump_bhash.py
TCP-PORT FASTREUSE #OWNERS
1234 0 1
60000 +1 2
>>>
A mixture of certain sockets with and with out REUSEADDR sharing port 60,000:
>>> s1 = socket(AF_INET, SOCK_STREAM)
>>> s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>>> s1.bind(('127.1.1.1', 60_000))
>>> !./dump_bhash.py
TCP-PORT FASTREUSE #OWNERS
1234 0 1
60000 +1 1
>>> s2 = socket(AF_INET, SOCK_STREAM)
>>> s2.bind(('127.2.2.2', 60_000))
>>> !./dump_bhash.py
TCP-PORT FASTREUSE #OWNERS
1234 0 1
60000 0 2
>>>
With such tooling, proving that Desk 2 holds true is only a matter of writing a bunch of exploratory tests.
However what has occurred in that final snippet? The bind bucket has clearly transitioned from one fastreuse state to a different. That is what Desk 1 fails to seize. And it implies that we nonetheless do not have the total image.
We’ve but to seek out out when the bucket’s fastreuse state can change. This requires a state machine.
Das State Machine
As we’ve simply seen, a bind bucket doesn’t want to remain within the preliminary fastreuse state all through its lifetime. Including sockets to the bucket can set off a state change. Because it seems, it could possibly solely transition into fastreuse = 0
, if we occur to bind() a socket that:
- would not battle present homeowners, and
- would not have the
SO_REUSEADDR
choice enabled.
And whereas we may have figured all of it out by rigorously studying the code in inet_csk_get_port → inet_csk_update_fastreuse
, it definitely would not harm to substantiate our understanding with a few more tests.
Now that we’ve the total image, this begs the query…
Why are you telling me all this?
Firstly, in order that the subsequent time bind()
syscall rejects your request with EADDRINUSE
, or join()
refuses to cooperate by throwing the EADDRNOTAVAIL
error, you’ll know what is going on, or no less than have the instruments to seek out out.
Secondly, as a result of we’ve beforehand advertised a technique for opening connections from a selected vary of ports which includes bind()’ing sockets with the SO_REUSEADDR choice. What we didn’t notice again then, is that there exists a corner case when the identical port cannot be shared with the common, join()
‘ed sockets. Whereas that isn’t a deal-breaker, it’s good to know the results.
To make issues higher, we’ve labored with the Linux group to increase the kernel API with a brand new socket choice that lets the person specify the local port range. The brand new choice might be accessible within the upcoming Linux 6.3. With it we not need to resort to bind()-tricks. This makes it potential to but once more share a neighborhood port with common join()
‘ed sockets.
Closing ideas
In the present day we posed a comparatively easy query – when can two TCP sockets share a neighborhood tackle? – and labored our approach in the direction of a solution. A solution that’s too complicated to compress it right into a single sentence. What’s extra, it is not even the total reply. In spite of everything, we’ve determined to disregard the existence of the SO_REUSEPORT function, and didn’t take into account conflicts with TCP listening sockets.
If there’s a easy takeaway, although, it’s that bind()’ing a socket can have tough penalties. When utilizing bind() to pick an egress IP tackle, it’s best to mix it with IP_BIND_ADDRESS_NO_PORT socket choice, and depart the port project to the kernel. In any other case we would unintentionally block native TCP ports from being reused.
It’s too dangerous that the identical recommendation doesn’t apply to UDP, the place IP_BIND_ADDRESS_NO_PORT does probably not work as we speak. However that’s one other story.
Till subsequent time ????.
When you get pleasure from scratching your head whereas studying the Linux kernel supply code, we are hiring.