thermalcircle.de

climbing the thermals

User Tools

Site Tools


blog:linux:sockets_in_the_linux_kernel_2_udp_socket_lookup_on_rx

Sockets in the Linux Kernel - Part 2: UDP Socket Lookup on Rx

In this article series I like to explore the implementation of sockets in the Linux Kernel and the source code surrounding them. While most of my previous articles focused primarily on OSI Layer 3, this series will attempt a dive into OSI Layer 4. It is very easy for readers of the kernel source code to not see the forest for the trees due to sheer complexity. It is my intent to throw a lifeline here to help navigating the code and to hold on to what is essential.

Articles of the series

Overview

In this 2nd article of the series, I like to dive into the socket lookup for UDP packets, which determines the socket on which an incoming UDP packet will actually be received. While separate receive paths exist for IPv4 and for IPv6 packets, the socket lookup still has a lot of joint components which are being used on both receive paths. Further, the overall way the lookup works is kept nearly identical in both cases. I'll walk through the general packet flow, the involved components like the socket implementation and the UDP socket lookup table, and finally the socket lookup itself. While the implementation of the socket lookup for the TCP protocol also has many similarities and joint components compared to UDP, I intentionally only cover UDP here. Else, this article would simply become too long. I'll try to cover the TCP socket lookup in a separate future article.

Rx Packet Flow

Figure 1 visualizes the receive path of unicast UDP packets in the Linux kernel, encapsulated either in IPv4 or IPv6, from initial packet reception in the NIC driver until being added to the queue of a receiving socket, with special focus on the the socket lookup. Please compare this figure to its counterpart in my previous article Routing Decisions in the Linux Kernel - Part 1: Lookup and packet flow, in which I focus on what happens to received network packets on L3.

Figure 1: Packet flow on the receive path from L2 → L3 → L4 for locally received UDP packets with focus on the socket lookup.

In this example here the routing decision determines that the received IPv4/IPv6 packet is to be locally received and not forwarded. Based on that decision, an indirect function call to the local receive handler is performed, which calls ip_local_deliver() in case of IPv4 and ip6_input() in case of IPv6. In both cases, the packet now traverses the Netfilter Input hook2) and is then demultiplexed based on the L4 protocol it carries. I described this in the 1st article of this series Sockets in the Linux Kernel - Part 1: L4 Protocol Demultiplexing on Rx. In this case here it is determined to be a UDP packet and thereby function udp_rcv() is being called in case of IPv4 and udp6_rcv() in case of IPv6. Both functions do perform some initial checks of the UDP header, checksum, … and then then call function __udp4_lib_lookup() (IPv4) or function __udp6_lib_lookup() (IPv6) respectively, which represents the actual socket lookup. Both implementations use a central hash table as back-end, which I'll describe in detail in one of the next sections. If a matching socket is found, the packet is being added to the receive queue of that socket. Else, an ICMP/ICMPv6 error reply message is sent back to the sender of the packet.

Socket structs

Before getting to the actual socket lookup, I like to give you a tour of the building blocks which are involved here, first of all the sockets themselves. In the kernel those are represented by a complex object oriented hierarchy of structs. The implementation represents a deep rabbit hole and I only will go in as deep as necessary here.

Figure 2: Simplified example instance of an IPv4 UDP socket, created and initialized by syscalls socket(2), bind(2) and connect(2) (the actual structs contain by far more member variables).

Figure 2 shows the instances of structs which are allocated and initialized in the kernel when you use the socket(2) syscall to create yourself a UDP socket in the domain of IPv4, then use the bind(2) syscall to bind this socket to a local IP address and port and then use the connect(2) syscall to connect this socket to a peer IP address and port. You should be familiar with those syscalls, as they represent the basic tools of socket programming. All involved kernel socket structs got far more member variables than shown in Figure 2. I only show the ones here which I consider relevant for the topic at hand. So, let's walk through this example: The socket(2) syscall, represented by function __sys_socket() in the kernel, allocates an instance of struct socket and an instance of struct udp_sock, which both hold pointers to each other. The latter includes a hierarchy of sub-structs struct inet_sock, struct sock and struct sock_common, as shown in Figure 2. The bind(2) syscall, represented by function inet_bind() in the kernel (in case of IPv4), here binds the socket to 127.0.0.1:8080 and saves this IP address in member variable skc_rcv_saddr and the port 8080 in member variable skc_num. It calculates and saves hashes and adds your socket to two different hash tables (more on that in the section below). This results in the bind port 8080 additionally being saved as hash in member variable skc_u16hashes[0]. That detail is relevant, as that member variable is being used in the socket lookup. The connect(2) syscall, represented by function inet_dgram_connect() in the kernel (in case of UDP), here connects the socket to peer 127.0.0.2:12345 and saves this peer IP address in member skc_daddr and the peer port in skc_dport.

Please be aware that those member variables are frequently being accessed under several different alias names instead of their actual name when being used in different parts of the code. Most of the structs in this hierarchy got a bunch of preprocessor defines which serve as alias names or “shortcuts” to access member variables of the structs which are lower in the hierarchy. See e.g. this list of defines which is part of struct sock. In this article I'll always use the actual name and not one of these alias names when referring to a member variable.

Figure 3: Simplified example instance of an IPv6 UDP socket, created and initialized by syscalls socket(2), bind(2) and connect(2) (the actual structs contain by far more member variables).

Figure 3 shows the very same example as Figure 2; however, this time for a UDP socket in the domain of IPv6. It shows you the instances of structs which are being allocated and initialized in the kernel when you use the socket(2), bind(2) and connect(2) syscalls in the same way as described above, just this time with IPv6 addresses. Let's walk through this IPv6 variant of the example: The socket(2) syscall allocates an instance of struct socket and an instance of struct udp6_sock, which both hold pointers to each other. The latter includes nearly the same hierarchy of sub-structs as in the IPv4 UDP example. The struct udp6_sock merely acts as a wrapper around the original hierarchy consisting of struct udp_sock, struct inet_sock, struct sock and struct sock_common and adds an additional struct ip6_pinfo. The bind(2) syscall, represented by function inet6_bind() in the kernel (in case of IPv6), here binds the socket to [2001:db8::1]:8080 and saves this IP address in member variable skc_v6_rcv_saddr. It saves the port 8080 in member skc_num and in skc_u16hashes[0] and adds the socket to two already mentioned hash tables. The connect(2) syscall here connects the socket to peer [2001:db8::2]:12345 and saves this peer IP address in member skc_v6_daddr and the peer port in member skc_dport.

UDP hash tables

Figure 4: Global instance of struct udp_table, holding hash tables *hash and *hash2.

The UDP socket lookup table is implemented as a global instance of struct udp_table whose members are allocated and initialized in function udp_table_init(), which is called by function udp_init() during early kernel boot, see Figure 4. It holds two separate hash tables in its members *hash and *hash2. The entries (“buckets”) of those hash tables are represented by instances of struct udp_hslot (that struct has a memory footprint/alignment of 16 bytes). Both hash tables possess the same number of entries, which can vary between min 2^8 and max 2^16 entries and which is determined dynamically during allocation depending on available memory on the system. On one of my systems I observed them each to possess 2^14 = 16384 entries. The allocation size and number of entries can easily be observed, because the kernel outputs a log message during early boot which appears in your dmesg and journald logs:

$ journalctl -b 0 -k -g 'UDP hash table entries'
UDP hash table entries: 16384 (order: 7, 524288 bytes, linear)
# meaning:
# 16384 buckets in *hash  a 16 byte
# 16384 buckets in *hash2 a 16 byte
# 16384 * 32byte = 524288 byte

Further, the (read-only) sysctl udp_hash_entries can also show you that same number of entries3):

$ sudo sysctl net.ipv4.udp_hash_entries
net.ipv4.udp_hash_entries = 16384

As shown in Figure 5 and mentioned in the sections above, the bind(2) syscall adds your socket to both of these hash tables. Thereby these two tables do in fact hold all UDP sockets on the system which are bound to a network address and port. This includes IPv4 and IPv6 sockets, connected and unconnected sockets, and sockets of all network namespaces.

Figure 5: Sockets being added to both hash tables by bind(2); hash table lookup in *hash based on netns + port and in *hash2 based on netns + address + port.

Each table has a slightly different purpose. The udp_table.hash table is being used to lookup a socket based on the local port number it binds to. The udp_table.hash2 table is being used to lookup a socket based on the local address and port number it binds to. It is this second table which is being used by the socket lookup for received UPD packets (more on that in the next section). Both tables implement the usual hash table patterns, which means a socket lookup in one of them consists of two steps: First a hash is being calculated, in case of udp_table.hash based on the network namespace4) + a port number and in case of udp_table.hash2 based on the network namespace + an IP address + a port number. The calculated hash is being used as array index to determine the correct “bucket” in which the socket you are searching for resides in. Each bucket contains a double linked list of socket instances. A second step then loops through the socket instances in that bucket and compares socket member variables like the bind address and port to find the correct match. The socket structs contain the usual connectors (*next and *prev pointers) for double linked lists, by which means the bind(2) syscall adds them to the double linked list of their bucket, see again Figures 2 and 3 where I mention those. The 2nd step of the lookup uses the common container_of() pointer magic to obtain a pointer to the actual socket struct from the connector pointer. In case you are not yet familiar with those hash table patterns, I recommend to take a look at my article Connection tracking (conntrack) - Part 2: Core Implementation, where I explain the hash table of the connection tracking system (in more detail), which essentially implements the same pattern.

Option: Network namespace with individual instance of struct udp_table

As mentioned above, by default only one global instance of struct udp_table exists and is being used in all network namespaces. Each network namespace still keeps an individual pointer to that global instance in net->ipv4.udp_table. This pointer is initialized in udp_set_table(), which is being called on creation time of each network namespace. The socket lookup actually uses this pointer to access udp_table. However, a feature has been added with kernel 6.2 which optionally enables network namespaces to allocate their own individual instance of struct udp_table. To activate this, you need to set sysctl net.ipv4.udp_child_hash_entries to a value n between 7 and 16 and then create a child network namespace. That child network namespace will then allocate its own instance of udp_table with 2^n entries/buckets; see again function udp_set_table(). You can confirm this by reading sysctl net.ipv4.udp_hash_entries inside the child network namespace.

By default this feature is switched off and the involved sysctls will look like this5):

net.ipv4.udp_child_hash_entries = 0
net.ipv4.udp_hash_entries =  16384  # in main network namespace
net.ipv4.udp_hash_entries = -16384  # in other network namespaces

If switched on and set n=8, the involved sysctls will look like this:

net.ipv4.udp_child_hash_entries = 8
net.ipv4.udp_hash_entries = 16384   # in main network namespace
net.ipv4.udp_hash_entries = 256     # in child network namespace

Socket lookup IPv4

Now we are finally getting into the meat of things. The actual socket lookup for locally received IPv4 UDP packets is implemented in function __udp4_lib_lookup() and visualized in Figure 6.

Figure 6: Socket lookup for IPv4 UDP packets in detail.

It mainly consists of two successive lookups into udp_table.hash2[]. The first lookup calculates a hash based on the network namespace in which the packet has been received and the destination IP address and destination UDP port of the packet. Thereby the correct bucket inside the hash table is determined. Then function udp4_lib_lookup2() is called, which implements looping through all sockets in the double linked list of that bucket to find the one socket which matches best. It calls compute_score() for each of those sockets which, as the name suggests, computes a “score” for each one and let's the highest score win. To reach the minimum score which is considered a match, a socket must at least match with the received network packet regarding its network namespace and the IPv4 address and UDP port it binds to. For this comparison, the socket struct member variables skc_net, skc_rcv_saddr and skc_u16hashes[0] are used; see once again Figure 2. If it is a connected socket and the IPv4 address and UDP port of the peer it is connected to match the source IP address and source port of the packet, then it achieves a higher score. Here the socket struct members skc_daddr and skc_dport are compared6). There are further socket member variables which are checked here and can make it achieve an even higher score, but those are corner cases like e.g a socket which binds to a specific network device7) and I won't go into that here. Thereby, the best matching socket is determined and if it actually is a connected socket, then the whole lookup ends here returning that successful match. If a matching socket has been found, which is not connected, thereby only matches to the destination IP address and destination port of the packet, then this is also considered a successful match and the lookup ends here, too. However, as you can see in Figure 6, an optional eBPF lookup happens in between those checks (see info box below for details on that).

If all that didn't produce a match, a 2nd lookup is done into udp_table.hash2[]. This time the hash is calculated based on the network namespace in which the packet has been received, the destination UDP port of the packet and the “any” IP address 0.0.0.0, thereby searching for a matching socket which binds to the “any” address (= to all local addresses). The hash determines the bucket and once again function udp4_lib_lookup2() is called, to loop through the sockets of that bucket to find the best (if any) match. For sockets which bind to the any address, their member variable skc_rcv_saddr is obviously being set to 0.0.0.0. Function udp_4_lib_lookup2() here does not compare this member to the actual destination IP address of the network packet, but instead checks whether it is indeed 0.0.0.0.

Some additional remarks on how to interpret what happens here:

Based on the sequence of steps described here, you might at first glance get the impression that a connected socket here always wins8) against a not connected socket and that a socket which binds to a specific local IP address always wins against a socket which binds to the any address. Both is actually not the case here. Don't confuse this with TCP. While with TCP, an established socket (= a socket which is already connected to a peer) which binds to local address A and port B, can coexist with another (“listening”) socket which also binds to local address A and port B, the socket API doesn't allow this for UDP9). If both cannot coexist at the same time on the system, then also no one can “win” against one another. The same goes for sockets which bind to the any (0.0.0.0) address and port B. If such a socket exists, no other UDP socket can exist at the same time on the system which binds to a specific local IP address and also to port B. The bind(2) syscall won't allow it. So, it is not relevant that the 2nd lookup which checks for the any address actually happens after the lookup for a specific address. Two separate lookups are merely necessary here due to the nature of the hash table, as the specific IP address (or the any address) are part of the calculated hash. You cannot check for both with just one lookup.

eBPF sk_lookup

As shown in Figure 6, an eBPF program can be loaded into the kernel to be executed at this sk_lookup hook. This can be used to override the regular socket lookup and to select a target socket for the network packet to be received at, based on other criteria than the default ones covered by the regular lookup. The initial lookup based on netns + dstAddr + dstPort is still performed before that eBPF hook and matching connected UDP sockets thereby cannot be overridden by an eBPF program. Matches to unconnected UDP sockets however can be overridden. An eBPF program loaded into this hook is being loaded into a specified network namespace. In other words, each network namespace has its own individual eBPF hook here. I collected some useful links for those who like to get deeper into this topic:

Socket lookup IPv6

Despite being implemented separately from the IPv4 lookup described above, the socket lookup for locally received IPv6 UDP packets, visualized in Figure 7, works nearly identical to its IPv4 counterpart.

Figure 7: Socket lookup for IPv6 UDP packets in detail.

The entire lookup is implemented in function __udp6_lib_lookup(). It performs the very same two successive lookups into udp_table.hash2, with the optional eBPF socket lookup in between. The looping through the hash table buckets here is implemented in function udp6_lib_lookup2(), which internally uses an IPv6 variant of the compute_score() function to compute the score of matching sockets. The only real difference to the IPv4 implementation is the obvious fact that IPv6 addresses are being compared here. So here the destination IPv6 address of the received UDP packet is being compared to the socket struct member skc_v6_rcv_addr and in case of connected sockets, the source IPv6 address of the network packet is compared to member skc_v6_daddr. Additionally, a socket will only be considered a match if it belongs to address family (domain) PF_INET610). That value is stored in socket member skc_family and clearly categorizes a socket as an IPv6 (or dual-stack) socket. Please compare all this to Figure 3 which shows all these socket member variables.

Dual-stack sockets

When you create an IPv6 UDP socket and bind it to to an existing local address and port and not to the any address [::], then this socket is thereby (as you would expect) an IPv6-only socket. Syscall bind(2) in this case sets socket member skc_ipv6only:1 to 1 (take another look at Figure 3), which is equivalent to setting socket option IPV6_V6ONLY to true. However, if you create an IPv6 UDP socket and bind it to the IPv6 any address [::], then you thereby create a dual-stack socket, which is able to handle IPv4 as well as IPv6 traffic. Precondition is sysctl net.ipv6.bindv6only being set to 0 (default). This sysctl represents the default value for socket option IPV6_V6ONLY. Thus, either by setting sysctl net.ipv6.bindv6only=111) before creating a socket or by explicitly setting socket option IPV6_V6ONLY to true for the socket in question before binding it to the IPv6 any address, you can still enforce this socket to be IPv6-only. This works, because the socket(2) syscall initializes socket struct member skc_ipv6only:1 with the current value of net.ipv6.bindv6only and after that you still have the opportunity to change the value of that member by setting socket option IPV6_V6ONLY. This subtle difference can be made visible with the iproute2 ss tool. Here an example where ss lists 3 UDP sockets which all bind to the any address:

$ ss -l
Netid   State   ...  Local Address:Port    ... 
udp     UNCONN  ...        0.0.0.0:111     ...  # IPv4 socket (AF_INET)
udp     UNCONN  ...           [::]:53      ...  # IPv6-only socket (AF_INET6)
udp     UNCONN  ...              *:80      ...  # dual-stack socket (AF_INET6)

But how does the socket lookup work in case of a dual-stack socket? To answer this, we first need to explain how the socket struct member variables are initialized by the socket(2) and bind(2) syscalls in this case:

socket(AF_INET6, SOCK_DGRAM, IPPROTO_UDP) = 3
bind(3, {sa_family=AF_INET6, sin6_port=htons(8080), sin6_flowinfo=htonl(0), \
   inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0

Please compare this to the example shown in Figure 3. I'll only explain here what happens differently in case of a dual-stack socket: The bind(2) syscall here binds the socket to the IPv6 any address [::] and saves this IP address in member variable skc_v6_rcv_saddr. It additionally sets member skc_rcv_saddr to zero, which means it sets it to the IPv4 any address 0.0.0.0. It doesn't touch member skc_ipv6only:1 and leaves it at its value 0 already set by the socket(2) syscall. As a result we now have a socket of protocol family AF_INET6 (PF_INET6) which binds to both [::] and 0.0.0.0 → a dual-stack socket.

In case an IPv4 UDP packet with a matching destination port number is received, the IPv4 UDP socket lookup is performed as described above and let's say we arrived now at the 2nd hash table lookup which is based on netns + address 0.0.0.0 + the packet's destination port (Figure 6). A minor detail I didn't mention so far is that function udp4_lib_lookup2(), when looping through the sockets in the hash table bucket, actually does not limit its search to IPv4 sockets (sockets of protocol family AF_INET/PF_INET). It also considers IPv6 sockets (sockets of protocol family AF_INET6/PF_INET6) as a match and it checks for member skc_ipv6only:1 to be 012). The compute_score() function merely gives a PF_INET6 socket a little lower score than a PF_INET socket (= than a “real” IPv4 socket). Thus, an IPv4 UDP packet can indeed find a match here, assuming that everything else, like the destination port number, fits.

In case an IPv6 UDP packet with a matching destination port number is received, the IPv6 UDP socket lookup is performed exactly as described in the section above and its 2nd hash table lookup, which checks for the any address [::], produces a match (Figure 7).

IPv4-mapped IPv6 addresses
In case you are wondering how you can handle receiving IPv4 UDP packets on an AF_INET6 socket: When doing socket programming with IPv4 (AF_INET), IP addresses and ports are being handled in form of struct sockaddr_in with its member sin_addr of type struct in_addr which holds an actual IPv4 address. When doing socket programming with IPv6 (AF_INET6), IP addresses and ports are being handled in form of struct sockaddr_in6 with its member sin6_addr of type struct in6_addr which holds an actual IPv6 address. Dual-stack sockets are AF_INET6 sockets and thereby still always use the latter structs for addresses. IPv4 addresses are being handled here in form of IPv4-mapped-IPv6 addresses, which convert an IPv4 address 1.2.3.4 into a pseudo IPv6 address ::ffff:1.2.3.4 so it can be stored within an instance of struct in6_addr. Those addresses are specified in RFC4192:

|                80 bits               | 16 |      32 bits        |
+--------------------------------------+--------------------------+
|0000..............................0000|FFFF|    IPv4 address     |
+--------------------------------------+----+---------------------+

IPv4-mapped IPv6 addresses do never actually appear on the wire (= within network packet headers). They are merely meant for use in special use cases like this one. So if you receive an IPv4 UDP packet on a dual-stack socket with syscall recvfrom(), you will be provided with an instance of struct sockaddr_in6 which holds the source IPv4 address of the received packet in form of an IPv4-mapped IPv6 address.

Context

This article describes the source code and behavior of the Linux Kernel v6.12.

Kernel 6.13

This is a perfect example on how quickly documentation ages. While I was still working on this article, kernel 6.13 got released (2025-01-19), which introduced a significant change to the UDP socket lookup. After reviewing this change, I decided to keep this article as it is. No other significant change to the UDP socket lookup had been introduced in quite a while. So this article correctly describes the implementation and behavior of a whole bunch of kernel versions, up to and including 6.12. Further, 6.12 is an LTS kernel which thereby will continue to be used by many people in the foreseeable future. The change introduced with 6.13 does not replace or heavily modify the implementation of 6.12. Instead it adds a 3rd hash table and yet another lookup and this change is mostly only relevant for connected sockets in combination with the socket option SO_REUSEPORT. I intent to cover both SO_REUSEPORT and the change of 6.13 in the next article of this series. For now, I'll give you an appetizer on what this change in 6.13 is all about. The mails on the netdev mailing list adding the patchset describe it quite well:

In addition to the existing two hash tables *hash and *hash2, a third hash table has been added to struct udp_table, named *hash4. It obtains the same number of entries as the other two hash tables and its key feature is that it provides a hash lookup based on netns and a 4-tuple; thus, srcAddr, srcPort, dstAddr, dstPort of a received UDP packet, implemented in new function udp4_lib_lookup4(). This lookup is added to the existing __udp4_lib_lookup() function before all other hash lookups; compare to Figure 6. Its intent is to improve performance of the overall lookup in cases where a big number of connected sockets exists. That is commonly the case when running a UDP server application, which handles a big number of connected sockets and is using socket option SO_REUSEPORT. Sockets are being added to the *hash4 table by syscall connect(2), so those can be found quicker during the lookup, based on the 4-tuple. The same new behavior has been added to the IPv6 UDP socket lookup in function __udp6_lib_lookup() by newly added function udp6_lib_lookup4(); compare to Figure 7.

Feedback

Feedback to this article is very welcome! Please be aware that I did not develop or contribute to any of the software components described here. I'm merely some developer who took a look at the source code and did some practical experimenting. If you find something which I might have misunderstood or described incorrectly here, then I would be very grateful, if you bring this to my attention and of course I'll then fix my content asap accordingly.

published 2025-03-02, last modified 2025-03-02

1)
coming soon…
2)
to be precise: the IPv4 or IPv6 variant of the Input hook, as both are implemented independently. See my article Nftables - Packet flow and Netfilter hooks in detail.
3)
What might be a slightly confusing fact is that, despite those two hash tables in struct udp_table being used to lookup instances of IPv4 UDP as well as IPv6 UDP sockets, this sysctl is still located in net.ipv4 and there is no equivalent sysctl in net.ipv6.
4)
Obviously it is necessary to include the network namespace here as sockets from all network namespaces on the system are present within these hash tables and a clear distinction is necessary.
5)
taking the 16384 entries which I observed to be allocated at my system here as an example
6)
A socket is actually only considered connected, if the values of both those members are not 0. Only then this comparison does actually take place.
7)
SO_BINDTODEVICE
8)
With “wins” I mean that if both sockets exist and are present in the hash table, then the one which “wins” will be the preferred match to be chosen.
9)
At least not without further “magic” like SO_REUSEPORT, but in that case the lookup will anyway work a little different than described here. I'll try to cover that in one of the next articles.
10)
which as you can see is actually the same as AF_INET6 and represents the domain being set by the socket(2) syscall
11)
which takes effect effect within the current network namespace
12)
Both IPv4 and IPv6 sockets possess this member and it is always 0 for IPv4 sockets and its value for IPv6 sockets is being handled as described above.
blog/linux/sockets_in_the_linux_kernel_2_udp_socket_lookup_on_rx.txt · Last modified: 2025-03-02 by Andrej Stender