Thermalcircle

climbing the thermals

User Tools

Site Tools


blog:linux:sockets_in_the_linux_kernel_3_tcp_socket_lookup_on_rx

Sockets in the Linux Kernel - Part 3: TCP Socket Lookup on Rx

In this article series I like to explore the implementation of sockets in the Linux Kernel and the source code surrounding them. While most of my previous articles focused primarily on OSI Layer 3, this series will attempt a dive into OSI Layer 4. It is very easy for readers of the kernel source code to not see the forest for the trees due to sheer complexity. It is my intent to throw a lifeline here to help navigating the code and to hold on to what is essential.

Articles of the series

Overview

In this 3rd article of the series, I like to dive into the socket lookup for TCP packets, which determines the socket on which an incoming TCP packet will actually be received and handled. While separate receive paths exist for IPv4 and for IPv6 packets, the socket lookup and the implementation of the TCP protocol both primarily consist of shared components which are being used on both receive paths. The overall way the lookup works is kept nearly identical in both cases. I'll walk through the general packet flow, the involved components like the socket structs, the hash tables, and finally the socket lookup itself.

Rx Packet Flow

Figure 1 visualizes the receive path of unicast TCP packets in the Linux kernel, encapsulated either in IPv4 or IPv6, from initial packet reception in the NIC driver until being handled by the TCP state machine, with special focus on the the socket lookup1).

Figure 1: Packet flow on the receive path from L2 → L3 → L4 for locally received TCP packets with focus on the socket lookup.

In the packet flow depicted in Figure 1, the routing decision determines that the IPv4/IPv6 packet is to be locally received. Based on that decision, an indirect function call to the local receive handler is done, which calls ip_local_deliver() in case of IPv4 and ip6_input() in case of IPv6. In both cases, the packet now traverses the Netfilter Input hook2) and is then demultiplexed3) based on the L4 protocol it carries. In this case the packet carries a TCP segment and thereby function tcp_v4_rcv() is being called in case of IPv4 and tcp_v6_rcv() in case of IPv6. Both functions do some initial checks of the TCP header and checksum and then call __inet_lookup() (IPv4) or __inet6_lookup() (IPv6) respectively4), which represent the actual socket lookup. Both implementations use two central hash tables as back-end, which I'll describe in detail in the next sections. If a matching socket is found, the packet is being handed over to the TCP state machine which processes it further and depending on the packet details and connection state does add its payload, if existing, to the receive buffer of that socket.

Socket structs

Before getting to the actual socket lookup, I like to give you a tour of the building blocks which are involved here, first of all the sockets themselves. In the kernel those are represented by a complex object oriented hierarchy of structs. The implementation represents a deep rabbit hole and I only will go in as deep as necessary here.

Figure 2: Simplified example instance of an IPv4 TCP listening socket or child socket.

Figure 2 shows the instances of structs which are allocated and initialized in the kernel when you use the socket(2) syscall to create yourself a TCP socket in the domain of IPv4, then use the bind(2) syscall to bind this socket to a local IPv4 address and port and then use the listen(2) syscall to make this socket listen on the bound address and port. All involved kernel socket structs got far more member variables than shown in Figure 2 and the other figures in this section. I only show the ones here which I consider relevant for the topic at hand. So, let's walk through this example: The socket(2) syscall, primarily represented by function __sys_socket() in the kernel, allocates an instance of struct socket and an instance of struct tcp_sock5), which both hold pointers to each other. The latter includes a hierarchy of sub-structs struct inet_connection_sock, struct inet_sock, struct sock and struct sock_common, as shown in Figure 2. The bind(2) syscall, primarily represented by function inet_bind() in the kernel (in case of IPv4), here binds the socket to 127.0.0.1:8080 and saves this IP address in member variable skc_rcv_saddr and the port 8080 in member variable skc_num. The listen(2) syscall, primarily represented by function inet_listen() in the kernel (in case of TCP), here makes the socket listening by changing its state and adding it to a hash table (more on that in the section below).

Please be aware that those member variables are frequently being accessed under several different alias names instead of their actual name when being used in different parts of the code. Most of the structs in this hierarchy got a bunch of preprocessor defines which serve as alias names or “shortcuts” to access member variables of the structs which are lower in the hierarchy. See e.g. this list of defines which is part of struct sock. In this article I'll always use the actual name and not one of these alias names when referring to a member variable.

Figure 3: Simplified example instance of an IPv6 TCP listening socket or child socket.

Figure 3 shows the very same example as Figure 2; however, this time for a TCP socket in the domain of IPv6. It shows you the instances of structs which are being allocated and initialized in the kernel when you use the socket(2), bind(2) and listen(2) syscalls in the same way as described above, just this time with IPv6 addresses. Let's walk through this IPv6 variant of the example: The socket(2) syscall allocates an instance of struct socket and an instance of struct tcp6_sock, which both hold pointers to each other. The latter includes nearly the same hierarchy of sub-structs as in the IPv4 TCP example. The struct tcp6_sock merely acts as a wrapper around the original hierarchy consisting of struct tcp_sock, struct inet_connection_sock, struct inet_sock, struct sock and struct sock_common and adds an additional struct ip6_pinfo. The bind(2) syscall, represented by function inet6_bind() in the kernel (in case of IPv6), here binds the socket to [2001:db8::1]:8080 and saves this IP address in member variable skc_v6_rcv_saddr. It saves the port 8080 in member skc_num. The listen(2) syscall, same as with IPv4, adds the socket to a hash table and changes its state (I'll get to that).

Figure 4: Simplified example instance of an IPv4 TCP request socket.
Figure 5: Simplified example instance of an IPv6 TCP request socket.

So, what are those “request sockets” shown in Figure 4 and 5? Those are “lightweight” TCP sockets, which are being used to represent new connection requests during the TCP 3-way handshake. The full-blown TCP sockets shown in Figures 2 and 3 are only being used as either listening sockets or as sockets which represent established connections. During the TCP 3-way handshake and also during the TCP FIN handshake, “lightweight” sockets are being used instead. I'll not explain the ones being used in the FIN handshake here. A TCP request socket is being created when a first TCP SYN packet is being received from a peer client. In case of IPv4 it consists of a hierarchy of structs tcp_request_sock, inet_request_sock, struct request_sock and struct sock_common, see again Figure 4. In case of IPv6 it consists of the exact same hierarchy of structs, just the whole thing is being wrapped into an instance of struct tcp6_request_sock at its top, see Figure 5.

TCP Syncookies

This is a feature which makes handling new connection requests received on a listening TCP socket even more “lightweight”, by not using any “request sockets” or other kinds of socket instances at all during the 3-way handshake. Syncookies are usually being used when the TCP accept queue is already full. I'll not explain them any further in this article.

TCP hash tables

Figure 6: Global instance of struct inet_hashinfo, named tcp_hashinfo, holding TCP hash tables, tables *ehash and *lhash2 are used in the socket lookup.

The hash tables which are used to hold TCP sockets are placed in a global instance of struct inet_hashinfo, named tcp_hashinfo, whose members are allocated and initialized in function tcp_init() and functions below during early kernel boot, see Figure 6. This struct actually holds 4 different hash tables ehash, bhash, bhash2 and lhash2. Tables bhash and bhash2 are used for socket management related to the bind() syscall and do not actually hold socket instances. They are not relevant for the scope of this article. Tables ehash and lhash2 do hold socket instances and are used in the TCP socket lookup, so those two are our main focus here. While lhash2 holds sockets which are in TCP_LISTEN state (listening sockets), ehash holds sockets in all other states (sockets which represent actual TCP connections). The entries (“buckets” or sometimes also called “slots”) of table ehash are of type struct inet_ehash_bucket, which has a memory footprint of 8 bytes and holds the head pointer for a linked list. The entries of lhash2 are of type struct inet_listen_hashbucket, which has a memory footprint of 16 bytes and additionally contains a spinlock. By default, the number of entries of both tables is decided dynamically during allocation, based on the amount of memory on the system. For ehash an entry is created for each 128KB of memory on the system, with a hard limit set to 2^19 entries. For lhash2 an entry is created for each 2MB of memory on the system, with a hard limit set to 2^16 entries. On one of my systems with 32 GB of RAM I observed ehash to possess 2^18 = 262144 entries6) and lhash2 to possess 2^14 = 16384 entries7). The allocation size and number of entries of both tables can easily be observed by means of a log message during early boot which appears in your dmesg and journald logs8):

$ journalctl -b 0 -k -g 'TCP established'
TCP established hash table entries: 262144 (order: 9, 2097152 bytes, linear)
# meaning:
# 262144 buckets in *ehash  a 8 byte = 2097152 byte
 
$ journalctl -b 0 -k -g 'tcp_listen_portaddr_hash'
tcp_listen_portaddr_hash hash table entries: 16384 (order: 6, 262144 bytes, linear)
# meaning:
# 16384 buckets in *lhash2  a 16 byte = 262144 byte

Further, the (read-only) sysctl tcp_ehash_entries can also show you that same number of entries for ehash. There seems to be no equivalent for lhash2:

$ sudo sysctl net.ipv4.tcp_ehash_entries
net.ipv4.tcp_ehash_entries = 262144

I'll describe in more detail in the next section how these tables are actually being used. Let's state for now that both tables together are holding all the sockets which can be matched during the TCP socket lookup to a locally received packet. This includes TCP/IPv4 as well as TCP/IPv6 sockets and by default also sockets of all network namespaces.

Option: Network namespace with individual instance of ehash table

As mentioned above, by default only one global instance of struct inet_hashinfo exists, named tcp_hashinfo, which holds both the ehash and the lhash2 table and is being used in all network namespaces. Each network namespace keeps an individual pointer to that global instance in net->ipv4.tcp_death_row.hashinfo. This pointer is initialized in function tcp_set_hashinfo(), which is being called on creation time of each network namespace. The socket lookup actually uses this pointer to access ehash and lhash2. However, a sysctl exists which optionally enables network namespaces to allocate their own individual instance of ehash (but not of lhash2!). If that feature is activated, then function tcp_set_hashinfo() creates a new instance of struct inet_hashinfo individually for each new child network namespace and allocates an individual instance of the ehash table. The *lhash2 pointer in that struct will; however, keep pointing to the global instance of lhash2. To activate this, you need to set sysctl net.ipv4.tcp_child_ehash_entries to the number of entries you wish to be allocated for ehash and then create a child network namespace. That child network namespace will then allocate its own instance of ehash with the configured number of entries; see again function tcp_set_hashinfo(). You can confirm this by reading sysctl net.ipv4.tcp_ehash_entries inside the child network namespace. By default this feature is switched off and the involved sysctls will look like this9):

net.ipv4.tcp_child_ehash_entries = 0
net.ipv4.tcp_ehash_entries =  262144  # in main network namespace
net.ipv4.tcp_ehash_entries = -262144  # in other network namespaces

If switched on and set to 1048576 (2^20) entries, the involved sysctls will look like this:

net.ipv4.tcp_child_ehash_entries = 1048576
net.ipv4.tcp_ehash_entries = 262144   # in main network namespace
net.ipv4.tcp_ehash_entries = 1048576  # in child network namespace

Socket lookup IPv4

Now we are finally getting into the meat of things. The actual socket lookup for locally received IPv4 TCP packets is implemented in function __inet_lookup() and visualized in Figure 7. It primarily consists of a lookup into table ehash, implemented in __inet_lookup_established(), followed by two successive lookups into lhash2, implemented in __inet_lookup_listener(). While the names of those functions suggest, that the lookup into ehash is a lookup for established sockets and the lookups into lhash2 are lookups for listening sockets, reality is a little more complex.

Figure 7: Socket lookup for IPv4 TCP packets in detail.

As I already mentioned in the previous section, ehash holds sockets which can have all possible states of the TCP state machine, except for TCP_LISTEN, while lhash2 solely holds sockets which are in state TCP_LISTEN. The lookup into ehash happens based on the 4-tuple of the received TCP packet; thus, its srcIP + srcPort + dstIP + dstPort. A hash is calculated based on that10). In other words, the lookup into ehash can find a matching socket which represents a TCP connection which is either established or is currently in the middle of the TCP 3-way handshake or the TCP finish handshake. The lookup works in typical hash table manner11): The calculated hash serves as an array index in ehash and thereby determines the correct bucket/slot within the table. Then the code loops through the linked list of all socket instances located in that bucket/slot, comparing the 4-tuple and netns of the packet to each socket to find the matching one. This is done by function inet_match(). Here the socket member variables sk_net, skc_rcv_saddr, skc_num, skc_daddr and skc_dport are used for comparison, see again Figure 2. If a matching socket is found, the whole lookup finishes here. Else the code proceeds to the lookups into lhash2. As indicated in Figure 7, there is an optional eBPF lookup happening in between here, see inet_lookup_run_sk_lookup(). By default, this is a no-op and you can ignore it. See the info box below for more details on this optional lookup. Now, a lookup into lhash2 is done based on the 2-tuple, consisting of the dstIP and dstPort of the received packet. A hash is calculated based on the 2-tuple and the netns to determine the bucket/slot and then the code uses function inet_lhash2_lookup() to loop through the sockets of the linked list of that bucket/slot. This lookup is able to find a matching listening socket which is bound to a specific local IP address and port. If no match is found, another lookup is done into lhash2, this time based on a 2-tuple consisting of the “any” IP address 0.0.0.0 and the dstPort of the received packet. Obviously, this lookup is able to find a matching listening socket which is bound to the “any” IP address and a specific port. From the sequence of lookups performed here you can clearly see that matching sockets which represent actual TCP connections based on the 4-tuple take precedence over matches to listening sockets bound to specific addresses which in turn take precedence over matches to listening sockets bound to the “any” address.

eBPF sk_lookup

As shown in Figure 7, an eBPF program can be loaded into the kernel to be executed at this sk_lookup hook. This can be used to override the regular socket lookup and to select a target socket for the network packet to be received at, based on other criteria than the default ones covered by the regular lookup. The initial lookup based on the 4-tuple into table ehash is still performed before that eBPF hook and thereby takes precedence. In other words, the eBPF lookup cannot override the regular lookup for sockets representing existing (established) TCP connections. It can only be used to override the lookups for listening sockets. An eBPF program loaded into this hook is being loaded into a specified network namespace. In other words, each network namespace has its own individual eBPF hook here. I collected some useful links for those who like to get deeper into this topic:

Socket lookup IPv6

The socket lookup for locally received IPv6 TCP packets, depicted in Figure 8, actually works exactly like its IPv4 counterpart described in the previous section. It merely is implemented in a separate set of functions. The entire lookup is done in __inet6_lookup(). That function calls __inet6_lookup_established() to do the lookup into ehash and then __inet6_lookup_listener() to do the lookups into lhash2, including the optional eBPF lookup, which here is implemented in inet6_lookup_run_sk_lookup().

Figure 8: Socket lookup for IPv6 TCP packets in detail.

There's only a tiny difference compared to IPv4: As these lookups are based on IPv6 instead of IPv4 addresses and the IPv6 addresses are being stored in different member variables of the socket instances, those different members, skc_v6_rcv_saddr and skc_v6_daddr are being used here instead of their IPv4 counterparts, see again Figure 3.

Example: TCP server socket

Let's walk through the initial steps of a “passive open” of an IPv4 TCP socket: We create a new TCP socket, which we like to use as listening socket within a TCP server application. After creation we bind it to a local address and port and make it listen on that address and port. Once we receive a TCP SYN packet from a client which matches our listening socket, a request socket is created and on successful completion of the TCP 3-way handshake that request socket is being replaced by a new full-blown TCP socket which represents the now established TCP connection. That one is usually called a “child socket”. While walking through these steps, let's focus on the involved socket instances, their states, which hash tables they reside in and what that means for the socket lookup which is executed for each packet which we receive from the peer client.

(1) Syscall socket()

# socket() syscall, shown here in strace style
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3

We use syscall socket() to create a new TCP socket. The main steps of that syscall are done in functions __sock_create(), inet_create()12) and tcp_v4_init_sock()13): An instance of the whole hierarchy of structs shown in Figure 2 above is being created, see Figure 9. The state of this new socket is set to TCP_CLOSE for now. It is not yet being added to any hash table. Instead the syscall returns a file descriptor (here: 3) to the caller in user space, which serves as a reference.

Figure 9: New instance of struct tcp_sock created by syscall socket(). Initial state is TCP_CLOSE.

(2) Syscall bind()

bind(3, {sa_family=AF_INET, sin_port=htons(8080), 
     sin_addr=inet_addr("127.0.0.1")}, 16) = 0

We use syscall bind() to bind our new socket to a specified local address and port 127.0.0.1:8080. The main steps of that syscall are done in __inet_bind()14) and inet_csk_get_port()15): New entries are added to hash tables bhash and bhash216). The specified local address and port are saved in member variables skc_rcv_saddr and skc_num of the socket, see Figure 10.

Figure 10: Syscall bind() binds socket to 127.0.0.1:8080.

(3) Syscall listen()

listen(3, 128) = 0

Then we call syscall listen(). The main steps that syscall are done in __inet_listen_sk(), inet_csk_listen_start() and __inet_hash()17). This marks our socket as “passive” = it makes the socket listening on the bound local address and port for incoming TCP connections from other “active” sockets (clients). It does that by adding the socket to table lhash2 and setting its state to TCP_LISTEN. The hash, which is used as index in lhash2, is being calculated by function ipv4_portaddr_hash()18) based on the network namespace and the 2-tuple of the bind address and port. The connector pointers skc_node19) are used to add it to the linked list of the correct bucket of the table, see Figure 11.

Figure 11: Syscall listen() adds socket to table lhash2 and sets state to TCP_LISTEN.

(4) TCP SYN received

Let's say a TCP SYN packet is received from a client and it matches to the listening TCP socket which we just created. The TCP socket lookup doesn't find a match among the sockets in the ehash table, but it finds our matching socket in the lhash2 table, see Figure 12.

Figure 12: TCP socket lookup performed in __inet_lookup() for received TCP SYN finds our matching listening socket in lhash2 based on 2-tuple of the packet's dstIP and dstPort.

Thereby we detected a new connection request, and a new “request socket” is being created. For IPv4 this is a socket of type struct tcp_request_sock. See Figure 13. Figure 4 above shows its internal structure in detail. Its purpose is to serve as a “lightweight” socket which represents the detected (and now ongoing) connection request. The 4-tuple consisting of srcIP:srcPort + dstIP:dstPort of the received TCP SYN packet is saved in its member variables. It's state is set to TCP_NEW_SYN_RECV. Member skc_listener is made to point to our listening socket, to which the socket lookup for the packet matched.

Figure 13: Instance of struct tcp_request_sock is created, holding 4-tuple of received TCP SYN.

After being initialized, the new request socket is being added to table ehash by inet_ehash_insert(), see Figure 14. The hash, which is used as index in ehash, is being calculated by inet_ehashfn() based on the network namespace and the 4-tuple saved in the request socket. The connector pointers skc_node of the request socket are used add it to the linked list of the correct bucket of the table. Finally, a TCP SYN+ACK reply packet is sent back to the client.

Figure 14: TCP request socket is added to table ehash.

(5) TCP SYN received (2nd)

Let's say now another TCP SYN packet is being received from the same client. That might not be the most common case as we are currently waiting for the client to send us a TCP ACK to complete the 3-way handshake. However, it might happen, as that 2nd TCP SYN could e.g. by a retransmission in case our SYN ACK reply got lost on its way to the client. It is interesting to look at this case to see how the request socket is being used here: At this point the request socket and our original listening socket are both present in the hash tables as shown in Figure 15. The TCP socket lookup finds our request socket in the ehash table based on the packet's 4-tuple + netns and thereby detects that the received packet is part of an ongoing connection request. Most of the packet handling after that happens in tcp_check_req() and tcp_rtx_synack(). In the most common case the kernel will here once again reply with another TCP SYN ACK20).

Figure 15: TCP socket lookup for a 2nd TCP SYN received during handshake finds our matching request socket in ehash based on the packet's 4-tuple.

(6) TCP ACK received

Let's assume now a TCP ACK is received from the same client, which completes the TCP 3-way handshake. The request socket and our original listening socket are both present in the hash tables as shown in Figure 16. The TCP socket lookup finds our request socket in the ehash table based on the packet's 4-tuple21) and thereby detects that the received TCP ACK packet is part of an ongoing connection request.

Figure 16: TCP socket lookup for received TCP ACK finds our matching request socket in ehash based on the packet's 4-tuple.

Most of the packet handling which follows now is done in function tcp_check_req() and functions below that. Further checks are done and if the packet turns out to be valid, a new full-blown TCP socket instance is created, which shall represent the established TCP connection. This new instance of struct tcp_sock is being cloned from our listening socket. Cloning here means, that most of the internal member variables are being copied to the new instance. Fittingly, this new socket is usually referred to as a child socket (a child of the listening socket). However, certain variables like the 4-tuple are being copied from our request socket. See Figure 17. The state of this new socket is initially set to TCP_SYN_RECV.

Figure 17: New struct tcp_sock cloned from listening socket, 4-tuple taken from request socket.

Then the request socket is being removed from table ehash and replaced by the newly created child socket. See Figure 18. The remaining handling now happens below tcp_child_process(). Among other things, the state of the new child socket is now changed to TCP_ESTABLISHED.

Figure 18: Request socket is being replaced by new child socket in table ehash.

(7) TCP PSH ACK received

If we now receive further TCP packets from the client, e.g. a TCP PSH ACK carrying payload data, then the TCP socket lookup now finds our new established child socket in table ehash as a match based on the 4-tuple, see Figure 19. If everything checks out, the TCP state machine then delivers the packet's payload data to the receive buffer of the child socket.

Figure 19: TCP socket lookup for received TCP PSH ACK finds our matching established child socket in ehash based on the packet's 4-tuple.

(8) Syscall accept()

I'll likely extend this article a little in the near future, adding details about how the accept queue of a listening TCP socket works.

Context

This article describes the source code and behavior of Linux Kernel v6.12.

Feedback

Feedback to this article is very welcome! Please be aware that I did not develop or contribute to any of the software components described here. I'm merely some developer who took a look at the source code and did some practical experimenting. If you find something which I might have misunderstood or described incorrectly here, then I would be very grateful, if you bring this to my attention and of course I'll then fix my content asap accordingly.

published 2026-01-06, last modified 2026-01-06

2)
To be precise: the IPv4 or IPv6 variant of the Input hook, as both are implemented independently.
See my article Nftables - Packet flow and Netfilter hooks in detail.
4)
Granted, there is an intermediate layer of functions __inet_lookup_skb() or __inet6_lookup_skb() being called here, which do first call inet_steal_sock() or inet6_steal_sock() respectively. However, that's only relevant regarding the early-demux feature. I do not focus on that here.
5)
To be precise, it usually uses a slab cache for that.
6)
which makes sense for having one entry per 128KB: 32GB = 32768MB * 8 = 262144
7)
which makes sense for having one entry per 2MB: 32GB = 32768MB / 2 = 16384
8)
Yes, of course you can also observe the sizes of the other two hash tables bhash and bhash2 in this way:
$ journalctl -b 0 -k -g 'TCP bind'
TCP bind hash table entries: 65536 (order: 9, 2097152 bytes, linear)
# meaning:
# 65536 buckets in *bhash  a 16 byte
# 65536 buckets in *bhash2 a 16 byte
# 65536 * 32byte = 2097152 byte
9)
taking the 262144 entries which I observed to be allocated on my system here as an example
10)
The network namespace in which the packet has been received is also included in the hash calculation, as table ehash is a global table which holds sockets from more than one (or all) network namespaces. Thereby, distinction of those is necessary.
11)
In case you need a refresher on how hash tables work… I described the inner workings of the hash table of the connection tracking system of the kernel in more detail in my article Connection tracking (conntrack) - Part 2: Core Implementation.
12) , 14)
in case of IPv4 (AF_INET)
13) , 15) , 17)
in case of TCP (SOCK_STREAM)
16)
That just as a side note, as those do not concern us in the scope of this article. Those entries BTW are not pointers to our socket, but instead management data related to it.
18)
In case of IPv6, the equivalent ipv6_portaddr_hash() function would have been used.
19)
to be precise, skc_nulls_node, as that's a union
20)
There's a security feature in place which prevents sending TCP SYN ACK replies if multiple TCP SYNs are coming in in high frequency, as that would likely be a DOS attack.
21)
and netns… I won't continue to mention that explicitly each time
blog/linux/sockets_in_the_linux_kernel_3_tcp_socket_lookup_on_rx.txt · Last modified: 2026-01-06 by Andrej Stender