Table of Contents
Sockets in the Linux Kernel - Part 1: L4 Protocol Demultiplexing on Rx
In this article series I like to explore the implementation of sockets in the Linux Kernel and the source code surrounding them. While most of my previous articles focused primarily on OSI Layer 3, this series will attempt a dive into OSI Layer 4.
Articles of the series
- Sockets in the Linux Kernel - Part 2: Socket Lookup on Rx (coming soon …)
Overview
As a prelude to this series, in this article I will explore how demultiplexing is performed based on the OSI Layer 4 (transport layer) protocol for locally received network packets in the Linux kernel. While this mechanism is not directly related to sockets, it is an adjacent topic which represents the “glue” between OSI Layers 3 and 4 on the packet receive path. The Layer 4 protocol encapsulated within a network packet (e.g. a TCP Segment or a UDP Datagram) is specified by a field in the Layer 3 protocol header. In case of IPv4 it is specified by the Protocol field (8bit) within the IPv4 header and in case of IPv6 it is specified by the Next Header field (8bit) in the IPv6 header. The actual protocol numbers being used here to identify L4 protocols are the same in IPv4 and in IPv61) and are maintained by IANA which provides an official list of protocol numbers. Wikipedia also provides that same List of IP protocol numbers. You can further find this list in file /etc/protocols
on your Linux system, which is being used by software components like glibc and certain user space network utilities. The Linux kernel implements lookup tables where L4 protocol receive handler functions are being registered in and the L4 protocol numbers specified in the L3 headers of received network packets are being used as array indices in these tables to select the correct handler function.
Rx Packet Flow
Figure 1 shows the network packet receive path in the Linux kernel with emphasis on the L3 receive paths of IPv4 and IPv6 and L4 protocol demultiplexing. Please compare this figure and the description in this section with its counterpart in my previous article Routing Decisions in the Linux Kernel - Part 1: Lookup and packet flow, where I describe what is happening on L2 and L3 on the receive path in more detail.
As you can see, early after a network packet has been received, already some demultiplexing step happens on L2 based on the EtherType field in the Ethernet frame header, which selects the correct L3 protocol receive handler function to use for the packet. The principles of that demultiplexing step are similar to what I describe here, but are not in the scope of this article. Figure 1 only shows the L3 receive paths of the IPv4 and the IPv6 protocols. In case of IPv4, either the receive handler function ip_rcv()
or its list counterpart ip_list_rcv()
2) is being called, while in case of IPv6, the receive handler function ipv6_rcv()
or its list counterpart ipv6_list_rcv()
is being called. Thus, both receive paths are two independent implementations, which however semantically roughly perform the same steps like letting the receive packet traverse the Netfilter Prerouting hook3) and then doing a routing lookup based on the destination IP address of the packet. In the case described here, the routing lookup returns a result of type RTN_LOCAL
which determines that the packet is to be locally received4). Thus, in case of IPv4, function ip_local_deliver()
is called and, in case of IPv6, function ip6_input()
is called. Thereby the packet traverses either the IPv4 or the IPv6 variant of the Netfilter Input hook. Then, in case of IPv4, function ip_protocol_deliver_rcu()
is called or, in case of IPv6, function ip6_protocol_deliver_rcu()
is called. This is where the protocol field (IPv4 header) or the Next Header field (IPv6 header) is being examined and L4 protocol demultiplexing happens.
Figure 1 here merely shows the L4 receive handler functions of the most prominent L4 protocols TCP and UDP. In each of these receive handlers a socket lookup is being performed to find a matching destination socket for the received packet. I intend to describe that lookup in the 2nd article of this series (not yet published).
IPv4 L4 demux table
A global array named inet_protos
is declared in the kernel which represents the L4 protocol demux table for IPv4, see Figure 2. It is an array of pointers to instances of struct net_protocol
and has the fixed size of MAX_INET_PROTOS
(256) elements. struct net_protocol
is defined as the following:
struct net_protocol { int (*handler)(struct sk_buff *skb); int (*err_handler)(struct sk_buff *skb, u32 info); unsigned int no_policy:1, icmp_strict_tag_validation:1; u32 secret; };
Its function pointer member named handler
is meant to point to a L4 protocol receive handler function.
A second function pointer member named err_handler
is meant to point to a L4 protocol error handler function, which can e.g. be used by a L4 protocol to send ICMP error replies in certain error cases. Each L4 protocol implementation is required to create and hold an instance of struct net_protocol
for itself and initialize the function pointers with its respective handler functions.
Function inet_add_protocol()
and its counterpart inet_del_protocol()
are provided to register/unregister those instances of struct net_protocol
within the demux table inet_protos
.
To register, a L4 protocol needs to call inet_add_protocol()
and provide its already initialized instance of struct net_protocol
and a L4 protocol number as arguments. The protocol number
is used as index in the array inet_protos[]
and this element of the array then is set to point to the
given struct net_protocol
instance. The inverse happens when calling inet_del_protocol()
, where
the pointer at the specified index is set back to NULL
.
L4 protocols like TCP, UDP, ICMP and IGMP perform this registration step already during the early boot phase in function inet_init()
. The unregister function never gets called for any of them, because those protocols are a mandatory parts of the TCP/IP suite. Figure 2 shows the receive handler functions of those protocols.
Other L4 protocols are able to register/unregister later during kernel runtime. E.g. the GRE protocol will register/unregister itself with protocol number 47
when you load/unload its kernel module net/ipv4/gre.ko
5).
The following table lists the respective handler and error handler functions of several L4 protocols in the kernel (no guarantee for completeness):
But how is all this being used on the packet receive path? As mentioned in the section above, L4 protocol demultiplexing for IPv4 happens in function ip_protocol_deliver_rcu()
. The Protocol field value is already being retrieved from the IPv4 header in the calling function ip_local_deliver_finish()
and used as index in the demux table to obtain the correct instance of struct net_protocol
. Then an indirect function call using the handler
function pointer is performed. Thus, the correct L4 protocol receive handler function is being called:
/* simplified pseudo code */ struct net_protocol *ipprot; ipprot = inet_protos[protocol]; if (ipprot) { ipprot->handler(); } else { icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0); }
As you can see, in case the value of the Protocol field in the IPv4 header specifies a L4 protocol which is not registered in the demux table, an ICMP error reply of type destination unreachable (3) and code destination protocol unreachable (2) is sent.
Some subtle details, which lead deeper into the rabbit hole:
If you take another look at the code of function ip_protocol_deliver_rcu(), then you'll see that xfrm4_policy_check() is being called here under certain conditions. This is the IPsec “input policy”. It is only relevant in case you are decrypting an IPsec packet. My article Nftables - Netfilter and VPN/IPsec packet flow explains this in detail. If you take another look at the definition of struct net_protocol, then you'll see it also contains a bitfield with a bit named no_policy. L4 protocol implementations can set (1) or unset (0) this bit (seems, most of them set it to 1). If set, the IPsec “input policy” check will be skipped here in this function for that particular L4 protocol. Skipping it here opens the possibility for L4 protocols do either do or not do this check later within their own receive handlers. This is especially relevant for UDP, as things there are a little messy: The IPsec “input policy” is meant to be applied only to already decrypted packets and not to encrypted ones. In case of IPsec NAT-traversal, encrypted IPsec (ESP) packets are additionally encapsulated in UDP (port 4500). However, received UDP packets which are encrypted in IPsec with NAT-t come through here twice, first in still encrypted form, then a second time in decrypted (and decapsulated → tunnel mode) form. Thus, the UDP receive handler needs to do that IPsec “input policy” check itself at the appropriate time. Those details are handled in udp_queue_rcv_one_skb(), several call stack frames below udp_rcv().
IPv6 L4 demux table
The IPv6 implementation of L4 protocol demultiplexing is semantically very similar to its IPv4 counterpart. Here, a global array named inet6_protos
is declared in the kernel which represents the L4 protocol demux table for IPv6, see Figure 3. It is an array of pointers to instances of struct inet6_protocol
and has the fixed size of MAX_INET_PROTOS
(256) elements. struct inet6_protocol
is defined as the following:
struct inet6_protocol { int (*handler)(struct sk_buff *skb); int (*err_handler)(struct sk_buff *skb, struct inet6_skb_parm *opt, u8 type, u8 code, int offset, __be32 info); unsigned int flags; /* INET6_PROTO_xxx */ u32 secret; };
Its function pointer member named handler
is meant to point to a L4 protocol receive handler function.
A second function pointer member named err_handler
is meant to point to a L4 protocol error handler function, which can e.g. be used by a L4 protocol to send ICMPv6 error replies in certain error cases. Each L4 protocol implementation is required to create and hold an instance of struct inet6_protocol
for itself and initialize the function pointers with its respective handler functions.
Function inet6_add_protocol()
and its counterpart inet6_del_protocol()
are provided to register/unregister those instances of struct inet6_protocol
within the demux table inet6_protos
.
To register, a L4 protocol needs to call inet6_add_protocol()
and provide its already initialized instance of struct inet6_protocol
and a L4 protocol number as arguments. The protocol number
is used as index in the array inet6_protos[]
and this element of the array then is set to point to the
given struct inet6_protocol
instance. The inverse happens when calling inet6_del_protocol()
, where
the pointer at the specified index is set back to NULL
.
The following table lists the respective handler and error handler functions of several L4 protocols in the kernel (no guarantee for completeness). Some of those are IPv6 extension headers and no actual L4 protocols:
But how is all this being used on the packet receive path? As mentioned in the section above, L4 protocol demultiplexing for IPv6 happens in function ip6_protocol_deliver_rcu()
. The Next Header field value is being retrieved from the IPv6 header and used as index in the demux table to obtain the correct instance of struct inet6_protocol
. Then an indirect function call using the handler
function pointer is performed. Thus, the correct L4 protocol receive handler function is being called:
/* simplified pseudo code */ struct inet6_protocol *ipprot; ipprot = inet6_protos[nexthdr]; if (ipprot) { ipprot->handler(); } else { icmpv6_send(skb, ICMPV6_PARAMPROB, ICMPV6_UNK_NEXTHDR, nhoff); }
As you can see, in case the value of the Next Header field in the IPv6 header specifies a L4 protocol which is not registered in the demux table, an ICMPv6 error reply of type parameter problem (4) and code unrecognized next header type (1) is sent.
Some subtle details, which lead deeper into the rabbit hole:
If you take another look at the code of function ip6_protocol_deliver_rcu(), then you'll see that xfrm6_policy_check() is being called here under certain conditions. This is the IPv6 equivalent of the feature which I described in the info box for IPv4 above. Here it is implemented based on flag INET6_PROTO_NOPOLICY which is assigned to member flags of struct inet6_protocol, while for IPv4 it is based on the no_policy bit.
Context
This article describes the source code and behavior of the Linux Kernel v6.12.
Feedback
Feedback to this article is very welcome! Please be aware that I did not develop or contribute to any of the software components described here. I'm merely some developer who took a look at the source code and did some practical experimenting. If you find something which I might have misunderstood or described incorrectly here, then I would be very grateful, if you bring this to my attention and of course I'll then fix my content asap accordingly.
References
published 2025-01-05, last modified 2025-01-05
RTN_UNICAST
, the packet would be forwarded instead.err_handler
not implemented for this protocol. Thus, seems error handler is not mandatory.