Thermalcircle

climbing the thermals

User Tools

Site Tools


blog:linux:routing_decisions_in_the_linux_kernel_2_caching

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
blog:linux:routing_decisions_in_the_linux_kernel_2_caching [2022-07-31] – old revision restored (2022-07-31) Andrej Stenderblog:linux:routing_decisions_in_the_linux_kernel_2_caching [2025-01-19] (current) – connected sockets in tx socket caching Andrej Stender
Line 35: Line 35:
 You've probably heard the term //routing cache// here and there, right? You've probably heard the term //routing cache// here and there, right?
 An actual full-blown //routing cache// was present in older kernels and An actual full-blown //routing cache// was present in older kernels and
-has been removed with v3.6. However, that does not mean that caching thereby +had been removed with v3.6. However, that does not mean that caching thereby 
-vanished completely. Other caching an optimization mechanisms regarding+vanished completely. Other caching and optimization mechanisms regarding
 routing decisions have been added to the kernel since that removal. routing decisions have been added to the kernel since that removal.
 Further, some already existing ones have been kept. The Internet Further, some already existing ones have been kept. The Internet
Line 77: Line 77:
 </figure> </figure>
  
-Now let's take a look at the local output path in kernel v3.5, illustrated in Figure {{ref>routingcache_send1}}. The cache and routing lookup nearly work the same way than on the receive path; however, there are some minor differences. +Now let's take a look at the local output path in kernel v3.5, illustrated in Figure {{ref>routingcache_send1}}. The cache and routing lookup nearly work the same way as on the receive path; however, there are some minor differences. 
 Like in the previous article, I again take function ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/ip_output.c#L335|ip_queue_xmit()]]'' as an example, which is used when a TCP socket likes to send data on the network. It calls ''[[https://elixir.bootlin.com/linux/v3.5/source/include/net/route.h#L140|ip_route_output_ports()]]'', which in turn calls ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/route.c#L2931|ip_route_output_flow()]]'', which in turn calls ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/route.c#L2809|__ip_route_output_key()]]''. This is where the cache lookup is implemented. The lookup key is generated here based on data provided by the sending socket, like source and destination IP address, egress network interface index, IPv4 TOS field, ''%%skb->mark%%'' and the //network namespace//. Obviously locally generated packets which are to be sent out do not possess an ingress network interface. Thus, instead the egress network interface is being used here. But wait a minute. Isn't the egress network interface actually being determined by the routing lookup? So, how can it be an input parameter for the routing cache lookup? Same goes for the source IP address, which would be implicitly determined once the egress interface got determined. Well, both parameters can be predetermined in case the socket which sends this packet is bound to a specific network interface or to an IP address which is assigned to one of the network interfaces on this system. In all other cases it is of course the routing lookup itself which determines these parameters. Ok, let's get back to packet processing: In case of a cache miss, function ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/route.c#L2616|ip_route_output_slow()]]'' is called, which in turn calls ''fib_lookup()'', same as on the receive path.  Based on its result then a routing decision object is allocated and initialized and then added to the routing cache. Like in the previous article, I again take function ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/ip_output.c#L335|ip_queue_xmit()]]'' as an example, which is used when a TCP socket likes to send data on the network. It calls ''[[https://elixir.bootlin.com/linux/v3.5/source/include/net/route.h#L140|ip_route_output_ports()]]'', which in turn calls ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/route.c#L2931|ip_route_output_flow()]]'', which in turn calls ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/route.c#L2809|__ip_route_output_key()]]''. This is where the cache lookup is implemented. The lookup key is generated here based on data provided by the sending socket, like source and destination IP address, egress network interface index, IPv4 TOS field, ''%%skb->mark%%'' and the //network namespace//. Obviously locally generated packets which are to be sent out do not possess an ingress network interface. Thus, instead the egress network interface is being used here. But wait a minute. Isn't the egress network interface actually being determined by the routing lookup? So, how can it be an input parameter for the routing cache lookup? Same goes for the source IP address, which would be implicitly determined once the egress interface got determined. Well, both parameters can be predetermined in case the socket which sends this packet is bound to a specific network interface or to an IP address which is assigned to one of the network interfaces on this system. In all other cases it is of course the routing lookup itself which determines these parameters. Ok, let's get back to packet processing: In case of a cache miss, function ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/route.c#L2616|ip_route_output_slow()]]'' is called, which in turn calls ''fib_lookup()'', same as on the receive path.  Based on its result then a routing decision object is allocated and initialized and then added to the routing cache.
 However, no matter whether the routing decision object had just been allocated or came from the routing cache, its attachment to the network packet is actually not handled as part of this whole routing lookup. That happens a few function call layers up in function ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/ip_output.c#L380|ip_queue_xmit()]]''. This might seem like an irrelevant detail right now, but it will become relevant later in the sections below, you'll see. However, no matter whether the routing decision object had just been allocated or came from the routing cache, its attachment to the network packet is actually not handled as part of this whole routing lookup. That happens a few function call layers up in function ''[[https://elixir.bootlin.com/linux/v3.5/source/net/ipv4/ip_output.c#L380|ip_queue_xmit()]]''. This might seem like an irrelevant detail right now, but it will become relevant later in the sections below, you'll see.
Line 276: Line 276:
 ===== Socked Caching (TX) ===== ===== Socked Caching (TX) =====
 This caching feature is being used on the local output path; thus, only for sending locally This caching feature is being used on the local output path; thus, only for sending locally
-generated packets, not for received or forwarded packets. When a socket on the system +generated packets, not for received or forwarded packets. When a connected/established 
-(e.g. TCP or UDP socket, client or server)+socket on the system (e.g. TCP or UDP socket, client or server)
 is sending data on the the network, then it is sufficient to only do a routing lookup is sending data on the the network, then it is sufficient to only do a routing lookup
-for the very first packet it sends. After all, the destination IP address is the same+for the very first packet it sends. After all, the destination IP address will be the same
 for all following packets it sends and so is the routing decision. Thus, the //routing for all following packets it sends and so is the routing decision. Thus, the //routing
 decision// object, once obtained from the initial lookup, is being cached within the decision// object, once obtained from the initial lookup, is being cached within the
Line 327: Line 327:
  
 <figure early_demux1> <figure early_demux1>
-{{ :linux:early_demux_01.png?nolink |}} +{{ :linux:early_demux_01.png?direct |}} 
-<caption>//IP early demultiplexing// for TCP/UDP sockets before Routing Lookup on receive path.</caption>+<caption>//IP early demultiplexing// for TCP/UDP sockets before Routing Lookup on the receive path (click to enlarge).</caption>
 </figure> </figure>
  
Line 367: Line 367:
 an existing socket on the system reaches //established// state an existing socket on the system reaches //established// state
 and this is done within OSI layer 4 handling, e.g. for TCP and this is done within OSI layer 4 handling, e.g. for TCP
-in function ''[[https://elixir.bootlin.com/linux/v5.14.7/source/net/ipv4/tcp_input.c#L5742|tcp_rcv_established()]]''. Sysctls are implemented +in  ''[[https://elixir.bootlin.com/linux/v5.14.7/source/net/ipv4/tcp_input.c#L5742|tcp_rcv_established()]]''.  
-on network namespace level to switch this feature on/off either globally  +I described this feature here for IPv4. An equivalent implementation 
-for IPv4 within the network namespace or just for TCP and/or UDP (default: on).+exists on the IPv6 receive path. Sysctls are implemented on network 
 +namespace level to switch this feature on/off either globally for IPv4 
 +and IPv6((Yes, despite their naming, the //sysctls// listed below represent the on/off switches 
 +for //early demux// for both IPv4 and IPv6!)) within the network namespace or just for TCP and/or UDP (default: on).
  
 <code bash> <code bash>
Line 391: Line 394:
  
 <figure ip_list_rcv_finish1> <figure ip_list_rcv_finish1>
-{{ :linux:ip_list_rcv_finish_01.png?nolink |}} +{{ :linux:ip_list_rcv_finish_01.png?direct |}} 
-<caption>List-based packet handling on the receive path.</caption>+<caption>List-based packet handling on the receive path (click to enlarge).</caption>
 </figure> </figure>
  
Line 413: Line 416:
  
 <figure hint_caching1> <figure hint_caching1>
-{{ :linux:hint_caching_01.png?nolink |}} +{{ :linux:hint_caching_01.png?direct |}} 
-<caption>Using previous packet as //hint// and check, whether it's attached //routing decision// can be re-used.</caption>+<caption>Using previous packet as //hint// and check, whether it's attached //routing decision// can be re-used (click to enlarge).</caption>
 </figure> </figure>
  
Line 440: Line 443:
  
 ===== Flowtables ===== ===== Flowtables =====
-[[https://www.kernel.org/doc/html/latest/networking/nf_flowtable.html|Flowtables]] is a software (and under certain circumstances also hardware) fastpath mechanism implemented by the //Netfilter// developers, which speeds up forwarded network packets by making them skip a big part of the normal slowpath handling. It is set up by Nftables rules, is based on the //[[connection_tracking_1_modules_and_hooks|connection tracking]]// feature and caches //routing decisions// in a hash table. Lookup into that table and thereby fastpath/slowpath demultiplexing for a network packet happens early on reception in the //[[nftables_packet_flow_netfilter_hooks_detail|Netfilter Ingress hook]]//. I'll describe //Flowtables// in detail in a later article. A link will be placed hereonce that article has been published.+//Flowtables// is a //software// (and under certain circumstances also //hardware////fastpath// mechanism implemented by the //Netfilter// developers, which speeds up forwarded network packets by making them skip a big part of the normal network stack ("slowpath"handling. It is set up by Nftables rules, is based on the //[[connection_tracking_1_modules_and_hooks|connection tracking]]// feature and caches //routing decisions// in a hash table. Lookup into that table and thereby fastpath/slowpath demultiplexing for a network packet happens early on reception in the //Netfilter Ingress hook//. I describe //Flowtables// in detail in a separate article seriesstarting with [[flowtables_1_a_netfilter_nftables_fastpath|Flowtables - Part 1: A Netfilter/Nftables Fastpath]]. 
 ===== Cache Invalidation ===== ===== Cache Invalidation =====
 What actually happens in case something changes, like a new entry being added to a routing table What actually happens in case something changes, like a new entry being added to a routing table
Line 490: Line 494:
 //generation identifier// is simply set to the current value of the global //generation identifier// is simply set to the current value of the global
 //generation identifier//. The integer ''[[https://elixir.bootlin.com/linux/v5.14.7/source/include/net/dst.h#L55|obsolete]]'' also plays a role here. Both functions set it to ''[[https://elixir.bootlin.com/linux/v5.14.7/source/include/net/dst.h#L58|DST_OBSOLETE_FORCE_CHK]]''. This value specifies //generation identifier//. The integer ''[[https://elixir.bootlin.com/linux/v5.14.7/source/include/net/dst.h#L55|obsolete]]'' also plays a role here. Both functions set it to ''[[https://elixir.bootlin.com/linux/v5.14.7/source/include/net/dst.h#L58|DST_OBSOLETE_FORCE_CHK]]''. This value specifies
-that the //generation identifier// mechanism has be be used for that instance.+that the //generation identifier// mechanism has to be used for that instance.
 Thus, when a //routing decision// is about to be used once more (e.g. when it is about to be Thus, when a //routing decision// is about to be used once more (e.g. when it is about to be
 attached to yet another skb), its //generation identifier// value needs to be compared attached to yet another skb), its //generation identifier// value needs to be compared
 to the global one, which is done by function ''[[https://elixir.bootlin.com/linux/v5.14.7/source/net/ipv4/route.c#L399|rt_is_expired()]]''. If both are unequal, the cache entry is considered expired. to the global one, which is done by function ''[[https://elixir.bootlin.com/linux/v5.14.7/source/net/ipv4/route.c#L399|rt_is_expired()]]''. If both are unequal, the cache entry is considered expired.
-In case of e.g. the //nexthop// caching mechanism described above, that function is wrapped inside function ''[[https://elixir.bootlin.com/linux/v5.14.7/source/net/ipv4/route.c#L1567|rt_cache_valid()]]'', which first checks checks the value of member ''obsolete'':+In case of e.g. the //nexthop// caching mechanism described above, that function is wrapped inside function ''[[https://elixir.bootlin.com/linux/v5.14.7/source/net/ipv4/route.c#L1567|rt_cache_valid()]]'', which first checks the value of member ''obsolete'':
  
 <code c> <code c>
Line 553: Line 557:
 to my attention and of course I'll then fix my content asap accordingly.  to my attention and of course I'll then fix my content asap accordingly. 
  
-//published 2022-07-31//, //last modified 2022-07-31//+//published 2022-07-31//, //last modified 2025-01-19//
  
blog/linux/routing_decisions_in_the_linux_kernel_2_caching.1659291563.txt.gz · Last modified: 2022-07-31 by Andrej Stender