Thermalcircle.de

climbing the thermals

User Tools

Site Tools


blog:linux:nftables_ipsec_packet_flow

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
blog:linux:nftables_ipsec_packet_flow [2021-06-29] – added publishing date Andrej Stenderblog:linux:nftables_ipsec_packet_flow [2022-08-14] (current) – added details about xfrm bundle Andrej Stender
Line 1: Line 1:
-{{tag>linux netfilter nftables ipsec strongswan charon swanctl xfrm}}+{{tag>linux kernel netfilter nftables ipsec strongswan charon swanctl xfrm}}
 ====== Nftables - Netfilter and VPN/IPsec packet flow ====== ====== Nftables - Netfilter and VPN/IPsec packet flow ======
 ~~META: ~~META:
 date created = 2020-05-30 date created = 2020-05-30
 ~~ ~~
- 
-~~NOTOC~~ 
  
 In this article I like to explain how the packet flow through  In this article I like to explain how the packet flow through 
Line 31: Line 29:
 so-called //Security Associations// (SAs). Usually there are (at least) three SAs negotiated for each VPN tunnel connection: The IKE_SA which, once established, represents the secured communication channel for IKE itself and (at least) two more CHILD_SAs, one for each data flow direction, which represent the secured communication channels for packets which shall flow through the VPN tunnel.  so-called //Security Associations// (SAs). Usually there are (at least) three SAs negotiated for each VPN tunnel connection: The IKE_SA which, once established, represents the secured communication channel for IKE itself and (at least) two more CHILD_SAs, one for each data flow direction, which represent the secured communication channels for packets which shall flow through the VPN tunnel. 
  
-In addition to the SAs, IPsec also introduces the concept of the so-called //Security Policies// (SPs), which are also created during IKE handshake. Those are either defined by the IPsec tunnel configuration provided by the admin/user and/or (depending on case) can also at least partly result from dynamic IKE negotiation. The purpose of the SPs is to act as "traffic selectors" on each VPN endpoint to decide which network packet shall travel through the VPN tunnel and which not. Usually the SPs make those distinction based on source and destination IP addresses (/subnets) of the packets, but additional parameters (e.g. protocols, port numbers, ...) can also be considered.+In addition to the SAs, IPsec also introduces the concept of the so-called //Security Policies// (SPs), which are also created during IKE handshake. Those are either defined by the IPsec tunnel configuration provided by the admin/user and/or (depending on case) can also at least partly result from dynamic IKE negotiation. The purpose of the SPs is to act as "traffic selectors" on each VPN endpoint to decide which network packet shall travel through the VPN tunnel and which not. Usually the SPs make those distinction based on source and destination IP addresses (/subnets) of the packets, but additional parameters (e.g. protocols, port numbers, ...) can also be considered. If a packet shall travel through the VPN tunnel, the SP further specifies, which SA is to be applied.
  
 Be aware that both SAs and SPs merely are volatile and not persistent data. Their lifetime is defined by the lifetime of the VPN tunnel connection. It might even be shorter because of key re-negotiations / "rekeying". Be aware that both SAs and SPs merely are volatile and not persistent data. Their lifetime is defined by the lifetime of the VPN tunnel connection. It might even be shorter because of key re-negotiations / "rekeying".
Line 39: Line 37:
  
 | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|tcp|payload|</code></html>((Of course, in the actual packet there is no "blank space" between the ''eth'' (ethernet) and the ''ip'' header. I just show it like this here to emphasize WHERE in the packet the ESP header and the outer IP header are being inserted.)) | A "normal" packet which shall travel through the VPN tunnel, is encrypted and encapsulated like this while traversing the VPN gateway. | | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|tcp|payload|</code></html>((Of course, in the actual packet there is no "blank space" between the ''eth'' (ethernet) and the ''ip'' header. I just show it like this here to emphasize WHERE in the packet the ESP header and the outer IP header are being inserted.)) | A "normal" packet which shall travel through the VPN tunnel, is encrypted and encapsulated like this while traversing the VPN gateway. |
-| <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|tcp|payload|</span></code></html>((The darker grey background here shall indicate that this is the part of the whole packet which gets encrypted.))((There is one further detail which I intentionally omit here because it is not relevant for the topic at hand: The ESP protocol in reality does not only possess a //header// but also a //trailer// part which comes after the payload. So, to be completely accurate, the ESP encapsulation in reality looks like this: ''|eth|ip|esp-header|ip|tcp|payload|esp-trailer|''.)) | :::  |+| <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|tcp|payload|</span></code></html>((The darker grey background here shall indicate that this is the part of the whole packet which gets encrypted.))((There is one further detail which I intentionally omit here because it is not relevant for the topic at hand: The ESP protocol in reality does not only possess a //header// but also a //trailer// part which comes after the payload. So, to be completely accurate, the ESP encapsulation in reality looks like this: ''|eth|ip|esp-header|ip|tcp|payload|esp-trailer|''.)) | :::  |
  
 If //Nat-traversal// is active, then ESP is additionally encapsulated in UDP: If //Nat-traversal// is active, then ESP is additionally encapsulated in UDP:
  
 | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|tcp|payload|</code></html> |In case of //Nat-traversal// additional encapsulation in UDP (same port ''4500'' as for IKE is then used here). | | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|tcp|payload|</code></html> |In case of //Nat-traversal// additional encapsulation in UDP (same port ''4500'' as for IKE is then used here). |
-| <html><code>|eth|<span style="color: darkgreen;">ip</span>|udp|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|tcp|payload|</span></code></html> | :::  |+| <html><code>|eth|<span style="color: red;">ip</span>|udp|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|tcp|payload|</span></code></html> | :::  |
  
  
Line 54: Line 52:
 {{:linux:linux-ipsec-impl1.png?direct&700|}} {{:linux:linux-ipsec-impl1.png?direct&700|}}
 <caption> <caption>
-Block diagram showing userspace and kernel part of IPsec implementation +Block diagram showing userspace and kernel part of the IPsec implementation 
-on Linux (StrongSwan and Xfrm framework) and interfaces between both.+on Linux (StrongSwan and Xfrm framework) and interfaces between both
 </caption> </caption>
 </figure> </figure>
Line 84: Line 82:
 which then feeds this config into the //charon// daemon via the //Vici// IPC interface. which then feeds this config into the //charon// daemon via the //Vici// IPC interface.
 Further, the syntax of ''swanctl.conf'' looks slightly different than the syntax Further, the syntax of ''swanctl.conf'' looks slightly different than the syntax
-of ''ipsec.conf'' from the //Stroke// interface, but the semantic is the same. +of ''ipsec.conf'' from the //Stroke// interface, but the semantic is nearly the same. 
  
 An additional config file ''/etc/strongswan.conf''((and it includes a bunch of further files)) exists, which contains general/global strongswan settings, which are not directly related to individual VPN connections. An additional config file ''/etc/strongswan.conf''((and it includes a bunch of further files)) exists, which contains general/global strongswan settings, which are not directly related to individual VPN connections.
Line 92: Line 90:
 ==== The Xfrm framework ==== ==== The Xfrm framework ====
 The so-called //Xfrm framework// is a component within the Linux kernel. The so-called //Xfrm framework// is a component within the Linux kernel.
-As the man page ''man 8 ip-xfrm'' states, it is an //"IP framework for transforming packets +As one of the //iproute2// man pages((''man 8 ip-xfrm'')) states, 
-(such as encrypting their payloads)"//. Thus "Xfrm" stands for "transform".+it is an //"IP framework for transforming packets 
 +(such as encrypting their payloads)"//. Thus"Xfrm" stands for "transform".
 While the userspace part (Strongswan) handles the overall IPsec orchestration and While the userspace part (Strongswan) handles the overall IPsec orchestration and
 runs the IKEv1/IKEv2 protocol to buildup/teardown VPN tunnels/connections, runs the IKEv1/IKEv2 protocol to buildup/teardown VPN tunnels/connections,
 the kernel part is responsible for encrypting+encapsulating and decrypting+decapsulating the kernel part is responsible for encrypting+encapsulating and decrypting+decapsulating
 network packets which travel through the VPN tunnel and to select/decide which packets network packets which travel through the VPN tunnel and to select/decide which packets
-go through the VPN tunnel at all. To do that, it requires +go through the VPN tunnel at all. To do that, it requires all SA and SP 
-all SA and SP instances which define the VPN tunnel/connection to +instances which define the VPN tunnel/connection to be present within the kernel. 
-be present in form of data structures within the kernel. Only then it can +Only then it can make decisions on which packet shall be encrypted/decrypted 
-make decisions on which packet shall be encrypted/decrypted and which not +and which not and which encryption algorithms and keys to use.
-and which encryption algorithms and keys to use.+
  
 The Xfrm framework implements the so-called //Security Association Database// (SAD) The Xfrm framework implements the so-called //Security Association Database// (SAD)
-and the //Security Policy Database// (SPD) for holding SA and SP instances in the kernel. +and the //Security Policy Database// (SPD) for holding SA and SP instances
-Userspace components (like //Strongswan//, the //iproute2// tool collection and others) can use a //Netlink// socket to talk to the kernel and to show/create/adjust/delete SA and SP instances in the SAD and SPD. You can e.g. use the //iproute2// tool ''ip'' to show the SA and SP instances which currently exist in those databases:+An SA is represented by ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L149|struct xfrm_state]]'' and an SP by ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L498|struct xfrm_policy]]'' in the kernel. 
 +Userspace components (like //Strongswan//, the //iproute2// tool collection and others) use a //Netlink// socket to communicate with the kernel to show/create/modify/delete SA and SP instances in the SAD and SPD. You can e.g. use the //iproute2// tool ''ip'' to show the SA and SP instances which currently exist in those databases:
  
   * command ''ip xfrm state'' shows SA instances   * command ''ip xfrm state'' shows SA instances
   * command ''ip xfrm policy'' shows SP instances   * command ''ip xfrm policy'' shows SP instances
  
-You can even use ''ip'' as a low-level config tool to create/delete SA and SP instances. There is a [[https://backreference.org/2014/11/12/on-the-fly-ipsec-vpn-with-iproute2/|very good article]] which explains how to do that. However in practice you leave the duty of creating/deleting SA and SP instances to //Strongswan//.+You can further use ''ip'' as a low-level config tool to create/delete SA and SP instances. There is a [[https://backreference.org/2014/11/12/on-the-fly-ipsec-vpn-with-iproute2/|very good article]] which explains how to do that. However in practice you leave the duty of creating/deleting SA and SP instances to //Strongswan//.
  
 SP instances can be created for three different "data directions": SP instances can be created for three different "data directions":
Line 117: Line 116:
 | "output policy"  | ''dir out'' | SP works as a selector on outgoing packets to select which are to be encrypted+encapsulated and which not | | "output policy"  | ''dir out'' | SP works as a selector on outgoing packets to select which are to be encrypted+encapsulated and which not |
 | "input policy"   | ''dir in''  | SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP local on the system | | "input policy"   | ''dir in''  | SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP local on the system |
-| "forward policy" | ''dir fwd'' | SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP which is not local, thereby packets which are to be forwarded (routed) | +| "forward policy" | ''dir fwd'' | SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP which is not local, thereby packets which are to be forwarded |
- +
-Due to the fact that IPsec is a mandatory part of the IPv6 protocol (but is +
-also available for the IPv4 protocol), the implementation of the Xfrm +
-framework or the "IPsec stack" is very interwoven with the implementation of +
-the IPv4 and IPv6 protocols in the kernel which makes things very complex when +
-you look into details. I saw statements which claim that the Xfrm +
-framework is the most complex part of the entire network stack in the Linux +
-kernel. +
- +
-So, what I describe in the following is a "simplified view" on how things work. It is a +
-sufficient model to understand how the network packet flow works and how the +
-Xfrm framework relates to the Netfilter framework (Netfilter +
-and Xfrm are implemented independently from each other in the kernel!).+
  
 If you are working with Nftables or Iptables, then you probably are If you are working with Nftables or Iptables, then you probably are
Line 136: Line 122:
 [[:linux:netfilter_packet_flow_image|Netfilter packet flow image]], [[:linux:netfilter_packet_flow_image|Netfilter packet flow image]],
 which illustrates the packet flow through the Netfilter hooks and which illustrates the packet flow through the Netfilter hooks and
-Iptables //chains//. One great thing about this image is, that it covers +Iptables //chains//. One great thing about this image is, that it also covers 
-the Xfrm framework, too (at least from the bird's eye view). It illustrates +the Xfrm framework. It illustrates four distinct "Xfrm actions
-four distinct Xfrm "decision +in the network packet flow path, named //xfrm/socket lookup//, //xfrm decode//, 
-points"((I intentionally call them "decision points" and not "hooks" here+//xfrm lookup// and //xfrm encode//. Howeverthis illustration is 
-because "hooksis a term which is used in the Netfilter framework.)) in +kind-of a bird's eye view. These four "actionsdo not resemble the 
-the network packet flow path and shows clearly where those are located in relation  +actual Xfrm implementation very closelyThe actual framework 
-to the Iptables //chains//. In Figure {{ref>nfhooksxfrm1}} I created a simplified +works a little bit different, which means that there actually are  
-version of this image, which only shows the Netfilter hooks (blue boxes) and the Xfrm  +more than four points within the packet flow path where Xfrm takes 
-"decision points" (grey boxes), and thereby is not focused on the subtle +action and also the locations of those are a little bit different. 
-differences between Iptables and Nftables. Because my focus is +Figure {{ref>nfhooksxfrm1}} represents a simplified view of the packet flow 
-on the Xfrm framework I added a fifth "decision point" here, which is +with main focus on Netfilter and the Xfrm framework. 
-not shown in the original [[:linux:netfilter_packet_flow_image|Netfilter packet flow image]], but I know that it exists from reading in the +It shows the Netfilter hooks in blue color and the locations where 
-kernel source code:+the Xfrm framework takes action in magenta color. 
 +If you are not yet familiar with the Netfilter hooks and their relation 
 +to Nftables/Iptables, then please take a look at my other article [[nftables_packet_flow_netfilter_hooks_detail|Nftables - Packet flow and 
 +Netfilter hooks in detail]] before proceeding here.
  
 <figure nfhooksxfrm1> <figure nfhooksxfrm1>
-{{:linux:nf-hooks-xfrm1.png?direct&700|}} +{{:linux:packet-flow-ipsec-tunnel.png?direct&700|}} 
-<caption>Block diagram of Netfilter hooks and Xfrm decision points</caption>+<caption>Block diagram of Netfilter hooks and Xfrm actions in IPsec tunnel-mode (click to enlarge)</caption>
 </figure> </figure>
  
-I'll explain the Xfrm "decision pointsherewhile assuming that you +| <WRAP>{{:linux:routing-step.png?nolink |}} The routing lookup is performed for incoming as well as local outgoing packets, see Figure {{ref>nfhooksxfrm1}}. Function ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/ip_fib.h#L363|fib_lookup()]]'' performs the actual lookup into the policy routing rules and routing tables. The routing decision resulting from this lookup is attached to the traversing network packet (skb). 
-are already familiar with the Netfilter hooks which I covered in +It is an instance of two combined structs, the outer ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/route.h#L49|struct rtable]]'' and the inner ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/dst.h#L24|struct dst_entry]]'', Together, both contain information like the output network interface, the ip address of the next hop gateway (if existing), function pointers which determine the path this packet takes through the remaining part of the kernel network stack, and more. My [[routing_decisions_in_the_linux_kernel_1_lookup_packet_flow|article series on routing]] explains that in detail. At first glance, routing has nothing to do with the Xfrm framework, but it is relevant in this context as you will see below.</WRAP>
-much detail in my other article +| <WRAP>{{:linux:xfrm-action-policy-out.png?nolink |}} This action is performed for forwarded as well as for local outgoing packets after the routing lookup, see Figure {{ref>nfhooksxfrm1}} and function ''[[https://elixir.bootlin.com/linux/v5.10.46/source/net/xfrm/xfrm_policy.c#L3183|xfrm_lookup()]]''. The Xfrm framework performs a lookup into the IPsec SPD, searching for a matching output policy (''dir out'' SP). If no matching policy is found, the network packet stays unchanged and simply continues on its way. If a matching policy is found, a lookup into the SAD is performed to resolve an SA which corresponds to the matching SP (shown as an attached magenta box named //Xfrm lookup state//). If the resolved SA specifies tunnel-mode, then yet another routing lookup is performed, this time for the (future) outer IPv4 packet which will later encapsulate the current packet. The actual packet transformation does not yet happen at this point. Instead, a "bundleof transformation instructions for this packet is assembled. The term "bundle" stems from the kernel source code and refers to a bunch of struct instances pointing to each other. Among those are the original routing decision of this packetthe SP, the SA((there can be more than one SA being applied to a packet, but that is a less common case)), the routing decision for the future outer IP packet and more. Those are usually assembled around one or several instances of ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L925|struct xfrm_dst]]''. The bundle is attached to the network packet (skb), replacing the originally attached routing decision. Function pointers within the bundle ensure, that the packet takes a different path through the remaining part of the kernel network stack. Figure {{ref>xfrm_dst}} shows how a bundle would actually look like. 
-[[nftables_packet_flow_netfilter_hooks_detail|Nftables Packet flow and +</WRAP>
-Netfilter hooks in detail]] (if not, please read that article first).+| <WRAP>{{:linux:xfrm-action-encode.png?nolink |}} This is where packets which shall travel through the VPN tunnel are being encrypted and encapsulated. The Xfrm framework transforms a packet according to the instructions in the "bundle" attached to it. A function pointer within the bundle makes sure, that the packet takes a "detour" into this transformation code after traversing the Netfilter //Postrouting// hook. For IPv4 packets, the entry function which leads the packet on this path is ''[[https://elixir.bootlin.com/linux/v5.10.46/source/net/ipv4/xfrm4_output.c#L31|xfrm4_output()]]''. In case of tunnel-mode this transformation means encapsulating the IP packet into a new outer IP packet and then encapsulating the inner IP packet into ESP protocol and encrypting it and its payload. After completion of the transformation, the xfrm components/instructions are being removed from the "bundle", leaving only the routing decision for the outer IP packet attached to the packet.</WRAP> | 
 +| <WRAP>{{:linux:xfrm-action-decode.png?nolink |}} This is where packets which have been received through the VPN tunnel are being decrypted and decapsulated. If an IP packet on the local input path contains an ESP packet, then the Xfrm framework performs a lookup into the SAD (//Xfrm lookup state//), based on the SPI((SPI is an integer value in the unencrypted part of the ESP header)) and the destination IP address. If no matching SA is found, the packet is dropped. If a matching SA is found, the ESP packet is decrypted and decapsulated. In case of tunnel-mode the outer IP packet is decapsulated. The remaining inner IP packet then is re-inserted into the receive path on OSI layer 2. Packets remember which SA has been used on them((in an skb extension named ''sec_path'')). That becomes relevant when they later traverse the //Xfrm lookup in policy// or //Xfrm lookup fwd policy// action.\\ \\ In case of //Nat-traversal// mode, both IKE and ESP packets arrive on the local input path being encapsulated in UDP on port 4500 and the kernel must distinguish between both. This is done based on the so-called //Non-ESP Marker//((4 zero bytes ''0x00000000'' at beginning of UDP payload, defined in [[https://datatracker.ietf.org/doc/html/rfc3948|RFC3948]])). IKE packets are given to a UDP socket where userspace application Strongswan is listening ((There is more to that. Strongswan is required to set a special socket option called ''UDP_ENCAP'' or else it won't receive any IKE packets on port 4500. But that is an implementation detail.)), while ESP packets are decrypted and decapsulated as described above.</WRAP>
 +| <WRAP>{{:linux:xfrm-action-policy-in.png?nolink |}} If already decrypted+decapsulated packets arrive here, 
 +they must match the "input policy" (''dir in'' SP) which corresponds to the SA which has been used to decrypt+decapsulate them. If they match, they simply pass, if not they are dropped. 
 +"Normal" network packets which have not been decrypted+decapsulatedby default simply pass here if they do not match any "input policy"((However, you can actually define additional input policies for those packets and thereby use the Xfrm framework to do packet filtering. However, this seems to be an exotic and nearly not documented feature. Netfilter with Nftables/IPtables does this job much better and is well documented, so that should probably be your preferred choice for doing packet filtering.)). 
 +ESP packets anyway circumvent this action, as you can see in Figure {{ref>nfhooksxfrm1}}; "input policies" are not meant for them.</WRAP>
 +| <WRAP>{{:linux:xfrm-action-policy-fwd.png?nolink |}} If already decrypted+decapsulated packets arrive here, 
 +they must match the "forward policy" (''dir fwd'' SP) which corresponds to the SA which has been used to decrypt+decapsulate them. If they match, they simply pass, if not they are dropped. "Normal" network packets which have not been decrypted+decapsulated, by default simply pass here if they do not match any "forward policy"((However, you can actually define additional forward policies for those packets and thereby use the Xfrm framework to do packet filtering. However, this seems to be an exotic and nearly not documented feature. Netfilter with Nftables/IPtables does this job much better and is well documented, so that should probably be your preferred choice for doing packet filtering.))</WRAP> |
  
  
-| //Xfrm lookup// | This is where the SPD is used to check if the traversing packets are matching to any "output policy" (''dir out'' SP) and if yes, they are given to the //Xfrm encode// step to be encrypted+encapsulated | +The Xfrm framework implementation does NOT use virtual network interfaces to distinguish between VPN and non-VPN traffic. This is a relevant difference compared to other implementations like the older //KLIPS// IPsec stack which was used in kernel v2.4 and earlier. Why is this relevant? It is true that virtual network interfaces are not required, because the concept of the SPs does all the distinction which is required for the VPN to operate. However, the absence of virtual network interfaces makes it harder for Netfilter-based packet filtering systems like Iptables and Nftables to distinguish between VPN and non-VPN packets within their rules.
-| //Xfrm encode// | This is where packets which shall travel through the VPN tunnel are being encrypted and encapsulated based on SA instances within the SAD (which SA to use and how it relates to SP instances is regulated via the integer identifiers like ''reqid'' and ''spi''; watch out for the ''tmpl'' keyword in the output of command ''ip xfrm policy'') | +
-| //Xfrm/socket lookup// | This "decision point" is actually a conglomerate of several checks which happen in different situations: (1) incoming ESP packets are checked against the SAD and if a matching SA is found (source and destination IP and ''spi'' match), then those packets are given to the //Xfrm decode// step. (2) If already decrypted+decapsulated packets arrive here, the SPD is used to check if they match to any "input policy (''dir in'' SP). If yes, they simply pass, if no they are dropped.\\ \\ This step needs to do a little more "magic" if //Nat-traversal// is used. In that case both IKE and ESP packets arrive here being encapsulated in UDP on port 4500 and here the kernel must distinguish between both((Done with a so-called //Non-ESP Marker// (4 zero bytes ''0x00000000'' at beginning of UDP payload), defined in [[https://tools.ietf.org/html/rfc3948|RFC3948]].)) and give the IKE packets to a userspace UDP socket where Strongswan((There is even more to that. Strongswan is required to set a special socket option called ''UDP_ENCAP'' or else it won't receive any IKE packets on port 4500. But that is an implementation detail.)) is listening and check the ESP packets against the SAD and give them to the //Xfrm decode// step. | +
-| //Xfrm decode// | This is where packets which have been received through the VPN tunnel are being decrypted and decapsulated based on SA instances within the SAD. | +
-| (//Xfrm fwd lookup//) | This step is not shown in the [[:linux:netfilter_packet_flow_image|Netfilter packet flow image]] (probably because it is considered less relevant), but I mention it here for completeness:  If already decrypted+decapsulated packets arrive here, the SPD is used to check if they match to any "forward policy" (''dir fwd'' SP) instance. If yes, they simply pass, if no they are dropped. | +
- +
-It is very important to mention that the Xfrm framework implementation does NOT use virtual network interfaces to distinguish between VPN and non-VPN traffic. This is a relevant difference compared to other +
-implementations like the older //KLIPS// IPsec stack which was used in kernel v2.4 and earlier. Why is this relevant? It is true that virtual network interfaces are not required, because the concept of the SPs +
-does all the distinction which is required for the VPN to operate. However, the absence of virtual network +
-interfaces makes it harder for Netfilter-based packet filtering systems like Iptables and Nftables to distinguish between VPN and non-VPN packets within their rules.+
  
 It is obvious that an Nftables rule would be easy to write if all VPN traffic goes through a virtual network interface e.g. called ''ipsec0''. In case of the Xfrm framework that is not the case, at least not It is obvious that an Nftables rule would be easy to write if all VPN traffic goes through a virtual network interface e.g. called ''ipsec0''. In case of the Xfrm framework that is not the case, at least not
Line 178: Line 165:
 are optional to use and never became the default. The Strongswan documentation calls VPN setups based on those virtual network interfaces [[https://wiki.strongswan.org/projects/strongswan/wiki/RouteBasedVPN|"Route-based VPNs"]]. It seems, essentially two types of virtual interfaces have been introduced in this context over the years: The older ''vti'' interfaces and the newer ''xfrm'' interfaces((Additional kinds of virtual network interfaces exist in this context, like e.g. the ''gre'' interfaces, but they represent an entirely different concept and protocol (''GRE'' protocol) which is e.g. used to build DMVPN setups. That is an advanced topic which works a little different than the "normal" IPsec VPN which I describe here.)). In the remaining part of this article I will describe how the IPsec-based VPN looks like from Netfilter point-of-view in the "normal" case where NO virtual network interfaces are used. are optional to use and never became the default. The Strongswan documentation calls VPN setups based on those virtual network interfaces [[https://wiki.strongswan.org/projects/strongswan/wiki/RouteBasedVPN|"Route-based VPNs"]]. It seems, essentially two types of virtual interfaces have been introduced in this context over the years: The older ''vti'' interfaces and the newer ''xfrm'' interfaces((Additional kinds of virtual network interfaces exist in this context, like e.g. the ''gre'' interfaces, but they represent an entirely different concept and protocol (''GRE'' protocol) which is e.g. used to build DMVPN setups. That is an advanced topic which works a little different than the "normal" IPsec VPN which I describe here.)). In the remaining part of this article I will describe how the IPsec-based VPN looks like from Netfilter point-of-view in the "normal" case where NO virtual network interfaces are used.
  
 +<figure xfrm_dst>
 +{{:linux:xfrm_dst.png?direct&700|}}
 +<caption>Simplified illustration of an Xfrm bundle, attached to a network packet
 +(click to enlarge). In IPsec tunnel-mode, the bundle contains two //routing decisions//,
 +references to IPsec SA and SP and function pointers to lead the packet
 +on the Xfrm encrypt+encapsulate path. Compare it to a normal
 +//routing decision// object, which I described in my
 +[[routing_decisions_in_the_linux_kernel_1_lookup_packet_flow#the_routing_decision_object|article series on routing]].
 +</caption>
 +</figure>
  
 ===== Example Site-to-site VPN ===== ===== Example Site-to-site VPN =====
 <figure ipsecsstopo1> <figure ipsecsstopo1>
-{{ :linux:site-to-site-topo1.png?direct&600 |}} +{{ :linux:site-to-site-topo1.png?nolink&600 |}} 
-<caption>IPsec site-to-site example setup with VPN gateways ''r1'' and ''r2'' (can be roughly compared to the [[https://www.strongswan.org/testing/testresults/swanctl/net2net-psk/|Strongswan: Test swanctl/net2net-psk]]).+<caption>IPsec site-to-site example setup with VPN gateways ''r1'' and ''r2''
 </caption> </caption>
 </figure> </figure>
  
-It is better to have a practical example as basis for further diving into the topic. Here I will use a site-to-site VPN setup, which is created between two VPN gateways ''r1'' and ''r2'' (IPsec, tunnel-mode, IKEv2, ESP, IPv4) as shown in Figure {{ref>ipsecsstopo1}}. The VPN tunnel will connect the local subnets behind ''r1'' and ''r2''. Additionally, both ''r1'' and ''r2'' operate as SNAT edge routers when forwarding non-VPN traffic, but not for VPN traffic. This creates the necessity to distinguish between VPN and non-VPN packets in Nftables rules, but more on that later. The router ''rx'' is just a placeholder for an arbitrary cloud (e.g. the Internet) between both VPN gateways.+It is better to have a practical example as basis for further diving into the topic. Here I will use a site-to-site VPN setup, which is created between two VPN gateways ''r1'' and ''r2'' (IPsec, tunnel-mode, IKEv2, ESP, IPv4) as shown in Figure {{ref>ipsecsstopo1}}.  
 +It can be roughly compared to the [[https://www.strongswan.org/testing/testresults/ikev2/net2net-psk/|strongSwan KVM Test ikev2/net2net-psk]] setup. 
 +The VPN tunnel will connect the local subnets behind ''r1'' and ''r2''. Additionally, both ''r1'' and ''r2'' operate as SNAT edge routers when forwarding non-VPN traffic, but not for VPN traffic. This creates the necessity to distinguish between VPN and non-VPN packets in Nftables rules, but more on that later. The router ''rx'' is just a placeholder for an arbitrary cloud (e.g. the Internet) between both VPN gateways.
  
   * [[:linux:ipsec:example:ss1:ip_setup|IP setup (addresses, routes) of the example topology]]   * [[:linux:ipsec:example:ss1:ip_setup|IP setup (addresses, routes) of the example topology]]
Line 261: Line 260:
 }</code></WRAP></WRAP> }</code></WRAP></WRAP>
 <caption> <caption>
-Strongswan configuration on ''r1'' and ''r2''((Be aware that config on ''r1'' and ''r2'' here merely is an example configuration which helps me to elaborate on the interaction between Strongswan, Xfrm and Netfilter+Nftables. It is NOT necessarily a good setup to be used in the real world out in the field! E.g. to keep things simple I work with PSKs here instead of certificates, which would already be a questionable decision regarding a real world appliance.))+Strongswan configuration on ''r1'' and ''r2''((Be aware that config on ''r1'' and ''r2'' here merely is an example configuration which helps me to elaborate on the interaction between Strongswan, Xfrm and Netfilter+Nftables. It is NOT necessarily a good setup to be used in the real world out in the field! To keep things simple I e.g. work with PSKs here instead of certificates, which would already be a questionable decision regarding a real world appliance.))
 </caption> </caption>
 </figure> </figure>
Line 341: Line 340:
 </figure> </figure>
  
-The content of the following two Figures {{ref>echo_request_r1_traversal}} and {{ref>echo_reply_r1_traversal}} is the result of experiments did with using the ''trace'' and ''log'' features of Nftables. Those features will make this traversal visible to youhowever they are only able to cover Nftables //chains//, //rules// and thereby Netfilter //hooks//They cannot show you what is going on in the Xfrm framework. You just see this indirectly by observing a packet being still unencrypted wile traversing one Netfilter hook and then appearing encrypted+encapsulated in the next Netfilter hookAlso played a little with the ''ip xfrm policy'' command to find out how the behavior changes when I remove one of the SP instances set by StrongswanFrom what I read in the Internetyou can do more or less "packet filtering" things with the SPs ("traffic selectors") in the Xfrm frameworkbut how their logic works in detailseems to be documented nowhere. I will add more info here, if I learn more on that later.+The following Figures {{ref>echo_request_r1_traversal}} and 
 +{{ref>echo_reply_r1_traversal}} show in detail how the ICMP //echo-request// 
 +and the corresponding ICMP //echo-reply// traverse the kernel network stack on 
 +''r1''. Those Figures and the attached tables and descriptions below are the 
 +result of me doing a lot of experimenting and reading in the kernel source code. 
 +used the Nftables ''trace'' and ''log'' features to make 
 +chain traversal, and thereby Netfilter hook traversal, visibleFurther, I 
 +used ''ftrace'' to get a ''function_graph'' of the journey network packet 
 +takes through the kernel network stack while being encrypted/decrypted. 
 +On some occasions used ''gdb'' together with ''qemu''+''KVM'' to set 
 +breakpoints within the kernel of a linux virtual machine and thereby 
 +observe the content of data structures involved with a traversing packet. 
 +While reading source code, the book [[https://ramirose.wixsite.com/ramirosen|Linux Kernel Networking - Implementation and Theory (Rami RosenApress2014)]] proved to be a great 
 +help to me to find orientation within the kernel network stack. I hope 
 +this gives you a head start in case you intend to dive deep into that topic 
 +yourself, too.
  
 <figure echo_request_r1_traversal> <figure echo_request_r1_traversal>
-{{:linux:nf-hooks-xfrm-encode1.png?direct&650|}}+{{:linux:packet-flow-ipsec-tunnel-encrypt.png?direct&700|}} 
 +<caption> ICMP echo-request ''h1'' -> ''h2'', ''r1'' traversal, encrypt+encapsulate (click to enlarge)</caption> 
 +</figure>
  
-   Netfilter / Xfrm  ^ Encapsulation                                                                                                                                                          ^ ''iif''  ^ ''oif''  ^ ''ip saddr''   ^ ''ip daddr''   ^ +Step ^^ Encapsulation                                                                                                                                                          ^ ''iif''  ^ ''oif''  ^ ''ip saddr''   ^ ''ip daddr''   ^ 
-| 1  | <html><b><span style="background-color:#7FB3D5;">Prerouting</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' |          | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>+| 1  | **eth0**           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' |      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
-| 2  | **Routing**           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ...      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>+| 2  | <html><b><span style="background-color:#eeeeee;">taps</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' |      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
- | <html><b><span style="background-color:#E5E7E9;">Xfrm fwd lookup</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>+| 3  | <html><b><span style="background-color:#dddddd;">qdisc ingress</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' |      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
- | <html><b><span style="background-color:#7FB3D5;">Forward</span></b></html>           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>+| 4  | <html><b><span style="background-color:#729fcf;">Ingress</span></b></html> **(eth0)** | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' |          | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
- | <html><b><span style="background-color:#7FB3D5;">Postrouting</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                                  | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>+5  | <html><b><span style="background-color:#729fcf;">Prerouting</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' |          | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
- | <html><b><span style="background-color:#E5E7E9;">Xfrm lookup</span></b></html>           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                                  | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>+| 6  | **Routing**           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ...      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
- | <html><b><span style="background-color:#E5E7E9;">Xfrm encode</span></b></html>           | <html><code>|eth|......|<span style="color: navy;">ip</span>|icmp|</code></html>                                                                                                | ''eth1'' | ...            | ...            | + | <html><b><span style="background-color:#d4abbe;">Xfrm fwd policy</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
- | <html><b><span style="background-color:#7FB3D5;">Output</span></b></html>            | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: darkgreen;">8.0.0.1</span></code></html>    | <html><code><span style="color: darkgreen;">9.0.0.1</span></code></html>    | + | <html><b><span style="background-color:#d4abbe;">Xfrm out policy</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
- | <html><b><span style="background-color:#7FB3D5;">Postrouting</span></b></html>       | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: darkgreen;">8.0.0.1</span></code></html>    | <html><code><span style="color: darkgreen;">9.0.0.1</span></code></html>    | + | <html><b><span style="background-color:#729fcf;">Forward</span></b></html>           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         ''eth0'' | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
-10 | <html><b><span style="background-color:#E5E7E9;">Xfrm lookup</span></b></html>           | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: darkgreen;">8.0.0.1</span></code></html>    | <html><code><span style="color: darkgreen;">9.0.0.1</span></code></html>    |+10  | <html><b><span style="background-color:#729fcf;">Postrouting</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                                  | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +11  | <html><b><span style="background-color:#d4abbe;">Xfrm encode</span></b></html>           | <html><code>|eth|......|<span style="color: navy;">ip</span>|icmp|</code></html>                                                                                                | ''eth1'' | ...            | ...            | 
 +12  | <html><b><span style="background-color:#729fcf;">Output</span></b></html>            | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +13  | <html><b><span style="background-color:#729fcf;">Postrouting</span></b></html>       | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +14 | <html><b><span style="background-color:#a3d196;">neighbor</span></b></html>           | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +| 15 | <html><b><span style="background-color:#dddddd;">qdisc egress</span></b></html>           | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +| 16 | <html><b><span style="background-color:#eeeeee;">taps</span></b></html>           | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +| 17 | **eth1** | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    |
  
-<caption> ICMP echo-request ''h1'' -> ''h2'', ''r1'' traversal 
-  * (1), (2), (3), (4), (5):\\ The //ICMP echo-request// from ''h1'' to ''h2'' is received on the ''eth0'' interface of ''r1'' and thus first (1) traverses the //Prerouting// Netfilter hook, then (2) the routing decision is made, and  ''eth1'' is determined as the output interface for this packet. Then (3) a lookup into the //Xfrm// //SPD// is done to check if the packet matches any of the "forward policies" (''dir fwd'' SPs, see Figure {{ref>r1_ip_xfrm_policy}}), but it does not and thus simply passes that step. Then it traverses the (4) //Forward// and the (5) //Postrouting// Netfilter hooks. This behavior so far is identical to any "normal" network packet which is being forwarded on a router. 
-  * (6), (7):\\ The //ICMP echo-request// packet traverses (6) the //Xfrm lookup// decision point. A lookup into the //SPD// is done to check if it matches any "output policy" (''dir out'' //SP//) and yes, because of its source and destination IP addresses ''192.168.1.100'' and ''192.168.2.100'' the packet matches, see Figure {{ref>r1_ip_xfrm_policy}}. Thus this packet is NOT immediately sent out on ''eth1'' and instead (7) it is given to //Xfrm encode// step where it is encrypted+encapsulated into ESP and an outer IP packet based on the corresponding SA (Figure {{ref>r1_ip_xfrm_state}}) in the SAD; see ''tmpl'' statement in the matching output policy (Figure {{ref>r1_ip_xfrm_policy}}). The outer IP header has the source and destination IP addresses ''8.0.0.1'' and ''9.0.0.1''. 
-  * (8), (9), (10):\\ The resulting packet now traverses the (8) //Output// and (9) //Postrouting// Netfilter hooks and then (10) once again the //Xfrm lookup// decision point. It is already encrypted+encapsulated, thus there is no further match in the SPD and it simply passes and is finally sent out on ''eth1''. 
-</caption> 
-</figure> 
-  
-Important to note regarding Figure {{ref>echo_request_r1_traversal}} is that the ''oif'', since determined by the routing decision, always stays ''eth1''. This is what it means that the //Xfrm framework// does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a ''vti'' interface named ''vti0''), then ''oif'' would be ''vti0'' in steps (3) till (7) instead of ''eth1'', but ''oif'' would still be ''eth1'' in steps (8) till (10). 
  
-<figure echo_reply_r1_traversal> +__Steps (1) (2) (3) (4) (5):__ The //ICMP echo-request// packet from ''h1'' to ''h2'' is received on the ''eth0'' interface of ''r1''. It traverses //taps// (where e.g. Wireshark/tcpdump could listen), the //ingress// queueing discipline (network packet scheduler,  ''tc''), and the Netfilter //Ingress// hook of ''eth0'' (where e.g. a //flowtable// could be placed). Then it traverses the Netfilter //Prerouting// hook
-{{:linux:nf-hooks-xfrm-decode1.png?direct&650|}}+
  
-^    ^ Netfilter / Xfrm                                                                    ^ Encapsulation                             ^ ''iif''  ^ ''oif''  ^ ''ip saddr''   ^ ''ip daddr''   ^ +__Step (6):__ The routing lookup is performedIt determines that this packet 
-| 1  | <html><b><span style="background-color:#7FB3D5;">Prerouting</span></b></html>       | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: darkgreen;">9.0.0.1</span></code></html> | <html><code><span style="color: darkgreen;">8.0.0.1</span></code></html>+needs to be forwarded and sent out on ''eth1'' and attaches this routing 
-| 2  | **Routing**                                                                         | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: darkgreen;">9.0.0.1</span></code></html> | <html><code><span style="color: darkgreen;">8.0.0.1</span></code></html>+decision to the packet.
-| 3  | <html><b><span style="background-color:#7FB3D5;">Input</span></b></html>            | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: darkgreen;">9.0.0.1</span></code></html> | <html><code><span style="color: darkgreen;">8.0.0.1</span></code></html>+
-| 4  | <html><b><span style="background-color:#E5E7E9;">Xfrm/sock lookup</span></b></html> | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: darkgreen;">9.0.0.1</span></code></html> | <html><code><span style="color: darkgreen;">8.0.0.1</span></code></html>+
-| 5  | <html><b><span style="background-color:#E5E7E9;">Xfrm decode</span></b></html>      | <html><code>|eth|......|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | ...            | ...            | +
- | <html><b><span style="background-color:#7FB3D5;">Prerouting</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>+
-| 7  | **Routing**                                                                         | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ...      | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>+
-| 8  | <html><b><span style="background-color:#E5E7E9;">Xfrm fwd lookup</span></b></html>  | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | +
-| 9  | <html><b><span style="background-color:#7FB3D5;">Forward</span></b></html>          | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>''eth1'' | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | +
-| 10 | <html><b><span style="background-color:#7FB3D5;">Postrouting</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>+
-| 11 | <html><b><span style="background-color:#E5E7E9;">Xfrm lookup</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html> |+
  
-<caption> ICMP echo-reply ''h2'' -> ''h1'', ''r1'' traversal +__Step (7):__ The Xfrm framework performs a lookup into the IPsec SPD, 
-  * (1), (2), (3):\\ The //ICMP echo-reply// sent from ''h2'' back to ''h1'' is received on ''eth1'' interface of ''r1'' and thus first (1traverses the //Prerouting// Netfilter hook, then (2the routing decision is made. Because this packet is still encrypted and encapsulated, its outer IP header has the destination IP address ''8.0.0.1'' (the IP address of ''r1'' ''eth1'' interfaceand thus routing decides this is a packet targeted for local reception and (3gives it to the //Input// Netfilter hook+searching for a matching //forward policy// (''dir fwd'' SP). No match is 
-  (4)(5):\\ The packet traverses the //Xfrm/socket lookup// decision pointBecause this packet contains an ESP header, a lookup into the SAD is done and a matching SA is found, see Figure {{ref>r1_ip_xfrm_state}}Thus, this packet is given to (5) //Xfrm decode//, where it is decapsulated and decrypted (ESP and outer IP header are stripped away) based on the matching SA+found and the packet passes. 
-  (6)(7)(8):\\ The packet is now re-inserted in the input path of network interface ''eth1''. It traverses (6) the //Prerouting// hook and (7) the routing decision is made and ''eth0'' is determined as the output interface for this packet. It (8) traverses the //Xfrm fwd lookup// decision point where a lookup into the SPD is done to check if the packet matches any "forward policy" (''dir fwd'' SPsee Figure {{ref>r1_ip_xfrm_policy}}). It matches and thereby passes((While experimenting, I removed the ''dir fwd'' policyThen the packet was dropped exactly at this point.)).  + 
-  * (9), (10)(11):\\ The packet now traverses the (9) //Forward// and the (10//Postrouting// Netfilter hook and finally also the //Xfrm lookup// decision point (there is no SP match and the packet simply passes), before being sent out on ''eth0''.  +__Step (8):__ The Xfrm framework performs a lookup into the IPsec SPD, 
-</caption>+searching for a matching //output policy// (''dir out'' SP) and yesbecause of its 
 +source and destination IP addresses ''192.168.1.100'' and ''192.168.2.100'' the packet 
 +matches, see Figure {{ref>r1_ip_xfrm_policy}}. 
 +An IPsec SA is resolved (see Figure {{ref>r1_ip_xfrm_state}}), which corresponds to 
 +the matching SP. The Xfrm framework detects, that tunnel-mode is configured in 
 +this SA. Thus, it now performs yet another routing lookup, this time for the (future) outer 
 +IPv4 packet, which will later encapsulate the current packet. A “bundle” of 
 +transformation instructions for this packet is assembled, which contains the 
 +original routing decision from step (6), the SP, the SA, the routing decision 
 +for the future outer IP packet and more. It is attached to the packet, 
 +replacing the attached routing decision from step (6)
 + 
 +__Steps (9) (10):__ The packet traverses the Netfilter //Forward// and 
 +//Postrouting// hooks. 
 + 
 +__Step (11):__ The Xfrm framework transforms the packet 
 +according to the instructions in the attached “bundle”In this case this 
 +means encapsulating the IP packet into a new outer IP packet  
 +with source IP address ''8.0.0.1'' and destination IP addresses ''9.0.0.1'' 
 +and then encapsulating the inner IP packet into ESP protocol and encrypting it and its 
 +payload. After that the transformation instructions are removed from the 
 +“bundle”, leaving only the routing decision for the new outer IP packet 
 +attached to the packet
 + 
 +__Steps (12) (13) (14) (15) (16) (17):__ The packet is re-inserted into the local 
 +output path. It traverses the Netfilter //Output// hook and then again the 
 +Netfilter //Postrouting// hook. Then it traverses the //neighboring 
 +subsystem// which in this case resolves the next hop gateway ip address 
 +''8.0.0.2'' from the routing decision attached to this packet into a MAC address 
 +(by doing ARP lookup, if address not yet in cache)
 +Finally, it traverses the //egress// queueing discipline (network packet 
 +scheduler, ''tc'')//taps// (where e.g. Wireshark/tcpdump could listen) and 
 +then is sent out on ''eth1''. 
 + 
 +The output interface ''oif'' of the packet, since determined by the routing decision(sin steps (6) and (8), always stays ''eth1'' when you check for it in an Nftables rule within one of the Netfilter hooks coming after that. This is what it means that the //Xfrm framework// does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a ''vti'' interface named ''vti0''), then ''oif'' would be ''vti0'' in steps (9) and (10instead of ''eth1''but ''oif'' would still be ''eth1'' in steps (12) and (13)
 + 
 +<figure echo_reply_r1_traversal> 
 +{{:linux:packet-flow-ipsec-tunnel-decrypt.png?direct&700|}} 
 +<caption> ICMP echo-reply ''h2'' -> ''h1'', ''r1'' traversal, decrypt+decapsulate (click to enlarge)</caption>
 </figure> </figure>
  
-Important to note regarding Figure {{ref>echo_reply_r1_traversal}} is that the ''iif'' during this whole traversal stays ''eth1''. This is what it means that the Xfrm framework does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a ''vti'' interface named ''vti0''), then ''iif'' would still be ''eth1'' in steps (1till (5), but would be ''vti0'' in steps (6till (9). +^ Step ^^ Encapsulation                             ^ ''iif''  ^ ''oif''  ^ ''ip saddr''   ^ ''ip daddr''   ^ 
 +| 1  | **eth1** | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 2  | <html><b><span style="background-color:#eeeeee;">taps</span></b></html> | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 3  | <html><b><span style="background-color:#dddddd;">qdisc ingress</span></b></html> | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 4  | <html><b><span style="background-color:#729fcf;">Ingress</span></b></html> **(eth1)** | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 5  | <html><b><span style="background-color:#729fcf;">Prerouting</span></b></html> | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 6  | **Routing**                                                                         | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 7  | <html><b><span style="background-color:#729fcf;">Input</span></b></html>            | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 8  | <html><b><span style="background-color:#d4abbe;">Xfrm decode</span></b></html>      | <html><code>|eth|......|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | ...            | ...            | 
 +| 9  | **eth1** | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 10  | <html><b><span style="background-color:#eeeeee;">taps</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 11 | <html><b><span style="background-color:#dddddd;">qdisc ingress</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 12 | <html><b><span style="background-color:#729fcf;">Ingress</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 13 | <html><b><span style="background-color:#729fcf;">Prerouting</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 14 | **Routing**                                                                         | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ...      | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 15 | <html><b><span style="background-color:#d4abbe;">Xfrm fwd policy</span></b></html>  | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 16 | <html><b><span style="background-color:#d4abbe;">Xfrm out policy</span></b></html>  | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 17 | <html><b><span style="background-color:#729fcf;">Forward</span></b></html>          | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 18 | <html><b><span style="background-color:#729fcf;">Postrouting</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 19 | <html><b><span style="background-color:#a3d196;">neighbor</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 20 | <html><b><span style="background-color:#dddddd;">qdisc egress</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 21 | <html><b><span style="background-color:#eeeeee;">taps</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 22 | **eth0** | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 + 
 + 
 +__Steps (1) (2) (3) (4) (5):__ The //ICMP echo-reply// sent from ''h2'' back to ''h1'' is received on ''eth1'' interface of ''r1'' in its encrypted+encapsulated form. It traverses //taps// (where e.g. Wireshark/tcpdump could listen), the //ingress// queueing discipline (network packet scheduler,  ''tc''), and the Netfilter //Ingress// hook of ''eth1'' (where e.g. a //flowtable// could be placed). Then it traverses the Netfilter //Prerouting// hook.  
 + 
 +__Steps (6) (7):__ The routing lookup is performed. In this case here, the routing subsystem determines that this packets destination IP ''8.0.0.1'' matches the IP address of interface ''eth1'' of this host ''r1''. Thus, this packet is destined for local reception and the lookup attaches an according routing decision to it. As a result, the packet then traverses the Netfilter //Input// hook.  
 + 
 +__Steps (8) (9):__ The Xfrm framework has a layer 4 receive handler waiting for incoming ESP 
 +packets at this point. It parses the SPI value from the ESP header of the 
 +packet and performs a lookup into the SAD for a matching IPsec SA (lookup 
 +based on SPI and destination IP address). A matching SA is found, see Figure 
 +{{ref>r1_ip_xfrm_state}}, which 
 +specifies the further steps to take here for this packet. It is decrypted and 
 +the ESP header is decapsulated. Now the internal IP packet becomes visible. 
 +The SA specifies tunnel-mode, so the outer IPv4 header is decapsulated. A lot 
 +of packet meta data is changed here, e.g. the attached routing decision (of 
 +the outer IP packet, which is now removed) is stripped away, the reference to 
 +connection tracking is removed, and a pointer to the SA which has been used 
 +here to transform the packet is attached (via skb extension 
 +''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L1029|struct sec_path]]''). As a result, other kernel components can later 
 +recognize that this packet has been decrypted by Xfrm. Finally, the packet is 
 +now re-inserted into the OSI layer 2 receive path of ''eth1''
 + 
 +__Steps (10) (11) (12) (13):__ Now history repeats... the packet once again traverses 
 +//taps//, the //ingress// queueing discipline and the Netfilter //Ingress// hook of 
 +''eth1''. Then it traverses the Netfilter //Prerouting// hook.  
 + 
 +__Step (14):__ The routing lookup is performed. It determines that this 
 +packet needs to be forwarded and sent out on ''eth0'' and attaches this routing decision 
 +to the packet. 
 + 
 +__Step (15):__ The Xfrm framework recognizes, that this packet has been transformed  
 +according to the SA, whose pointer is still attached to the packet  (within skb extension ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L1029|struct sec_path]]''). 
 +It checks if a ''forward policy'' (''dir fwd'' SP) exists which corresponds to this SA, and yes, 
 +a match is found here, see Figure {{ref>r1_ip_xfrm_policy}}. So the packet passes. 
 + 
 +__Step (16):__ The Xfrm framework performs a lookup into the IPsec SPD, searching for a matching 
 +//output policy// (''dir out'' SP). No match is found. Still, the packet passes. 
 +The idea of the //output policy// is to detect packets which shall be encrypted with IPsec. 
 +Packets which do not match, just pass.  
 + 
 +__Steps (17) (18) (19) (20) (21) (22):__ The packet traverses the Netfilter 
 +//Forward// and //Postrouting// hooks. Then it traverses the //neighboring 
 +subsystem// which does resolve the destination IP address, which is now 
 +''192.168.1.100'', into a MAC address (by doing ARP lookup, if address not yet 
 +in cache). Finally, it traverses the //egress// queueing discipline, //taps// 
 +and then is sent out on ''eth0''
 + 
 +The input interface ''iif'' of the packet during this whole traversal stays ''eth1'' when you check for it in an Nftables rule within one of the Netfilter hooks((obviously except the //Postrouting// hook, where the input interface is not remembered anymore)). 
 +This is what it means that the Xfrm framework does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a ''vti'' interface named ''vti0''), then ''iif'' would still be ''eth1'' in steps (4)(5) and (7), but would be ''vti0'' in steps (12)(13) and (17). 
 ===== SNAT, Nftables ===== ===== SNAT, Nftables =====
 Now to add the SNAT behavior to ''r1'' and ''r2'', we apply the following Now to add the SNAT behavior to ''r1'' and ''r2'', we apply the following
Line 410: Line 524:
 and ''h2'' e.g. can now finally ping ''rx''((which intentionally does not know the routes to the subnets behind ''r1'' and ''r2'', because routers in the Internet commonly do not know the routes to private subnets behind edge routers.)) and from ''rx'' point of view it looks like the ping came from ''r1'' or respectfully ''r2'' and ''h2'' e.g. can now finally ping ''rx''((which intentionally does not know the routes to the subnets behind ''r1'' and ''r2'', because routers in the Internet commonly do not know the routes to private subnets behind edge routers.)) and from ''rx'' point of view it looks like the ping came from ''r1'' or respectfully ''r2''
  
-However, let's take another look at the example from Figure {{ref>echo_request_r1_traversal}} above (the ICMP echo-request ''h1'' -> ''h2'' traversing ''r1'') and examine how the behavior differs now: In step (5) in the example the still unencrypted ICMP echo-request packet traverses the Netfilter //Postrouting// hook and thereby the Nftables //postrouting// chain. Because of the ''masquerade'' rule, its source IP address is now replaced with the address ''8.0.0.1'' of the ''eth1'' interface of ''r1''. Resulting from that, in step (6) this packet now does NOT match any IPsec "output policyanymore, thus it is not encrypted+encapsulated and does not travel through the VPN tunnelObviously this is not our intended behavior and further, this ping is now anyway doomed to fail, because ''rx'' does +However, let's take another look at the example from Figure {{ref>echo_request_r1_traversal}} above (the ICMP echo-request ''h1'' -> ''h2'' traversing ''r1'') and examine how the behavior differs now: In step (10) in the example the still unencrypted ICMP echo-request packet traverses the Netfilter //Postrouting// hook and thereby the Nftables //postrouting// chain. Because of the ''masquerade'' rule, its source IP address is now replaced with the address ''8.0.0.1'' of the ''eth1'' interface of ''r1'' 
-not know the route to the target subnet and even if it did, ''r2'' would then drop the packet +Resulting from that, this packet now does not match the IPsec //output policy// anymore. Thus, it won't get encrypted+encapsulated! Obviously that is not our intended behavior, but let's first dig deeper to understand what actually happens here: In step (8) this packet still had its original source and destination IP addresses ''192.168.1.100'' and ''192.168.2.100'' and thereby matched to //output policy// selector ''src 192.168.1.0/24 dst 192.168.2.0/24'' (see Figure {{ref>r1_ip_xfrm_policy}}). Now in step (10), with its source IP address changed to ''8.0.0.1'' it does not match that //output policy// anymore. But wait! That Xfrm lookup for the //output policy// has already been done in step (8) and the "bundle" with transform instructions is already attached to the packetIsn't the chaos complete now??? No, because the NAT implementation indeed is aware that this situation can occur. In case NAT actually changes something in the packet traversing the //Postrouting// hook (like in this case the source IP address), then it calls function ''[[https://elixir.bootlin.com/linux/v5.10.46/source/net/netfilter/nf_nat_core.c#L150|nf_xfrm_me_harder()]]''((BTW: Very funny function name! "transform me harder..." You kernel developers made my day! LOL)) which essentially does the whole step (8) all over again. Thus, the Xfrm lookup for a matching //output policy// is actually repeated in this case. For the ICMP echo-request packet in our example this actually means that the original "plain" routing decision is being restored and and the "bundle" with the Xfrm instructions is removed.  
-because it also is now configured as SNAT router and thereby drops incoming ''new'' connections + 
-in the ''forward'' chain.+Ok, now we understand it ... the ping is natted, but then sent out plain and unencrypted. That is not what we want. Further, this ping is now anyway doomed to fail, because ''rx'' does not know the route to the target subnet and even if it did, ''r2'' would then drop the packet because it also is now configured as SNAT router and thereby drops incoming ''new'' connections in the ''forward'' chain.
  
 How to fix that? It is our intended behavior, that network packets from subnet How to fix that? It is our intended behavior, that network packets from subnet
Line 435: Line 549:
 The complete rulesets then look like this: [[:linux:ipsec:example:ss1:nftables_ruleset|Nftables rulesets on r1 on r2]] The complete rulesets then look like this: [[:linux:ipsec:example:ss1:nftables_ruleset|Nftables rulesets on r1 on r2]]
  
-Let's look at the example from Figure {{ref>echo_request_r1_traversal}} again (ICMP echo-request ''h1'' -> ''h2'' traversing ''r1''): In step (5) when traversing the //postrouting// chain, now the inserted rule ''oif eth1 ip daddr 192.168.2.0/24 accept'' prevents that the packet is natted by accepting it and thereby preventing that it traverses the ''masquerade'' rule((The remaining packets of that connection are then anyway handled by //connection tracking// and not by this //chain// and the //connection tracking// now knows that this connection is NOT to be natted.)). The remaining steps of the r1 traversal now happen exactly as in the example and we reached our intended behavior.+Let's look at the example from Figure {{ref>echo_request_r1_traversal}} again (ICMP echo-request ''h1'' -> ''h2'' traversing ''r1''): In step (10) when traversing the //postrouting// chain, now the inserted rule ''oif eth1 ip daddr 192.168.2.0/24 accept'' prevents that the packet is natted by accepting it and thereby preventing that it traverses the ''masquerade'' rule((The remaining packets of that connection are then anyway handled by the //nat// module in cooperation with //connection tracking// and not by this //chain// and thereby //connection tracking// from now on now knows that this connection is NOT to be natted.)). The remaining steps of the ''r1'' traversal now happen exactly as in the example and we reached our intended behavior.
  
 When the ICMP echo-request packet is received by and traverses ''r2'', the behavior follows the same principles as described in the example in Figure {{ref>echo_reply_r1_traversal}} above (ICMP echo-reply ''h2'' -> ''h1'' traversing ''r1''). However, it is necessary to add the rule ''iif eth1 oif eth0 ip saddr 192.168.1.0/24 ct state new accept'' to the //forward// chain, because from ''r2'' point of view this is a packet of a ''new'' incoming connection and those we want to allow for packets which traveled through the VPN tunnel, but not for other packets. When the ICMP echo-request packet is received by and traverses ''r2'', the behavior follows the same principles as described in the example in Figure {{ref>echo_reply_r1_traversal}} above (ICMP echo-reply ''h2'' -> ''h1'' traversing ''r1''). However, it is necessary to add the rule ''iif eth1 oif eth0 ip saddr 192.168.1.0/24 ct state new accept'' to the //forward// chain, because from ''r2'' point of view this is a packet of a ''new'' incoming connection and those we want to allow for packets which traveled through the VPN tunnel, but not for other packets.
Line 441: Line 555:
 ===== Distinguish VPN/non-VPN traffic ===== ===== Distinguish VPN/non-VPN traffic =====
 No matter if your idea is to do NAT or your intentions are other kinds of packet manipulation or packet filtering, it all boils down to distinguishing between VPN and non-VPN traffic.  No matter if your idea is to do NAT or your intentions are other kinds of packet manipulation or packet filtering, it all boils down to distinguishing between VPN and non-VPN traffic. 
-The Nftables rulesets I applied to ''r1'' and ''r2'' in the example above are a crude way to do that, but it works. It is crude, because the distinction is made solely based on the target subnet (the destination IP address) of a traversing packet. As I mentioned abovethings would be easier, if the Xfrm framework would use virtual network interfacesbecause those then could serve as basis for making this distinction +The Nftables rulesets I applied to ''r1'' and ''r2'' in the example above are a crude way to do that, but it works. It is crude, because the distinction is made solely based on the target subnet (the destination IP address) of a traversing packet. Furtherthe added rule in the //forward// chain now allows incoming new connections from WAN side, if the source address is from the peer's subnet. Howeverthe rule does not check, whether that traffic has actually been received via the VPN tunnel! Essentially, it opens a hole in the stateful firewallAnd there are further issues which this crude solution does not address: What happens when the VPN tunnel is not up, but the mentioned Nftables rulesets are in place? What happens if the subnets behind the VPN gateways are not statically configured but instead are part of dynamic IKE negotiation during IKE handshake? ..
-The way described here does not address issues like: What happens when the VPN tunnel is not up, but the mentioned Nftables rulesets are in place? What happens if the subnets behind the VPN gateways are not statically configured but instead are part of dynamic IKE negotiation during IKE handshake? ...+As I mentioned above, things would be easier, if the Xfrm framework would use virtual network interfaces, because those then could serve as basis for making this distinction
  
 Several means have been implemented to address those kind of issues: Several means have been implemented to address those kind of issues:
   * Strongswan provides an optional ''_updown'' script which is called each time when the VPN tunnel comes up or goes down. You can use it to dynamically set/remove the Nftables rules you require. The default version of that script already sets some Iptables rules for you, but depending on your system (kernel/Nftables version) this can create more problems than it solves. And who says that these default rules do fit well with your intended behavior? Thus, if you use this, you probably need to replace the default script with your own version.   * Strongswan provides an optional ''_updown'' script which is called each time when the VPN tunnel comes up or goes down. You can use it to dynamically set/remove the Nftables rules you require. The default version of that script already sets some Iptables rules for you, but depending on your system (kernel/Nftables version) this can create more problems than it solves. And who says that these default rules do fit well with your intended behavior? Thus, if you use this, you probably need to replace the default script with your own version.
-  * Nftables offers IPSEC EXPRESSIONS (syntax ''ipsec {in | out} [ spnum NUM ] ...'') which make it possible to make lookups into the Xfrm framework within a rule to determine if a packet is part of the VPN context or not. However, you require very recent version of Nftables and of the Linux kernel for that. Check your man page ''man 8 nft''. See also section [[#Context]] below.+  * Nftables offers IPSEC EXPRESSIONS (syntax ''ipsec {in | out} [ spnum NUM ] ...'') which make it possible to determine whether a packet is part of the VPN context or not. I published follow-up article which covers that topic in detail: [[nftables_demystifying_ipsec_expressions|Nftables - Demystifying IPsec expressions]].
   * Nftables offers "markers" which you can set and read on traversing packets (syntax ''meta mark'') which can help you to mark packets (set the mark) as being part of the VPN context in one hook/chain and in a later hook/chain you can read the mark again.   * Nftables offers "markers" which you can set and read on traversing packets (syntax ''meta mark'') which can help you to mark packets (set the mark) as being part of the VPN context in one hook/chain and in a later hook/chain you can read the mark again.
   * So-called ''vti'' or ''xfrm'' virtual network interfaces can optionally be used "on top" of the default Xfrm framework behavior.   * So-called ''vti'' or ''xfrm'' virtual network interfaces can optionally be used "on top" of the default Xfrm framework behavior.
- 
-I am planning to describe some of those means in more detail in another article, however I still need to write that one. ;-) I'll place a link here once I find the time to write it. 
  
  
Line 459: Line 571:
 Debian 10 (buster) system with using Debian //backports// on //amd64// architecture. Debian 10 (buster) system with using Debian //backports// on //amd64// architecture.
  
-  * kernel: ''5.4.19-1~bpo10+1'' +  * kernel: ''5.10.46-4~bpo10+1'' 
-  * nftables: ''0.9.3-2~bpo10+1'' +  * nftables: ''0.9.6-1~bpo10+1'' 
-  * libnftnl: ''1.1.5-1~bpo10+1'' +  * strongswan: ''5.7.2-1+deb10u1''
-  * strongswan: ''5.7.2-1''+
  
 ===== Feedback ===== ===== Feedback =====
-[[:feedback|Feedback]] to this article is very welcome!+[[:feedback|Feedback]] to this article is very welcome! Please be aware that I did not develop or contribute to any of the software components described here. I'm merely some developer who took a look at the source code and did some practical experimenting. If you find something which I might have misunderstood or described incorrectly here, then I would be very grateful, if you bring this to my attention and of course I'll then fix my content asap accordingly. 
  
 +===== References =====
 +  * [[https://www.strongswan.org/]]
 +  * [[https://backreference.org/2014/11/12/on-the-fly-ipsec-vpn-with-iproute2/|backreference.org: “On the fly” IPsec VPN with iproute2 (2014)]]
 +  * [[https://www.linux-magazin.de/ausgaben/2004/12/sicherer-brandstifter/|linux-magazin.de: Firewalling bei IPsec-Einsatz unter Kernel 2.6 (2004)]]
 +  * [[https://ramirose.wixsite.com/ramirosen|Linux Kernel Networking - Implementation and Theory (Rami Rosen, Apress, 2014)]]
  
-//published 2020-05-30//+//published 2020-05-30//, //last modified 2022-08-14//
  
blog/linux/nftables_ipsec_packet_flow.1624917862.txt.gz · Last modified: 2021-06-29 by Andrej Stender