Thermalcircle.de

climbing the thermals

User Tools

Site Tools


blog:linux:nftables_ipsec_packet_flow

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
blog:linux:nftables_ipsec_packet_flow [2020-06-01] – [Table] Andrej Stenderblog:linux:nftables_ipsec_packet_flow [2022-08-14] (current) – added details about xfrm bundle Andrej Stender
Line 1: Line 1:
 +{{tag>linux kernel netfilter nftables ipsec strongswan charon swanctl xfrm}}
 ====== Nftables - Netfilter and VPN/IPsec packet flow ====== ====== Nftables - Netfilter and VPN/IPsec packet flow ======
 ~~META: ~~META:
Line 5: Line 6:
  
 In this article I like to explain how the packet flow through  In this article I like to explain how the packet flow through 
-//Netfilter// hooks looks like on a host which works as VPN gateway +Netfilter hooks looks like on a host which works as an IPsec-based VPN gateway in tunnel-mode. 
-based on IPsec (Strongswantunnel-mode, IKEv2, ESP). I'll focus on //Nftables// +Obviously network packets which are to be sent through a VPN tunnel are encrypted+encapsulated on a VPN gateway and packets received through the tunnel are decapsulated and decrypted... but in which sequence does 
-in favor of the older //Iptables// and regarding Strongswan I'll focus on the newer  +this exactly happen and which packet traverses which Netfilter hook in which sequence and in which form (encrypted/not yet encrypted/already decrypted)? 
-//vici// interface (using ''swanctl'') in favor of the older //stroke// interface.+I'll do a short recap of IPsec in general, explain the IPsec implementation on Linux as it is commonly used today (Strongswan + Xfrm framework) and explain packet traversal through the VPN gateways in an example site-to-site VPN setup (IPsec in tunnel-mode, IKEv2, ESP, IPv4). I'll focus on Nftables in favor of the older Iptables and I'll setup the VPN via the modern //Vici/swanctl// configuration interface of Strongswan instead of the older //Stroke// interface.
  
-See also [[nftables_packet_flow_netfilter_hooks_detail|my other article]] which covers packet flow through //Netfilter// hooks in general. 
  
 +===== IPsec short recap =====
 +A comprehensive recap on the topic IPsec would require an entire book. I'll merely
 +provide a very short recap here focused on protocols and ports to put my actual topic into context.
  
-===== IPsec recap ===== +==== IKE protocol ==== 
-A comprehensive recap on the topic IPsec would require whole bookI'll merely +An IPsec based VPN possesses "management channel" between both VPN endpoint hosts((in this case here "VPN Gateways")), which is the IKE protocol((in this case here IKEv2))It is responsible for bringing up, managing, and tearing down the VPN tunnel connection between both VPN endpoints. This gives both endpoints the opportunity to authenticate to each other and negotiate encryption-, authentication- and data-integrity-algorithms and the keys those algorithms require 
-provide very short recap here to put my actual topic into context.+(usually session-based keys). IKE is encapsulated in UDP and uses UDP port 500. 
 +In case of //NAT-traversal// (= if NAT router is detected between both endpoints during IKE handshake) it dynamically switches to UDP port 4500 during IKE handshakeThus, IKE encapsulation on an Ethernet-based network looks like this:
  
-===== IPsec implementation in Linux ===== +| ''|eth|ip|udp|ikev2|'' | IKEv2 packet on UDP/500, //Nat-traversal//: UDP/500 -> UDP/4500 |
-IPsec implementation in Linux consists of a userspace part and a kernel part. +
-Nowadays the userspace part is represented by the [[wp>StrongSwan]] suite (there have been predecessors) +
-and the kernel part is represented by the //Xfrm framework//, which is sometimes called the //Netkey stack/and is present in the kernel since v2.6. With the following image I like to show these components and how they interact in a simple block diagram style.+
  
-{{:wiki:linux:linux-ipsec-impl1.png?direct&700|}}+ 
 + 
 +==== SAs and SPs ==== 
 +The mentioned algorithms and keys which are negotiated during IKE handshake are being organized in 
 +so-called //Security Associations// (SAs). Usually there are (at least) three SAs negotiated for each VPN tunnel connection: The IKE_SA which, once established, represents the secured communication channel for IKE itself and (at least) two more CHILD_SAs, one for each data flow direction, which represent the secured communication channels for packets which shall flow through the VPN tunnel.  
 + 
 +In addition to the SAs, IPsec also introduces the concept of the so-called //Security Policies// (SPs), which are also created during IKE handshake. Those are either defined by the IPsec tunnel configuration provided by the admin/user and/or (depending on case) can also at least partly result from dynamic IKE negotiation. The purpose of the SPs is to act as "traffic selectors" on each VPN endpoint to decide which network packet shall travel through the VPN tunnel and which not. Usually the SPs make those distinction based on source and destination IP addresses (/subnets) of the packets, but additional parameters (e.g. protocols, port numbers, ...) can also be considered. If a packet shall travel through the VPN tunnel, the SP further specifies, which SA is to be applied. 
 + 
 +Be aware that both SAs and SPs merely are volatile and not persistent data. Their lifetime is defined by the lifetime of the VPN tunnel connection. It might even be shorter because of key re-negotiations / "rekeying"
 +==== ESP protocol, tunnel-mode ==== 
 +After the initial IKE handshake has been successfully finished, the VPN tunnel between both endpoints thereby is "up" and 
 +packets can travel through it. In case of //tunnel-mode// the IP packets which shall travel through the VPN tunnel are being encrypted and then encapsulated in packets of the so-called ESP protocol((Yes, there is also the AH protocol, which can be used as alternative to ESP, but AH only provides authentication and data integrity and no encryption. Thus, it is neither relevant for this article nor widely used in practice.)). The whole thing is then further encapsulated into another (an "outer") IP packet. The reason is that the VPN tunnel itself is merely a point-to-point connection between two VPN endpoints (one source and one destination IP address), but those endpoints are in that case VPN gateways which are used to connect entire subnets on both ends of the tunnel. Thus, the source and destination IP addresses of the "payload" packets which travel through the VPN gateway need to be kept independent from the source and destination IP addresses of the "outer" IP packets. Encapsulation then looks like this (example for a TCP connection): 
 + 
 +| <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|tcp|payload|</code></html>((Of course, in the actual packet there is no "blank space" between the ''eth'' (ethernet) and the ''ip'' header. I just show it like this here to emphasize WHERE in the packet the ESP header and the outer IP header are being inserted.)) | A "normal" packet which shall travel through the VPN tunnel, is encrypted and encapsulated like this while traversing the VPN gateway. | 
 +| <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|tcp|payload|</span></code></html>((The darker grey background here shall indicate that this is the part of the whole packet which gets encrypted.))((There is one further detail which I intentionally omit here because it is not relevant for the topic at hand: The ESP protocol in reality does not only possess a //header// but also a //trailer// part which comes after the payload. So, to be completely accurate, the ESP encapsulation in reality looks like this: ''|eth|ip|esp-header|ip|tcp|payload|esp-trailer|''.)) | :::  | 
 + 
 +If //Nat-traversal// is active, then ESP is additionally encapsulated in UDP: 
 + 
 +| <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|tcp|payload|</code></html> |In case of //Nat-traversal// additional encapsulation in UDP (same port ''4500'' as for IKE is then used here). | 
 +| <html><code>|eth|<span style="color: red;">ip</span>|udp|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|tcp|payload|</span></code></html> | :::  | 
 + 
 + 
 +===== IPsec Linux implementation ===== 
 +IPsec implementation in Linux consists of a userspace part and a kernel part.  
 +Several implementations have been created over the years. Nowadays the most commonly used implementation of the userspace part seems to be [[wp>StrongSwan]] (There were/are other implementations like //Openswan// and //FreeS/WAN//. I'll focus on //Strongswan// here.) The IPsec implementation in modern Linux kernels is the so-called //Xfrm framework//, which is sometimes also called the //Netkey stack//. It is present in the Linux kernel since v2.6. There have been predecessors, like the //KLIPS// IPsec stack which was used in kernel v2.4 and earlier. With Figure {{ref>linuxipsecimpl1}} I like to show the responsibilities of Strongswan and the Xfrm framework and how both interact with each other in a simple block diagram style. 
 + 
 +<figure linuxipsecimpl1> 
 +{{:linux:linux-ipsec-impl1.png?direct&700|}} 
 +<caption> 
 +Block diagram showing userspace and kernel part of the IPsec implementation 
 +on Linux (StrongSwan and Xfrm framework) and interfaces between both 
 +</caption> 
 +</figure>
 ==== Stongswan ==== ==== Stongswan ====
 The essential part of Strongswan is the userspace daemon //charon// which implements The essential part of Strongswan is the userspace daemon //charon// which implements
-IKEv1/IKEv2 and acts as the central "orchestrator" of IPsec-based VPN (the main active component) +IKEv1/IKEv2 and acts as the central "orchestrator" of IPsec-based VPN tunnels/connections  
-on the system.+on each VPN endpoint host.
  
-It provides an interface to the user/administrator to configure IPsec on the system.+It provides an interface to the user/admin for configuration of IPsec on the system.
 Actually, more precisely, it provides two different interfaces to do that: Actually, more precisely, it provides two different interfaces to do that:
 One is the so-called //Stroke// interface. It provides means to configure IPsec One is the so-called //Stroke// interface. It provides means to configure IPsec
 via two main config files ''/etc/ipsec.conf'' and ''/etc/ipsec.secrets'' via two main config files ''/etc/ipsec.conf'' and ''/etc/ipsec.secrets''
-This is the older of the two interfaces and it can be considered deprecated (however +This is the older of the two interfaces and it can be considered deprecated 
-it is still supported). +by now (however it is still supported). 
-The other and newer one is the so-called //Vici// interface. It is an IPC mechanism,  + 
-which means the //charon// daemon listens on a Unix-domain socket and client tools  +The other and newer one is the so-called //Vici// interface. This is an IPC mechanism,  
-(like Strongswans own cmdline tool ''swanctl'', but also other tools like e.g. the +which means the //charon// daemon is listening on a Unix domain socket and client tools  
-NHRP daemon of the FRR routing protocol engine, which is used in DMVPN setups)+like Strongswans own cmdline tool ''swanctl''((But also other tools like e.g. the 
 +//NHRP// daemon ''nhrpd'' of the //FRR// routing protocol engine, which is used in //DMVPN// setups.))
 can connect to it to configure IPsec. can connect to it to configure IPsec.
 This way of configuration is more powerful than the //Stroke// interface , because it This way of configuration is more powerful than the //Stroke// interface , because it
 makes it easier for other tools to provide and adjust configuration dynamically makes it easier for other tools to provide and adjust configuration dynamically
 and event driven at any time. and event driven at any time.
- 
 However in many common IPsec setups the configuration is still simply being However in many common IPsec setups the configuration is still simply being
-supplied via config files. When using //Vici//, the difference is merely, that+supplied via config files. When using //Vici//, the difference merely is that
 the config file(s) (mainly the file ''/etc/swanctl/swanctl.conf'') are not interpreted the config file(s) (mainly the file ''/etc/swanctl/swanctl.conf'') are not interpreted
 by the //charon// daemon directly, but instead are interpreted by the cmdline tool ''swanctl'' by the //charon// daemon directly, but instead are interpreted by the cmdline tool ''swanctl''
 which then feeds this config into the //charon// daemon via the //Vici// IPC interface. which then feeds this config into the //charon// daemon via the //Vici// IPC interface.
 +Further, the syntax of ''swanctl.conf'' looks slightly different than the syntax
 +of ''ipsec.conf'' from the //Stroke// interface, but the semantic is nearly the same. 
  
-The //charon// daemon uses a //Netlink// socket as a communication channel +An additional config file ''/etc/strongswan.conf''((and it includes bunch of further files)) exists, which contains general/global strongswan settings, which are not directly related to individual VPN connections.
-into the kernel.+
  
-==== The xfrm framework ==== +So let's say you created a ''swanctl.conf'' config file on both of your VPN endpoint hosts with the intended configuration of your VPN tunnel and you used the ''swanctl'' tool to load this configuration into the //charon// daemon and to initiate the IKE handshake with the peer endpoint((There are other ways to trigger the IKE handshake, e.g. "on demand" by network traffic itself, but that topic is beyond this article.)). The //charon// daemonwhen executing the IKE handshake, negotiates all the details of the VPN tunnel with the peer as explained above and thereby creates the SA and SP instances which define the VPN connection. It keeps the IKE_SA for itself in userspace, because this is its own "secure channel" for IKE communication with the peer endpoint. It feeds the other SA and SP instances into the kernel via a //Netlink// socket. The VPN connection/tunnel is now "up". If you later decide to terminate the VPN connection again, //charon// removes the SA and SP instances from the kernel.
-The so-called //xfrm// framework is a component within the Linux kernel. While +
-the userspace part (Strongswan) handles the overall IPsec orchestration and +
-runs the IKEv1/IKEv2 protocol to buildup/teardown VPNs, the kernel part +
-handles all what can be considered the "VPN payload". It implements the +
-//Security Association Database// (''SAD'') and the //Security Policy +
-Database// (''SPD'')+
  
-This means the userspace +==== The Xfrm framework ==== 
-daemon charon feeds the actual IPsec //Security Association// (''SA'')  +The so-called //Xfrm framework// is a component within the Linux kernel. 
-instances and //Security Policy// (''SP''instanceswhich result from +As one of the //iproute2// man pages((''man 8 ip-xfrm'')) states, 
-configuration and from IKEv1/IKEv2 handshake into the kernel and the kernel +it is an //"IP framework for transforming packets 
-maintains and uses those to encrypt and decrypt the actual "payload" network +(such as encrypting their payloads)"//. Thus"Xfrm" stands for "transform". 
-packets of the VPN.+While the userspace part (Strongswan) handles the overall IPsec orchestration and 
 +runs the IKEv1/IKEv2 protocol to buildup/teardown VPN tunnels/connections, 
 +the kernel part is responsible for encrypting+encapsulating and decrypting+decapsulating 
 +network packets which travel through the VPN tunnel and to select/decide which packets 
 +go through the VPN tunnel at all. To do that, it requires all SA and SP 
 +instances which define the VPN tunnel/connection to be present within the kernel. 
 +Only then it can make decisions on which packet shall be encrypted/decrypted 
 +and which not and which encryption algorithms and keys to use.
  
-You can use the //iproute2// tool ''ip'' as low-level admin tool to show  +The Xfrm framework implements the so-called //Security Association Database// (SAD) 
-the ''SA''and ''SP''which are currently configured in the +and the //Security Policy Database// (SPD) for holding SA and SP instances. 
-databases inside the kernel<code bash> +An SA is represented by ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L149|struct xfrm_state]]'' and an SP by ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L498|struct xfrm_policy]]'' in the kernel. 
-#list SAs which are currently configured in the kernel +Userspace components (like //Strongswan//, the //iproute2// tool collection and others) use a //Netlink// socket to communicate with the kernel to show/create/modify/delete SA and SP instances in the SAD and SPD. You can e.g. use the //iproute2// tool ''ip'' to show the SA and SP instances which currently exist in those databases:
-ip xfrm state+
  
-#list SPs which are currently configured in the kernel +  * command ''ip xfrm state'' shows SA instances 
-ip xfrm policy+  * command ''ip xfrm policy'' shows SP instances 
 + 
 +You can further use ''ip'' as a low-level config tool to create/delete SA and SP instances. There is a [[https://backreference.org/2014/11/12/on-the-fly-ipsec-vpn-with-iproute2/|very good article]] which explains how to do that. However in practice you leave the duty of creating/deleting SA and SP instances to //Strongswan//
 + 
 +SP instances can be created for three different "data directions": 
 +^ security policy   ^ syntax((This syntax is e.g. used in the output of command ''ip xfrm policy''.)) ^ meaning ^ 
 +| "output policy"  | ''dir out'' | SP works as a selector on outgoing packets to select which are to be encrypted+encapsulated and which not | 
 +| "input policy"   | ''dir in''  | SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP local on the system | 
 +| "forward policy" | ''dir fwd'' | SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP which is not local, thereby packets which are to be forwarded | 
 + 
 +If you are working with Nftables or Iptables, then you probably are 
 +familiar with the widely used 
 +[[:linux:netfilter_packet_flow_image|Netfilter packet flow image]], 
 +which illustrates the packet flow through the Netfilter hooks and 
 +Iptables //chains//. One great thing about this image is, that it also covers 
 +the Xfrm framework. It illustrates four distinct "Xfrm actions" 
 +in the network packet flow path, named //xfrm/socket lookup//, //xfrm decode//, 
 +//xfrm lookup// and //xfrm encode//. However, this illustration is 
 +kind-of a bird's eye view. These four "actions" do not resemble the 
 +actual Xfrm implementation very closely. The actual framework 
 +works a little bit different, which means that there actually are  
 +more than four points within the packet flow path where Xfrm takes 
 +action and also the locations of those are a little bit different. 
 +Figure {{ref>nfhooksxfrm1}} represents a simplified view of the packet flow 
 +with main focus on Netfilter and the Xfrm framework. 
 +It shows the Netfilter hooks in blue color and the locations where 
 +the Xfrm framework takes action in magenta color. 
 +If you are not yet familiar with the Netfilter hooks and their relation 
 +to Nftables/Iptables, then please take a look at my other article [[nftables_packet_flow_netfilter_hooks_detail|Nftables - Packet flow and 
 +Netfilter hooks in detail]] before proceeding here. 
 + 
 +<figure nfhooksxfrm1> 
 +{{:linux:packet-flow-ipsec-tunnel.png?direct&700|}} 
 +<caption>Block diagram of Netfilter hooks and Xfrm actions in IPsec tunnel-mode (click to enlarge)</caption> 
 +</figure> 
 + 
 +| <WRAP>{{:linux:routing-step.png?nolink |}} The routing lookup is performed for incoming as well as local outgoing packets, see Figure {{ref>nfhooksxfrm1}}. Function ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/ip_fib.h#L363|fib_lookup()]]'' performs the actual lookup into the policy routing rules and routing tables. The routing decision resulting from this lookup is attached to the traversing network packet (skb). 
 +It is an instance of two combined structs, the outer ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/route.h#L49|struct rtable]]'' and the inner ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/dst.h#L24|struct dst_entry]]'', Together, both contain information like the output network interface, the ip address of the next hop gateway (if existing), function pointers which determine the path this packet takes through the remaining part of the kernel network stack, and more. My [[routing_decisions_in_the_linux_kernel_1_lookup_packet_flow|article series on routing]] explains that in detail. At first glance, routing has nothing to do with the Xfrm framework, but it is relevant in this context as you will see below.</WRAP>
 +| <WRAP>{{:linux:xfrm-action-policy-out.png?nolink |}} This action is performed for forwarded as well as for local outgoing packets after the routing lookup, see Figure {{ref>nfhooksxfrm1}} and function ''[[https://elixir.bootlin.com/linux/v5.10.46/source/net/xfrm/xfrm_policy.c#L3183|xfrm_lookup()]]''. The Xfrm framework performs a lookup into the IPsec SPD, searching for a matching output policy (''dir out'' SP). If no matching policy is found, the network packet stays unchanged and simply continues on its way. If a matching policy is found, a lookup into the SAD is performed to resolve an SA which corresponds to the matching SP (shown as an attached magenta box named //Xfrm lookup state//). If the resolved SA specifies tunnel-mode, then yet another routing lookup is performed, this time for the (future) outer IPv4 packet which will later encapsulate the current packet. The actual packet transformation does not yet happen at this point. Instead, a "bundle" of transformation instructions for this packet is assembled. The term "bundle" stems from the kernel source code and refers to a bunch of struct instances pointing to each other. Among those are the original routing decision of this packet, the SP, the SA((there can be more than one SA being applied to a packet, but that is a less common case)), the routing decision for the future outer IP packet and more. Those are usually assembled around one or several instances of ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L925|struct xfrm_dst]]''. The bundle is attached to the network packet (skb), replacing the originally attached routing decision. Function pointers within the bundle ensure, that the packet takes a different path through the remaining part of the kernel network stack. Figure {{ref>xfrm_dst}} shows how a bundle would actually look like. 
 +</WRAP>
 +| <WRAP>{{:linux:xfrm-action-encode.png?nolink |}} This is where packets which shall travel through the VPN tunnel are being encrypted and encapsulated. The Xfrm framework transforms a packet according to the instructions in the "bundle" attached to it. A function pointer within the bundle makes sure, that the packet takes a "detour" into this transformation code after traversing the Netfilter //Postrouting// hook. For IPv4 packets, the entry function which leads the packet on this path is ''[[https://elixir.bootlin.com/linux/v5.10.46/source/net/ipv4/xfrm4_output.c#L31|xfrm4_output()]]''. In case of tunnel-mode this transformation means encapsulating the IP packet into a new outer IP packet and then encapsulating the inner IP packet into ESP protocol and encrypting it and its payload. After completion of the transformation, the xfrm components/instructions are being removed from the "bundle", leaving only the routing decision for the outer IP packet attached to the packet.</WRAP>
 +| <WRAP>{{:linux:xfrm-action-decode.png?nolink |}} This is where packets which have been received through the VPN tunnel are being decrypted and decapsulated. If an IP packet on the local input path contains an ESP packet, then the Xfrm framework performs a lookup into the SAD (//Xfrm lookup state//), based on the SPI((SPI is an integer value in the unencrypted part of the ESP header)) and the destination IP address. If no matching SA is found, the packet is dropped. If a matching SA is found, the ESP packet is decrypted and decapsulated. In case of tunnel-mode the outer IP packet is decapsulated. The remaining inner IP packet then is re-inserted into the receive path on OSI layer 2. Packets remember which SA has been used on them((in an skb extension named ''sec_path'')). That becomes relevant when they later traverse the //Xfrm lookup in policy// or //Xfrm lookup fwd policy// action.\\ \\ In case of //Nat-traversal// mode, both IKE and ESP packets arrive on the local input path being encapsulated in UDP on port 4500 and the kernel must distinguish between both. This is done based on the so-called //Non-ESP Marker//((4 zero bytes ''0x00000000'' at beginning of UDP payload, defined in [[https://datatracker.ietf.org/doc/html/rfc3948|RFC3948]])). IKE packets are given to a UDP socket where userspace application Strongswan is listening ((There is more to that. Strongswan is required to set a special socket option called ''UDP_ENCAP'' or else it won't receive any IKE packets on port 4500. But that is an implementation detail.)), while ESP packets are decrypted and decapsulated as described above.</WRAP>
 +| <WRAP>{{:linux:xfrm-action-policy-in.png?nolink |}} If already decrypted+decapsulated packets arrive here, 
 +they must match the "input policy" (''dir in'' SP) which corresponds to the SA which has been used to decrypt+decapsulate them. If they match, they simply pass, if not they are dropped. 
 +"Normal" network packets which have not been decrypted+decapsulated, by default simply pass here if they do not match any "input policy"((However, you can actually define additional input policies for those packets and thereby use the Xfrm framework to do packet filtering. However, this seems to be an exotic and nearly not documented feature. Netfilter with Nftables/IPtables does this job much better and is well documented, so that should probably be your preferred choice for doing packet filtering.)). 
 +ESP packets anyway circumvent this action, as you can see in Figure {{ref>nfhooksxfrm1}}; "input policies" are not meant for them.</WRAP>
 +| <WRAP>{{:linux:xfrm-action-policy-fwd.png?nolink |}} If already decrypted+decapsulated packets arrive here, 
 +they must match the "forward policy" (''dir fwd'' SP) which corresponds to the SA which has been used to decrypt+decapsulate them. If they match, they simply pass, if not they are dropped. "Normal" network packets which have not been decrypted+decapsulated, by default simply pass here if they do not match any "forward policy"((However, you can actually define additional forward policies for those packets and thereby use the Xfrm framework to do packet filtering. However, this seems to be an exotic and nearly not documented feature. Netfilter with Nftables/IPtables does this job much better and is well documented, so that should probably be your preferred choice for doing packet filtering.))</WRAP>
 + 
 + 
 +The Xfrm framework implementation does NOT use virtual network interfaces to distinguish between VPN and non-VPN traffic. This is a relevant difference compared to other implementations like the older //KLIPS// IPsec stack which was used in kernel v2.4 and earlier. Why is this relevant? It is true that virtual network interfaces are not required, because the concept of the SPs does all the distinction which is required for the VPN to operate. However, the absence of virtual network interfaces makes it harder for Netfilter-based packet filtering systems like Iptables and Nftables to distinguish between VPN and non-VPN packets within their rules. 
 + 
 +It is obvious that an Nftables rule would be easy to write if all VPN traffic goes through a virtual network interface e.g. called ''ipsec0''. In case of the Xfrm framework that is not the case, at least not 
 +by default. Additional features have been developed over the years to address this problem. Some of them 
 +re-introduce the concept of virtual network interfaces "on top" of the Xfrm framework, but those 
 +are optional to use and never became the default. The Strongswan documentation calls VPN setups based on those virtual network interfaces [[https://wiki.strongswan.org/projects/strongswan/wiki/RouteBasedVPN|"Route-based VPNs"]]. It seems, essentially two types of virtual interfaces have been introduced in this context over the years: The older ''vti'' interfaces and the newer ''xfrm'' interfaces((Additional kinds of virtual network interfaces exist in this context, like e.g. the ''gre'' interfaces, but they represent an entirely different concept and protocol (''GRE'' protocol) which is e.g. used to build DMVPN setups. That is an advanced topic which works a little different than the "normal" IPsec VPN which I describe here.)). In the remaining part of this article I will describe how the IPsec-based VPN looks like from Netfilter point-of-view in the "normal" case where NO virtual network interfaces are used. 
 + 
 +<figure xfrm_dst> 
 +{{:linux:xfrm_dst.png?direct&700|}} 
 +<caption>Simplified illustration of an Xfrm bundle, attached to a network packet 
 +(click to enlarge). In IPsec tunnel-mode, the bundle contains two //routing decisions//, 
 +references to IPsec SA and SP and function pointers to lead the packet 
 +on the Xfrm encrypt+encapsulate path. Compare it to a normal 
 +//routing decision// object, which I described in my 
 +[[routing_decisions_in_the_linux_kernel_1_lookup_packet_flow#the_routing_decision_object|article series on routing]]. 
 +</caption> 
 +</figure> 
 + 
 +===== Example Site-to-site VPN ===== 
 +<figure ipsecsstopo1> 
 +{{ :linux:site-to-site-topo1.png?nolink&600 |}} 
 +<caption>IPsec site-to-site example setup with VPN gateways ''r1'' and ''r2'' 
 +</caption> 
 +</figure> 
 + 
 +It is better to have a practical example as basis for further diving into the topic. Here I will use a site-to-site VPN setup, which is created between two VPN gateways ''r1'' and ''r2'' (IPsec, tunnel-mode, IKEv2, ESP, IPv4) as shown in Figure {{ref>ipsecsstopo1}}.  
 +It can be roughly compared to the [[https://www.strongswan.org/testing/testresults/ikev2/net2net-psk/|strongSwan KVM Test ikev2/net2net-psk]] setup. 
 +The VPN tunnel will connect the local subnets behind ''r1'' and ''r2''. Additionally, both ''r1'' and ''r2'' operate as SNAT edge routers when forwarding non-VPN traffic, but not for VPN traffic. This creates the necessity to distinguish between VPN and non-VPN packets in Nftables rules, but more on that later. The router ''rx'' is just a placeholder for an arbitrary cloud (e.g. the Internet) between both VPN gateways. 
 + 
 +  * [[:linux:ipsec:example:ss1:ip_setup|IP setup (addresses, routes) of the example topology]] 
 + 
 +<figure swanctl_conf_r1_r2> 
 +<WRAP group><WRAP half column> 
 +<code json r1: swanctl.conf> 
 +connections { 
 +  gw-gw { 
 +    local_addrs  = 8.0.0.1 
 +    remote_addrs = 9.0.0.1 
 +    local { 
 +      auth = psk 
 +      id = r1 
 +    } 
 +    remote { 
 +      auth = psk 
 +      id = r2 
 +    } 
 +    children { 
 +      net-net { 
 +        mode = tunnel 
 +        local_ts  = 192.168.1.0/24 
 +        remote_ts = 192.168.2.0/24 
 +        esp_proposals = aes128gcm128 
 +      } 
 +    } 
 +    version = 2 
 +    mobike = no 
 +    reauth_time = 10800 
 +    proposals = aes128-sha256-modp3072 
 +  } 
 +
 +secrets { 
 +  ike-1 { 
 +    id-1 = r1 
 +    id-2 = r2 
 +    secret = "ohCeiVi5iez36ieFu" 
 +  } 
 +}</code></WRAP><WRAP half column> 
 +<code json r2: swanctl.conf> 
 +connections { 
 +  gw-gw { 
 +    local_addrs  = 9.0.0.1 
 +    remote_addrs = 8.0.0.1  
 +    local { 
 +      auth = psk 
 +      id = r2 
 +    } 
 +    remote { 
 +      auth = psk 
 +      id = r1 
 +    } 
 +    children { 
 +      net-net { 
 +        mode = tunnel 
 +        local_ts  = 192.168.2.0/24 
 +        remote_ts = 192.168.1.0/24 
 +        esp_proposals = aes128gcm128 
 +      } 
 +    } 
 +    version = 2 
 +    mobike = no 
 +    reauth_time = 10800 
 +    proposals = aes128-sha256-modp3072 
 +  } 
 +
 +secrets { 
 +  ike-1 { 
 +    id-1 = r1 
 +    id-2 = r2 
 +    secret = "ohCeiVi5iez36ieFu" 
 +  } 
 +}</code></WRAP></WRAP> 
 +<caption> 
 +Strongswan configuration on ''r1'' and ''r2''((Be aware that config on ''r1'' and ''r2'' here merely is an example configuration which helps me to elaborate on the interaction between Strongswan, Xfrm and Netfilter+Nftables. It is NOT necessarily a good setup to be used in the real world out in the field! To keep things simple I e.g. work with PSKs here instead of certificates, which would already be a questionable decision regarding a real world appliance.)) 
 +</caption> 
 +</figure> 
 + 
 +Execute command ''%%swanctl --load-all%%'' on ''r1'' and ''r2'' to load this configuration of Figure {{ref>swanctl_conf_r1_r2}} into the //charon// daemon. Execute command ''%%swanctl --initate --child net-net%%'' on ''r1'' to initiate the VPN connection between both gateways. This triggers the IKEv2 handshake with ''r1'' as "initiator" and ''r2'' as "responder". On successful IKE handshake, the //charon// daemon feeds the SAs and SPs which define this VPN tunnel via Netlink socket into the kernel and thereby the tunnel is now up. You can use the //iproute2// tool ''ip'' as low-level admin tool to show the SAs and SPs which currently exist in the 
 +databases inside the kernel, see Figures {{ref>r1_ip_xfrm_state}} and {{ref>r1_ip_xfrm_policy}}. 
 + 
 +<figure r1_ip_xfrm_state> 
 +<code bash> 
 +root@r1:~# ip xfrm state 
 +src 8.0.0.1 dst 9.0.0.1 
 +        proto esp spi 0xc5400599 reqid 1 mode tunnel 
 +        replay-window 0 flag af-unspec 
 +        aead rfc4106(gcm(aes)) 0x8849c107d9f6972da27a5faef554a68b10f3b938 128 
 +        anti-replay context: seq 0x0, oseq 0x9, bitmap 0x00000000 
 +src 9.0.0.1 dst 8.0.0.1 
 +        proto esp spi 0xcd7dff80 reqid 1 mode tunnel 
 +        replay-window 32 flag af-unspec 
 +        aead rfc4106(gcm(aes)) 0x3c0497d489904175bdb446f3e09ae4c3acaf5d45 128 
 +        anti-replay context: seq 0x9, oseq 0x0, bitmap 0x000001ff
 </code> </code>
 +<caption>Showing SAs currently present in SAD in kernel on ''r1'' with ''ip xfrm'' command</caption>
 +</figure>
  
-The ''ip'' tool uses the same means (//Netlink// socket) to communicate with +<figure r1_ip_xfrm_policy> 
-the kernelYou could also use it as a low-level config tool to +<code bash> 
-create/edit/delete ''SA''s and ''SP''s in the kernel, however in practice +root@r1:~# ip xfrm policy 
-you leave those duties to Strongswan.+src 192.168.1.0/24 dst 192.168.2.0/24  
 +        dir out priority 375423 ptype main  
 +        tmpl src 8.0.0.1 dst 9.0.0.1 
 +                proto esp spi 0xc5400599 reqid 1 mode tunnel 
 +src 192.168.2.0/24 dst 192.168.1.0/24  
 +        dir fwd priority 375423 ptype main  
 +        tmpl src 9.0.0.1 dst 8.0.0.1 
 +                proto esp reqid 1 mode tunnel 
 +src 192.168.2.0/24 dst 192.168.1.0/24  
 +        dir in priority 375423 ptype main  
 +        tmpl src 9.0.0.1 dst 8.0.0.1 
 +                proto esp reqid 1 mode tunnel 
 +src 0.0.0.0/0 dst 0.0.0.0/0  
 +        socket in priority 0 ptype main  
 +src 0.0.0.0/0 dst 0.0.0.0/ 
 +        socket out priority 0 ptype main  
 +src 0.0.0.0/0 dst 0.0.0.0/0  
 +        socket in priority 0 ptype main  
 +src 0.0.0.0/0 dst 0.0.0.0/0  
 +        socket out priority 0 ptype main  
 +src ::/0 dst ::/0  
 +        socket in priority 0 ptype main  
 +src ::/0 dst ::/0  
 +        socket out priority 0 ptype main  
 +src ::/0 dst ::/0  
 +        socket in priority 0 ptype main  
 +src ::/0 dst ::/0  
 +        socket out priority 0 ptype main  
 +</code> 
 +<caption>Showing SPs currently present in SPD in kernel on ''r1'' with ''ip xfrm'' command</caption> 
 +</figure>
  
-==== hooks ==== +Just for completenessTo tear the the VPN tunnel down again, you would need to execute command  
-{{:wiki:linux:nf-hooks-xfrm1.png?direct&700|}}+''%%swanctl --terminate --child net-net%%'' to terminate the ESP tunnel (SPs and SAs get removed from the kernel databases) and then command ''%%swanctl --terminate --ike gw-gw%%'' to terminate the IKE connection/association (IKE_SA).
  
  
 +===== Packet Flow =====
 +Let's say the VPN tunnel in the example described above is now up and running. To start simple, let's also assume that we have not yet configured any Nftables //ruleset// on the VPN gateways ''r1'' and ''r2''. Thus, those two hosts do not yet do SNAT (we will do this later). The VPN will already work normally, which means any IP packet which is sent from any host in subnet ''192.168.1.0/24'' to any host in subnet ''192.168.2.0/24'' or vice versa will travel through the VPN tunnel.
  
-===== Table =====+This means, packets which are traversing one of the VPN gateways ''r1'' or ''r2'' and are about to "enter the VPN tunnel" are being encrypted and then encapsulated in ESP and an outer IP packet. Packets which are "leaving the tunnel" are being decapsulated (the outer IP packet and ESP header are stripped away) and decrypted. Let's observe that on VPN gateway ''r1'' by sending a single ping from ''h1'' to ''h2'', see Figure {{ref>r1_icmp}}.
  
-**''ICMP'' ''echo-request'' ''h1''-> ''h2''''r1'' traversal**\\ +<figure r1_icmp> 
 +{{:linux:r1-traversal1.png?direct&200 |}} 
 +\\  
 +<code bash> 
 +h1$ ping -c1 192.168.2.100 
 +#       h1                                    h2 
 +# 192.168.1.100 -> ICMP echo-request -> 192.168.2.100 
 +# 192.168.1.100 <- ICMP echo-reply   <- 192.168.2.100 
 +</code> 
 +<caption>Single ping from ''h1'' to ''h2'' traversing ''r1''; ICMP echo-request entering VPN tunnel, ICMP echo-reply leaving VPN tunnel.</caption> 
 +</figure>
  
-{{:wiki:linux:nf-hooks-xfrm-encode1.png?direct&650|}}+The following Figures {{ref>echo_request_r1_traversal}} and 
 +{{ref>echo_reply_r1_traversal}} show in detail how the ICMP //echo-request// 
 +and the corresponding ICMP //echo-reply// traverse the kernel network stack on 
 +''r1''Those Figures and the attached tables and descriptions below are the 
 +result of me doing a lot of experimenting and reading in the kernel source code. 
 +I used the Nftables ''trace'' and ''log'' features to make 
 +chain traversal, and thereby Netfilter hook traversal, visible. Further, I 
 +used ''ftrace'' to get a ''function_graph'' of the journey a network packet 
 +takes through the kernel network stack while being encrypted/decrypted. 
 +On some occasions I used ''gdb'' together with ''qemu''+''KVM'' to set 
 +breakpoints within the kernel of a linux virtual machine and thereby 
 +observe the content of data structures involved with a traversing packet. 
 +While reading source code, the book [[https://ramirose.wixsite.com/ramirosen|Linux Kernel Networking - Implementation and Theory (Rami Rosen, Apress, 2014)]] proved to be a great 
 +help to me to find orientation within the kernel network stack. I hope 
 +this gives you a head start in case you intend to dive deep into that topic 
 +yourself, too.
  
-^ step ^ netfilter hook / xfrm ^ encapsulation                             ^ ''iif''  ^ ''oif''  ^ ''ip saddr''   ^ ''ip daddr''+<figure echo_request_r1_traversal
-| 1    | ''prerouting''        | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth0'' |          | <html><code><span style="color: navy;">10.0.1.100</span></code></html> | <html><code><span style="color: navy;">10.0.2.100</span></code></html> | +{{:linux:packet-flow-ipsec-tunnel-encrypt.png?direct&700|}} 
-| 2    | ''forward''           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="colornavy;">ip</span>|icmp|</code></html> | ''eth0'' | ''eth1'' | <html><code><span style="colornavy;">10.0.1.100</span></code></html> | <html><code><span style="color: navy;">10.0.2.100</span></code></html>+<captionICMP echo-request ''h1'' -> ''h2''''r1'' traversal, encrypt+encapsulate (click to enlarge)</caption
-| 3    | ''postrouting''       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth1'' | <html><code><span style="color: navy;">10.0.1.100</span></code></html> | <html><code><span style="color: navy;">10.0.2.100</span></code></html> +</figure>
-| 4    | ''xfrm lookup''       <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          |          |                |                | +
-| 5    | ''xfrm encode''       | <html><code>|eth|......|<span style="color: navy;">ip</span>|icmp|</code></html> |          |          |                |                | +
-| 6    | ''output''            | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: darkgreen;">3.0.0.1</span></code></html>    | <html><code><span style="color: darkgreen;">5.0.0.1</span></code></html>    | +
-| 7    | ''postrouting''       | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: darkgreen;">3.0.0.1</span></code></html>    | <html><code><span style="color: darkgreen;">5.0.0.1</span></code></html>    |+
  
-**''ICMP'' ''echo-reply'' ''h2''-> ''h1''''r1'' traversal**\\ +^ Step ^^ Encapsulation                                                                                                                                                          ^ ''iif''  ^ ''oif''  ^ ''ip saddr''   ^ ''ip daddr''   ^ 
 +| 1  | **eth0**           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         ''eth0'' |      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 2  | <html><b><span style="background-color:#eeeeee;">taps</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         ''eth0'' |      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 3  | <html><b><span style="background-color:#dddddd;">qdisc ingress</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         ''eth0'' |      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 4  | <html><b><span style="background-color:#729fcf;">Ingress</span></b></html> **(eth0)** | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         ''eth0'' |          | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 5  | <html><b><span style="background-color:#729fcf;">Prerouting</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         ''eth0'' |          | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 6  | **Routing**           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         ''eth0'' | ...      | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 7  | <html><b><span style="background-color:#d4abbe;">Xfrm fwd policy</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 8  | <html><b><span style="background-color:#d4abbe;">Xfrm out policy</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 9  | <html><b><span style="background-color:#729fcf;">Forward</span></b></html>           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                         | ''eth0'' | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 10  | <html><b><span style="background-color:#729fcf;">Postrouting</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html>                                                                  | ''eth1'' | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | <html><code><span style="color: navy;">192.168.2.100</span></code></html>
 +| 11  | <html><b><span style="background-color:#d4abbe;">Xfrm encode</span></b></html>           | <html><code>|eth|......|<span style="color: navy;">ip</span>|icmp|</code></html>                                                                                                | ''eth1'' | ...            | ...            | 
 +| 12  | <html><b><span style="background-color:#729fcf;">Output</span></b></html>            | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +| 13  | <html><b><span style="background-color:#729fcf;">Postrouting</span></b></html>       | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +| 14 | <html><b><span style="background-color:#a3d196;">neighbor</span></b></html>           | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +| 15 | <html><b><span style="background-color:#dddddd;">qdisc egress</span></b></html>           | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +| 16 | <html><b><span style="background-color:#eeeeee;">taps</span></b></html>           | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    | 
 +| 17 | **eth1** | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> |          | ''eth1'' | <html><code><span style="color: red;">8.0.0.1</span></code></html>    | <html><code><span style="color: red;">9.0.0.1</span></code></html>    |
  
-{{:wiki:linux:nf-hooks-xfrm-decode1.png?direct&650|}} 
  
-step ^ netfilter hook / xfrm encapsulation                             ^ ''iif''  ^ ''oif''  ^ ''ip saddr''   ^ ''ip daddr''   ^ +__Steps (1) (2) (3) (4) (5):__ The //ICMP echo-request// packet from ''h1'' to ''h2'' is received on the ''eth0'' interface of ''r1''. It traverses //taps// (where e.g. Wireshark/tcpdump could listen), the //ingress// queueing discipline (network packet scheduler,  ''tc''), and the Netfilter //Ingress// hook of ''eth0'' (where e.g. a //flowtable// could be placed). Then it traverses the Netfilter //Prerouting// hook.  
-| 1    | ''prerouting''        | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: darkgreen;">5.0.0.1</span></code></html> | <html><code><span style="color: darkgreen;">3.0.0.1</span></code></html>+ 
-2    | ''input''             | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: darkgreen;">5.0.0.1</span></code></html> | <html><code><span style="color: darkgreen;">3.0.0.1</span></code></html>+__Step (6):__ The routing lookup is performed. It determines that this packet 
-3    | ''xfrm/socket lookup'' | <html><code>|eth|<span style="color: darkgreen;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html>         |          |                               +needs to be forwarded and sent out on ''eth1'' and attaches this routing 
-4    | ''xfrm decode''       | <html><code>|eth|......|<span style="color: navy;">ip</span>|icmp|</code></html> |          |          |                               +decision to the packet. 
-5    | ''prerouting''        | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">10.0.2.100</span></code></html> | <html><code><span style="color: navy;">10.0.1.100</span></code></html>+ 
-6    | ''forward''           | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ''eth0'' | <html><code><span style="color: navy;">10.0.2.100</span></code></html> | <html><code><span style="color: navy;">10.0.1.100</span></code></html>+__Step (7):__ The Xfrm framework performs a lookup into the IPsec SPD, 
-7    | ''postrouting''       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">10.0.2.100</span></code></html> | <html><code><span style="color: navy;">10.0.1.100</span></code></html> |+searching for a matching //forward policy// (''dir fwd'' SP). No match is 
 +found and the packet passes. 
 + 
 +__Step (8):__ The Xfrm framework performs a lookup into the IPsec SPD, 
 +searching for a matching //output policy// (''dir out'' SP) and yes, because of its 
 +source and destination IP addresses ''192.168.1.100'' and ''192.168.2.100'' the packet 
 +matches, see Figure {{ref>r1_ip_xfrm_policy}}. 
 +An IPsec SA is resolved (see Figure {{ref>r1_ip_xfrm_state}}), which corresponds to 
 +the matching SP. The Xfrm framework detects, that tunnel-mode is configured in 
 +this SA. Thus, it now performs yet another routing lookup, this time for the (future) outer 
 +IPv4 packet, which will later encapsulate the current packet. A “bundle” of 
 +transformation instructions for this packet is assembled, which contains the 
 +original routing decision from step (6), the SP, the SA, the routing decision 
 +for the future outer IP packet and more. It is attached to the packet, 
 +replacing the attached routing decision from step (6). 
 + 
 +__Steps (9) (10):__ The packet traverses the Netfilter //Forward// and 
 +//Postrouting// hooks. 
 + 
 +__Step (11):__ The Xfrm framework transforms the packet 
 +according to the instructions in the attached “bundle”. In this case this 
 +means encapsulating the IP packet into a new outer IP packet  
 +with source IP address ''8.0.0.1'' and destination IP addresses ''9.0.0.1'' 
 +and then encapsulating the inner IP packet into ESP protocol and encrypting it and its 
 +payload. After that the transformation instructions are removed from the 
 +“bundle”, leaving only the routing decision for the new outer IP packet 
 +attached to the packet. 
 + 
 +__Steps (12) (13) (14) (15) (16) (17):__ The packet is re-inserted into the local 
 +output path. It traverses the Netfilter //Output// hook and then again the 
 +Netfilter //Postrouting// hook. Then it traverses the //neighboring 
 +subsystem// which in this case resolves the next hop gateway ip address 
 +''8.0.0.2'' from the routing decision attached to this packet into a MAC address 
 +(by doing ARP lookup, if address not yet in cache). 
 +Finally, it traverses the //egress// queueing discipline (network packet 
 +scheduler, ''tc''), //taps// (where e.g. Wireshark/tcpdump could listen) and 
 +then is sent out on ''eth1''
 + 
 +The output interface ''oif'' of the packet, since determined by the routing decision(s) in steps (6) and (8), always stays ''eth1'' when you check for it in an Nftables rule within one of the Netfilter hooks coming after that. This is what it means that the //Xfrm framework// does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a ''vti'' interface named ''vti0''), then ''oif'' would be ''vti0'' in steps (9) and (10) instead of ''eth1'', but ''oif'' would still be ''eth1'' in steps (12) and (13). 
 + 
 +<figure echo_reply_r1_traversal> 
 +{{:linux:packet-flow-ipsec-tunnel-decrypt.png?direct&700|}} 
 +<caption> ICMP echo-reply ''h2'' -> ''h1'', ''r1'' traversal, decrypt+decapsulate (click to enlarge)</caption> 
 +</figure> 
 + 
 +^ Step ^Encapsulation                             ^ ''iif''  ^ ''oif''  ^ ''ip saddr''   ^ ''ip daddr''   ^ 
 +| 1  | **eth1** | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 2  | <html><b><span style="background-color:#eeeeee;">taps</span></b></html> | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +3  | <html><b><span style="background-color:#dddddd;">qdisc ingress</span></b></html> | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 4  | <html><b><span style="background-color:#729fcf;">Ingress</span></b></html> **(eth1)** | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +5  | <html><b><span style="background-color:#729fcf;">Prerouting</span></b></html> | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 6  | **Routing**                                                                         | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html>''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> <html><code><span style="color: red;">8.0.0.1</span></code></html> 
 +7  | <html><b><span style="background-color:#729fcf;">Input</span></b></html>            | <html><code>|eth|<span style="color: red;">ip</span>|esp<span style="background-color:#E5E4E2;">|<span style="color: navy;">ip</span>|icmp|</span></code></html> | ''eth1'' |          | <html><code><span style="color: red;">9.0.0.1</span></code></html> | <html><code><span style="color: red;">8.0.0.1</span></code></html>
 +| 8  | <html><b><span style="background-color:#d4abbe;">Xfrm decode</span></b></html>      | <html><code>|eth|......|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | ...            | ...            | 
 +| 9  | **eth1** | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 10  | <html><b><span style="background-color:#eeeeee;">taps</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 11 | <html><b><span style="background-color:#dddddd;">qdisc ingress</span></b></html> | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> <html><code><span style="color: navy;">192.168.1.100</span></code></html> 
 +12 | <html><b><span style="background-color:#729fcf;">Ingress</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 13 | <html><b><span style="background-color:#729fcf;">Prerouting</span></b></html>       | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' |          | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +14 | **Routing**                                                                         | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ...      | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 15 | <html><b><span style="background-color:#d4abbe;">Xfrm fwd policy</span></b></html>  | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +16 | <html><b><span style="background-color:#d4abbe;">Xfrm out policy</span></b></html>  | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 17 | <html><b><span style="background-color:#729fcf;">Forward</span></b></html>          | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> | ''eth1'' | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 18 | <html><b><span style="background-color:#729fcf;">Postrouting</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html> | 
 +| 19 | <html><b><span style="background-color:#a3d196;">neighbor</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 20 | <html><b><span style="background-color:#dddddd;">qdisc egress</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 21 | <html><b><span style="background-color:#eeeeee;">taps</span></b></html>      | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 +| 22 | **eth0** | <html><code>|eth|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|<span style="color: navy;">ip</span>|icmp|</code></html> |          | ''eth0'' | <html><code><span style="color: navy;">192.168.2.100</span></code></html> | <html><code><span style="color: navy;">192.168.1.100</span></code></html>
 + 
 + 
 +__Steps (1) (2) (3) (4) (5):__ The //ICMP echo-reply// sent from ''h2'' back to ''h1'' is received on ''eth1'' interface of ''r1'' in its encrypted+encapsulated form. It traverses //taps// (where e.g. Wireshark/tcpdump could listen), the //ingress// queueing discipline (network packet scheduler,  ''tc''), and the Netfilter //Ingress// hook of ''eth1'' (where e.g. a //flowtable// could be placed). Then it traverses the Netfilter //Prerouting// hook.  
 + 
 +__Steps (6) (7):__ The routing lookup is performed. In this case here, the routing subsystem determines that this packets destination IP ''8.0.0.1'' matches the IP address of interface ''eth1'' of this host ''r1''. Thus, this packet is destined for local reception and the lookup attaches an according routing decision to it. As a result, the packet then traverses the Netfilter //Input// hook.  
 + 
 +__Steps (8) (9):__ The Xfrm framework has a layer 4 receive handler waiting for incoming ESP 
 +packets at this point. It parses the SPI value from the ESP header of the 
 +packet and performs a lookup into the SAD for a matching IPsec SA (lookup 
 +based on SPI and destination IP address). A matching SA is found, see Figure 
 +{{ref>r1_ip_xfrm_state}}, which 
 +specifies the further steps to take here for this packet. It is decrypted and 
 +the ESP header is decapsulated. Now the internal IP packet becomes visible. 
 +The SA specifies tunnel-mode, so the outer IPv4 header is decapsulated. A lot 
 +of packet meta data is changed here, e.g. the attached routing decision (of 
 +the outer IP packet, which is now removed) is stripped away, the reference to 
 +connection tracking is removed, and a pointer to the SA which has been used 
 +here to transform the packet is attached (via skb extension 
 +''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L1029|struct sec_path]]''). As a result, other kernel components can later 
 +recognize that this packet has been decrypted by Xfrm. Finally, the packet is 
 +now re-inserted into the OSI layer 2 receive path of ''eth1''
 + 
 +__Steps (10) (11) (12) (13):__ Now history repeats... the packet once again traverses 
 +//taps//, the //ingress// queueing discipline and the Netfilter //Ingress// hook of 
 +''eth1''. Then it traverses the Netfilter //Prerouting// hook.  
 + 
 +__Step (14):__ The routing lookup is performed. It determines that this 
 +packet needs to be forwarded and sent out on ''eth0'' and attaches this routing decision 
 +to the packet. 
 + 
 +__Step (15):__ The Xfrm framework recognizes, that this packet has been transformed  
 +according to the SA, whose pointer is still attached to the packet  (within skb extension ''[[https://elixir.bootlin.com/linux/v5.10.46/source/include/net/xfrm.h#L1029|struct sec_path]]''). 
 +It checks if a ''forward policy'' (''dir fwd'' SP) exists which corresponds to this SA, and yes, 
 +a match is found here, see Figure {{ref>r1_ip_xfrm_policy}}. So the packet passes. 
 + 
 +__Step (16):__ The Xfrm framework performs a lookup into the IPsec SPD, searching for a matching 
 +//output policy// (''dir out'' SP). No match is found. Still, the packet passes. 
 +The idea of the //output policy// is to detect packets which shall be encrypted with IPsec. 
 +Packets which do not match, just pass.  
 + 
 +__Steps (17) (18) (19) (20) (21) (22):__ The packet traverses the Netfilter 
 +//Forward// and //Postrouting// hooks. Then it traverses the //neighboring 
 +subsystem// which does resolve the destination IP address, which is now 
 +''192.168.1.100'', into a MAC address (by doing ARP lookup, if address not yet 
 +in cache). Finally, it traverses the //egress// queueing discipline, //taps// 
 +and then is sent out on ''eth0''
 + 
 +The input interface ''iif'' of the packet during this whole traversal stays ''eth1'' when you check for it in an Nftables rule within one of the Netfilter hooks((obviously except the //Postrouting// hook, where the input interface is not remembered anymore)). 
 +This is what it means that the Xfrm framework does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a ''vti'' interface named ''vti0''), then ''iif'' would still be ''eth1'' in steps (4), (5) and (7), but would be ''vti0'' in steps (12), (13) and (17).  
 +===== SNAT, Nftables ===== 
 +Now to add the SNAT behavior to ''r1'' and ''r2'', we apply the following 
 +Nftables //ruleset// on ''r1'' and ''r2'': 
 + 
 +<code bash> 
 +nft add table nat 
 +nft add chain nat postrouting { type nat hook postrouting priority 100\; } 
 +nft add rule nat postrouting oif eth1 masquerade 
 +nft add table filter 
 +nft add chain filter forward { type filter hook forward priority 0\; policy drop\; } 
 +nft add rule filter forward iif eth0 oif eth1 accept 
 +nft add rule filter forward iif eth1 oif eth0 ct state established,related accept 
 +</code> 
 + 
 +This //ruleset// is identical on both hosts. 
 +What is the resulting change in behavior? Well, for non-VPN traffic now all works as intended. Hosts ''h1'' 
 +and ''h2'' e.g. can now finally ping ''rx''((which intentionally does not know the routes to the subnets behind ''r1'' and ''r2'', because routers in the Internet commonly do not know the routes to private subnets behind edge routers.)) and from ''rx'' point of view it looks like the ping came from ''r1'' or respectfully ''r2''.  
 + 
 +However, let's take another look at the example from Figure {{ref>echo_request_r1_traversal}} above (the ICMP echo-request ''h1'' -> ''h2'' traversing ''r1'') and examine how the behavior differs now: In step (10) in the example the still unencrypted ICMP echo-request packet traverses the Netfilter //Postrouting// hook and thereby the Nftables //postrouting// chain. Because of the ''masquerade'' rule, its source IP address is now replaced with the address ''8.0.0.1'' of the ''eth1'' interface of ''r1''.  
 +Resulting from that, this packet now does not match the IPsec //output policy// anymore. Thus, it won't get encrypted+encapsulated! Obviously that is not our intended behavior, but let's first dig deeper to understand what actually happens here: In step (8) this packet still had its original source and destination IP addresses ''192.168.1.100'' and ''192.168.2.100'' and thereby matched to //output policy// selector ''src 192.168.1.0/24 dst 192.168.2.0/24'' (see Figure {{ref>r1_ip_xfrm_policy}}). Now in step (10), with its source IP address changed to ''8.0.0.1'' it does not match that //output policy// anymore. But wait! That Xfrm lookup for the //output policy// has already been done in step (8) and the "bundle" with transform instructions is already attached to the packet. Isn't the chaos complete now??? No, because the NAT implementation indeed is aware that this situation can occur. In case NAT actually changes something in the packet traversing the //Postrouting// hook (like in this case the source IP address), then it calls function ''[[https://elixir.bootlin.com/linux/v5.10.46/source/net/netfilter/nf_nat_core.c#L150|nf_xfrm_me_harder()]]''((BTW: Very funny function name! "transform me harder..." You kernel developers made my day! LOL)) which essentially does the whole step (8) all over again. Thus, the Xfrm lookup for a matching //output policy// is actually repeated in this case. For the ICMP echo-request packet in our example this actually means that the original "plain" routing decision is being restored and and the "bundle" with the Xfrm instructions is removed.  
 + 
 +Ok, now we understand it ... the ping is natted, but then sent out plain and unencrypted. That is not what we want. Further, this ping is now anyway doomed to fail, because ''rx'' does not know the route to the target subnet and even if it did, ''r2'' would then drop the packet because it also is now configured as SNAT router and thereby drops incoming ''new'' connections in the ''forward'' chain. 
 + 
 +How to fix that? It is our intended behavior, that network packets from subnet 
 +''192.168.1.0/24'' to subnet ''192.168.2.0/24'' and vice versa shall travel through the 
 +VPN tunnel and shall not be natted. Also, it shall be possible to establish connections 
 +with connection oriented protocols (e.g. TCP((Connection tracking in Linux handles a ping (ICMP 
 +echo-request + echo-reply) as a connection oriented protocol, thus the same applies to ping here.))) 
 +in both ways through the VPN tunnel. One simple way to achieve this behavior is to add these two rules on ''r1'': 
 +<code bash r1> 
 +nft insert rule nat postrouting oif eth1 ip daddr 192.168.2.0/24 accept 
 +nft add rule filter forward iif eth1 oif eth0 ip saddr 192.168.2.0/24 ct state new accept 
 +</code> 
 +For the first rule I used ''insert'' instead of ''add'', so that it is inserted as first 
 +rule of the //postrouting// chain, thus BEFORE the ''masquerade'' rule. Obviously we 
 +need to do the equivalent (but not identical!) thing on ''r2'': 
 +<code bash r2> 
 +nft insert rule nat postrouting oif eth1 ip daddr 192.168.1.0/24 accept 
 +nft add rule filter forward iif eth1 oif eth0 ip saddr 192.168.1.0/24 ct state new accept 
 +</code> 
 + 
 +The complete rulesets then look like this: [[:linux:ipsec:example:ss1:nftables_ruleset|Nftables rulesets on r1 on r2]] 
 + 
 +Let's look at the example from Figure {{ref>echo_request_r1_traversal}} again (ICMP echo-request ''h1'' -> ''h2'' traversing ''r1''): In step (10) when traversing the //postrouting// chain, now the inserted rule ''oif eth1 ip daddr 192.168.2.0/24 accept'' prevents that the packet is natted by accepting it and thereby preventing that it traverses the ''masquerade'' rule((The remaining packets of that connection are then anyway handled by the //nat// module in cooperation with //connection tracking// and not by this //chain// and thereby //connection tracking// from now on now knows that this connection is NOT to be natted.)). The remaining steps of the ''r1'' traversal now happen exactly as in the example and we reached our intended behavior. 
 + 
 +When the ICMP echo-request packet is received by and traverses ''r2'', the behavior follows the same principles as described in the example in Figure {{ref>echo_reply_r1_traversal}} above (ICMP echo-reply ''h2'' -> ''h1'' traversing ''r1''). However, it is necessary to add the rule ''iif eth1 oif eth0 ip saddr 192.168.1.0/24 ct state new accept'' to the //forward// chain, because from ''r2'' point of view this is a packet of a ''new'' incoming connection and those we want to allow for packets which traveled through the VPN tunnel, but not for other packets. 
 + 
 +===== Distinguish VPN/non-VPN traffic ===== 
 +No matter if your idea is to do NAT or your intentions are other kinds of packet manipulation or packet filtering, it all boils down to distinguishing between VPN and non-VPN traffic.  
 +The Nftables rulesets I applied to ''r1'' and ''r2'' in the example above are a crude way to do that, but it works. It is crude, because the distinction is made solely based on the target subnet (the destination IP address) of a traversing packet. Further, the added rule in the //forward// chain now allows incoming new connections from WAN side, if the source address is from the peer's subnet. However, the rule does not check, whether that traffic has actually been received via the VPN tunnel! Essentially, it opens a hole in the stateful firewall. And there are further issues which this crude solution does not address: What happens when the VPN tunnel is not up, but the mentioned Nftables rulesets are in place? What happens if the subnets behind the VPN gateways are not statically configured but instead are part of dynamic IKE negotiation during IKE handshake? ... 
 +As I mentioned above, things would be easier, if the Xfrm framework would use virtual network interfaces, because those then could serve as basis for making this distinction.  
 + 
 +Several means have been implemented to address those kind of issues: 
 +  * Strongswan provides an optional ''_updown'' script which is called each time when the VPN tunnel comes up or goes down. You can use it to dynamically set/remove the Nftables rules you require. The default version of that script already sets some Iptables rules for you, but depending on your system (kernel/Nftables version) this can create more problems than it solves. And who says that these default rules do fit well with your intended behavior? Thus, if you use this, you probably need to replace the default script with your own version. 
 +  * Nftables offers IPSEC EXPRESSIONS (syntax ''ipsec {in | out} [ spnum NUM ] ...'') which make it possible to determine whether a packet is part of the VPN context or not. I published a follow-up article which covers that topic in detail: [[nftables_demystifying_ipsec_expressions|Nftables - Demystifying IPsec expressions]]. 
 +  * Nftables offers "markers" which you can set and read on traversing packets (syntax ''meta mark'') which can help you to mark packets (set the mark) as being part of the VPN context in one hook/chain and in a later hook/chain you can read the mark again. 
 +  * So-called ''vti'' or ''xfrm'' virtual network interfaces can optionally be used "on top" of the default Xfrm framework behavior. 
  
  
Line 119: Line 569:
 ===== Context ===== ===== Context =====
 The described behavior and implementation has been observed on a The described behavior and implementation has been observed on a
-Debian 10 (buster) system with using Debian //backports// on ''amd64'' architecture.+Debian 10 (buster) system with using Debian //backports// on //amd64// architecture.
  
-  * kernel: ''5.4.19-1~bpo10+1'' +  * kernel: ''5.10.46-4~bpo10+1'' 
-  * nftables: ''0.9.3-2~bpo10+1'' +  * nftables: ''0.9.6-1~bpo10+1'' 
-  * libnftnl: ''1.1.5-1~bpo10+1'' +  * strongswan: ''5.7.2-1+deb10u1''
-  * strongswan: ''5.7.2-1''+
  
 ===== Feedback ===== ===== Feedback =====
-[[:feedback|Feedback]] to this article is very welcome!+[[:feedback|Feedback]] to this article is very welcome! Please be aware that I did not develop or contribute to any of the software components described here. I'm merely some developer who took a look at the source code and did some practical experimenting. If you find something which I might have misunderstood or described incorrectly here, then I would be very grateful, if you bring this to my attention and of course I'll then fix my content asap accordingly.  
 + 
 +===== References ===== 
 +  * [[https://www.strongswan.org/]] 
 +  * [[https://backreference.org/2014/11/12/on-the-fly-ipsec-vpn-with-iproute2/|backreference.org: “On the fly” IPsec VPN with iproute2 (2014)]] 
 +  * [[https://www.linux-magazin.de/ausgaben/2004/12/sicherer-brandstifter/|linux-magazin.de: Firewalling bei IPsec-Einsatz unter Kernel 2.6 (2004)]] 
 +  * [[https://ramirose.wixsite.com/ramirosen|Linux Kernel Networking - Implementation and Theory (Rami Rosen, Apress, 2014)]]
  
 +//published 2020-05-30//, //last modified 2022-08-14//
  
-{{tag>linux netfilter nftables ipsec strongswan charon swanctl xfrm}} 
blog/linux/nftables_ipsec_packet_flow.1591013678.txt.gz · Last modified: 2020-06-01 by Andrej Stender