Thermalcircle.de

climbing the thermals

User Tools

Site Tools


blog:linux:nftables_ipsec_packet_flow

This is an old revision of the document!


Nftables - Netfilter and VPN/IPsec packet flow

In this article I like to explain how the packet flow through Netfilter hooks looks like on a host which works as an IPsec-based VPN gateway in tunnel-mode. Obviously network packets which are to be sent through a VPN tunnel are encrypted+encapsulated on a VPN gateway and packets received through the tunnel are decapsulated and decrypted… but in which sequence does this exactly happen and which packet traverses which Netfilter hook in which sequence and in which form (encrypted/not yet encrypted/already decrypted)?

I'll do a short recap of IPsec in general, explain the IPsec implementation on Linux as it is commonly used today (Strongswan + Xfrm framework) and explain packet traversal through the VPN gateways in an example site-to-site VPN setup (IPsec in tunnel-mode, IKEv2, ESP, IPv4). I'll focus on Nftables in favor of the older Iptables and I'll setup the VPN via the modern Vici/swanctl configuration interface of Strongswan instead of the older Stroke interface.

IPsec short recap

A comprehensive recap on the topic IPsec would require an entire book. I'll merely provide a very short recap here focused on protocols and ports to put my actual topic into context.

IKE protocol

An IPsec based VPN possesses a “management channel” between both VPN endpoint hosts1), which is the IKE protocol2). It is responsible for bringing up, managing, and tearing down the VPN tunnel connection between both VPN endpoints. This gives both endpoints the opportunity to authenticate to each other and negotiate encryption-, authentication- and data-integrity-algorithms and the keys those algorithms require (usually session-based keys). IKE is encapsulated in UDP and uses UDP port 500. In case of NAT-traversal (= if a NAT router is detected between both endpoints during IKE handshake) it dynamically switches to UDP port 4500 during IKE handshake. Thus, IKE encapsulation on an Ethernet-based network looks like this:

|eth|ip|udp|ikev2| IKEv2 packet on UDP/500, Nat-traversal: UDP/500 → UDP/4500

SAs and SPs

The mentioned algorithms and keys which are negotiated during IKE handshake are being organized in so-called Security Associations (SAs). Usually there are (at least) three SAs negotiated for each VPN tunnel connection: The IKE_SA which, once established, represents the secured communication channel for IKE itself and (at least) two more CHILD_SAs, one for each data flow direction, which represent the secured communication channels for packets which shall flow through the VPN tunnel.

In addition to the SAs, IPsec also introduces the concept of the so-called Security Policies (SPs), which are also created during IKE handshake. Those are either defined by the IPsec tunnel configuration provided by the admin/user and/or (depending on case) can also at least partly result from dynamic IKE negotiation. The purpose of the SPs is to act as “traffic selectors” on each VPN endpoint to decide which network packet shall travel through the VPN tunnel and which not. Usually the SPs make those distinction based on source and destination IP addresses (/subnets) of the packets, but additional parameters (e.g. protocols, port numbers, …) can also be considered.

Be aware that both SAs and SPs merely are volatile and not persistent data. Their lifetime is defined by the lifetime of the VPN tunnel connection. It might even be shorter because of key re-negotiations / “rekeying”.

ESP protocol, tunnel-mode

After the initial IKE handshake has been successfully finished, the VPN tunnel between both endpoints thereby is “up” and packets can travel through it. In case of tunnel-mode the IP packets which shall travel through the VPN tunnel are being encrypted and then encapsulated in packets of the so-called ESP protocol3). The whole thing is then further encapsulated into another (an “outer”) IP packet. The reason is that the VPN tunnel itself is merely a point-to-point connection between two VPN endpoints (one source and one destination IP address), but those endpoints are in that case VPN gateways which are used to connect entire subnets on both ends of the tunnel. Thus, the source and destination IP addresses of the “payload” packets which travel through the VPN gateway need to be kept independent from the source and destination IP addresses of the “outer” IP packets. Encapsulation then looks like this (example for a TCP connection):

|eth|      |ip|tcp|payload|4) A “normal” packet which shall travel through the VPN tunnel, is encrypted and encapsulated like this while traversing the VPN gateway.
|eth|ip|esp|ip|tcp|payload|5)6)

If Nat-traversal is active, then ESP is additionally encapsulated in UDP:

|eth|          |ip|tcp|payload| In case of Nat-traversal additional encapsulation in UDP (same port 4500 as for IKE is then used here).
|eth|ip|udp|esp|ip|tcp|payload|

IPsec Linux implementation

IPsec implementation in Linux consists of a userspace part and a kernel part. Several implementations have been created over the years. Nowadays the most commonly used implementation of the userspace part seems to be StrongSwan (There were/are other implementations like Openswan and FreeS/WAN. I'll focus on Strongswan here.) The IPsec implementation in modern Linux kernels is the so-called Xfrm framework, which is sometimes also called the Netkey stack. It is present in the Linux kernel since v2.6. There have been predecessors, like the KLIPS IPsec stack which was used in kernel v2.4 and earlier. With Figure 1 I like to show the responsibilities of Strongswan and the Xfrm framework and how both interact with each other in a simple block diagram style.

Figure 1: Block diagram showing userspace and kernel part of IPsec implementation on Linux (StrongSwan and Xfrm framework) and interfaces between both.

Stongswan

The essential part of Strongswan is the userspace daemon charon which implements IKEv1/IKEv2 and acts as the central “orchestrator” of IPsec-based VPN tunnels/connections on each VPN endpoint host.

It provides an interface to the user/admin for configuration of IPsec on the system. Actually, more precisely, it provides two different interfaces to do that: One is the so-called Stroke interface. It provides means to configure IPsec via two main config files /etc/ipsec.conf and /etc/ipsec.secrets. This is the older of the two interfaces and it can be considered deprecated by now (however it is still supported).

The other and newer one is the so-called Vici interface. This is an IPC mechanism, which means the charon daemon is listening on a Unix domain socket and client tools like Strongswans own cmdline tool swanctl7) can connect to it to configure IPsec. This way of configuration is more powerful than the Stroke interface , because it makes it easier for other tools to provide and adjust configuration dynamically and event driven at any time. However in many common IPsec setups the configuration is still simply being supplied via config files. When using Vici, the difference merely is that the config file(s) (mainly the file /etc/swanctl/swanctl.conf) are not interpreted by the charon daemon directly, but instead are interpreted by the cmdline tool swanctl which then feeds this config into the charon daemon via the Vici IPC interface. Further, the syntax of swanctl.conf looks slightly different than the syntax of ipsec.conf from the Stroke interface, but the semantic is the same.

An additional config file /etc/strongswan.conf8) exists, which contains general/global strongswan settings, which are not directly related to individual VPN connections.

So let's say you created a swanctl.conf config file on both of your VPN endpoint hosts with the intended configuration of your VPN tunnel and you used the swanctl tool to load this configuration into the charon daemon and to initiate the IKE handshake with the peer endpoint9). The charon daemon, when executing the IKE handshake, negotiates all the details of the VPN tunnel with the peer as explained above and thereby creates the SA and SP instances which define the VPN connection. It keeps the IKE_SA for itself in userspace, because this is its own “secure channel” for IKE communication with the peer endpoint. It feeds the other SA and SP instances into the kernel via a Netlink socket. The VPN connection/tunnel is now “up”. If you later decide to terminate the VPN connection again, charon removes the SA and SP instances from the kernel.

The Xfrm framework

The so-called Xfrm framework is a component within the Linux kernel. As the man page man 8 ip-xfrm states, it is an “IP framework for transforming packets (such as encrypting their payloads)”. Thus “Xfrm” stands for “transform”. While the userspace part (Strongswan) handles the overall IPsec orchestration and runs the IKEv1/IKEv2 protocol to buildup/teardown VPN tunnels/connections, the kernel part is responsible for encrypting+encapsulating and decrypting+decapsulating network packets which travel through the VPN tunnel and to select/decide which packets go through the VPN tunnel at all. To do that, it requires all SA and SP instances which define the VPN tunnel/connection to be present in form of data structures within the kernel. Only then it can make decisions on which packet shall be encrypted/decrypted and which not and which encryption algorithms and keys to use.

The Xfrm framework implements the so-called Security Association Database (SAD) and the Security Policy Database (SPD) for holding SA and SP instances in the kernel. Userspace components (like Strongswan, the iproute2 tool collection and others) can use a Netlink socket to talk to the kernel and to show/create/adjust/delete SA and SP instances in the SAD and SPD. You can e.g. use the iproute2 tool ip to show the SA and SP instances which currently exist in those databases:

  • command ip xfrm state shows SA instances
  • command ip xfrm policy shows SP instances

You can even use ip as a low-level config tool to create/delete SA and SP instances. There is a very good article which explains how to do that. However in practice you leave the duty of creating/deleting SA and SP instances to Strongswan.

SP instances can be created for three different “data directions”:

security policy syntax10) meaning
“output policy” dir out SP works as a selector on outgoing packets to select which are to be encrypted+encapsulated and which not
“input policy” dir in SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP local on the system
“forward policy” dir fwd SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP which is not local, thereby packets which are to be forwarded (routed)

Due to the fact that IPsec is a mandatory part of the IPv6 protocol (but is also available for the IPv4 protocol), the implementation of the Xfrm framework or the “IPsec stack” is very interwoven with the implementation of the IPv4 and IPv6 protocols in the kernel which makes things very complex when you look into details. I saw statements which claim that the Xfrm framework is the most complex part of the entire network stack in the Linux kernel.

So, what I describe in the following is a “simplified view” on how things work. It is a sufficient model to understand how the network packet flow works and how the Xfrm framework relates to the Netfilter framework (Netfilter and Xfrm are implemented independently from each other in the kernel!).

If you are working with Nftables or Iptables, then you probably are familiar with the widely used Netfilter packet flow image, which illustrates the packet flow through the Netfilter hooks and Iptables chains. One great thing about this image is, that it covers the Xfrm framework, too (at least from the bird's eye view). It illustrates four distinct Xfrm “decision points”11) in the network packet flow path and shows clearly where those are located in relation to the Iptables chains. In Figure 2 I created a simplified version of this image, which only shows the Netfilter hooks (blue boxes) and the Xfrm “decision points” (grey boxes), and thereby is not focused on the subtle differences between Iptables and Nftables. Because my focus is on the Xfrm framework I added a fifth “decision point” here, which is not shown in the original Netfilter packet flow image, but I know that it exists from reading in the kernel source code:

Figure 2: Block diagram of Netfilter hooks and Xfrm decision points

I'll explain the Xfrm “decision points” here, while assuming that you are already familiar with the Netfilter hooks which I covered in much detail in my other article Nftables - Packet flow and Netfilter hooks in detail (if not, please read that article first).

Xfrm lookup This is where the SPD is used to check if the traversing packets are matching to any “output policy” (dir out SP) and if yes, they are given to the Xfrm encode step to be encrypted+encapsulated
Xfrm encode This is where packets which shall travel through the VPN tunnel are being encrypted and encapsulated based on SA instances within the SAD (which SA to use and how it relates to SP instances is regulated via the integer identifiers like reqid and spi; watch out for the tmpl keyword in the output of command ip xfrm policy)
Xfrm/socket lookup This “decision point” is actually a conglomerate of several checks which happen in different situations: (1) incoming ESP packets are checked against the SAD and if a matching SA is found (source and destination IP and spi match), then those packets are given to the Xfrm decode step. (2) If already decrypted+decapsulated packets arrive here, the SPD is used to check if they match to any “input policy (dir in SP). If yes, they simply pass, if no they are dropped.

This step needs to do a little more “magic” if Nat-traversal is used. In that case both IKE and ESP packets arrive here being encapsulated in UDP on port 4500 and here the kernel must distinguish between both12) and give the IKE packets to a userspace UDP socket where Strongswan13) is listening and check the ESP packets against the SAD and give them to the Xfrm decode step.
Xfrm decode This is where packets which have been received through the VPN tunnel are being decrypted and decapsulated based on SA instances within the SAD.
(Xfrm fwd lookup) This step is not shown in the Netfilter packet flow image (probably because it is considered less relevant), but I mention it here for completeness: If already decrypted+decapsulated packets arrive here, the SPD is used to check if they match to any “forward policy” (dir fwd SP) instance. If yes, they simply pass, if no they are dropped.

It is very important to mention that the Xfrm framework implementation does NOT use virtual network interfaces to distinguish between VPN and non-VPN traffic. This is a relevant difference compared to other implementations like the older KLIPS IPsec stack which was used in kernel v2.4 and earlier. Why is this relevant? It is true that virtual network interfaces are not required, because the concept of the SPs does all the distinction which is required for the VPN to operate. However, the absence of virtual network interfaces makes it harder for Netfilter-based packet filtering systems like Iptables and Nftables to distinguish between VPN and non-VPN packets within their rules.

It is obvious that an Nftables rule would be easy to write if all VPN traffic goes through a virtual network interface e.g. called ipsec0. In case of the Xfrm framework that is not the case, at least not by default. Additional features have been developed over the years to address this problem. Some of them re-introduce the concept of virtual network interfaces “on top” of the Xfrm framework, but those are optional to use and never became the default. The Strongswan documentation calls VPN setups based on those virtual network interfaces "Route-based VPNs". It seems, essentially two types of virtual interfaces have been introduced in this context over the years: The older vti interfaces and the newer xfrm interfaces14). In the remaining part of this article I will describe how the IPsec-based VPN looks like from Netfilter point-of-view in the “normal” case where NO virtual network interfaces are used.

Example Site-to-site VPN

Figure 3: IPsec site-to-site example setup with VPN gateways r1 and r2 (can be roughly compared to the Strongswan: Test swanctl/net2net-psk).

It is better to have a practical example as basis for further diving into the topic. Here I will use a site-to-site VPN setup, which is created between two VPN gateways r1 and r2 (IPsec, tunnel-mode, IKEv2, ESP, IPv4) as shown in Figure 3. The VPN tunnel will connect the local subnets behind r1 and r2. Additionally, both r1 and r2 operate as SNAT edge routers when forwarding non-VPN traffic, but not for VPN traffic. This creates the necessity to distinguish between VPN and non-VPN packets in Nftables rules, but more on that later. The router rx is just a placeholder for an arbitrary cloud (e.g. the Internet) between both VPN gateways.

r1: swanctl.conf
connections {
  gw-gw {
    local_addrs  = 8.0.0.1
    remote_addrs = 9.0.0.1
    local {
      auth = psk
      id = r1
    }
    remote {
      auth = psk
      id = r2
    }
    children {
      net-net {
        mode = tunnel
        local_ts  = 192.168.1.0/24
        remote_ts = 192.168.2.0/24
        esp_proposals = aes128gcm128
      }
    }
    version = 2
    mobike = no
    reauth_time = 10800
    proposals = aes128-sha256-modp3072
  }
}
secrets {
  ike-1 {
    id-1 = r1
    id-2 = r2
    secret = "ohCeiVi5iez36ieFu"
  }
}
r2: swanctl.conf
connections {
  gw-gw {
    local_addrs  = 9.0.0.1
    remote_addrs = 8.0.0.1 
    local {
      auth = psk
      id = r2
    }
    remote {
      auth = psk
      id = r1
    }
    children {
      net-net {
        mode = tunnel
        local_ts  = 192.168.2.0/24
        remote_ts = 192.168.1.0/24
        esp_proposals = aes128gcm128
      }
    }
    version = 2
    mobike = no
    reauth_time = 10800
    proposals = aes128-sha256-modp3072
  }
}
secrets {
  ike-1 {
    id-1 = r1
    id-2 = r2
    secret = "ohCeiVi5iez36ieFu"
  }
}

Figure 4: Strongswan configuration on r1 and r215)

Execute command swanctl --load-all on r1 and r2 to load this configuration of Figure 4 into the charon daemon. Execute command swanctl --initate --child net-net on r1 to initiate the VPN connection between both gateways. This triggers the IKEv2 handshake with r1 as “initiator” and r2 as “responder”. On successful IKE handshake, the charon daemon feeds the SAs and SPs which define this VPN tunnel via Netlink socket into the kernel and thereby the tunnel is now up. You can use the iproute2 tool ip as low-level admin tool to show the SAs and SPs which currently exist in the databases inside the kernel, see Figures 5 and 6.

root@r1:~# ip xfrm state
src 8.0.0.1 dst 9.0.0.1
        proto esp spi 0xc5400599 reqid 1 mode tunnel
        replay-window 0 flag af-unspec
        aead rfc4106(gcm(aes)) 0x8849c107d9f6972da27a5faef554a68b10f3b938 128
        anti-replay context: seq 0x0, oseq 0x9, bitmap 0x00000000
src 9.0.0.1 dst 8.0.0.1
        proto esp spi 0xcd7dff80 reqid 1 mode tunnel
        replay-window 32 flag af-unspec
        aead rfc4106(gcm(aes)) 0x3c0497d489904175bdb446f3e09ae4c3acaf5d45 128
        anti-replay context: seq 0x9, oseq 0x0, bitmap 0x000001ff

Figure 5: Showing SAs currently present in SAD in kernel on r1 with ip xfrm command

root@r1:~# ip xfrm policy
src 192.168.1.0/24 dst 192.168.2.0/24 
        dir out priority 375423 ptype main 
        tmpl src 8.0.0.1 dst 9.0.0.1
                proto esp spi 0xc5400599 reqid 1 mode tunnel
src 192.168.2.0/24 dst 192.168.1.0/24 
        dir fwd priority 375423 ptype main 
        tmpl src 9.0.0.1 dst 8.0.0.1
                proto esp reqid 1 mode tunnel
src 192.168.2.0/24 dst 192.168.1.0/24 
        dir in priority 375423 ptype main 
        tmpl src 9.0.0.1 dst 8.0.0.1
                proto esp reqid 1 mode tunnel
src 0.0.0.0/0 dst 0.0.0.0/0 
        socket in priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
        socket out priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
        socket in priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
        socket out priority 0 ptype main 
src ::/0 dst ::/0 
        socket in priority 0 ptype main 
src ::/0 dst ::/0 
        socket out priority 0 ptype main 
src ::/0 dst ::/0 
        socket in priority 0 ptype main 
src ::/0 dst ::/0 
        socket out priority 0 ptype main 

Figure 6: Showing SPs currently present in SPD in kernel on r1 with ip xfrm command

Just for completeness: To tear the the VPN tunnel down again, you would need to execute command swanctl --terminate --child net-net to terminate the ESP tunnel (SPs and SAs get removed from the kernel databases) and then command swanctl --terminate --ike gw-gw to terminate the IKE connection/association (IKE_SA).

Packet Flow

Let's say the VPN tunnel in the example described above is now up and running. To start simple, let's also assume that we have not yet configured any Nftables ruleset on the VPN gateways r1 and r2. Thus, those two hosts do not yet do SNAT (we will do this later). The VPN will already work normally, which means any IP packet which is sent from any host in subnet 192.168.1.0/24 to any host in subnet 192.168.2.0/24 or vice versa will travel through the VPN tunnel.

This means, packets which are traversing one of the VPN gateways r1 or r2 and are about to “enter the VPN tunnel” are being encrypted and then encapsulated in ESP and an outer IP packet. Packets which are “leaving the tunnel” are being decapsulated (the outer IP packet and ESP header are stripped away) and decrypted. Let's observe that on VPN gateway r1 by sending a single ping from h1 to h2, see Figure 7.


h1$ ping -c1 192.168.2.100
#       h1                                    h2
# 192.168.1.100 -> ICMP echo-request -> 192.168.2.100
# 192.168.1.100 <- ICMP echo-reply   <- 192.168.2.100

Figure 7: Single ping from h1 to h2 traversing r1; ICMP echo-request entering VPN tunnel, ICMP echo-reply leaving VPN tunnel.

The content of the following two Figures 8 and 9 is the result of experiments I did with using the trace and log features of Nftables. Those features will make this traversal visible to you, however they are only able to cover Nftables chains, rules and thereby Netfilter hooks. They cannot show you what is going on in the Xfrm framework. You just see this indirectly by observing a packet being still unencrypted wile traversing one Netfilter hook and then appearing encrypted+encapsulated in the next Netfilter hook. Also I played a little with the ip xfrm policy command to find out how the behavior changes when I remove one of the SP instances set by Strongswan. From what I read in the Internet, you can do more or less “packet filtering” things with the SPs (“traffic selectors”) in the Xfrm framework, but how their logic works in detail, seems to be documented nowhere. I will add more info here, if I learn more on that later.

Netfilter / Xfrm Encapsulation iif oif ip saddr ip daddr
1 Prerouting |eth|      |ip|icmp| eth0 192.168.1.100 192.168.2.100
2 Routing |eth|      |ip|icmp| eth0 192.168.1.100 192.168.2.100
3 Xfrm fwd lookup |eth|      |ip|icmp| eth0 eth1 192.168.1.100 192.168.2.100
4 Forward |eth|      |ip|icmp| eth0 eth1 192.168.1.100 192.168.2.100
5 Postrouting |eth|      |ip|icmp| eth1 192.168.1.100 192.168.2.100
6 Xfrm lookup |eth|      |ip|icmp| eth1 192.168.1.100 192.168.2.100
7 Xfrm encode |eth|......|ip|icmp| eth1
8 Output |eth|ip|esp|ip|icmp| eth1 8.0.0.1 9.0.0.1
9 Postrouting |eth|ip|esp|ip|icmp| eth1 8.0.0.1 9.0.0.1
10 Xfrm lookup |eth|ip|esp|ip|icmp| eth1 8.0.0.1 9.0.0.1

Figure 8: ICMP echo-request h1h2, r1 traversal

  • (1), (2), (3), (4), (5):
    The ICMP echo-request from h1 to h2 is received on the eth0 interface of r1 and thus first (1) traverses the Prerouting Netfilter hook, then (2) the routing decision is made, and eth1 is determined as the output interface for this packet. Then (3) a lookup into the Xfrm SPD is done to check if the packet matches any of the “forward policies” (dir fwd SPs, see Figure 6), but it does not and thus simply passes that step. Then it traverses the (4) Forward and the (5) Postrouting Netfilter hooks. This behavior so far is identical to any “normal” network packet which is being forwarded on a router.
  • (6), (7):
    The ICMP echo-request packet traverses (6) the Xfrm lookup decision point. A lookup into the SPD is done to check if it matches any “output policy” (dir out SP) and yes, because of its source and destination IP addresses 192.168.1.100 and 192.168.2.100 the packet matches, see Figure 6. Thus this packet is NOT immediately sent out on eth1 and instead (7) it is given to Xfrm encode step where it is encrypted+encapsulated into ESP and an outer IP packet based on the corresponding SA (Figure 5) in the SAD; see tmpl statement in the matching output policy (Figure 6). The outer IP header has the source and destination IP addresses 8.0.0.1 and 9.0.0.1.
  • (8), (9), (10):
    The resulting packet now traverses the (8) Output and (9) Postrouting Netfilter hooks and then (10) once again the Xfrm lookup decision point. It is already encrypted+encapsulated, thus there is no further match in the SPD and it simply passes and is finally sent out on eth1.

Important to note regarding Figure 8 is that the oif, since determined by the routing decision, always stays eth1. This is what it means that the Xfrm framework does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a vti interface named vti0), then oif would be vti0 in steps (3) till (7) instead of eth1, but oif would still be eth1 in steps (8) till (10).

Netfilter / Xfrm Encapsulation iif oif ip saddr ip daddr
1 Prerouting |eth|ip|esp|ip|icmp| eth1 9.0.0.1 8.0.0.1
2 Routing |eth|ip|esp|ip|icmp| eth1 9.0.0.1 8.0.0.1
3 Input |eth|ip|esp|ip|icmp| eth1 9.0.0.1 8.0.0.1
4 Xfrm/sock lookup |eth|ip|esp|ip|icmp| eth1 9.0.0.1 8.0.0.1
5 Xfrm decode |eth|......|ip|icmp| eth1
6 Prerouting |eth|      |ip|icmp| eth1 192.168.2.100 192.168.1.100
7 Routing |eth|      |ip|icmp| eth1 192.168.2.100 192.168.1.100
8 Xfrm fwd lookup |eth|      |ip|icmp| eth1 eth0 192.168.2.100 192.168.1.100
9 Forward |eth|      |ip|icmp| eth1 eth0 192.168.2.100 192.168.1.100
10 Postrouting |eth|      |ip|icmp| eth0 192.168.2.100 192.168.1.100
11 Xfrm lookup |eth|      |ip|icmp| eth0 192.168.2.100 192.168.1.100

Figure 9: ICMP echo-reply h2h1, r1 traversal

  • (1), (2), (3):
    The ICMP echo-reply sent from h2 back to h1 is received on eth1 interface of r1 and thus first (1) traverses the Prerouting Netfilter hook, then (2) the routing decision is made. Because this packet is still encrypted and encapsulated, its outer IP header has the destination IP address 8.0.0.1 (the IP address of r1 eth1 interface) and thus routing decides this is a packet targeted for local reception and (3) gives it to the Input Netfilter hook.
  • (4), (5):
    The packet traverses the Xfrm/socket lookup decision point. Because this packet contains an ESP header, a lookup into the SAD is done and a matching SA is found, see Figure 5. Thus, this packet is given to (5) Xfrm decode, where it is decapsulated and decrypted (ESP and outer IP header are stripped away) based on the matching SA.
  • (6), (7), (8):
    The packet is now re-inserted in the input path of network interface eth1. It traverses (6) the Prerouting hook and (7) the routing decision is made and eth0 is determined as the output interface for this packet. It (8) traverses the Xfrm fwd lookup decision point where a lookup into the SPD is done to check if the packet matches any “forward policy” (dir fwd SP, see Figure 6). It matches and thereby passes16).
  • (9), (10), (11):
    The packet now traverses the (9) Forward and the (10) Postrouting Netfilter hook and finally also the Xfrm lookup decision point (there is no SP match and the packet simply passes), before being sent out on eth0.

Important to note regarding Figure 9 is that the iif during this whole traversal stays eth1. This is what it means that the Xfrm framework does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a vti interface named vti0), then iif would still be eth1 in steps (1) till (5), but would be vti0 in steps (6) till (9).

SNAT, Nftables

Now to add the SNAT behavior to r1 and r2, we apply the following Nftables ruleset on r1 and r2:

nft add table nat
nft add chain nat postrouting { type nat hook postrouting priority 100\; }
nft add rule nat postrouting oif eth1 masquerade
nft add table filter
nft add chain filter forward { type filter hook forward priority 0\; policy drop\; }
nft add rule filter forward iif eth0 oif eth1 accept
nft add rule filter forward iif eth1 oif eth0 ct state established,related accept

This ruleset is identical on both hosts. What is the resulting change in behavior? Well, for non-VPN traffic now all works as intended. Hosts h1 and h2 e.g. can now finally ping rx17) and from rx point of view it looks like the ping came from r1 or respectfully r2.

However, let's take another look at the example from Figure 8 above (the ICMP echo-request h1h2 traversing r1) and examine how the behavior differs now: In step (5) in the example the still unencrypted ICMP echo-request packet traverses the Netfilter Postrouting hook and thereby the Nftables postrouting chain. Because of the masquerade rule, its source IP address is now replaced with the address 8.0.0.1 of the eth1 interface of r1. Resulting from that, in step (6) this packet now does NOT match any IPsec “output policy” anymore, thus it is not encrypted+encapsulated and does not travel through the VPN tunnel. Obviously this is not our intended behavior and further, this ping is now anyway doomed to fail, because rx does not know the route to the target subnet and even if it did, r2 would then drop the packet because it also is now configured as SNAT router and thereby drops incoming new connections in the forward chain.

How to fix that? It is our intended behavior, that network packets from subnet 192.168.1.0/24 to subnet 192.168.2.0/24 and vice versa shall travel through the VPN tunnel and shall not be natted. Also, it shall be possible to establish connections with connection oriented protocols (e.g. TCP18)) in both ways through the VPN tunnel. One simple way to achieve this behavior is to add these two rules on r1:

r1
nft insert rule nat postrouting oif eth1 ip daddr 192.168.2.0/24 accept
nft add rule filter forward iif eth1 oif eth0 ip saddr 192.168.2.0/24 ct state new accept

For the first rule I used insert instead of add, so that it is inserted as first rule of the postrouting chain, thus BEFORE the masquerade rule. Obviously we need to do the equivalent (but not identical!) thing on r2:

r2
nft insert rule nat postrouting oif eth1 ip daddr 192.168.1.0/24 accept
nft add rule filter forward iif eth1 oif eth0 ip saddr 192.168.1.0/24 ct state new accept

The complete rulesets then look like this: Nftables rulesets on r1 on r2

Let's look at the example from Figure 8 again (ICMP echo-request h1h2 traversing r1): In step (5) when traversing the postrouting chain, now the inserted rule oif eth1 ip daddr 192.168.2.0/24 accept prevents that the packet is natted by accepting it and thereby preventing that it traverses the masquerade rule19). The remaining steps of the r1 traversal now happen exactly as in the example and we reached our intended behavior.

When the ICMP echo-request packet is received by and traverses r2, the behavior follows the same principles as described in the example in Figure 9 above (ICMP echo-reply h2h1 traversing r1). However, it is necessary to add the rule iif eth1 oif eth0 ip saddr 192.168.1.0/24 ct state new accept to the forward chain, because from r2 point of view this is a packet of a new incoming connection and those we want to allow for packets which traveled through the VPN tunnel, but not for other packets.

Distinguish VPN/non-VPN traffic

No matter if your idea is to do NAT or your intentions are other kinds of packet manipulation or packet filtering, it all boils down to distinguishing between VPN and non-VPN traffic. The Nftables rulesets I applied to r1 and r2 in the example above are a crude way to do that, but it works. It is crude, because the distinction is made solely based on the target subnet (the destination IP address) of a traversing packet. As I mentioned above, things would be easier, if the Xfrm framework would use virtual network interfaces, because those then could serve as basis for making this distinction. The way described here does not address issues like: What happens when the VPN tunnel is not up, but the mentioned Nftables rulesets are in place? What happens if the subnets behind the VPN gateways are not statically configured but instead are part of dynamic IKE negotiation during IKE handshake? …

Several means have been implemented to address those kind of issues:

  • Strongswan provides an optional _updown script which is called each time when the VPN tunnel comes up or goes down. You can use it to dynamically set/remove the Nftables rules you require. The default version of that script already sets some Iptables rules for you, but depending on your system (kernel/Nftables version) this can create more problems than it solves. And who says that these default rules do fit well with your intended behavior? Thus, if you use this, you probably need to replace the default script with your own version.
  • Nftables offers IPSEC EXPRESSIONS (syntax ipsec {in | out} [ spnum NUM ] …) which make it possible to make lookups into the Xfrm framework within a rule to determine if a packet is part of the VPN context or not. However, you require a very recent version of Nftables and of the Linux kernel for that. Check your man page man 8 nft. See also section Context below.
  • Nftables offers “markers” which you can set and read on traversing packets (syntax meta mark) which can help you to mark packets (set the mark) as being part of the VPN context in one hook/chain and in a later hook/chain you can read the mark again.
  • So-called vti or xfrm virtual network interfaces can optionally be used “on top” of the default Xfrm framework behavior.

I am planning to describe some of those means in more detail in another article, however I still need to write that one. ;-) I'll place a link here once I find the time to write it.

Context

The described behavior and implementation has been observed on a Debian 10 (buster) system with using Debian backports on amd64 architecture.

  • kernel: 5.4.19-1~bpo10+1
  • nftables: 0.9.3-2~bpo10+1
  • libnftnl: 1.1.5-1~bpo10+1
  • strongswan: 5.7.2-1

Feedback

Feedback to this article is very welcome!

1)
in this case here “VPN Gateways”
2)
in this case here IKEv2
3)
Yes, there is also the AH protocol, which can be used as alternative to ESP, but AH only provides authentication and data integrity and no encryption. Thus, it is neither relevant for this article nor widely used in practice.
4)
Of course, in the actual packet there is no “blank space” between the eth (ethernet) and the ip header. I just show it like this here to emphasize WHERE in the packet the ESP header and the outer IP header are being inserted.
5)
The darker grey background here shall indicate that this is the part of the whole packet which gets encrypted.
6)
There is one further detail which I intentionally omit here because it is not relevant for the topic at hand: The ESP protocol in reality does not only possess a header but also a trailer part which comes after the payload. So, to be completely accurate, the ESP encapsulation in reality looks like this: |eth|ip|esp-header|ip|tcp|payload|esp-trailer|.
7)
But also other tools like e.g. the NHRP daemon nhrpd of the FRR routing protocol engine, which is used in DMVPN setups.
8)
and it includes a bunch of further files
9)
There are other ways to trigger the IKE handshake, e.g. “on demand” by network traffic itself, but that topic is beyond this article.
10)
This syntax is e.g. used in the output of command ip xfrm policy.
11)
I intentionally call them “decision points” and not “hooks” here, because “hooks” is a term which is used in the Netfilter framework.
12)
Done with a so-called Non-ESP Marker (4 zero bytes 0x00000000 at beginning of UDP payload), defined in RFC3948.
13)
There is even more to that. Strongswan is required to set a special socket option called UDP_ENCAP or else it won't receive any IKE packets on port 4500. But that is an implementation detail.
14)
Additional kinds of virtual network interfaces exist in this context, like e.g. the gre interfaces, but they represent an entirely different concept and protocol (GRE protocol) which is e.g. used to build DMVPN setups. That is an advanced topic which works a little different than the “normal” IPsec VPN which I describe here.
15)
Be aware that config on r1 and r2 here merely is an example configuration which helps me to elaborate on the interaction between Strongswan, Xfrm and Netfilter+Nftables. It is NOT necessarily a good setup to be used in the real world out in the field! E.g. to keep things simple I work with PSKs here instead of certificates, which would already be a questionable decision regarding a real world appliance.
16)
While experimenting, I removed the dir fwd policy. Then the packet was dropped exactly at this point.
17)
which intentionally does not know the routes to the subnets behind r1 and r2, because routers in the Internet commonly do not know the routes to private subnets behind edge routers.
18)
Connection tracking in Linux handles a ping (ICMP echo-request + echo-reply) as a connection oriented protocol, thus the same applies to ping here.
19)
The remaining packets of that connection are then anyway handled by connection tracking and not by this chain and the connection tracking now knows that this connection is NOT to be natted.
blog/linux/nftables_ipsec_packet_flow.1613849226.txt.gz · Last modified: 2021-02-20 by Andrej Stender