This is an old revision of the document!
Nftables - Netfilter and VPN/IPsec packet flow
In this article I like to explain how the packet flow through Netfilter hooks looks like on a host which works as an IPsec-based VPN gateway in tunnel-mode. Obviously network packets which are to be sent through a VPN tunnel are encrypted+encapsulated on a VPN gateway and packets received through the tunnel are decapsulated and decrypted… but in which sequence does this exactly happen and which packet traverses which Netfilter hook in which sequence and in which form (encrypted/not yet encrypted/already decrypted)? I'll do a short recap of IPsec in general, explain the IPsec implementation on Linux as it is commonly used today (Strongswan + Xfrm framework) and explain packet traversal through the VPN gateways in an example site-to-site VPN setup (IPsec in tunnel-mode, IKEv2, ESP, IPv4). I'll focus on Nftables in favor of the older Iptables and I'll setup the VPN via the modern Vici/swanctl configuration interface of Strongswan instead of the older Stroke interface.
IPsec short recap
A comprehensive recap on the topic IPsec would require an entire book. I'll merely provide a very short recap here focused on protocols and ports to put my actual topic into context.
IKE protocol
An IPsec based VPN possesses a “management channel” between both VPN endpoint hosts1), which is the IKE protocol2). It is responsible for bringing up, managing, and tearing down the VPN tunnel connection between both VPN endpoints. This gives both endpoints the opportunity to authenticate to each other and negotiate encryption-, authentication- and data-integrity-algorithms and the keys those algorithms require (usually session-based keys). IKE is encapsulated in UDP and uses UDP port 500. In case of NAT-traversal (= if a NAT router is detected between both endpoints during IKE handshake) it dynamically switches to UDP port 4500 during IKE handshake. Thus, IKE encapsulation on an Ethernet-based network looks like this:
|eth|ip|udp|ikev2| | IKEv2 packet on UDP/500, Nat-traversal: UDP/500 → UDP/4500 |
SAs and SPs
The mentioned algorithms and keys which are negotiated during IKE handshake are being organized in so-called Security Associations (SAs). Usually there are (at least) three SAs negotiated for each VPN tunnel connection: The IKE_SA which, once established, represents the secured communication channel for IKE itself and (at least) two more CHILD_SAs, one for each data flow direction, which represent the secured communication channels for packets which shall flow through the VPN tunnel.
In addition to the SAs, IPsec also introduces the concept of the so-called Security Policies (SPs), which are also created during IKE handshake. Those are either defined by the IPsec tunnel configuration provided by the admin/user and/or (depending on case) can also at least partly result from dynamic IKE negotiation. The purpose of the SPs is to act as “traffic selectors” on each VPN endpoint to decide which network packet shall travel through the VPN tunnel and which not. Usually the SPs make those distinction based on source and destination IP addresses (/subnets) of the packets, but additional parameters (e.g. protocols, port numbers, …) can also be considered.
Be aware that both SAs and SPs merely are volatile and not persistent data. Their lifetime is defined by the lifetime of the VPN tunnel connection. It might even be shorter because of key re-negotiations / “rekeying”.
ESP protocol, tunnel-mode
After the initial IKE handshake has been successfully finished, the VPN tunnel between both endpoints thereby is “up” and packets can travel through it. In case of tunnel-mode the IP packets which shall travel through the VPN tunnel are being encrypted and then encapsulated in packets of the so-called ESP protocol3). The whole thing is then further encapsulated into another (an “outer”) IP packet. The reason is that the VPN tunnel itself is merely a point-to-point connection between two VPN endpoints (one source and one destination IP address), but those endpoints are in that case VPN gateways which are used to connect entire subnets on both ends of the tunnel. Thus, the source and destination IP addresses of the “payload” packets which travel through the VPN gateway need to be kept independent from the source and destination IP addresses of the “outer” IP packets. Encapsulation then looks like this (example for a TCP connection):
|eth| |ip|tcp|payload| 4) | A “normal” packet which shall travel through the VPN tunnel, is encrypted and encapsulated like this while traversing the VPN gateway. |
|eth|ip|esp|ip|tcp|payload| 5)6) |
If Nat-traversal is active, then ESP is additionally encapsulated in UDP:
|eth| |ip|tcp|payload| | In case of Nat-traversal additional encapsulation in UDP (same port 4500 as for IKE is then used here). |
|eth|ip|udp|esp|ip|tcp|payload| |
IPsec Linux implementation
IPsec implementation in Linux consists of a userspace part and a kernel part. Several implementations have been created over the years. Nowadays the most commonly used implementation of the userspace part seems to be StrongSwan (There were/are other implementations like Openswan and FreeS/WAN. I'll focus on Strongswan here.) The IPsec implementation in modern Linux kernels is the so-called Xfrm framework, which is sometimes also called the Netkey stack. It is present in the Linux kernel since v2.6. There have been predecessors, like the KLIPS IPsec stack which was used in kernel v2.4 and earlier. With Figure 1 I like to show the responsibilities of Strongswan and the Xfrm framework and how both interact with each other in a simple block diagram style.
Stongswan
The essential part of Strongswan is the userspace daemon charon which implements IKEv1/IKEv2 and acts as the central “orchestrator” of IPsec-based VPN tunnels/connections on each VPN endpoint host.
It provides an interface to the user/admin for configuration of IPsec on the system.
Actually, more precisely, it provides two different interfaces to do that:
One is the so-called Stroke interface. It provides means to configure IPsec
via two main config files /etc/ipsec.conf
and /etc/ipsec.secrets
.
This is the older of the two interfaces and it can be considered deprecated
by now (however it is still supported).
The other and newer one is the so-called Vici interface. This is an IPC mechanism,
which means the charon daemon is listening on a Unix domain socket and client tools
like Strongswans own cmdline tool swanctl
7)
can connect to it to configure IPsec.
This way of configuration is more powerful than the Stroke interface , because it
makes it easier for other tools to provide and adjust configuration dynamically
and event driven at any time.
However in many common IPsec setups the configuration is still simply being
supplied via config files. When using Vici, the difference merely is that
the config file(s) (mainly the file /etc/swanctl/swanctl.conf
) are not interpreted
by the charon daemon directly, but instead are interpreted by the cmdline tool swanctl
which then feeds this config into the charon daemon via the Vici IPC interface.
Further, the syntax of swanctl.conf
looks slightly different than the syntax
of ipsec.conf
from the Stroke interface, but the semantic is the same.
An additional config file /etc/strongswan.conf
8) exists, which contains general/global strongswan settings, which are not directly related to individual VPN connections.
So let's say you created a swanctl.conf
config file on both of your VPN endpoint hosts with the intended configuration of your VPN tunnel and you used the swanctl
tool to load this configuration into the charon daemon and to initiate the IKE handshake with the peer endpoint9). The charon daemon, when executing the IKE handshake, negotiates all the details of the VPN tunnel with the peer as explained above and thereby creates the SA and SP instances which define the VPN connection. It keeps the IKE_SA for itself in userspace, because this is its own “secure channel” for IKE communication with the peer endpoint. It feeds the other SA and SP instances into the kernel via a Netlink socket. The VPN connection/tunnel is now “up”. If you later decide to terminate the VPN connection again, charon removes the SA and SP instances from the kernel.
The Xfrm framework
The so-called Xfrm framework is a component within the Linux kernel.
As the man page man 8 ip-xfrm
states, it is an “IP framework for transforming packets
(such as encrypting their payloads)”. Thus “Xfrm” stands for “transform”.
While the userspace part (Strongswan) handles the overall IPsec orchestration and
runs the IKEv1/IKEv2 protocol to buildup/teardown VPN tunnels/connections,
the kernel part is responsible for encrypting+encapsulating and decrypting+decapsulating
network packets which travel through the VPN tunnel and to select/decide which packets
go through the VPN tunnel at all. To do that, it requires
all SA and SP instances which define the VPN tunnel/connection to
be present in form of data structures within the kernel. Only then it can
make decisions on which packet shall be encrypted/decrypted and which not
and which encryption algorithms and keys to use.
The Xfrm framework implements the so-called Security Association Database (SAD)
and the Security Policy Database (SPD) for holding SA and SP instances in the kernel.
Userspace components (like Strongswan, the iproute2 tool collection and others) can use a Netlink socket to talk to the kernel and to show/create/adjust/delete SA and SP instances in the SAD and SPD. You can e.g. use the iproute2 tool ip
to show the SA and SP instances which currently exist in those databases:
- command
ip xfrm state
shows SA instances - command
ip xfrm policy
shows SP instances
You can even use ip
as a low-level config tool to create/delete SA and SP instances. There is a very good article which explains how to do that. However in practice you leave the duty of creating/deleting SA and SP instances to Strongswan.
SP instances can be created for three different “data directions”:
security policy | syntax10) | meaning |
---|---|---|
“output policy” | dir out | SP works as a selector on outgoing packets to select which are to be encrypted+encapsulated and which not |
“input policy” | dir in | SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP local on the system |
“forward policy” | dir fwd | SP works as a selector on incoming packets which already have been decrypted+decapsulated and have a destination IP which is not local, thereby packets which are to be forwarded (routed) |
Due to the fact that IPsec is a mandatory part of the IPv6 protocol (but is also available for the IPv4 protocol), the implementation of the Xfrm framework or the “IPsec stack” is very interwoven with the implementation of the IPv4 and IPv6 protocols in the kernel which makes things very complex when you look into details. I saw statements which claim that the Xfrm framework is the most complex part of the entire network stack in the Linux kernel.
So, what I describe in the following is a “simplified view” on how things work. It is a sufficient model to understand how the network packet flow works and how the Xfrm framework relates to the Netfilter framework (Netfilter and Xfrm are implemented independently from each other in the kernel!).
If you are working with Nftables or Iptables, then you probably are familiar with the widely used Netfilter packet flow image, which illustrates the packet flow through the Netfilter hooks and Iptables chains. One great thing about this image is, that it covers the Xfrm framework, too (at least from the bird's eye view). It illustrates four distinct Xfrm “decision points”11) in the network packet flow path and shows clearly where those are located in relation to the Iptables chains. In Figure 2 I created a simplified version of this image, which only shows the Netfilter hooks (blue boxes) and the Xfrm “decision points” (grey boxes), and thereby is not focused on the subtle differences between Iptables and Nftables. Because my focus is on the Xfrm framework I added a fifth “decision point” here, which is not shown in the original Netfilter packet flow image, but I know that it exists from reading in the kernel source code:
I'll explain the Xfrm “decision points” here, while assuming that you are already familiar with the Netfilter hooks which I covered in much detail in my other article Nftables - Packet flow and Netfilter hooks in detail (if not, please read that article first).
Xfrm lookup | This is where the SPD is used to check if the traversing packets are matching to any “output policy” (dir out SP) and if yes, they are given to the Xfrm encode step to be encrypted+encapsulated |
Xfrm encode | This is where packets which shall travel through the VPN tunnel are being encrypted and encapsulated based on SA instances within the SAD (which SA to use and how it relates to SP instances is regulated via the integer identifiers like reqid and spi ; watch out for the tmpl keyword in the output of command ip xfrm policy ) |
Xfrm/socket lookup | This “decision point” is actually a conglomerate of several checks which happen in different situations: (1) incoming ESP packets are checked against the SAD and if a matching SA is found (source and destination IP and spi match), then those packets are given to the Xfrm decode step. (2) If already decrypted+decapsulated packets arrive here, the SPD is used to check if they match to any “input policy (dir in SP). If yes, they simply pass, if no they are dropped.This step needs to do a little more “magic” if Nat-traversal is used. In that case both IKE and ESP packets arrive here being encapsulated in UDP on port 4500 and here the kernel must distinguish between both12) and give the IKE packets to a userspace UDP socket where Strongswan13) is listening and check the ESP packets against the SAD and give them to the Xfrm decode step. |
Xfrm decode | This is where packets which have been received through the VPN tunnel are being decrypted and decapsulated based on SA instances within the SAD. |
(Xfrm fwd lookup) | This step is not shown in the Netfilter packet flow image (probably because it is considered less relevant), but I mention it here for completeness: If already decrypted+decapsulated packets arrive here, the SPD is used to check if they match to any “forward policy” (dir fwd SP) instance. If yes, they simply pass, if no they are dropped. |
It is very important to mention that the Xfrm framework implementation does NOT use virtual network interfaces to distinguish between VPN and non-VPN traffic. This is a relevant difference compared to other implementations like the older KLIPS IPsec stack which was used in kernel v2.4 and earlier. Why is this relevant? It is true that virtual network interfaces are not required, because the concept of the SPs does all the distinction which is required for the VPN to operate. However, the absence of virtual network interfaces makes it harder for Netfilter-based packet filtering systems like Iptables and Nftables to distinguish between VPN and non-VPN packets within their rules.
It is obvious that an Nftables rule would be easy to write if all VPN traffic goes through a virtual network interface e.g. called ipsec0
. In case of the Xfrm framework that is not the case, at least not
by default. Additional features have been developed over the years to address this problem. Some of them
re-introduce the concept of virtual network interfaces “on top” of the Xfrm framework, but those
are optional to use and never became the default. The Strongswan documentation calls VPN setups based on those virtual network interfaces "Route-based VPNs". It seems, essentially two types of virtual interfaces have been introduced in this context over the years: The older vti
interfaces and the newer xfrm
interfaces14). In the remaining part of this article I will describe how the IPsec-based VPN looks like from Netfilter point-of-view in the “normal” case where NO virtual network interfaces are used.
Example Site-to-site VPN
It is better to have a practical example as basis for further diving into the topic. Here I will use a site-to-site VPN setup, which is created between two VPN gateways r1
and r2
(IPsec, tunnel-mode, IKEv2, ESP, IPv4) as shown in Figure 3. The VPN tunnel will connect the local subnets behind r1
and r2
. Additionally, both r1
and r2
operate as SNAT edge routers when forwarding non-VPN traffic, but not for VPN traffic. This creates the necessity to distinguish between VPN and non-VPN packets in Nftables rules, but more on that later. The router rx
is just a placeholder for an arbitrary cloud (e.g. the Internet) between both VPN gateways.
Execute command swanctl --load-all
on r1
and r2
to load this configuration of Figure 4 into the charon daemon. Execute command swanctl --initate --child net-net
on r1
to initiate the VPN connection between both gateways. This triggers the IKEv2 handshake with r1
as “initiator” and r2
as “responder”. On successful IKE handshake, the charon daemon feeds the SAs and SPs which define this VPN tunnel via Netlink socket into the kernel and thereby the tunnel is now up. You can use the iproute2 tool ip
as low-level admin tool to show the SAs and SPs which currently exist in the
databases inside the kernel, see Figures 5 and 6.
Just for completeness: To tear the the VPN tunnel down again, you would need to execute command
swanctl --terminate --child net-net
to terminate the ESP tunnel (SPs and SAs get removed from the kernel databases) and then command swanctl --terminate --ike gw-gw
to terminate the IKE connection/association (IKE_SA).
Packet Flow
Let's say the VPN tunnel in the example described above is now up and running. To start simple, let's also assume that we have not yet configured any Nftables ruleset on the VPN gateways r1
and r2
. Thus, those two hosts do not yet do SNAT (we will do this later). The VPN will already work normally, which means any IP packet which is sent from any host in subnet 192.168.1.0/24
to any host in subnet 192.168.2.0/24
or vice versa will travel through the VPN tunnel.
This means, packets which are traversing one of the VPN gateways r1
or r2
and are about to “enter the VPN tunnel” are being encrypted and then encapsulated in ESP and an outer IP packet. Packets which are “leaving the tunnel” are being decapsulated (the outer IP packet and ESP header are stripped away) and decrypted. Let's observe that on VPN gateway r1
by sending a single ping from h1
to h2
, see Figure 7.
The content of the following two Figures 8 and 9 is the result of experiments I did with using the trace
and log
features of Nftables. Those features will make this traversal visible to you, however they are only able to cover Nftables chains, rules and thereby Netfilter hooks. They cannot show you what is going on in the Xfrm framework. You just see this indirectly by observing a packet being still unencrypted wile traversing one Netfilter hook and then appearing encrypted+encapsulated in the next Netfilter hook. Also I played a little with the ip xfrm policy
command to find out how the behavior changes when I remove one of the SP instances set by Strongswan. From what I read in the Internet, you can do more or less “packet filtering” things with the SPs (“traffic selectors”) in the Xfrm framework, but how their logic works in detail, seems to be documented nowhere. I will add more info here, if I learn more on that later.
Important to note regarding Figure 8 is that the oif
, since determined by the routing decision, always stays eth1
. This is what it means that the Xfrm framework does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a vti
interface named vti0
), then oif
would be vti0
in steps (3) till (7) instead of eth1
, but oif
would still be eth1
in steps (8) till (10).
Important to note regarding Figure 9 is that the iif
during this whole traversal stays eth1
. This is what it means that the Xfrm framework does not use virtual network interfaces. If virtual network interfaces would instead be used here (e.g. a vti
interface named vti0
), then iif
would still be eth1
in steps (1) till (5), but would be vti0
in steps (6) till (9).
SNAT, Nftables
Now to add the SNAT behavior to r1
and r2
, we apply the following
Nftables ruleset on r1
and r2
:
nft add table nat nft add chain nat postrouting { type nat hook postrouting priority 100\; } nft add rule nat postrouting oif eth1 masquerade nft add table filter nft add chain filter forward { type filter hook forward priority 0\; policy drop\; } nft add rule filter forward iif eth0 oif eth1 accept nft add rule filter forward iif eth1 oif eth0 ct state established,related accept
This ruleset is identical on both hosts.
What is the resulting change in behavior? Well, for non-VPN traffic now all works as intended. Hosts h1
and h2
e.g. can now finally ping rx
17) and from rx
point of view it looks like the ping came from r1
or respectfully r2
.
However, let's take another look at the example from Figure 8 above (the ICMP echo-request h1
→ h2
traversing r1
) and examine how the behavior differs now: In step (5) in the example the still unencrypted ICMP echo-request packet traverses the Netfilter Postrouting hook and thereby the Nftables postrouting chain. Because of the masquerade
rule, its source IP address is now replaced with the address 8.0.0.1
of the eth1
interface of r1
. Resulting from that, in step (6) this packet now does NOT match any IPsec “output policy” anymore, thus it is not encrypted+encapsulated and does not travel through the VPN tunnel. Obviously this is not our intended behavior and further, this ping is now anyway doomed to fail, because rx
does
not know the route to the target subnet and even if it did, r2
would then drop the packet
because it also is now configured as SNAT router and thereby drops incoming new
connections
in the forward
chain.
How to fix that? It is our intended behavior, that network packets from subnet
192.168.1.0/24
to subnet 192.168.2.0/24
and vice versa shall travel through the
VPN tunnel and shall not be natted. Also, it shall be possible to establish connections
with connection oriented protocols (e.g. TCP18))
in both ways through the VPN tunnel. One simple way to achieve this behavior is to add these two rules on r1
:
- r1
nft insert rule nat postrouting oif eth1 ip daddr 192.168.2.0/24 accept nft add rule filter forward iif eth1 oif eth0 ip saddr 192.168.2.0/24 ct state new accept
For the first rule I used insert
instead of add
, so that it is inserted as first
rule of the postrouting chain, thus BEFORE the masquerade
rule. Obviously we
need to do the equivalent (but not identical!) thing on r2
:
- r2
nft insert rule nat postrouting oif eth1 ip daddr 192.168.1.0/24 accept nft add rule filter forward iif eth1 oif eth0 ip saddr 192.168.1.0/24 ct state new accept
The complete rulesets then look like this: Nftables rulesets on r1 on r2
Let's look at the example from Figure 8 again (ICMP echo-request h1
→ h2
traversing r1
): In step (5) when traversing the postrouting chain, now the inserted rule oif eth1 ip daddr 192.168.2.0/24 accept
prevents that the packet is natted by accepting it and thereby preventing that it traverses the masquerade
rule19). The remaining steps of the r1 traversal now happen exactly as in the example and we reached our intended behavior.
When the ICMP echo-request packet is received by and traverses r2
, the behavior follows the same principles as described in the example in Figure 9 above (ICMP echo-reply h2
→ h1
traversing r1
). However, it is necessary to add the rule iif eth1 oif eth0 ip saddr 192.168.1.0/24 ct state new accept
to the forward chain, because from r2
point of view this is a packet of a new
incoming connection and those we want to allow for packets which traveled through the VPN tunnel, but not for other packets.
Distinguish VPN/non-VPN traffic
No matter if your idea is to do NAT or your intentions are other kinds of packet manipulation or packet filtering, it all boils down to distinguishing between VPN and non-VPN traffic.
The Nftables rulesets I applied to r1
and r2
in the example above are a crude way to do that, but it works. It is crude, because the distinction is made solely based on the target subnet (the destination IP address) of a traversing packet. As I mentioned above, things would be easier, if the Xfrm framework would use virtual network interfaces, because those then could serve as basis for making this distinction.
The way described here does not address issues like: What happens when the VPN tunnel is not up, but the mentioned Nftables rulesets are in place? What happens if the subnets behind the VPN gateways are not statically configured but instead are part of dynamic IKE negotiation during IKE handshake? …
Several means have been implemented to address those kind of issues:
- Strongswan provides an optional
_updown
script which is called each time when the VPN tunnel comes up or goes down. You can use it to dynamically set/remove the Nftables rules you require. The default version of that script already sets some Iptables rules for you, but depending on your system (kernel/Nftables version) this can create more problems than it solves. And who says that these default rules do fit well with your intended behavior? Thus, if you use this, you probably need to replace the default script with your own version. - Nftables offers IPSEC EXPRESSIONS (syntax
ipsec {in | out} [ spnum NUM ] …
) which make it possible to make lookups into the Xfrm framework within a rule to determine if a packet is part of the VPN context or not. However, you require a very recent version of Nftables and of the Linux kernel for that. Check your man pageman 8 nft
. See also section Context below. - Nftables offers “markers” which you can set and read on traversing packets (syntax
meta mark
) which can help you to mark packets (set the mark) as being part of the VPN context in one hook/chain and in a later hook/chain you can read the mark again. - So-called
vti
orxfrm
virtual network interfaces can optionally be used “on top” of the default Xfrm framework behavior.
I am planning to describe some of those means in more detail in another article, however I still need to write that one. I'll place a link here once I find the time to write it.
Context
The described behavior and implementation has been observed on a Debian 10 (buster) system with using Debian backports on amd64 architecture.
- kernel:
5.4.19-1~bpo10+1
- nftables:
0.9.3-2~bpo10+1
- libnftnl:
1.1.5-1~bpo10+1
- strongswan:
5.7.2-1
Feedback
Feedback to this article is very welcome!
published 2020-05-30, last modified 2021-08-07
eth
(ethernet) and the ip
header. I just show it like this here to emphasize WHERE in the packet the ESP header and the outer IP header are being inserted.|eth|ip|esp-header|ip|tcp|payload|esp-trailer|
.nhrpd
of the FRR routing protocol engine, which is used in DMVPN setups.ip xfrm policy
.0x00000000
at beginning of UDP payload), defined in RFC3948.UDP_ENCAP
or else it won't receive any IKE packets on port 4500. But that is an implementation detail.gre
interfaces, but they represent an entirely different concept and protocol (GRE
protocol) which is e.g. used to build DMVPN setups. That is an advanced topic which works a little different than the “normal” IPsec VPN which I describe here.r1
and r2
here merely is an example configuration which helps me to elaborate on the interaction between Strongswan, Xfrm and Netfilter+Nftables. It is NOT necessarily a good setup to be used in the real world out in the field! E.g. to keep things simple I work with PSKs here instead of certificates, which would already be a questionable decision regarding a real world appliance.dir fwd
policy. Then the packet was dropped exactly at this point.r1
and r2
, because routers in the Internet commonly do not know the routes to private subnets behind edge routers.