This is an old revision of the document!
Table of Contents
Nftables - Packet flow and Netfilter hooks in detail
If you are using Iptables or the newer Nftables (I consider the latter one to be the default case nowadays) and you are merely doing some simple packet filtering with IPv4, then you'll probably get enough info out of the official documentation and by a quick look through websites which provide example configurations.
However if you are working on a little bit more complex stuff like writing Nftables rules while caring for both IPv4 and IPv6, while using IPsec and doing NAT, or other of the “more interesting” stuff… then things tend to get a little more tricky. If you want to be sure to know what you are doing and to create and place your tables, chains and rules correctly to make them do the right thing… then it is beneficial to understand the flow of network packets and the internal workings of Nftables and the underlying Netfilter framework in a little more detail.
I for myself always like to know how things work and to dig a little deeper than just gaining the very minimum knowledge required to solve the issue at hand. Available documentation on this topic isn't bad, but like most other documentation it tends to leave some gaps and questions unanswered in your head and it often is outdated. Especially in this case here, many of the more interesting details are often only covered by older articles focused on the predecessor Iptables.
After digging through a lot of websites, some kernel source code and doing some practical experimenting involving the trace and log features of Nftables, I like to share some things I've learned. With this article I'll try to explain things like Nftables concepts like base chains, priority and traffic classes and put them in relation to the actual network packet flow through the Netfilter hooks.
Worth a thousand words
Over the years several images have been created which intend to visualize the network packet flow through the Netfilter hooks in the Linux kernel, and thereby the packet flow through the tables, chains and rules of Iptables or Nftables. Probably the most famous, detailed and best maintained image is the following one. The original author is Jan Engelhardt and it has been published on Wikipedia under the Creative Commons Attribution-Share Alike 3.0 Unported license1).
However what this image shows you is the packet flow though the Netfilter hooks and thereby the packet flow through the tables and chains like they existed in old Iptables. In Nftables however you are free to create and name tables and chains to your liking, so things will probably look a little different then. The image still remains very useful, especially because it contains a lot further details like bridging, ingress hook and IPsec/xfrm, however when interpreting it you are required to “read a little between the lines”.
Netfilter
The Netfilter framework within the Linux kernel is the basic building block on which packet selection systems like Iptables or the newer Nftables are built upon. It provides a bunch of hooks inside the Linux kernel, which are being traversed by network packets as those flow through the kernel. Other kernel components can register callback functions with those hooks, which enables them to examine the packets and to make decisions on whether packets shall be dropped (=deleted) or be accepted (=keep going on their way through the kernel). The following is a simplified version of the netfilter packet flow image which shows these hooks (the blue boxes in the image):
A network packet received on a network device first traverses the Prerouting hook. Then the routing decision happens and thereby the kernel determines whether this packet is destined at a local process (e.g. socket of a server listening on the system) or whether the packet shall be forwarded (in that case the system works as a router). In the first case the packet then traverses the Input hook and is then given to the local process. In the second case the packet traverses the Forward hook and finally the Postrouting hook, before being sent out on a network device. A packet which has been generated by a local process (e.g. a client or server software which likes to send something out on the network), first traverses the Output hook and then also the Postrouting hook, before it is sent out on a network device.
Those five hooks have been present in the Linux kernel for a very long time. You can e.g. already find an equivalent of the image above in the Linux netfilter Hacking HOWTO from 2002. The good news is that at least from a bird's eye view all this is still accurate today. Of course, if you look into details, things are more complex now (“now” here means kernel v5.4.0). I try to show that in the image below (click to enlarge). The courier font in the image indicates how things are named within the Linux kernel source code.
<nodisp 2> : ARP, Bridging </nodisp>
As you can see, those five hooks exist independently for the IPv4 and for the IPv6 protocol (meaning IPv4 and IPv6 packets each traverse their own hooks). Further hooks exist to be traversed by ARP packages or when you do bridging (I do not go into details about those here). An additional ingress hook exists, which exists independently for each network device. The list goes on… no guarantee for completeness2). Nftables abstracts these things with what it calls Address Families (ip
, ip6
, inet
, arp
, bridge
, netdev
), but more about that later.
Network Namespaces
If you do not work with or care about network namespaces or if you do not know what they are, then you can ignore this section. Just be aware: Even if you do not explicitly make use of network namespaces (e.g. by creating additional ones), still one instance, the default network namespace “init_net”, always exists and then all the networking happens inside it.
All the mentioned hooks exist independently (=are being re-created) within each network namespace3). That means the data structures in the Linux kernel which hold the list of callback functions which are registered with the hooks, are re-created (initially empty) for each new network namespace. Thus who is registered with those hooks is different and individual to each network namespace. Of course the actual concept of network namespaces and its impact goes far beyond just that, but that's not the topic of this article.
Register callbacks
As already mentioned, the idea of the hooks is to give other kernel components the opportunity to register callback functions with a hook which are then being called for each network packet which traverses this hook. Netfilter provides an API to do that and both Iptables and Nftables and further systems like Connection Tracking make use of it. This API provides the two functions to register/unregister a callback function with a specific hook: nf_register_net_hook()
and nf_unregister_net_hook()
.
Several callback functions can be registered with the same hook. Netfilter holds the function pointers of those callback functions (together with some meta data) in an array, which is dynamically being grown or shrunk each time when some component registers/unregisters a callback. Each hook has its own array,
implemented as an instance of struct nf_hook_entries
in the kernel.
Priority
The sequence of callbacks in this array is important, because network packets which traverse the hook, will traverse the callbacks in the sequence in which those are present within the array. When registering a callback, the caller needs to specify a priority value (shown in red color in the image above), which is then used by Netfilter to determine WHERE to insert the new callback into the array. The priority is a signed integer value (int
) and the whole value range of that data type can be used. As you see in the image, Netfilter sorts the callbacks in ascending order from lower to higher priority values, thus a callback with lower value like -200
comes BEFORE a callback with a higher value like 100
. However in practice not the full range of values of the priority integer seems to be used. The kernel contains several enums which define some common discrete priority values. Things seem a little messy here, because those enums are (a little) different for each protocol (= for each Address Family how Nftables would call it). Here as an example the enum for the IPv4 protocol:
/* from include/uapi/linux/netfilter_ipv4.h (kernel v5.4.0) */ enum nf_ip_hook_priorities { NF_IP_PRI_FIRST = INT_MIN, NF_IP_PRI_RAW_BEFORE_DEFRAG = -450, NF_IP_PRI_CONNTRACK_DEFRAG = -400, NF_IP_PRI_RAW = -300, NF_IP_PRI_SELINUX_FIRST = -225, NF_IP_PRI_CONNTRACK = -200, NF_IP_PRI_MANGLE = -150, NF_IP_PRI_NAT_DST = -100, NF_IP_PRI_FILTER = 0, NF_IP_PRI_SECURITY = 50, NF_IP_PRI_NAT_SRC = 100, NF_IP_PRI_SELINUX_LAST = 225, NF_IP_PRI_CONNTRACK_HELPER = 300, NF_IP_PRI_CONNTRACK_CONFIRM = INT_MAX, NF_IP_PRI_LAST = INT_MAX, };
I go into such detail here, because this enum shows you the discrete priority values which are being used by kernel components like connection tracking when registering their own callbacks with a Netfilter hook. This is relevant for Iptables and Nftabless as you will see below.
Hard-coded vs. Flexibility
The Netfilter hooks themselves are hard-coded into the Linux kernel network stack. You'll find them in the source code if you search for function calls named NF_HOOK()
4). In case you are wondering, why other kernel components are required to register callbacks with these hooks at
runtime and why those callbacks are not also hard coded… well I did not write this code, so my guess is as
good as yours. There are many potential reasons which might have led to these design decisions, but common sense (and comments on some websites) made at least these two reasons obvious to me:
- For once this kind of flexibility during runtime is an essential basic requirement in a kernel where many components (also Netfilter, Nftables, Iptables and connection tracking) can potentially be loaded or unloaded during runtime as kernel modules and which employs powerful concepts of further abstraction like network namespaces.
- Performance is a crucial issue. Every network packet needs to traverse all callbacks registered with a hook. Thus those callbacks should be registered in an economical way. This is probably one of the driving reasons why base chains in Nftables need to be explicitly created by the user in contrast to the more or less “hard-coded” chains of Iptables (more details below).
Hook traversal and verdict
Now let's take a more detailed look on how the callbacks which are registered with the same hook are being traversed by network packets. The image above shows this (click to enlarge).
For each network packet which traverses this hook the callback functions are being called one by one
in the sequence/order in which they are present within the array of the hook (the sequence defined by
the priority value). Network packets are represented within the Linux kernel as instances
of struct sk_buff
(often abbreviated as “skb”). A pointer to such an skb instance is given as function argument to all these callback functions, so each one can examine the packet. Each callback is required to give a “verdict” back to Netfilter as return-value. There are several possible values for the “verdict”, but for understanding these concepts only these two are relevant: NF_ACCEPT
and NF_DROP
. NF_ACCEPT
tells Netfilter that the overall “verdict” of the callback is that it “accepts” the network packet. This means the packet now traverses the next callback registered with this hook (if existing). If all callbacks of this hook/return NF_ACCEPT
, then the packet finally continues its traversal of the kernel network stack. However if a callback returns NF_DROP
then the packet is being “dropped” (=deleted) and no further callbacks or parts of the network stack are being traversed.
Iptables
To put things into context, let's take a short look at Iptables as the predecessor of Nftables. Iptables organizes its rules into tables and chains, whereas tables merely are a means (a container) to group chains together, which have something in common (e.g. chains which are used for nat belong to the nat
table). The actual rules reside inside the chains.
Iptables registers its chains with the Netfilter hooks by registering its own callback functions as described above. This means when a network packet traverses a hook (e.g. Prerouting), then this packet traverses the chains which are registered with this hook and thereby traverses their rules.
In case of Iptables all that is already pre-defined. A fixed set of tables exists, each table containing a fixed set of chains5). The chains are named like the hooks with which they are registered.
table | contains chains | command to show that |
---|---|---|
filter | INPUT , FORWARD , OUTPUT | iptables [-t filter] -L |
nat | PREROUTING , (INPUT )6), OUTPUT , POSTROUTING | iptables -t nat -L |
mangle | PREROUTING , INPUT , FORWARD , OUTPUT , POSTROUTING | iptables -t mangle -L |
raw | PREROUTING , OUTPUT | iptables -t raw -L |
The sequence in which the chains are being traversed when a packet traverses the hook (their priority) is also already fixed. The Netfilter packet flow image shows this sequence in detail. In the image, each chain registered with a hook is represented by a box like the following, containing the name of the table and the chain it belongs to.
I additionally show the priority here (in red color) because I like to further elaborate on it, however the priority value is not shown in the original Netfilter packet flow image.
The iptables
cmdline tool itself is only responsible for configuring tables, chains and rules for handling IPv4 packets, thus its corresponding kernel component only registers its chains with the five Netfilter hooks of the IPv4 protocol. To cover all the protocol families, the complete Iptables suite is split up into several distinct cmdline tools and corresponding kernel components:
iptables
for IPv4 /NFPROTO_IPV4
ip6tables
for IPv6 /NFPROTO_IPV6
arptables
for ARP /NFPROTO_ARP
ebtables
for Bridging /NFPROTO_BRIDGE
Let's take a look at iptables
for IPv4. Because the Iptables chains are named after the hooks they are registered with, interpreting the image is straightforward (click to enlarge):
Connection tracking
As you can see in the image above, the connection tracking system also registers itself with the Netfilter hooks and based on the priority value (-200
) you can clearly see which Iptables chain is called BEFORE and which AFTER the connection tracking callback. There is much more to tell about connection tracking. I'll probably cover this in another article.
Nftables
In general Nftables organizes its rules into tables and chains in the same way Iptables does. Tables are again containers for chains and chains are carrying the rules. However, in contrast to Iptables, no pre-defined tables or chains exist. All tables and chains have to be explicitly created by the user. The user can give arbitrary names to the tables and chains when creating them. Nftables distinguishes between so-called base chains and regular chains. A base chain is a chain which is being registered with a Netfilter hook (by means of callback functions as described above) and you must specify that hook when you create the chain. A regular chain is not registered with any hook (regular chains are not covered in this article)7). Thus the user is not forced to name the base chains like the hooks they will be registered with. This obviously offers more freedom and flexibility, but thereby also has more potential to create confusion.
Address Families
In contrast to Iptables, Nftables is not split up into several userspace tools and corresponding kernel components to address the different groups of hooks which Netfilter provides. It solves this issue by introducing the concept of the so-called Address Families. When you create a table you need to specify to which Address Family it belongs to. The following Address Families exist and map to the following groups of Netfilter hooks:
ip
: maps to IPv4 protocol hooks /NFPROTO_IPV4
(default)ip6
: maps to IPv6 protocol hooks /NFPROTO_IPV6
inet
: maps to both IPv4 and IPv6 protocol hooksarp
: maps to ARP protocol hooks /NFPROTO_ARP
bridge
: maps to bridging hooks /NFPROTO_BRIDGE
netdev
: maps to ingress hook /NFPROTO_NETDEV
As a result, all base chains which you create within a table will be registered with the specified Netfilter hook of that Address Family which you selected for the table. The ip
Address Family (IPv4) is the default one. So, if you do not specify any Address Family when creating a table, then this table will belong to ip
.
In the following example I intentionally mention the ip
Address Family to emphasize what is happening:
#create a new table named 'foo', belonging to address family 'ip' nft create table ip foo #create new base chain named 'bar' in table 'foo', register it with #netfilter hook 'input' of the 'ip' address family (=IPv4 protocol) #and specify priority '0' nft create chain ip foo bar { type filter hook input priority 0\; }
The inet family
The inet
Address Family is special. When you create a table belonging to that family and then create a base chain within that table, then this base chain will get registered with two Netfilter hooks: The equivalent hooks of IPv4 and IPv6. This means both IPv4 and IPv6 packets will traverse the rules of this chain. Example:
nft create table inet foo #this base chain will get registered with the Netfilter 'input' #hook of IPv4 and also to the Netfilter 'input' hook of IPv6 nft create chain inet foo bar { type filter hook input priority 0\; }
Priority
In the examples above you already saw that Nftables requires you to specify a priority
value when creating a base chain. This is the very same priority as I described already
in detail when covering Netfilter above. You can specify integer values, but the newer
versions of Nftables also define placeholder names for several discrete priority values
analog to the mentioned enums in Netfilter.
When creating a base chain, you can e.g. specify priority filter
which translates into priority 0
. The available placeholder names are8):
Name | Priority Value |
---|---|
raw | -300 |
mangle | -150 |
conntrack9) | -200 |
dstnat | -100 |
filter | 0 |
security | 50 |
srcnat | 100 |
The following example creates a table named myfilter
in the ip
address family (IPv4) and then creates two base chains named foo
and bar
, registering them with the Netfilter hook input, but each with different priority:
nft create table ip myfilter nft create chain ip myfilter foo { type filter hook input priority 0\; } nft create chain ip myfilter bar { type filter hook input priority 50\; } #alternatively you could create the same chains using named priority values nft create chain ip myfilter foo \ { type filter hook input priority filter\; } nft create chain ip myfilter bar \ { type filter hook input priority security\; }
As a result, IPv4 network packets traversing the Netfilter hook input
will first traverse the foo
chain and then the bar
chain.
Negative Values
Nftables currently has a limitation (see bug ticket) which makes it difficult (or at least uncomfortable) to enter negative integer values for the priority on the nft
command line. Using the placeholder names is probably the most comfortable workaround. However if you really want to enter a negative integer value, one possible way to enter it is this:
#adding '--' makes it possible to specify negative priority nft -- add chain foo bar { type nat hook input priority -100\; }
What if priority is equal?
What actually happens when you register two base chains with the same hook which both have the same priority, e.g. by creating two Nftables base chains like this:
nft create chain ip table1 chain1 { type filter hook input priority 0\; } nft create chain ip table1 chain2 { type filter hook input priority 0\; }
The source code of Netfilter answers this question. It actually allows
to register callbacks with the same hook which have the same priority value.
In case of the example above, function nf_register_net_hook()
is
first called for chain1 and then for chain2. I checked the kernel
source code10) and was able to confirm the behavior with the
Nftables nftrace
feature: The kernel code places chain2 BEFORE
(in front of) chain1 in the array of callbacks for this hook. As a result,
network packets then traverse chain2 BEFORE chain1. This means here
the sequence/order in which you register both chains becomes relevant!
However, I guess it is best practice to consider the sequence in which two chains with equal priority on the same hook are traversed to be “undefined” and thus to either avoid this case or to design the rules added to those chains in a way in which they do not depend on the the sequence of chain traversal. After all, the behavior I describe here is an internal kernel behavior which is undocumented and implementation could change with any newer kernel version. Thus you should not rely on it!
Best practices
Seems it has become common practice among most users to actually use the same naming concepts for tables and chains in Nftables, like they were used in Iptables (mainly naming the base chains like the hooks they get registered with). Further, it is good practice to only create the tables and chains you require for your use case. This not only makes your ruleset smaller and potentially easier to read and maintain, it is also relevant regarding performance. Be aware that each chain you create and register with one of the Netfilter hooks (= each “base chain”) will actually be traversed by network packets and thereby poses a performance penalty.
Example: NAT edge router
If you e.g. like to do some simple IPv4 packet filtering and snat (masquerading) on an edge router, which is a very common case, then this set of tables and chains would probably be sufficient:
table | base chains |
---|---|
filter | input , forward , output |
nat | postrouting |
You create these tables and chains in address family ip
:
nft create table ip nat nft create chain ip nat postrouting \ { type nat hook postrouting priority srcnat\; } nft create table ip filter nft create chain ip filter input \ { type filter hook input priority filter\; } nft create chain ip filter forward \ { type filter hook forward priority filter\; } nft create chain ip filter output \ { type filter hook output priority filter\; }
As a result, the chains registered with the IPv4 Netfilter hooks will look like this (click to enlarge):
Then you add some simple masquerading and packet filtering rules:
nft add rule ip nat postrouting oif eth1 masquerade nft add rule ip filter forward iif eth1 \ ct state new,invalid,untracked drop nft add rule ip filter input iif eth1 ip protocol != icmp \ ct state new,invalid,untracked drop
(I merely gave a minimalist example here. One could even remove the output chain again, because I did not add any rules to it. In reality you for sure will add a more complex set of rules.)
Context
The described Nftables behavior and implementation has been observed on a Debian buster system with backports on amd64 platform.
- kernel #1 SMP Debian 5.4.19-1~bpo10+1 (2020-03-09)
- nftables 0.9.3-2~bpo10+1
- libnftnl 1.1.5-1~bpo10+1
Feedback
Feedback to this article is very welcome!
DECNET
, which I consider historic.nat
table also contains an INPUT
chain. The command iptables -t nat -L
shows it. However by the time of writing it is not yet shown in the latest version of the Netfilter packet flow image. Seems it has been added somewhen around kernel 2.6.35
with this commit.man 8 nft
) for details.nf_hook_entries_grow()
in
net/netfilter/core.c
in kernel v5.4.0