Thermalcircle.de

climbing the thermals

User Tools

Site Tools


blog:linux:nftables_packet_flow_netfilter_hooks_detail

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
blog:linux:nftables_packet_flow_netfilter_hooks_detail [2020-10-16] – [Example: NAT edge router] Andrej Stenderblog:linux:nftables_packet_flow_netfilter_hooks_detail [2022-08-07] (current) – activated TOC Andrej Stender
Line 1: Line 1:
 +{{tag>linux kernel netfilter nftables iptables}}
 ====== Nftables - Packet flow and Netfilter hooks in detail ====== ====== Nftables - Packet flow and Netfilter hooks in detail ======
 ~~META: ~~META:
 date created = 2020-05-17  date created = 2020-05-17 
 ~~ ~~
- 
-~~NOTOC~~ 
  
 If you are using //Iptables// or the newer //Nftables// and you are merely doing some simple If you are using //Iptables// or the newer //Nftables// and you are merely doing some simple
Line 10: Line 9:
 official documentation and by a quick look through websites which official documentation and by a quick look through websites which
 provide example configurations.  provide example configurations. 
- 
 However, if you are working on a little bit more complex stuff like writing However, if you are working on a little bit more complex stuff like writing
-//Nftables// rules while caring for both IPv4 and IPv6, while using IPsec((Check out my other article [[:blog:linux:nftables_ipsec_packet_flow|Nftables - Netfilter and VPN/IPsec packet flow]], where I cover that topic.))+//Nftables// rules while caring for both IPv4 and IPv6, while using IPsec
 and doing NAT, or other of the "more interesting" stuff... then things tend and doing NAT, or other of the "more interesting" stuff... then things tend
 to get a little more tricky. to get a little more tricky.
Line 21: Line 19:
 in a little more detail. in a little more detail.
  
 +===== Rationale =====
 I for myself always like to know how things work and to dig a little deeper than I for myself always like to know how things work and to dig a little deeper than
 just gaining the very minimum knowledge required to solve the issue at hand. just gaining the very minimum knowledge required to solve the issue at hand.
Line 27: Line 26:
 the available documentation is outdated. Many of the more interesting details the available documentation is outdated. Many of the more interesting details
 are often only covered by older articles focused on the //Nftables// predecessor //Iptables//. are often only covered by older articles focused on the //Nftables// predecessor //Iptables//.
- 
 After digging through a lot of websites, some kernel source code and doing some practical After digging through a lot of websites, some kernel source code and doing some practical
 experimenting involving the //trace// and //log// features of //Nftables//, experimenting involving the //trace// and //log// features of //Nftables//,
Line 41: Line 39:
 and thereby the packet flow through the //tables//, //chains// and //rules// of and thereby the packet flow through the //tables//, //chains// and //rules// of
 //Iptables// or //Nftables//. Probably the most famous, detailed and best //Iptables// or //Nftables//. Probably the most famous, detailed and best
-maintained image is the following one. +maintained image is shown in Figure {{ref>nfpackflowofficial}}.
-The original author is Jan Engelhardt and it has been published on [[https://commons.wikimedia.org/wiki/File:Netfilter-packet-flow.svg|Wikipedia]] under the [[https://en.wikipedia.org/wiki/en:Creative_Commons|Creative Commons]] [[https://creativecommons.org/licenses/by-sa/3.0/deed.en|Attribution-Share Alike 3.0 Unported]] license((This kindly allows me to use it as I publish my content under a compatible license. Thank you. See my licensing statement on page bottom.)).+
  
 +<figure nfpackflowofficial>
 {{:linux:netfilter-packet-flow.png?direct&700|}} {{:linux:netfilter-packet-flow.png?direct&700|}}
 +<caption>Netfilter Packet Flow image, published on [[https://commons.wikimedia.org/wiki/File:Netfilter-packet-flow.svg|Wikipedia]], [[https://creativecommons.org/licenses/by-sa/3.0/deed.en|CC BY-SA 3.0]]((Author: Jan Engelhardt\\ This kindly allows me to use this image as I publish my content under a compatible license. Thank you. See my licensing statement on page bottom.))
 +</caption>
 +</figure>
  
-However, what this image shows you is the packet flow though the //Netfilter hooks// and thereby the packet flow through the //tables// and //chains// like they existed in old //Iptables//. In //Nftables// however you are free to create and name //tables// and //chains// to your liking, so things will probably look a little different then. The image still remains very useful, especially because it contains a lot further details like //bridging//, //ingress// hook and //IPsec//%%/%%//xfrm//, however when interpreting it you are required to "read a little between the lines".+However, what this image shows you is the packet flow though the //Netfilter hooks// and thereby the packet flow through the //tables// and //chains// like they existed in old //Iptables//. In //Nftables// however you are free to create and name //tables// and //chains// to your liking, so things will probably look a little different then. The image still remains very useful, especially because it contains a lot of further details like //bridging//, //ingress// hook and //IPsec//%%/%%//Xfrm//((Check out my other article [[:blog:linux:nftables_ipsec_packet_flow|Nftables - Netfilter and VPN/IPsec packet flow]], where I cover that topic.)), however when interpreting it you are required to "read a little bit between the lines".
  
 ===== Netfilter ===== ===== Netfilter =====
 The [[wp>Netfilter|Netfilter]] framework within the Linux kernel is the basic building block on which packet selection systems like //Iptables// or the newer //Nftables// are built upon. The [[wp>Netfilter|Netfilter]] framework within the Linux kernel is the basic building block on which packet selection systems like //Iptables// or the newer //Nftables// are built upon.
-It provides a bunch of //hooks// inside the Linux kernel, which are being traversed by network packets as those flow through the kernel. Other kernel components can register callback functions with those hooks, which enables them to examine the packets and to make decisions on whether packets shall be //dropped// (=deleted) or be //accepted// (=keep going on their way through the kernel). +It provides a bunch of //hooks// inside the Linux kernel, which are being traversed by network packets as those flow through the kernel. Other kernel components can register callback functions with those hooks, which enables them to examine the packets and to make decisions on whether packets shall be //dropped// (=deleted) or be //accepted// (=keep going on their way through the kernel). Figure {{ref>nfhookssimple}} is a simplified version of the Netfilter packet flow image which shows these hooks.
-The following is a simplified version of the netfilter packet flow image which shows these hooks (the blue boxes in the image):+
  
 +<figure nfhookssimple>
 {{ :linux:nf-hooks-simple1.png?nolink&700 }} {{ :linux:nf-hooks-simple1.png?nolink&700 }}
 +<caption>Netfilter hooks - simple block diagram</caption>
 +</figure>
  
 A network packet received on a network device first traverses the //Prerouting// hook. Then the routing decision happens and thereby the kernel determines whether this packet is destined at a local process (e.g. socket of a server listening on the system) or whether the packet shall be forwarded (in that case the system works as a router). In the first case the packet then traverses the //Input// hook and is then given to the local process. In the second case the packet traverses the //Forward// hook and finally the //Postrouting// hook, before being sent out on a network device. A packet which has been generated by a local process (e.g. a client or server software which likes to send something out on the network), first traverses the //Output// hook and then also the //Postrouting// hook, before it is sent out on a network device. A network packet received on a network device first traverses the //Prerouting// hook. Then the routing decision happens and thereby the kernel determines whether this packet is destined at a local process (e.g. socket of a server listening on the system) or whether the packet shall be forwarded (in that case the system works as a router). In the first case the packet then traverses the //Input// hook and is then given to the local process. In the second case the packet traverses the //Forward// hook and finally the //Postrouting// hook, before being sent out on a network device. A packet which has been generated by a local process (e.g. a client or server software which likes to send something out on the network), first traverses the //Output// hook and then also the //Postrouting// hook, before it is sent out on a network device.
  
-Those five hooks have been present in the Linux kernel for a very long time. You can e.g. already find an equivalent of the image above in the [[https://netfilter.org/documentation/HOWTO//netfilter-hacking-HOWTO-3.html|Linux netfilter Hacking HOWTO]] from 2002. The good news is that at least from a bird's eye view all this is still accurate today. Of course, if you look into details, things are more complex now. I try to show that in the image below (click to enlarge). The //courier// font in the image indicates how things are named within the Linux kernel source code.+Those five hooks have been present in the Linux kernel for a very long time. You can e.g. already find an equivalent of Figure {{ref>nfhookssimple}} in the [[https://netfilter.org/documentation/HOWTO//netfilter-hacking-HOWTO-3.html|Linux netfilter Hacking HOWTO]] from 2002. The good news is that at least from a bird's eye view all this is still accurate today. Of course, if you look into details, things are more complex now. I try to show that in Figure {{ref>nfhooksdetail}}. The //courier// font within the image indicates how things are named within the Linux kernel source code.
  
 +<figure nfhooksdetail>
 {{:linux:nf-hooks-detail1.jpg?direct&700|}} {{:linux:nf-hooks-detail1.jpg?direct&700|}}
 +<caption>Netfilter hooks in more detail: IPv4, IPv6, ARP, Bridging, network namespaces and Ingress\\ (click to enlarge)</caption>
 +</figure>
  
 As you can see, those five hooks exist independently for the IPv4 and for the IPv6 protocol (meaning IPv4 and IPv6 packets each traverse their own hooks). Further hooks exist to be traversed by ARP packets or when you do //bridging// (I do not go into details about those here). An additional //ingress// hook exists, which exists independently for each network device. The list goes on... no guarantee for completeness((I already intentionally left some things unmentioned like e.g. ''DECNET'', which I consider historic.)). //Nftables// abstracts these things with what it calls //Address Families// (''ip'', ''ip6'', ''inet'', ''arp'', ''bridge'', ''netdev''), but more about that later. As you can see, those five hooks exist independently for the IPv4 and for the IPv6 protocol (meaning IPv4 and IPv6 packets each traverse their own hooks). Further hooks exist to be traversed by ARP packets or when you do //bridging// (I do not go into details about those here). An additional //ingress// hook exists, which exists independently for each network device. The list goes on... no guarantee for completeness((I already intentionally left some things unmentioned like e.g. ''DECNET'', which I consider historic.)). //Nftables// abstracts these things with what it calls //Address Families// (''ip'', ''ip6'', ''inet'', ''arp'', ''bridge'', ''netdev''), but more about that later.
Line 68: Line 74:
 explicitly make use of network namespaces (e.g. by creating additional ones), explicitly make use of network namespaces (e.g. by creating additional ones),
 still one instance, the default network namespace //"init_net"//, always exists still one instance, the default network namespace //"init_net"//, always exists
-and then all the networking happens inside it.+and then all the networking happens inside this namespace.
  
 All the mentioned hooks exist independently (=are being re-created) within each All the mentioned hooks exist independently (=are being re-created) within each
 network namespace((The only exception here is the //ingress// hook which is bound to network namespace((The only exception here is the //ingress// hook which is bound to
-an individual //network device// and thereby (at least not directly) to a //network namespace//.)). That means the data structures in the Linux kernel which hold the list of callback+an individual //network device// and thereby (at least not directly) to a //network namespace//.)), as shown in Figure {{ref>nfhooksdetail}}. That means the data structures in the Linux kernel which hold the list of callback
 functions which are registered with the hooks, are re-created (initially empty) functions which are registered with the hooks, are re-created (initially empty)
-for each new network namespace. Thus who is registered with those hooks is+for each new network namespace. Thuswho is registered with those hooks is
 different and individual to each network namespace. different and individual to each network namespace.
 Of course the actual concept of network namespaces and its impact goes Of course the actual concept of network namespaces and its impact goes
Line 80: Line 86:
  
  
-==== Register callbacks ==== +==== Register hook functions ==== 
-As already mentioned, the idea of the hooks is to give other kernel components the opportunity to register //callback// functions with a hook which are then being called for each network packet which traverses this hook. //Netfilter// provides an API to do that and both //Iptables// and //Nftables// and further systems like //Connection Tracking// make use of it. This API provides these two functions  to register/unregister a callback function with a specific hook: ''nf_register_net_hook()'' and ''nf_unregister_net_hook()''.+As already mentioned, the idea of the hooks is to give other kernel components the opportunity to register //callback// functions with a Netfilter hook which are then being called for each network packet which traverses this hook. //Netfilter// provides an API to do that and both //Iptables// and //Nftables// and further systems like //Connection Tracking// make use of it. This API provides the functions ''[[https://elixir.bootlin.com/linux/v5.4.19/source/net/netfilter/core.c#L449|nf_register_net_hook()]]'' and ''[[https://elixir.bootlin.com/linux/v5.4.19/source/net/netfilter/core.c#L425|nf_unregister_net_hook()]]''((and further variations of those functions)) to register/unregister a callback function with a specific hook. Figure {{ref>nfhookregister}} visualizes this.
  
 +<figure nfhookregister>
 {{ :linux:nf-hook-entries-register1.png?direct&600 }} {{ :linux:nf-hook-entries-register1.png?direct&600 }}
 +<caption>Netfilter API to register/unregister callbacks ("hook functions") with a hook</caption>
 +</figure>
  
-Several callback functions can be registered with the same hook. //Netfilter// holds the function pointers of those callback functions (together with some meta data) in an array, which is dynamically being grown or shrunk each time when some component registers/unregisters a callback. Each hook has its own array, +Several callback functions can be registered with the same hook. //Netfilter// holds the function pointers of those functions (together with some meta data) in an array, which is dynamically being grown or shrunk each time when some component registers/unregisters a function. Each Netfilter hook has its own array, implemented as an instance of ''struct nf_hook_entries'' in the kernel
-implemented as an instance of ''struct nf_hook_entries'' in the kernel.+In most other documentation on the Internet as well as in discussions among the Netfilter developer community, those registered callback functions are usually referred to as "hook functions"((Sometimes they are simply referred to as "hooks", which creates some ambiguity. Be careful when you read something about a "hook" somewhere in the Internet... the meaning might be a "Netfilter hook", but it might also be a "callback function" registered with one of the Netfilter hooks.)). Thus, I will also refer to them as "hook functions" from now on
  
 ==== Priority ==== ==== Priority ====
-The sequence of callbacks in this array is important, because network packets which traverse the hook, will traverse the callbacks in the sequence in which those are present within the array. When registering a callback, the caller needs to specify a //priority// value (shown in red color in the image above), which is then used by //Netfilter// to determine WHERE to insert the new callback into the array. The //priority// is a signed integer value (''int'') and the whole value range of that data type can be used. As you see in the image, //Netfilter// sorts the callbacks in ascending order from lower to higher //priority// values, thus callback with lower value like ''-200'' comes BEFORE a callback with a higher value like ''100''. However in practice not the full range of values of the //priority// integer seems to be used. The kernel contains several //enums// which define some common discrete //priority// values. Things seem a little messy here, because those enums are (a little) different for each protocol (= for each //Address Family// how //Nftables// would call it). Here as an example the enum for the IPv4 protocol:<code c> +The sequence of hook functions in this array is important, because network packets which traverse the hook, will traverse the hook functions in the sequence in which those are present within the array. When registering a hook function, the caller needs to specify a //priority// value (shown in red color in Figure {{ref>nfhookregister}}), which is then used by //Netfilter// to determine WHERE to insert the new hook function into the array. The //priority// is a signed integer value (''int'') and the whole value range of that data type can be used. As you see in Figure {{ref>nfhookregister}}, //Netfilter// sorts the hook functions in ascending order from lower to higher //priority// values. Thus, a hook function with lower value like ''-200'' comes BEFORE a hook function with a higher value like ''100''. However in practice not the full range of values of the //priority// integer seems to be used. The kernel contains several //enums// which define some common discrete //priority// values. Things seem a little messy here, because those enums are (a little) different for each protocol (= for each //Address Family// how //Nftables// would call it). Figure {{ref>nfipv4hookpriorities}} shows as an example the enum for the IPv4 protocol
-/* from include/uapi/linux/netfilter_ipv4.h (kernel v5.4.0) */+ 
 +<figure nfipv4hookpriorities> 
 +<code c>
 enum nf_ip_hook_priorities { enum nf_ip_hook_priorities {
  NF_IP_PRI_FIRST = INT_MIN,  NF_IP_PRI_FIRST = INT_MIN,
Line 109: Line 120:
 }; };
 </code> </code>
 +<caption>IPv4 hook priorities //enum//\\ 
 +Source code extract from ''[[https://elixir.bootlin.com/linux/v5.4.19/source/include/uapi/linux/netfilter_ipv4.h#L30|include/uapi/linux/netfilter_ipv4.h]]''</caption>
 +</figure>
  
-I go into such detail here, because this enum shows you the discrete //priority// values which are being used by kernel components like //connection tracking// when registering their own callbacks with a //Netfilter// hook. This is relevant for //Iptables// and //Nftabless// as you will see below.+I go into such detail here, because this enum shows you the discrete //priority// values which are being used by kernel components like //connection tracking// when registering their own hook functions with a Netfilter hook. This is relevant for //Iptables// and //Nftabless// as you will see below.
  
 ==== Hard-coded vs. Flexibility ==== ==== Hard-coded vs. Flexibility ====
-The //Netfilter// hooks themselves are hard-coded into the Linux kernel network stack. You'll find them in the source code if you search for function calls named ''NF_HOOK()''((or similar... a few variations exist)). In case you are wondering, why other kernel components are required to register callbacks with these hooks at +The Netfilter hooks themselves are hard-coded into the Linux kernel network stack. You'll find them in the source code if you search for function calls named ''NF_HOOK()''((or similar... a few variations exist)). In case you are wondering, why other kernel components are required to register hook functions with these Netfilter hooks at runtime and why those hook functions are not also hard coded... well I did not write this code, so your guess is as good as mine. There are many potential reasons which might have led to these design decisions, but common sense (and comments on some websites) made at least these two reasons obvious to me:
-runtime and why those callbacks are not also hard coded... well I did not write this code, so your guess is as good as mine. There are many potential reasons which might have led to these design decisions, but common sense (and comments on some websites) made at least these two reasons obvious to me: +
- +
-  - For once this kind of flexibility during runtime is an essential basic requirement in a kernel where many components (also //Netfilter//, //Nftables//, //Iptables// and //connection tracking//) can potentially be loaded or unloaded during runtime as //kernel modules// and which employs powerful concepts of further abstraction like //network namespaces//+
-  - Performance is a crucial issue. Every network packet needs to traverse all callbacks registered with a hook. Thus those callbacks should be registered in an economical way. This is probably one of the driving reasons why //base chains// in //Nftables// need to be explicitly created by the user in contrast to the pre-defined chains of //Iptables// (more details below).+
  
 +  - For once this kind of flexibility during runtime is an essential basic requirement in a kernel where many components (also //Nftables//, //Iptables// and //connection tracking//) can potentially be loaded or unloaded during runtime as //kernel modules// and which employs powerful concepts of further abstraction like //network namespaces//.
 +  - Performance is a crucial issue. Every network packet needs to traverse all hook functions registered with a Netfilter hook. Thus, those hook functions should be registered in an economical way. This is probably one of the driving reasons why //base chains// in //Nftables// need to be explicitly created by the user in contrast to the pre-defined chains of //Iptables// (more details below).
  
 ==== Hook traversal and verdict ==== ==== Hook traversal and verdict ====
-{{ :linux:nf-hook-entries-flow1.png?direct&700 }} +Now let's take a more detailed look on how the hook functions which are registered with the same Netfilter hook are being traversed by network packets.  
-Now let's take a more detailed look on how the callbacks which are registered with the same hook are being traversed by network packets. The image above shows this (click to enlarge)+For each network packet which traverses this hook, the hook functions are being called one by one
-For each network packet which traverses this hook, the callback functions are being called one by one+
 in the sequence/order in which they are present within the array of the hook (the sequence defined by in the sequence/order in which they are present within the array of the hook (the sequence defined by
-the //priority// value). Network packets are represented within the Linux kernel as instances +the //priority// value). 
-of ''struct sk_buff'' (often abbreviated as //"skb"//)A pointer to such an //skb// instance is given as function argument to all these callback functions, so each one can examine the packet. Each callback is required to give "verdict" back to //Netfilter// as //return-value//. There are several possible values for the "verdict", but for understanding these concepts only these two are relevant: ''NF_ACCEPT'' and ''NF_DROP''. ''NF_ACCEPT'' tells //Netfilter// that the overall "verdict" of the callback is that it "accepts" the network packet. This means the packet now traverses the next callback registered with this hook (if existing). If all callbacks of this hook return ''NF_ACCEPT'', then the packet finally continues its traversal of the kernel network stack. However if a callback returns ''NF_DROP'' then the packet is being "dropped" (=deleted) and no further callbacks or parts of the network stack are being traversed.+ 
 +<figure nfhookentriesflow> 
 +{{ :linux:nf-hook-entries-flow1.png?direct&700 }} 
 +<caption>Packet flow through hook functions registered with a Netfilter hook (click to enlarge)</caption> 
 +</figure>
  
 +Network packets are represented within the Linux kernel as instances of ''struct sk_buff'' (often referred to as "socket buffer" and abbreviated as //"skb"//). A pointer to such an //skb// instance is given as function argument to all these hook functions , so each one can examine the packet. Each hook function is required to give a "verdict" back to //Netfilter// as //return-value//. There are several possible values for the "verdict", but for understanding these concepts only these two are relevant: ''NF_ACCEPT'' and ''NF_DROP''. ''NF_ACCEPT'' tells //Netfilter//, that the hook function "accepts" the network packet. This means the packet now traverses the next hook function registered with this hook (if existing). If all hook functions of this hook return ''NF_ACCEPT'', then the packet finally continues its traversal of the kernel network stack. However, if a hook function returns ''NF_DROP'', then the packet is being "dropped" (=deleted) and no further hook functions or parts of the network stack are being traversed.
  
 ===== Iptables ===== ===== Iptables =====
-To put things into context, let's take a short look at //Iptables// as the predecessor of //Nftables//. //Iptables// organizes its //rules// into //tables// and //chains//, whereas //tables// merely are a means (a container) to group //chains// together, which have something in common (e.g. //chains// which are used for //nat// belong to the ''nat'' //table//). The actual //rules// reside inside the //chains//. +To put things into context, let's take a short look at //Iptables// as the predecessor of //Nftables//. //Iptables// organizes its //rules// into //tables// and //chains//, whereas //tables// for the most part merely are a means (a container) to group //chains// together, which have something in common. E.g. //chains// which are used for //nat// belong to the ''nat''((Well, ''nat'' is already a special case and there is more magic behind it. E.g. only the very first packet of each connection will traverse the //chains// of the ''nat'' table, but that topic is beyond this article.)) //table//. The actual //rules// reside inside the //chains//. //Iptables// registers its //chains// with the Netfilter hooks by registering its own hook functions as described above. This means when a network packet traverses a hook (e.g. //Prerouting//), then this packet traverses the //chains// which are registered with this hook and thereby traverses their //rules//.
-//Iptables// registers its //chains// with the //Netfilter// hooks by registering its own callback functions as described above. This means when a network packet traverses a hook (e.g. //Prerouting//), then this packet traverses the //chains// which are registered with this hook and thereby traverses their //rules//.+
  
-In case of //Iptables// all that is already pre-defined. A fixed set of //tables// exists, each //table// containing a fixed set of //chains//((Ok, as a user you can also create additional //chains// if you want, but those are not registered with //Netfilter// hooks and anyway that is a different topic.)). The //chains// are named like the hooks with which they are registered. +In case of //Iptables// all that is already pre-defined. A fixed set of //tables// exists, each //table// containing a fixed set of //chains//((Ok, as a user you can also create additional //chains// if you want, but those are not registered with Netfilter hooks and anyway that is a different topic.)). The //chains// are named like the Netfilter hooks with which they are registered. 
  
 ^ table ^ contains chains ^ command to show that ^ ^ table ^ contains chains ^ command to show that ^
Line 141: Line 156:
 | ''raw''    | ''PREROUTING'',  ''OUTPUT'' | ''iptables -t raw -L'' | | ''raw''    | ''PREROUTING'',  ''OUTPUT'' | ''iptables -t raw -L'' |
  
-The sequence in which the //chains// are being traversed when a packet traverses the hook (their //priority//) is also already fixed. The {{:linux:netfilter-packet-flow.png?linkonly|Netfilter packet flow image}} shows this sequence in detail. In the image, each //chain// registered with a hook is represented by a box like the following, containing the name of the //chain// and the //table// it belongs to.+The sequence in which the //chains// are being traversed when a packet traverses the hook (their //priority//) is also already fixed. The Netfilter packet flow image (Figure {{ref>nfpackflowofficial}}shows this sequence in detail. In the image, each //chain// registered with a hook is represented by a block like the following in Figure {{ref>nfhookentrylegend}}, containing the name of the //chain// and the //table// it belongs to.
  
 +<figure nfhookentrylegend>
 {{ :linux:nf-hook-entry-legend1.jpg?direct&200 |}} {{ :linux:nf-hook-entry-legend1.jpg?direct&200 |}}
 +<caption>Understanding how tables/chains are visualized in the Netfilter packet flow image ({{ref>nfpackflowofficial}}).</caption>
 +</figure>
  
-I additionally show the //priority// here (in red color) because I like to further elaborate on it, however the //priority// value is not shown in the original Netfilter packet flow image. +I additionally show the //priority// here (in red color) because I like to further elaborate on it. However, the //priority// value is not shown in the original Netfilter packet flow image. 
-The ''iptables'' cmdline tool itself is only responsible for configuring //tables//, //chains// and //rules// for handling IPv4 packets, thus its corresponding kernel component only registers its //chains// with the five //Netfilter// hooks of the IPv4 protocol. To cover all the protocol families, the complete //Iptables// suite is split up into several distinct cmdline tools and corresponding kernel components:+The ''iptables'' cmdline tool itself is only responsible for configuring //tables//, //chains// and //rules// for handling IPv4 packets. Thus, its corresponding kernel component only registers its //chains// with the five //Netfilter// hooks of the IPv4 protocol. To cover all the protocol families, the complete //Iptables// suite is split up into several distinct cmdline tools and corresponding kernel components:
  
   * ''iptables'' for IPv4 / ''NFPROTO_IPV4''   * ''iptables'' for IPv4 / ''NFPROTO_IPV4''
Line 153: Line 171:
   * ''ebtables'' for Bridging / ''NFPROTO_BRIDGE''   * ''ebtables'' for Bridging / ''NFPROTO_BRIDGE''
  
-Let's take a look at ''iptables'' for IPv4. Because the //Iptables// //chains// are named after the hooks they are registered with, interpreting the image is straightforward (click to enlarge):+Let's take a look at ''iptables'' for IPv4. Because the //Iptables// //chains// are named after the hooks they are registered with, interpreting the Netfilter packet flow image is straightforward, as shown in Figure {{ref>nfthooksiptables}}.
  
 +<figure nfthooksiptables>
 {{ :linux:nf-hooks-iptables1.png?direct&700 |}} {{ :linux:nf-hooks-iptables1.png?direct&700 |}}
 +<caption>//Iptables// chains registered with IPv4 Netfilter hooks (+conntrack) (click to enlarge) (compare to {{ref>nfpackflowofficial}})</caption> 
 +</figure>
  
 ===== Connection tracking ===== ===== Connection tracking =====
-As you can see in the image above, the //connection tracking// system also registers itself with the //Netfilter// hooks and based on the //priority// value (''-200'') you can clearly see which //Iptables// //chain// is called BEFORE and which AFTER the //connection tracking// callback.+As you can see in Figure {{ref>nfthooksiptables}}, the //connection tracking// system also registers itself with the Netfilter hooks and based on the //priority// value (''-200'') you can clearly see which //Iptables// //chain// is called BEFORE and which AFTER the //connection tracking// hook function. There is much more to tell about //connection tracking//. If you further look into details, then you'll see that the //connection tracking// system actually even registers more hook functions with the Netfilter hooks, than shown here. However, the two hook functions shown represent a sufficient model to understand the behavior of //connection tracking// when creating //Iptables// or //Nftables// rules. I elaborate on the topic //connection tracking// in detail in a separate series of blog articles, starting with [[connection_tracking_1_modules_and_hooks|Connection tracking - Part 1: Modules and Hooks]].
  
-There is much more to tell about //connection tracking//. If you further look into details, then you'll see that the //connection tracking// system actually even registers more callback functions with the //Netfilter// hooks, than shown here. However, the two callbacks shown here represent a sufficient model to understand the behavior of //connection tracking// when creating //Iptables// or //Nftables// rules.  
-A very good article exists on this topic, written by Pablo Neira Ayuso, the Linux kernel maintainer of the //Netfilter// subsystem: [[http://people.netfilter.org/pablo/docs/login.pdf|Netfilter's connection tracking system]]. 
 ===== Nftables ===== ===== Nftables =====
 In general //Nftables// organizes its //rules// into //tables// and //chains// in the same way //Iptables// does. //Tables// are again containers for //chains// and //chains// are carrying the //rules// In general //Nftables// organizes its //rules// into //tables// and //chains// in the same way //Iptables// does. //Tables// are again containers for //chains// and //chains// are carrying the //rules//
 However, in contrast to //Iptables//, no pre-defined //tables// or //chains// exist. All //tables// and //chains// have to be explicitly created by the user. The user can give arbitrary names to the //tables// and //chains// when creating them. However, in contrast to //Iptables//, no pre-defined //tables// or //chains// exist. All //tables// and //chains// have to be explicitly created by the user. The user can give arbitrary names to the //tables// and //chains// when creating them.
-//Nftables// distinguishes between so-called //base chains// and //regular chains//. A //base chain// is a //chain// which is being registered with a //Netfilter// hook (by means of callback functions as described above) and you must specify that hook when you create the //chain//+//Nftables// distinguishes between so-called //base chains// and //regular chains//. A //base chain// is a //chain// which is being registered with a Netfilter hook (by means of hook functions as described above) and you must specify that hook when you create the //chain//
 A //regular chain// is not registered with any hook (//regular chains// are not covered in this article)((The //regular chains// represent the same feature as I already mentioned for //Iptables//. The user can create an arbitrary number of //chains// which are not registered to any hook and use them similar as you would use //functions// in a programming language. But that is an entirely different topic.)).  A //regular chain// is not registered with any hook (//regular chains// are not covered in this article)((The //regular chains// represent the same feature as I already mentioned for //Iptables//. The user can create an arbitrary number of //chains// which are not registered to any hook and use them similar as you would use //functions// in a programming language. But that is an entirely different topic.)). 
-Thus the user is not forced to name the //base chains// like the hooks they will be registered with. This obviously offers more freedom and flexibility, but thereby also has more potential to create confusion. +Thusthe user is not forced to name the //base chains// like the Netfilter hooks they will be registered with. This obviously offers more freedom and flexibility, but thereby also has more potential to create confusion.
  
 ==== Address Families ==== ==== Address Families ====
Line 183: Line 200:
  
 As a result, all //base chains// which you create within a //table// will be registered with the specified //Netfilter// hook of that //Address Family// which you selected for the //table//. The ''ip'' //Address Family// (IPv4) is the default one. So, if you do not specify any //Address Family// when creating a //table//, then this //table// will belong to ''ip''. As a result, all //base chains// which you create within a //table// will be registered with the specified //Netfilter// hook of that //Address Family// which you selected for the //table//. The ''ip'' //Address Family// (IPv4) is the default one. So, if you do not specify any //Address Family// when creating a //table//, then this //table// will belong to ''ip''.
-In the following example I intentionally mention the ''ip'' //Address Family// to emphasize what is happening:+ 
 +The following example creates a new table named ''foo'', belonging to address family ''ip'', then creates a new base chain named ''bar'' in table ''foo'', registering it with //Netfilter// hook ''input'' of the ''ip'' address family (=IPv4 protocol) and specifying priority ''0'' (I explicitly specify ''ip'' //Address Family// here just to emphasize what is happening; it can be omitted.) 
 <code bash> <code bash>
-#create a new table named 'foo', belonging to address family 'ip' 
 nft create table ip foo nft create table ip foo
- +nft create chain ip foo bar {type filter hook input priority 0\;}
-#create new base chain named 'bar' in table 'foo', register it with +
-#netfilter hook 'input' of the 'ip' address family (=IPv4 protocol) +
-#and specify priority '0' +
-nft create chain ip foo bar { type filter hook input priority 0\; }+
 </code> </code>
  
-=== The inet family === +The ''inet'' //Address Family// is special. When you create a //table// belonging to that family and then create a //base chain// within that //table//, then this //base chain// will get registered with two //Netfilter// hooks: The equivalent //hooks// of IPv4 and IPv6. This means both IPv4 and IPv6 packets will traverse the //rules// of this //chain//. 
-The ''inet'' //Address Family// is special. When you create a //table// belonging to that family and then create a //base chain// within that //table//, then this //base chain// will get registered with two //Netfilter// hooks: The equivalent //hooks// of IPv4 and IPv6. This means both IPv4 and IPv6 packets will traverse the //rules// of this //chain//Example:+The following example creates a table ''foo'' and a base chain ''bar'' in address family ''inet''. Base chain ''bar'' will get registered with Netfilter ''input'' hook of IPv4 and also with Netfilter ''input'' hook of IPv6.
  
 <code bash> <code bash>
 nft create table inet foo nft create table inet foo
- +nft create chain inet foo bar {type filter hook input priority 0\;}
-#this base chain will get registered with the Netfilter 'input' +
-#hook of IPv4 and also to the Netfilter 'input' hook of IPv6 +
-nft create chain inet foo bar { type filter hook input priority 0\; }+
 </code> </code>
 +
 ==== Priority ==== ==== Priority ====
 In the examples above you already saw that //Nftables// requires you to specify a //priority// In the examples above you already saw that //Nftables// requires you to specify a //priority//
 value when creating a //base chain//. This is the very same //priority// as I described already value when creating a //base chain//. This is the very same //priority// as I described already
 in detail when covering //Netfilter// above. You can specify integer values, but the newer in detail when covering //Netfilter// above. You can specify integer values, but the newer
-versions of //Nftables// also define placeholder names for several discrete //priority// values +versions of //Nftables// also define placeholder names for several discrete //priority// values analog to the mentioned //enums// in //Netfilter//. The following table lists those placeholder names((no guarantee for completeness. By the time of writing this still seems to be under heavy development. See man page ''man 8 nft'' for details)).
-analog to the mentioned //enums// in //Netfilter//+
-When creating a //base chain//, you can e.g. specify ''priority filter'' which translates into ''priority 0''. The available placeholder names are((No guarantee for correctness or completeness... by time of writing this still seems to be under heavy development. See man page of //Nftables// (''man 8 nft''for details.))+
  
 ^ Name         ^ Priority Value ^ ^ Name         ^ Priority Value ^
 | ''raw''      | ''-300'' | ''raw''      | ''-300''
 | ''mangle''   | ''-150'' | ''mangle''   | ''-150''
-| conntrack((As you can guess, this is NOT one of the placeholder names you can use. I added it here as a reminder which //priority// value is reserved for the //connection tracking// callback.))    | ''-200'' |+| conntrack((As you can guess, this is NOT one of the placeholder names you can use. I added it here as a reminder which //priority// value is reserved for the //connection tracking// hook function.))    | ''-200'' |
 | ''dstnat''   | ''-100'' | ''dstnat''   | ''-100''
 | ''filter''   | ''0''    |  | ''filter''   | ''0''    | 
Line 221: Line 231:
 | ''srcnat''   | ''100''  | ''srcnat''   | ''100'' 
  
-The following example creates a //table// named ''myfilter'' in the ''ip'' //address family// (IPv4) and then creates two //base chains// named ''foo'' and ''bar'', registering them with the //Netfilter// hook //input//, but each with different //priority//:+When creating a //base chain//, you can e.g. specify ''priority filter'' which translates into ''priority 0''The following example creates a //table// named ''myfilter'' in the ''ip'' //address family// (IPv4). It then creates two //base chains// named ''foo'' and ''bar'', registering them with the //Netfilter// IPv4 hook //input//, but each with different //priority//. Figure {{ref>nftex3}} shows the result. IPv4 network packets traversing the //Netfilter// hook //input// will first traverse the ''foo'' //chain// and then the ''bar'' //chain//
 <code bash> <code bash>
 nft create table ip myfilter nft create table ip myfilter
 +nft create chain ip myfilter foo {type filter hook input priority 0\;}
 +nft create chain ip myfilter bar {type filter hook input priority 50\;}
  
-nft create chain ip myfilter foo { type filter hook input priority 0\; } +# alternatively you could create the same chains using named priority values: 
-nft create chain ip myfilter bar { type filter hook input priority 50\; } +nft create chain ip myfilter foo {type filter hook input priority filter\;} 
- +nft create chain ip myfilter bar {type filter hook input priority security\;}
-#alternatively you could create the same chains using named priority values +
-nft create chain ip myfilter foo +
- { type filter hook input priority filter\; } +
-nft create chain ip myfilter bar +
- { type filter hook input priority security\; }+
 </code> </code>
  
-As a result, IPv4 network packets traversing the //Netfilter// hook //input// +<figure nftex3>
-will first traverse the ''foo'' //chain// and then the ''bar'' //chain//: +
 {{:linux:netfilter-input-hook-nft-example1.png?nolink&200|}} {{:linux:netfilter-input-hook-nft-example1.png?nolink&200|}}
 +<caption>Base chains ''foo'' and ''bar'' registered with the //Netfilter// Ipv4 //input// hook</caption>
 +</figure>
 +
 +//Nftables// currently has a limitation (see [[https://bugzilla.netfilter.org/show_bug.cgi?id=1083|bug ticket]]) which makes it difficult (or at least uncomfortable) to enter negative integer values for the //priority// on the ''nft'' command line. Using the placeholder names is probably the most comfortable workaround. Adding ''%%--%%'' after ''nft'' the another way to do it:
  
-=== Negative Values === 
-//Nftables// currently has a limitation (see [[https://bugzilla.netfilter.org/show_bug.cgi?id=1083|bug ticket]]) which makes it difficult (or at least uncomfortable) to enter negative integer values for the //priority// on the ''nft'' command line. Using the placeholder names is probably the most comfortable workaround. However if you really want to enter a negative integer value, one possible way to enter it is this: 
 <code bash> <code bash>
-#adding '--' makes it possible to specify negative priority +nft -- add chain foo bar {type nat hook input priority -100\;}
-nft -- add chain foo bar { type nat hook input priority -100\; }+
 </code> </code>
  
-=== What if priority is equal? === +But what actually happens when you register two //base chains// with the same hook which both have the same //priority//? The source code of //Netfilter// answers this questionIt actually allows to register hook functions with the same hook which have the same //priority// valueIn case of the following example, function ''nf_register_net_hook()'' is first called for //chain1// and then for //chain2//
-What actually happens when you register two //base chains// with the same hook + 
-which both have the same //priority//, e.gby creating two //Nftables// //base +<code bash> 
-chains// like this:<code bash> +nft create chain ip table1 chain1 {type filter hook input priority 0\;} 
-nft create chain ip table1 chain1 { type filter hook input priority 0\; } +nft create chain ip table1 chain2 {type filter hook input priority 0\;}
-nft create chain ip table1 chain2 { type filter hook input priority 0\; }+
 </code> </code>
-The source code of //Netfilter// answers this question. It actually allows + 
-to register callbacks with the same hook which have the same //priority// value. +I checked the kernel source code((see function ''nf_hook_entries_grow()'' in
-In case of the example above, function ''nf_register_net_hook()'' is +
-first called for //chain1// and then for //chain2//I checked the kernel +
-source code((see function ''nf_hook_entries_grow()'' in+
 ''net/netfilter/core.c'' in kernel v5.4.0)) and was able to confirm the behavior with the ''net/netfilter/core.c'' in kernel v5.4.0)) and was able to confirm the behavior with the
 //Nftables// ''nftrace'' feature: The kernel code places //chain2// BEFORE //Nftables// ''nftrace'' feature: The kernel code places //chain2// BEFORE
-(in front of) //chain1// in the array of callbacks for this hook. As a result, +(in front of) //chain1// in the array of hook functions for this hook. As a result, 
 network packets then traverse //chain2// BEFORE //chain1//. This means here network packets then traverse //chain2// BEFORE //chain1//. This means here
-the sequence/order in which you register both chains becomes relevant! +the sequence/order in which you issue the commands to register both chains becomes relevant!
 However, I guess it is best practice to consider the sequence in which two However, I guess it is best practice to consider the sequence in which two
 chains with equal //priority// on the same hook are traversed to be chains with equal //priority// on the same hook are traversed to be
-"undefined" and thus to either avoid this case or to design the //rules// added+"undefined"and thus to either avoid this case or to design the //rules// added
 to those //chains// in a way in which they do not depend on the the sequence of to those //chains// in a way in which they do not depend on the the sequence of
 //chain// traversal. After all, the behavior I describe here is an internal //chain// traversal. After all, the behavior I describe here is an internal
 kernel behavior which is undocumented and implementation could change with any kernel behavior which is undocumented and implementation could change with any
-newer kernel version. Thus you should not rely on it!+newer kernel version. Thusyou should not rely on it!
  
  
Line 279: Line 281:
  
 ==== Example: NAT edge router === ==== Example: NAT edge router ===
-{{:linux:edge-router1.png?nolink&100 |}} +The example in Figure {{ref>nftedgerouter}} demonstrates an edge router, doing some simple IPv4 packet filtering and //SNAT// (masquerading). I merely gave minimalist example here. One could even remove the //output// //chain// again, because I did not add any rules to it. In reality you for sure will add a more complex set of rules.
-If you e.g. like to do some simple IPv4 packet filtering and //snat// (masquerading) on an edge router, which is very common case, then this set of //tables// and //chains// would probably be sufficient:+
  
-^ table ^ base chains ^ +<figure nftedgerouter> 
-''filter'' | ''input'', ''forward'', ''output''+{{ :linux:edge-router1.png?nolink&100 |}}
-| ''nat''    | ''postrouting''+
- +
-You create these //tables// and //chains// in //address family// ''ip'':+
 <code bash> <code bash>
 nft create table ip nat nft create table ip nat
-nft create chain ip nat postrouting +nft create chain ip nat postrouting {type nat hook postrouting priority srcnat\;} 
- { type nat hook postrouting priority srcnat\; }+nft add rule ip nat postrouting oif eth1 masquerade
  
 nft create table ip filter nft create table ip filter
-nft create chain ip filter input +nft create chain ip filter input {type filter hook input priority filter\;} 
- { type filter hook input priority filter\; } +nft create chain ip filter forward {type filter hook forward priority filter\;} 
-nft create chain ip filter forward +nft create chain ip filter output {type filter hook output priority filter\;}
- { type filter hook forward priority filter\; } +
-nft create chain ip filter output +
- { type filter hook output priority filter\; } +
-</code> +
- +
-As a result, the //chains// registered with the IPv4 //Netfilter// hooks will look like this (click to enlarge): +
- +
-{{ :linux:nf-hooks-nftables-ex2.png?direct&700 }} +
- +
-Then you add some simple masquerading and packet filtering rules: +
-<code bash> +
-nft add rule ip nat postrouting oif eth1 masquerade +
 nft add rule ip filter forward iif eth1 oif eth0 ct state new,invalid drop nft add rule ip filter forward iif eth1 oif eth0 ct state new,invalid drop
 nft add rule ip filter input iif eth1 ip protocol != icmp ct state new,invalid drop nft add rule ip filter input iif eth1 ip protocol != icmp ct state new,invalid drop
 </code> </code>
- +{{ :linux:nf-hooks-nftables-ex2.png?direct&700 }} 
-(I merely gave a minimalist example hereOne could even remove the //output// //chain// again, because I did not add any rules to it. In reality you for sure will add a more complex set of rules.)+<caption>Example of minimalistic //Nftables// //ruleset// for edge router doing //SNAT// and the 
 +//base chains// getting registered with the //Netfilter// IPv4 hooks resulting from that (+conntrack) (click to enlarge).</caption> 
 +</figure>
  
  
 +==== List hook functions (coming soon) ====
 +Nftables developers in July 2021 announced a new feature, which will
 +likely be included in the next version of Nftables to be released;
 +see [[http://git.netfilter.org/nftables/commit/?id=4694f7230195bfcff179ed418ddcdd5ff7d5a8e1|this recent git commit]]. This feature lets Nftables list all the hook functions which are currently
 +registered with a specified Netfilter hook together with their assigned
 +priorities. If you e.g. like to list all hook functions currently registered with the Netfilter
 +IPv4 Prerouting hook, the syntax to do that will probably be something like
 +''nft list hook ip prerouting''.
 ===== Context ===== ===== Context =====
 The described behavior and implementation has been observed on a The described behavior and implementation has been observed on a
Line 328: Line 323:
 [[:feedback|Feedback]] to this article is very welcome! [[:feedback|Feedback]] to this article is very welcome!
  
-{{tag>linux netfilter nftables iptables}}+ 
 +//published 2020-05-17//, //last modified 2022-08-07// 
  
blog/linux/nftables_packet_flow_netfilter_hooks_detail.1602859441.txt.gz · Last modified: 2020-10-16 by Andrej Stender