Thermalcircle.de

climbing the thermals

User Tools

Site Tools


blog:linux:connection_tracking_1_modules_and_hooks

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
blog:linux:connection_tracking_1_modules_and_hooks [2021-06-29] – fixed some typos Andrej Stenderblog:linux:connection_tracking_1_modules_and_hooks [2023-08-15] (current) – improved/fixed some statements, typos Andrej Stender
Line 1: Line 1:
-{{tag>linux netfilter conntrack nftables iptables}}+{{tag>linux kernel netfilter conntrack nftables iptables}}
 ====== Connection tracking (conntrack) - Part 1: Modules and Hooks ====== ====== Connection tracking (conntrack) - Part 1: Modules and Hooks ======
 ~~META: ~~META:
 date created = 2021-04-04  date created = 2021-04-04 
 ~~ ~~
- 
-~~NOTOC~~ 
- 
  
 With this article series I like to take a closer look at the connection tracking subsystem of the Linux kernel, With this article series I like to take a closer look at the connection tracking subsystem of the Linux kernel,
 which provides the basis for features like stateful packet filtering and NAT. which provides the basis for features like stateful packet filtering and NAT.
 I refer to it as the "ct system" throughout the series. I refer to it as the "ct system" throughout the series.
-It is not my intention to replace or repeat existing documentation. Great articles on the topic already exist, however most of them are a little bit dated; see [[#References]] below. I intend to provide an up-to-date view by the time of writing, based on LTS kernel 5.10, and complement existing documentation by taking a deep look under the hood and show how things actually work. In this first article, I give an overview about the ct system's purpose and elaborate on how it relates to other kernel components like Netfilter and Nftables. I explain what happens when network packets traverse its Netfilter hook callbacks and how it serves as basis for stateful packet filtering.+It is not my intention to replace or repeat existing documentation. Great articles on the topic already exist, however most of them are a little bit dated; see [[#References]] below. I intend to provide an up-to-date view by the time of writing, based on LTS kernel 5.10, and complement existing documentation by taking a deep look under the hood and show how things actually work. In this first article, I give an overview about the ct system's purpose and elaborate on how it relates to other kernel components like Netfilter and Nftables. I explain what happens when network packets traverse its Netfilter hook functions and how it serves as basis for stateful packet filtering.
  
 ===== Articles of the series ===== ===== Articles of the series =====
   * [[connection_tracking_1_modules_and_hooks|Connection tracking (conntrack) - Part 1: Modules and Hooks]]   * [[connection_tracking_1_modules_and_hooks|Connection tracking (conntrack) - Part 1: Modules and Hooks]]
   * [[connection_tracking_2_core_implementation|Connection tracking (conntrack) - Part 2: Core Implementation]]   * [[connection_tracking_2_core_implementation|Connection tracking (conntrack) - Part 2: Core Implementation]]
-  * Connection tracking (conntrack) - Part 3: Connection States and Examples (coming soon)+  * [[connection_tracking_3_state_and_examples|Connection tracking (conntrack) - Part 3: State and Examples]]
  
 ===== Overview ===== ===== Overview =====
 What is the purpose of connection tracking and what does it do? Once activated, connection tracking (the ct system inside the Linux kernel) examines IPv4 and/or IPv6 network packets and their payload, with the intention to determine which packets are associated with each other, e.g. in the scope of a connection-oriented protocol like TCP. The ct system performs this task as a transparent observer and does not take active part in the communication between endpoints. It is not relevant for the ct system, whether the endpoints of a connection are local or remote. They could be located on remote hosts, in which case the ct system would observe them while running on a host which merely is routing or bridging the packets of a particular connection. Alternatively, one or even both of the endpoints could be local sockets on the very same host where the ct system is running. It makes no difference. What is the purpose of connection tracking and what does it do? Once activated, connection tracking (the ct system inside the Linux kernel) examines IPv4 and/or IPv6 network packets and their payload, with the intention to determine which packets are associated with each other, e.g. in the scope of a connection-oriented protocol like TCP. The ct system performs this task as a transparent observer and does not take active part in the communication between endpoints. It is not relevant for the ct system, whether the endpoints of a connection are local or remote. They could be located on remote hosts, in which case the ct system would observe them while running on a host which merely is routing or bridging the packets of a particular connection. Alternatively, one or even both of the endpoints could be local sockets on the very same host where the ct system is running. It makes no difference.
-The ct system maintains an up-to-date (live) list of all tracked connections. Based on that it "categorizes" network packets while those are traversing the kernel network stack, by supplying each one with a reference (a pointer) to one of its tracked connection instances. As a result, other kernel components can access this connection association and make decisions based on that. The two most prominent candidates which make use of that are the NAT subsystem and the stateful packet filtering modules of Iptables and Nftables.+The ct system maintains an up-to-date (live) list of all tracked connections. Based on that it "categorizes" network packets while those are traversing the kernel network stack, by supplying each one with a reference (a pointer) to one of its tracked connection instances. As a result, other kernel components can access this connection association and make decisions based on that. The two most prominent candidates which make use of that are the NAT subsystem and the stateful packet filtering / stateful packet inspection (SPI) modules of Iptables and Nftables.
 The ct system itself does never alter/manipulate packets. It usually also never drops packets, however that can happen in certain rare cases. When inspecting packet content, its main focus is on OSI layers 3 and 4. It is able to track TCP, UDP, ICMP, ICMPv6, SCTP, DCCP and GRE connections. Obviously, the ct system’s definition of a “connection” is not limited to connection-oriented protocols, as several of the protocols just mentioned are not connection-oriented. It e.g. considers and handles an ICMP echo-request plus echo-reply (ping) as a “connection”. The ct system provides several helper/extension components, which extend its tracking abilities into application layer and e.g. track protocols like FTP, TFTP, IRC, PPTP, SIP, … Those are the basis for further use cases like [[wp>Application_Layer_Gateway|Application Layer Gateways]]. The ct system itself does never alter/manipulate packets. It usually also never drops packets, however that can happen in certain rare cases. When inspecting packet content, its main focus is on OSI layers 3 and 4. It is able to track TCP, UDP, ICMP, ICMPv6, SCTP, DCCP and GRE connections. Obviously, the ct system’s definition of a “connection” is not limited to connection-oriented protocols, as several of the protocols just mentioned are not connection-oriented. It e.g. considers and handles an ICMP echo-request plus echo-reply (ping) as a “connection”. The ct system provides several helper/extension components, which extend its tracking abilities into application layer and e.g. track protocols like FTP, TFTP, IRC, PPTP, SIP, … Those are the basis for further use cases like [[wp>Application_Layer_Gateway|Application Layer Gateways]].
  
Line 36: Line 33:
 nft add rule ip filter forward iif eth0 ct state established accept nft add rule ip filter forward iif eth0 ct state established accept
 </code> </code>
-<caption>Example, adding Nftables rules with //CONNTRACK EXPRESSIONS// to a fictive //forward// chain((Obviously you first would have to create that chain and a table ...+<caption>Example, adding Nftables rules with //CONNTRACK EXPRESSIONS// to a chain named //forward//((Obviously you first would have to create that chain and a table ...
 <code bash> <code bash>
 nft add table ip filter nft add table ip filter
Line 77: Line 74:
  
 ===== Netfilter hooks ===== ===== Netfilter hooks =====
-Like Iptables and Nftables, the ct system is built on top of the Netfilter framework. It implements callback functions to be able to observe network packets and registers those with the Netfilter hooks. +Like Iptables and Nftables, the ct system is built on top of the Netfilter framework. It implements hook functions to be able to observe network packets and registers those with the Netfilter hooks. 
-If you are not yet very familiar with Netfilter hooks, better first take a look at my other article [[nftables_packet_flow_netfilter_hooks_detail|Nftables - Packet flow and Netfilter hooks in detail]], before proceeding here. From the bird's eye view, the famous //Netfilter Packet Flow// image shown in Figure {{ref>nfpackflowofficial}} already gives a good hint on what is going on.+If you are not yet very familiar with Netfilter hooks, better first take a look at my other article [[nftables_packet_flow_netfilter_hooks_detail|Nftables - Packet flow and Netfilter hooks in detail]], before proceeding here. From the bird's eye view, the //Netfilter Packet Flow// image shown in Figure {{ref>nfpackflowofficial}}, which has been created by Netfilter developers and thereby can be considered official documentation, already gives a good hint on what is going on. 
  
 <figure nfpackflowofficial> <figure nfpackflowofficial>
Line 85: Line 82:
 </figure> </figure>
  
-The blocks named //conntrack// within that image represent the hook callbacks of the ct system. While this probably provides a sufficient model one needs to keep in mind when writing Iptables/Nftables rules for stateful packet filtering, the actual implementation is more complex. The very same Netfilter hooks, like //Prerouting//, //Input//, //Forward//, //Output// and //Postrouting//, all exist independently within each network namespace. For this reason, they represent the the actual "on"/"off" switch to enable/disable the ct system individually within a network namespace: The ct system'callbacks are being registered/unregistered only with the hooks of that specific network namespace where the ct system shall be enabled/disabled. Thereby the ct system only "sees" the network packets of network namespaces which it shall see.+The blocks named //conntrack// within that image represent the hook functions of the ct system. While this probably provides a sufficient model one needs to keep in mind when writing Iptables/Nftables rules for stateful packet filtering, the actual implementation is more complex. The very same Netfilter hooks, like //Prerouting//, //Input//, //Forward//, //Output// and //Postrouting//, all exist independently within each network namespace. For this reason, they represent the the actual "on"/"off" switch to enable/disable the ct system individually within a network namespace: The ct system'hook functions are being registered/unregistered only with the hooks of that specific network namespace where the ct system shall be enabled/disabled. Thereby the ct system only "sees" the network packets of network namespaces which it shall see.
  
 ===== Module nf_conntrack ===== ===== Module nf_conntrack =====
-Let's get back to the example above and take a look at the kernel module of the ct system itelf. When the first Nftables rule containing a //CONNTRACK EXPRESSION// is being added to the ruleset of your current network namespace, the Nftables code (indirectly) triggers loading of kernel module ''nf_conntrack'' as described above, if not already loaded. After that, the Nftables code calls +Let's get back to the example above and take a look at the kernel module of the ct system itself. When the first Nftables rule containing a //CONNTRACK EXPRESSION// is being added to the ruleset of your current network namespace, the Nftables code (indirectly) triggers loading of kernel module ''nf_conntrack'' as described above, if not already loaded. After that, the Nftables code calls 
- ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L583|nf_ct_netns_get()]]''. This is a function which is exported (=provided) by the just loaded ''nf_conntrack'' module. When called, it registers the callback functions of the ct system with the Netfilter hooks of the current network namespace. + ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L583|nf_ct_netns_get()]]''. This is a function which is exported (=provided) by the just loaded ''nf_conntrack'' module. When called, it registers the hook functions of the ct system with the Netfilter hooks of the current network namespace. 
-The Nftables rules shown in the example above specify //address family// ''ip''. Thus, in that case the ct system registers the four callbacks shown in Figure {{ref>nfcthooks1}} with the IPv4 Netfilter hooks. In case of //address family// ''ip6'', the ct system instead would register some equivalent four callbacks with the Netfilter hooks of IPv6. In case of //address family// ''inet'', it would register callbacks with both the IPv4 and the IPv6 Netfilter hooks.+The Nftables rules shown in Figure {{ref>nftctex1}} specify //address family// ''ip''. Thus, in that case the ct system registers the four hook functions shown in Figure {{ref>nfcthooks1}} with the IPv4 Netfilter hooks. In case of //address family// ''ip6'', the ct system instead would register the same four hook functions with the Netfilter hooks of IPv6. In case of //address family// ''inet'', it would register its hook functions with both the IPv4 and the IPv6 Netfilter hooks.
  
 <figure nfcthooks1> <figure nfcthooks1>
 {{ :linux:nf-ct-hooks-ipv4.png?direct&700 |}} {{ :linux:nf-ct-hooks-ipv4.png?direct&700 |}}
 <caption> <caption>
-The four conntrack callbacks registered with Netfilter IPv4 hooks (click to enlarge). See  ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L206|net/netfilter/nf_conntrack_proto.c]]''((for their IPv6 equivalent see [[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L401|here]])). The [[http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO-3.html#ss3.2|Linux netfilter Hacking HOWTO]] also briefly mentions those callbacks in some ASCII art drawing.+The four conntrack hook functions registered with Netfilter IPv4 hooks (click to enlarge). See  ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L206|net/netfilter/nf_conntrack_proto.c]]''((for their IPv6 equivalent see [[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L401|here]])). The [[http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO-3.html#ss3.2|Linux netfilter Hacking HOWTO]] also briefly mentions those functions in some ASCII art drawing.
 </caption> </caption>
 </figure> </figure>
  
-While function ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L583|nf_ct_netns_get()]]'' has the purpose to register the callbacks of the ct system, function ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L610|nf_ct_netns_put()]]'' has the purpose to unregister them. +While function ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L583|nf_ct_netns_get()]]'' has the purpose to register the hook functions of the ct system, function ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_proto.c#L610|nf_ct_netns_put()]]'' has the purpose to unregister them. 
-Both functions internally do reference counting. This means that in the current network namespace, maybe one, maybe several kernel components at some point require connection tracking and thereby call  ''nf_ct_netns_get()''. However this function only registers the ct callbacks once the first time it is called and on successive calls merely increments a reference counter. If a component at some point does not require connection tracking anymore, it calls ''nf_ct_netns_put()'', which decrements the reference counter. If it reaches zero, ''nf_ct_netns_put()'' unregisters the ct callbacks. In our example this e.g. would happen if you delete all the Nftables rules in your ruleset in the current namespace which contain //CONNTRACK EXPRESSIONS//. This mechanism ensures that the ct system is really only enabled where needed.+Both functions internally do reference counting. This means that in the current network namespace, maybe one, maybe several kernel components at some point require connection tracking and thereby call  ''nf_ct_netns_get()''. However this function only registers the ct hook functions once the first time it is called and on successive calls merely increments a reference counter. If a component at some point does not require connection tracking anymore, it calls ''nf_ct_netns_put()'', which decrements the reference counter. If it reaches zero, ''nf_ct_netns_put()'' unregisters the ct hook functions. In our example this e.g. would happen if you delete all the Nftables rules in your ruleset in the current namespace which contain //CONNTRACK EXPRESSIONS//. This mechanism ensures that the ct system is really only enabled where needed.
  
-==== The main ct hook callbacks ==== +==== The main ct hook functions ==== 
-The two hook callbacks which get registered with priority -200 in the  +The two hook functions which get registered with priority -200 in the //Prerouting// hook and in the //Output// hook in Figure {{ref>nfcthooks1}} are the very same //conntrack// hook functions shown in the official //Netfilter Packet Flow image// in Figure {{ref>nfpackflowofficial}}. Internally, both of them (nearly) do the same thing. Wrapped by some outer functions which do slightly different things, the major function called by both of them is ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_core.c#L1793|nf_conntrack_in()]]''. Thus, the major difference between both merely is their placement... the one in the //Prerouting// hook handles packets received on the network while the one in the //Output// hook handles outgoing packets generated on this host. These two can be considered the "main" hook functions of the ct system, because most of what the ct system does with traversing network packets happens inside them... analyzing and associating packets with tracked connections, then supplying those packets with a reference (pointer) to tracked connection instances... . I'll elaborate on that in more detail in the sections below.
-//Prerouting// hook and in the //Output// hook in Figure {{ref>nfcthooks1}} +
- are the very same //conntrack// callbacks shown in the official //Netfilter Packet Flow image// in Figure {{ref>nfpackflowofficial}}. Internally, both of them (nearly) do the same thing. The difference merely is +
-their placement... the one in the //Prerouting// hook handles packets received +
-on the network while the one in the //Output// hook handles outgoing packets generated on this host. +
-These two can be considered the "main" hook callbacks of the ct system, because most +
-of what the ct system does with traversing network packets happens inside +
-them... analyzing and associating packets with tracked connections, then supplying those packets with a reference (pointer) to tracked connection instances... . I'll elaborate in more detail on that in the sections below.+
  
  
-==== The help+confirm callbacks ==== +==== The help+confirm hook functions ==== 
-Another two hook callbacks get registered with MAX priority in the //Input//+Another two hook functions get registered with MAX priority in the //Input//
 hook and in the //Postrouting// hook in Figure {{ref>nfcthooks1}}. Priority hook and in the //Postrouting// hook in Figure {{ref>nfcthooks1}}. Priority
-MAX means the highest possible unsigned integer value. A callback with this+MAX means the highest possible unsigned integer value. A hook function with this
 priority will be traversed as the very last one within the Netfilter hook and no priority will be traversed as the very last one within the Netfilter hook and no
-other callback can get registered to be traversed after it. Those two callbacks+other hook function can get registered to be traversed after it. Those two hook functions
 here are not shown in Figure {{ref>nfpackflowofficial}} and can be considered here are not shown in Figure {{ref>nfpackflowofficial}} and can be considered
 some internal thing which is not worth mentioning on the bird's eye view. some internal thing which is not worth mentioning on the bird's eye view.
Line 124: Line 114:
 both is their placement in the Netfilter hooks, which makes sure that ALL both is their placement in the Netfilter hooks, which makes sure that ALL
 network packets, no matter if incoming/outgoing/forwarded packets, traverse one network packets, no matter if incoming/outgoing/forwarded packets, traverse one
-of them as the very last thing after having traversed all other callbacks+of them as the very last thing after having traversed all other hook functions
-I refer to them as the //conntrack "help+confirm"// callbacks in this article+I refer to them as the //conntrack "help+confirm"// hook functions in this article
 series, hinting that they got two independent purposes. One is to execute series, hinting that they got two independent purposes. One is to execute
 "helper" code, which is an advanced feature which is only used in certain "helper" code, which is an advanced feature which is only used in certain
 specific use cases and I won't cover that topic in the scope of this first specific use cases and I won't cover that topic in the scope of this first
-article. The second is to "confirm" new tracked connections. I'll elaborate in +article. The second is to "confirm" new tracked connections; see ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/netfilter/nf_conntrack_core.c#L1073|__nf_conntrack_confirm()]]''
-the sections below on what that means.+I'll elaborate on what that means in the sections below.
  
 <WRAP round info> <WRAP round info>
 Only in recent kernel versions by the time of writing (here kernel v5.10.19) Only in recent kernel versions by the time of writing (here kernel v5.10.19)
 both mentioned features, "helper" and "confirm", exist combined  both mentioned features, "helper" and "confirm", exist combined 
-within the same hook callbacks. Not too long ago both still existed in form of +within the same hook functions. Not too long ago both still existed in form of 
-separate ct callbacks in the //Input// and the //Postrouting// +separate ct hook functions in the //Input// and the //Postrouting// 
-Netfilter hooks: The "helper" callback with priority +Netfilter hooks: The "helper" hook function with priority 
-300 and the "confirm" callback with priority MAX.+300 and the "confirm" hook function with priority MAX.
 See e.g. [[https://elixir.bootlin.com/linux/v4.19.98/source/net/netfilter/nf_conntrack_proto.c#L486|LTS kernel v4.19]]. They have been combined in this See e.g. [[https://elixir.bootlin.com/linux/v4.19.98/source/net/netfilter/nf_conntrack_proto.c#L486|LTS kernel v4.19]]. They have been combined in this
 [[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=827318feb69cb07ed58bb9b9dd6c2eaa81a116ad|git [[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=827318feb69cb07ed58bb9b9dd6c2eaa81a116ad|git
Line 146: Line 136:
  
 ===== Modules nf_defrag_ipv4/6 ===== ===== Modules nf_defrag_ipv4/6 =====
-As shown above, module ''nf_conntrack'' depends on the modules ''nf_defrag_ipv4'' and ''nf_defrag_ipv6''. What is important to know here is, that those take care about re-assembling (=defragment) IPv4 and IPv6 fragments respectively, if those occur. Usually, defragmentation is supposed to happen at the receiving communication endpoint and not along the way through the hops between both endpoints. However, in this case it is necessary.  Connection tracking can only do its job, if ALL packets of a connection can be identified and no packet can slip through the fingers of the ct system. The problem with fragments is, that they do not all contain the necessary protocol header information which would be required to identify and associate them with a tracked connection.+As shown in Figure {{ref>nft_ct_depends}}, module ''nf_conntrack'' depends on the modules ''nf_defrag_ipv4'' and ''nf_defrag_ipv6''. What is important to know here is, that those take care about re-assembling (=defragment) IPv4 and IPv6 fragments respectively, if those occur. Usually, defragmentation is supposed to happen at the receiving communication endpoint and not along the way through the hops between both endpoints. However, in this case it is necessary.  Connection tracking can only do its job, if ALL packets of a connection can be identified and no packet can slip through the fingers of the ct system. The problem with fragments is, that they do not all contain the necessary protocol header information required to identify and associate them with a tracked connection.
  
 <figure nfdefraghooks1> <figure nfdefraghooks1>
 {{ :linux:nf-defrag-hooks-ipv4.png?direct&700 |}} {{ :linux:nf-defrag-hooks-ipv4.png?direct&700 |}}
-<caption>Defrag callbacks registered with Netfilter IPv4 hooks (click to enlarge). See ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv4/netfilter/nf_defrag_ipv4.c#L92|net/ipv4/netfilter/nf_defrag_ipv4.c]]''((for their IPv6 equivalent see [[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c#L75|here]])).+<caption>Defrag hook functions registered with Netfilter IPv4 hooks (click to enlarge).\\ See ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv4/netfilter/nf_defrag_ipv4.c#L92|net/ipv4/netfilter/nf_defrag_ipv4.c]]''((for their IPv6 equivalent see [[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c#L75|here]])).
 </caption> </caption>
 </figure> </figure>
  
-Like the ct system itself, those defrag modules do not become globally active on module load. They export (=provide) functions ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv4/netfilter/nf_defrag_ipv4.c#L130|nf_defrag_ipv4_enable()]]'' and ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c#L131|nf_defrag_ipv6_enable()]]'' respectively, which register their own callbacks with the Netfilter hooks. Figure {{ref>nfdefraghooks1}} shows this for module ''nf_defrag_ipv4'' and the IPv4 Netfilter hooks: Two //defrag// callbacks get registered, one in the //Prerouting// hook, the other in the //Output// hook. They both get registered with priority -400, which ensures that packets traverse them BEFORE traversing the //conntrack// callbacks, which are registered with priority -200. Both //defrag// callbacks execute the same function ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv4/netfilter/nf_defrag_ipv4.c#L61|ipv4_conntrack_defrag()]]'' on packet traversal+Like the ct system itself, those defrag modules do not become globally active on module load. They export (=provide) functions ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv4/netfilter/nf_defrag_ipv4.c#L130|nf_defrag_ipv4_enable()]]'' and ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c#L131|nf_defrag_ipv6_enable()]]'' respectively, which register their own hook function with the Netfilter hooks.  
-The ct system's function ''nf_ct_netns_get()'' mentioned in the section above does call ''nf_defrag_ipv4_enable()'' and/or its IPv6 counterpart respectively, before registering the ct system'callbacks. Thus, those //defrag// hook callbacks get registered together with the //conntrack// hook callbacks. However, no reference counting is implemented here, which means, once those hook callbacks are registered, they stay registered (until someone explicitly removes/unloads the kernel module).+Figure {{ref>nfdefraghooks1}} shows this for module ''nf_defrag_ipv4'' and the IPv4 Netfilter hooks: Internally this module provides function ''[[https://elixir.bootlin.com/linux/v5.10.19/source/net/ipv4/netfilter/nf_defrag_ipv4.c#L61|ipv4_conntrack_defrag()]]'' to handle defragmentation of traversing network packets. 
 +This function is being registered as a hook function with the Netfilter //Prerouting// hook and also with the Netfilter //Output// hook. In both those places it is being registered with priority -400, which ensures that packets traverse it BEFORE traversing the //conntrack// hook functions, which are registered with priority -200.  
 +The ct system's function ''nf_ct_netns_get()'' mentioned in the section above does call ''nf_defrag_ipv4_enable()'' and/or its IPv6 counterpart respectively, before registering the ct system'hook functions. Thus, the //defrag// hook functions get registered together with the //conntrack// hook functions. However, no reference counting is implemented here, which means, once this hook function is registered, it stays registered (until someone explicitly removes/unloads the kernel module).
  
 ===== Hooks Summary ===== ===== Hooks Summary =====
 Figure {{ref>nfhooks-complete1}} summarizes things by showing the mentioned Figure {{ref>nfhooks-complete1}} summarizes things by showing the mentioned
 //conntrack// and //conntrack// and
-//defrag// hooks together with the well known packet filtering chains +//defrag// hook functions together with the well known packet filtering chains 
 of Iptables. For completeness I also show the priority values here. This should of Iptables. For completeness I also show the priority values here. This should
 provide for a comfortable comparison to what you see in the official //Netfilter provide for a comfortable comparison to what you see in the official //Netfilter
Line 167: Line 159:
 <figure nfhooks-complete1> <figure nfhooks-complete1>
 {{ :linux:nf-ct-iptables-hooks-ipv4.png?direct&700 |}} {{ :linux:nf-ct-iptables-hooks-ipv4.png?direct&700 |}}
-<caption>//Conntrack//+//Defrag// callbacks, //Iptables// chains registered with IPv4 Netfilter hooks\\ (click to enlarge)</caption>+<caption>//Conntrack//+//Defrag// hook functions, //Iptables// chains registered with IPv4 Netfilter hooks\\ (click to enlarge)</caption>
 </figure> </figure>
  
Line 175: Line 167:
 like that. Thus, showing the old but well known Iptables chains still seemed like that. Thus, showing the old but well known Iptables chains still seemed
 like the most pragmatic thing to do. like the most pragmatic thing to do.
-The important thing which Figure {{ref>nfhooks-complete1}} shows is, that regarding traversal of the //defrag// and //conntrack// callbacks, the same thing happens for all kinds of packets, no matter if those are to be received by a local socket, if those are generated by a local socket and sent out on the network or if those merely are forwarded (routed) packets: +The important thing which Figure {{ref>nfhooks-complete1}} shows is, that regarding traversal of the //defrag// and //conntrack// hook functions, the same thing happens for all kinds of packets, no matter if those are to be received by a local socket, if those are generated by a local socket and sent out on the network or if those merely are forwarded (routed) packets: 
-They all first traverse one of the //defrag// callbacks, either the one in the //Prerouting// +They all first traverse one of the //defrag// hook functions, either the one in the //Prerouting// 
-or the //Output// hook. This ensures that these callbacks can defragment potential fragments+or the //Output// hook. This ensures that these function(s) can defragment potential fragments
 before the ct system is able to see them. After that, the packets traverse a potential before the ct system is able to see them. After that, the packets traverse a potential
 Iptables chain of the raw table (if existing / in use) and then one of the main //conntrack// Iptables chain of the raw table (if existing / in use) and then one of the main //conntrack//
-callbacks (the ones with priority -200). Other Iptables chains, like the ones with priority 0+hook functions (the ones with priority -200). Other Iptables chains, like the ones with priority 0
 which are commonly used for packet filtering, are traversed after that. Then, as the very which are commonly used for packet filtering, are traversed after that. Then, as the very
-last thing the packets traverse one of the //conntrack "help+confirm"// callbacks+last thing the packets traverse one of the //conntrack "help+confirm"// hook functions
  
  
 ===== How it works... ===== ===== How it works... =====
-I know... so far I kept beating around the bush. Now let's finally talk about how the ct system actually operates and what it does to network packets traversing its hook callbacks. Please be aware that what I describe in this section are the basics and does not cover all what the ct system actually does. The ct system maintains the connections which it is tracking in a central table. Each tracked connection is represented by an instance of ''[[https://elixir.bootlin.com/linux/v5.10.19/source/include/net/netfilter/nf_conntrack.h#L58|struct nf_conn]]''. That structure contains all necessary details the ct system learns about the connection over time while tracking it. From the ct system's point of view, every network packet which traverses one of its main hook callbacks (those with priority -200) is one of four possible things: +I know... so far I kept beating around the bush. Now let's finally talk about how the ct system actually operates and what it does to network packets traversing its hook functions. Please be aware that what I describe in this section are the basics and does not cover all what the ct system actually does. The ct system maintains the connections which it is tracking in a central table. Each tracked connection is represented by an instance of ''[[https://elixir.bootlin.com/linux/v5.10.19/source/include/net/netfilter/nf_conntrack.h#L58|struct nf_conn]]''. That structure contains all necessary details the ct system learns about the connection over time while tracking it. From the ct system's point of view, every network packet which traverses one of its main hook functions (those with priority -200) is one of four possible things: 
-  - It is either is part of or related to one of its tracked connections. +  - It is either part of or related to one of its tracked connections. 
-  - It is the first packet of a new connection which is not yet tracked.+  - It is the first seen packet of a connection which is not yet tracked.
   - It is an invalid packet, which is broken or doesn't fit in somehow.   - It is an invalid packet, which is broken or doesn't fit in somehow.
   - It is marked as NOTRACK, which tells the ct system to ignore it.   - It is marked as NOTRACK, which tells the ct system to ignore it.
Line 201: Line 193:
 {{ :linux:nf-ct-nfct-lookup.png?nolink&600 |}} {{ :linux:nf-ct-nfct-lookup.png?nolink&600 |}}
 <caption> <caption>
-Network packet traversing ct main callback (priority -200) in //Prerouting// hook, lookup in central+Network packet traversing ct main hook function (priority -200) in //Prerouting// hook, lookup in central
 ct table finds that packet belongs to already tracked connection, ct table finds that packet belongs to already tracked connection,
 packet is given pointer to that connection. packet is given pointer to that connection.
Line 208: Line 200:
  
 Figure {{ref>nfct-lookup}} shows an example of the first possibility, an incoming packet Figure {{ref>nfct-lookup}} shows an example of the first possibility, an incoming packet
-being part of an already tracked connection. When that packet traverses the main //conntrack// callback (the one with priority -200), the ct system first performs some initial validity checks on it. If the packet passes those, the ct system then does a lookup into its central table to find the potentially matching connection. In this case a match is found and the packet is provided with a pointer to the matching tracked connection instance. For this purpose, the ''[[https://elixir.bootlin.com/linux/v5.10.19/source/include/linux/skbuff.h#L713|skb]]''((Network packets are represented as instances of ''struct sk_buff'' within the Linux kernel network stack. This struct is often referred to as "socket buffer" or "skb".)) of each packet possesses  member variable ''[[https://elixir.bootlin.com/linux/v5.10.19/source/include/linux/skbuff.h#L759|_nfct]]''((I intentionally omit a tiny detail here: The data type of ''_nfct'' actually is not ''struct nf_conn *'', but instead is ''unsigned long''. Actually the 3 least significant bits of that integer are used in a special way (used for ''ctinfo'') and are not used as a pointer. The remaining bits are used as a pointer to ''struct nf_conn''. This is just a messy implementation detail, which you can ignore for now. I'll get back to it in a later article.)). This means, the network packet thereby is kind-of being "marked" or "categorized" by the ct system. As a final step in the main ct callback, the OSI layer 4 protocol of the packet is now analyzed and latest protocol state and details are saved to its tracked connection instance. Then the packet continues on its way through other callbacks, hooks and the network stack. Other kernel components, like Nftables with //CONNTRACK EXPRESSION// rules, can now without further ct table lookup obtain connection information about the packet, by simply dereferencing the ''%%skb->_nfct%%'' pointer. This is shown in Figure {{ref>nfct-lookup}} in form of an example Nftables chain with priority 0 in the //Prerouting// hook. If you would place a rule with expression ''ct state established'' in that chain, that rule would match. The very last thing that packet traverses before being received by a local socket is the //conntrack "help+confirm"// callback in the //Input// hook. Nothing happens to the packet here. That callback is targeted at other cases.+being part of an already tracked connection. When that packet traverses the main //conntrack// hook function (the one with priority -200), the ct system first performs some initial validity checks on it. If the packet passes those, the ct system then does a lookup into its central table to find the potentially matching connection. In this casea match is found and the packet is provided with a pointer to the matching tracked connection instance. For this purpose, the ''[[https://elixir.bootlin.com/linux/v5.10.19/source/include/linux/skbuff.h#L713|skb]]''((Network packets are represented as instances of ''struct sk_buff'' within the Linux kernel network stack. This struct is often referred to as "socket buffer" or "skb".)) of each packet possesses  member variable ''[[https://elixir.bootlin.com/linux/v5.10.19/source/include/linux/skbuff.h#L759|_nfct]]''((I intentionally omit a tiny detail here: The data type of ''_nfct'' actually is not ''struct nf_conn *'', but instead is ''unsigned long''. Actually the 3 least significant bits of that integer are used in a special way (used for ''ctinfo'') and are not used as a pointer. The remaining bits are used as a pointer to ''struct nf_conn''. This is just a messy implementation detail, which you can ignore for now. I'll get back to it in a later article.)). This means, the network packet thereby is kind-of being "marked" or "categorized" by the ct system.  
 +Further, the OSI layer 4 protocol of the packet is now being analyzed and latest protocol state and details are saved to its tracked connection instance. Then the packet continues on its way through other hook functions and the network stack. Other kernel components, like Nftables with //CONNTRACK EXPRESSION// rules, can now without further ct table lookup obtain connection information about the packet, by simply dereferencing the ''%%skb->_nfct%%'' pointer. This is shown in Figure {{ref>nfct-lookup}} in form of an example Nftables chain with priority 0 in the //Prerouting// hook. If you would place a rule with expression ''ct state established'' in that chain, that rule would match. The very last thing that packet traverses before being received by a local socket is the //conntrack "help+confirm"// hook function in the //Input// hook. Nothing happens to the packet here. That hook function is targeted at other cases.
  
 <figure nfct-new> <figure nfct-new>
 {{ :linux:nf-ct-nfct-new.png?nolink&600 |}} {{ :linux:nf-ct-nfct-new.png?nolink&600 |}}
 <caption> <caption>
-Packet traversing ct main callback (priority -200), lookup in central ct table+Packet traversing ct main hook function (priority -200), lookup in central ct table
 finds no match, packet is considered first one of new connection, new connection finds no match, packet is considered first one of new connection, new connection
 is created and packet is given pointer to it, new connection is later "confirmed" is created and packet is given pointer to it, new connection is later "confirmed"
-and added to ct table in "help+confirm" callback (priority MAX).+and added to ct table in "help+confirm" hook function (priority MAX).
 </caption> </caption>
 </figure> </figure>
Line 222: Line 215:
 Figure {{ref>nfct-new}} shows an example of the second possibility, an incoming packet Figure {{ref>nfct-new}} shows an example of the second possibility, an incoming packet
 being the first one representing a new connection which is not yet tracked by the ct system. being the first one representing a new connection which is not yet tracked by the ct system.
-When that packet traverses the main //conntrack// callback (the one with priority -200), let's assume that +When that packet traverses the main //conntrack// hook function (the one with priority -200), let's assume that 
-it passes the already mentioned validity checks. However, in this case the lookup in the ct table (1) does not find a matching connection. As a result, the ct system considers the packet to be the first one of a new connection. A new instance of ''struct nf_conn'' is created (2) and member ''%%skb->_nfct%%'' of the packet is initialized to point to that instance. The ct system considers the new connection as "unconfirmed" at this point. Thus, the new connection instance is not yet being added to the central table. It is temporarily parked on the so-called //unconfirmed list//Finally now the OSI layer 4 protocol of the packet is analyzed and protocol state and details are saved to its tracked connection instance. Then the packet continues on its way through other callbacks, hooks and the network stack. +it passes the already mentioned validity checks. However, in this case the lookup in the ct table (1) does not find a matching connection. As a result, the ct system considers the packet to be the first one of a new connection((To be precise: The first one the ct system has //seen// from that connection. That does not necessarily mean that this always must be the actual very first packet of a new connection, because there might be cases where the ct system for whatever reason did not see the first few packets of an actual connection and kind-of starts tracking in the middle of an already existing connection.)). A new instance of ''struct nf_conn'' is created (2) and member ''%%skb->_nfct%%'' of the packet is initialized to point to that instance. The ct system considers the new connection as "unconfirmed" at this point. Thus, the new connection instance is not yet being added to the central table. It is temporarily parked on the so-called //unconfirmed list// 
 +Further, the OSI layer 4 protocol of the packet is now being analyzed and protocol state and details are saved to its tracked connection instance. Then the packet continues on its way through other hook functions and the network stack. 
 Figure {{ref>nfct-new}} also shows an example Nftables chain with priority 0 in the //Prerouting// hook. If you would place a rule with expression ''ct state new'' in that chain, it would match. Figure {{ref>nfct-new}} also shows an example Nftables chain with priority 0 in the //Prerouting// hook. If you would place a rule with expression ''ct state new'' in that chain, it would match.
-The very last thing that packet traverses before being received by a local socket is the //conntrack “help+confirm”// callback in the //Input// hook. It is the job of that callback, to "confirm" new connections, which means setting a status bit accordingly and moving the connection instance from the //unconfirmed list// to the actual ct table (3). The idea behind this behavior is, that a packet like this might get dropped somewhere on the way between the main conntrack callback and the //conntrack "help+confirm"// callback, e.g. by an Nftables rule, by the routing system, or by who ever... The idea is to prevent "unconfirmed" new connections from cluttering up the central ct table or consuming an unnecessary high amount of CPU power. A very common scenario would e.g. be that an Nftables rule like ''iif eth0 ct state new drop'' exists to prevent new connections coming in on interface ''eth0''. Naturally, connection attempts matching to that should get dropped while consuming as little CPU power as possible and should not show up in the ct table at all. In that case, the very first packet of connections like that would get dropped within the Nftables chain and never reach the //conntrack "help+confirm"// callback. Thus, the new connection instance would never get confirmed and die an untimely death while still being on the //unconfirmed list//. In other words, it would get deleted together with the dropped network packet. This especially makes sense when you think about someone doing a port scan or a TCP SYN flooding attack.+The very last thing that packet traverses before being received by a local socket is the //conntrack “help+confirm”// hook function in the //Input// hook. It is the job of that function to "confirm" new connections, which means setting a status bit accordingly and moving the connection instance from the //unconfirmed list// to the actual ct table (3). The idea behind this behavior is, that a packet like this might get dropped somewhere on the way between the main conntrack hook function and the //conntrack "help+confirm"// hook function, e.g. by an Nftables rule, by the routing system, or by who ever... The idea is to prevent "unconfirmed" new connections from cluttering up the central ct table or consuming an unnecessary high amount of CPU power. A very common scenario would e.g. be that an Nftables rule like ''iif eth0 ct state new drop'' exists to prevent new connections coming in on interface ''eth0''. Naturally, connection attempts matching to that should get dropped while consuming as little CPU power as possible and should not show up in the ct table at all. In that case, the very first packet of connections like that would get dropped within the Nftables chain and never reach the //conntrack "help+confirm"// hook function. Thus, the new connection instance would never get confirmed and die an untimely death while still being on the //unconfirmed list//. In other words, it would get deleted together with the dropped network packet. This especially makes sense when you think about someone doing a port scan or a TCP SYN flooding attack
 +But even if a client, who is trying to establish e.g. a TCP connection by sending a TCP SYN packet, is behaving normally, it would still send out several TCP SYN packets as retransmissions if it does not receive any reply from the peer side. Thus, if you have a ''ct state new drop'' rule in place, this mechanism ensures, that the ct system intentionally does not remember this (denied!) connection and thereby treats all succeeding TCP SYN packets (retransmissions) again as new packets and those then will be dropped by the same ''ct state new drop'' rule.
  
-The third possibility is, that the ct system considers a packet as //invalid//. This e.g. happens, when a packet does not pass the mentioned initial validity checks of the main conntrack callback in the //Prerouting// hook in Figure {{ref>nfct-lookup}} or {{ref>nfct-new}}, e.g. because of a broken or incomplete protocol header which cannot not be parsed. It further can happen when a packet fails the detailed analysis of its OSI layer 4 protocol. E.g. in case of TCP the ct system observes receive window and sequence numbers and a packet which does not match regarding its sequence numbers would be considered //invalid//.+The third possibility is, that the ct system considers a packet as //invalid//. This e.g. happens, when a packet does not pass the mentioned initial validity checks of the main conntrack hook function in the //Prerouting// hook in Figure {{ref>nfct-lookup}} or {{ref>nfct-new}}, e.g. because of a broken or incomplete protocol header which cannot be parsed. This further can happen when a packet fails the detailed analysis of its OSI layer 4 protocol. E.g. in case of TCP the ct system observes receive window and sequence numbers and a packet which does not match regarding its sequence numbers would be considered //invalid//.
 However, it is not the job of the ct system to drop invalid packets((However, there are a few rare cases, like an overflow of the ct table, where it indeed drops packets.)). The ct system leaves that decision to other parts of the kernel network stack. If it considers a packet as //invalid//, the ct system simply leaves ''%%skb->_nfct=NULL%%''. If you would place an Nftables rule with expression ''ct state invalid'' in the example chain in Figure {{ref>nfct-lookup}} or {{ref>nfct-new}}, then that rule would match. However, it is not the job of the ct system to drop invalid packets((However, there are a few rare cases, like an overflow of the ct table, where it indeed drops packets.)). The ct system leaves that decision to other parts of the kernel network stack. If it considers a packet as //invalid//, the ct system simply leaves ''%%skb->_nfct=NULL%%''. If you would place an Nftables rule with expression ''ct state invalid'' in the example chain in Figure {{ref>nfct-lookup}} or {{ref>nfct-new}}, then that rule would match.
  
-The fourth possibility is a means for other kernel components like Nftables to mark packets with a "do not track" bit((Actually that bit is named ''IP_CT_UNTRACKED'' and it is placed in ''ctinfo''. More on that in a later article...)) which tells the ct system to ignore them. For this to work with Nftables, you would need to create a chain with a priority smaller than -200 (e.g. -300), which ensures it is traversed before the main ct callback and place a rule with a ''notrack'' statement in that chain; see [[https://wiki.nftables.org/wiki-nftables/index.php/Setting_packet_connection_tracking_metainformation|Nftables wiki]].  If you then would place an Nftables rule with expression ''ct state untracked'' in the example chain in Figure {{ref>nfct-lookup}} or {{ref>nfct-new}}, that rule would match. This is kind-of a corner case topic and I won't go into further details within the scope of this article. +The fourth possibility is a means for other kernel components like Nftables to mark packets with a "do not track" bit((Actually that bit is named ''IP_CT_UNTRACKED'' and it is placed in ''ctinfo''. More on that in a later article...)) which tells the ct system to ignore them. For this to work with Nftables, you would need to create a chain with a priority smaller than -200 (e.g. -300), which ensures it is traversed before the main ct hook function and place a rule with a ''notrack'' statement in that chain; see [[https://wiki.nftables.org/wiki-nftables/index.php/Setting_packet_connection_tracking_metainformation|Nftables wiki]].  If you then would place an Nftables rule with expression ''ct state untracked'' in the example chain in Figure {{ref>nfct-lookup}} or {{ref>nfct-new}}, that rule would match. This is kind-of a corner case topic and I won't go into further details within the scope of this article. 
  
  
Line 245: Line 240:
 initialized as described in the sections above, a new connection is first added to initialized as described in the sections above, a new connection is first added to
 the //unconfirmed list//. If the network packet which triggered its creation is the //unconfirmed list//. If the network packet which triggered its creation is
-dropped before reaching the ct system's //help+confirm// hook callback, then that+dropped before reaching the ct system's //help+confirm// hook function, then that
 connection is removed from the list and deleted. If the packet however passes connection is removed from the list and deleted. If the packet however passes
-the //help+confirm// callback, then the connection is moved to the central ct table +the //help+confirm// hook function, then the connection is moved to the central ct table 
-and is marked as "confirmed". There it stays until this connection is considered to be "expired". This is handled via timeouts. In simple words... if no further network packet arrives for a tracked connection for a certain amount of time, then that connection +and is marked as "confirmed". There it stays until this connection is considered to be "expired". This is handled via timeouts. In simple words... if no further network packet arrives for a tracked connection for a certain amount of time, then that connection will be considered "expired". The actual timeout value defining that amount
-will be considered "expired". The actual timeout value defining that amount+
 of time strongly depends on network protocol, state and traffic behavior of that of time strongly depends on network protocol, state and traffic behavior of that
 connection. Once "expired", the connection is moved to the //dying list// and it is  connection. Once "expired", the connection is moved to the //dying list// and it is 
Line 281: Line 275:
   * [[http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO.html|Linux netfilter Hacking HOWTO (Rusty Russell and Harald Welte, 2002)]]   * [[http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO.html|Linux netfilter Hacking HOWTO (Rusty Russell and Harald Welte, 2002)]]
   * [[http://people.netfilter.org/pablo/docs/login.pdf|Netfilter’s connection tracking system (Pablo Neira Ayuso, 2006)]]   * [[http://people.netfilter.org/pablo/docs/login.pdf|Netfilter’s connection tracking system (Pablo Neira Ayuso, 2006)]]
 +  * [[https://www.frozentux.net/iptables-tutorial/iptables-tutorial.html#STATEMACHINE|Iptables tutorial 1.2.2: Chapter 7. The state machine (Oskar Andreasson, 2006)]]
   * [[https://wiki.aalto.fi/download/attachments/70789072/netfilter-paper-final.pdf|Netfilter Connection Tracking and NAT Implementation (Magnus Boye, 2012)]]   * [[https://wiki.aalto.fi/download/attachments/70789072/netfilter-paper-final.pdf|Netfilter Connection Tracking and NAT Implementation (Magnus Boye, 2012)]]
   * [[http://arthurchiao.art/blog/conntrack-design-and-implementation/|Connection Tracking: Design and Implementation Inside Linux Kernel (Arthur Chiao, 2020)]]   * [[http://arthurchiao.art/blog/conntrack-design-and-implementation/|Connection Tracking: Design and Implementation Inside Linux Kernel (Arthur Chiao, 2020)]]
Line 288: Line 283:
  
  
-//published 2021-04-04//+//published 2021-04-04//, //last modified 2023-08-15//
  
blog/linux/connection_tracking_1_modules_and_hooks.1624918479.txt.gz · Last modified: 2021-06-29 by Andrej Stender