netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
@ 2024-03-18  9:39 Pablo Neira Ayuso
  2024-03-18 10:05 ` Sven Auhagen
  2024-03-20  8:39 ` Sven Auhagen
  0 siblings, 2 replies; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-03-18  9:39 UTC (permalink / raw)
  To: netfilter-devel; +Cc: sven.auhagen, cratiu, ozsh, vladbu, gal, fw

In case that either FIN or RST packet is seen, infer current TCP state
based on the TCP packet flags before setting the _TEARDOWN flag:

- FIN packets result in TCP_CONNTRACK_FIN_WAIT which uses a default
  timeout of 2 minutes.
- RST packets lead to tcp_state TCP_CONNTRACK_CLOSE of 10 seconds.

Therefore, TCP established state with a low timeout is not used anymore
when handing over the flow to the classic conntrack path, otherwise a
FIN packet coming in the reply direction could re-offload this flow
again.

If flow teardown is required for other reasons, eg. no traffic seen
after NF_FLOW_TIMEOUT, then use TCP established state but set timeout to
TCP_CONNTRACK_UNACK state which is used in conntrack to pick up
connections from the middle (default is 5 minutes).

Fixes: e5eaac2beb54 ("netfilter: flowtable: fix TCP flow teardown")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
Compile-tested only.

 include/net/netfilter/nf_flow_table.h |  1 +
 net/netfilter/nf_flow_table_core.c    | 45 +++++++++++++++++++++------
 net/netfilter/nf_flow_table_ip.c      |  2 +-
 3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index a763dd327c6e..924f3720143f 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -293,6 +293,7 @@ int nf_flow_table_init(struct nf_flowtable *flow_table);
 void nf_flow_table_free(struct nf_flowtable *flow_table);
 
 void flow_offload_teardown(struct flow_offload *flow);
+void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin);
 
 void nf_flow_snat_port(const struct flow_offload *flow,
 		       struct sk_buff *skb, unsigned int thoff,
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index a0571339239c..481fe3d96bbc 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
 }
 EXPORT_SYMBOL_GPL(flow_offload_route_init);
 
-static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
+static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
+				  enum tcp_conntrack tcp_state)
 {
-	tcp->seen[0].td_maxwin = 0;
-	tcp->seen[1].td_maxwin = 0;
+	struct nf_tcp_net *tn = nf_tcp_pernet(net);
+
+	ct->proto.tcp.state = tcp_state;
+	ct->proto.tcp.seen[0].td_maxwin = 0;
+	ct->proto.tcp.seen[1].td_maxwin = 0;
+
+	/* Similar to mid-connection pickup with loose=1.
+	 * Avoid large ESTABLISHED timeout.
+	 */
+	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
+		return tn->timeouts[TCP_CONNTRACK_UNACK];
+
+	return tn->timeouts[tcp_state];
 }
 
 static void flow_offload_fixup_ct(struct nf_conn *ct)
@@ -178,12 +190,8 @@ static void flow_offload_fixup_ct(struct nf_conn *ct)
 	s32 timeout;
 
 	if (l4num == IPPROTO_TCP) {
-		struct nf_tcp_net *tn = nf_tcp_pernet(net);
-
-		flow_offload_fixup_tcp(&ct->proto.tcp);
-
-		timeout = tn->timeouts[ct->proto.tcp.state];
-		timeout -= tn->offload_timeout;
+		timeout = flow_offload_fixup_tcp(net, ct,
+						 TCP_CONNTRACK_ESTABLISHED);
 	} else if (l4num == IPPROTO_UDP) {
 		struct nf_udp_net *tn = nf_udp_pernet(net);
 		enum udp_conntrack state =
@@ -346,12 +354,29 @@ static void flow_offload_del(struct nf_flowtable *flow_table,
 
 void flow_offload_teardown(struct flow_offload *flow)
 {
+	flow_offload_fixup_ct(flow->ct);
+	smp_mb__before_atomic();
 	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
 	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
-	flow_offload_fixup_ct(flow->ct);
 }
 EXPORT_SYMBOL_GPL(flow_offload_teardown);
 
+void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
+{
+	enum tcp_conntrack tcp_state;
+
+	if (fin)
+		tcp_state = TCP_CONNTRACK_FIN_WAIT;
+	else /* rst */
+		tcp_state = TCP_CONNTRACK_CLOSE;
+
+	flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
+	smp_mb__before_atomic();
+	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
+	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
+}
+EXPORT_SYMBOL_GPL(flow_offload_teardown_tcp);
+
 struct flow_offload_tuple_rhash *
 flow_offload_lookup(struct nf_flowtable *flow_table,
 		    struct flow_offload_tuple *tuple)
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index e45fade76409..13b6c453d8bc 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -29,7 +29,7 @@ static int nf_flow_state_check(struct flow_offload *flow, int proto,
 
 	tcph = (void *)(skb_network_header(skb) + thoff);
 	if (unlikely(tcph->fin || tcph->rst)) {
-		flow_offload_teardown(flow);
+		flow_offload_teardown_tcp(flow, tcph->fin);
 		return -1;
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-18  9:39 [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown Pablo Neira Ayuso
@ 2024-03-18 10:05 ` Sven Auhagen
  2024-03-20  8:39 ` Sven Auhagen
  1 sibling, 0 replies; 27+ messages in thread
From: Sven Auhagen @ 2024-03-18 10:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> In case that either FIN or RST packet is seen, infer current TCP state
> based on the TCP packet flags before setting the _TEARDOWN flag:
> 
> - FIN packets result in TCP_CONNTRACK_FIN_WAIT which uses a default
>   timeout of 2 minutes.
> - RST packets lead to tcp_state TCP_CONNTRACK_CLOSE of 10 seconds.
> 
> Therefore, TCP established state with a low timeout is not used anymore
> when handing over the flow to the classic conntrack path, otherwise a
> FIN packet coming in the reply direction could re-offload this flow
> again.
> 
> If flow teardown is required for other reasons, eg. no traffic seen
> after NF_FLOW_TIMEOUT, then use TCP established state but set timeout to
> TCP_CONNTRACK_UNACK state which is used in conntrack to pick up
> connections from the middle (default is 5 minutes).
> 
> Fixes: e5eaac2beb54 ("netfilter: flowtable: fix TCP flow teardown")
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> Compile-tested only.

Thanks Pablo, I will test both patches on a production system this week
and give you feedback.

Both patches look good to me though and should fix the timout problem.

> 
>  include/net/netfilter/nf_flow_table.h |  1 +
>  net/netfilter/nf_flow_table_core.c    | 45 +++++++++++++++++++++------
>  net/netfilter/nf_flow_table_ip.c      |  2 +-
>  3 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
> index a763dd327c6e..924f3720143f 100644
> --- a/include/net/netfilter/nf_flow_table.h
> +++ b/include/net/netfilter/nf_flow_table.h
> @@ -293,6 +293,7 @@ int nf_flow_table_init(struct nf_flowtable *flow_table);
>  void nf_flow_table_free(struct nf_flowtable *flow_table);
>  
>  void flow_offload_teardown(struct flow_offload *flow);
> +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin);
>  
>  void nf_flow_snat_port(const struct flow_offload *flow,
>  		       struct sk_buff *skb, unsigned int thoff,
> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> index a0571339239c..481fe3d96bbc 100644
> --- a/net/netfilter/nf_flow_table_core.c
> +++ b/net/netfilter/nf_flow_table_core.c
> @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
>  }
>  EXPORT_SYMBOL_GPL(flow_offload_route_init);
>  
> -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> +				  enum tcp_conntrack tcp_state)
>  {
> -	tcp->seen[0].td_maxwin = 0;
> -	tcp->seen[1].td_maxwin = 0;
> +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> +
> +	ct->proto.tcp.state = tcp_state;
> +	ct->proto.tcp.seen[0].td_maxwin = 0;
> +	ct->proto.tcp.seen[1].td_maxwin = 0;
> +
> +	/* Similar to mid-connection pickup with loose=1.
> +	 * Avoid large ESTABLISHED timeout.
> +	 */
> +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> +
> +	return tn->timeouts[tcp_state];
>  }
>  
>  static void flow_offload_fixup_ct(struct nf_conn *ct)
> @@ -178,12 +190,8 @@ static void flow_offload_fixup_ct(struct nf_conn *ct)
>  	s32 timeout;
>  
>  	if (l4num == IPPROTO_TCP) {
> -		struct nf_tcp_net *tn = nf_tcp_pernet(net);
> -
> -		flow_offload_fixup_tcp(&ct->proto.tcp);
> -
> -		timeout = tn->timeouts[ct->proto.tcp.state];
> -		timeout -= tn->offload_timeout;
> +		timeout = flow_offload_fixup_tcp(net, ct,
> +						 TCP_CONNTRACK_ESTABLISHED);
>  	} else if (l4num == IPPROTO_UDP) {
>  		struct nf_udp_net *tn = nf_udp_pernet(net);
>  		enum udp_conntrack state =
> @@ -346,12 +354,29 @@ static void flow_offload_del(struct nf_flowtable *flow_table,
>  
>  void flow_offload_teardown(struct flow_offload *flow)
>  {
> +	flow_offload_fixup_ct(flow->ct);
> +	smp_mb__before_atomic();
>  	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
>  	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
> -	flow_offload_fixup_ct(flow->ct);
>  }
>  EXPORT_SYMBOL_GPL(flow_offload_teardown);
>  
> +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> +{
> +	enum tcp_conntrack tcp_state;
> +
> +	if (fin)
> +		tcp_state = TCP_CONNTRACK_FIN_WAIT;
> +	else /* rst */
> +		tcp_state = TCP_CONNTRACK_CLOSE;
> +
> +	flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> +	smp_mb__before_atomic();
> +	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
> +	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
> +}
> +EXPORT_SYMBOL_GPL(flow_offload_teardown_tcp);
> +
>  struct flow_offload_tuple_rhash *
>  flow_offload_lookup(struct nf_flowtable *flow_table,
>  		    struct flow_offload_tuple *tuple)
> diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
> index e45fade76409..13b6c453d8bc 100644
> --- a/net/netfilter/nf_flow_table_ip.c
> +++ b/net/netfilter/nf_flow_table_ip.c
> @@ -29,7 +29,7 @@ static int nf_flow_state_check(struct flow_offload *flow, int proto,
>  
>  	tcph = (void *)(skb_network_header(skb) + thoff);
>  	if (unlikely(tcph->fin || tcph->rst)) {
> -		flow_offload_teardown(flow);
> +		flow_offload_teardown_tcp(flow, tcph->fin);
>  		return -1;
>  	}
>  
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-18  9:39 [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown Pablo Neira Ayuso
  2024-03-18 10:05 ` Sven Auhagen
@ 2024-03-20  8:39 ` Sven Auhagen
  2024-03-20  8:45   ` Pablo Neira Ayuso
  1 sibling, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20  8:39 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> In case that either FIN or RST packet is seen, infer current TCP state
> based on the TCP packet flags before setting the _TEARDOWN flag:
> 
> - FIN packets result in TCP_CONNTRACK_FIN_WAIT which uses a default
>   timeout of 2 minutes.
> - RST packets lead to tcp_state TCP_CONNTRACK_CLOSE of 10 seconds.
> 
> Therefore, TCP established state with a low timeout is not used anymore
> when handing over the flow to the classic conntrack path, otherwise a
> FIN packet coming in the reply direction could re-offload this flow
> again.
> 
> If flow teardown is required for other reasons, eg. no traffic seen
> after NF_FLOW_TIMEOUT, then use TCP established state but set timeout to
> TCP_CONNTRACK_UNACK state which is used in conntrack to pick up
> connections from the middle (default is 5 minutes).
> 
> Fixes: e5eaac2beb54 ("netfilter: flowtable: fix TCP flow teardown")
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> Compile-tested only.
> 
>  include/net/netfilter/nf_flow_table.h |  1 +
>  net/netfilter/nf_flow_table_core.c    | 45 +++++++++++++++++++++------
>  net/netfilter/nf_flow_table_ip.c      |  2 +-
>  3 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
> index a763dd327c6e..924f3720143f 100644
> --- a/include/net/netfilter/nf_flow_table.h
> +++ b/include/net/netfilter/nf_flow_table.h
> @@ -293,6 +293,7 @@ int nf_flow_table_init(struct nf_flowtable *flow_table);
>  void nf_flow_table_free(struct nf_flowtable *flow_table);
>  
>  void flow_offload_teardown(struct flow_offload *flow);
> +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin);
>  
>  void nf_flow_snat_port(const struct flow_offload *flow,
>  		       struct sk_buff *skb, unsigned int thoff,
> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> index a0571339239c..481fe3d96bbc 100644
> --- a/net/netfilter/nf_flow_table_core.c
> +++ b/net/netfilter/nf_flow_table_core.c
> @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
>  }
>  EXPORT_SYMBOL_GPL(flow_offload_route_init);
>  
> -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> +				  enum tcp_conntrack tcp_state)
>  {
> -	tcp->seen[0].td_maxwin = 0;
> -	tcp->seen[1].td_maxwin = 0;
> +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> +
> +	ct->proto.tcp.state = tcp_state;
> +	ct->proto.tcp.seen[0].td_maxwin = 0;
> +	ct->proto.tcp.seen[1].td_maxwin = 0;
> +
> +	/* Similar to mid-connection pickup with loose=1.
> +	 * Avoid large ESTABLISHED timeout.
> +	 */
> +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> +		return tn->timeouts[TCP_CONNTRACK_UNACK];

Hi Pablo,

I tested the patch but the part that sets the timout to UNACK is not
very practical.
For example my long running SSH connections get killed off by the firewall
regularly now while beeing ESTABLISHED:

[NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
[UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
[UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216

[DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036

I would remove the if case here.

> +
> +	return tn->timeouts[tcp_state];
>  }
>  
>  static void flow_offload_fixup_ct(struct nf_conn *ct)
> @@ -178,12 +190,8 @@ static void flow_offload_fixup_ct(struct nf_conn *ct)
>  	s32 timeout;
>  
>  	if (l4num == IPPROTO_TCP) {
> -		struct nf_tcp_net *tn = nf_tcp_pernet(net);
> -
> -		flow_offload_fixup_tcp(&ct->proto.tcp);
> -
> -		timeout = tn->timeouts[ct->proto.tcp.state];
> -		timeout -= tn->offload_timeout;
> +		timeout = flow_offload_fixup_tcp(net, ct,
> +						 TCP_CONNTRACK_ESTABLISHED);
>  	} else if (l4num == IPPROTO_UDP) {
>  		struct nf_udp_net *tn = nf_udp_pernet(net);
>  		enum udp_conntrack state =
> @@ -346,12 +354,29 @@ static void flow_offload_del(struct nf_flowtable *flow_table,
>  
>  void flow_offload_teardown(struct flow_offload *flow)
>  {
> +	flow_offload_fixup_ct(flow->ct);
> +	smp_mb__before_atomic();
>  	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
>  	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
> -	flow_offload_fixup_ct(flow->ct);
>  }
>  EXPORT_SYMBOL_GPL(flow_offload_teardown);
>  
> +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> +{
> +	enum tcp_conntrack tcp_state;
> +
> +	if (fin)
> +		tcp_state = TCP_CONNTRACK_FIN_WAIT;
> +	else /* rst */
> +		tcp_state = TCP_CONNTRACK_CLOSE;
> +
> +	flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> +	smp_mb__before_atomic();
> +	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
> +	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
> +}
> +EXPORT_SYMBOL_GPL(flow_offload_teardown_tcp);
> +
>  struct flow_offload_tuple_rhash *
>  flow_offload_lookup(struct nf_flowtable *flow_table,
>  		    struct flow_offload_tuple *tuple)
> diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
> index e45fade76409..13b6c453d8bc 100644
> --- a/net/netfilter/nf_flow_table_ip.c
> +++ b/net/netfilter/nf_flow_table_ip.c
> @@ -29,7 +29,7 @@ static int nf_flow_state_check(struct flow_offload *flow, int proto,
>  
>  	tcph = (void *)(skb_network_header(skb) + thoff);
>  	if (unlikely(tcph->fin || tcph->rst)) {
> -		flow_offload_teardown(flow);
> +		flow_offload_teardown_tcp(flow, tcph->fin);
>  		return -1;
>  	}
>  
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  8:39 ` Sven Auhagen
@ 2024-03-20  8:45   ` Pablo Neira Ayuso
  2024-03-20  8:49     ` Sven Auhagen
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-03-20  8:45 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

Hi Sven,

On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
[...]
> > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > index a0571339239c..481fe3d96bbc 100644
> > --- a/net/netfilter/nf_flow_table_core.c
> > +++ b/net/netfilter/nf_flow_table_core.c
> > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> >  }
> >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> >  
> > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > +				  enum tcp_conntrack tcp_state)
> >  {
> > -	tcp->seen[0].td_maxwin = 0;
> > -	tcp->seen[1].td_maxwin = 0;
> > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > +
> > +	ct->proto.tcp.state = tcp_state;
> > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > +
> > +	/* Similar to mid-connection pickup with loose=1.
> > +	 * Avoid large ESTABLISHED timeout.
> > +	 */
> > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> 
> Hi Pablo,
> 
> I tested the patch but the part that sets the timout to UNACK is not
> very practical.
> For example my long running SSH connections get killed off by the firewall
> regularly now while beeing ESTABLISHED:
> 
> [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> 
> [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> 
> I would remove the if case here.

OK, I remove it and post a v2. Thanks!

> > +
> > +	return tn->timeouts[tcp_state];
> >  }
> >  
> >  static void flow_offload_fixup_ct(struct nf_conn *ct)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  8:45   ` Pablo Neira Ayuso
@ 2024-03-20  8:49     ` Sven Auhagen
  2024-03-20  9:07       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20  8:49 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> Hi Sven,
> 
> On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> [...]
> > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > index a0571339239c..481fe3d96bbc 100644
> > > --- a/net/netfilter/nf_flow_table_core.c
> > > +++ b/net/netfilter/nf_flow_table_core.c
> > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > >  }
> > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > >  
> > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > +				  enum tcp_conntrack tcp_state)
> > >  {
> > > -	tcp->seen[0].td_maxwin = 0;
> > > -	tcp->seen[1].td_maxwin = 0;
> > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > +
> > > +	ct->proto.tcp.state = tcp_state;
> > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > +
> > > +	/* Similar to mid-connection pickup with loose=1.
> > > +	 * Avoid large ESTABLISHED timeout.
> > > +	 */
> > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > 
> > Hi Pablo,
> > 
> > I tested the patch but the part that sets the timout to UNACK is not
> > very practical.
> > For example my long running SSH connections get killed off by the firewall
> > regularly now while beeing ESTABLISHED:
> > 
> > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > 
> > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > 
> > I would remove the if case here.
> 
> OK, I remove it and post a v2. Thanks!

Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
should be reverted to the real tcp state ct->proto.tcp.state.
This way we always set the current TCP timeout.
I am doing more tests with that now.

> 
> > > +
> > > +	return tn->timeouts[tcp_state];
> > >  }
> > >  
> > >  static void flow_offload_fixup_ct(struct nf_conn *ct)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  8:49     ` Sven Auhagen
@ 2024-03-20  9:07       ` Pablo Neira Ayuso
  2024-03-20  9:20         ` Sven Auhagen
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-03-20  9:07 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > Hi Sven,
> > 
> > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > [...]
> > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > index a0571339239c..481fe3d96bbc 100644
> > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > >  
> > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > +				  enum tcp_conntrack tcp_state)
> > > >  {
> > > > -	tcp->seen[0].td_maxwin = 0;
> > > > -	tcp->seen[1].td_maxwin = 0;
> > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > +
> > > > +	ct->proto.tcp.state = tcp_state;
> > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > +
> > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > +	 * Avoid large ESTABLISHED timeout.
> > > > +	 */
> > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > 
> > > Hi Pablo,
> > > 
> > > I tested the patch but the part that sets the timout to UNACK is not
> > > very practical.
> > > For example my long running SSH connections get killed off by the firewall
> > > regularly now while beeing ESTABLISHED:
> > > 
> > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > 
> > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > 
> > > I would remove the if case here.
> > 
> > OK, I remove it and post a v2. Thanks!
> 
> Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> should be reverted to the real tcp state ct->proto.tcp.state.

ct->proto.tcp.state contains the state that was observed before
handling over this flow to the flowtable, in most cases, this should
be TCP_CONNTRACK_ESTABLISHED.

> This way we always set the current TCP timeout.

I can keep it to ct->proto.tcp.state but I wonder if it is better to
use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.

> I am doing more tests with that now.

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  9:07       ` Pablo Neira Ayuso
@ 2024-03-20  9:20         ` Sven Auhagen
  2024-03-20  9:27           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20  9:20 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > Hi Sven,
> > > 
> > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > [...]
> > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > >  
> > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > +				  enum tcp_conntrack tcp_state)
> > > > >  {
> > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > +
> > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > +
> > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > +	 */
> > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > 
> > > > Hi Pablo,
> > > > 
> > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > very practical.
> > > > For example my long running SSH connections get killed off by the firewall
> > > > regularly now while beeing ESTABLISHED:
> > > > 
> > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > 
> > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > 
> > > > I would remove the if case here.
> > > 
> > > OK, I remove it and post a v2. Thanks!
> > 
> > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > should be reverted to the real tcp state ct->proto.tcp.state.
> 
> ct->proto.tcp.state contains the state that was observed before
> handling over this flow to the flowtable, in most cases, this should
> be TCP_CONNTRACK_ESTABLISHED.
> 
> > This way we always set the current TCP timeout.
> 
> I can keep it to ct->proto.tcp.state but I wonder if it is better to
> use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.

In case of a race condition or if something is off like my TCP_FIN
that is beeing offloaded again setting to to ESTABLISHED hard coded
will make the e.g. FIN or CLOSE a very long state.
It is not guaranteed that we are still in ESTABLISHED when this code
runs. Also for example we could have seen both FIN already before the
flowtable gc runs.
> 
> > I am doing more tests with that now.
> 
> Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  9:20         ` Sven Auhagen
@ 2024-03-20  9:27           ` Pablo Neira Ayuso
  2024-03-20  9:31             ` Sven Auhagen
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-03-20  9:27 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > Hi Sven,
> > > > 
> > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > [...]
> > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > >  
> > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > >  {
> > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > +
> > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > +
> > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > +	 */
> > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > 
> > > > > Hi Pablo,
> > > > > 
> > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > very practical.
> > > > > For example my long running SSH connections get killed off by the firewall
> > > > > regularly now while beeing ESTABLISHED:
> > > > > 
> > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > 
> > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > 
> > > > > I would remove the if case here.
> > > > 
> > > > OK, I remove it and post a v2. Thanks!
> > > 
> > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > should be reverted to the real tcp state ct->proto.tcp.state.
> > 
> > ct->proto.tcp.state contains the state that was observed before
> > handling over this flow to the flowtable, in most cases, this should
> > be TCP_CONNTRACK_ESTABLISHED.
> > 
> > > This way we always set the current TCP timeout.
> > 
> > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> 
> In case of a race condition or if something is off like my TCP_FIN
> that is beeing offloaded again setting to to ESTABLISHED hard coded
> will make the e.g. FIN or CLOSE a very long state.
> It is not guaranteed that we are still in ESTABLISHED when this code
> runs. Also for example we could have seen both FIN already before the
> flowtable gc runs.

OK, I just posted a v2, leave things as is. I agree it is better to
only address the issue you are observing at this time, it is possible
to revisit later.

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  9:27           ` Pablo Neira Ayuso
@ 2024-03-20  9:31             ` Sven Auhagen
  2024-03-20  9:51               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20  9:31 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > Hi Sven,
> > > > > 
> > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > [...]
> > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > >  
> > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > >  {
> > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > +
> > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > +
> > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > +	 */
> > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > 
> > > > > > Hi Pablo,
> > > > > > 
> > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > very practical.
> > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > regularly now while beeing ESTABLISHED:
> > > > > > 
> > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > 
> > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > 
> > > > > > I would remove the if case here.
> > > > > 
> > > > > OK, I remove it and post a v2. Thanks!
> > > > 
> > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > 
> > > ct->proto.tcp.state contains the state that was observed before
> > > handling over this flow to the flowtable, in most cases, this should
> > > be TCP_CONNTRACK_ESTABLISHED.
> > > 
> > > > This way we always set the current TCP timeout.
> > > 
> > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > 
> > In case of a race condition or if something is off like my TCP_FIN
> > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > will make the e.g. FIN or CLOSE a very long state.
> > It is not guaranteed that we are still in ESTABLISHED when this code
> > runs. Also for example we could have seen both FIN already before the
> > flowtable gc runs.
> 
> OK, I just posted a v2, leave things as is. I agree it is better to
> only address the issue you are observing at this time, it is possible
> to revisit later.
> 
> Thanks!

Thanks, I will give it another try.
I think for it to be foolproof we need
to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
They way thinks are right now we are open to a race condition from the reverse side of the
connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
running after the flowtable code.
The conenction is still in TCP established during that window and another packet can just
push it back to the flowtable while the FIN or RST is not processed yet.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  9:31             ` Sven Auhagen
@ 2024-03-20  9:51               ` Pablo Neira Ayuso
  2024-03-20 10:13                 ` Sven Auhagen
  2024-03-20 10:29                 ` Sven Auhagen
  0 siblings, 2 replies; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-03-20  9:51 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > Hi Sven,
> > > > > > 
> > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > [...]
> > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > >  
> > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > >  {
> > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > +
> > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > +
> > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > +	 */
> > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > 
> > > > > > > Hi Pablo,
> > > > > > > 
> > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > very practical.
> > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > 
> > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > 
> > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > 
> > > > > > > I would remove the if case here.
> > > > > > 
> > > > > > OK, I remove it and post a v2. Thanks!
> > > > > 
> > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > 
> > > > ct->proto.tcp.state contains the state that was observed before
> > > > handling over this flow to the flowtable, in most cases, this should
> > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > 
> > > > > This way we always set the current TCP timeout.
> > > > 
> > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > 
> > > In case of a race condition or if something is off like my TCP_FIN
> > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > will make the e.g. FIN or CLOSE a very long state.
> > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > runs. Also for example we could have seen both FIN already before the
> > > flowtable gc runs.
> > 
> > OK, I just posted a v2, leave things as is. I agree it is better to
> > only address the issue you are observing at this time, it is possible
> > to revisit later.
> > 
> > Thanks!
> 
> Thanks, I will give it another try.
> I think for it to be foolproof we need
> to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.

My patch already does it:

+void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
+{
+       enum tcp_conntrack tcp_state;
+
+       if (fin)
+               tcp_state = TCP_CONNTRACK_FIN_WAIT;
+       else /* rst */
+               tcp_state = TCP_CONNTRACK_CLOSE;
+
+       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);

flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.

> They way thinks are right now we are open to a race condition from the reverse side of the
> connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> running after the flowtable code.
> The conenction is still in TCP established during that window and another packet can just
> push it back to the flowtable while the FIN or RST is not processed yet.

I don't see how:

static void nft_flow_offload_eval(const struct nft_expr *expr,
                                  ...

        switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
        case IPPROTO_TCP:
                tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
                                          sizeof(_tcph), &_tcph);
                if (unlikely(!tcph || tcph->fin || tcph->rst ||
                             !nf_conntrack_tcp_established(ct)))
                        goto out;

this would now be either in FIN/CLOSE state.

FIN, RST packet does not trigger re-offload. And ACK packet would find
the entry in !nf_conntrack_tcp_established(ct).

What path could trigger re-offload after my latest patch?

For the flow timeout case (where flow entry is released and packets
goes back to classic conntrack path) re-offload is possible, but that
is a feature.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  9:51               ` Pablo Neira Ayuso
@ 2024-03-20 10:13                 ` Sven Auhagen
  2024-03-20 10:36                   ` Pablo Neira Ayuso
  2024-03-20 10:29                 ` Sven Auhagen
  1 sibling, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20 10:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > Hi Sven,
> > > > > > > 
> > > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > [...]
> > > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > > >  }
> > > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > > >  
> > > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > > >  {
> > > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > > +
> > > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > > +
> > > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > > +	 */
> > > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > > 
> > > > > > > > Hi Pablo,
> > > > > > > > 
> > > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > > very practical.
> > > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > > 
> > > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > > 
> > > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > > 
> > > > > > > > I would remove the if case here.
> > > > > > > 
> > > > > > > OK, I remove it and post a v2. Thanks!
> > > > > > 
> > > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > > 
> > > > > ct->proto.tcp.state contains the state that was observed before
> > > > > handling over this flow to the flowtable, in most cases, this should
> > > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > > 
> > > > > > This way we always set the current TCP timeout.
> > > > > 
> > > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > > 
> > > > In case of a race condition or if something is off like my TCP_FIN
> > > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > > will make the e.g. FIN or CLOSE a very long state.
> > > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > > runs. Also for example we could have seen both FIN already before the
> > > > flowtable gc runs.
> > > 
> > > OK, I just posted a v2, leave things as is. I agree it is better to
> > > only address the issue you are observing at this time, it is possible
> > > to revisit later.
> > > 
> > > Thanks!
> > 
> > Thanks, I will give it another try.
> > I think for it to be foolproof we need
> > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> 
> My patch already does it:
> 
> +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> +{
> +       enum tcp_conntrack tcp_state;
> +
> +       if (fin)
> +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> +       else /* rst */
> +               tcp_state = TCP_CONNTRACK_CLOSE;
> +
> +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> 
> flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.

Ah you are correct.
Never the less I can tell you that I still see this problem with the patch attached:

 [UPDATE] tcp      6 120 FIN_WAIT src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [OFFLOAD] mark=25165825
  [UPDATE] tcp      6 30 LAST_ACK src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [ASSURED] mark=25165825
   [UPDATE] tcp      6 10 CLOSE src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [ASSURED] mark=25165825
   [DESTROY] tcp      6 CLOSE src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 packets=15 bytes=2688 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 packets=18 bytes=7172 [ASSURED] mark=25165825 delta-time=126

> 
> > They way thinks are right now we are open to a race condition from the reverse side of the
> > connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> > running after the flowtable code.
> > The conenction is still in TCP established during that window and another packet can just
> > push it back to the flowtable while the FIN or RST is not processed yet.
> 
> I don't see how:
> 
> static void nft_flow_offload_eval(const struct nft_expr *expr,
>                                   ...
> 
>         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
>         case IPPROTO_TCP:
>                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
>                                           sizeof(_tcph), &_tcph);
>                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
>                              !nf_conntrack_tcp_established(ct)))
>                         goto out;
> 
> this would now be either in FIN/CLOSE state.
> 
> FIN, RST packet does not trigger re-offload. And ACK packet would find
> the entry in !nf_conntrack_tcp_established(ct).
> 
> What path could trigger re-offload after my latest patch?
> 
> For the flow timeout case (where flow entry is released and packets
> goes back to classic conntrack path) re-offload is possible, but that
> is a feature.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20  9:51               ` Pablo Neira Ayuso
  2024-03-20 10:13                 ` Sven Auhagen
@ 2024-03-20 10:29                 ` Sven Auhagen
  2024-03-20 10:47                   ` Pablo Neira Ayuso
  1 sibling, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20 10:29 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > Hi Sven,
> > > > > > > 
> > > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > [...]
> > > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > > >  }
> > > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > > >  
> > > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > > >  {
> > > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > > +
> > > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > > +
> > > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > > +	 */
> > > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > > 
> > > > > > > > Hi Pablo,
> > > > > > > > 
> > > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > > very practical.
> > > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > > 
> > > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > > 
> > > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > > 
> > > > > > > > I would remove the if case here.
> > > > > > > 
> > > > > > > OK, I remove it and post a v2. Thanks!
> > > > > > 
> > > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > > 
> > > > > ct->proto.tcp.state contains the state that was observed before
> > > > > handling over this flow to the flowtable, in most cases, this should
> > > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > > 
> > > > > > This way we always set the current TCP timeout.
> > > > > 
> > > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > > 
> > > > In case of a race condition or if something is off like my TCP_FIN
> > > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > > will make the e.g. FIN or CLOSE a very long state.
> > > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > > runs. Also for example we could have seen both FIN already before the
> > > > flowtable gc runs.
> > > 
> > > OK, I just posted a v2, leave things as is. I agree it is better to
> > > only address the issue you are observing at this time, it is possible
> > > to revisit later.
> > > 
> > > Thanks!
> > 
> > Thanks, I will give it another try.
> > I think for it to be foolproof we need
> > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> 
> My patch already does it:
> 
> +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> +{
> +       enum tcp_conntrack tcp_state;
> +
> +       if (fin)
> +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> +       else /* rst */
> +               tcp_state = TCP_CONNTRACK_CLOSE;
> +
> +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> 
> flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> 
> > They way thinks are right now we are open to a race condition from the reverse side of the
> > connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> > running after the flowtable code.
> > The conenction is still in TCP established during that window and another packet can just
> > push it back to the flowtable while the FIN or RST is not processed yet.
> 
> I don't see how:
> 
> static void nft_flow_offload_eval(const struct nft_expr *expr,
>                                   ...
> 
>         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
>         case IPPROTO_TCP:
>                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
>                                           sizeof(_tcph), &_tcph);
>                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
>                              !nf_conntrack_tcp_established(ct)))
>                         goto out;
> 
> this would now be either in FIN/CLOSE state.
> 
> FIN, RST packet does not trigger re-offload. And ACK packet would find
> the entry in !nf_conntrack_tcp_established(ct).
> 
> What path could trigger re-offload after my latest patch?

From looking through the nf conntrack tcp code you need to spin_lock
the TCP state change to avoid a race with another packet.

> 
> For the flow timeout case (where flow entry is released and packets
> goes back to classic conntrack path) re-offload is possible, but that
> is a feature.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20 10:13                 ` Sven Auhagen
@ 2024-03-20 10:36                   ` Pablo Neira Ayuso
  2024-03-20 10:38                     ` Sven Auhagen
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-03-20 10:36 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 11:13:23AM +0100, Sven Auhagen wrote:
> On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> > On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
[...]
> > > I think for it to be foolproof we need
> > > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> > 
> > My patch already does it:
> > 
> > +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> > +{
> > +       enum tcp_conntrack tcp_state;
> > +
> > +       if (fin)
> > +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> > +       else /* rst */
> > +               tcp_state = TCP_CONNTRACK_CLOSE;
> > +
> > +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> > 
> > flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> 
> Ah you are correct.
> Never the less I can tell you that I still see this problem with the patch attached:
>
>  [UPDATE] tcp      6 120 FIN_WAIT src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [OFFLOAD] mark=25165825
>   [UPDATE] tcp      6 30 LAST_ACK src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [ASSURED] mark=25165825
>    [UPDATE] tcp      6 10 CLOSE src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [ASSURED] mark=25165825
>    [DESTROY] tcp      6 CLOSE src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 packets=15 bytes=2688 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 packets=18 bytes=7172 [ASSURED] mark=25165825 delta-time=126

Just to make sure, are you testing with these two patches?

https://patchwork.ozlabs.org/project/netfilter-devel/patch/20240320092638.798076-1-pablo@netfilter.org/
https://patchwork.ozlabs.org/project/netfilter-devel/patch/20240320092638.798076-2-pablo@netfilter.org/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20 10:36                   ` Pablo Neira Ayuso
@ 2024-03-20 10:38                     ` Sven Auhagen
  0 siblings, 0 replies; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20 10:38 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 11:36:05AM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 11:13:23AM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > > > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> [...]
> > > > I think for it to be foolproof we need
> > > > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> > > 
> > > My patch already does it:
> > > 
> > > +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> > > +{
> > > +       enum tcp_conntrack tcp_state;
> > > +
> > > +       if (fin)
> > > +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> > > +       else /* rst */
> > > +               tcp_state = TCP_CONNTRACK_CLOSE;
> > > +
> > > +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> > > 
> > > flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> > 
> > Ah you are correct.
> > Never the less I can tell you that I still see this problem with the patch attached:
> >
> >  [UPDATE] tcp      6 120 FIN_WAIT src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [OFFLOAD] mark=25165825
> >   [UPDATE] tcp      6 30 LAST_ACK src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [ASSURED] mark=25165825
> >    [UPDATE] tcp      6 10 CLOSE src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 [ASSURED] mark=25165825
> >    [DESTROY] tcp      6 CLOSE src=192.168.7.105 dst=17.253.57.219 sport=49574 dport=443 packets=15 bytes=2688 src=17.253.57.219 dst=87.138.198.79 sport=443 dport=5078 packets=18 bytes=7172 [ASSURED] mark=25165825 delta-time=126
> 
> Just to make sure, are you testing with these two patches?
> 
> https://patchwork.ozlabs.org/project/netfilter-devel/patch/20240320092638.798076-1-pablo@netfilter.org/
> https://patchwork.ozlabs.org/project/netfilter-devel/patch/20240320092638.798076-2-pablo@netfilter.org/

I am currently testing with v1 of these patches.
I need to wait until later today to reboot the production system for v2.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20 10:29                 ` Sven Auhagen
@ 2024-03-20 10:47                   ` Pablo Neira Ayuso
  2024-03-20 11:15                     ` Sven Auhagen
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-03-20 10:47 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 11:29:05AM +0100, Sven Auhagen wrote:
> On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> > On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > Hi Sven,
> > > > > > > > 
> > > > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > [...]
> > > > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > > > >  }
> > > > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > > > >  
> > > > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > > > >  {
> > > > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > > > +
> > > > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > > > +
> > > > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > > > +	 */
> > > > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > > > 
> > > > > > > > > Hi Pablo,
> > > > > > > > > 
> > > > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > > > very practical.
> > > > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > > > 
> > > > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > > > 
> > > > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > > > 
> > > > > > > > > I would remove the if case here.
> > > > > > > > 
> > > > > > > > OK, I remove it and post a v2. Thanks!
> > > > > > > 
> > > > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > > > 
> > > > > > ct->proto.tcp.state contains the state that was observed before
> > > > > > handling over this flow to the flowtable, in most cases, this should
> > > > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > > > 
> > > > > > > This way we always set the current TCP timeout.
> > > > > > 
> > > > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > > > 
> > > > > In case of a race condition or if something is off like my TCP_FIN
> > > > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > > > will make the e.g. FIN or CLOSE a very long state.
> > > > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > > > runs. Also for example we could have seen both FIN already before the
> > > > > flowtable gc runs.
> > > > 
> > > > OK, I just posted a v2, leave things as is. I agree it is better to
> > > > only address the issue you are observing at this time, it is possible
> > > > to revisit later.
> > > > 
> > > > Thanks!
> > > 
> > > Thanks, I will give it another try.
> > > I think for it to be foolproof we need
> > > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> > 
> > My patch already does it:
> > 
> > +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> > +{
> > +       enum tcp_conntrack tcp_state;
> > +
> > +       if (fin)
> > +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> > +       else /* rst */
> > +               tcp_state = TCP_CONNTRACK_CLOSE;
> > +
> > +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> > 
> > flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> > 
> > > They way thinks are right now we are open to a race condition from the reverse side of the
> > > connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> > > running after the flowtable code.
> > > The conenction is still in TCP established during that window and another packet can just
> > > push it back to the flowtable while the FIN or RST is not processed yet.
> > 
> > I don't see how:
> > 
> > static void nft_flow_offload_eval(const struct nft_expr *expr,
> >                                   ...
> > 
> >         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
> >         case IPPROTO_TCP:
> >                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
> >                                           sizeof(_tcph), &_tcph);
> >                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
> >                              !nf_conntrack_tcp_established(ct)))
> >                         goto out;
> > 
> > this would now be either in FIN/CLOSE state.
> > 
> > FIN, RST packet does not trigger re-offload. And ACK packet would find
> > the entry in !nf_conntrack_tcp_established(ct).
> > 
> > What path could trigger re-offload after my latest patch?
> 
> From looking through the nf conntrack tcp code you need to spin_lock
> the TCP state change to avoid a race with another packet.

The flowtable owns the flow, packets belonging the flow cannot update
the TCP state while the flow is offloaded to the flowtable.

Once _TEARDOWN flag is set on, then packets get back to classic
conntrack path.

I don't see how such race can happen.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20 10:47                   ` Pablo Neira Ayuso
@ 2024-03-20 11:15                     ` Sven Auhagen
  2024-03-20 12:37                       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20 11:15 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 11:47:50AM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 11:29:05AM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > > > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > > > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > Hi Sven,
> > > > > > > > > 
> > > > > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > [...]
> > > > > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > > > > >  }
> > > > > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > > > > >  
> > > > > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > > > > >  {
> > > > > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > > > > +
> > > > > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > > > > +
> > > > > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > > > > +	 */
> > > > > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > > > > 
> > > > > > > > > > Hi Pablo,
> > > > > > > > > > 
> > > > > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > > > > very practical.
> > > > > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > > > > 
> > > > > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > > > > 
> > > > > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > > > > 
> > > > > > > > > > I would remove the if case here.
> > > > > > > > > 
> > > > > > > > > OK, I remove it and post a v2. Thanks!
> > > > > > > > 
> > > > > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > > > > 
> > > > > > > ct->proto.tcp.state contains the state that was observed before
> > > > > > > handling over this flow to the flowtable, in most cases, this should
> > > > > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > > > > 
> > > > > > > > This way we always set the current TCP timeout.
> > > > > > > 
> > > > > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > > > > 
> > > > > > In case of a race condition or if something is off like my TCP_FIN
> > > > > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > > > > will make the e.g. FIN or CLOSE a very long state.
> > > > > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > > > > runs. Also for example we could have seen both FIN already before the
> > > > > > flowtable gc runs.
> > > > > 
> > > > > OK, I just posted a v2, leave things as is. I agree it is better to
> > > > > only address the issue you are observing at this time, it is possible
> > > > > to revisit later.
> > > > > 
> > > > > Thanks!
> > > > 
> > > > Thanks, I will give it another try.
> > > > I think for it to be foolproof we need
> > > > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> > > 
> > > My patch already does it:
> > > 
> > > +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> > > +{
> > > +       enum tcp_conntrack tcp_state;
> > > +
> > > +       if (fin)
> > > +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> > > +       else /* rst */
> > > +               tcp_state = TCP_CONNTRACK_CLOSE;
> > > +
> > > +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> > > 
> > > flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> > > 
> > > > They way thinks are right now we are open to a race condition from the reverse side of the
> > > > connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> > > > running after the flowtable code.
> > > > The conenction is still in TCP established during that window and another packet can just
> > > > push it back to the flowtable while the FIN or RST is not processed yet.
> > > 
> > > I don't see how:
> > > 
> > > static void nft_flow_offload_eval(const struct nft_expr *expr,
> > >                                   ...
> > > 
> > >         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
> > >         case IPPROTO_TCP:
> > >                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
> > >                                           sizeof(_tcph), &_tcph);
> > >                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
> > >                              !nf_conntrack_tcp_established(ct)))
> > >                         goto out;
> > > 
> > > this would now be either in FIN/CLOSE state.
> > > 
> > > FIN, RST packet does not trigger re-offload. And ACK packet would find
> > > the entry in !nf_conntrack_tcp_established(ct).
> > > 
> > > What path could trigger re-offload after my latest patch?
> > 
> > From looking through the nf conntrack tcp code you need to spin_lock
> > the TCP state change to avoid a race with another packet.
> 
> The flowtable owns the flow, packets belonging the flow cannot update
> the TCP state while the flow is offloaded to the flowtable.
> 
> Once _TEARDOWN flag is set on, then packets get back to classic
> conntrack path.

Hmm alright, something is going wrong somewhere and it looks a lot like
a race condition :)

I mean just in theory it is not guaranteed that both directions send all
packets through the flowtable just because it is offloaded.
A variety of error checks might send a packet back to the slow path.

> 
> I don't see how such race can happen.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20 11:15                     ` Sven Auhagen
@ 2024-03-20 12:37                       ` Pablo Neira Ayuso
  2024-03-20 13:37                         ` Sven Auhagen
  2024-04-08  5:24                         ` Sven Auhagen
  0 siblings, 2 replies; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-03-20 12:37 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

[-- Attachment #1: Type: text/plain, Size: 9278 bytes --]

On Wed, Mar 20, 2024 at 12:15:51PM +0100, Sven Auhagen wrote:
> On Wed, Mar 20, 2024 at 11:47:50AM +0100, Pablo Neira Ayuso wrote:
> > On Wed, Mar 20, 2024 at 11:29:05AM +0100, Sven Auhagen wrote:
> > > On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> > > > On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > > > > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > > > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > > > > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > > Hi Sven,
> > > > > > > > > > 
> > > > > > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > > [...]
> > > > > > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > > > > > >  }
> > > > > > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > > > > > >  
> > > > > > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > > > > > >  {
> > > > > > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > > > > > +
> > > > > > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > > > > > +	 */
> > > > > > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > > > > > 
> > > > > > > > > > > Hi Pablo,
> > > > > > > > > > > 
> > > > > > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > > > > > very practical.
> > > > > > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > > > > > 
> > > > > > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > > > > > 
> > > > > > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > > > > > 
> > > > > > > > > > > I would remove the if case here.
> > > > > > > > > > 
> > > > > > > > > > OK, I remove it and post a v2. Thanks!
> > > > > > > > > 
> > > > > > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > > > > > 
> > > > > > > > ct->proto.tcp.state contains the state that was observed before
> > > > > > > > handling over this flow to the flowtable, in most cases, this should
> > > > > > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > > > > > 
> > > > > > > > > This way we always set the current TCP timeout.
> > > > > > > > 
> > > > > > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > > > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > > > > > 
> > > > > > > In case of a race condition or if something is off like my TCP_FIN
> > > > > > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > > > > > will make the e.g. FIN or CLOSE a very long state.
> > > > > > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > > > > > runs. Also for example we could have seen both FIN already before the
> > > > > > > flowtable gc runs.
> > > > > > 
> > > > > > OK, I just posted a v2, leave things as is. I agree it is better to
> > > > > > only address the issue you are observing at this time, it is possible
> > > > > > to revisit later.
> > > > > > 
> > > > > > Thanks!
> > > > > 
> > > > > Thanks, I will give it another try.
> > > > > I think for it to be foolproof we need
> > > > > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> > > > 
> > > > My patch already does it:
> > > > 
> > > > +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> > > > +{
> > > > +       enum tcp_conntrack tcp_state;
> > > > +
> > > > +       if (fin)
> > > > +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> > > > +       else /* rst */
> > > > +               tcp_state = TCP_CONNTRACK_CLOSE;
> > > > +
> > > > +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> > > > 
> > > > flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> > > > 
> > > > > They way thinks are right now we are open to a race condition from the reverse side of the
> > > > > connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> > > > > running after the flowtable code.
> > > > > The conenction is still in TCP established during that window and another packet can just
> > > > > push it back to the flowtable while the FIN or RST is not processed yet.
> > > > 
> > > > I don't see how:
> > > > 
> > > > static void nft_flow_offload_eval(const struct nft_expr *expr,
> > > >                                   ...
> > > > 
> > > >         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
> > > >         case IPPROTO_TCP:
> > > >                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
> > > >                                           sizeof(_tcph), &_tcph);
> > > >                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
> > > >                              !nf_conntrack_tcp_established(ct)))
> > > >                         goto out;
> > > > 
> > > > this would now be either in FIN/CLOSE state.
> > > > 
> > > > FIN, RST packet does not trigger re-offload. And ACK packet would find
> > > > the entry in !nf_conntrack_tcp_established(ct).
> > > > 
> > > > What path could trigger re-offload after my latest patch?
> > > 
> > > From looking through the nf conntrack tcp code you need to spin_lock
> > > the TCP state change to avoid a race with another packet.
> > 
> > The flowtable owns the flow, packets belonging the flow cannot update
> > the TCP state while the flow is offloaded to the flowtable.
> > 
> > Once _TEARDOWN flag is set on, then packets get back to classic
> > conntrack path.
> 
> Hmm alright, something is going wrong somewhere and it looks a lot like
> a race condition :)

This check is racy, another packet could alter the ct state right
after it evaluates true.

         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
         case IPPROTO_TCP:
                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
                                           sizeof(_tcph), &_tcph);
                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
                              !nf_conntrack_tcp_established(ct))) <-------
                         goto out;

Sequence would be:

1) flow expires (after 30 seconds), goes back to conntrack in established state
2) packet re-offloads the flow and nf_conntrack_tcp_established(ct) evaluates true.
3) FIN packet races to update conntrack getting to FIN_WAIT while re-offloading
   the flow.

then you see FIN_WAIT and offload, it could happen with an expired
flow that goes back to conntrack.

But I am not sure yet if this is the case you're observing there.

> I mean just in theory it is not guaranteed that both directions send all
> packets through the flowtable just because it is offloaded.
> A variety of error checks might send a packet back to the slow path.

There is the mtu check that is lacking the teardown, but that should
only affect UDP traffic. A patch from Felix decided has cached the mtu
in the flow entry. That is also probably convenient to have, but it
looks like a different issue, I will also post a patch for this issue.

[-- Attachment #2: x.patch --]
[-- Type: text/x-diff, Size: 1136 bytes --]

diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 13b6c453d8bc..627268c32aaa 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -372,8 +372,10 @@ static int nf_flow_offload_forward(struct nf_flowtable_ctx *ctx,
 	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
 
 	mtu = flow->tuplehash[dir].tuple.mtu + ctx->offset;
-	if (unlikely(nf_flow_exceeds_mtu(skb, mtu)))
+	if (unlikely(nf_flow_exceeds_mtu(skb, mtu))) {
+		flow_offload_teardown(flow);
 		return 0;
+	}
 
 	iph = (struct iphdr *)(skb_network_header(skb) + ctx->offset);
 	thoff = (iph->ihl * 4) + ctx->offset;
@@ -651,8 +653,10 @@ static int nf_flow_offload_ipv6_forward(struct nf_flowtable_ctx *ctx,
 	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
 
 	mtu = flow->tuplehash[dir].tuple.mtu + ctx->offset;
-	if (unlikely(nf_flow_exceeds_mtu(skb, mtu)))
+	if (unlikely(nf_flow_exceeds_mtu(skb, mtu))) {
+		flow_offload_teardown(flow);
 		return 0;
+	}
 
 	ip6h = (struct ipv6hdr *)(skb_network_header(skb) + ctx->offset);
 	thoff = sizeof(*ip6h) + ctx->offset;

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20 12:37                       ` Pablo Neira Ayuso
@ 2024-03-20 13:37                         ` Sven Auhagen
  2024-04-08  5:24                         ` Sven Auhagen
  1 sibling, 0 replies; 27+ messages in thread
From: Sven Auhagen @ 2024-03-20 13:37 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 01:37:58PM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 12:15:51PM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 11:47:50AM +0100, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 20, 2024 at 11:29:05AM +0100, Sven Auhagen wrote:
> > > > On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> > > > > On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > > > > > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > > > > > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > > > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > > > Hi Sven,
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > > > [...]
> > > > > > > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > > > > > > >  }
> > > > > > > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > > > > > > >  
> > > > > > > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > > > > > > >  {
> > > > > > > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > > > > > > +	 */
> > > > > > > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Pablo,
> > > > > > > > > > > > 
> > > > > > > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > > > > > > very practical.
> > > > > > > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > > > > > > 
> > > > > > > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > > > > > > 
> > > > > > > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > > > > > > 
> > > > > > > > > > > > I would remove the if case here.
> > > > > > > > > > > 
> > > > > > > > > > > OK, I remove it and post a v2. Thanks!
> > > > > > > > > > 
> > > > > > > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > > > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > > > > > > 
> > > > > > > > > ct->proto.tcp.state contains the state that was observed before
> > > > > > > > > handling over this flow to the flowtable, in most cases, this should
> > > > > > > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > > > > > > 
> > > > > > > > > > This way we always set the current TCP timeout.
> > > > > > > > > 
> > > > > > > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > > > > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > > > > > > 
> > > > > > > > In case of a race condition or if something is off like my TCP_FIN
> > > > > > > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > > > > > > will make the e.g. FIN or CLOSE a very long state.
> > > > > > > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > > > > > > runs. Also for example we could have seen both FIN already before the
> > > > > > > > flowtable gc runs.
> > > > > > > 
> > > > > > > OK, I just posted a v2, leave things as is. I agree it is better to
> > > > > > > only address the issue you are observing at this time, it is possible
> > > > > > > to revisit later.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > 
> > > > > > Thanks, I will give it another try.
> > > > > > I think for it to be foolproof we need
> > > > > > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> > > > > 
> > > > > My patch already does it:
> > > > > 
> > > > > +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> > > > > +{
> > > > > +       enum tcp_conntrack tcp_state;
> > > > > +
> > > > > +       if (fin)
> > > > > +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> > > > > +       else /* rst */
> > > > > +               tcp_state = TCP_CONNTRACK_CLOSE;
> > > > > +
> > > > > +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> > > > > 
> > > > > flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> > > > > 
> > > > > > They way thinks are right now we are open to a race condition from the reverse side of the
> > > > > > connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> > > > > > running after the flowtable code.
> > > > > > The conenction is still in TCP established during that window and another packet can just
> > > > > > push it back to the flowtable while the FIN or RST is not processed yet.
> > > > > 
> > > > > I don't see how:
> > > > > 
> > > > > static void nft_flow_offload_eval(const struct nft_expr *expr,
> > > > >                                   ...
> > > > > 
> > > > >         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
> > > > >         case IPPROTO_TCP:
> > > > >                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
> > > > >                                           sizeof(_tcph), &_tcph);
> > > > >                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
> > > > >                              !nf_conntrack_tcp_established(ct)))
> > > > >                         goto out;
> > > > > 
> > > > > this would now be either in FIN/CLOSE state.
> > > > > 
> > > > > FIN, RST packet does not trigger re-offload. And ACK packet would find
> > > > > the entry in !nf_conntrack_tcp_established(ct).
> > > > > 
> > > > > What path could trigger re-offload after my latest patch?
> > > > 
> > > > From looking through the nf conntrack tcp code you need to spin_lock
> > > > the TCP state change to avoid a race with another packet.
> > > 
> > > The flowtable owns the flow, packets belonging the flow cannot update
> > > the TCP state while the flow is offloaded to the flowtable.
> > > 
> > > Once _TEARDOWN flag is set on, then packets get back to classic
> > > conntrack path.
> > 
> > Hmm alright, something is going wrong somewhere and it looks a lot like
> > a race condition :)
> 
> This check is racy, another packet could alter the ct state right
> after it evaluates true.
> 
>          switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
>          case IPPROTO_TCP:
>                  tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
>                                            sizeof(_tcph), &_tcph);
>                  if (unlikely(!tcph || tcph->fin || tcph->rst ||
>                               !nf_conntrack_tcp_established(ct))) <-------
>                          goto out;
> 
> Sequence would be:
> 
> 1) flow expires (after 30 seconds), goes back to conntrack in established state
> 2) packet re-offloads the flow and nf_conntrack_tcp_established(ct) evaluates true.
> 3) FIN packet races to update conntrack getting to FIN_WAIT while re-offloading
>    the flow.
> 
> then you see FIN_WAIT and offload, it could happen with an expired
> flow that goes back to conntrack.
> 
> But I am not sure yet if this is the case you're observing there.
> 
> > I mean just in theory it is not guaranteed that both directions send all
> > packets through the flowtable just because it is offloaded.
> > A variety of error checks might send a packet back to the slow path.
> 
> There is the mtu check that is lacking the teardown, but that should
> only affect UDP traffic. A patch from Felix decided has cached the mtu
> in the flow entry. That is also probably convenient to have, but it
> looks like a different issue, I will also post a patch for this issue.

I would be suprised if the MTU is an issue.
I will also add a counter/printk to the two checks from your patch just to make sure.
I will report back with my testing of v2 of the patches.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-03-20 12:37                       ` Pablo Neira Ayuso
  2024-03-20 13:37                         ` Sven Auhagen
@ 2024-04-08  5:24                         ` Sven Auhagen
  2024-04-09 11:11                           ` Pablo Neira Ayuso
  1 sibling, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-04-08  5:24 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Wed, Mar 20, 2024 at 01:37:58PM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 12:15:51PM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 11:47:50AM +0100, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 20, 2024 at 11:29:05AM +0100, Sven Auhagen wrote:
> > > > On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> > > > > On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > > > > > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > > > > > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > > > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > > > Hi Sven,
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > > > [...]
> > > > > > > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > > > > > > >  }
> > > > > > > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > > > > > > >  
> > > > > > > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > > > > > > >  {
> > > > > > > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > > > > > > +	 */
> > > > > > > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Pablo,
> > > > > > > > > > > > 
> > > > > > > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > > > > > > very practical.
> > > > > > > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > > > > > > 
> > > > > > > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > > > > > > 
> > > > > > > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > > > > > > 
> > > > > > > > > > > > I would remove the if case here.
> > > > > > > > > > > 
> > > > > > > > > > > OK, I remove it and post a v2. Thanks!
> > > > > > > > > > 
> > > > > > > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > > > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > > > > > > 
> > > > > > > > > ct->proto.tcp.state contains the state that was observed before
> > > > > > > > > handling over this flow to the flowtable, in most cases, this should
> > > > > > > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > > > > > > 
> > > > > > > > > > This way we always set the current TCP timeout.
> > > > > > > > > 
> > > > > > > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > > > > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > > > > > > 
> > > > > > > > In case of a race condition or if something is off like my TCP_FIN
> > > > > > > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > > > > > > will make the e.g. FIN or CLOSE a very long state.
> > > > > > > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > > > > > > runs. Also for example we could have seen both FIN already before the
> > > > > > > > flowtable gc runs.
> > > > > > > 
> > > > > > > OK, I just posted a v2, leave things as is. I agree it is better to
> > > > > > > only address the issue you are observing at this time, it is possible
> > > > > > > to revisit later.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > 
> > > > > > Thanks, I will give it another try.
> > > > > > I think for it to be foolproof we need
> > > > > > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> > > > > 
> > > > > My patch already does it:
> > > > > 
> > > > > +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> > > > > +{
> > > > > +       enum tcp_conntrack tcp_state;
> > > > > +
> > > > > +       if (fin)
> > > > > +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> > > > > +       else /* rst */
> > > > > +               tcp_state = TCP_CONNTRACK_CLOSE;
> > > > > +
> > > > > +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> > > > > 
> > > > > flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> > > > > 
> > > > > > They way thinks are right now we are open to a race condition from the reverse side of the
> > > > > > connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> > > > > > running after the flowtable code.
> > > > > > The conenction is still in TCP established during that window and another packet can just
> > > > > > push it back to the flowtable while the FIN or RST is not processed yet.
> > > > > 
> > > > > I don't see how:
> > > > > 
> > > > > static void nft_flow_offload_eval(const struct nft_expr *expr,
> > > > >                                   ...
> > > > > 
> > > > >         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
> > > > >         case IPPROTO_TCP:
> > > > >                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
> > > > >                                           sizeof(_tcph), &_tcph);
> > > > >                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
> > > > >                              !nf_conntrack_tcp_established(ct)))
> > > > >                         goto out;
> > > > > 
> > > > > this would now be either in FIN/CLOSE state.
> > > > > 
> > > > > FIN, RST packet does not trigger re-offload. And ACK packet would find
> > > > > the entry in !nf_conntrack_tcp_established(ct).
> > > > > 
> > > > > What path could trigger re-offload after my latest patch?
> > > > 
> > > > From looking through the nf conntrack tcp code you need to spin_lock
> > > > the TCP state change to avoid a race with another packet.
> > > 
> > > The flowtable owns the flow, packets belonging the flow cannot update
> > > the TCP state while the flow is offloaded to the flowtable.
> > > 
> > > Once _TEARDOWN flag is set on, then packets get back to classic
> > > conntrack path.
> > 
> > Hmm alright, something is going wrong somewhere and it looks a lot like
> > a race condition :)
> 
> This check is racy, another packet could alter the ct state right
> after it evaluates true.
> 
>          switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
>          case IPPROTO_TCP:
>                  tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
>                                            sizeof(_tcph), &_tcph);
>                  if (unlikely(!tcph || tcph->fin || tcph->rst ||
>                               !nf_conntrack_tcp_established(ct))) <-------
>                          goto out;
> 
> Sequence would be:
> 
> 1) flow expires (after 30 seconds), goes back to conntrack in established state
> 2) packet re-offloads the flow and nf_conntrack_tcp_established(ct) evaluates true.
> 3) FIN packet races to update conntrack getting to FIN_WAIT while re-offloading
>    the flow.
> 
> then you see FIN_WAIT and offload, it could happen with an expired
> flow that goes back to conntrack.
> 
> But I am not sure yet if this is the case you're observing there.
> 
> > I mean just in theory it is not guaranteed that both directions send all
> > packets through the flowtable just because it is offloaded.
> > A variety of error checks might send a packet back to the slow path.
> 
> There is the mtu check that is lacking the teardown, but that should
> only affect UDP traffic. A patch from Felix decided has cached the mtu
> in the flow entry. That is also probably convenient to have, but it
> looks like a different issue, I will also post a patch for this issue.

Hi Pablo,

after some testing the problem only happens very rarely now.
I suspect it happens only on connections that are at some point
one way only or in some other way not in a correct state anymore.
Never the less your latest patches are very good and reduce the problem
to an absolute minimum that FIN WAIT is offlodaded and the timeout
is correct now.

Here is one example if a flow that still is in FIN WAIT:

[NEW] tcp      6 120 SYN_SENT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 [UNREPLIED] src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
[UPDATE] tcp      6 60 SYN_RECV src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
[UPDATE] tcp      6 86400 ESTABLISHED src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
[UPDATE] tcp      6 120 FIN_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
[UPDATE] tcp      6 30 LAST_ACK src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
 [UPDATE] tcp      6 120 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
 [DESTROY] tcp      6 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 packets=15 bytes=1750 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 packets=13 bytes=6905 [ASSURED] mark=16777216 delta-time=120

Best
Sven


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-04-08  5:24                         ` Sven Auhagen
@ 2024-04-09 11:11                           ` Pablo Neira Ayuso
  2024-04-09 11:35                             ` Sven Auhagen
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-04-09 11:11 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

Hi Sven,

On Mon, Apr 08, 2024 at 07:24:43AM +0200, Sven Auhagen wrote:
> Hi Pablo,
> 
> after some testing the problem only happens very rarely now.
> I suspect it happens only on connections that are at some point
> one way only or in some other way not in a correct state anymore.
> Never the less your latest patches are very good and reduce the problem
> to an absolute minimum that FIN WAIT is offlodaded and the timeout
> is correct now.

Thanks for testing, I am going to submit this patch.

If you have a bit more cycles, I still would like to know what corner
case scenario is still triggering this so...

> Here is one example if a flow that still is in FIN WAIT:
> 
> [NEW] tcp      6 120 SYN_SENT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 [UNREPLIED] src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
> [UPDATE] tcp      6 60 SYN_RECV src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
> [UPDATE] tcp      6 86400 ESTABLISHED src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
> [UPDATE] tcp      6 120 FIN_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
> [UPDATE] tcp      6 30 LAST_ACK src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
>  [UPDATE] tcp      6 120 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
>  [DESTROY] tcp      6 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 packets=15 bytes=1750 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 packets=13 bytes=6905 [ASSURED] mark=16777216 delta-time=120

... could you run conntrack -E -o timestamp? I'd like to know if this is
a flow that is handed over back to classic path after 30 seconds, then
being placed in the flowtable again.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-04-09 11:11                           ` Pablo Neira Ayuso
@ 2024-04-09 11:35                             ` Sven Auhagen
  2024-04-11  9:27                               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Sven Auhagen @ 2024-04-09 11:35 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

On Tue, Apr 09, 2024 at 01:11:43PM +0200, Pablo Neira Ayuso wrote:
> Hi Sven,
> 
> On Mon, Apr 08, 2024 at 07:24:43AM +0200, Sven Auhagen wrote:
> > Hi Pablo,
> > 
> > after some testing the problem only happens very rarely now.
> > I suspect it happens only on connections that are at some point
> > one way only or in some other way not in a correct state anymore.
> > Never the less your latest patches are very good and reduce the problem
> > to an absolute minimum that FIN WAIT is offlodaded and the timeout
> > is correct now.
> 
> Thanks for testing, I am going to submit this patch.
> 
> If you have a bit more cycles, I still would like to know what corner
> case scenario is still triggering this so...
> 
> > Here is one example if a flow that still is in FIN WAIT:
> > 
> > [NEW] tcp      6 120 SYN_SENT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 [UNREPLIED] src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
> > [UPDATE] tcp      6 60 SYN_RECV src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
> > [UPDATE] tcp      6 86400 ESTABLISHED src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
> > [UPDATE] tcp      6 120 FIN_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
> > [UPDATE] tcp      6 30 LAST_ACK src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
> >  [UPDATE] tcp      6 120 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
> >  [DESTROY] tcp      6 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 packets=15 bytes=1750 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 packets=13 bytes=6905 [ASSURED] mark=16777216 delta-time=120
> 
> ... could you run conntrack -E -o timestamp? I'd like to know if this is
> a flow that is handed over back to classic path after 30 seconds, then
> being placed in the flowtable again.

Sure here is a fresh output:

[1712662404.573225]	    [NEW] tcp      6 120 SYN_SENT src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 [UNREPLIED] src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 mark=25165825
[1712662404.588094]	 [UPDATE] tcp      6 60 SYN_RECV src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 mark=25165825
[1712662404.591802]	 [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 [OFFLOAD] mark=25165825
[1712662405.682563]	 [UPDATE] tcp      6 120 FIN_WAIT src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 [OFFLOAD] mark=25165825
[1712662405.689501]	 [UPDATE] tcp      6 30 LAST_ACK src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 [ASSURED] mark=25165825
[1712662405.704370]	 [UPDATE] tcp      6 120 TIME_WAIT src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 [ASSURED] mark=25165825
[1712662451.967906]	[DESTROY] tcp      6 ESTABLISHED src=192.168.6.122 dst=52.98.243.2 sport=52717 dport=443 packets=14 bytes=4134 src=52.98.243.2 dst=37.24.174.42 sport=443 dport=20116 packets=17 bytes=13712 [ASSURED] mark=16777216 delta-time=140




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-04-09 11:35                             ` Sven Auhagen
@ 2024-04-11  9:27                               ` Pablo Neira Ayuso
  2024-04-11 11:05                                 ` Florian Westphal
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-04-11  9:27 UTC (permalink / raw)
  To: Sven Auhagen; +Cc: netfilter-devel, cratiu, ozsh, vladbu, gal, fw

[-- Attachment #1: Type: text/plain, Size: 4800 bytes --]

On Tue, Apr 09, 2024 at 01:35:15PM +0200, Sven Auhagen wrote:
> On Tue, Apr 09, 2024 at 01:11:43PM +0200, Pablo Neira Ayuso wrote:
> > Hi Sven,
> > 
> > On Mon, Apr 08, 2024 at 07:24:43AM +0200, Sven Auhagen wrote:
> > > Hi Pablo,
> > > 
> > > after some testing the problem only happens very rarely now.
> > > I suspect it happens only on connections that are at some point
> > > one way only or in some other way not in a correct state anymore.
> > > Never the less your latest patches are very good and reduce the problem
> > > to an absolute minimum that FIN WAIT is offlodaded and the timeout
> > > is correct now.
> > 
> > Thanks for testing, I am going to submit this patch.
> > 
> > If you have a bit more cycles, I still would like to know what corner
> > case scenario is still triggering this so...
> > 
> > > Here is one example if a flow that still is in FIN WAIT:
> > > 
> > > [NEW] tcp      6 120 SYN_SENT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 [UNREPLIED] src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
> > > [UPDATE] tcp      6 60 SYN_RECV src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
> > > [UPDATE] tcp      6 86400 ESTABLISHED src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
> > > [UPDATE] tcp      6 120 FIN_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
> > > [UPDATE] tcp      6 30 LAST_ACK src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
> > >  [UPDATE] tcp      6 120 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
> > >  [DESTROY] tcp      6 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 packets=15 bytes=1750 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 packets=13 bytes=6905 [ASSURED] mark=16777216 delta-time=120
> > 
> > ... could you run conntrack -E -o timestamp? I'd like to know if this is
> > a flow that is handed over back to classic path after 30 seconds, then
> > being placed in the flowtable again.
> 
> Sure here is a fresh output:
> 
> [1712662404.573225]	    [NEW] tcp      6 120 SYN_SENT src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 [UNREPLIED] src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 mark=25165825
> [1712662404.588094]	 [UPDATE] tcp      6 60 SYN_RECV src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 mark=25165825
> [1712662404.591802]	 [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 [OFFLOAD] mark=25165825
> [1712662405.682563]	 [UPDATE] tcp      6 120 FIN_WAIT src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 [OFFLOAD] mark=25165825

It is happening right away, a bit more of 1 second after.

I can also see IP_CT_TCP_FLAG_CLOSE_INIT is not set on when ct->state
is adjusted to _FIN_WAIT state in the fixup routine.

There are also packets that might trigger transitions to sIG
(ignored) in TCP tracking. Probably conntrack -E is misleading because
this does not allow us to see what actual packets are coming after the
fin.

Can you give a try to this untested patch? It applies on top of the
two patches you already have for TCP and UDP.

> [1712662405.689501]	 [UPDATE] tcp      6 30 LAST_ACK src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 [ASSURED] mark=25165825
> [1712662405.704370]	 [UPDATE] tcp      6 120 TIME_WAIT src=192.168.7.101 dst=157.240.251.61 sport=52717 dport=5222 src=157.240.251.61 dst=87.138.198.79 sport=5222 dport=26886 [ASSURED] mark=25165825

Can you also enable -o id,timestamp? The id should tell us if we are
always talking on the same object.

> [1712662451.967906]	[DESTROY] tcp      6 ESTABLISHED src=192.168.6.122 dst=52.98.243.2 sport=52717 dport=443 packets=14 bytes=4134 src=52.98.243.2 dst=37.24.174.42 sport=443 dport=20116 packets=17 bytes=13712 [ASSURED] mark=16777216 delta-time=140

[-- Attachment #2: adjust-ct-tcp-flags.patch --]
[-- Type: text/x-diff, Size: 3283 bytes --]

diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index 8507deed6b1f..2388c4fe99a0 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -293,7 +293,7 @@ int nf_flow_table_init(struct nf_flowtable *flow_table);
 void nf_flow_table_free(struct nf_flowtable *flow_table);
 
 void flow_offload_teardown(struct flow_offload *flow);
-void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin);
+void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin, int dir);
 
 void nf_flow_snat_port(const struct flow_offload *flow,
 		       struct sk_buff *skb, unsigned int thoff,
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 16068ef04490..e3a7d8eff727 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -350,18 +350,21 @@ void flow_offload_teardown(struct flow_offload *flow)
 }
 EXPORT_SYMBOL_GPL(flow_offload_teardown);
 
-void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
+void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin, int dir)
 {
+	struct nf_conn *ct = flow->ct;
 	enum tcp_conntrack tcp_state;
 
-	if (fin)
+	if (fin) {
 		tcp_state = TCP_CONNTRACK_FIN_WAIT;
-	else /* rst */
+		ct->proto.tcp.seen[dir].flags |= IP_CT_TCP_FLAG_CLOSE_INIT;
+	} else { /* rst */
 		tcp_state = TCP_CONNTRACK_CLOSE;
+	}
 
-	flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
+	flow_offload_fixup_tcp(nf_ct_net(ct), ct, tcp_state);
 	smp_mb__before_atomic();
-	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
+	clear_bit(IPS_OFFLOAD_BIT, &ct->status);
 	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
 }
 EXPORT_SYMBOL_GPL(flow_offload_teardown_tcp);
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 9840ab5e3ae6..04fabfa9ff7e 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -20,7 +20,7 @@
 #include <linux/udp.h>
 
 static int nf_flow_state_check(struct flow_offload *flow, int proto,
-			       struct sk_buff *skb, unsigned int thoff)
+			       struct sk_buff *skb, unsigned int thoff, int dir)
 {
 	struct tcphdr *tcph;
 
@@ -29,7 +29,7 @@ static int nf_flow_state_check(struct flow_offload *flow, int proto,
 
 	tcph = (void *)(skb_network_header(skb) + thoff);
 	if (unlikely(tcph->fin || tcph->rst)) {
-		flow_offload_teardown_tcp(flow, tcph->fin);
+		flow_offload_teardown_tcp(flow, tcph->fin, dir);
 		return -1;
 	}
 
@@ -379,7 +379,7 @@ static int nf_flow_offload_forward(struct nf_flowtable_ctx *ctx,
 
 	iph = (struct iphdr *)(skb_network_header(skb) + ctx->offset);
 	thoff = (iph->ihl * 4) + ctx->offset;
-	if (nf_flow_state_check(flow, iph->protocol, skb, thoff))
+	if (nf_flow_state_check(flow, iph->protocol, skb, thoff, dir))
 		return 0;
 
 	if (!nf_flow_dst_check(&tuplehash->tuple)) {
@@ -658,7 +658,7 @@ static int nf_flow_offload_ipv6_forward(struct nf_flowtable_ctx *ctx,
 
 	ip6h = (struct ipv6hdr *)(skb_network_header(skb) + ctx->offset);
 	thoff = sizeof(*ip6h) + ctx->offset;
-	if (nf_flow_state_check(flow, ip6h->nexthdr, skb, thoff))
+	if (nf_flow_state_check(flow, ip6h->nexthdr, skb, thoff, dir))
 		return 0;
 
 	if (!nf_flow_dst_check(&tuplehash->tuple)) {

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-04-11  9:27                               ` Pablo Neira Ayuso
@ 2024-04-11 11:05                                 ` Florian Westphal
  2024-04-11 11:40                                   ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Westphal @ 2024-04-11 11:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Sven Auhagen, netfilter-devel, cratiu, ozsh, vladbu, gal, fw

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> I can also see IP_CT_TCP_FLAG_CLOSE_INIT is not set on when ct->state
> is adjusted to _FIN_WAIT state in the fixup routine.

Unrelated to this patch, but I think that there is an increasing and
disturbing amount of code that attempts to 'fix' the ct state.

I don't think its right and I intend to remove all of these "fixups"
of the conntrack state from flowtable infra.

I see no reason whatsoever why we need to do this, fin can be passed up
to conntrack and conntrack can and should handle this without any extra
mucking with the nf_conn state fields from flowtable infra.

The only cases where I see why we need to take action from
flowtable layer are:

1. timeout extensions of nf_conn from gc worker to prevent eviction
2. removal of the flowtable entry on RST reception. Don't see why that
   needs state fixup of nf_conn.
3. removal of the flowtable entry on hard failure of
   output routines, e.g. because route is stale.
   Don't see why that needs any nf_conn changes either.

My impression is that all these conditionals paper over some other
bugs, for example gc_worker extending timeout is racing with the
datapath, this needs to be fixed first.

I plan to work on this after the selftest fixups.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-04-11 11:05                                 ` Florian Westphal
@ 2024-04-11 11:40                                   ` Pablo Neira Ayuso
  2024-04-11 12:13                                     ` Florian Westphal
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-04-11 11:40 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Sven Auhagen, netfilter-devel, cratiu, ozsh, vladbu, gal

On Thu, Apr 11, 2024 at 01:05:04PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > I can also see IP_CT_TCP_FLAG_CLOSE_INIT is not set on when ct->state
> > is adjusted to _FIN_WAIT state in the fixup routine.
> 
> Unrelated to this patch, but I think that there is an increasing and
> disturbing amount of code that attempts to 'fix' the ct state.

I originally started this infrastructure without the fixup, and there
was a bit of fuzz about it because people complained that TCP
connection somewhat becomes stateless and would get stuck too after
fin packet is seen.

There is also connlimit, which would be distorted after the TCP flow
becomes stateless when handing it over to the flowtable.

Another reason is that hardware offload folks need this, otherwise
hardware entry remains too long in the hardware flowtable.

> I don't think its right and I intend to remove all of these "fixups"
> of the conntrack state from flowtable infra.
> 
> I see no reason whatsoever why we need to do this, fin can be passed up
> to conntrack and conntrack can and should handle this without any extra
> mucking with the nf_conn state fields from flowtable infra.

You mean, just let the fin packet go without tearing down the flow
entry?

> The only cases where I see why we need to take action from
> flowtable layer are:
> 
> 1. timeout extensions of nf_conn from gc worker to prevent eviction
> 2. removal of the flowtable entry on RST reception. Don't see why that
>    needs state fixup of nf_conn.

Remove it right away then is what you propose?

> 3. removal of the flowtable entry on hard failure of
>    output routines, e.g. because route is stale.
>    Don't see why that needs any nf_conn changes either.

if dst is stale, packet needs to go back to classic path to get a
fresh route.

> My impression is that all these conditionals paper over some other
> bugs, for example gc_worker extending timeout is racing with the
> datapath, this needs to be fixed first.

That sounds good. But I am afraid some folks will not be happy if TCP
flow becomes stateless again.

> I plan to work on this after the selftest fixups.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-04-11 11:40                                   ` Pablo Neira Ayuso
@ 2024-04-11 12:13                                     ` Florian Westphal
  2024-04-11 15:50                                       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Westphal @ 2024-04-11 12:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, Sven Auhagen, netfilter-devel, cratiu, ozsh,
	vladbu, gal

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > I see no reason whatsoever why we need to do this, fin can be passed up
> > to conntrack and conntrack can and should handle this without any extra
> > mucking with the nf_conn state fields from flowtable infra.
> 
> You mean, just let the fin packet go without tearing down the flow
> entry?

Yes, thats what I meant.  RST would still remove the flow entry.

> > The only cases where I see why we need to take action from
> > flowtable layer are:
> > 
> > 1. timeout extensions of nf_conn from gc worker to prevent eviction
> > 2. removal of the flowtable entry on RST reception. Don't see why that
> >    needs state fixup of nf_conn.
> 
> Remove it right away then is what you propose?

Isn't that what is happeing right now?  We set TEARDOWN bit and
remove OFFLOAD bit from nf_conn.

I think we should NOT do it for FIN and let conntrack handle this,
but we should still do it for RST.

Technically I think we could also skip it for RST but I don't see
a big advantage.  For FIN it will keep offloading in place which is
a win for connetions where one end still sends more data.

> > 3. removal of the flowtable entry on hard failure of
> >    output routines, e.g. because route is stale.
> >    Don't see why that needs any nf_conn changes either.
> 
> if dst is stale, packet needs to go back to classic path to get a
> fresh route.

Yes, sure.  But I would keep the teardown in place that we have now,
I meant to say that the current code makes sense to me, i.e.:

if (!nf_flow_dst_check(&tuplehash->tuple)) {
	flow_offload_teardown(flow);
	return 0;
}

> > My impression is that all these conditionals paper over some other
> > bugs, for example gc_worker extending timeout is racing with the
> > datapath, this needs to be fixed first.
> 
> That sounds good. But I am afraid some folks will not be happy if TCP
> flow becomes stateless again.

I do not know what that means.  There can never be a flowtable entry
without a backing nf_conn, so I don't know what stateless means in this
context.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-04-11 12:13                                     ` Florian Westphal
@ 2024-04-11 15:50                                       ` Pablo Neira Ayuso
  2024-04-19  7:47                                         ` Sven Auhagen
  0 siblings, 1 reply; 27+ messages in thread
From: Pablo Neira Ayuso @ 2024-04-11 15:50 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Sven Auhagen, netfilter-devel, cratiu, ozsh, vladbu, gal

On Thu, Apr 11, 2024 at 02:13:20PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > I see no reason whatsoever why we need to do this, fin can be passed up
> > > to conntrack and conntrack can and should handle this without any extra
> > > mucking with the nf_conn state fields from flowtable infra.
> > 
> > You mean, just let the fin packet go without tearing down the flow
> > entry?
> 
> Yes, thats what I meant.  RST would still remove the flow entry.

So flow entry would remain in place if fin packet is seen. Then, an
ack packet will follow fastpath while another fin packet in the other
direction will follow classic.

> > > The only cases where I see why we need to take action from
> > > flowtable layer are:
> > > 
> > > 1. timeout extensions of nf_conn from gc worker to prevent eviction
> > > 2. removal of the flowtable entry on RST reception. Don't see why that
> > >    needs state fixup of nf_conn.
> > 
> > Remove it right away then is what you propose?
> 
> Isn't that what is happeing right now?  We set TEARDOWN bit and
> remove OFFLOAD bit from nf_conn.
> 
> I think we should NOT do it for FIN and let conntrack handle this,
> but we should still do it for RST.

Conntrack will have to deal then with an entry that is completely
out-of-sync, will that work? At least a fixup to disable sequence
tracking will be required?

> Technically I think we could also skip it for RST but I don't see
> a big advantage.  For FIN it will keep offloading in place which is
> a win for connetions where one end still sends more data.

I see.

> > > 3. removal of the flowtable entry on hard failure of
> > >    output routines, e.g. because route is stale.
> > >    Don't see why that needs any nf_conn changes either.
> > 
> > if dst is stale, packet needs to go back to classic path to get a
> > fresh route.
> 
> Yes, sure.  But I would keep the teardown in place that we have now,
> I meant to say that the current code makes sense to me, i.e.:
> 
> if (!nf_flow_dst_check(&tuplehash->tuple)) {
> 	flow_offload_teardown(flow);
> 	return 0;
> }

I see, I misunderstood.

> > > My impression is that all these conditionals paper over some other
> > > bugs, for example gc_worker extending timeout is racing with the
> > > datapath, this needs to be fixed first.
> > 
> > That sounds good. But I am afraid some folks will not be happy if TCP
> > flow becomes stateless again.
> 
> I do not know what that means.  There can never be a flowtable entry
> without a backing nf_conn, so I don't know what stateless means in this
> context.

If fin does not tear down the flow entry, then the flow entry remains
in place and it holds a reference to the conntrack object, which will
not be release until 30 seconds of no traffic activity, right?

Maybe I am just missing part of what you have in mind, I still don't
see the big picture.

Would you make a quick summary of the proposed new logic?

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown
  2024-04-11 15:50                                       ` Pablo Neira Ayuso
@ 2024-04-19  7:47                                         ` Sven Auhagen
  0 siblings, 0 replies; 27+ messages in thread
From: Sven Auhagen @ 2024-04-19  7:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, netfilter-devel, cratiu, ozsh, vladbu, gal

On Thu, Apr 11, 2024 at 05:50:30PM +0200, Pablo Neira Ayuso wrote:
> On Thu, Apr 11, 2024 at 02:13:20PM +0200, Florian Westphal wrote:
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > I see no reason whatsoever why we need to do this, fin can be passed up
> > > > to conntrack and conntrack can and should handle this without any extra
> > > > mucking with the nf_conn state fields from flowtable infra.
> > > 
> > > You mean, just let the fin packet go without tearing down the flow
> > > entry?
> > 
> > Yes, thats what I meant.  RST would still remove the flow entry.
> 
> So flow entry would remain in place if fin packet is seen. Then, an
> ack packet will follow fastpath while another fin packet in the other
> direction will follow classic.
> 
> > > > The only cases where I see why we need to take action from
> > > > flowtable layer are:
> > > > 
> > > > 1. timeout extensions of nf_conn from gc worker to prevent eviction
> > > > 2. removal of the flowtable entry on RST reception. Don't see why that
> > > >    needs state fixup of nf_conn.
> > > 
> > > Remove it right away then is what you propose?
> > 
> > Isn't that what is happeing right now?  We set TEARDOWN bit and
> > remove OFFLOAD bit from nf_conn.
> > 
> > I think we should NOT do it for FIN and let conntrack handle this,
> > but we should still do it for RST.
> 
> Conntrack will have to deal then with an entry that is completely
> out-of-sync, will that work? At least a fixup to disable sequence
> tracking will be required?
> 
> > Technically I think we could also skip it for RST but I don't see
> > a big advantage.  For FIN it will keep offloading in place which is
> > a win for connetions where one end still sends more data.
> 
> I see.
> 
> > > > 3. removal of the flowtable entry on hard failure of
> > > >    output routines, e.g. because route is stale.
> > > >    Don't see why that needs any nf_conn changes either.
> > > 
> > > if dst is stale, packet needs to go back to classic path to get a
> > > fresh route.
> > 
> > Yes, sure.  But I would keep the teardown in place that we have now,
> > I meant to say that the current code makes sense to me, i.e.:
> > 
> > if (!nf_flow_dst_check(&tuplehash->tuple)) {
> > 	flow_offload_teardown(flow);
> > 	return 0;
> > }
> 
> I see, I misunderstood.
> 
> > > > My impression is that all these conditionals paper over some other
> > > > bugs, for example gc_worker extending timeout is racing with the
> > > > datapath, this needs to be fixed first.
> > > 
> > > That sounds good. But I am afraid some folks will not be happy if TCP
> > > flow becomes stateless again.
> > 
> > I do not know what that means.  There can never be a flowtable entry
> > without a backing nf_conn, so I don't know what stateless means in this
> > context.
> 
> If fin does not tear down the flow entry, then the flow entry remains
> in place and it holds a reference to the conntrack object, which will
> not be release until 30 seconds of no traffic activity, right?
> 
> Maybe I am just missing part of what you have in mind, I still don't
> see the big picture.
> 
> Would you make a quick summary of the proposed new logic?
> 
> Thanks!

Hi Pablo,

I tested your last patch but it makes no difference:

    [NEW] tcp      6 120 SYN_SENT src=192.168.7.126 dst=157.240.223.63 sport=63461 dport=443 [UNREPLIED] src=157.240.223.63 dst=87.138.198.79 sport=443 dport=13354 mark=25165825
 [UPDATE] tcp      6 60 SYN_RECV src=192.168.7.126 dst=157.240.223.63 sport=63461 dport=443 src=157.240.223.63 dst=87.138.198.79 sport=443 dport=13354 mark=25165825
 [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.7.126 dst=157.240.223.63 sport=63461 dport=443 src=157.240.223.63 dst=87.138.198.79 sport=443 dport=13354 [OFFLOAD] mark=25165825
 [UPDATE] tcp      6 120 FIN_WAIT src=192.168.7.126 dst=157.240.223.63 sport=63461 dport=443 src=157.240.223.63 dst=87.138.198.79 sport=443 dport=13354 [OFFLOAD] mark=25165825
 [UPDATE] tcp      6 30 LAST_ACK src=192.168.7.126 dst=157.240.223.63 sport=63461 dport=443 src=157.240.223.63 dst=87.138.198.79 sport=443 dport=13354 [OFFLOAD] mark=25165825
[DESTROY] tcp      6 LAST_ACK src=192.168.7.126 dst=157.240.223.63 sport=63461 dport=443 packets=10 bytes=790 src=157.240.223.63 dst=87.138.198.79 sport=443 dport=13354 packets=19 bytes=5061 [ASSURED] mark=25165825 delta-time=138

Best
Sven


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-04-19  7:48 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-18  9:39 [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown Pablo Neira Ayuso
2024-03-18 10:05 ` Sven Auhagen
2024-03-20  8:39 ` Sven Auhagen
2024-03-20  8:45   ` Pablo Neira Ayuso
2024-03-20  8:49     ` Sven Auhagen
2024-03-20  9:07       ` Pablo Neira Ayuso
2024-03-20  9:20         ` Sven Auhagen
2024-03-20  9:27           ` Pablo Neira Ayuso
2024-03-20  9:31             ` Sven Auhagen
2024-03-20  9:51               ` Pablo Neira Ayuso
2024-03-20 10:13                 ` Sven Auhagen
2024-03-20 10:36                   ` Pablo Neira Ayuso
2024-03-20 10:38                     ` Sven Auhagen
2024-03-20 10:29                 ` Sven Auhagen
2024-03-20 10:47                   ` Pablo Neira Ayuso
2024-03-20 11:15                     ` Sven Auhagen
2024-03-20 12:37                       ` Pablo Neira Ayuso
2024-03-20 13:37                         ` Sven Auhagen
2024-04-08  5:24                         ` Sven Auhagen
2024-04-09 11:11                           ` Pablo Neira Ayuso
2024-04-09 11:35                             ` Sven Auhagen
2024-04-11  9:27                               ` Pablo Neira Ayuso
2024-04-11 11:05                                 ` Florian Westphal
2024-04-11 11:40                                   ` Pablo Neira Ayuso
2024-04-11 12:13                                     ` Florian Westphal
2024-04-11 15:50                                       ` Pablo Neira Ayuso
2024-04-19  7:47                                         ` Sven Auhagen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).