netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
@ 2022-01-21  1:19 Jakub Kicinski
  2022-01-21  8:55 ` Eric Dumazet
  2022-01-24 19:23 ` kernel test robot
  0 siblings, 2 replies; 6+ messages in thread
From: Jakub Kicinski @ 2022-01-21  1:19 UTC (permalink / raw)
  To: davem, edumazet; +Cc: dsahern, pabeni, herbert, netdev, Jakub Kicinski

IPv6 GRO considers packets to belong to different flows when their
hop_limit is different. This seems counter-intuitive, the flow is
the same. hop_limit may vary because of various bugs or hacks but
that doesn't mean it's okay for GRO to reorder packets.

Practical impact of this problem on overall TCP performance
is unclear, but TCP itself detects this reordering and bumps
TCPSACKReorder resulting in user complaints.

Note that the code plays an easy to miss trick by upcasting next_hdr
to a u16 pointer and compares next_hdr and hop_limit in one go.
Coalesce the flush setting to reduce the instruction count a touch.

Fixes: 787e92083601 ("ipv6: Add GRO support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/ipv6/ip6_offload.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index b29e9ba5e113..570071a87e71 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -249,7 +249,7 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
 		 if ((first_word & htonl(0xF00FFFFF)) ||
 		     !ipv6_addr_equal(&iph->saddr, &iph2->saddr) ||
 		     !ipv6_addr_equal(&iph->daddr, &iph2->daddr) ||
-		     *(u16 *)&iph->nexthdr != *(u16 *)&iph2->nexthdr) {
+		     iph->nexthdr != iph2->nexthdr) {
 not_same_flow:
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
@@ -260,8 +260,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
 				goto not_same_flow;
 		}
 		/* flush if Traffic Class fields are different */
-		NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
-		NAPI_GRO_CB(p)->flush |= flush;
+		NAPI_GRO_CB(p)->flush |= flush |
+					 !!((first_word & htonl(0x0FF00000)) |
+					    (iph->hop_limit ^ iph2->hop_limit));
 
 		/* If the previous IP ID value was based on an atomic
 		 * datagram we can overwrite the value and ignore it.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
  2022-01-21  1:19 [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch Jakub Kicinski
@ 2022-01-21  8:55 ` Eric Dumazet
  2022-01-21 15:15   ` Jakub Kicinski
  2022-01-24 19:23 ` kernel test robot
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2022-01-21  8:55 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: David Miller, David Ahern, Paolo Abeni, Herbert Xu, netdev

On Thu, Jan 20, 2022 at 5:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> IPv6 GRO considers packets to belong to different flows when their
> hop_limit is different. This seems counter-intuitive, the flow is
> the same. hop_limit may vary because of various bugs or hacks but
> that doesn't mean it's okay for GRO to reorder packets.
>
> Practical impact of this problem on overall TCP performance
> is unclear, but TCP itself detects this reordering and bumps
> TCPSACKReorder resulting in user complaints.
>
> Note that the code plays an easy to miss trick by upcasting next_hdr
> to a u16 pointer and compares next_hdr and hop_limit in one go.
> Coalesce the flush setting to reduce the instruction count a touch.
>

There are downsides to this change.

We had an internal discussion at Google years ago about this
difference in behavior of IPv6/IPv4

We came to the conclusion the IPv6 behavior was better for our needs
(and we do not care
much about IPv4 GRO, since Google DC traffic is 99.99% IPv6)

In our case, we wanted to keep this 'ipv6 feature' because we were
experimenting with the idea of sending
TSO packets with different flowlabels, to use different paths in the
network, to increase nominal
throughput for WAN flows (one flow would use multiple fiber links)

The issue with 'ipv4 gro style about ttl mismatch' was that because of
small differences in RTT for each path,
 a receiver could very well receive mixed packets.

Even without playing with ECMP hashes, this scenario can happen if the sender
uses a bonding device in balance-rr mode.

After your change, GRO would be defeated and deliver one MSS at a time
to TCP stack.

We implemented SACK compress in TCP stack to avoid extra SACK being
sent by the receiver

We have an extension of this SACK compression for TCP flows terminated
by Google servers,
since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
DUPACK to start retransmits.

Something like this pseudo code:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
int ofo_possible)
        }
        if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
                tp->dup_ack_counter++;
-               goto send_now;
+               if (peer_is_using_old_rule_about_fastretrans(tp))
+                       goto send_now;
        }
        tp->compressed_ack++;
        if (hrtimer_is_queued(&tp->compressed_ack_timer))



> Fixes: 787e92083601 ("ipv6: Add GRO support")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  net/ipv6/ip6_offload.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index b29e9ba5e113..570071a87e71 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -249,7 +249,7 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
>                  if ((first_word & htonl(0xF00FFFFF)) ||
>                      !ipv6_addr_equal(&iph->saddr, &iph2->saddr) ||
>                      !ipv6_addr_equal(&iph->daddr, &iph2->daddr) ||
> -                    *(u16 *)&iph->nexthdr != *(u16 *)&iph2->nexthdr) {
> +                    iph->nexthdr != iph2->nexthdr) {
>  not_same_flow:
>                         NAPI_GRO_CB(p)->same_flow = 0;
>                         continue;
> @@ -260,8 +260,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
>                                 goto not_same_flow;
>                 }
>                 /* flush if Traffic Class fields are different */
> -               NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
> -               NAPI_GRO_CB(p)->flush |= flush;
> +               NAPI_GRO_CB(p)->flush |= flush |
> +                                        !!((first_word & htonl(0x0FF00000)) |
> +                                           (iph->hop_limit ^ iph2->hop_limit));
>
>                 /* If the previous IP ID value was based on an atomic
>                  * datagram we can overwrite the value and ignore it.
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
  2022-01-21  8:55 ` Eric Dumazet
@ 2022-01-21 15:15   ` Jakub Kicinski
  2022-01-21 16:37     ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2022-01-21 15:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, David Ahern, Paolo Abeni, Herbert Xu, netdev,
	ycheng, ncardwell

On Fri, 21 Jan 2022 00:55:08 -0800 Eric Dumazet wrote:
> On Thu, Jan 20, 2022 at 5:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > IPv6 GRO considers packets to belong to different flows when their
> > hop_limit is different. This seems counter-intuitive, the flow is
> > the same. hop_limit may vary because of various bugs or hacks but
> > that doesn't mean it's okay for GRO to reorder packets.
> >
> > Practical impact of this problem on overall TCP performance
> > is unclear, but TCP itself detects this reordering and bumps
> > TCPSACKReorder resulting in user complaints.
> >
> > Note that the code plays an easy to miss trick by upcasting next_hdr
> > to a u16 pointer and compares next_hdr and hop_limit in one go.
> > Coalesce the flush setting to reduce the instruction count a touch.
> 
> There are downsides to this change.
> 
> We had an internal discussion at Google years ago about this
> difference in behavior of IPv6/IPv4
> 
> We came to the conclusion the IPv6 behavior was better for our needs
> (and we do not care
> much about IPv4 GRO, since Google DC traffic is 99.99% IPv6)
> 
> In our case, we wanted to keep this 'ipv6 feature' because we were
> experimenting with the idea of sending
> TSO packets with different flowlabels, to use different paths in the
> network, to increase nominal
> throughput for WAN flows (one flow would use multiple fiber links)
> 
> The issue with 'ipv4 gro style about ttl mismatch' was that because of
> small differences in RTT for each path,
>  a receiver could very well receive mixed packets.
> 
> Even without playing with ECMP hashes, this scenario can happen if the sender
> uses a bonding device in balance-rr mode.
> 
> After your change, GRO would be defeated and deliver one MSS at a time
> to TCP stack.

Indeed. Sounds like we're trading correctness for an optimization of a
questionable practical application, but our motivation isn't 100% pure
either [1] so whatever way we can fix this is fine by me :)

[1] We have some shenanigans that bump TTL to indicate re-transmitted
packets so we can identify them in the network.

> We implemented SACK compress in TCP stack to avoid extra SACK being
> sent by the receiver
> 
> We have an extension of this SACK compression for TCP flows terminated
> by Google servers,
> since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
> DUPACK to start retransmits.
> 
> Something like this pseudo code:
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
> 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
> int ofo_possible)
>         }
>         if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
>                 tp->dup_ack_counter++;
> -               goto send_now;
> +               if (peer_is_using_old_rule_about_fastretrans(tp))
> +                       goto send_now;
>         }
>         tp->compressed_ack++;
>         if (hrtimer_is_queued(&tp->compressed_ack_timer))
> 

Is this something we could upstream / test? peer_is_using.. does not
exist upstream.


Coincidentally, speaking of sending SACKs, my initial testing was on
5.12 kernels and there I saw what appeared to a lay person (me) like
missing ACKs. Receiver would receive segments:

_AB_C_D_E

where _ indicates loss. It'd SACK A, then generate the next SACK after E
(SACKing C D E), sender would rexmit A which makes receiver ACK all 
the way to the end of B. Now sender thinks B arrived after CDE because
it was never sacked.

Perhaps it was fixed by commit a29cb6914681 ("net: tcp better handling
of reordering then loss cases").. or it's a result of some out-of-tree 
hack. I thought I'd mention it tho in case it immediately rings a bell
for anyone.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
  2022-01-21 15:15   ` Jakub Kicinski
@ 2022-01-21 16:37     ` Eric Dumazet
  2022-01-25  0:02       ` Jakub Kicinski
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2022-01-21 16:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, David Ahern, Paolo Abeni, Herbert Xu, netdev,
	Yuchung Cheng, Neal Cardwell

On Fri, Jan 21, 2022 at 7:15 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 21 Jan 2022 00:55:08 -0800 Eric Dumazet wrote:
> > On Thu, Jan 20, 2022 at 5:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > IPv6 GRO considers packets to belong to different flows when their
> > > hop_limit is different. This seems counter-intuitive, the flow is
> > > the same. hop_limit may vary because of various bugs or hacks but
> > > that doesn't mean it's okay for GRO to reorder packets.
> > >
> > > Practical impact of this problem on overall TCP performance
> > > is unclear, but TCP itself detects this reordering and bumps
> > > TCPSACKReorder resulting in user complaints.
> > >
> > > Note that the code plays an easy to miss trick by upcasting next_hdr
> > > to a u16 pointer and compares next_hdr and hop_limit in one go.
> > > Coalesce the flush setting to reduce the instruction count a touch.
> >
> > There are downsides to this change.
> >
> > We had an internal discussion at Google years ago about this
> > difference in behavior of IPv6/IPv4
> >
> > We came to the conclusion the IPv6 behavior was better for our needs
> > (and we do not care
> > much about IPv4 GRO, since Google DC traffic is 99.99% IPv6)
> >
> > In our case, we wanted to keep this 'ipv6 feature' because we were
> > experimenting with the idea of sending
> > TSO packets with different flowlabels, to use different paths in the
> > network, to increase nominal
> > throughput for WAN flows (one flow would use multiple fiber links)
> >
> > The issue with 'ipv4 gro style about ttl mismatch' was that because of
> > small differences in RTT for each path,
> >  a receiver could very well receive mixed packets.
> >
> > Even without playing with ECMP hashes, this scenario can happen if the sender
> > uses a bonding device in balance-rr mode.
> >
> > After your change, GRO would be defeated and deliver one MSS at a time
> > to TCP stack.
>
> Indeed. Sounds like we're trading correctness for an optimization of a
> questionable practical application, but our motivation isn't 100% pure
> either [1] so whatever way we can fix this is fine by me :)
>
> [1] We have some shenanigans that bump TTL to indicate re-transmitted
> packets so we can identify them in the network.
>
> > We implemented SACK compress in TCP stack to avoid extra SACK being
> > sent by the receiver
> >
> > We have an extension of this SACK compression for TCP flows terminated
> > by Google servers,
> > since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
> > DUPACK to start retransmits.
> >
> > Something like this pseudo code:
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
> > 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
> > int ofo_possible)
> >         }
> >         if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
> >                 tp->dup_ack_counter++;
> > -               goto send_now;
> > +               if (peer_is_using_old_rule_about_fastretrans(tp))
> > +                       goto send_now;
> >         }
> >         tp->compressed_ack++;
> >         if (hrtimer_is_queued(&tp->compressed_ack_timer))
> >
>
> Is this something we could upstream / test? peer_is_using.. does not
> exist upstream.

Sure, because we do not have a standardized way (at SYN SYNACK time)
to advertise
that the stack is not 10 years old.

This could be a per net-ns sysctl, or a per socket flag, or a per cgroup flag.

In our case, we do negotiate special TCP options, and allow these options
only from internal communications.

(So we store this private bit in the socket itself)

>
>
> Coincidentally, speaking of sending SACKs, my initial testing was on
> 5.12 kernels and there I saw what appeared to a lay person (me) like
> missing ACKs. Receiver would receive segments:
>
> _AB_C_D_E
>
> where _ indicates loss. It'd SACK A, then generate the next SACK after E
> (SACKing C D E), sender would rexmit A which makes receiver ACK all
> the way to the end of B. Now sender thinks B arrived after CDE because
> it was never sacked.
>
> Perhaps it was fixed by commit a29cb6914681 ("net: tcp better handling
> of reordering then loss cases").. or it's a result of some out-of-tree
> hack. I thought I'd mention it tho in case it immediately rings a bell
> for anyone.

Could all the missing SACK have been lost ?

Writing a packetdrill test for this scenario should not be too hard.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
  2022-01-21  1:19 [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch Jakub Kicinski
  2022-01-21  8:55 ` Eric Dumazet
@ 2022-01-24 19:23 ` kernel test robot
  1 sibling, 0 replies; 6+ messages in thread
From: kernel test robot @ 2022-01-24 19:23 UTC (permalink / raw)
  To: Jakub Kicinski, davem, edumazet
  Cc: kbuild-all, dsahern, pabeni, herbert, netdev, Jakub Kicinski

Hi Jakub,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/0day-ci/linux/commits/Jakub-Kicinski/ipv6-gro-flush-instead-of-assuming-different-flows-on-hop_limit-mismatch/20220121-092033
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 57afdc0aab094b4c811b3fe030b2567812a495f3
config: x86_64-randconfig-s022 (https://download.01.org/0day-ci/archive/20220125/202201250210.roaIok2H-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/6f8f3e541288381a67df8b670068d5add231d082
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jakub-Kicinski/ipv6-gro-flush-instead-of-assuming-different-flows-on-hop_limit-mismatch/20220121-092033
        git checkout 6f8f3e541288381a67df8b670068d5add231d082
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash net/ipv6/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> net/ipv6/ip6_offload.c:264:57: sparse: sparse: restricted __be32 degrades to integer
>> net/ipv6/ip6_offload.c:263:48: sparse: sparse: dubious: x | !y

vim +264 net/ipv6/ip6_offload.c

   182	
   183	INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
   184								 struct sk_buff *skb)
   185	{
   186		const struct net_offload *ops;
   187		struct sk_buff *pp = NULL;
   188		struct sk_buff *p;
   189		struct ipv6hdr *iph;
   190		unsigned int nlen;
   191		unsigned int hlen;
   192		unsigned int off;
   193		u16 flush = 1;
   194		int proto;
   195	
   196		off = skb_gro_offset(skb);
   197		hlen = off + sizeof(*iph);
   198		iph = skb_gro_header_fast(skb, off);
   199		if (skb_gro_header_hard(skb, hlen)) {
   200			iph = skb_gro_header_slow(skb, hlen, off);
   201			if (unlikely(!iph))
   202				goto out;
   203		}
   204	
   205		skb_set_network_header(skb, off);
   206		skb_gro_pull(skb, sizeof(*iph));
   207		skb_set_transport_header(skb, skb_gro_offset(skb));
   208	
   209		flush += ntohs(iph->payload_len) != skb_gro_len(skb);
   210	
   211		proto = iph->nexthdr;
   212		ops = rcu_dereference(inet6_offloads[proto]);
   213		if (!ops || !ops->callbacks.gro_receive) {
   214			__pskb_pull(skb, skb_gro_offset(skb));
   215			skb_gro_frag0_invalidate(skb);
   216			proto = ipv6_gso_pull_exthdrs(skb, proto);
   217			skb_gro_pull(skb, -skb_transport_offset(skb));
   218			skb_reset_transport_header(skb);
   219			__skb_push(skb, skb_gro_offset(skb));
   220	
   221			ops = rcu_dereference(inet6_offloads[proto]);
   222			if (!ops || !ops->callbacks.gro_receive)
   223				goto out;
   224	
   225			iph = ipv6_hdr(skb);
   226		}
   227	
   228		NAPI_GRO_CB(skb)->proto = proto;
   229	
   230		flush--;
   231		nlen = skb_network_header_len(skb);
   232	
   233		list_for_each_entry(p, head, list) {
   234			const struct ipv6hdr *iph2;
   235			__be32 first_word; /* <Version:4><Traffic_Class:8><Flow_Label:20> */
   236	
   237			if (!NAPI_GRO_CB(p)->same_flow)
   238				continue;
   239	
   240			iph2 = (struct ipv6hdr *)(p->data + off);
   241			first_word = *(__be32 *)iph ^ *(__be32 *)iph2;
   242	
   243			/* All fields must match except length and Traffic Class.
   244			 * XXX skbs on the gro_list have all been parsed and pulled
   245			 * already so we don't need to compare nlen
   246			 * (nlen != (sizeof(*iph2) + ipv6_exthdrs_len(iph2, &ops)))
   247			 * memcmp() alone below is sufficient, right?
   248			 */
   249			 if ((first_word & htonl(0xF00FFFFF)) ||
   250			     !ipv6_addr_equal(&iph->saddr, &iph2->saddr) ||
   251			     !ipv6_addr_equal(&iph->daddr, &iph2->daddr) ||
   252			     iph->nexthdr != iph2->nexthdr) {
   253	not_same_flow:
   254				NAPI_GRO_CB(p)->same_flow = 0;
   255				continue;
   256			}
   257			if (unlikely(nlen > sizeof(struct ipv6hdr))) {
   258				if (memcmp(iph + 1, iph2 + 1,
   259					   nlen - sizeof(struct ipv6hdr)))
   260					goto not_same_flow;
   261			}
   262			/* flush if Traffic Class fields are different */
 > 263			NAPI_GRO_CB(p)->flush |= flush |
 > 264						 !!((first_word & htonl(0x0FF00000)) |
   265						    (iph->hop_limit ^ iph2->hop_limit));
   266	
   267			/* If the previous IP ID value was based on an atomic
   268			 * datagram we can overwrite the value and ignore it.
   269			 */
   270			if (NAPI_GRO_CB(skb)->is_atomic)
   271				NAPI_GRO_CB(p)->flush_id = 0;
   272		}
   273	
   274		NAPI_GRO_CB(skb)->is_atomic = true;
   275		NAPI_GRO_CB(skb)->flush |= flush;
   276	
   277		skb_gro_postpull_rcsum(skb, iph, nlen);
   278	
   279		pp = indirect_call_gro_receive_l4(tcp6_gro_receive, udp6_gro_receive,
   280						 ops->callbacks.gro_receive, head, skb);
   281	
   282	out:
   283		skb_gro_flush_final(skb, pp, flush);
   284	
   285		return pp;
   286	}
   287	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
  2022-01-21 16:37     ` Eric Dumazet
@ 2022-01-25  0:02       ` Jakub Kicinski
  0 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2022-01-25  0:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, David Ahern, Paolo Abeni, Herbert Xu, netdev,
	Yuchung Cheng, Neal Cardwell

Sorry for the delay I had to do some homework and more tests.

On Fri, 21 Jan 2022 08:37:12 -0800 Eric Dumazet wrote:
> On Fri, Jan 21, 2022 at 7:15 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > We implemented SACK compress in TCP stack to avoid extra SACK being
> > > sent by the receiver
> > >
> > > We have an extension of this SACK compression for TCP flows terminated
> > > by Google servers,
> > > since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
> > > DUPACK to start retransmits.
> > >
> > > Something like this pseudo code:
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
> > > 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
> > > int ofo_possible)
> > >         }
> > >         if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
> > >                 tp->dup_ack_counter++;
> > > -               goto send_now;
> > > +               if (peer_is_using_old_rule_about_fastretrans(tp))
> > > +                       goto send_now;
> > >         }
> > >         tp->compressed_ack++;
> > >         if (hrtimer_is_queued(&tp->compressed_ack_timer))
> > >  
> >
> > Is this something we could upstream / test? peer_is_using.. does not
> > exist upstream.  
> 
> Sure, because we do not have a standardized way (at SYN SYNACK time)
> to advertise
> that the stack is not 10 years old.
> 
> This could be a per net-ns sysctl, or a per socket flag, or a per cgroup flag.
> 
> In our case, we do negotiate special TCP options, and allow these options
> only from internal communications.
> 
> (So we store this private bit in the socket itself)

This does not fix the problem, unfortunately. I still see TCP detecting
reordering based on SACK if re-transmits have higher TTL.

> > Coincidentally, speaking of sending SACKs, my initial testing was on
> > 5.12 kernels and there I saw what appeared to a lay person (me) like
> > missing ACKs. Receiver would receive segments:
> >
> > _AB_C_D_E
> >
> > where _ indicates loss. It'd SACK A, then generate the next SACK after E
> > (SACKing C D E), sender would rexmit A which makes receiver ACK all
> > the way to the end of B. Now sender thinks B arrived after CDE because
> > it was never sacked.
> >
> > Perhaps it was fixed by commit a29cb6914681 ("net: tcp better handling
> > of reordering then loss cases").. or it's a result of some out-of-tree
> > hack. I thought I'd mention it tho in case it immediately rings a bell
> > for anyone.  
> 
> Could all the missing SACK have been lost ?

I had tcpdump on both ends, but I can't repro any more with the GRO fix
applied. Maybe it was also related to that. Somehow.

> Writing a packetdrill test for this scenario should not be too hard.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-01-25  3:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-21  1:19 [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch Jakub Kicinski
2022-01-21  8:55 ` Eric Dumazet
2022-01-21 15:15   ` Jakub Kicinski
2022-01-21 16:37     ` Eric Dumazet
2022-01-25  0:02       ` Jakub Kicinski
2022-01-24 19:23 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).