* [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
@ 2022-01-21 1:19 Jakub Kicinski
2022-01-21 8:55 ` Eric Dumazet
2022-01-24 19:23 ` kernel test robot
0 siblings, 2 replies; 6+ messages in thread
From: Jakub Kicinski @ 2022-01-21 1:19 UTC (permalink / raw)
To: davem, edumazet; +Cc: dsahern, pabeni, herbert, netdev, Jakub Kicinski
IPv6 GRO considers packets to belong to different flows when their
hop_limit is different. This seems counter-intuitive, the flow is
the same. hop_limit may vary because of various bugs or hacks but
that doesn't mean it's okay for GRO to reorder packets.
Practical impact of this problem on overall TCP performance
is unclear, but TCP itself detects this reordering and bumps
TCPSACKReorder resulting in user complaints.
Note that the code plays an easy to miss trick by upcasting next_hdr
to a u16 pointer and compares next_hdr and hop_limit in one go.
Coalesce the flush setting to reduce the instruction count a touch.
Fixes: 787e92083601 ("ipv6: Add GRO support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
net/ipv6/ip6_offload.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index b29e9ba5e113..570071a87e71 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -249,7 +249,7 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
if ((first_word & htonl(0xF00FFFFF)) ||
!ipv6_addr_equal(&iph->saddr, &iph2->saddr) ||
!ipv6_addr_equal(&iph->daddr, &iph2->daddr) ||
- *(u16 *)&iph->nexthdr != *(u16 *)&iph2->nexthdr) {
+ iph->nexthdr != iph2->nexthdr) {
not_same_flow:
NAPI_GRO_CB(p)->same_flow = 0;
continue;
@@ -260,8 +260,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
goto not_same_flow;
}
/* flush if Traffic Class fields are different */
- NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
- NAPI_GRO_CB(p)->flush |= flush;
+ NAPI_GRO_CB(p)->flush |= flush |
+ !!((first_word & htonl(0x0FF00000)) |
+ (iph->hop_limit ^ iph2->hop_limit));
/* If the previous IP ID value was based on an atomic
* datagram we can overwrite the value and ignore it.
--
2.31.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
2022-01-21 1:19 [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch Jakub Kicinski
@ 2022-01-21 8:55 ` Eric Dumazet
2022-01-21 15:15 ` Jakub Kicinski
2022-01-24 19:23 ` kernel test robot
1 sibling, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2022-01-21 8:55 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: David Miller, David Ahern, Paolo Abeni, Herbert Xu, netdev
On Thu, Jan 20, 2022 at 5:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> IPv6 GRO considers packets to belong to different flows when their
> hop_limit is different. This seems counter-intuitive, the flow is
> the same. hop_limit may vary because of various bugs or hacks but
> that doesn't mean it's okay for GRO to reorder packets.
>
> Practical impact of this problem on overall TCP performance
> is unclear, but TCP itself detects this reordering and bumps
> TCPSACKReorder resulting in user complaints.
>
> Note that the code plays an easy to miss trick by upcasting next_hdr
> to a u16 pointer and compares next_hdr and hop_limit in one go.
> Coalesce the flush setting to reduce the instruction count a touch.
>
There are downsides to this change.
We had an internal discussion at Google years ago about this
difference in behavior of IPv6/IPv4
We came to the conclusion the IPv6 behavior was better for our needs
(and we do not care
much about IPv4 GRO, since Google DC traffic is 99.99% IPv6)
In our case, we wanted to keep this 'ipv6 feature' because we were
experimenting with the idea of sending
TSO packets with different flowlabels, to use different paths in the
network, to increase nominal
throughput for WAN flows (one flow would use multiple fiber links)
The issue with 'ipv4 gro style about ttl mismatch' was that because of
small differences in RTT for each path,
a receiver could very well receive mixed packets.
Even without playing with ECMP hashes, this scenario can happen if the sender
uses a bonding device in balance-rr mode.
After your change, GRO would be defeated and deliver one MSS at a time
to TCP stack.
We implemented SACK compress in TCP stack to avoid extra SACK being
sent by the receiver
We have an extension of this SACK compression for TCP flows terminated
by Google servers,
since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
DUPACK to start retransmits.
Something like this pseudo code:
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
int ofo_possible)
}
if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
tp->dup_ack_counter++;
- goto send_now;
+ if (peer_is_using_old_rule_about_fastretrans(tp))
+ goto send_now;
}
tp->compressed_ack++;
if (hrtimer_is_queued(&tp->compressed_ack_timer))
> Fixes: 787e92083601 ("ipv6: Add GRO support")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> net/ipv6/ip6_offload.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index b29e9ba5e113..570071a87e71 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -249,7 +249,7 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
> if ((first_word & htonl(0xF00FFFFF)) ||
> !ipv6_addr_equal(&iph->saddr, &iph2->saddr) ||
> !ipv6_addr_equal(&iph->daddr, &iph2->daddr) ||
> - *(u16 *)&iph->nexthdr != *(u16 *)&iph2->nexthdr) {
> + iph->nexthdr != iph2->nexthdr) {
> not_same_flow:
> NAPI_GRO_CB(p)->same_flow = 0;
> continue;
> @@ -260,8 +260,9 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
> goto not_same_flow;
> }
> /* flush if Traffic Class fields are different */
> - NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
> - NAPI_GRO_CB(p)->flush |= flush;
> + NAPI_GRO_CB(p)->flush |= flush |
> + !!((first_word & htonl(0x0FF00000)) |
> + (iph->hop_limit ^ iph2->hop_limit));
>
> /* If the previous IP ID value was based on an atomic
> * datagram we can overwrite the value and ignore it.
> --
> 2.31.1
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
2022-01-21 8:55 ` Eric Dumazet
@ 2022-01-21 15:15 ` Jakub Kicinski
2022-01-21 16:37 ` Eric Dumazet
0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2022-01-21 15:15 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, David Ahern, Paolo Abeni, Herbert Xu, netdev,
ycheng, ncardwell
On Fri, 21 Jan 2022 00:55:08 -0800 Eric Dumazet wrote:
> On Thu, Jan 20, 2022 at 5:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > IPv6 GRO considers packets to belong to different flows when their
> > hop_limit is different. This seems counter-intuitive, the flow is
> > the same. hop_limit may vary because of various bugs or hacks but
> > that doesn't mean it's okay for GRO to reorder packets.
> >
> > Practical impact of this problem on overall TCP performance
> > is unclear, but TCP itself detects this reordering and bumps
> > TCPSACKReorder resulting in user complaints.
> >
> > Note that the code plays an easy to miss trick by upcasting next_hdr
> > to a u16 pointer and compares next_hdr and hop_limit in one go.
> > Coalesce the flush setting to reduce the instruction count a touch.
>
> There are downsides to this change.
>
> We had an internal discussion at Google years ago about this
> difference in behavior of IPv6/IPv4
>
> We came to the conclusion the IPv6 behavior was better for our needs
> (and we do not care
> much about IPv4 GRO, since Google DC traffic is 99.99% IPv6)
>
> In our case, we wanted to keep this 'ipv6 feature' because we were
> experimenting with the idea of sending
> TSO packets with different flowlabels, to use different paths in the
> network, to increase nominal
> throughput for WAN flows (one flow would use multiple fiber links)
>
> The issue with 'ipv4 gro style about ttl mismatch' was that because of
> small differences in RTT for each path,
> a receiver could very well receive mixed packets.
>
> Even without playing with ECMP hashes, this scenario can happen if the sender
> uses a bonding device in balance-rr mode.
>
> After your change, GRO would be defeated and deliver one MSS at a time
> to TCP stack.
Indeed. Sounds like we're trading correctness for an optimization of a
questionable practical application, but our motivation isn't 100% pure
either [1] so whatever way we can fix this is fine by me :)
[1] We have some shenanigans that bump TTL to indicate re-transmitted
packets so we can identify them in the network.
> We implemented SACK compress in TCP stack to avoid extra SACK being
> sent by the receiver
>
> We have an extension of this SACK compression for TCP flows terminated
> by Google servers,
> since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
> DUPACK to start retransmits.
>
> Something like this pseudo code:
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
> 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
> int ofo_possible)
> }
> if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
> tp->dup_ack_counter++;
> - goto send_now;
> + if (peer_is_using_old_rule_about_fastretrans(tp))
> + goto send_now;
> }
> tp->compressed_ack++;
> if (hrtimer_is_queued(&tp->compressed_ack_timer))
>
Is this something we could upstream / test? peer_is_using.. does not
exist upstream.
Coincidentally, speaking of sending SACKs, my initial testing was on
5.12 kernels and there I saw what appeared to a lay person (me) like
missing ACKs. Receiver would receive segments:
_AB_C_D_E
where _ indicates loss. It'd SACK A, then generate the next SACK after E
(SACKing C D E), sender would rexmit A which makes receiver ACK all
the way to the end of B. Now sender thinks B arrived after CDE because
it was never sacked.
Perhaps it was fixed by commit a29cb6914681 ("net: tcp better handling
of reordering then loss cases").. or it's a result of some out-of-tree
hack. I thought I'd mention it tho in case it immediately rings a bell
for anyone.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
2022-01-21 15:15 ` Jakub Kicinski
@ 2022-01-21 16:37 ` Eric Dumazet
2022-01-25 0:02 ` Jakub Kicinski
0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2022-01-21 16:37 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David Miller, David Ahern, Paolo Abeni, Herbert Xu, netdev,
Yuchung Cheng, Neal Cardwell
On Fri, Jan 21, 2022 at 7:15 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 21 Jan 2022 00:55:08 -0800 Eric Dumazet wrote:
> > On Thu, Jan 20, 2022 at 5:19 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > IPv6 GRO considers packets to belong to different flows when their
> > > hop_limit is different. This seems counter-intuitive, the flow is
> > > the same. hop_limit may vary because of various bugs or hacks but
> > > that doesn't mean it's okay for GRO to reorder packets.
> > >
> > > Practical impact of this problem on overall TCP performance
> > > is unclear, but TCP itself detects this reordering and bumps
> > > TCPSACKReorder resulting in user complaints.
> > >
> > > Note that the code plays an easy to miss trick by upcasting next_hdr
> > > to a u16 pointer and compares next_hdr and hop_limit in one go.
> > > Coalesce the flush setting to reduce the instruction count a touch.
> >
> > There are downsides to this change.
> >
> > We had an internal discussion at Google years ago about this
> > difference in behavior of IPv6/IPv4
> >
> > We came to the conclusion the IPv6 behavior was better for our needs
> > (and we do not care
> > much about IPv4 GRO, since Google DC traffic is 99.99% IPv6)
> >
> > In our case, we wanted to keep this 'ipv6 feature' because we were
> > experimenting with the idea of sending
> > TSO packets with different flowlabels, to use different paths in the
> > network, to increase nominal
> > throughput for WAN flows (one flow would use multiple fiber links)
> >
> > The issue with 'ipv4 gro style about ttl mismatch' was that because of
> > small differences in RTT for each path,
> > a receiver could very well receive mixed packets.
> >
> > Even without playing with ECMP hashes, this scenario can happen if the sender
> > uses a bonding device in balance-rr mode.
> >
> > After your change, GRO would be defeated and deliver one MSS at a time
> > to TCP stack.
>
> Indeed. Sounds like we're trading correctness for an optimization of a
> questionable practical application, but our motivation isn't 100% pure
> either [1] so whatever way we can fix this is fine by me :)
>
> [1] We have some shenanigans that bump TTL to indicate re-transmitted
> packets so we can identify them in the network.
>
> > We implemented SACK compress in TCP stack to avoid extra SACK being
> > sent by the receiver
> >
> > We have an extension of this SACK compression for TCP flows terminated
> > by Google servers,
> > since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
> > DUPACK to start retransmits.
> >
> > Something like this pseudo code:
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
> > 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
> > int ofo_possible)
> > }
> > if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
> > tp->dup_ack_counter++;
> > - goto send_now;
> > + if (peer_is_using_old_rule_about_fastretrans(tp))
> > + goto send_now;
> > }
> > tp->compressed_ack++;
> > if (hrtimer_is_queued(&tp->compressed_ack_timer))
> >
>
> Is this something we could upstream / test? peer_is_using.. does not
> exist upstream.
Sure, because we do not have a standardized way (at SYN SYNACK time)
to advertise
that the stack is not 10 years old.
This could be a per net-ns sysctl, or a per socket flag, or a per cgroup flag.
In our case, we do negotiate special TCP options, and allow these options
only from internal communications.
(So we store this private bit in the socket itself)
>
>
> Coincidentally, speaking of sending SACKs, my initial testing was on
> 5.12 kernels and there I saw what appeared to a lay person (me) like
> missing ACKs. Receiver would receive segments:
>
> _AB_C_D_E
>
> where _ indicates loss. It'd SACK A, then generate the next SACK after E
> (SACKing C D E), sender would rexmit A which makes receiver ACK all
> the way to the end of B. Now sender thinks B arrived after CDE because
> it was never sacked.
>
> Perhaps it was fixed by commit a29cb6914681 ("net: tcp better handling
> of reordering then loss cases").. or it's a result of some out-of-tree
> hack. I thought I'd mention it tho in case it immediately rings a bell
> for anyone.
Could all the missing SACK have been lost ?
Writing a packetdrill test for this scenario should not be too hard.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
2022-01-21 1:19 [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch Jakub Kicinski
2022-01-21 8:55 ` Eric Dumazet
@ 2022-01-24 19:23 ` kernel test robot
1 sibling, 0 replies; 6+ messages in thread
From: kernel test robot @ 2022-01-24 19:23 UTC (permalink / raw)
To: Jakub Kicinski, davem, edumazet
Cc: kbuild-all, dsahern, pabeni, herbert, netdev, Jakub Kicinski
Hi Jakub,
I love your patch! Perhaps something to improve:
[auto build test WARNING on net/master]
url: https://github.com/0day-ci/linux/commits/Jakub-Kicinski/ipv6-gro-flush-instead-of-assuming-different-flows-on-hop_limit-mismatch/20220121-092033
base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 57afdc0aab094b4c811b3fe030b2567812a495f3
config: x86_64-randconfig-s022 (https://download.01.org/0day-ci/archive/20220125/202201250210.roaIok2H-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.4-dirty
# https://github.com/0day-ci/linux/commit/6f8f3e541288381a67df8b670068d5add231d082
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jakub-Kicinski/ipv6-gro-flush-instead-of-assuming-different-flows-on-hop_limit-mismatch/20220121-092033
git checkout 6f8f3e541288381a67df8b670068d5add231d082
# save the config file to linux build tree
mkdir build_dir
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash net/ipv6/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
sparse warnings: (new ones prefixed by >>)
>> net/ipv6/ip6_offload.c:264:57: sparse: sparse: restricted __be32 degrades to integer
>> net/ipv6/ip6_offload.c:263:48: sparse: sparse: dubious: x | !y
vim +264 net/ipv6/ip6_offload.c
182
183 INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
184 struct sk_buff *skb)
185 {
186 const struct net_offload *ops;
187 struct sk_buff *pp = NULL;
188 struct sk_buff *p;
189 struct ipv6hdr *iph;
190 unsigned int nlen;
191 unsigned int hlen;
192 unsigned int off;
193 u16 flush = 1;
194 int proto;
195
196 off = skb_gro_offset(skb);
197 hlen = off + sizeof(*iph);
198 iph = skb_gro_header_fast(skb, off);
199 if (skb_gro_header_hard(skb, hlen)) {
200 iph = skb_gro_header_slow(skb, hlen, off);
201 if (unlikely(!iph))
202 goto out;
203 }
204
205 skb_set_network_header(skb, off);
206 skb_gro_pull(skb, sizeof(*iph));
207 skb_set_transport_header(skb, skb_gro_offset(skb));
208
209 flush += ntohs(iph->payload_len) != skb_gro_len(skb);
210
211 proto = iph->nexthdr;
212 ops = rcu_dereference(inet6_offloads[proto]);
213 if (!ops || !ops->callbacks.gro_receive) {
214 __pskb_pull(skb, skb_gro_offset(skb));
215 skb_gro_frag0_invalidate(skb);
216 proto = ipv6_gso_pull_exthdrs(skb, proto);
217 skb_gro_pull(skb, -skb_transport_offset(skb));
218 skb_reset_transport_header(skb);
219 __skb_push(skb, skb_gro_offset(skb));
220
221 ops = rcu_dereference(inet6_offloads[proto]);
222 if (!ops || !ops->callbacks.gro_receive)
223 goto out;
224
225 iph = ipv6_hdr(skb);
226 }
227
228 NAPI_GRO_CB(skb)->proto = proto;
229
230 flush--;
231 nlen = skb_network_header_len(skb);
232
233 list_for_each_entry(p, head, list) {
234 const struct ipv6hdr *iph2;
235 __be32 first_word; /* <Version:4><Traffic_Class:8><Flow_Label:20> */
236
237 if (!NAPI_GRO_CB(p)->same_flow)
238 continue;
239
240 iph2 = (struct ipv6hdr *)(p->data + off);
241 first_word = *(__be32 *)iph ^ *(__be32 *)iph2;
242
243 /* All fields must match except length and Traffic Class.
244 * XXX skbs on the gro_list have all been parsed and pulled
245 * already so we don't need to compare nlen
246 * (nlen != (sizeof(*iph2) + ipv6_exthdrs_len(iph2, &ops)))
247 * memcmp() alone below is sufficient, right?
248 */
249 if ((first_word & htonl(0xF00FFFFF)) ||
250 !ipv6_addr_equal(&iph->saddr, &iph2->saddr) ||
251 !ipv6_addr_equal(&iph->daddr, &iph2->daddr) ||
252 iph->nexthdr != iph2->nexthdr) {
253 not_same_flow:
254 NAPI_GRO_CB(p)->same_flow = 0;
255 continue;
256 }
257 if (unlikely(nlen > sizeof(struct ipv6hdr))) {
258 if (memcmp(iph + 1, iph2 + 1,
259 nlen - sizeof(struct ipv6hdr)))
260 goto not_same_flow;
261 }
262 /* flush if Traffic Class fields are different */
> 263 NAPI_GRO_CB(p)->flush |= flush |
> 264 !!((first_word & htonl(0x0FF00000)) |
265 (iph->hop_limit ^ iph2->hop_limit));
266
267 /* If the previous IP ID value was based on an atomic
268 * datagram we can overwrite the value and ignore it.
269 */
270 if (NAPI_GRO_CB(skb)->is_atomic)
271 NAPI_GRO_CB(p)->flush_id = 0;
272 }
273
274 NAPI_GRO_CB(skb)->is_atomic = true;
275 NAPI_GRO_CB(skb)->flush |= flush;
276
277 skb_gro_postpull_rcsum(skb, iph, nlen);
278
279 pp = indirect_call_gro_receive_l4(tcp6_gro_receive, udp6_gro_receive,
280 ops->callbacks.gro_receive, head, skb);
281
282 out:
283 skb_gro_flush_final(skb, pp, flush);
284
285 return pp;
286 }
287
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch
2022-01-21 16:37 ` Eric Dumazet
@ 2022-01-25 0:02 ` Jakub Kicinski
0 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2022-01-25 0:02 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, David Ahern, Paolo Abeni, Herbert Xu, netdev,
Yuchung Cheng, Neal Cardwell
Sorry for the delay I had to do some homework and more tests.
On Fri, 21 Jan 2022 08:37:12 -0800 Eric Dumazet wrote:
> On Fri, Jan 21, 2022 at 7:15 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > We implemented SACK compress in TCP stack to avoid extra SACK being
> > > sent by the receiver
> > >
> > > We have an extension of this SACK compression for TCP flows terminated
> > > by Google servers,
> > > since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
> > > DUPACK to start retransmits.
> > >
> > > Something like this pseudo code:
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
> > > 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
> > > int ofo_possible)
> > > }
> > > if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
> > > tp->dup_ack_counter++;
> > > - goto send_now;
> > > + if (peer_is_using_old_rule_about_fastretrans(tp))
> > > + goto send_now;
> > > }
> > > tp->compressed_ack++;
> > > if (hrtimer_is_queued(&tp->compressed_ack_timer))
> > >
> >
> > Is this something we could upstream / test? peer_is_using.. does not
> > exist upstream.
>
> Sure, because we do not have a standardized way (at SYN SYNACK time)
> to advertise
> that the stack is not 10 years old.
>
> This could be a per net-ns sysctl, or a per socket flag, or a per cgroup flag.
>
> In our case, we do negotiate special TCP options, and allow these options
> only from internal communications.
>
> (So we store this private bit in the socket itself)
This does not fix the problem, unfortunately. I still see TCP detecting
reordering based on SACK if re-transmits have higher TTL.
> > Coincidentally, speaking of sending SACKs, my initial testing was on
> > 5.12 kernels and there I saw what appeared to a lay person (me) like
> > missing ACKs. Receiver would receive segments:
> >
> > _AB_C_D_E
> >
> > where _ indicates loss. It'd SACK A, then generate the next SACK after E
> > (SACKing C D E), sender would rexmit A which makes receiver ACK all
> > the way to the end of B. Now sender thinks B arrived after CDE because
> > it was never sacked.
> >
> > Perhaps it was fixed by commit a29cb6914681 ("net: tcp better handling
> > of reordering then loss cases").. or it's a result of some out-of-tree
> > hack. I thought I'd mention it tho in case it immediately rings a bell
> > for anyone.
>
> Could all the missing SACK have been lost ?
I had tcpdump on both ends, but I can't repro any more with the GRO fix
applied. Maybe it was also related to that. Somehow.
> Writing a packetdrill test for this scenario should not be too hard.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-01-25 3:42 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-21 1:19 [PATCH net] ipv6: gro: flush instead of assuming different flows on hop_limit mismatch Jakub Kicinski
2022-01-21 8:55 ` Eric Dumazet
2022-01-21 15:15 ` Jakub Kicinski
2022-01-21 16:37 ` Eric Dumazet
2022-01-25 0:02 ` Jakub Kicinski
2022-01-24 19:23 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).