All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/2] tcp bugs for window shrinking, OoO ACKs
@ 2022-10-20 18:22 Kamaljit Singh
  2022-10-20 18:22 ` [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk Kamaljit Singh
  2022-10-20 18:22 ` [PATCH v1 2/2] tcp: Ignore OOO handling for TCP ACKs Kamaljit Singh
  0 siblings, 2 replies; 11+ messages in thread
From: Kamaljit Singh @ 2022-10-20 18:22 UTC (permalink / raw)
  To: edumazet, davem, yoshfuji, dsahern, kuba, pabeni
  Cc: netdev, Niklas.Cassel, Damien.LeMoal, kamaljit.singh1

We have found significant performance impacts for 4K Writes to an NVMe/TCP
target under heavy congestion conditions due to improper SEQ-NUM number 
provided by the host TCP layer. In this case the target shrunk the window,
which seemed to cause the host to send a stale SEQ-NUM for a subsequent
ACK.

It was also observed that on occasions that a host sends ACKs out-of-order
it caused data retransmissions thus affecting performance for NVMe/TCP
Writes.

A network trace is available and can be provided for further reference.

Kamaljit Singh (2):
  tcp: Fix for stale host ACK when tgt window shrunk
  tcp: Ignore OOO handling for TCP ACKs

 net/ipv4/tcp_output.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk
  2022-10-20 18:22 [PATCH v1 0/2] tcp bugs for window shrinking, OoO ACKs Kamaljit Singh
@ 2022-10-20 18:22 ` Kamaljit Singh
  2022-10-20 20:45   ` Eric Dumazet
  2022-10-20 18:22 ` [PATCH v1 2/2] tcp: Ignore OOO handling for TCP ACKs Kamaljit Singh
  1 sibling, 1 reply; 11+ messages in thread
From: Kamaljit Singh @ 2022-10-20 18:22 UTC (permalink / raw)
  To: edumazet, davem, yoshfuji, dsahern, kuba, pabeni
  Cc: netdev, Niklas.Cassel, Damien.LeMoal, kamaljit.singh1

Under certain congestion conditions, an NVMe/TCP target may be configured
to shrink the TCP window in an effort to slow the sender down prior to
issuing a more drastic L2 pause or PFC indication.  Although the TCP
standard discourages implementations from shrinking the TCP window, it also
states that TCP implementations must be robust to this occurring. The
current Linux TCP layer (in conjunction with the NVMe/TCP host driver) has
an issue when the TCP window is shrunk by a target, which causes ACK frames
to be transmitted with a “stale” SEQ_NUM or for data frames to be
retransmitted by the host. It has been observed that processing of these
“stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
performance.

Network traffic analysis revealed that SEQ-NUM being used by the host to
ACK the frame that resized the TCP window had an older SEQ-NUM and not a
value corresponding to the next SEQ-NUM expected on that connection.

In such a case, the kernel was using the seq number calculated by
tcp_wnd_end() as per the code segment below. Since, in this case
tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect for
the scenario.  The correct seq number that needs to be returned is
tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
with its performance impact.

  1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
  1272 {
  1273   return tp->snd_una + tp->snd_wnd;
  1274 }

Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
---
 net/ipv4/tcp_output.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 11aa0ab10bba..322e061edb72 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct sock *sk)
 	    (tp->rx_opt.wscale_ok &&
 	     ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp->rx_opt.rcv_wscale))))
 		return tp->snd_nxt;
+	else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
+		 !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)))
+		return tp->snd_nxt;
 	else
 		return tcp_wnd_end(tp);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v1 2/2] tcp: Ignore OOO handling for TCP ACKs
  2022-10-20 18:22 [PATCH v1 0/2] tcp bugs for window shrinking, OoO ACKs Kamaljit Singh
  2022-10-20 18:22 ` [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk Kamaljit Singh
@ 2022-10-20 18:22 ` Kamaljit Singh
  2022-10-20 18:57   ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Kamaljit Singh @ 2022-10-20 18:22 UTC (permalink / raw)
  To: edumazet, davem, yoshfuji, dsahern, kuba, pabeni
  Cc: netdev, Niklas.Cassel, Damien.LeMoal, kamaljit.singh1

Even with the TCP window fix to tcp_acceptable_seq(), occasional
out-of-order host ACKs were still seen under heavy write workloads thus
Impacting performance.  By removing the OoO optionality for ACKs in
__tcp_transmit_skb() that issue seems to be fixed as well.

Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
---
 net/ipv4/tcp_output.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 322e061edb72..1cd77493f32c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1307,7 +1307,10 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	 * TODO: Ideally, in-flight pure ACK packets should not matter here.
 	 * One way to get this would be to set skb->truesize = 2 on them.
 	 */
-	skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1);
+	if (likely(tcb->tcp_flags & TCPHDR_ACK))
+		skb->ooo_okay = 0;
+	else
+		skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1);
 
 	/* If we had to use memory reserve to allocate this skb,
 	 * this might cause drops if packet is looped back :
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] tcp: Ignore OOO handling for TCP ACKs
  2022-10-20 18:22 ` [PATCH v1 2/2] tcp: Ignore OOO handling for TCP ACKs Kamaljit Singh
@ 2022-10-20 18:57   ` Eric Dumazet
  2022-10-20 21:25     ` Kamaljit Singh
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2022-10-20 18:57 UTC (permalink / raw)
  To: Kamaljit Singh
  Cc: davem, yoshfuji, dsahern, kuba, pabeni, netdev, Niklas.Cassel,
	Damien.LeMoal

On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@wdc.com> wrote:
>
> Even with the TCP window fix to tcp_acceptable_seq(), occasional
> out-of-order host ACKs were still seen under heavy write workloads thus
> Impacting performance.  By removing the OoO optionality for ACKs in
> __tcp_transmit_skb() that issue seems to be fixed as well.

This is highly suspect/bogus.

 Please give which driver is used here.

>
> Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
> ---
>  net/ipv4/tcp_output.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 322e061edb72..1cd77493f32c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1307,7 +1307,10 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
>          * TODO: Ideally, in-flight pure ACK packets should not matter here.
>          * One way to get this would be to set skb->truesize = 2 on them.
>          */
> -       skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1);
> +       if (likely(tcb->tcp_flags & TCPHDR_ACK))
> +               skb->ooo_okay = 0;
> +       else
> +               skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1);
>

This is absolutely wrong and would impact performance quite a lot.

You are basically removing all possibilities for ackets of a TCP flow
to be directed to a new queue, say if use thread has migrated to
another cpu.

After 3WHS, all packets get ACK set.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk
  2022-10-20 18:22 ` [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk Kamaljit Singh
@ 2022-10-20 20:45   ` Eric Dumazet
  2022-10-21  1:01     ` Kamaljit Singh
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2022-10-20 20:45 UTC (permalink / raw)
  To: Kamaljit Singh
  Cc: davem, yoshfuji, dsahern, kuba, pabeni, netdev, Niklas.Cassel,
	Damien.LeMoal

On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@wdc.com> wrote:
>
> Under certain congestion conditions, an NVMe/TCP target may be configured
> to shrink the TCP window in an effort to slow the sender down prior to
> issuing a more drastic L2 pause or PFC indication.  Although the TCP
> standard discourages implementations from shrinking the TCP window, it also
> states that TCP implementations must be robust to this occurring. The
> current Linux TCP layer (in conjunction with the NVMe/TCP host driver) has
> an issue when the TCP window is shrunk by a target, which causes ACK frames
> to be transmitted with a “stale” SEQ_NUM or for data frames to be
> retransmitted by the host.

Linux sends ACK packets, with a legal SEQ number.

The issue is the receiver of such packets, right ?

Because as you said receivers should be relaxed about this, especially
if _they_ decided
to not respect the TCP standards.

You are proposing to send old ACK, that might be dropped by other stacks.

It has been observed that processing of these
> “stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
> performance.
>
> Network traffic analysis revealed that SEQ-NUM being used by the host to
> ACK the frame that resized the TCP window had an older SEQ-NUM and not a
> value corresponding to the next SEQ-NUM expected on that connection.
>
> In such a case, the kernel was using the seq number calculated by
> tcp_wnd_end() as per the code segment below. Since, in this case
> tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect for
> the scenario.  The correct seq number that needs to be returned is
> tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
> with its performance impact.
>
>   1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
>   1272 {
>   1273   return tp->snd_una + tp->snd_wnd;
>   1274 }
>
> Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
> ---
>  net/ipv4/tcp_output.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 11aa0ab10bba..322e061edb72 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct sock *sk)
>             (tp->rx_opt.wscale_ok &&
>              ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp->rx_opt.rcv_wscale))))
>                 return tp->snd_nxt;
> +       else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
> +                !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)))
> +               return tp->snd_nxt;
>         else
>                 return tcp_wnd_end(tp);
>  }
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] tcp: Ignore OOO handling for TCP ACKs
  2022-10-20 18:57   ` Eric Dumazet
@ 2022-10-20 21:25     ` Kamaljit Singh
  2022-10-20 21:47       ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Kamaljit Singh @ 2022-10-20 21:25 UTC (permalink / raw)
  To: edumazet
  Cc: Niklas Cassel, davem, Damien Le Moal, dsahern, yoshfuji, kuba,
	pabeni, netdev

On Thu, 2022-10-20 at 11:57 -0700, Eric Dumazet wrote:
> CAUTION: This email originated from outside of Western Digital. Do not click
> on links or open attachments unless you recognize the sender and know that the
> content is safe.
> 
> 
> On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@wdc.com>
> wrote:
> > Even with the TCP window fix to tcp_acceptable_seq(), occasional
> > out-of-order host ACKs were still seen under heavy write workloads thus
> > Impacting performance.  By removing the OoO optionality for ACKs in
> > __tcp_transmit_skb() that issue seems to be fixed as well.
> 
> This is highly suspect/bogus.
> 
>  Please give which driver is used here.
The NVMe/TCP Host driver (also mentioned in the cover letter).


> 
> > Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
> > ---
> >  net/ipv4/tcp_output.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 322e061edb72..1cd77493f32c 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1307,7 +1307,10 @@ static int __tcp_transmit_skb(struct sock *sk, struct
> > sk_buff *skb,
> >          * TODO: Ideally, in-flight pure ACK packets should not matter here.
> >          * One way to get this would be to set skb->truesize = 2 on them.
> >          */
> > -       skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1);
> > +       if (likely(tcb->tcp_flags & TCPHDR_ACK))
> > +               skb->ooo_okay = 0;
> > +       else
> > +               skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1);
> > 
> 
> This is absolutely wrong and would impact performance quite a lot.
> 
> You are basically removing all possibilities for ackets of a TCP flow
> to be directed to a new queue, say if use thread has migrated to
> another cpu.
Are you suggesting that the proposed change not be done at all or done in a
different way? We did see an observed performance improvement in NVMe/TCP
traffic with this fix. If you have an alternative idea I'd be happy to try &
test it out.


> 
> After 3WHS, all packets get ACK set.
-- 
Thanks,
Kamaljit Singh <kamaljit.singh1@wdc.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] tcp: Ignore OOO handling for TCP ACKs
  2022-10-20 21:25     ` Kamaljit Singh
@ 2022-10-20 21:47       ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2022-10-20 21:47 UTC (permalink / raw)
  To: Kamaljit Singh
  Cc: Niklas Cassel, davem, Damien Le Moal, dsahern, yoshfuji, kuba,
	pabeni, netdev

On Thu, Oct 20, 2022 at 2:25 PM Kamaljit Singh <Kamaljit.Singh1@wdc.com> wrote:
>
> On Thu, 2022-10-20 at 11:57 -0700, Eric Dumazet wrote:
> > CAUTION: This email originated from outside of Western Digital. Do not click
> > on links or open attachments unless you recognize the sender and know that the
> > content is safe.
> >
> >
> > On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@wdc.com>
> > wrote:
> > > Even with the TCP window fix to tcp_acceptable_seq(), occasional
> > > out-of-order host ACKs were still seen under heavy write workloads thus
> > > Impacting performance.  By removing the OoO optionality for ACKs in
> > > __tcp_transmit_skb() that issue seems to be fixed as well.
> >
> > This is highly suspect/bogus.
> >
> >  Please give which driver is used here.
> The NVMe/TCP Host driver (also mentioned in the cover letter).
>

This is code located on the same linux host ?

So... this is loopback interface ?

Which ndo_start_xmit() is called ?

>
> >
> > > Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
> > > ---
> > >  net/ipv4/tcp_output.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 322e061edb72..1cd77493f32c 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -1307,7 +1307,10 @@ static int __tcp_transmit_skb(struct sock *sk, struct
> > > sk_buff *skb,
> > >          * TODO: Ideally, in-flight pure ACK packets should not matter here.
> > >          * One way to get this would be to set skb->truesize = 2 on them.
> > >          */
> > > -       skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1);
> > > +       if (likely(tcb->tcp_flags & TCPHDR_ACK))
> > > +               skb->ooo_okay = 0;
> > > +       else
> > > +               skb->ooo_okay = sk_wmem_alloc_get(sk) < SKB_TRUESIZE(1);
> > >
> >
> > This is absolutely wrong and would impact performance quite a lot.
> >
> > You are basically removing all possibilities for ackets of a TCP flow
> > to be directed to a new queue, say if use thread has migrated to
> > another cpu.
> Are you suggesting that the proposed change not be done at all or done in a
> different way? We did see an observed performance improvement in NVMe/TCP
> traffic with this fix. If you have an alternative idea I'd be happy to try &
> test it out.

Well, you disable a very important feature, just to work around some
problem in another layer.

Let's investigate what can be done in this other layer, once it is identified.


>
>
> >
> > After 3WHS, all packets get ACK set.
> --
> Thanks,
> Kamaljit Singh <kamaljit.singh1@wdc.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk
  2022-10-20 20:45   ` Eric Dumazet
@ 2022-10-21  1:01     ` Kamaljit Singh
  2022-10-21  2:48       ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Kamaljit Singh @ 2022-10-21  1:01 UTC (permalink / raw)
  To: edumazet
  Cc: Niklas Cassel, davem, Damien Le Moal, dsahern, yoshfuji, kuba,
	pabeni, netdev

[-- Attachment #1: Type: text/plain, Size: 4235 bytes --]

On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> CAUTION: This email originated from outside of Western Digital. Do not click
> on links or open attachments unless you recognize the sender and know that the
> content is safe.
> 
> 
> On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@wdc.com>
> wrote:
> > Under certain congestion conditions, an NVMe/TCP target may be configured
> > to shrink the TCP window in an effort to slow the sender down prior to
> > issuing a more drastic L2 pause or PFC indication.  Although the TCP
> > standard discourages implementations from shrinking the TCP window, it also
> > states that TCP implementations must be robust to this occurring. The
> > current Linux TCP layer (in conjunction with the NVMe/TCP host driver) has
> > an issue when the TCP window is shrunk by a target, which causes ACK frames
> > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > retransmitted by the host.
> 
> Linux sends ACK packets, with a legal SEQ number.
> 
> The issue is the receiver of such packets, right ?
Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
initiator has an incorrect SEQ-NUM. 

I've attached a .pcapng Network trace for Wireshark. This captures a small
snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using fio).
As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12 from
the initiator. This happens right after the target closes the TCP window (pkts
#7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt #11.

This patch addresses the above scenario (tp->snd_wnd=0) and returns the correct
SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path was
returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.

Initiator Environment:
- NVMe-oF Initiator: drivers/nvme/host/tcp.c
- NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
- Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)


> 
> Because as you said receivers should be relaxed about this, especially
> if _they_ decided
> to not respect the TCP standards.
> 
> You are proposing to send old ACK, that might be dropped by other stacks.
On the contrary, I'm proposing to use the expected/correct ACK based on tp-
>snd_nxt.


> 
> It has been observed that processing of these
> > “stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
> > performance.
> > 
> > Network traffic analysis revealed that SEQ-NUM being used by the host to
> > ACK the frame that resized the TCP window had an older SEQ-NUM and not a
> > value corresponding to the next SEQ-NUM expected on that connection.
> > 
> > In such a case, the kernel was using the seq number calculated by
> > tcp_wnd_end() as per the code segment below. Since, in this case
> > tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect for
> > the scenario.  The correct seq number that needs to be returned is
> > tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
> > with its performance impact.
> > 
> >   1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
> >   1272 {
> >   1273   return tp->snd_una + tp->snd_wnd;
> >   1274 }
> > 
> > Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
> > ---
> >  net/ipv4/tcp_output.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 11aa0ab10bba..322e061edb72 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct sock
> > *sk)
> >             (tp->rx_opt.wscale_ok &&
> >              ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp-
> > >rx_opt.rcv_wscale))))
> >                 return tp->snd_nxt;
> > +       else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
> > +                !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)))
> > +               return tp->snd_nxt;
> >         else
> >                 return tcp_wnd_end(tp);
> >  }
> > --
> > 2.25.1
> > 
-- 
Thanks,
Kamaljit Singh <kamaljit.singh1@wdc.com>

[-- Attachment #2: AckWithStaleSeqNum.pcapng --]
[-- Type: application/x-pcapng, Size: 23028 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk
  2022-10-21  1:01     ` Kamaljit Singh
@ 2022-10-21  2:48       ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2022-10-21  2:48 UTC (permalink / raw)
  To: Kamaljit Singh
  Cc: Niklas Cassel, davem, Damien Le Moal, dsahern, yoshfuji, kuba,
	pabeni, netdev

On Thu, Oct 20, 2022 at 6:01 PM Kamaljit Singh <Kamaljit.Singh1@wdc.com> wrote:
>
> On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> > CAUTION: This email originated from outside of Western Digital. Do not click
> > on links or open attachments unless you recognize the sender and know that the
> > content is safe.
> >
> >
> > On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@wdc.com>
> > wrote:
> > > Under certain congestion conditions, an NVMe/TCP target may be configured
> > > to shrink the TCP window in an effort to slow the sender down prior to
> > > issuing a more drastic L2 pause or PFC indication.  Although the TCP
> > > standard discourages implementations from shrinking the TCP window, it also
> > > states that TCP implementations must be robust to this occurring. The
> > > current Linux TCP layer (in conjunction with the NVMe/TCP host driver) has
> > > an issue when the TCP window is shrunk by a target, which causes ACK frames
> > > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > > retransmitted by the host.
> >
> > Linux sends ACK packets, with a legal SEQ number.
> >
> > The issue is the receiver of such packets, right ?
> Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
> initiator has an incorrect SEQ-NUM.
>
> I've attached a .pcapng Network trace for Wireshark. This captures a small
> snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using fio).
> As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12 from
> the initiator. This happens right after the target closes the TCP window (pkts
> #7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt #11.
>
> This patch addresses the above scenario (tp->snd_wnd=0) and returns the correct
> SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path was
> returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.
>
> Initiator Environment:
> - NVMe-oF Initiator: drivers/nvme/host/tcp.c
> - NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
> - Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)
>
>
> >
> > Because as you said receivers should be relaxed about this, especially
> > if _they_ decided
> > to not respect the TCP standards.
> >
> > You are proposing to send old ACK, that might be dropped by other stacks.
> On the contrary, I'm proposing to use the expected/correct ACK based on tp-
> >snd_nxt.

Please take a look at the very lengthy comment at the front of the function.

Basically we are in a mode where a value needs to be chosen, and we do
not really know which one
will be accepted by the buggy peer.

You are changing something that has been there forever, risking
breaking many other stacks, and/or middleboxes.

It seems the remote TCP stack is quite buggy, I do not think we want
to change something which has never been an issue until today in 2022.

Also not that packet #13, sent immediately after the ACK is carrying
whatever needed values.
I do not see why the prior packet (#12) would matter.

Please elaborate.



>
>
> >
> > It has been observed that processing of these
> > > “stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
> > > performance.
> > >
> > > Network traffic analysis revealed that SEQ-NUM being used by the host to
> > > ACK the frame that resized the TCP window had an older SEQ-NUM and not a
> > > value corresponding to the next SEQ-NUM expected on that connection.
> > >
> > > In such a case, the kernel was using the seq number calculated by
> > > tcp_wnd_end() as per the code segment below. Since, in this case
> > > tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect for
> > > the scenario.  The correct seq number that needs to be returned is
> > > tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
> > > with its performance impact.
> > >
> > >   1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
> > >   1272 {
> > >   1273   return tp->snd_una + tp->snd_wnd;
> > >   1274 }
> > >
> > > Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
> > > ---
> > >  net/ipv4/tcp_output.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 11aa0ab10bba..322e061edb72 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct sock
> > > *sk)
> > >             (tp->rx_opt.wscale_ok &&
> > >              ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp-
> > > >rx_opt.rcv_wscale))))
> > >                 return tp->snd_nxt;
> > > +       else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
> > > +                !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)))
> > > +               return tp->snd_nxt;
> > >         else
> > >                 return tcp_wnd_end(tp);
> > >  }
> > > --
> > > 2.25.1
> > >
> --
> Thanks,
> Kamaljit Singh <kamaljit.singh1@wdc.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk
  2022-10-24 22:07 ` [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk Kamaljit Singh
@ 2022-10-25  0:21   ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2022-10-25  0:21 UTC (permalink / raw)
  To: Kamaljit Singh
  Cc: yoshfuji, Niklas Cassel, Damien Le Moal, davem, kuba, pabeni, netdev

On Mon, Oct 24, 2022 at 3:07 PM Kamaljit Singh <Kamaljit.Singh1@wdc.com> wrote:
>
> Hi Eric,
>
> Please find my inline responses below.
>
> Thanks,
> Kamaljit
>
>
> On Thu, 2022-10-20 at 19:48 -0700, Eric Dumazet wrote:
> > CAUTION: This email originated from outside of Western Digital. Do not click
> > on links or open attachments unless you recognize the sender and know that the
> > content is safe.
> >
> >
> > On Thu, Oct 20, 2022 at 6:01 PM Kamaljit Singh <Kamaljit.Singh1@wdc.com>
> > wrote:
> > > On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> > > > CAUTION: This email originated from outside of Western Digital. Do not
> > > > click
> > > > on links or open attachments unless you recognize the sender and know that
> > > > the
> > > > content is safe.
> > > >
> > > >
> > > > On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@wdc.com>
> > > > wrote:
> > > > > Under certain congestion conditions, an NVMe/TCP target may be
> > > > > configured
> > > > > to shrink the TCP window in an effort to slow the sender down prior to
> > > > > issuing a more drastic L2 pause or PFC indication.  Although the TCP
> > > > > standard discourages implementations from shrinking the TCP window, it
> > > > > also
> > > > > states that TCP implementations must be robust to this occurring. The
> > > > > current Linux TCP layer (in conjunction with the NVMe/TCP host driver)
> > > > > has
> > > > > an issue when the TCP window is shrunk by a target, which causes ACK
> > > > > frames
> > > > > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > > > > retransmitted by the host.
> > > >
> > > > Linux sends ACK packets, with a legal SEQ number.
> > > >
> > > > The issue is the receiver of such packets, right ?
> > > Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
> > > initiator has an incorrect SEQ-NUM.
> > >
> > > I've attached a .pcapng Network trace for Wireshark. This captures a small
> > > snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using
> > > fio).
> > > As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12
> > > from
> > > the initiator. This happens right after the target closes the TCP window
> > > (pkts
> > > #7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt
> > > #11.
> > >
> > > This patch addresses the above scenario (tp->snd_wnd=0) and returns the
> > > correct
> > > SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path
> > > was
> > > returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.
> > >
> > > Initiator Environment:
> > > - NVMe-oF Initiator: drivers/nvme/host/tcp.c
> > > - NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
> > > - Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)
> > >
> > >
> > > > Because as you said receivers should be relaxed about this, especially
> > > > if _they_ decided
> > > > to not respect the TCP standards.
> > > >
> > > > You are proposing to send old ACK, that might be dropped by other stacks.
> > > On the contrary, I'm proposing to use the expected/correct ACK based on tp-
> > > > snd_nxt.
> >
> > Please take a look at the very lengthy comment at the front of the function.
> >
> > Basically we are in a mode where a value needs to be chosen, and we do
> > not really know which one
> > will be accepted by the buggy peer.
> >
> I'm pasting the source code comment you're referring to here. You're right that
> the comment is very relevant in this case as the TCP window is being shrunk,
> although, I'd politely argue that its a design choice rather than a bug in our
> hardware target implementation.
>
> /* SND.NXT, if window was not shrunk or the amount of shrunk was less than one
>  * window scaling factor due to loss of precision.
>  * If window has been shrunk, what should we make? It is not clear at all.
>  * Using SND.UNA we will fail to open window, SND.NXT is out of window. :-(
>  * Anything in between SND.UNA...SND.UNA+SND.WND also can be already
>  * invalid. OK, let's make this for now:
>  */
>
> Below, I'm also pasting a plain-text version of the .pcapng, provided earlier as
> an email attachment. Hopefully this makes it easier to refer to the packets as
> you read through my comments. I had to massage the formatting to fit it in this
> email. Data remains the same except for AckNum for pkt#3 which referred to an
> older packet and threw off the formatting.
>
> Initiator = 10.10.11.151 (aka NVMe/TCP host)
> Target = 10.10.11.12
>
> No. Time        Src IP          Proto    Len    SeqNum  AckNum  WinSize
> 1   0.000000000 10.10.11.151    TCP      4154   1       1       25
> 2   0.000000668 10.10.11.151    TCP      4154   4097    1       25
> 3   0.000039250 10.10.11.12     TCP      64     1       x       16384
> 4   0.000040064 10.10.11.12     TCP      64     1       1       16384
> 5   0.000040951 10.10.11.12     NVMe/TCP 82     1       1       16384
> 6   0.000041009 10.10.11.12     NVMe/TCP 82     25      1       16384
> 7   0.000059422 10.10.11.12     TCP      64     49      4097    0
> 8   0.000060059 10.10.11.12     TCP      64     49      8193    0
> 9   0.000072519 10.10.11.12     TCP      64     49      8193    16384
> 10  0.000074756 10.10.11.151    TCP      4154   8193    1       25
> 11  0.000075089 10.10.11.151    TCP      802    12289   1       25
> 12  0.000089454 10.10.11.151    TCP      64     4097    49      25
> 13  0.000102225 10.10.11.151    TCP      4154   13033   49      25
> 14  0.000102567 10.10.11.151    TCP      4154   17129   49      25
> 15  0.000140273 10.10.11.12     TCP      64     49      13033   16384
> 16  0.000157344 10.10.11.151    TCP      106    21225   49      25
> 17  0.000158580 10.10.11.12     TCP      64     49      13033   0
>
> Packets #7 and #8: Target shrinks window to zero for congestion control
> Packet #9: ~12us later Target expands window back to 16384
>
> [Packet #12] is an ACK packet from the Initiator. Since it does not send data,
> window shrinking should not affect its SEQ-NUM here. Hence, its probably safe to
> send SND.NXT, as in my patch. In other words, TCP window should be relevant to
> data packets and not to ACK packets. Would you agree?

No, I do not agree.

See at the end of this email a packetdrill test demonstrating your
pacth would add extra work
(a challenge ACK)

>
> [Referring to the SND pointers diagram at this URL
> http://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech-2.htm]
>
> This unexpected behavior by the Initiator causes our hardware offloaded Target
> to hand-off control to Firmware slow-path.

This is the bug.

This packet is perfectly normal and should not cause a problem with an offload.

Please contact the vendor to fix this issue.

 If we can keep everything in the
> hardware (fast) path and not invoke Firmware that's when we have the best chance
> of optimal performance.
>
> Running heavy workloads at 100G link rate there can be a million instances of
> such behavior as packet #12 is exhibiting and is very disruptive to fast-path.
>
>
> > You are changing something that has been there forever, risking
> > breaking many other stacks, and/or middleboxes.
> >
> Regardless of how we handle this patch, the fact remains that for any other
> hardware based TCP offloads existing elsewhere they will have the same
> susceptibility as a result of this Linux TCP behavior, even if their congestion
> control mechanism does not match this scenario.
>
> Being fully aware of how ubiquitous TCP layer is we tried ways to avoid changing
> it. Early on, in drivers/nvme/host/tcp.c we had even tried
> tcp_sock_set_quickack() instead of PATCH 2/2 but it did not help. If you can
> suggest a better way that could de-risk existing deployments, I'd be more than
> happy to discuss this further and try other solutions. An example could be a TCP
> socket option that would stay disabled by default. We could then use it in the NVMe/TCP driver or a userspace accessible method.
>
> > It seems the remote TCP stack is quite buggy, I do not think we want
> > to change something which has never been an issue until today in 2022.
> >
> IMHO, I wouldn't quite characterize it that way. It was a design choice and
> provides one of multiple ways of handling network congestion. It may also be
> possible that there are other implementations affected by this issue.
>
>
> > Also not that packet #13, sent immediately after the ACK is carrying
> > whatever needed values.
> > I do not see why the prior packet (#12) would matter.
> >
> > Please elaborate.
> Packet #13, however, is a data packet sent by the Initiator. This is in direct
> reaction to packet #9 from the Target that expanded the window size back to 16K.
> Even though it correctly uses SND.NXT it does not affect the handling for packet
> #12 in our hardware Target.
>
>

Here is a packetdrill test showing the problem you are adding, since
it seems it is not clear to you.

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [16384], 4) = 0

   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 2920 <mss 4096,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 4096,nop,nop,sackOK>
  +.1 < . 1:1(0) ack 1 win 16384
    +0 accept(3, ..., ...) = 4

// Note: linux will not shrink the window...
// I think this would require some patch, to emulate a buggy stack
   +0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1000], 4) = 0

   +0 < P. 1:16385(16384) ack 1 win 150
   +0 write(4, ..., 100) = 100
   +0 > P. 1:101(100) ack 16385 win 0

// OK, what if the ACK carries a sequence in the future ?
// This could happen if the peer sent 18000 bytes while our window was
> 16384 and
// if kamaljit.singh1@wdc.com  patch would be accepted...

   +.1 < . 18001:18001(0) ack 101 win 1000

// Too bad, prior ACK has a sequence in the future
// We send a challenge ACK in an attempt to fix the synchronization issue.
// This would be avoided completely if prior ack was "16385:16385(0)
ack 101 win 1000"
  +0  > . 101:101(0) ack 16385 win 0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk
       [not found] <3a24f8979afad17bc33de485a6a75bfc51102b91.camel@wdc.com>
@ 2022-10-24 22:07 ` Kamaljit Singh
  2022-10-25  0:21   ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Kamaljit Singh @ 2022-10-24 22:07 UTC (permalink / raw)
  To: edumazet
  Cc: yoshfuji, Niklas Cassel, Damien Le Moal, davem, kuba, pabeni, netdev

Hi Eric,

Please find my inline responses below.

Thanks,
Kamaljit


On Thu, 2022-10-20 at 19:48 -0700, Eric Dumazet wrote:
> CAUTION: This email originated from outside of Western Digital. Do not click
> on links or open attachments unless you recognize the sender and know that the
> content is safe.
> 
> 
> On Thu, Oct 20, 2022 at 6:01 PM Kamaljit Singh <Kamaljit.Singh1@wdc.com>
> wrote:
> > On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> > > CAUTION: This email originated from outside of Western Digital. Do not
> > > click
> > > on links or open attachments unless you recognize the sender and know that
> > > the
> > > content is safe.
> > > 
> > > 
> > > On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@wdc.com>
> > > wrote:
> > > > Under certain congestion conditions, an NVMe/TCP target may be
> > > > configured
> > > > to shrink the TCP window in an effort to slow the sender down prior to
> > > > issuing a more drastic L2 pause or PFC indication.  Although the TCP
> > > > standard discourages implementations from shrinking the TCP window, it
> > > > also
> > > > states that TCP implementations must be robust to this occurring. The
> > > > current Linux TCP layer (in conjunction with the NVMe/TCP host driver)
> > > > has
> > > > an issue when the TCP window is shrunk by a target, which causes ACK
> > > > frames
> > > > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > > > retransmitted by the host.
> > > 
> > > Linux sends ACK packets, with a legal SEQ number.
> > > 
> > > The issue is the receiver of such packets, right ?
> > Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
> > initiator has an incorrect SEQ-NUM.
> > 
> > I've attached a .pcapng Network trace for Wireshark. This captures a small
> > snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using
> > fio).
> > As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12
> > from
> > the initiator. This happens right after the target closes the TCP window
> > (pkts
> > #7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt
> > #11.
> > 
> > This patch addresses the above scenario (tp->snd_wnd=0) and returns the
> > correct
> > SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path
> > was
> > returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.
> > 
> > Initiator Environment:
> > - NVMe-oF Initiator: drivers/nvme/host/tcp.c
> > - NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
> > - Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)
> > 
> > 
> > > Because as you said receivers should be relaxed about this, especially
> > > if _they_ decided
> > > to not respect the TCP standards.
> > > 
> > > You are proposing to send old ACK, that might be dropped by other stacks.
> > On the contrary, I'm proposing to use the expected/correct ACK based on tp-
> > > snd_nxt.
> 
> Please take a look at the very lengthy comment at the front of the function.
> 
> Basically we are in a mode where a value needs to be chosen, and we do
> not really know which one
> will be accepted by the buggy peer.
> 
I'm pasting the source code comment you're referring to here. You're right that
the comment is very relevant in this case as the TCP window is being shrunk,
although, I'd politely argue that its a design choice rather than a bug in our
hardware target implementation.

/* SND.NXT, if window was not shrunk or the amount of shrunk was less than one
 * window scaling factor due to loss of precision.
 * If window has been shrunk, what should we make? It is not clear at all.
 * Using SND.UNA we will fail to open window, SND.NXT is out of window. :-(
 * Anything in between SND.UNA...SND.UNA+SND.WND also can be already
 * invalid. OK, let's make this for now:
 */

Below, I'm also pasting a plain-text version of the .pcapng, provided earlier as
an email attachment. Hopefully this makes it easier to refer to the packets as
you read through my comments. I had to massage the formatting to fit it in this
email. Data remains the same except for AckNum for pkt#3 which referred to an
older packet and threw off the formatting.

Initiator = 10.10.11.151 (aka NVMe/TCP host)
Target = 10.10.11.12

No. Time        Src IP          Proto    Len    SeqNum  AckNum  WinSize
1   0.000000000 10.10.11.151    TCP      4154   1       1       25
2   0.000000668 10.10.11.151    TCP      4154   4097    1       25
3   0.000039250 10.10.11.12     TCP      64     1       x       16384   
4   0.000040064 10.10.11.12     TCP      64     1       1       16384
5   0.000040951 10.10.11.12     NVMe/TCP 82     1       1       16384
6   0.000041009 10.10.11.12     NVMe/TCP 82     25      1       16384
7   0.000059422 10.10.11.12     TCP      64     49      4097    0
8   0.000060059 10.10.11.12     TCP      64     49      8193    0 
9   0.000072519 10.10.11.12     TCP      64     49      8193    16384 
10  0.000074756 10.10.11.151    TCP      4154   8193    1       25 
11  0.000075089 10.10.11.151    TCP      802    12289   1       25 
12  0.000089454 10.10.11.151    TCP      64     4097    49      25 
13  0.000102225 10.10.11.151    TCP      4154   13033   49      25 
14  0.000102567 10.10.11.151    TCP      4154   17129   49      25 
15  0.000140273 10.10.11.12     TCP      64     49      13033   16384 
16  0.000157344 10.10.11.151    TCP      106    21225   49      25 
17  0.000158580 10.10.11.12     TCP      64     49      13033   0

Packets #7 and #8: Target shrinks window to zero for congestion control
Packet #9: ~12us later Target expands window back to 16384

[Packet #12] is an ACK packet from the Initiator. Since it does not send data,
window shrinking should not affect its SEQ-NUM here. Hence, its probably safe to
send SND.NXT, as in my patch. In other words, TCP window should be relevant to
data packets and not to ACK packets. Would you agree?

[Referring to the SND pointers diagram at this URL 
http://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech-2.htm]

This unexpected behavior by the Initiator causes our hardware offloaded Target
to hand-off control to Firmware slow-path. If we can keep everything in the
hardware (fast) path and not invoke Firmware that's when we have the best chance
of optimal performance.

Running heavy workloads at 100G link rate there can be a million instances of
such behavior as packet #12 is exhibiting and is very disruptive to fast-path.


> You are changing something that has been there forever, risking
> breaking many other stacks, and/or middleboxes.
> 
Regardless of how we handle this patch, the fact remains that for any other
hardware based TCP offloads existing elsewhere they will have the same
susceptibility as a result of this Linux TCP behavior, even if their congestion
control mechanism does not match this scenario.

Being fully aware of how ubiquitous TCP layer is we tried ways to avoid changing
it. Early on, in drivers/nvme/host/tcp.c we had even tried
tcp_sock_set_quickack() instead of PATCH 2/2 but it did not help. If you can
suggest a better way that could de-risk existing deployments, I'd be more than
happy to discuss this further and try other solutions. An example could be a TCP
socket option that would stay disabled by default. We could then use it in the NVMe/TCP driver or a userspace accessible method.

> It seems the remote TCP stack is quite buggy, I do not think we want
> to change something which has never been an issue until today in 2022.
> 
IMHO, I wouldn't quite characterize it that way. It was a design choice and
provides one of multiple ways of handling network congestion. It may also be
possible that there are other implementations affected by this issue.


> Also not that packet #13, sent immediately after the ACK is carrying
> whatever needed values.
> I do not see why the prior packet (#12) would matter.
> 
> Please elaborate.
Packet #13, however, is a data packet sent by the Initiator. This is in direct
reaction to packet #9 from the Target that expanded the window size back to 16K.
Even though it correctly uses SND.NXT it does not affect the handling for packet
#12 in our hardware Target.


> 
> 
> > > It has been observed that processing of these
> > > > “stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
> > > > performance.
> > > > 
> > > > Network traffic analysis revealed that SEQ-NUM being used by the host to
> > > > ACK the frame that resized the TCP window had an older SEQ-NUM and not a
> > > > value corresponding to the next SEQ-NUM expected on that connection.
> > > > 
> > > > In such a case, the kernel was using the seq number calculated by
> > > > tcp_wnd_end() as per the code segment below. Since, in this case
> > > > tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect
> > > > for
> > > > the scenario.  The correct seq number that needs to be returned is
> > > > tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
> > > > with its performance impact.
> > > > 
> > > >   1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
> > > >   1272 {
> > > >   1273   return tp->snd_una + tp->snd_wnd;
> > > >   1274 }
> > > > 
> > > > Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com>
> > > > ---
> > > >  net/ipv4/tcp_output.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > index 11aa0ab10bba..322e061edb72 100644
> > > > --- a/net/ipv4/tcp_output.c
> > > > +++ b/net/ipv4/tcp_output.c
> > > > @@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct
> > > > sock
> > > > *sk)
> > > >             (tp->rx_opt.wscale_ok &&
> > > >              ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp-
> > > > > rx_opt.rcv_wscale))))
> > > >                 return tp->snd_nxt;
> > > > +       else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
> > > > +                !((1 << sk->sk_state) & (TCPF_SYN_SENT |
> > > > TCPF_SYN_RECV)))
> > > > +               return tp->snd_nxt;
> > > >         else
> > > >                 return tcp_wnd_end(tp);
> > > >  }
> > > > --
> > > > 2.25.1
> > > > 
> > --
> > Thanks,
> > Kamaljit Singh <kamaljit.singh1@wdc.com>



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-10-25  1:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-20 18:22 [PATCH v1 0/2] tcp bugs for window shrinking, OoO ACKs Kamaljit Singh
2022-10-20 18:22 ` [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk Kamaljit Singh
2022-10-20 20:45   ` Eric Dumazet
2022-10-21  1:01     ` Kamaljit Singh
2022-10-21  2:48       ` Eric Dumazet
2022-10-20 18:22 ` [PATCH v1 2/2] tcp: Ignore OOO handling for TCP ACKs Kamaljit Singh
2022-10-20 18:57   ` Eric Dumazet
2022-10-20 21:25     ` Kamaljit Singh
2022-10-20 21:47       ` Eric Dumazet
     [not found] <3a24f8979afad17bc33de485a6a75bfc51102b91.camel@wdc.com>
2022-10-24 22:07 ` [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk Kamaljit Singh
2022-10-25  0:21   ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.