All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme-tls and TCP window full
@ 2023-07-06 18:42 Hannes Reinecke
  2023-07-06 23:18 ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2023-07-06 18:42 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme

Hi Sagi,

I'm currently debugging my nvme-tls patches; with the rebase to latest 
linus' head things started to behave erratically as occasionally a CQE 
was never received, triggering a reset.

Originally I thought it's my read_sock() implementation which is to 
blame, but then I fixed wireshark to do a proper frame dissecting 
(wireshark merge req #11359), and found that it's rather an issue with 
TCP window becoming full.
While this is arguably an issue with the sender (which is trying to send 
32k worth of data in one packet), the connection never recovers from the 
window full state; the command is dropped, never to reappear again.

I would have thought/hoped that we (from the NVMe side) would
be able to handle it better; in particular I'm surprised that we can 
send large chunks of data at all. And that the packet is dropped due to 
a protocol error without us notifying.
So question really is: do we check for the TCP window size somewhere?
If so, where? Or is it something the lower layers have to do for us?

Full packet dissection available on request.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-06 18:42 nvme-tls and TCP window full Hannes Reinecke
@ 2023-07-06 23:18 ` Sagi Grimberg
  2023-07-11  9:28   ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2023-07-06 23:18 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme


> Hi Sagi,

Hey Hannes,

> I'm currently debugging my nvme-tls patches; with the rebase to latest 
> linus' head things started to behave erratically as occasionally a CQE 
> was never received, triggering a reset.

Bummer.

> Originally I thought it's my read_sock() implementation which is to 
> blame, but then I fixed wireshark to do a proper frame dissecting 
> (wireshark merge req #11359), and found that it's rather an issue with 
> TCP window becoming full.
> While this is arguably an issue with the sender (which is trying to send 
> 32k worth of data in one packet)

This is before TSO. it is not actually 32K in a single packet. If you
record on the switch, you will see that it is packetized to MTU size
properly.

> the connection never recovers from the 
> window full state; the command is dropped, never to reappear again.

This is a bug. And sounds like a new issue to me. Does this happen
without TLS patches applied? And without SPLICE patches from david?

> I would have thought/hoped that we (from the NVMe side) would
> be able to handle it better; in particular I'm surprised that we can 
> send large chunks of data at all.

What do you mean? if we get R2T of a given size (within MDTS), we just
send it all... What do you mean by handling it better? What would you
expect the driver to do?

> And that the packet is dropped due to 
> a protocol error without us notifying.

TCP needs to handle it. It does not notify the consumer unless there
is a TCP reset sent which breaks the connection. But I am not familiar
with a scenario that a TCP segment is dropped and not retired. TCP is
full of retry logic based on dropped packets (with rtt measurements)
and ack/data arriving out-of-order...

> So question really is: do we check for the TCP window size somewhere?
> If so, where? Or is it something the lower layers have to do for us?

We don't and we can't. The application is not and cannot be exposed to
the TCP window size because it belongs to the very low level of 
congestion control mechanisms berried deep down in TCP, that's why
socket buffers exist. The application sends until the socket buffer is
full (and evicts when sent buffers are acknowledged within the ack'd
sequence number).

> Full packet dissection available on request.

This does not sound like an nvme-tcp problem to me. Sounds like
a breakage. It is possible that the sendpage conversion missed
a flag or something, or that the stack behaves slightly different. Or
something in your TLS patches and its interaction with the SPLICE
patches.

Can you send me your updated code (tls fixes plus nvme-tls). I suggest
to start bisecting.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-06 23:18 ` Sagi Grimberg
@ 2023-07-11  9:28   ` Sagi Grimberg
  2023-07-11 10:31     ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2023-07-11  9:28 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme

Hey Hannes,

Any progress on this one?

On 7/7/23 02:18, Sagi Grimberg wrote:
> 
>> Hi Sagi,
> 
> Hey Hannes,
> 
>> I'm currently debugging my nvme-tls patches; with the rebase to latest 
>> linus' head things started to behave erratically as occasionally a CQE 
>> was never received, triggering a reset.
> 
> Bummer.
> 
>> Originally I thought it's my read_sock() implementation which is to 
>> blame, but then I fixed wireshark to do a proper frame dissecting 
>> (wireshark merge req #11359), and found that it's rather an issue with 
>> TCP window becoming full.
>> While this is arguably an issue with the sender (which is trying to 
>> send 32k worth of data in one packet)
> 
> This is before TSO. it is not actually 32K in a single packet. If you
> record on the switch, you will see that it is packetized to MTU size
> properly.
> 
>> the connection never recovers from the window full state; the command 
>> is dropped, never to reappear again.
> 
> This is a bug. And sounds like a new issue to me. Does this happen
> without TLS patches applied? And without SPLICE patches from david?
> 
>> I would have thought/hoped that we (from the NVMe side) would
>> be able to handle it better; in particular I'm surprised that we can 
>> send large chunks of data at all.
> 
> What do you mean? if we get R2T of a given size (within MDTS), we just
> send it all... What do you mean by handling it better? What would you
> expect the driver to do?
> 
>> And that the packet is dropped due to a protocol error without us 
>> notifying.
> 
> TCP needs to handle it. It does not notify the consumer unless there
> is a TCP reset sent which breaks the connection. But I am not familiar
> with a scenario that a TCP segment is dropped and not retired. TCP is
> full of retry logic based on dropped packets (with rtt measurements)
> and ack/data arriving out-of-order...
> 
>> So question really is: do we check for the TCP window size somewhere?
>> If so, where? Or is it something the lower layers have to do for us?
> 
> We don't and we can't. The application is not and cannot be exposed to
> the TCP window size because it belongs to the very low level of 
> congestion control mechanisms berried deep down in TCP, that's why
> socket buffers exist. The application sends until the socket buffer is
> full (and evicts when sent buffers are acknowledged within the ack'd
> sequence number).
> 
>> Full packet dissection available on request.
> 
> This does not sound like an nvme-tcp problem to me. Sounds like
> a breakage. It is possible that the sendpage conversion missed
> a flag or something, or that the stack behaves slightly different. Or
> something in your TLS patches and its interaction with the SPLICE
> patches.
> 
> Can you send me your updated code (tls fixes plus nvme-tls). I suggest
> to start bisecting.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-11  9:28   ` Sagi Grimberg
@ 2023-07-11 10:31     ` Hannes Reinecke
  2023-07-11 12:05       ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2023-07-11 10:31 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme

On 7/11/23 11:28, Sagi Grimberg wrote:
> Hey Hannes,
> 
> Any progress on this one?
> 
Oh well; slow going.
After some length debugging I found that the read_sock() implementation 
wasn't exactly up to scratch (it would read each byte individually, 
instead of each skb individually ... D'oh).
But even with that fixed I'm still seeing I/O stalls due to window full:

[85639.669120] nvme nvme0: queue 2: C2H data cid 0x4d offset 0 len 4096
[85639.669885] nvme nvme0: queue 2: PDU Data cid 0x4d offset 29 len 4120 
rem 4096
[85639.670809] nvme nvme0: queue 2: RSP cid 0x4d status 0
[85639.671449] nvme nvme0: queue 2: pdu 24 data 0 consumed 4144 result 0
[85639.674928] nvme nvme0: queue 2: skb 00000000bee9b76d len 8262 offset 
5 len 8240
[85639.676474] nvme nvme0: queue 2: C2H data cid 0x4e offset 0 len 8192
[85639.678013] nvme nvme0: queue 2: PDU Data cid 0x4e offset 29 len 8216 
rem 8192
[85639.679810] nvme nvme0: queue 2: RSP cid 0x4e status 0
[85639.681247] nvme nvme0: queue 2: pdu 24 data 0 consumed 8240 result 0
[85639.690985] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
offset 5 len 16384
[85639.692321] nvme nvme0: queue 2: C2H data cid 0x4f offset 0 len 126976
[85639.694012] nvme nvme0: queue 2: PDU Data cid 0x4f offset 29 len 
16360 rem 126976
[85639.695882] nvme nvme0: queue 2: pdu 0 data 110616 consumed 16384 
result 0
[85639.697667] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
offset 5 len 16384
[85639.699339] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
rem 110616
[85639.701122] nvme nvme0: queue 2: pdu 0 data 94232 consumed 16384 result 0
[85639.702831] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
offset 5 len 16384
[85639.704918] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
rem 94232
[85639.706759] nvme nvme0: queue 2: pdu 0 data 77848 consumed 16384 result 0
[85639.708367] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
offset 5 len 16384
[85639.711219] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
rem 77848
[85639.712108] nvme nvme0: queue 2: pdu 0 data 61464 consumed 16384 result 0
[85639.714182] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
offset 5 len 16384
[85639.715107] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
rem 61464
[85639.715989] nvme nvme0: queue 2: pdu 0 data 45080 consumed 16384 result 0
[85671.510572] nvme nvme0: queue 2: timeout cid 0x4f type 4 opcode 0x2 
(Read)

(These are just client-side logs).
Which looks to me as if we're waiting for the server to continue sending
PDU data frames, only these never arrive.
So really I wonder whether it's not rather an issue on the server side. 
Maybe the server doesn't retire skbs (or not all of them), causing the 
TCP window to shrink.
That, of course, is wild guessing, as I have no idea if and how calls to 
'consume_skb' reflect back to the TCP window size.

Investigation continues.

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-11 10:31     ` Hannes Reinecke
@ 2023-07-11 12:05       ` Sagi Grimberg
  2023-07-13  9:48         ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2023-07-11 12:05 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme


>> Hey Hannes,
>>
>> Any progress on this one?
>>
> Oh well; slow going.
> After some length debugging I found that the read_sock() implementation 
> wasn't exactly up to scratch (it would read each byte individually, 
> instead of each skb individually ... D'oh).
> But even with that fixed I'm still seeing I/O stalls due to window full:

What does your current read_sock look like?

One thing that looks incorrect is that you should be calling
tls_rx_rec_wait() in a loop as long as it returns 1 no? This way
you will continue to consume as long as you have something in the
sk ready for process...

Something like (the completely untested):
do {
	if (!skb) {
		if (!skb_queue_empty(&ctx->rx_list)) {
			skb = __skb_dequeue(&ctx->rx_list);
		} else {
			struct tls_decrypt_arg darg;

			err = tls_rx_rec_wait(sk, NULL, true, true);
			if (err <= 0) {
				tls_rx_reader_release(sk, ctx);
				return err;
			}

			memset(&darg.inargs, 0, sizeof(darg.inargs));

			err = tls_rx_one_record(sk, NULL, &darg);
			if (err < 0) {
				tls_err_abort(sk, -EBADMSG);
				tls_rx_reader_release(sk, ctx);
				return err;
			}

			tls_rx_rec_done(ctx);
			skb = darg.skb;
	}

	rxm = strp_msg(skb);
	tlm = tls_msg(skb);

	/* read_sock does not support reading control messages /
	if (tlm->control != TLS_RECORD_TYPE_DATA) {
		err = -EINVAL;
		goto read_sock_requeue;
	}

	used = read_actor(desc, skb, rxm->offset, rxm->full_len);
	if (used <= 0) {
		err = used;
		goto read_sock_end;
	}

	copied += used;
	if (used < rxm->full_len) {
		rxm->offset += used;
		rxm->full_len -= used;
		if (!desc->count)
			goto read_sock_requeue;
	} else {
		consume_skb(skb);
		if (desc->count && skb_queue_empty(&ctx->rx_list))
			skb = __skb_dequeue(&ctx->rx_list);
		else
			skb = NULL;
	}
} while (skb);

> 
> [85639.669120] nvme nvme0: queue 2: C2H data cid 0x4d offset 0 len 4096
> [85639.669885] nvme nvme0: queue 2: PDU Data cid 0x4d offset 29 len 4120 
> rem 4096

what is rem? remaining? what is this derived from? the PDU or skb?

> [85639.670809] nvme nvme0: queue 2: RSP cid 0x4d status 0
> [85639.671449] nvme nvme0: queue 2: pdu 24 data 0 consumed 4144 result 0
> [85639.674928] nvme nvme0: queue 2: skb 00000000bee9b76d len 8262 offset 
> 5 len 8240

What is offset? skb offset?

> [85639.676474] nvme nvme0: queue 2: C2H data cid 0x4e offset 0 len 8192
> [85639.678013] nvme nvme0: queue 2: PDU Data cid 0x4e offset 29 len 8216 
> rem 8192

offset is 29? is that pdu offset? should be 24? does it somehow combine
the offset 5 from above?

It is not very clear what the prints are...

> [85639.679810] nvme nvme0: queue 2: RSP cid 0x4e status 0
> [85639.681247] nvme nvme0: queue 2: pdu 24 data 0 consumed 8240 result 0
> [85639.690985] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.692321] nvme nvme0: queue 2: C2H data cid 0x4f offset 0 len 126976
> [85639.694012] nvme nvme0: queue 2: PDU Data cid 0x4f offset 29 len 
> 16360 rem 126976
> [85639.695882] nvme nvme0: queue 2: pdu 0 data 110616 consumed 16384 
> result 0
> [85639.697667] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.699339] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
> rem 110616
> [85639.701122] nvme nvme0: queue 2: pdu 0 data 94232 consumed 16384 
> result 0
> [85639.702831] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.704918] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
> rem 94232
> [85639.706759] nvme nvme0: queue 2: pdu 0 data 77848 consumed 16384 
> result 0
> [85639.708367] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.711219] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
> rem 77848
> [85639.712108] nvme nvme0: queue 2: pdu 0 data 61464 consumed 16384 
> result 0
> [85639.714182] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.715107] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
> rem 61464
> [85639.715989] nvme nvme0: queue 2: pdu 0 data 45080 consumed 16384 
> result 0
> [85671.510572] nvme nvme0: queue 2: timeout cid 0x4f type 4 opcode 0x2 
> (Read)
> 
> (These are just client-side logs).
> Which looks to me as if we're waiting for the server to continue sending
> PDU data frames, only these never arrive.

I think it is more likely that you are losing sync on the stream...
Maybe the server is done sending data, but if you have a bug processing
it the client think that more is coming...

What is the workload you are testing? Is this workload dependent? small bs?

> So really I wonder whether it's not rather an issue on the server side.

The variable you added here is the read_sock, so I would suspect that
this would be your offender.

> Maybe the server doesn't retire skbs (or not all of them), causing the 
> TCP window to shrink.
> That, of course, is wild guessing, as I have no idea if and how calls to 
> 'consume_skb' reflect back to the TCP window size.

skbs are unrelated to the TCP window. They relate to the socket send
buffer. skbs left dangling would cause server side to run out of memory,
not for the TCP window to close. The two are completely unrelated.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-11 12:05       ` Sagi Grimberg
@ 2023-07-13  9:48         ` Hannes Reinecke
  2023-07-13 10:11           ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2023-07-13  9:48 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Jakub Kicinski,
	open list:NETWORKING [GENERAL]

On 7/11/23 14:05, Sagi Grimberg wrote:
> 
>>> Hey Hannes,
>>>
>>> Any progress on this one?
>>>
>> Oh well; slow going.
[ .. ]
>> Maybe the server doesn't retire skbs (or not all of them), causing the 
>> TCP window to shrink.
>> That, of course, is wild guessing, as I have no idea if and how calls 
>> to 'consume_skb' reflect back to the TCP window size.
> 
> skbs are unrelated to the TCP window. They relate to the socket send
> buffer. skbs left dangling would cause server side to run out of memory,
> not for the TCP window to close. The two are completely unrelated.

Ouch.
Wasn't me, in the end:

diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
index f37f4a0fcd3c..ca1e0e198ceb 100644
--- a/net/tls/tls_strp.c
+++ b/net/tls/tls_strp.c
@@ -369,7 +369,6 @@ static int tls_strp_copyin(read_descriptor_t *desc, 
struct sk_buff *in_skb,

  static int tls_strp_read_copyin(struct tls_strparser *strp)
  {
-       struct socket *sock = strp->sk->sk_socket;
         read_descriptor_t desc;

         desc.arg.data = strp;
@@ -377,7 +376,7 @@ static int tls_strp_read_copyin(struct tls_strparser 
*strp)
         desc.count = 1; /* give more than one skb per call */

         /* sk should be locked here, so okay to do read_sock */
-       sock->ops->read_sock(strp->sk, &desc, tls_strp_copyin);
+       tcp_read_sock(strp->sk, &desc, tls_strp_copyin);

         return desc.error;
  }

Otherwise we'd enter a recursion calling ->read_sock(), which will 
redirect to tls_sw_read_sock(), calling tls_strp_check_rcv(), calling 
->read_sock() ...
It got covered up with the tls_rx_reader_lock() Jakub put in, so I 
really only noticed it when instrumenting that one.

And my reading seems that the current in-kernel TLS implementation 
assumes TCP as the underlying transport anyway, so no harm done.
Jakub?

Cheers,

Hannes


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-13  9:48         ` Hannes Reinecke
@ 2023-07-13 10:11           ` Sagi Grimberg
  2023-07-13 10:16             ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2023-07-13 10:11 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme, Jakub Kicinski,
	open list:NETWORKING [GENERAL]


>> skbs are unrelated to the TCP window. They relate to the socket send
>> buffer. skbs left dangling would cause server side to run out of memory,
>> not for the TCP window to close. The two are completely unrelated.
> 
> Ouch.
> Wasn't me, in the end:
> 
> diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
> index f37f4a0fcd3c..ca1e0e198ceb 100644
> --- a/net/tls/tls_strp.c
> +++ b/net/tls/tls_strp.c
> @@ -369,7 +369,6 @@ static int tls_strp_copyin(read_descriptor_t *desc, 
> struct sk_buff *in_skb,
> 
>   static int tls_strp_read_copyin(struct tls_strparser *strp)
>   {
> -       struct socket *sock = strp->sk->sk_socket;
>          read_descriptor_t desc;
> 
>          desc.arg.data = strp;
> @@ -377,7 +376,7 @@ static int tls_strp_read_copyin(struct tls_strparser 
> *strp)
>          desc.count = 1; /* give more than one skb per call */
> 
>          /* sk should be locked here, so okay to do read_sock */
> -       sock->ops->read_sock(strp->sk, &desc, tls_strp_copyin);
> +       tcp_read_sock(strp->sk, &desc, tls_strp_copyin);
> 
>          return desc.error;
>   }
> 
> Otherwise we'd enter a recursion calling ->read_sock(), which will 
> redirect to tls_sw_read_sock(), calling tls_strp_check_rcv(), calling 
> ->read_sock() ...

Is this new? How did this pop up just now?

> It got covered up with the tls_rx_reader_lock() Jakub put in, so I 
> really only noticed it when instrumenting that one.

So without it, you get two contexts reading from the socket?
Not sure how this works, but obviously wrong...

> And my reading seems that the current in-kernel TLS implementation 
> assumes TCP as the underlying transport anyway, so no harm done.
> Jakub?

While it is correct that the assumption for tcp only, I think the
right thing to do would be to store the original read_sock and call
that...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-13 10:11           ` Sagi Grimberg
@ 2023-07-13 10:16             ` Hannes Reinecke
  2023-07-18 18:59               ` Jakub Kicinski
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2023-07-13 10:16 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Jakub Kicinski,
	open list:NETWORKING [GENERAL]

On 7/13/23 12:11, Sagi Grimberg wrote:
> 
>>> skbs are unrelated to the TCP window. They relate to the socket send
>>> buffer. skbs left dangling would cause server side to run out of memory,
>>> not for the TCP window to close. The two are completely unrelated.
>>
>> Ouch.
>> Wasn't me, in the end:
>>
>> diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
>> index f37f4a0fcd3c..ca1e0e198ceb 100644
>> --- a/net/tls/tls_strp.c
>> +++ b/net/tls/tls_strp.c
>> @@ -369,7 +369,6 @@ static int tls_strp_copyin(read_descriptor_t 
>> *desc, struct sk_buff *in_skb,
>>
>>   static int tls_strp_read_copyin(struct tls_strparser *strp)
>>   {
>> -       struct socket *sock = strp->sk->sk_socket;
>>          read_descriptor_t desc;
>>
>>          desc.arg.data = strp;
>> @@ -377,7 +376,7 @@ static int tls_strp_read_copyin(struct 
>> tls_strparser *strp)
>>          desc.count = 1; /* give more than one skb per call */
>>
>>          /* sk should be locked here, so okay to do read_sock */
>> -       sock->ops->read_sock(strp->sk, &desc, tls_strp_copyin);
>> +       tcp_read_sock(strp->sk, &desc, tls_strp_copyin);
>>
>>          return desc.error;
>>   }
>>
>> Otherwise we'd enter a recursion calling ->read_sock(), which will 
>> redirect to tls_sw_read_sock(), calling tls_strp_check_rcv(), calling 
>> ->read_sock() ...
> 
> Is this new? How did this pop up just now?
> 
It's not new; this has been in there since ages immemorial.
It just got uncovered as yours truly was brave enough to implement 
->read_sock() for TLS ...

>> It got covered up with the tls_rx_reader_lock() Jakub put in, so I 
>> really only noticed it when instrumenting that one.
> 
> So without it, you get two contexts reading from the socket?
> Not sure how this works, but obviously wrong...
> 
Oh, no. Without it you get a loop, eventually resulting in a stack overflow.

>> And my reading seems that the current in-kernel TLS implementation 
>> assumes TCP as the underlying transport anyway, so no harm done.
>> Jakub?
> 
> While it is correct that the assumption for tcp only, I think the
> right thing to do would be to store the original read_sock and call
> that...

Ah, sure. Or that.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-13 10:16             ` Hannes Reinecke
@ 2023-07-18 18:59               ` Jakub Kicinski
  2023-07-19  7:27                 ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2023-07-18 18:59 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Sagi Grimberg, linux-nvme, open list:NETWORKING [GENERAL]

On Thu, 13 Jul 2023 12:16:13 +0200 Hannes Reinecke wrote:
> >> And my reading seems that the current in-kernel TLS implementation 
> >> assumes TCP as the underlying transport anyway, so no harm done.
> >> Jakub?  
> > 
> > While it is correct that the assumption for tcp only, I think the
> > right thing to do would be to store the original read_sock and call
> > that...  
> 
> Ah, sure. Or that.

Yup, sorry for late reply, read_sock could also be replaced by BPF 
or some other thing, even if it's always TCP "at the bottom".

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-18 18:59               ` Jakub Kicinski
@ 2023-07-19  7:27                 ` Hannes Reinecke
  2023-07-19 11:54                   ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2023-07-19  7:27 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Sagi Grimberg, linux-nvme, open list:NETWORKING [GENERAL]

On 7/18/23 20:59, Jakub Kicinski wrote:
> On Thu, 13 Jul 2023 12:16:13 +0200 Hannes Reinecke wrote:
>>>> And my reading seems that the current in-kernel TLS implementation
>>>> assumes TCP as the underlying transport anyway, so no harm done.
>>>> Jakub?
>>>
>>> While it is correct that the assumption for tcp only, I think the
>>> right thing to do would be to store the original read_sock and call
>>> that...
>>
>> Ah, sure. Or that.
> 
> Yup, sorry for late reply, read_sock could also be replaced by BPF
> or some other thing, even if it's always TCP "at the bottom".

Hmm. So what do you suggest?
Remember, the current patch does this:

@@ -377,7 +376,7 @@ static int tls_strp_read_copyin(struct tls_strparser 
*strp)
         desc.count = 1; /* give more than one skb per call */

         /* sk should be locked here, so okay to do read_sock */
-       sock->ops->read_sock(strp->sk, &desc, tls_strp_copyin);
+       tcp_read_sock(strp->sk, &desc, tls_strp_copyin);

         return desc.error;
  }

precisely because ->read_sock() gets redirected when TLS engages.
And also remember TLS does _not_ use the normal redirection by 
intercepting the callbacks from 'struct sock', but rather replaces the 
->ops callback in struct socket.

So I'm slightly at a loss on how to implement a new callback without 
having to redo the entire TLS handover.
Hence I vastly prefer just the simple patch by using tcp_read_sock() 
directly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nvme-tls and TCP window full
  2023-07-19  7:27                 ` Hannes Reinecke
@ 2023-07-19 11:54                   ` Sagi Grimberg
  0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2023-07-19 11:54 UTC (permalink / raw)
  To: Hannes Reinecke, Jakub Kicinski
  Cc: linux-nvme, open list:NETWORKING [GENERAL]


>>>>> And my reading seems that the current in-kernel TLS implementation
>>>>> assumes TCP as the underlying transport anyway, so no harm done.
>>>>> Jakub?
>>>>
>>>> While it is correct that the assumption for tcp only, I think the
>>>> right thing to do would be to store the original read_sock and call
>>>> that...
>>>
>>> Ah, sure. Or that.
>>
>> Yup, sorry for late reply, read_sock could also be replaced by BPF
>> or some other thing, even if it's always TCP "at the bottom".
> 
> Hmm. So what do you suggest?
> Remember, the current patch does this:
> 
> @@ -377,7 +376,7 @@ static int tls_strp_read_copyin(struct tls_strparser 
> *strp)
>          desc.count = 1; /* give more than one skb per call */
> 
>          /* sk should be locked here, so okay to do read_sock */
> -       sock->ops->read_sock(strp->sk, &desc, tls_strp_copyin);
> +       tcp_read_sock(strp->sk, &desc, tls_strp_copyin);
> 
>          return desc.error;
>   }
> 
> precisely because ->read_sock() gets redirected when TLS engages.
> And also remember TLS does _not_ use the normal redirection by 
> intercepting the callbacks from 'struct sock', but rather replaces the 
> ->ops callback in struct socket.
> 
> So I'm slightly at a loss on how to implement a new callback without 
> having to redo the entire TLS handover.
> Hence I vastly prefer just the simple patch by using tcp_read_sock() 
> directly.

I think this is fine. The tls parser is exclusive to the bottom socket
being a tcp socket anyways, read_sock() was by definition until Hannes's
patch 6/6 always tcp_read_sock. So this is a valid replacement IMO.
I don't think that it is worth the effort to "prepare" for generalizing
the tls parser.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-07-19 11:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-06 18:42 nvme-tls and TCP window full Hannes Reinecke
2023-07-06 23:18 ` Sagi Grimberg
2023-07-11  9:28   ` Sagi Grimberg
2023-07-11 10:31     ` Hannes Reinecke
2023-07-11 12:05       ` Sagi Grimberg
2023-07-13  9:48         ` Hannes Reinecke
2023-07-13 10:11           ` Sagi Grimberg
2023-07-13 10:16             ` Hannes Reinecke
2023-07-18 18:59               ` Jakub Kicinski
2023-07-19  7:27                 ` Hannes Reinecke
2023-07-19 11:54                   ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.