linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nvme-tcp: Check if request has started before processing it
@ 2021-02-12 18:17 Daniel Wagner
  2021-02-12 20:58 ` Sagi Grimberg
  2021-02-13  8:42 ` Hannes Reinecke
  0 siblings, 2 replies; 22+ messages in thread
From: Daniel Wagner @ 2021-02-12 18:17 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-kernel, Sagi Grimberg, Christoph Hellwig, Jens Axboe,
	Keith Busch, Hannes Reinecke, Daniel Wagner

blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---

This patch is against nvme-5.12.

There is one blk_mq_tag_to_rq() in nvme_tcp_recv_ddgst() which I
didn't update as I am not sure if it's also needed.

py-crash> bt
#0  0xffffffffa76a33de in arch_atomic_try_cmpxchg (new=<optimized out>, old=<optimized out>, v=<optimized out>)
    at ../arch/x86/include/asm/atomic.h:200
#1  atomic_try_cmpxchg (new=<optimized out>, old=<optimized out>, v=<optimized out>) at ../include/asm-generic/atomic-instrumented.h:695
#2  queued_spin_lock (lock=<optimized out>) at ../include/asm-generic/qspinlock.h:78
#3  do_raw_spin_lock_flags (flags=<optimized out>, lock=<optimized out>) at ../include/linux/spinlock.h:193
#4  __raw_spin_lock_irqsave (lock=<optimized out>) at ../include/linux/spinlock_api_smp.h:119
#5  _raw_spin_lock_irqsave (lock=0x8 <__UNIQUE_ID_license257+8>) at ../kernel/locking/spinlock.c:159
#6  0xffffffffa6eea418 in complete (x=0x0 <__UNIQUE_ID_license257>) at ../kernel/sched/completion.c:32
#7  0xffffffffa721f99c in blk_mq_force_complete_rq (rq=0x8 <__UNIQUE_ID_license257+8>) at ../block/blk-mq.c:634
#8  0xffffffffa721fa0a in blk_mq_complete_request (rq=<optimized out>) at ../block/blk-mq.c:672
#9  0xffffffffc0b092ef in nvme_end_request (result=..., status=<optimized out>, req=<optimized out>) at ../drivers/nvme/host/nvme.h:477
#10 nvme_tcp_process_nvme_cqe (cqe=<optimized out>, queue=<optimized out>) at ../drivers/nvme/host/tcp.c:485
        rq = 0xffff948b840d0000
        hdr = <optimized out>
        ret = 0
        queue = 0xffff949501dd8110
        result = 0
#11 nvme_tcp_handle_comp (pdu=<optimized out>, queue=<optimized out>) at ../drivers/nvme/host/tcp.c:542
#12 nvme_tcp_recv_pdu (len=<optimized out>, offset=<optimized out>, skb=<optimized out>, queue=<optimized out>) at ../drivers/nvme/host/tcp.c:660
#13 nvme_tcp_recv_skb (desc=<optimized out>, skb=<optimized out>, offset=24, len=0) at ../drivers/nvme/host/tcp.c:805
#14 0xffffffffa7598af5 in tcp_read_sock
    (sk=0x8 <__UNIQUE_ID_license257+8>, desc=0xa <__UNIQUE_ID_license257+10>, recv_actor=0x1 <__UNIQUE_ID_license257+1>) at ../net/ipv4/tcp.c:1645
#15 0xffffffffc0b075b8 in nvme_tcp_try_recv (queue=0xffff949501dd8110) at ../drivers/nvme/host/tcp.c:1102
#16 0xffffffffc0b08fc7 in nvme_tcp_io_work (w=0xffff949501dd8118) at ../drivers/nvme/host/tcp.c:1126
#17 0xffffffffa6eba4e4 in process_one_work (worker=0xffff948d1b633ec0, work=0xffff949501dd8118) at ../kernel/workqueue.c:2273
#18 0xffffffffa6eba6fd in worker_thread (__worker=0xffff948d1b633ec0) at ../kernel/workqueue.c:2419
#19 0xffffffffa6ec0a3d in kthread (_create=0xffff948d1b618ec0) at ../kernel/kthread.c:268
#20 0xffffffffa7800215 in ret_from_fork () at ../arch/x86/entry/entry_64.S:351

py-crash>  p /x ((struct request*)0xffff948b840d0000)->state
$2 = 0x2

 drivers/nvme/host/tcp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 69f59d2c5799..4bec705ce8e6 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -485,7 +485,7 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
 	struct request *rq;
 
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
-	if (!rq) {
+	if (!rq || !blk_mq_request_started(rq)) {
 		dev_err(queue->ctrl->ctrl.device,
 			"queue %d tag 0x%x not found\n",
 			nvme_tcp_queue_id(queue), cqe->command_id);
@@ -506,7 +506,7 @@ static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
 	struct request *rq;
 
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
-	if (!rq) {
+	if (!rq || !blk_mq_request_started(rq)) {
 		dev_err(queue->ctrl->ctrl.device,
 			"queue %d tag %#x not found\n",
 			nvme_tcp_queue_id(queue), pdu->command_id);
@@ -610,7 +610,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	int ret;
 
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
-	if (!rq) {
+	if (!rq || !blk_mq_request_started(rq)) {
 		dev_err(queue->ctrl->ctrl.device,
 			"queue %d tag %#x not found\n",
 			nvme_tcp_queue_id(queue), pdu->command_id);
@@ -696,7 +696,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
 	struct request *rq;
 
 	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
-	if (!rq) {
+	if (!rq || !blk_mq_request_started(rq)) {
 		dev_err(queue->ctrl->ctrl.device,
 			"queue %d tag %#x not found\n",
 			nvme_tcp_queue_id(queue), pdu->command_id);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-12 18:17 [PATCH] nvme-tcp: Check if request has started before processing it Daniel Wagner
@ 2021-02-12 20:58 ` Sagi Grimberg
  2021-02-12 21:09   ` Keith Busch
  2021-02-13  8:42 ` Hannes Reinecke
  1 sibling, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2021-02-12 20:58 UTC (permalink / raw)
  To: Daniel Wagner, linux-nvme
  Cc: linux-kernel, Christoph Hellwig, Jens Axboe, Keith Busch,
	Hannes Reinecke

> blk_mq_tag_to_rq() will always return a request if the command_id is
> in the valid range. Check if the request has been started. If we
> blindly process the request we might double complete a request which
> can be fatal.

How did you get to this one? did the controller send a completion for
a completed/bogus request?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-12 20:58 ` Sagi Grimberg
@ 2021-02-12 21:09   ` Keith Busch
  2021-02-12 21:49     ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-02-12 21:09 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Daniel Wagner, linux-nvme, linux-kernel, Christoph Hellwig,
	Jens Axboe, Hannes Reinecke

On Fri, Feb 12, 2021 at 12:58:27PM -0800, Sagi Grimberg wrote:
> > blk_mq_tag_to_rq() will always return a request if the command_id is
> > in the valid range. Check if the request has been started. If we
> > blindly process the request we might double complete a request which
> > can be fatal.
> 
> How did you get to this one? did the controller send a completion for
> a completed/bogus request?

If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-12 21:09   ` Keith Busch
@ 2021-02-12 21:49     ` Sagi Grimberg
  2021-02-13  8:46       ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2021-02-12 21:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Daniel Wagner, linux-nvme, linux-kernel, Christoph Hellwig,
	Jens Axboe, Hannes Reinecke


>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>> in the valid range. Check if the request has been started. If we
>>> blindly process the request we might double complete a request which
>>> can be fatal.
>>
>> How did you get to this one? did the controller send a completion for
>> a completed/bogus request?
> 
> If that is the case, then that must mean it's possible the driver could
> have started the command id just before the bogus completion check. Data
> iorruption, right?

Yes, which is why I don't think this check is very useful..

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-12 18:17 [PATCH] nvme-tcp: Check if request has started before processing it Daniel Wagner
  2021-02-12 20:58 ` Sagi Grimberg
@ 2021-02-13  8:42 ` Hannes Reinecke
  1 sibling, 0 replies; 22+ messages in thread
From: Hannes Reinecke @ 2021-02-13  8:42 UTC (permalink / raw)
  To: Daniel Wagner, linux-nvme
  Cc: linux-kernel, Sagi Grimberg, Christoph Hellwig, Jens Axboe, Keith Busch

On 2/12/21 7:17 PM, Daniel Wagner wrote:
> blk_mq_tag_to_rq() will always return a request if the command_id is
> in the valid range. Check if the request has been started. If we
> blindly process the request we might double complete a request which
> can be fatal.
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
> 
> This patch is against nvme-5.12.
> 
> There is one blk_mq_tag_to_rq() in nvme_tcp_recv_ddgst() which I
> didn't update as I am not sure if it's also needed.
> 
I guess it is; this patch is essentially a protection against invalid 
frames, and as such affects all places.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-12 21:49     ` Sagi Grimberg
@ 2021-02-13  8:46       ` Hannes Reinecke
  2021-02-15 10:40         ` Daniel Wagner
  2021-02-15 21:23         ` Sagi Grimberg
  0 siblings, 2 replies; 22+ messages in thread
From: Hannes Reinecke @ 2021-02-13  8:46 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch
  Cc: Daniel Wagner, linux-nvme, linux-kernel, Christoph Hellwig, Jens Axboe

On 2/12/21 10:49 PM, Sagi Grimberg wrote:
> 
>>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>>> in the valid range. Check if the request has been started. If we
>>>> blindly process the request we might double complete a request which
>>>> can be fatal.
>>>
>>> How did you get to this one? did the controller send a completion for
>>> a completed/bogus request?
>>
>> If that is the case, then that must mean it's possible the driver could
>> have started the command id just before the bogus completion check. Data
>> iorruption, right?
> 
> Yes, which is why I don't think this check is very useful..

I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions 
with random command ids.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-13  8:46       ` Hannes Reinecke
@ 2021-02-15 10:40         ` Daniel Wagner
  2021-02-15 21:29           ` Sagi Grimberg
  2021-02-15 21:23         ` Sagi Grimberg
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Wagner @ 2021-02-15 10:40 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Keith Busch, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On Sat, Feb 13, 2021 at 09:46:41AM +0100, Hannes Reinecke wrote:
> On 2/12/21 10:49 PM, Sagi Grimberg wrote:
> > 
> > > > > blk_mq_tag_to_rq() will always return a request if the command_id is
> > > > > in the valid range. Check if the request has been started. If we
> > > > > blindly process the request we might double complete a request which
> > > > > can be fatal.
> > > > 
> > > > How did you get to this one? did the controller send a completion for
> > > > a completed/bogus request?
> > > 
> > > If that is the case, then that must mean it's possible the driver could
> > > have started the command id just before the bogus completion check. Data
> > > iorruption, right?

'during TCP LIF toggles and aggr relocates' testing the host
crashes. TBH, I do not really know what is happening or what the test
does. Still trying to figure out what's going on. I was just very
surprised how much the code trusts the other side to behave correctly.

> > Yes, which is why I don't think this check is very useful..
> 
> I actually view that as a valid protection against spoofed frames.
> Without it it's easy to crash the machine by injecting fake completions with
> random command ids.

In this test scenario it's not even a spoofed frame; maybe just a
confused controller.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-13  8:46       ` Hannes Reinecke
  2021-02-15 10:40         ` Daniel Wagner
@ 2021-02-15 21:23         ` Sagi Grimberg
  2021-02-16  8:51           ` Hannes Reinecke
  1 sibling, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2021-02-15 21:23 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Daniel Wagner, linux-kernel


>>>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>>>> in the valid range. Check if the request has been started. If we
>>>>> blindly process the request we might double complete a request which
>>>>> can be fatal.
>>>>
>>>> How did you get to this one? did the controller send a completion for
>>>> a completed/bogus request?
>>>
>>> If that is the case, then that must mean it's possible the driver could
>>> have started the command id just before the bogus completion check. Data
>>> iorruption, right?
>>
>> Yes, which is why I don't think this check is very useful..
> 
> I actually view that as a valid protection against spoofed frames.
> Without it it's easy to crash the machine by injecting fake completions 
> with random command ids.

And this doesn't help because the command can have been easily reused
and started... What is this protecting against? Note that none of the
other transports checks that, why should tcp?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-15 10:40         ` Daniel Wagner
@ 2021-02-15 21:29           ` Sagi Grimberg
  2021-02-26 12:35             ` Daniel Wagner
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2021-02-15 21:29 UTC (permalink / raw)
  To: Daniel Wagner, Hannes Reinecke
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel


>>>>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>>>>> in the valid range. Check if the request has been started. If we
>>>>>> blindly process the request we might double complete a request which
>>>>>> can be fatal.
>>>>>
>>>>> How did you get to this one? did the controller send a completion for
>>>>> a completed/bogus request?
>>>>
>>>> If that is the case, then that must mean it's possible the driver could
>>>> have started the command id just before the bogus completion check. Data
>>>> iorruption, right?
> 
> 'during TCP LIF toggles and aggr relocates' testing the host
> crashes. TBH, I do not really know what is happening or what the test
> does. Still trying to figure out what's going on.

Well, I think we should probably figure out why that is happening first.

> I was just very
> surprised how much the code trusts the other side to behave correctly./

What does pci/rdma/fc does differently? What does scsi do here?

>>> Yes, which is why I don't think this check is very useful..
>>
>> I actually view that as a valid protection against spoofed frames.
>> Without it it's easy to crash the machine by injecting fake completions with
>> random command ids.
> 
> In this test scenario it's not even a spoofed frame; maybe just a
> confused controller.

Maybe... I am still not sure how this patch helps here though...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-15 21:23         ` Sagi Grimberg
@ 2021-02-16  8:51           ` Hannes Reinecke
  0 siblings, 0 replies; 22+ messages in thread
From: Hannes Reinecke @ 2021-02-16  8:51 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, linux-nvme, Daniel Wagner, linux-kernel

On 2/15/21 10:23 PM, Sagi Grimberg wrote:
> 
>>>>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>>>>> in the valid range. Check if the request has been started. If we
>>>>>> blindly process the request we might double complete a request which
>>>>>> can be fatal.
>>>>>
>>>>> How did you get to this one? did the controller send a completion for
>>>>> a completed/bogus request?
>>>>
>>>> If that is the case, then that must mean it's possible the driver could
>>>> have started the command id just before the bogus completion check. 
>>>> Data
>>>> iorruption, right?
>>>
>>> Yes, which is why I don't think this check is very useful..
>>
>> I actually view that as a valid protection against spoofed frames.
>> Without it it's easy to crash the machine by injecting fake 
>> completions with random command ids.
> 
> And this doesn't help because the command can have been easily reused
> and started... What is this protecting against? Note that none of the
> other transports checks that, why should tcp?

Because it's particularly easy to spoof packets on tcp.
All other nvme-of transports are layered on top of other transports 
which do some sanity checks already, so it becomes really hard to inject 
invalid NVMe-oF frames for those.
NVMe-TCP has none of these protections, making it really easy to inject 
faulty frames (or, heaven forbid, running a packet fuzzer).

And crashing the machine on invalid frames is always a bad idea; I would 
have expected NVMe-TCP to drop them.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-15 21:29           ` Sagi Grimberg
@ 2021-02-26 12:35             ` Daniel Wagner
  2021-02-26 12:54               ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Wagner @ 2021-02-26 12:35 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Keith Busch, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
> Well, I think we should probably figure out why that is happening first.

I got my hands on a tcpdump trace. I've trimmed it to this:


No.     Time           Source                Destination           Protocol Length Info
      1 0.000000       10.228.194.30         10.228.38.214         NVMe     138    NVMe Write
      2 0.000285       10.228.38.214         10.228.194.30         NVMe/TCP 90     Ready To Transfer
      3 0.000591       10.228.194.30         10.228.38.214         NVMe     4186   NVMe Write: Data
      4 0.000673       10.228.38.214         10.228.194.30         TCP      66     4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 TSecr=1497295579
      5 0.002140       10.228.38.214         10.228.194.30         NVMe     90     NVMe Write: Response
      6 0.002511       10.228.194.30         10.228.38.175         NVMe     138    NVMe Write
      7 0.002812       10.228.38.175         10.228.194.30         NVMe/TCP 90     Ready To Transfer
      8 0.003006       10.228.194.30         10.228.38.175         NVMe     4186   NVMe Write: Data
      9 0.003098       10.228.38.175         10.228.194.30         TCP      66     4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 TSecr=3601034207
     10 0.004420       10.228.38.175         10.228.194.30         NVMe     90     NVMe Write: Response
     11 0.004890       10.228.38.214         10.228.194.30         NVMe/TCP 90     
     12 0.004969       10.228.38.175         10.228.194.30         NVMe/TCP 90  


The last few seconds contain only normal writes and suddenly the host
receives two invalid packets. From what I see the host doesn't misbehave
at all. I wonder if it would be possible to detect the invalid packet by
locking at the PDU header only. If this would be possible we could
discard it early and do not try to use the invalid command id.

Here the last two packets with details:


No.     Time           Source                Destination           Protocol Length Info
     11 0.004890       10.228.38.214         10.228.194.30         NVMe/TCP 90     

Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
    Encapsulation type: Ethernet (1)
    Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1614100568.780574000 seconds
    [Time delta from previous captured frame: 0.000470000 seconds]
    [Time delta from previous displayed frame: 0.000470000 seconds]
    [Time since reference or first frame: 0.004890000 seconds]
    Frame Number: 11
    Frame Length: 90 bytes (720 bits)
    Capture Length: 90 bytes (720 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
    [Coloring Rule Name: TCP]
    [Coloring Rule String: tcp]
Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
    Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
        Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
        Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
        0000 00.. = Differentiated Services Codepoint: Default (0)
        .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
    Total Length: 76
    Identification: 0x0000 (0)
    Flags: 0x40, Don't fragment
        0... .... = Reserved bit: Not set
        .1.. .... = Don't fragment: Set
        ..0. .... = More fragments: Not set
    Fragment Offset: 0
    Time to Live: 64
    Protocol: TCP (6)
    Header Checksum: 0x0000 [validation disabled]
    [Header checksum status: Unverified]
    Source Address: 10.228.38.214
    Destination Address: 10.228.194.30
Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, Ack: 1, Len: 24
    Source Port: 4420
    Destination Port: 46909
    [Stream index: 2]
    [TCP Segment Len: 24]
    Sequence Number: 1    (relative sequence number)
    Sequence Number (raw): 4175488220
    [Next Sequence Number: 25    (relative sequence number)]
    Acknowledgment Number: 1    (relative ack number)
    Acknowledgment number (raw): 2966903626
    1000 .... = Header Length: 32 bytes (8)
    Flags: 0x018 (PSH, ACK)
        000. .... .... = Reserved: Not set
        ...0 .... .... = Nonce: Not set
        .... 0... .... = Congestion Window Reduced (CWR): Not set
        .... .0.. .... = ECN-Echo: Not set
        .... ..0. .... = Urgent: Not set
        .... ...1 .... = Acknowledgment: Set
        .... .... 1... = Push: Set
        .... .... .0.. = Reset: Not set
        .... .... ..0. = Syn: Not set
        .... .... ...0 = Fin: Not set
        [TCP Flags: ·······AP···]
    Window: 257
    [Calculated window size: 257]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0xfefa [unverified]
    [Checksum Status: Unverified]
    Urgent Pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
        TCP Option - No-Operation (NOP)
            Kind: No-Operation (1)
        TCP Option - No-Operation (NOP)
            Kind: No-Operation (1)
        TCP Option - Timestamps: TSval 4211986351, TSecr 1497291787
            Kind: Time Stamp Option (8)
            Length: 10
            Timestamp value: 4211986351
            Timestamp echo reply: 1497291787
    [SEQ/ACK analysis]
        [Bytes in flight: 24]
        [Bytes sent since last PSH flag: 24]
    [Timestamps]
        [Time since first frame in this TCP stream: 0.000000000 seconds]
        [Time since previous frame in this TCP stream: 0.000000000 seconds]
    TCP payload (24 bytes)
    [PDU Size: 24]
NVM Express Fabrics TCP
    Pdu Type: CapsuleResponse (5)
    Pdu Specific Flags: 0x00
        .... ...0 = PDU Header Digest: Not set
        .... ..0. = PDU Data Digest: Not set
        .... .0.. = PDU Data Last: Not set
        .... 0... = PDU Data Success: Not set
    Pdu Header Length: 24
    Pdu Data Offset: 0
    Packet Length: 24
    Unknown Data: 02000400000000001c0000001f000000

0000  00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00   .........A....E.
0010  00 4c 00 00 40 00 40 06 00 00 0a e4 26 d6 0a e4   .L..@.@.....&...
0020  c2 1e 11 44 b7 3d f8 e0 e4 dc b0 d7 5b 4a 80 18   ...D.=......[J..
0030  01 01 fe fa 00 00 01 01 08 0a fb 0d cf af 59 3e   ..............Y>
0040  dc 0b 05 00 18 00 18 00 00 00 02 00 04 00 00 00   ................
0050  00 00 1c 00 00 00 1f 00 00 00                     ..........

No.     Time           Source                Destination           Protocol Length Info
     12 0.004969       10.228.38.175         10.228.194.30         NVMe/TCP 90     

Frame 12: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
    Encapsulation type: Ethernet (1)
    Arrival Time: Feb 23, 2021 18:16:08.780653000 CET
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1614100568.780653000 seconds
    [Time delta from previous captured frame: 0.000079000 seconds]
    [Time delta from previous displayed frame: 0.000079000 seconds]
    [Time since reference or first frame: 0.004969000 seconds]
    Frame Number: 12
    Frame Length: 90 bytes (720 bits)
    Capture Length: 90 bytes (720 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
    [Coloring Rule Name: TCP]
    [Coloring Rule String: tcp]
Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
    Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
        Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
        Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 10.228.38.175, Dst: 10.228.194.30
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
        0000 00.. = Differentiated Services Codepoint: Default (0)
        .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
    Total Length: 76
    Identification: 0x0000 (0)
    Flags: 0x40, Don't fragment
        0... .... = Reserved bit: Not set
        .1.. .... = Don't fragment: Set
        ..0. .... = More fragments: Not set
    Fragment Offset: 0
    Time to Live: 64
    Protocol: TCP (6)
    Header Checksum: 0x0000 [validation disabled]
    [Header checksum status: Unverified]
    Source Address: 10.228.38.175
    Destination Address: 10.228.194.30
Transmission Control Protocol, Src Port: 4420, Dst Port: 34895, Seq: 1, Ack: 1, Len: 24
    Source Port: 4420
    Destination Port: 34895
    [Stream index: 3]
    [TCP Segment Len: 24]
    Sequence Number: 1    (relative sequence number)
    Sequence Number (raw): 3092812012
    [Next Sequence Number: 25    (relative sequence number)]
    Acknowledgment Number: 1    (relative ack number)
    Acknowledgment number (raw): 2384147181
    1000 .... = Header Length: 32 bytes (8)
    Flags: 0x018 (PSH, ACK)
        000. .... .... = Reserved: Not set
        ...0 .... .... = Nonce: Not set
        .... 0... .... = Congestion Window Reduced (CWR): Not set
        .... .0.. .... = ECN-Echo: Not set
        .... ..0. .... = Urgent: Not set
        .... ...1 .... = Acknowledgment: Set
        .... .... 1... = Push: Set
        .... .... .0.. = Reset: Not set
        .... .... ..0. = Syn: Not set
        .... .... ...0 = Fin: Not set
        [TCP Flags: ·······AP···]
    Window: 257
    [Calculated window size: 257]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0xfed3 [unverified]
    [Checksum Status: Unverified]
    Urgent Pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
        TCP Option - No-Operation (NOP)
            Kind: No-Operation (1)
        TCP Option - No-Operation (NOP)
            Kind: No-Operation (1)
        TCP Option - Timestamps: TSval 3874335934, TSecr 3601030412
            Kind: Time Stamp Option (8)
            Length: 10
            Timestamp value: 3874335934
            Timestamp echo reply: 3601030412
    [SEQ/ACK analysis]
        [Bytes in flight: 24]
        [Bytes sent since last PSH flag: 24]
    [Timestamps]
        [Time since first frame in this TCP stream: 0.000000000 seconds]
        [Time since previous frame in this TCP stream: 0.000000000 seconds]
    TCP payload (24 bytes)
    [PDU Size: 24]
NVM Express Fabrics TCP
    Pdu Type: CapsuleResponse (5)
    Pdu Specific Flags: 0x00
        .... ...0 = PDU Header Digest: Not set
        .... ..0. = PDU Data Digest: Not set
        .... .0.. = PDU Data Last: Not set
        .... 0... = PDU Data Success: Not set
    Pdu Header Length: 24
    Pdu Data Offset: 0
    Packet Length: 24
    Unknown Data: 02000400000000001b0000001f000000

0000  00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00   .........A....E.
0010  00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4   .L..@.@.....&...
0020  c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18   ...D.O.X....2...
0030  01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3   ................
0040  5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00   ]...............
0050  00 00 1b 00 00 00 1f 00 00 00                     ..........

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-26 12:35             ` Daniel Wagner
@ 2021-02-26 12:54               ` Hannes Reinecke
  2021-02-26 16:13                 ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2021-02-26 12:54 UTC (permalink / raw)
  To: Daniel Wagner, Sagi Grimberg
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel

On 2/26/21 1:35 PM, Daniel Wagner wrote:
> On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
>> Well, I think we should probably figure out why that is happening first.
> 
> I got my hands on a tcpdump trace. I've trimmed it to this:
> 
> 
> No.     Time           Source                Destination           Protocol Length Info
>        1 0.000000       10.228.194.30         10.228.38.214         NVMe     138    NVMe Write
>        2 0.000285       10.228.38.214         10.228.194.30         NVMe/TCP 90     Ready To Transfer
>        3 0.000591       10.228.194.30         10.228.38.214         NVMe     4186   NVMe Write: Data
>        4 0.000673       10.228.38.214         10.228.194.30         TCP      66     4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 TSecr=1497295579
>        5 0.002140       10.228.38.214         10.228.194.30         NVMe     90     NVMe Write: Response
>        6 0.002511       10.228.194.30         10.228.38.175         NVMe     138    NVMe Write
>        7 0.002812       10.228.38.175         10.228.194.30         NVMe/TCP 90     Ready To Transfer
>        8 0.003006       10.228.194.30         10.228.38.175         NVMe     4186   NVMe Write: Data
>        9 0.003098       10.228.38.175         10.228.194.30         TCP      66     4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 TSecr=3601034207
>       10 0.004420       10.228.38.175         10.228.194.30         NVMe     90     NVMe Write: Response
>       11 0.004890       10.228.38.214         10.228.194.30         NVMe/TCP 90
>       12 0.004969       10.228.38.175         10.228.194.30         NVMe/TCP 90
> 
> 
> The last few seconds contain only normal writes and suddenly the host
> receives two invalid packets. From what I see the host doesn't misbehave
> at all. I wonder if it would be possible to detect the invalid packet by
> locking at the PDU header only. If this would be possible we could
> discard it early and do not try to use the invalid command id.
> 
> Here the last two packets with details:
> 
> 
> No.     Time           Source                Destination           Protocol Length Info
>       11 0.004890       10.228.38.214         10.228.194.30         NVMe/TCP 90
> 
> Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
>      Encapsulation type: Ethernet (1)
>      Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
>      [Time shift for this packet: 0.000000000 seconds]
>      Epoch Time: 1614100568.780574000 seconds
>      [Time delta from previous captured frame: 0.000470000 seconds]
>      [Time delta from previous displayed frame: 0.000470000 seconds]
>      [Time since reference or first frame: 0.004890000 seconds]
>      Frame Number: 11
>      Frame Length: 90 bytes (720 bits)
>      Capture Length: 90 bytes (720 bits)
>      [Frame is marked: False]
>      [Frame is ignored: False]
>      [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
>      [Coloring Rule Name: TCP]
>      [Coloring Rule String: tcp]
> Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
>      Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
>          Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
>          .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
>          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
>      Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
>          Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
>          .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
>          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
>      Type: IPv4 (0x0800)
> Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
>      0100 .... = Version: 4
>      .... 0101 = Header Length: 20 bytes (5)
>      Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
>          0000 00.. = Differentiated Services Codepoint: Default (0)
>          .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
>      Total Length: 76
>      Identification: 0x0000 (0)
>      Flags: 0x40, Don't fragment
>          0... .... = Reserved bit: Not set
>          .1.. .... = Don't fragment: Set
>          ..0. .... = More fragments: Not set
>      Fragment Offset: 0
>      Time to Live: 64
>      Protocol: TCP (6)
>      Header Checksum: 0x0000 [validation disabled]
>      [Header checksum status: Unverified]
>      Source Address: 10.228.38.214
>      Destination Address: 10.228.194.30
> Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, Ack: 1, Len: 24
>      Source Port: 4420
>      Destination Port: 46909
>      [Stream index: 2]
>      [TCP Segment Len: 24]
>      Sequence Number: 1    (relative sequence number)
>      Sequence Number (raw): 4175488220
>      [Next Sequence Number: 25    (relative sequence number)]
>      Acknowledgment Number: 1    (relative ack number)
>      Acknowledgment number (raw): 2966903626
>      1000 .... = Header Length: 32 bytes (8)
>      Flags: 0x018 (PSH, ACK)
>          000. .... .... = Reserved: Not set
>          ...0 .... .... = Nonce: Not set
>          .... 0... .... = Congestion Window Reduced (CWR): Not set
>          .... .0.. .... = ECN-Echo: Not set
>          .... ..0. .... = Urgent: Not set
>          .... ...1 .... = Acknowledgment: Set
>          .... .... 1... = Push: Set
>          .... .... .0.. = Reset: Not set
>          .... .... ..0. = Syn: Not set
>          .... .... ...0 = Fin: Not set
>          [TCP Flags: ·······AP···]
>      Window: 257
>      [Calculated window size: 257]
>      [Window size scaling factor: -1 (unknown)]
>      Checksum: 0xfefa [unverified]
>      [Checksum Status: Unverified]
>      Urgent Pointer: 0
>      Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
>          TCP Option - No-Operation (NOP)
>              Kind: No-Operation (1)
>          TCP Option - No-Operation (NOP)
>              Kind: No-Operation (1)
>          TCP Option - Timestamps: TSval 4211986351, TSecr 1497291787
>              Kind: Time Stamp Option (8)
>              Length: 10
>              Timestamp value: 4211986351
>              Timestamp echo reply: 1497291787
>      [SEQ/ACK analysis]
>          [Bytes in flight: 24]
>          [Bytes sent since last PSH flag: 24]
>      [Timestamps]
>          [Time since first frame in this TCP stream: 0.000000000 seconds]
>          [Time since previous frame in this TCP stream: 0.000000000 seconds]
>      TCP payload (24 bytes)
>      [PDU Size: 24]
> NVM Express Fabrics TCP
>      Pdu Type: CapsuleResponse (5)
>      Pdu Specific Flags: 0x00
>          .... ...0 = PDU Header Digest: Not set
>          .... ..0. = PDU Data Digest: Not set
>          .... .0.. = PDU Data Last: Not set
>          .... 0... = PDU Data Success: Not set
>      Pdu Header Length: 24
>      Pdu Data Offset: 0
>      Packet Length: 24
>      Unknown Data: 02000400000000001c0000001f000000
> 
> 0000  00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00   .........A....E.
> 0010  00 4c 00 00 40 00 40 06 00 00 0a e4 26 d6 0a e4   .L..@.@.....&...
> 0020  c2 1e 11 44 b7 3d f8 e0 e4 dc b0 d7 5b 4a 80 18   ...D.=......[J..
> 0030  01 01 fe fa 00 00 01 01 08 0a fb 0d cf af 59 3e   ..............Y>
> 0040  dc 0b 05 00 18 00 18 00 00 00 02 00 04 00 00 00   ................
> 0050  00 00 1c 00 00 00 1f 00 00 00                     ..........
> 
> No.     Time           Source                Destination           Protocol Length Info
>       12 0.004969       10.228.38.175         10.228.194.30         NVMe/TCP 90
> 
> Frame 12: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
>      Encapsulation type: Ethernet (1)
>      Arrival Time: Feb 23, 2021 18:16:08.780653000 CET
>      [Time shift for this packet: 0.000000000 seconds]
>      Epoch Time: 1614100568.780653000 seconds
>      [Time delta from previous captured frame: 0.000079000 seconds]
>      [Time delta from previous displayed frame: 0.000079000 seconds]
>      [Time since reference or first frame: 0.004969000 seconds]
>      Frame Number: 12
>      Frame Length: 90 bytes (720 bits)
>      Capture Length: 90 bytes (720 bits)
>      [Frame is marked: False]
>      [Frame is ignored: False]
>      [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
>      [Coloring Rule Name: TCP]
>      [Coloring Rule String: tcp]
> Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
>      Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
>          Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
>          .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
>          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
>      Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
>          Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
>          .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
>          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
>      Type: IPv4 (0x0800)
> Internet Protocol Version 4, Src: 10.228.38.175, Dst: 10.228.194.30
>      0100 .... = Version: 4
>      .... 0101 = Header Length: 20 bytes (5)
>      Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
>          0000 00.. = Differentiated Services Codepoint: Default (0)
>          .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
>      Total Length: 76
>      Identification: 0x0000 (0)
>      Flags: 0x40, Don't fragment
>          0... .... = Reserved bit: Not set
>          .1.. .... = Don't fragment: Set
>          ..0. .... = More fragments: Not set
>      Fragment Offset: 0
>      Time to Live: 64
>      Protocol: TCP (6)
>      Header Checksum: 0x0000 [validation disabled]
>      [Header checksum status: Unverified]
>      Source Address: 10.228.38.175
>      Destination Address: 10.228.194.30
> Transmission Control Protocol, Src Port: 4420, Dst Port: 34895, Seq: 1, Ack: 1, Len: 24
>      Source Port: 4420
>      Destination Port: 34895
>      [Stream index: 3]
>      [TCP Segment Len: 24]
>      Sequence Number: 1    (relative sequence number)
>      Sequence Number (raw): 3092812012
>      [Next Sequence Number: 25    (relative sequence number)]
>      Acknowledgment Number: 1    (relative ack number)
>      Acknowledgment number (raw): 2384147181
>      1000 .... = Header Length: 32 bytes (8)
>      Flags: 0x018 (PSH, ACK)
>          000. .... .... = Reserved: Not set
>          ...0 .... .... = Nonce: Not set
>          .... 0... .... = Congestion Window Reduced (CWR): Not set
>          .... .0.. .... = ECN-Echo: Not set
>          .... ..0. .... = Urgent: Not set
>          .... ...1 .... = Acknowledgment: Set
>          .... .... 1... = Push: Set
>          .... .... .0.. = Reset: Not set
>          .... .... ..0. = Syn: Not set
>          .... .... ...0 = Fin: Not set
>          [TCP Flags: ·······AP···]
>      Window: 257
>      [Calculated window size: 257]
>      [Window size scaling factor: -1 (unknown)]
>      Checksum: 0xfed3 [unverified]
>      [Checksum Status: Unverified]
>      Urgent Pointer: 0
>      Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
>          TCP Option - No-Operation (NOP)
>              Kind: No-Operation (1)
>          TCP Option - No-Operation (NOP)
>              Kind: No-Operation (1)
>          TCP Option - Timestamps: TSval 3874335934, TSecr 3601030412
>              Kind: Time Stamp Option (8)
>              Length: 10
>              Timestamp value: 3874335934
>              Timestamp echo reply: 3601030412
>      [SEQ/ACK analysis]
>          [Bytes in flight: 24]
>          [Bytes sent since last PSH flag: 24]
>      [Timestamps]
>          [Time since first frame in this TCP stream: 0.000000000 seconds]
>          [Time since previous frame in this TCP stream: 0.000000000 seconds]
>      TCP payload (24 bytes)
>      [PDU Size: 24]
> NVM Express Fabrics TCP
>      Pdu Type: CapsuleResponse (5)
>      Pdu Specific Flags: 0x00
>          .... ...0 = PDU Header Digest: Not set
>          .... ..0. = PDU Data Digest: Not set
>          .... .0.. = PDU Data Last: Not set
>          .... 0... = PDU Data Success: Not set
>      Pdu Header Length: 24
>      Pdu Data Offset: 0
>      Packet Length: 24
>      Unknown Data: 02000400000000001b0000001f000000
> 
> 0000  00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00   .........A....E.
> 0010  00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4   .L..@.@.....&...
> 0020  c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18   ...D.O.X....2...
> 0030  01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3   ................
> 0040  5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00   ]...............
> 0050  00 00 1b 00 00 00 1f 00 00 00                     ..........
> 
As I suspected, we did receive an invalid frame.
Data digest would have saved us, but then it's not enabled.

So we do need to check if the request is valid before processing it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-26 12:54               ` Hannes Reinecke
@ 2021-02-26 16:13                 ` Keith Busch
  2021-02-26 16:42                   ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-02-26 16:13 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote:
> On 2/26/21 1:35 PM, Daniel Wagner wrote:
> > On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
> > > Well, I think we should probably figure out why that is happening first.
> > 
> > I got my hands on a tcpdump trace. I've trimmed it to this:
> > 
> > 
> > No.     Time           Source                Destination           Protocol Length Info
> >        1 0.000000       10.228.194.30         10.228.38.214         NVMe     138    NVMe Write
> >        2 0.000285       10.228.38.214         10.228.194.30         NVMe/TCP 90     Ready To Transfer
> >        3 0.000591       10.228.194.30         10.228.38.214         NVMe     4186   NVMe Write: Data
> >        4 0.000673       10.228.38.214         10.228.194.30         TCP      66     4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 TSecr=1497295579
> >        5 0.002140       10.228.38.214         10.228.194.30         NVMe     90     NVMe Write: Response
> >        6 0.002511       10.228.194.30         10.228.38.175         NVMe     138    NVMe Write
> >        7 0.002812       10.228.38.175         10.228.194.30         NVMe/TCP 90     Ready To Transfer
> >        8 0.003006       10.228.194.30         10.228.38.175         NVMe     4186   NVMe Write: Data
> >        9 0.003098       10.228.38.175         10.228.194.30         TCP      66     4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 TSecr=3601034207
> >       10 0.004420       10.228.38.175         10.228.194.30         NVMe     90     NVMe Write: Response
> >       11 0.004890       10.228.38.214         10.228.194.30         NVMe/TCP 90
> >       12 0.004969       10.228.38.175         10.228.194.30         NVMe/TCP 90
> > 
> > 
> > The last few seconds contain only normal writes and suddenly the host
> > receives two invalid packets. From what I see the host doesn't misbehave
> > at all. I wonder if it would be possible to detect the invalid packet by
> > locking at the PDU header only. If this would be possible we could
> > discard it early and do not try to use the invalid command id.
> > 
> > Here the last two packets with details:
> > 
> > 
> > No.     Time           Source                Destination           Protocol Length Info
> >       11 0.004890       10.228.38.214         10.228.194.30         NVMe/TCP 90
> > 
> > Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
> >      Encapsulation type: Ethernet (1)
> >      Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
> >      [Time shift for this packet: 0.000000000 seconds]
> >      Epoch Time: 1614100568.780574000 seconds
> >      [Time delta from previous captured frame: 0.000470000 seconds]
> >      [Time delta from previous displayed frame: 0.000470000 seconds]
> >      [Time since reference or first frame: 0.004890000 seconds]
> >      Frame Number: 11
> >      Frame Length: 90 bytes (720 bits)
> >      Capture Length: 90 bytes (720 bits)
> >      [Frame is marked: False]
> >      [Frame is ignored: False]
> >      [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
> >      [Coloring Rule Name: TCP]
> >      [Coloring Rule String: tcp]
> > Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >      Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >          Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >          .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> >          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> >      Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> >          Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> >          .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> >          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> >      Type: IPv4 (0x0800)
> > Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
> >      0100 .... = Version: 4
> >      .... 0101 = Header Length: 20 bytes (5)
> >      Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
> >          0000 00.. = Differentiated Services Codepoint: Default (0)
> >          .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
> >      Total Length: 76
> >      Identification: 0x0000 (0)
> >      Flags: 0x40, Don't fragment
> >          0... .... = Reserved bit: Not set
> >          .1.. .... = Don't fragment: Set
> >          ..0. .... = More fragments: Not set
> >      Fragment Offset: 0
> >      Time to Live: 64
> >      Protocol: TCP (6)
> >      Header Checksum: 0x0000 [validation disabled]
> >      [Header checksum status: Unverified]
> >      Source Address: 10.228.38.214
> >      Destination Address: 10.228.194.30
> > Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, Ack: 1, Len: 24
> >      Source Port: 4420
> >      Destination Port: 46909
> >      [Stream index: 2]
> >      [TCP Segment Len: 24]
> >      Sequence Number: 1    (relative sequence number)
> >      Sequence Number (raw): 4175488220
> >      [Next Sequence Number: 25    (relative sequence number)]
> >      Acknowledgment Number: 1    (relative ack number)
> >      Acknowledgment number (raw): 2966903626
> >      1000 .... = Header Length: 32 bytes (8)
> >      Flags: 0x018 (PSH, ACK)
> >          000. .... .... = Reserved: Not set
> >          ...0 .... .... = Nonce: Not set
> >          .... 0... .... = Congestion Window Reduced (CWR): Not set
> >          .... .0.. .... = ECN-Echo: Not set
> >          .... ..0. .... = Urgent: Not set
> >          .... ...1 .... = Acknowledgment: Set
> >          .... .... 1... = Push: Set
> >          .... .... .0.. = Reset: Not set
> >          .... .... ..0. = Syn: Not set
> >          .... .... ...0 = Fin: Not set
> >          [TCP Flags: ·······AP···]
> >      Window: 257
> >      [Calculated window size: 257]
> >      [Window size scaling factor: -1 (unknown)]
> >      Checksum: 0xfefa [unverified]
> >      [Checksum Status: Unverified]
> >      Urgent Pointer: 0
> >      Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
> >          TCP Option - No-Operation (NOP)
> >              Kind: No-Operation (1)
> >          TCP Option - No-Operation (NOP)
> >              Kind: No-Operation (1)
> >          TCP Option - Timestamps: TSval 4211986351, TSecr 1497291787
> >              Kind: Time Stamp Option (8)
> >              Length: 10
> >              Timestamp value: 4211986351
> >              Timestamp echo reply: 1497291787
> >      [SEQ/ACK analysis]
> >          [Bytes in flight: 24]
> >          [Bytes sent since last PSH flag: 24]
> >      [Timestamps]
> >          [Time since first frame in this TCP stream: 0.000000000 seconds]
> >          [Time since previous frame in this TCP stream: 0.000000000 seconds]
> >      TCP payload (24 bytes)
> >      [PDU Size: 24]
> > NVM Express Fabrics TCP
> >      Pdu Type: CapsuleResponse (5)
> >      Pdu Specific Flags: 0x00
> >          .... ...0 = PDU Header Digest: Not set
> >          .... ..0. = PDU Data Digest: Not set
> >          .... .0.. = PDU Data Last: Not set
> >          .... 0... = PDU Data Success: Not set
> >      Pdu Header Length: 24
> >      Pdu Data Offset: 0
> >      Packet Length: 24
> >      Unknown Data: 02000400000000001c0000001f000000
> > 
> > 0000  00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00   .........A....E.
> > 0010  00 4c 00 00 40 00 40 06 00 00 0a e4 26 d6 0a e4   .L..@.@.....&...
> > 0020  c2 1e 11 44 b7 3d f8 e0 e4 dc b0 d7 5b 4a 80 18   ...D.=......[J..
> > 0030  01 01 fe fa 00 00 01 01 08 0a fb 0d cf af 59 3e   ..............Y>
> > 0040  dc 0b 05 00 18 00 18 00 00 00 02 00 04 00 00 00   ................
> > 0050  00 00 1c 00 00 00 1f 00 00 00                     ..........
> > 
> > No.     Time           Source                Destination           Protocol Length Info
> >       12 0.004969       10.228.38.175         10.228.194.30         NVMe/TCP 90
> > 
> > Frame 12: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
> >      Encapsulation type: Ethernet (1)
> >      Arrival Time: Feb 23, 2021 18:16:08.780653000 CET
> >      [Time shift for this packet: 0.000000000 seconds]
> >      Epoch Time: 1614100568.780653000 seconds
> >      [Time delta from previous captured frame: 0.000079000 seconds]
> >      [Time delta from previous displayed frame: 0.000079000 seconds]
> >      [Time since reference or first frame: 0.004969000 seconds]
> >      Frame Number: 12
> >      Frame Length: 90 bytes (720 bits)
> >      Capture Length: 90 bytes (720 bits)
> >      [Frame is marked: False]
> >      [Frame is ignored: False]
> >      [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
> >      [Coloring Rule Name: TCP]
> >      [Coloring Rule String: tcp]
> > Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >      Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >          Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >          .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> >          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> >      Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> >          Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> >          .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> >          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> >      Type: IPv4 (0x0800)
> > Internet Protocol Version 4, Src: 10.228.38.175, Dst: 10.228.194.30
> >      0100 .... = Version: 4
> >      .... 0101 = Header Length: 20 bytes (5)
> >      Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
> >          0000 00.. = Differentiated Services Codepoint: Default (0)
> >          .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
> >      Total Length: 76
> >      Identification: 0x0000 (0)
> >      Flags: 0x40, Don't fragment
> >          0... .... = Reserved bit: Not set
> >          .1.. .... = Don't fragment: Set
> >          ..0. .... = More fragments: Not set
> >      Fragment Offset: 0
> >      Time to Live: 64
> >      Protocol: TCP (6)
> >      Header Checksum: 0x0000 [validation disabled]
> >      [Header checksum status: Unverified]
> >      Source Address: 10.228.38.175
> >      Destination Address: 10.228.194.30
> > Transmission Control Protocol, Src Port: 4420, Dst Port: 34895, Seq: 1, Ack: 1, Len: 24
> >      Source Port: 4420
> >      Destination Port: 34895
> >      [Stream index: 3]
> >      [TCP Segment Len: 24]
> >      Sequence Number: 1    (relative sequence number)
> >      Sequence Number (raw): 3092812012
> >      [Next Sequence Number: 25    (relative sequence number)]
> >      Acknowledgment Number: 1    (relative ack number)
> >      Acknowledgment number (raw): 2384147181
> >      1000 .... = Header Length: 32 bytes (8)
> >      Flags: 0x018 (PSH, ACK)
> >          000. .... .... = Reserved: Not set
> >          ...0 .... .... = Nonce: Not set
> >          .... 0... .... = Congestion Window Reduced (CWR): Not set
> >          .... .0.. .... = ECN-Echo: Not set
> >          .... ..0. .... = Urgent: Not set
> >          .... ...1 .... = Acknowledgment: Set
> >          .... .... 1... = Push: Set
> >          .... .... .0.. = Reset: Not set
> >          .... .... ..0. = Syn: Not set
> >          .... .... ...0 = Fin: Not set
> >          [TCP Flags: ·······AP···]
> >      Window: 257
> >      [Calculated window size: 257]
> >      [Window size scaling factor: -1 (unknown)]
> >      Checksum: 0xfed3 [unverified]
> >      [Checksum Status: Unverified]
> >      Urgent Pointer: 0
> >      Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
> >          TCP Option - No-Operation (NOP)
> >              Kind: No-Operation (1)
> >          TCP Option - No-Operation (NOP)
> >              Kind: No-Operation (1)
> >          TCP Option - Timestamps: TSval 3874335934, TSecr 3601030412
> >              Kind: Time Stamp Option (8)
> >              Length: 10
> >              Timestamp value: 3874335934
> >              Timestamp echo reply: 3601030412
> >      [SEQ/ACK analysis]
> >          [Bytes in flight: 24]
> >          [Bytes sent since last PSH flag: 24]
> >      [Timestamps]
> >          [Time since first frame in this TCP stream: 0.000000000 seconds]
> >          [Time since previous frame in this TCP stream: 0.000000000 seconds]
> >      TCP payload (24 bytes)
> >      [PDU Size: 24]
> > NVM Express Fabrics TCP
> >      Pdu Type: CapsuleResponse (5)
> >      Pdu Specific Flags: 0x00
> >          .... ...0 = PDU Header Digest: Not set
> >          .... ..0. = PDU Data Digest: Not set
> >          .... .0.. = PDU Data Last: Not set
> >          .... 0... = PDU Data Success: Not set
> >      Pdu Header Length: 24
> >      Pdu Data Offset: 0
> >      Packet Length: 24
> >      Unknown Data: 02000400000000001b0000001f000000
> > 
> > 0000  00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00   .........A....E.
> > 0010  00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4   .L..@.@.....&...
> > 0020  c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18   ...D.O.X....2...
> > 0030  01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3   ................
> > 0040  5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00   ]...............
> > 0050  00 00 1b 00 00 00 1f 00 00 00                     ..........
> > 
> As I suspected, we did receive an invalid frame.
> Data digest would have saved us, but then it's not enabled.
> 
> So we do need to check if the request is valid before processing it.

That's just addressing a symptom. You can't fully verify the request is
valid this way because the host could have started the same command ID
the very moment before the code checks it, incorrectly completing an
in-flight command and getting data corruption.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-26 16:13                 ` Keith Busch
@ 2021-02-26 16:42                   ` Hannes Reinecke
  2021-02-26 17:19                     ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2021-02-26 16:42 UTC (permalink / raw)
  To: Keith Busch
  Cc: Daniel Wagner, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On 2/26/21 5:13 PM, Keith Busch wrote:
> On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote:
>> On 2/26/21 1:35 PM, Daniel Wagner wrote:
>>> On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
>>>> Well, I think we should probably figure out why that is happening first.
>>>
>>> I got my hands on a tcpdump trace. I've trimmed it to this:
>>>
[ .. ]
>>> NVM Express Fabrics TCP
>>>       Pdu Type: CapsuleResponse (5)
>>>       Pdu Specific Flags: 0x00
>>>           .... ...0 = PDU Header Digest: Not set
>>>           .... ..0. = PDU Data Digest: Not set
>>>           .... .0.. = PDU Data Last: Not set
>>>           .... 0... = PDU Data Success: Not set
>>>       Pdu Header Length: 24
>>>       Pdu Data Offset: 0
>>>       Packet Length: 24
>>>       Unknown Data: 02000400000000001b0000001f000000
>>>
>>> 0000  00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00   .........A....E.
>>> 0010  00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4   .L..@.@.....&...
>>> 0020  c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18   ...D.O.X....2...
>>> 0030  01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3   ................
>>> 0040  5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00   ]...............
>>> 0050  00 00 1b 00 00 00 1f 00 00 00                     ..........
>>>
>> As I suspected, we did receive an invalid frame.
>> Data digest would have saved us, but then it's not enabled.
>>
>> So we do need to check if the request is valid before processing it.
> 
> That's just addressing a symptom. You can't fully verify the request is
> valid this way because the host could have started the same command ID
> the very moment before the code checks it, incorrectly completing an
> in-flight command and getting data corruption.
> 
Oh, I am fully aware.

Bad frames are just that, bad frames.
We can only fully validate that when digests are enabled, but I gather 
that controllers sending out bad frames wouldn't want to enable digests, 
either. So relying on that is possibly not an option.

So really what I'm trying to avoid is the host crashing on a bad frame.
That kind of thing always resonates bad with customers.
And tripping over an uninitialized command is just too stupid IMO.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-26 16:42                   ` Hannes Reinecke
@ 2021-02-26 17:19                     ` Keith Busch
  2021-03-01 13:26                       ` Daniel Wagner
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-02-26 17:19 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On Fri, Feb 26, 2021 at 05:42:46PM +0100, Hannes Reinecke wrote:
> On 2/26/21 5:13 PM, Keith Busch wrote:
> > 
> > That's just addressing a symptom. You can't fully verify the request is
> > valid this way because the host could have started the same command ID
> > the very moment before the code checks it, incorrectly completing an
> > in-flight command and getting data corruption.
> > 
> Oh, I am fully aware.
> 
> Bad frames are just that, bad frames.
> We can only fully validate that when digests are enabled, but I gather that
> controllers sending out bad frames wouldn't want to enable digests, either.
> So relying on that is possibly not an option.
> 
> So really what I'm trying to avoid is the host crashing on a bad frame.
> That kind of thing always resonates bad with customers.
> And tripping over an uninitialized command is just too stupid IMO.

Crashing is bad, silent data corruption is worse. Is there truly no
defense against that? If not, why should anyone rely on this?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-02-26 17:19                     ` Keith Busch
@ 2021-03-01 13:26                       ` Daniel Wagner
  2021-03-01 13:55                         ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Wagner @ 2021-03-01 13:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> Crashing is bad, silent data corruption is worse. Is there truly no
> defense against that? If not, why should anyone rely on this?

If we receive an response for which we don't have a started request, we
know that something is wrong. Couldn't we in just reset the connection
in this case? We don't have to pretend nothing has happened and
continuing normally. This would avoid a host crash and would not create
(more) data corruption. Or I am just too naive?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-03-01 13:26                       ` Daniel Wagner
@ 2021-03-01 13:55                         ` Hannes Reinecke
  2021-03-01 16:05                           ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2021-03-01 13:55 UTC (permalink / raw)
  To: Daniel Wagner, Keith Busch
  Cc: Sagi Grimberg, Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel

On 3/1/21 2:26 PM, Daniel Wagner wrote:
> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
>> Crashing is bad, silent data corruption is worse. Is there truly no
>> defense against that? If not, why should anyone rely on this?
> 
> If we receive an response for which we don't have a started request, we
> know that something is wrong. Couldn't we in just reset the connection
> in this case? We don't have to pretend nothing has happened and
> continuing normally. This would avoid a host crash and would not create
> (more) data corruption. Or I am just too naive?
> 
This is actually a sensible solution.
Please send a patch for that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-03-01 13:55                         ` Hannes Reinecke
@ 2021-03-01 16:05                           ` Keith Busch
  2021-03-01 16:53                             ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-03-01 16:05 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> >> Crashing is bad, silent data corruption is worse. Is there truly no
> >> defense against that? If not, why should anyone rely on this?
> > 
> > If we receive an response for which we don't have a started request, we
> > know that something is wrong. Couldn't we in just reset the connection
> > in this case? We don't have to pretend nothing has happened and
> > continuing normally. This would avoid a host crash and would not create
> > (more) data corruption. Or I am just too naive?
> > 
> This is actually a sensible solution.
> Please send a patch for that.

Is a bad frame a problem that can be resolved with a reset?

Even if so, the reset doesn't indicate to the user if previous commands
completed with bad data, so it still seems unreliable.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-03-01 16:05                           ` Keith Busch
@ 2021-03-01 16:53                             ` Hannes Reinecke
  2021-03-01 20:59                               ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2021-03-01 16:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Daniel Wagner, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On 3/1/21 5:05 PM, Keith Busch wrote:
> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
>> On 3/1/21 2:26 PM, Daniel Wagner wrote:
>>> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
>>>> Crashing is bad, silent data corruption is worse. Is there truly no
>>>> defense against that? If not, why should anyone rely on this?
>>>
>>> If we receive an response for which we don't have a started request, we
>>> know that something is wrong. Couldn't we in just reset the connection
>>> in this case? We don't have to pretend nothing has happened and
>>> continuing normally. This would avoid a host crash and would not create
>>> (more) data corruption. Or I am just too naive?
>>>
>> This is actually a sensible solution.
>> Please send a patch for that.
> 
> Is a bad frame a problem that can be resolved with a reset?
> 
> Even if so, the reset doesn't indicate to the user if previous commands
> completed with bad data, so it still seems unreliable.
> 
We need to distinguish two cases here.
The one is use receiving a frame with an invalid tag, leading to a 
crash. This can be easily resolved by issuing a reset, as clearly the 
command was garbage and we need to invoke error handling (which is reset).

The other case is us receiving a frame with a _duplicate_ tag, ie a tag 
which is _currently_ valid. This is a case which will fail _even now_, 
as we have simply no way of detecting this.

So what again do we miss by fixing the first case?
Apart from a system which does _not_ crash?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-03-01 16:53                             ` Hannes Reinecke
@ 2021-03-01 20:59                               ` Keith Busch
  2021-03-02  7:18                                 ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2021-03-01 20:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
> On 3/1/21 5:05 PM, Keith Busch wrote:
> > On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> > > On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > > > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> > > > > Crashing is bad, silent data corruption is worse. Is there truly no
> > > > > defense against that? If not, why should anyone rely on this?
> > > > 
> > > > If we receive an response for which we don't have a started request, we
> > > > know that something is wrong. Couldn't we in just reset the connection
> > > > in this case? We don't have to pretend nothing has happened and
> > > > continuing normally. This would avoid a host crash and would not create
> > > > (more) data corruption. Or I am just too naive?
> > > > 
> > > This is actually a sensible solution.
> > > Please send a patch for that.
> > 
> > Is a bad frame a problem that can be resolved with a reset?
> > 
> > Even if so, the reset doesn't indicate to the user if previous commands
> > completed with bad data, so it still seems unreliable.
> > 
> We need to distinguish two cases here.
> The one is use receiving a frame with an invalid tag, leading to a crash.
> This can be easily resolved by issuing a reset, as clearly the command was
> garbage and we need to invoke error handling (which is reset).
> 
> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
> which is _currently_ valid. This is a case which will fail _even now_, as we
> have simply no way of detecting this.
> 
> So what again do we miss by fixing the first case?
> Apart from a system which does _not_ crash?

I'm just saying each case is a symptom of the same problem. The only
difference from observing one vs the other is a race with the host's
dispatch. And since you're proposing this patch, it sounds like this
condition does happen on tcp compared to other transports where we don't
observe it. I just thought the implication that data corruption happens
is a alarming.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-03-01 20:59                               ` Keith Busch
@ 2021-03-02  7:18                                 ` Hannes Reinecke
  2021-03-02 18:45                                   ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2021-03-02  7:18 UTC (permalink / raw)
  To: Keith Busch
  Cc: Daniel Wagner, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On 3/1/21 9:59 PM, Keith Busch wrote:
> On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
>> On 3/1/21 5:05 PM, Keith Busch wrote:
>>> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
>>>> On 3/1/21 2:26 PM, Daniel Wagner wrote:
>>>>> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
>>>>>> Crashing is bad, silent data corruption is worse. Is there truly no
>>>>>> defense against that? If not, why should anyone rely on this?
>>>>>
>>>>> If we receive an response for which we don't have a started request, we
>>>>> know that something is wrong. Couldn't we in just reset the connection
>>>>> in this case? We don't have to pretend nothing has happened and
>>>>> continuing normally. This would avoid a host crash and would not create
>>>>> (more) data corruption. Or I am just too naive?
>>>>>
>>>> This is actually a sensible solution.
>>>> Please send a patch for that.
>>>
>>> Is a bad frame a problem that can be resolved with a reset?
>>>
>>> Even if so, the reset doesn't indicate to the user if previous commands
>>> completed with bad data, so it still seems unreliable.
>>>
>> We need to distinguish two cases here.
>> The one is use receiving a frame with an invalid tag, leading to a crash.
>> This can be easily resolved by issuing a reset, as clearly the command was
>> garbage and we need to invoke error handling (which is reset).
>>
>> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
>> which is _currently_ valid. This is a case which will fail _even now_, as we
>> have simply no way of detecting this.
>>
>> So what again do we miss by fixing the first case?
>> Apart from a system which does _not_ crash?
> 
> I'm just saying each case is a symptom of the same problem. The only
> difference from observing one vs the other is a race with the host's
> dispatch. And since you're proposing this patch, it sounds like this
> condition does happen on tcp compared to other transports where we don't
> observe it. I just thought the implication that data corruption happens
> is a alarming.
> 
Oh yes, it is.
But sadly TCP inherently suffers from this, as literally anyone can
spoof frames on the network.
Other transports like RDMA or FC do not suffer to that extend as
spoofing frames there is far more elaborate, and not really possible
without dedicated hardware equipment.

That's why there is header and data digest; that will protect you
against accidental frame corruption (as this case clearly is; the
remainder of the frame is filled with zeroes).
It would not protect you against deliberate frame corruption; that's why
there is TPAR 8010 (TLS encryption for NVMe-TCP).

Be it as it may, none of these methods are in use here, and none of
these methods can be made mandatory. So we need to deal with the case at
hand.

And in my opinion crashing is the _worst_ options of all.
Tear the connection down, reset the thing, whatever.

But do not crash.
Customers tend to have a very dim view on crashing machines, and have a
very limited capacity for being susceptible to our reasoning in these cases.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] nvme-tcp: Check if request has started before processing it
  2021-03-02  7:18                                 ` Hannes Reinecke
@ 2021-03-02 18:45                                   ` Keith Busch
  0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2021-03-02 18:45 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, Sagi Grimberg, Jens Axboe, Christoph Hellwig,
	linux-nvme, linux-kernel

On Tue, Mar 02, 2021 at 08:18:40AM +0100, Hannes Reinecke wrote:
> On 3/1/21 9:59 PM, Keith Busch wrote:
> > On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
> >> On 3/1/21 5:05 PM, Keith Busch wrote:
> >>> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> >>>> On 3/1/21 2:26 PM, Daniel Wagner wrote:
> >>>>> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> >>>>>> Crashing is bad, silent data corruption is worse. Is there truly no
> >>>>>> defense against that? If not, why should anyone rely on this?
> >>>>>
> >>>>> If we receive an response for which we don't have a started request, we
> >>>>> know that something is wrong. Couldn't we in just reset the connection
> >>>>> in this case? We don't have to pretend nothing has happened and
> >>>>> continuing normally. This would avoid a host crash and would not create
> >>>>> (more) data corruption. Or I am just too naive?
> >>>>>
> >>>> This is actually a sensible solution.
> >>>> Please send a patch for that.
> >>>
> >>> Is a bad frame a problem that can be resolved with a reset?
> >>>
> >>> Even if so, the reset doesn't indicate to the user if previous commands
> >>> completed with bad data, so it still seems unreliable.
> >>>
> >> We need to distinguish two cases here.
> >> The one is use receiving a frame with an invalid tag, leading to a crash.
> >> This can be easily resolved by issuing a reset, as clearly the command was
> >> garbage and we need to invoke error handling (which is reset).
> >>
> >> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
> >> which is _currently_ valid. This is a case which will fail _even now_, as we
> >> have simply no way of detecting this.
> >>
> >> So what again do we miss by fixing the first case?
> >> Apart from a system which does _not_ crash?
> > 
> > I'm just saying each case is a symptom of the same problem. The only
> > difference from observing one vs the other is a race with the host's
> > dispatch. And since you're proposing this patch, it sounds like this
> > condition does happen on tcp compared to other transports where we don't
> > observe it. I just thought the implication that data corruption happens
> > is a alarming.
> > 
> Oh yes, it is.
> But sadly TCP inherently suffers from this, as literally anyone can
> spoof frames on the network.
> Other transports like RDMA or FC do not suffer to that extend as
> spoofing frames there is far more elaborate, and not really possible
> without dedicated hardware equipment.
> 
> That's why there is header and data digest; that will protect you
> against accidental frame corruption (as this case clearly is; the
> remainder of the frame is filled with zeroes).
> It would not protect you against deliberate frame corruption; that's why
> there is TPAR 8010 (TLS encryption for NVMe-TCP).
> 
> Be it as it may, none of these methods are in use here, and none of
> these methods can be made mandatory. So we need to deal with the case at
> hand.
> 
> And in my opinion crashing is the _worst_ options of all.
> Tear the connection down, reset the thing, whatever.
> 
> But do not crash.
> Customers tend to have a very dim view on crashing machines, and have a
> very limited capacity for being susceptible to our reasoning in these cases.

I was pushing the data corruption angle because fixing that should
address both cases. If there's really nothing you can do about
corruption, your approach here makes sense, and I defer to Sagi and
Christoph for inclusion.

I still wouldn't trust my data to storage behaving this way, though. :)

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-03-02 22:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-12 18:17 [PATCH] nvme-tcp: Check if request has started before processing it Daniel Wagner
2021-02-12 20:58 ` Sagi Grimberg
2021-02-12 21:09   ` Keith Busch
2021-02-12 21:49     ` Sagi Grimberg
2021-02-13  8:46       ` Hannes Reinecke
2021-02-15 10:40         ` Daniel Wagner
2021-02-15 21:29           ` Sagi Grimberg
2021-02-26 12:35             ` Daniel Wagner
2021-02-26 12:54               ` Hannes Reinecke
2021-02-26 16:13                 ` Keith Busch
2021-02-26 16:42                   ` Hannes Reinecke
2021-02-26 17:19                     ` Keith Busch
2021-03-01 13:26                       ` Daniel Wagner
2021-03-01 13:55                         ` Hannes Reinecke
2021-03-01 16:05                           ` Keith Busch
2021-03-01 16:53                             ` Hannes Reinecke
2021-03-01 20:59                               ` Keith Busch
2021-03-02  7:18                                 ` Hannes Reinecke
2021-03-02 18:45                                   ` Keith Busch
2021-02-15 21:23         ` Sagi Grimberg
2021-02-16  8:51           ` Hannes Reinecke
2021-02-13  8:42 ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).