linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* nvme tcp receive errors
@ 2021-03-31 16:18 Keith Busch
  2021-03-31 19:10 ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-03-31 16:18 UTC (permalink / raw)
  To: linux-nvme, sagi, hch

Hi,

While running a read-write mixed workload, we are observing errors like:

  nvme nvme4: queue 2 no space in request 0x1

Based on tcpdump, all data for this queue is expected to satisfy the
command request. I'm not familiar enough with the tcp interfaces, so
could anyone provide pointers on how to debug this further?

Thanks,
Keith

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-03-31 16:18 nvme tcp receive errors Keith Busch
@ 2021-03-31 19:10 ` Sagi Grimberg
  2021-03-31 20:49   ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-03-31 19:10 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch


> Hi,

Hey Keith,

> While running a read-write mixed workload, we are observing errors like:
> 
>    nvme nvme4: queue 2 no space in request 0x1

This means that we get a data payload from a read request and
we don't have a bio/bvec space to store it, which means we
are probably not tracking the request iterator correctly if
tcpdump shows that we are getting the right data length.

> Based on tcpdump, all data for this queue is expected to satisfy the
> command request. I'm not familiar enough with the tcp interfaces, so
> could anyone provide pointers on how to debug this further?

What was the size of the I/O that you were using? Is this easily
reproducible?

Do you have the below applied:
ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges")
0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter")

I'm assuming yes if you are using the latest nvme tree...

Does the issue still happens when you revert 0dc9edaf80ea?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-03-31 19:10 ` Sagi Grimberg
@ 2021-03-31 20:49   ` Keith Busch
  2021-03-31 22:16     ` Sagi Grimberg
  2021-04-02 17:11     ` Keith Busch
  0 siblings, 2 replies; 50+ messages in thread
From: Keith Busch @ 2021-03-31 20:49 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Wed, Mar 31, 2021 at 12:10:55PM -0700, Sagi Grimberg wrote:
> Hey Keith,
> 
> > While running a read-write mixed workload, we are observing errors like:
> > 
> >    nvme nvme4: queue 2 no space in request 0x1
> 
> This means that we get a data payload from a read request and
> we don't have a bio/bvec space to store it, which means we
> are probably not tracking the request iterator correctly if
> tcpdump shows that we are getting the right data length.
> 
> > Based on tcpdump, all data for this queue is expected to satisfy the
> > command request. I'm not familiar enough with the tcp interfaces, so
> > could anyone provide pointers on how to debug this further?
> 
> What was the size of the I/O that you were using? Is this easily
> reproducible?
> 
> Do you have the below applied:
> ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges")
> 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter")
> 
> I'm assuming yes if you are using the latest nvme tree...
> 
> Does the issue still happens when you revert 0dc9edaf80ea?

Thanks for the reply.

This was observed on the recent 5.12-rc4, so it has all the latest tcp
fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
difference. It is currently reproducible, though it can take over an
hour right now.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-03-31 20:49   ` Keith Busch
@ 2021-03-31 22:16     ` Sagi Grimberg
  2021-03-31 22:26       ` Keith Busch
  2021-04-02 17:11     ` Keith Busch
  1 sibling, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-03-31 22:16 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>> Hey Keith,
>>
>>> While running a read-write mixed workload, we are observing errors like:
>>>
>>>     nvme nvme4: queue 2 no space in request 0x1
>>
>> This means that we get a data payload from a read request and
>> we don't have a bio/bvec space to store it, which means we
>> are probably not tracking the request iterator correctly if
>> tcpdump shows that we are getting the right data length.
>>
>>> Based on tcpdump, all data for this queue is expected to satisfy the
>>> command request. I'm not familiar enough with the tcp interfaces, so
>>> could anyone provide pointers on how to debug this further?
>>
>> What was the size of the I/O that you were using? Is this easily
>> reproducible?
>>
>> Do you have the below applied:
>> ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges")
>> 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter")
>>
>> I'm assuming yes if you are using the latest nvme tree...
>>
>> Does the issue still happens when you revert 0dc9edaf80ea?
> 
> Thanks for the reply.
> 
> This was observed on the recent 5.12-rc4, so it has all the latest tcp
> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> difference. It is currently reproducible, though it can take over an
> hour right now.

What is the workload you are running? have an fio job file?
Is this I/O to a raw block device? or with fs or iosched?

Also, I'm assuming that you are using Linux nvmet as the target
device?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-03-31 22:16     ` Sagi Grimberg
@ 2021-03-31 22:26       ` Keith Busch
  2021-03-31 22:45         ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-03-31 22:26 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Wed, Mar 31, 2021 at 03:16:19PM -0700, Sagi Grimberg wrote:
> 
> > > Hey Keith,
> > > 
> > > > While running a read-write mixed workload, we are observing errors like:
> > > > 
> > > >     nvme nvme4: queue 2 no space in request 0x1
> > > 
> > > This means that we get a data payload from a read request and
> > > we don't have a bio/bvec space to store it, which means we
> > > are probably not tracking the request iterator correctly if
> > > tcpdump shows that we are getting the right data length.
> > > 
> > > > Based on tcpdump, all data for this queue is expected to satisfy the
> > > > command request. I'm not familiar enough with the tcp interfaces, so
> > > > could anyone provide pointers on how to debug this further?
> > > 
> > > What was the size of the I/O that you were using? Is this easily
> > > reproducible?
> > > 
> > > Do you have the below applied:
> > > ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges")
> > > 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter")
> > > 
> > > I'm assuming yes if you are using the latest nvme tree...
> > > 
> > > Does the issue still happens when you revert 0dc9edaf80ea?
> > 
> > Thanks for the reply.
> > 
> > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > difference. It is currently reproducible, though it can take over an
> > hour right now.
> 
> What is the workload you are running? have an fio job file?
> Is this I/O to a raw block device? or with fs or iosched?

It's O_DIRECT to raw block device using libaio engine. No fs, page
cache, or io scheduler are used.

The fio job is generated by a script that cycles through various sizes,
rw mixes, and io depth. It is not always consistent on which paricular
set of parameters are running when the error message is observed,
though. I can get more details if this will be helpful.
 
> Also, I'm assuming that you are using Linux nvmet as the target
> device?

Not this time. The target is implemented in a hardware device.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-03-31 22:26       ` Keith Busch
@ 2021-03-31 22:45         ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2021-03-31 22:45 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>> What is the workload you are running? have an fio job file?
>> Is this I/O to a raw block device? or with fs or iosched?
> 
> It's O_DIRECT to raw block device using libaio engine. No fs, page
> cache, or io scheduler are used.

I see.

> 
> The fio job is generated by a script that cycles through various sizes,
> rw mixes, and io depth. It is not always consistent on which paricular
> set of parameters are running when the error message is observed,
> though. I can get more details if this will be helpful.

Try out a debug patch [1], and when this happens we can get some more
info on the request itself.

>> Also, I'm assuming that you are using Linux nvmet as the target
>> device?
> 
> Not this time. The target is implemented in a hardware device.

Ha, cool...

[1]:
---
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 69f59d2c5799..b218a41ac088 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -37,6 +37,14 @@ enum nvme_tcp_send_state {
         NVME_TCP_SEND_DDGST,
  };

+enum nvme_tcp_cmd_state {
+       NVME_TCP_CMD_QUEUED = 0,
+       NVME_TCP_CMD_SEND_PDU,
+       NVME_TCP_CMD_PENDING_DATA,
+       NVME_TCP_CMD_DATA_DONE,
+       NVME_TCP_CMD_DONE,
+};
+
  struct nvme_tcp_request {
         struct nvme_request     req;
         void                    *pdu;
@@ -56,6 +64,7 @@ struct nvme_tcp_request {
         size_t                  offset;
         size_t                  data_sent;
         enum nvme_tcp_send_state state;
+       enum nvme_tcp_cmd_state cmd_state;
  };

  enum nvme_tcp_queue_flags {
@@ -482,6 +491,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl 
*ctrl)
  static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
                 struct nvme_completion *cqe)
  {
+       struct nvme_tcp_request *req;
         struct request *rq;

         rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
@@ -493,6 +503,8 @@ static int nvme_tcp_process_nvme_cqe(struct 
nvme_tcp_queue *queue,
                 return -EINVAL;
         }

+       req = blk_mq_rq_to_pdu(rq);
+       req->cmd_state = NVME_TCP_CMD_DONE;
         if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
                 nvme_complete_rq(rq);
         queue->nr_cqe++;
@@ -503,6 +515,7 @@ static int nvme_tcp_process_nvme_cqe(struct 
nvme_tcp_queue *queue,
  static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
                 struct nvme_tcp_data_pdu *pdu)
  {
+       struct nvme_tcp_request *req;
         struct request *rq;

         rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
@@ -512,11 +525,12 @@ static int nvme_tcp_handle_c2h_data(struct 
nvme_tcp_queue *queue,
                         nvme_tcp_queue_id(queue), pdu->command_id);
                 return -ENOENT;
         }
+       req = blk_mq_rq_to_pdu(rq);

         if (!blk_rq_payload_bytes(rq)) {
                 dev_err(queue->ctrl->ctrl.device,
-                       "queue %d tag %#x unexpected data\n",
-                       nvme_tcp_queue_id(queue), rq->tag);
+                       "queue %d tag %#x unexpected data cmd_state %d\n",
+                       nvme_tcp_queue_id(queue), rq->tag, req->cmd_state);
                 return -EIO;
         }

@@ -755,7 +769,9 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
*queue, struct sk_buff *skb,
                         nvme_tcp_ddgst_final(queue->rcv_hash, 
&queue->exp_ddgst);
                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
                 } else {
+                       req->cmd_state = NVME_TCP_CMD_DATA_DONE;
                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
+                               req->cmd_state = NVME_TCP_CMD_DONE;
                                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
                                 queue->nr_cqe++;
                         }
@@ -796,7 +812,10 @@ static int nvme_tcp_recv_ddgst(struct 
nvme_tcp_queue *queue,
         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
                 struct request *rq = 
blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
                                                 pdu->command_id);
+               struct nvme_tcp_request *req;

+               req = blk_mq_rq_to_pdu(rq);
+               req->cmd_state = NVME_TCP_CMD_DONE;
                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
                 queue->nr_cqe++;
         }
@@ -944,6 +963,7 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                                 nvme_tcp_ddgst_final(queue->snd_hash,
                                         &req->ddgst);
                                 req->state = NVME_TCP_SEND_DDGST;
+                               req->cmd_state = NVME_TCP_CMD_DATA_DONE;
                                 req->offset = 0;
                         } else {
                                 nvme_tcp_done_send_req(queue);
@@ -979,6 +999,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct 
nvme_tcp_request *req)

         len -= ret;
         if (!len) {
+               req->cmd_state = req->data_len ? 
NVME_TCP_CMD_PENDING_DATA : NVME_TCP_CMD_DATA_DONE;
                 if (inline_data) {
                         req->state = NVME_TCP_SEND_DATA;
                         if (queue->data_digest)
@@ -2329,6 +2350,7 @@ static blk_status_t nvme_tcp_queue_rq(struct 
blk_mq_hw_ctx *hctx,

         blk_mq_start_request(rq);

+       req->cmd_state = NVME_TCP_CMD_QUEUED;
         nvme_tcp_queue_request(req, true, bd->last);

         return BLK_STS_OK;
---

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-03-31 20:49   ` Keith Busch
  2021-03-31 22:16     ` Sagi Grimberg
@ 2021-04-02 17:11     ` Keith Busch
  2021-04-02 17:27       ` Sagi Grimberg
  1 sibling, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-02 17:11 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Thu, Apr 01, 2021 at 05:49:58AM +0900, Keith Busch wrote:
> On Wed, Mar 31, 2021 at 12:10:55PM -0700, Sagi Grimberg wrote:
> > Hey Keith,
> > 
> > > While running a read-write mixed workload, we are observing errors like:
> > > 
> > >    nvme nvme4: queue 2 no space in request 0x1
> > 
> > This means that we get a data payload from a read request and
> > we don't have a bio/bvec space to store it, which means we
> > are probably not tracking the request iterator correctly if
> > tcpdump shows that we are getting the right data length.
> > 
> > > Based on tcpdump, all data for this queue is expected to satisfy the
> > > command request. I'm not familiar enough with the tcp interfaces, so
> > > could anyone provide pointers on how to debug this further?
> > 
> > What was the size of the I/O that you were using? Is this easily
> > reproducible?
> > 
> > Do you have the below applied:
> > ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges")
> > 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter")
> > 
> > I'm assuming yes if you are using the latest nvme tree...
> > 
> > Does the issue still happens when you revert 0dc9edaf80ea?
> 
> Thanks for the reply.
> 
> This was observed on the recent 5.12-rc4, so it has all the latest tcp
> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> difference. It is currently reproducible, though it can take over an
> hour right now.

After reverting 0dc9edaf80ea, we are observing a kernel panic (below). We'll
try adding it back, plust adding your debug patch.

  BUG: kernel NULL pointer dereference, address: 000000000000002c
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0 
  Oops: 0000 [#1] SMP PTI
  CPU: 15 PID: 49319 Comm: fio Tainted: G           OE     5.12.0-051200rc4-generic #202103212230
  Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0a 08/02/2016
  RIP: 0010:nvme_tcp_init_iter+0x55/0xf0 [nvme_tcp]
  Code: ff ff b9 01 00 00 00 45 31 e4 48 8d 7b 68 4c 89 da 44 89 d6 e8 ec e4 16 c8 4c 89 63 70 5b 41 5c 41 5d 41 5e 5d c3 48 8b 57 60 <8b> 42 2c 4c 8b 6a 78 44 8b 42 28 44 8b 62 30 49 89 c3 48 89 c7 49
  RSP: 0018:ffffaafccd0eb920 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff9ad834bfcf28 RCX: 0000000000002000
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9ad834bfcf28
  RBP: ffffaafccd0eb940 R08: 0000000000002000 R09: 0000000000000000
  R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000002000
  R13: fffff67f07c1ce00 R14: ffff9ad78422c5f0 R15: ffff9ad834bfcf28
  FS:  00007f010087d740(0000) GS:ffff9adedfdc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000000000002c CR3: 000000027b11e005 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   nvme_tcp_try_send_data+0x282/0x2e0 [nvme_tcp]
   ? nvme_tcp_init_iter+0x44/0xf0 [nvme_tcp]
   nvme_tcp_try_send+0x12f/0x1b0 [nvme_tcp]
   ? blk_mq_start_request+0x3a/0x100
   nvme_tcp_queue_rq+0x159/0x180 [nvme_tcp]
   __blk_mq_try_issue_directly+0x116/0x1e0
   blk_mq_try_issue_list_directly+0x174/0x2b0
   blk_mq_sched_insert_requests+0xa5/0xf0
   blk_mq_flush_plug_list+0x106/0x1b0
   blk_flush_plug_list+0xdd/0x100
   blk_finish_plug+0x29/0x40
   __blkdev_direct_IO+0x2ef/0x480
   ? aio_fsync_work+0xf0/0xf0
   blkdev_direct_IO+0x56/0x80
   generic_file_read_iter+0x9c/0x140
   blkdev_read_iter+0x35/0x40
   aio_read+0xe0/0x1a0
   ? __cond_resched+0x35/0x50
   ? slab_pre_alloc_hook.constprop.0+0x96/0xe0
   __io_submit_one.constprop.0+0x107/0x1f0
   io_submit_one+0xe3/0x3a0
   __x64_sys_io_submit+0x84/0x180
   do_syscall_64+0x38/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-02 17:11     ` Keith Busch
@ 2021-04-02 17:27       ` Sagi Grimberg
  2021-04-05 14:37         ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-02 17:27 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>> Thanks for the reply.
>>
>> This was observed on the recent 5.12-rc4, so it has all the latest tcp
>> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
>> difference. It is currently reproducible, though it can take over an
>> hour right now.
> 
> After reverting 0dc9edaf80ea, we are observing a kernel panic (below).

Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
This patch is actually needed.


> We'll try adding it back, plust adding your debug patch.

Yes, that would give us more info about what is the state the
request is in when getting these errors

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-02 17:27       ` Sagi Grimberg
@ 2021-04-05 14:37         ` Keith Busch
  2021-04-07 19:53           ` Keith Busch
  2021-04-09 18:04           ` Sagi Grimberg
  0 siblings, 2 replies; 50+ messages in thread
From: Keith Busch @ 2021-04-05 14:37 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Fri, Apr 02, 2021 at 10:27:11AM -0700, Sagi Grimberg wrote:
> 
> > > Thanks for the reply.
> > > 
> > > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > > difference. It is currently reproducible, though it can take over an
> > > hour right now.
> > 
> > After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
> 
> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
> This patch is actually needed.
> 
> 
> > We'll try adding it back, plust adding your debug patch.
> 
> Yes, that would give us more info about what is the state the
> request is in when getting these errors

We have recreated with your debug patch:

  nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3

State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".

The summary from the test that I received:

  We have an Ethernet trace for this failure. I filtered the trace for the
  connection that maps to "queue 6 of nvme4" and tracked the state of the IO
  command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
  the Ethernet trace is:

   1. The target receives this Command in an Ethernet frame that has  9 Command
      capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
      operation for 16K IO size
   2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
      C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
      in the last PDU.
   3. The target sends a Response for this Command.
   4. About 1.3 ms later, the Host logs this msg and closes the connection.

Please let us know if you need any additional information.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-05 14:37         ` Keith Busch
@ 2021-04-07 19:53           ` Keith Busch
  2021-04-09 21:38             ` Sagi Grimberg
  2021-04-09 18:04           ` Sagi Grimberg
  1 sibling, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-07 19:53 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, Apr 05, 2021 at 11:37:02PM +0900, Keith Busch wrote:
> On Fri, Apr 02, 2021 at 10:27:11AM -0700, Sagi Grimberg wrote:
> > 
> > > > Thanks for the reply.
> > > > 
> > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > > > difference. It is currently reproducible, though it can take over an
> > > > hour right now.
> > > 
> > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
> > 
> > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
> > This patch is actually needed.
> > 
> > 
> > > We'll try adding it back, plust adding your debug patch.
> > 
> > Yes, that would give us more info about what is the state the
> > request is in when getting these errors
> 
> We have recreated with your debug patch:
> 
>   nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
> 
> State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
> 
> The summary from the test that I received:
> 
>   We have an Ethernet trace for this failure. I filtered the trace for the
>   connection that maps to "queue 6 of nvme4" and tracked the state of the IO
>   command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
>   the Ethernet trace is:
> 
>    1. The target receives this Command in an Ethernet frame that has  9 Command
>       capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
>       operation for 16K IO size
>    2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
>       C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
>       in the last PDU.
>    3. The target sends a Response for this Command.
>    4. About 1.3 ms later, the Host logs this msg and closes the connection.
> 
> Please let us know if you need any additional information.

I'm not sure if this is just a different symptom of the same problem,
but with the debug patch, we're occasionally hitting messages like:

  nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-05 14:37         ` Keith Busch
  2021-04-07 19:53           ` Keith Busch
@ 2021-04-09 18:04           ` Sagi Grimberg
  2021-04-14  0:29             ` Keith Busch
  1 sibling, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-09 18:04 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>>> Thanks for the reply.
>>>>
>>>> This was observed on the recent 5.12-rc4, so it has all the latest tcp
>>>> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
>>>> difference. It is currently reproducible, though it can take over an
>>>> hour right now.
>>>
>>> After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
>>
>> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
>> This patch is actually needed.
>>
>>
>>> We'll try adding it back, plust adding your debug patch.
>>
>> Yes, that would give us more info about what is the state the
>> request is in when getting these errors
> 
> We have recreated with your debug patch:
> 
>    nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
> 
> State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
> 
> The summary from the test that I received:
> 
>    We have an Ethernet trace for this failure. I filtered the trace for the
>    connection that maps to "queue 6 of nvme4" and tracked the state of the IO
>    command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
>    the Ethernet trace is:
> 
>     1. The target receives this Command in an Ethernet frame that has  9 Command
>        capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
>        operation for 16K IO size
>     2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
>        C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
>        in the last PDU.

Are the c2hdata pdus have data_length of 1416? and the last has 
data_length = 832?

1416 * 11 + 832 = 16408 > 16384

Can you share for each of the c2hdata PDUs what is:
- hlen
- plen
- data_length
- data_offset

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-07 19:53           ` Keith Busch
@ 2021-04-09 21:38             ` Sagi Grimberg
  2021-04-27 23:39               ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-09 21:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>>>> This was observed on the recent 5.12-rc4, so it has all the latest tcp
>>>>> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
>>>>> difference. It is currently reproducible, though it can take over an
>>>>> hour right now.
>>>>
>>>> After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
>>>
>>> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
>>> This patch is actually needed.
>>>
>>>
>>>> We'll try adding it back, plust adding your debug patch.
>>>
>>> Yes, that would give us more info about what is the state the
>>> request is in when getting these errors
>>
>> We have recreated with your debug patch:
>>
>>    nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
>>
>> State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
>>
>> The summary from the test that I received:
>>
>>    We have an Ethernet trace for this failure. I filtered the trace for the
>>    connection that maps to "queue 6 of nvme4" and tracked the state of the IO
>>    command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
>>    the Ethernet trace is:
>>
>>     1. The target receives this Command in an Ethernet frame that has  9 Command
>>        capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
>>        operation for 16K IO size
>>     2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
>>        C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
>>        in the last PDU.
>>     3. The target sends a Response for this Command.
>>     4. About 1.3 ms later, the Host logs this msg and closes the connection.
>>
>> Please let us know if you need any additional information.
> 
> I'm not sure if this is just a different symptom of the same problem,
> but with the debug patch, we're occasionally hitting messages like:
> 
>    nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2

According to this message, this means the host got an r2t for 16384
bytes after it already sent 8192 (which can only happen if it previously
got an r2t soliciting 8192 bytes or more that accumulate to that). Can
you share for each r2t pdus in this sequence:
r2t_length
r2t_offset

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-09 18:04           ` Sagi Grimberg
@ 2021-04-14  0:29             ` Keith Busch
  2021-04-21  5:33               ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-14  0:29 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Fri, Apr 09, 2021 at 11:04:43AM -0700, Sagi Grimberg wrote:
> 
> > > > > Thanks for the reply.
> > > > > 
> > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > > > > difference. It is currently reproducible, though it can take over an
> > > > > hour right now.
> > > > 
> > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
> > > 
> > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
> > > This patch is actually needed.
> > > 
> > > 
> > > > We'll try adding it back, plust adding your debug patch.
> > > 
> > > Yes, that would give us more info about what is the state the
> > > request is in when getting these errors
> > 
> > We have recreated with your debug patch:
> > 
> >    nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
> > 
> > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
> > 
> > The summary from the test that I received:
> > 
> >    We have an Ethernet trace for this failure. I filtered the trace for the
> >    connection that maps to "queue 6 of nvme4" and tracked the state of the IO
> >    command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
> >    the Ethernet trace is:
> > 
> >     1. The target receives this Command in an Ethernet frame that has  9 Command
> >        capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
> >        operation for 16K IO size
> >     2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
> >        C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
> >        in the last PDU.
> 
> Are the c2hdata pdus have data_length of 1416? and the last has data_length
> = 832?
> 
> 1416 * 11 + 832 = 16408 > 16384

Sorry, this was a mistake in the reporting. The last one's data length
was only 808; 832 was the packet length.

> Can you share for each of the c2hdata PDUs what is:
> - hlen

24 for all of them

> - plen

11 transfers at 1440, 832 for the last one

> - data_length

11 transfers at 1416, 808 for the last one

> - data_offset

0, 1416, 2832, 4248, 5564, 7080, 8496, 9912, 11328, 12744, 14160, 15567

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-14  0:29             ` Keith Busch
@ 2021-04-21  5:33               ` Sagi Grimberg
  2021-04-21 14:28                 ` Keith Busch
  2021-04-26 15:31                 ` Keith Busch
  0 siblings, 2 replies; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-21  5:33 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch

Hey Keith, sorry for the late response, been a bit under water
lately...

> Sorry, this was a mistake in the reporting. The last one's data length
> was only 808; 832 was the packet length.
> 
>> Can you share for each of the c2hdata PDUs what is:
>> - hlen
> 
> 24 for all of them
> 
>> - plen
> 
> 11 transfers at 1440, 832 for the last one
> 
>> - data_length
> 
> 11 transfers at 1416, 808 for the last one
> 
>> - data_offset
> 
> 0, 1416, 2832, 4248, 5564, 7080, 8496, 9912, 11328, 12744, 14160, 15567
> 

Can you retry with the following applied on top of what I sent you?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c60c1dcfb587..ff39d37e9793 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -63,6 +63,7 @@ struct nvme_tcp_request {
         /* send state */
         size_t                  offset;
         size_t                  data_sent;
+       size_t                  data_recvd;
         enum nvme_tcp_send_state state;
         enum nvme_tcp_cmd_state cmd_state;
  };
@@ -769,6 +770,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
*queue, struct sk_buff *skb,
                 *len -= recv_len;
                 *offset += recv_len;
                 queue->data_remaining -= recv_len;
+               req->data_recvd += recv_len;
         }

         if (!queue->data_remaining) {
@@ -776,6 +778,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
*queue, struct sk_buff *skb,
                         nvme_tcp_ddgst_final(queue->rcv_hash, 
&queue->exp_ddgst);
                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
                 } else {
+                       BUG_ON(req->data_recvd != req->data_len);
                         req->cmd_state = NVME_TCP_CMD_DATA_DONE;
                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
                                 req->cmd_state = NVME_TCP_CMD_DONE;
--

There might be a hidden assumption here that may cause this if multiple
c2hdata pdus will come per request...

If that is the case, you can try the following (on top):
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index ff39d37e9793..aabec8e6810a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -773,19 +773,20 @@ static int nvme_tcp_recv_data(struct 
nvme_tcp_queue *queue, struct sk_buff *skb,
                 req->data_recvd += recv_len;
         }

-       if (!queue->data_remaining) {
+       if (!queue->data_remaining)
+               nvme_tcp_init_recv_ctx(queue);
+
+       if (req->data_recvd == req->data_len) {
                 if (queue->data_digest) {
                         nvme_tcp_ddgst_final(queue->rcv_hash, 
&queue->exp_ddgst);
                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
                 } else {
-                       BUG_ON(req->data_recvd != req->data_len);
                         req->cmd_state = NVME_TCP_CMD_DATA_DONE;
                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
                                 req->cmd_state = NVME_TCP_CMD_DONE;
                                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
                                 queue->nr_cqe++;
                         }
-                       nvme_tcp_init_recv_ctx(queue);
                 }
         }
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-21  5:33               ` Sagi Grimberg
@ 2021-04-21 14:28                 ` Keith Busch
  2021-04-21 16:59                   ` Sagi Grimberg
  2021-04-26 15:31                 ` Keith Busch
  1 sibling, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-21 14:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Tue, Apr 20, 2021 at 10:33:30PM -0700, Sagi Grimberg wrote:
> Can you retry with the following applied on top of what I sent you?

Thanks, we'll give this a try. Just a quick question on the first patch:

> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index c60c1dcfb587..ff39d37e9793 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -63,6 +63,7 @@ struct nvme_tcp_request {
>         /* send state */
>         size_t                  offset;
>         size_t                  data_sent;
> +       size_t                  data_recvd;
>         enum nvme_tcp_send_state state;
>         enum nvme_tcp_cmd_state cmd_state;
>  };
> @@ -769,6 +770,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
> *queue, struct sk_buff *skb,
>                 *len -= recv_len;
>                 *offset += recv_len;
>                 queue->data_remaining -= recv_len;
> +               req->data_recvd += recv_len;

Does this need to get reset to 0 during the initial setup? It looks like
a "req->data_recvd = 0" in nvme_tcp_setup_cmd_pdu() should happen, no?

>         }
> 
>         if (!queue->data_remaining) {
> @@ -776,6 +778,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
> *queue, struct sk_buff *skb,
>                         nvme_tcp_ddgst_final(queue->rcv_hash,
> &queue->exp_ddgst);
>                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>                 } else {
> +                       BUG_ON(req->data_recvd != req->data_len);
>                         req->cmd_state = NVME_TCP_CMD_DATA_DONE;
>                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
>                                 req->cmd_state = NVME_TCP_CMD_DONE;
> --
> 
> There might be a hidden assumption here that may cause this if multiple
> c2hdata pdus will come per request...
> 
> If that is the case, you can try the following (on top):
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index ff39d37e9793..aabec8e6810a 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -773,19 +773,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
> *queue, struct sk_buff *skb,
>                 req->data_recvd += recv_len;
>         }
> 
> -       if (!queue->data_remaining) {
> +       if (!queue->data_remaining)
> +               nvme_tcp_init_recv_ctx(queue);
> +
> +       if (req->data_recvd == req->data_len) {
>                 if (queue->data_digest) {
>                         nvme_tcp_ddgst_final(queue->rcv_hash,
> &queue->exp_ddgst);
>                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>                 } else {
> -                       BUG_ON(req->data_recvd != req->data_len);
>                         req->cmd_state = NVME_TCP_CMD_DATA_DONE;
>                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
>                                 req->cmd_state = NVME_TCP_CMD_DONE;
>                                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
>                                 queue->nr_cqe++;
>                         }
> -                       nvme_tcp_init_recv_ctx(queue);
>                 }
>         }
> --

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-21 14:28                 ` Keith Busch
@ 2021-04-21 16:59                   ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-21 16:59 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>> Can you retry with the following applied on top of what I sent you?
> 
> Thanks, we'll give this a try. Just a quick question on the first patch:
> 
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index c60c1dcfb587..ff39d37e9793 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -63,6 +63,7 @@ struct nvme_tcp_request {
>>          /* send state */
>>          size_t                  offset;
>>          size_t                  data_sent;
>> +       size_t                  data_recvd;
>>          enum nvme_tcp_send_state state;
>>          enum nvme_tcp_cmd_state cmd_state;
>>   };
>> @@ -769,6 +770,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
>> *queue, struct sk_buff *skb,
>>                  *len -= recv_len;
>>                  *offset += recv_len;
>>                  queue->data_remaining -= recv_len;
>> +               req->data_recvd += recv_len;
> 
> Does this need to get reset to 0 during the initial setup? It looks like
> a "req->data_recvd = 0" in nvme_tcp_setup_cmd_pdu() should happen, no?

Yes, this hunk is needed as well:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8e55d8bc0c50..226404239b49 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2274,6 +2274,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct 
nvme_ns *ns,
         req->state = NVME_TCP_SEND_CMD_PDU;
         req->offset = 0;
         req->data_sent = 0;
+       req->data_recvd = 0;
         req->pdu_len = 0;
         req->pdu_sent = 0;
         req->data_len = blk_rq_nr_phys_segments(rq) ?
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-21  5:33               ` Sagi Grimberg
  2021-04-21 14:28                 ` Keith Busch
@ 2021-04-26 15:31                 ` Keith Busch
  2021-04-27  3:10                   ` Sagi Grimberg
  1 sibling, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-26 15:31 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Tue, Apr 20, 2021 at 10:33:30PM -0700, Sagi Grimberg wrote:
> There might be a hidden assumption here that may cause this if multiple
> c2hdata pdus will come per request...
> 
> If that is the case, you can try the following (on top):

This patch is successful for clearing up the observed "no space" issues
observed during read tests.

There are still some issues with write tests that look a bit different.
I'll get more details on that for you today, but it's probably okay if
you want to make a formal patch for the receive data side.

> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index ff39d37e9793..aabec8e6810a 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -773,19 +773,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
> *queue, struct sk_buff *skb,
>                 req->data_recvd += recv_len;
>         }
> 
> -       if (!queue->data_remaining) {
> +       if (!queue->data_remaining)
> +               nvme_tcp_init_recv_ctx(queue);
> +
> +       if (req->data_recvd == req->data_len) {
>                 if (queue->data_digest) {
>                         nvme_tcp_ddgst_final(queue->rcv_hash,
> &queue->exp_ddgst);
>                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>                 } else {
> -                       BUG_ON(req->data_recvd != req->data_len);
>                         req->cmd_state = NVME_TCP_CMD_DATA_DONE;
>                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
>                                 req->cmd_state = NVME_TCP_CMD_DONE;
>                                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
>                                 queue->nr_cqe++;
>                         }
> -                       nvme_tcp_init_recv_ctx(queue);
>                 }
>         }
> --

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-26 15:31                 ` Keith Busch
@ 2021-04-27  3:10                   ` Sagi Grimberg
  2021-04-27 18:12                     ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-27  3:10 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


> This patch is successful for clearing up the observed "no space" issues
> observed during read tests.
> 
> There are still some issues with write tests that look a bit different.
> I'll get more details on that for you today,

sure.

> but it's probably okay if
> you want to make a formal patch for the receive data side.

What should I put on the Reported-by: and Tested-by: tags?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-27  3:10                   ` Sagi Grimberg
@ 2021-04-27 18:12                     ` Keith Busch
  2021-04-27 23:58                       ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-27 18:12 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, Apr 26, 2021 at 08:10:38PM -0700, Sagi Grimberg wrote:
> 
> > This patch is successful for clearing up the observed "no space" issues
> > observed during read tests.
> > 
> > There are still some issues with write tests that look a bit different.
> > I'll get more details on that for you today,
> 
> sure.
> 
> > but it's probably okay if
> > you want to make a formal patch for the receive data side.
> 
> What should I put on the Reported-by: and Tested-by: tags?

This report and testing was done courtesy of

  Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com>

Before you submit a patch, though, we did additional testing with data
digest enabled and observe a regression with the following error:

  nvme nvme0: queue 0: data digest flag is cleared

From looking at the patch, the following part looks a bit suspicious:

> @@ -776,19 +776,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
> 		req->data_recvd += recv_len;
> 	}
> 
> -	if (!queue->data_remaining) {
> +	if (!queue->data_remaining)
> +		nvme_tcp_init_recv_ctx(queue);

The code had previously called nvme_tcp_init_recv_ctx() only if
queue->data_digest wasn't set, but now it's called all the time. I see
that calling this function clears ddgst_remaining, so does that explain
the new errors?

> +	if (req->data_recvd == req->data_len) {
>  		if (queue->data_digest) {
>  			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
>  			queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>  		} else {
> -			BUG_ON(req->data_recvd != req->data_len);
>  			req->cmd_state = NVME_TCP_CMD_DATA_DONE;
>  			if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
>  				req->cmd_state = NVME_TCP_CMD_DONE;
>  				nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
>  				queue->nr_cqe++;
>  			}
> -			nvme_tcp_init_recv_ctx(queue);
>  		}
>  	}

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-09 21:38             ` Sagi Grimberg
@ 2021-04-27 23:39               ` Keith Busch
  2021-04-27 23:55                 ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-27 23:39 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Fri, Apr 09, 2021 at 02:38:38PM -0700, Sagi Grimberg wrote:
> 
> > > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > > > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > > > > > difference. It is currently reproducible, though it can take over an
> > > > > > hour right now.
> > > > > 
> > > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
> > > > 
> > > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
> > > > This patch is actually needed.
> > > > 
> > > > 
> > > > > We'll try adding it back, plust adding your debug patch.
> > > > 
> > > > Yes, that would give us more info about what is the state the
> > > > request is in when getting these errors
> > > 
> > > We have recreated with your debug patch:
> > > 
> > >    nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
> > > 
> > > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
> > > 
> > > The summary from the test that I received:
> > > 
> > >    We have an Ethernet trace for this failure. I filtered the trace for the
> > >    connection that maps to "queue 6 of nvme4" and tracked the state of the IO
> > >    command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
> > >    the Ethernet trace is:
> > > 
> > >     1. The target receives this Command in an Ethernet frame that has  9 Command
> > >        capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
> > >        operation for 16K IO size
> > >     2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
> > >        C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
> > >        in the last PDU.
> > >     3. The target sends a Response for this Command.
> > >     4. About 1.3 ms later, the Host logs this msg and closes the connection.
> > > 
> > > Please let us know if you need any additional information.
> > 
> > I'm not sure if this is just a different symptom of the same problem,
> > but with the debug patch, we're occasionally hitting messages like:
> > 
> >    nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2
> 
> According to this message, this means the host got an r2t for 16384
> bytes after it already sent 8192 (which can only happen if it previously
> got an r2t soliciting 8192 bytes or more that accumulate to that). Can
> you share for each r2t pdus in this sequence:
> r2t_length
> r2t_offset

This one took a bit to go through. The traces all only have a single r2t
pdu with 0 offset for the full length of the requested transfer. I had
to add trace events to see what the heck the driver is thinking, and
the result is even more confusing.

The kernel message error:

    nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent)

And all the new trace events for this request are:

    fio-25086   [011] ....  9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384
    fio-25093   [007] ....  9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72
    <...>-22670 [003] ....  9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384

The fact "data_sent" is non-zero on the very first r2t makes no sense to me.
I so far can not find any sequence where that could happen.

Here's the tracepoints patch (I think something like this is worthy of being
included in the driver):

---
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9c1f3d273fcf..931197fa4ef4 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -141,6 +141,9 @@ struct nvme_tcp_ctrl {
 	u32			io_queues[HCTX_MAX_TYPES];
 };
 
+#define CREATE_TRACE_POINTS
+#include "trace-tcp.h"
+
 static LIST_HEAD(nvme_tcp_ctrl_list);
 static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
 static struct workqueue_struct *nvme_tcp_wq;
@@ -582,10 +585,13 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
 	struct request *rq = blk_mq_rq_from_pdu(req);
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 	u8 ddgst = nvme_tcp_ddgst_len(queue);
+	unsigned r2t_offset = le32_to_cpu(pdu->r2t_offset);
 
 	req->pdu_len = le32_to_cpu(pdu->r2t_length);
 	req->pdu_sent = 0;
 
+	trace_nvme_tcp_r2t(rq, req, r2t_offset);
+
 	if (unlikely(!req->pdu_len)) {
 		dev_err(queue->ctrl->ctrl.device,
 			"req %d r2t len is %u, probably a bug...\n",
@@ -595,17 +601,16 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req,
 
 	if (unlikely(req->data_sent + req->pdu_len > req->data_len)) {
 		dev_err(queue->ctrl->ctrl.device,
-			"req %d r2t len %u exceeded data len %u (%zu sent)\n",
-			rq->tag, req->pdu_len, req->data_len,
-			req->data_sent);
+			"req %d op %d r2t len %u exceeded data len %u (%zu sent) pdu offset %u\n",
+			rq->tag, req_op(rq), req->pdu_len, req->data_len,
+			req->data_sent, r2t_offset);
 		return -EPROTO;
 	}
 
-	if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->data_sent)) {
+	if (unlikely(r2t_offset < req->data_sent)) {
 		dev_err(queue->ctrl->ctrl.device,
 			"req %d unexpected r2t offset %u (expected %zu)\n",
-			rq->tag, le32_to_cpu(pdu->r2t_offset),
-			req->data_sent);
+			rq->tag, r2t_offset, req->data_sent);
 		return -EPROTO;
 	}
 
@@ -943,6 +948,7 @@ static void nvme_tcp_fail_request(struct nvme_tcp_request *req)
 static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 {
 	struct nvme_tcp_queue *queue = req->queue;
+	struct request *rq = blk_mq_rq_from_pdu(req);
 
 	while (true) {
 		struct page *page = nvme_tcp_req_cur_page(req);
@@ -956,6 +962,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 		else
 			flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
 
+		trace_nvme_tcp_data(rq, req, offset, len, last);
 		if (sendpage_ok(page)) {
 			ret = kernel_sendpage(queue->sock, page, offset, len,
 					flags);
@@ -977,9 +984,9 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 				nvme_tcp_ddgst_final(queue->snd_hash,
 					&req->ddgst);
 				req->state = NVME_TCP_SEND_DDGST;
-				req->cmd_state = NVME_TCP_CMD_DATA_DONE;
 				req->offset = 0;
 			} else {
+				req->cmd_state = NVME_TCP_CMD_DATA_DONE;
 				nvme_tcp_done_send_req(queue);
 			}
 			return 1;
@@ -996,7 +1003,8 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 	int len = sizeof(*pdu) + hdgst - req->offset;
 	int flags = MSG_DONTWAIT;
-	int ret;
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	int offset, ret;
 
 	if (inline_data || nvme_tcp_queue_more(queue))
 		flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
@@ -1006,8 +1014,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
-	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
-			offset_in_page(pdu) + req->offset, len,  flags);
+	offset = offset_in_page(pdu) + req->offset;
+	trace_nvme_tcp_cmd_pdu(rq, req, offset, len);
+	ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset,
+			len,  flags);
 	if (unlikely(ret <= 0))
 		return ret;
 
@@ -1034,14 +1044,16 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
 	struct nvme_tcp_data_pdu *pdu = req->pdu;
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 	int len = sizeof(*pdu) - req->offset + hdgst;
-	int ret;
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	int offset, ret;
 
 	if (queue->hdr_digest && !req->offset)
 		nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
 
-	ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
-			offset_in_page(pdu) + req->offset, len,
-			MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
+	offset = offset_in_page(pdu) + req->offset;
+	trace_nvme_tcp_h2c_pdu(rq, req, offset, len);
+	ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset,
+			len, MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
 	if (unlikely(ret <= 0))
 		return ret;
 
@@ -2367,6 +2379,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
 		return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
 
+	trace_nvme_tcp_queue_rq(rq);
 	ret = nvme_tcp_setup_cmd_pdu(ns, rq);
 	if (unlikely(ret))
 		return ret;
diff --git a/drivers/nvme/host/trace-tcp.h b/drivers/nvme/host/trace-tcp.h
new file mode 100644
index 000000000000..0fc8f4c51ee2
--- /dev/null
+++ b/drivers/nvme/host/trace-tcp.h
@@ -0,0 +1,154 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nvme_tcp
+
+#if !defined(_TRACE_NVME_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NVME_TCP_H
+
+#include <linux/nvme.h>
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "nvme.h"
+
+TRACE_EVENT(nvme_tcp_queue_rq,
+	TP_PROTO(struct request *rq),
+	TP_ARGS(rq),
+	TP_STRUCT__entry(
+		__field(unsigned, ctrl_id)
+		__field(unsigned, qid)
+		__field(unsigned, op)
+		__field(unsigned, tag)
+		__field(unsigned, len)
+	),
+	TP_fast_assign(
+		__entry->ctrl_id = nvme_req(rq)->ctrl->instance;
+		__entry->qid = nvme_req_qid(rq);
+		__entry->op = req_op(rq);
+		__entry->tag = rq->tag;
+		__entry->len = blk_rq_payload_bytes(rq);
+	),
+	TP_printk("nvme%u: qid=%u tag=%x op=%x data_len=%u",
+		__entry->ctrl_id, __entry->qid, __entry->tag, __entry->op,
+		__entry->len
+	)
+);
+
+TRACE_EVENT(nvme_tcp_cmd_pdu,
+	TP_PROTO(struct request *rq, struct nvme_tcp_request *req, unsigned offset, unsigned len),
+	TP_ARGS(rq, req, offset, len),
+	TP_STRUCT__entry(
+		__field(unsigned, ctrl_id)
+		__field(unsigned, qid)
+		__field(unsigned, op)
+		__field(unsigned, tag)
+		__field(unsigned, page_offset)
+		__field(unsigned, len)
+	),
+	TP_fast_assign(
+		__entry->ctrl_id = nvme_req(rq)->ctrl->instance;
+		__entry->qid = nvme_req_qid(rq);
+		__entry->op = req_op(rq);
+		__entry->tag = rq->tag;
+		__entry->page_offset = offset;
+		__entry->len = len;
+	),
+	TP_printk("nvme%u: qid=%u tag=%x op=%x page_offset=%u send_len=%u",
+		__entry->ctrl_id, __entry->qid, __entry->tag, __entry->op,
+		__entry->page_offset, __entry->len
+	)
+);
+
+TRACE_EVENT(nvme_tcp_h2c_pdu,
+	TP_PROTO(struct request *rq, struct nvme_tcp_request *req, unsigned offset, unsigned len),
+	TP_ARGS(rq, req, offset, len),
+	TP_STRUCT__entry(
+		__field(unsigned, ctrl_id)
+		__field(unsigned, qid)
+		__field(unsigned, op)
+		__field(unsigned, tag)
+		__field(unsigned, page_offset)
+		__field(unsigned, len)
+	),
+	TP_fast_assign(
+		__entry->ctrl_id = nvme_req(rq)->ctrl->instance;
+		__entry->qid = nvme_req_qid(rq);
+		__entry->op = req_op(rq);
+		__entry->tag = rq->tag;
+		__entry->page_offset = offset;
+		__entry->len = len;
+	),
+	TP_printk("nvme%u: qid=%u tag=%x op=%x page_offset=%u send_len=%u",
+		__entry->ctrl_id, __entry->qid, __entry->tag, __entry->op,
+		__entry->page_offset, __entry->len
+	)
+);
+
+TRACE_EVENT(nvme_tcp_data,
+	TP_PROTO(struct request *rq, struct nvme_tcp_request *req, unsigned offset, unsigned len, bool last),
+	TP_ARGS(rq, req, offset, len, last),
+	TP_STRUCT__entry(
+		__field(unsigned, ctrl_id)
+		__field(unsigned, qid)
+		__field(unsigned, op)
+		__field(unsigned, tag)
+		__field(unsigned, page_offset)
+		__field(unsigned, len)
+		__field(unsigned, sent)
+		__field(bool, last)
+	),
+	TP_fast_assign(
+		__entry->ctrl_id = nvme_req(rq)->ctrl->instance;
+		__entry->qid = nvme_req_qid(rq);
+		__entry->op = req_op(rq);
+		__entry->tag = rq->tag;
+		__entry->page_offset = offset;
+		__entry->len = len;
+		__entry->sent = req->pdu_sent;
+		__entry->last = last;
+	),
+	TP_printk("nvme%u: qid=%u tag=%x op=%x page_offset=%u send_len=%u pdu_sent=%u last=%u",
+		__entry->ctrl_id, __entry->qid, __entry->tag, __entry->op,
+		__entry->page_offset, __entry->len, __entry->sent, __entry->last
+	)
+);
+
+TRACE_EVENT(nvme_tcp_r2t,
+	TP_PROTO(struct request *rq, struct nvme_tcp_request *req, unsigned offset),
+	TP_ARGS(rq, req, offset),
+	TP_STRUCT__entry(
+		__field(unsigned, ctrl_id)
+		__field(unsigned, qid)
+		__field(unsigned, op)
+		__field(unsigned, tag)
+		__field(unsigned, len)
+		__field(unsigned, offset)
+		__field(unsigned, sent)
+		__field(unsigned, data_len)
+	),
+	TP_fast_assign(
+		__entry->ctrl_id = nvme_req(rq)->ctrl->instance;
+		__entry->qid = nvme_req_qid(rq);
+		__entry->op = req_op(rq);
+		__entry->tag = rq->tag;
+		__entry->len = req->pdu_len;
+		__entry->offset = offset;
+		__entry->sent = req->data_sent;
+		__entry->data_len = req->data_len;
+	),
+	TP_printk("nvme%u: qid=%u tag=%x op=%x r2t_len=%u r2t_offset=%u data_sent=%u data_len=%u",
+		__entry->ctrl_id, __entry->qid, __entry->tag, __entry->op,
+		__entry->len, __entry->offset, __entry->sent, __entry->data_len
+	)
+);
+
+#endif /* _TRACE_NVME_TCP_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace-tcp
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-27 23:39               ` Keith Busch
@ 2021-04-27 23:55                 ` Sagi Grimberg
  2021-04-28 15:58                   ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-27 23:55 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>>>>>> This was observed on the recent 5.12-rc4, so it has all the latest tcp
>>>>>>> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
>>>>>>> difference. It is currently reproducible, though it can take over an
>>>>>>> hour right now.
>>>>>>
>>>>>> After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
>>>>>
>>>>> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
>>>>> This patch is actually needed.
>>>>>
>>>>>
>>>>>> We'll try adding it back, plust adding your debug patch.
>>>>>
>>>>> Yes, that would give us more info about what is the state the
>>>>> request is in when getting these errors
>>>>
>>>> We have recreated with your debug patch:
>>>>
>>>>     nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
>>>>
>>>> State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
>>>>
>>>> The summary from the test that I received:
>>>>
>>>>     We have an Ethernet trace for this failure. I filtered the trace for the
>>>>     connection that maps to "queue 6 of nvme4" and tracked the state of the IO
>>>>     command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
>>>>     the Ethernet trace is:
>>>>
>>>>      1. The target receives this Command in an Ethernet frame that has  9 Command
>>>>         capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
>>>>         operation for 16K IO size
>>>>      2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
>>>>         C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
>>>>         in the last PDU.
>>>>      3. The target sends a Response for this Command.
>>>>      4. About 1.3 ms later, the Host logs this msg and closes the connection.
>>>>
>>>> Please let us know if you need any additional information.
>>>
>>> I'm not sure if this is just a different symptom of the same problem,
>>> but with the debug patch, we're occasionally hitting messages like:
>>>
>>>     nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2
>>
>> According to this message, this means the host got an r2t for 16384
>> bytes after it already sent 8192 (which can only happen if it previously
>> got an r2t soliciting 8192 bytes or more that accumulate to that). Can
>> you share for each r2t pdus in this sequence:
>> r2t_length
>> r2t_offset
> 
> This one took a bit to go through. The traces all only have a single r2t
> pdu with 0 offset for the full length of the requested transfer. I had
> to add trace events to see what the heck the driver is thinking, and
> the result is even more confusing.
> 
> The kernel message error:
> 
>      nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent)
> 
> And all the new trace events for this request are:
> 
>      fio-25086   [011] ....  9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384
>      fio-25093   [007] ....  9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72
>      <...>-22670 [003] ....  9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384
> 
> The fact "data_sent" is non-zero on the very first r2t makes no sense to me.

Yep, not supposed to happen... Like the traces though! very useful and
absolutely worth having.

> I so far can not find any sequence where that could happen.

The only way that data_sent can increment is either by getting an r2t
solicitation or by sending incapsule data.

What is the ioccsz the controller is exposing?
in nvme_tcp_sned_cmd_pdu we have:
--
                 if (inline_data) {
                         req->state = NVME_TCP_SEND_DATA;
                         if (queue->data_digest)
                                 crypto_ahash_init(queue->snd_hash);
                 } else {
                         nvme_tcp_done_send_req(queue);
                 }
--

Where inline_data flag is:
	bool inline_data = nvme_tcp_has_inline_data(req);
which essentially boils down to:
	req->data_len <= queue->cmnd_capsule_len - sizeof(struct nvme_command);

I wonder how does the nvme command sgl look like? is it an offset
sgl?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-27 18:12                     ` Keith Busch
@ 2021-04-27 23:58                       ` Sagi Grimberg
  2021-04-30 23:42                         ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-27 23:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>> What should I put on the Reported-by: and Tested-by: tags?
> 
> This report and testing was done courtesy of
> 
>    Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com>
> 
> Before you submit a patch, though, we did additional testing with data
> digest enabled and observe a regression with the following error:
> 
>    nvme nvme0: queue 0: data digest flag is cleared
> 
>  From looking at the patch, the following part looks a bit suspicious:
> 
>> @@ -776,19 +776,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>> 		req->data_recvd += recv_len;
>> 	}
>>
>> -	if (!queue->data_remaining) {
>> +	if (!queue->data_remaining)
>> +		nvme_tcp_init_recv_ctx(queue);
> 
> The code had previously called nvme_tcp_init_recv_ctx() only if
> queue->data_digest wasn't set, but now it's called all the time. I see
> that calling this function clears ddgst_remaining, so does that explain
> the new errors?
> 
>> +	if (req->data_recvd == req->data_len) {
>>   		if (queue->data_digest) {
>>   			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
>>   			queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>>   		} else {
>> -			BUG_ON(req->data_recvd != req->data_len);
>>   			req->cmd_state = NVME_TCP_CMD_DATA_DONE;
>>   			if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
>>   				req->cmd_state = NVME_TCP_CMD_DONE;
>>   				nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
>>   				queue->nr_cqe++;
>>   			}
>> -			nvme_tcp_init_recv_ctx(queue);
>>   		}
>>   	}

You are exactly right, I think this should do the trick:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8e55d8bc0c50..eb1feaacd11a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -55,6 +55,7 @@ struct nvme_tcp_request {
         /* send state */
         size_t                  offset;
         size_t                  data_sent;
+       size_t                  data_received;
         enum nvme_tcp_send_state state;
  };

@@ -751,9 +752,12 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
*queue, struct sk_buff *skb,
                 *len -= recv_len;
                 *offset += recv_len;
                 queue->data_remaining -= recv_len;
+               req->data_received += recv_len;
         }

         if (!queue->data_remaining) {
+               if (req->data_received < req->data_len)
+                       return 0;
                 if (queue->data_digest) {
                         nvme_tcp_ddgst_final(queue->rcv_hash, 
&queue->exp_ddgst);
                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
@@ -2274,6 +2278,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct 
nvme_ns *ns,
         req->state = NVME_TCP_SEND_CMD_PDU;
         req->offset = 0;
         req->data_sent = 0;
+       req->data_received = 0;
         req->pdu_len = 0;
         req->pdu_sent = 0;
         req->data_len = blk_rq_nr_phys_segments(rq) ?
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-27 23:55                 ` Sagi Grimberg
@ 2021-04-28 15:58                   ` Keith Busch
  2021-04-28 17:42                     ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-28 15:58 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Tue, Apr 27, 2021 at 04:55:43PM -0700, Sagi Grimberg wrote:
> 
> > > > > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > > > > > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > > > > > > > difference. It is currently reproducible, though it can take over an
> > > > > > > > hour right now.
> > > > > > > 
> > > > > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
> > > > > > 
> > > > > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
> > > > > > This patch is actually needed.
> > > > > > 
> > > > > > 
> > > > > > > We'll try adding it back, plust adding your debug patch.
> > > > > > 
> > > > > > Yes, that would give us more info about what is the state the
> > > > > > request is in when getting these errors
> > > > > 
> > > > > We have recreated with your debug patch:
> > > > > 
> > > > >     nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
> > > > > 
> > > > > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
> > > > > 
> > > > > The summary from the test that I received:
> > > > > 
> > > > >     We have an Ethernet trace for this failure. I filtered the trace for the
> > > > >     connection that maps to "queue 6 of nvme4" and tracked the state of the IO
> > > > >     command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
> > > > >     the Ethernet trace is:
> > > > > 
> > > > >      1. The target receives this Command in an Ethernet frame that has  9 Command
> > > > >         capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
> > > > >         operation for 16K IO size
> > > > >      2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
> > > > >         C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
> > > > >         in the last PDU.
> > > > >      3. The target sends a Response for this Command.
> > > > >      4. About 1.3 ms later, the Host logs this msg and closes the connection.
> > > > > 
> > > > > Please let us know if you need any additional information.
> > > > 
> > > > I'm not sure if this is just a different symptom of the same problem,
> > > > but with the debug patch, we're occasionally hitting messages like:
> > > > 
> > > >     nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2
> > > 
> > > According to this message, this means the host got an r2t for 16384
> > > bytes after it already sent 8192 (which can only happen if it previously
> > > got an r2t soliciting 8192 bytes or more that accumulate to that). Can
> > > you share for each r2t pdus in this sequence:
> > > r2t_length
> > > r2t_offset
> > 
> > This one took a bit to go through. The traces all only have a single r2t
> > pdu with 0 offset for the full length of the requested transfer. I had
> > to add trace events to see what the heck the driver is thinking, and
> > the result is even more confusing.
> > 
> > The kernel message error:
> > 
> >      nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent)
> > 
> > And all the new trace events for this request are:
> > 
> >      fio-25086   [011] ....  9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384
> >      fio-25093   [007] ....  9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72
> >      <...>-22670 [003] ....  9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384
> > 
> > The fact "data_sent" is non-zero on the very first r2t makes no sense to me.
> 
> Yep, not supposed to happen... Like the traces though! very useful and
> absolutely worth having.
> 
> > I so far can not find any sequence where that could happen.
> 
> The only way that data_sent can increment is either by getting an r2t
> solicitation or by sending incapsule data.
> 
> What is the ioccsz the controller is exposing?

In capsule data is not supported by this target, so ioccsz is 4.

> in nvme_tcp_sned_cmd_pdu we have:
> --
>                 if (inline_data) {
>                         req->state = NVME_TCP_SEND_DATA;
>                         if (queue->data_digest)
>                                 crypto_ahash_init(queue->snd_hash);
>                 } else {
>                         nvme_tcp_done_send_req(queue);
>                 }
> --
> 
> Where inline_data flag is:
> 	bool inline_data = nvme_tcp_has_inline_data(req);
> which essentially boils down to:
> 	req->data_len <= queue->cmnd_capsule_len - sizeof(struct nvme_command);
> 
> I wonder how does the nvme command sgl look like? is it an offset
> sgl?

Just a single data block descriptor SGL. The target supports only 1 and
reports that through ID_CTRL.MSDBD.

What do you mean by "offset" SGL?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-28 15:58                   ` Keith Busch
@ 2021-04-28 17:42                     ` Sagi Grimberg
  2021-04-28 18:01                       ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-28 17:42 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>> This one took a bit to go through. The traces all only have a single r2t
>>> pdu with 0 offset for the full length of the requested transfer. I had
>>> to add trace events to see what the heck the driver is thinking, and
>>> the result is even more confusing.
>>>
>>> The kernel message error:
>>>
>>>       nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent)
>>>
>>> And all the new trace events for this request are:
>>>
>>>       fio-25086   [011] ....  9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384
>>>       fio-25093   [007] ....  9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72
>>>       <...>-22670 [003] ....  9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384
>>>
>>> The fact "data_sent" is non-zero on the very first r2t makes no sense to me.
>>
>> Yep, not supposed to happen... Like the traces though! very useful and
>> absolutely worth having.
>>
>>> I so far can not find any sequence where that could happen.
>>
>> The only way that data_sent can increment is either by getting an r2t
>> solicitation or by sending incapsule data.
>>
>> What is the ioccsz the controller is exposing?
> 
> In capsule data is not supported by this target, so ioccsz is 4.
> 
>> in nvme_tcp_sned_cmd_pdu we have:
>> --
>>                  if (inline_data) {
>>                          req->state = NVME_TCP_SEND_DATA;
>>                          if (queue->data_digest)
>>                                  crypto_ahash_init(queue->snd_hash);
>>                  } else {
>>                          nvme_tcp_done_send_req(queue);
>>                  }
>> --
>>
>> Where inline_data flag is:
>> 	bool inline_data = nvme_tcp_has_inline_data(req);
>> which essentially boils down to:
>> 	req->data_len <= queue->cmnd_capsule_len - sizeof(struct nvme_command);
>>
>> I wonder how does the nvme command sgl look like? is it an offset
>> sgl?
> 
> Just a single data block descriptor SGL. The target supports only 1 and
> reports that through ID_CTRL.MSDBD.
> 
> What do you mean by "offset" SGL?

In tcp.c:
--
static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
                 struct nvme_command *c, u32 data_len)
{
         struct nvme_sgl_desc *sg = &c->common.dptr.sgl;

         sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
         sg->length = cpu_to_le32(data_len);
         sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
}

static void nvme_tcp_set_sg_host_data(struct nvme_command *c,
                 u32 data_len)
{
         struct nvme_sgl_desc *sg = &c->common.dptr.sgl;

         sg->addr = 0;
         sg->length = cpu_to_le32(data_len);
         sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
                         NVME_SGL_FMT_TRANSPORT_A;
}
--

What is the sgl type you see in the traces? transport specific sgl
(host-data i.e. non-incapsule) or inline?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-28 17:42                     ` Sagi Grimberg
@ 2021-04-28 18:01                       ` Keith Busch
  2021-04-28 23:06                         ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-28 18:01 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Wed, Apr 28, 2021 at 10:42:13AM -0700, Sagi Grimberg wrote:
> In tcp.c:
> --
> static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
>                 struct nvme_command *c, u32 data_len)
> {
>         struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
> 
>         sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
>         sg->length = cpu_to_le32(data_len);
>         sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
> }
> 
> static void nvme_tcp_set_sg_host_data(struct nvme_command *c,
>                 u32 data_len)
> {
>         struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
> 
>         sg->addr = 0;
>         sg->length = cpu_to_le32(data_len);
>         sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
>                         NVME_SGL_FMT_TRANSPORT_A;
> }
> --
> 
> What is the sgl type you see in the traces? transport specific sgl
> (host-data i.e. non-incapsule) or inline?

The Sub Type is 0xA, Transport Specific.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-28 18:01                       ` Keith Busch
@ 2021-04-28 23:06                         ` Sagi Grimberg
  2021-04-29  3:33                           ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-28 23:06 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>> In tcp.c:
>> --
>> static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
>>                  struct nvme_command *c, u32 data_len)
>> {
>>          struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
>>
>>          sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
>>          sg->length = cpu_to_le32(data_len);
>>          sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
>> }
>>
>> static void nvme_tcp_set_sg_host_data(struct nvme_command *c,
>>                  u32 data_len)
>> {
>>          struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
>>
>>          sg->addr = 0;
>>          sg->length = cpu_to_le32(data_len);
>>          sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
>>                          NVME_SGL_FMT_TRANSPORT_A;
>> }
>> --
>>
>> What is the sgl type you see in the traces? transport specific sgl
>> (host-data i.e. non-incapsule) or inline?
> 
> The Sub Type is 0xA, Transport Specific.

Interesting, I don't see how the host is going to send data
without it being in-capsule, and before receiving an r2t...

Maybe add this one for a sanity check:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index eb1feaacd11a..6cb0e13024e5 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -987,6 +987,8 @@ static int nvme_tcp_try_send_cmd_pdu(struct 
nvme_tcp_request *req)
         len -= ret;
         if (!len) {
                 if (inline_data) {
+                       pr_err("no way... data_len %d queue_max_inline 
%ld\n",
+                               req->data_len, 
nvme_tcp_inline_data_size(req->queue));
                         req->state = NVME_TCP_SEND_DATA;
                         if (queue->data_digest)
                                 crypto_ahash_init(queue->snd_hash);
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-28 23:06                         ` Sagi Grimberg
@ 2021-04-29  3:33                           ` Keith Busch
  2021-04-29  4:52                             ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-04-29  3:33 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Wed, Apr 28, 2021 at 04:06:12PM -0700, Sagi Grimberg wrote:
> 
> > > In tcp.c:
> > > --
> > > static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue,
> > >                  struct nvme_command *c, u32 data_len)
> > > {
> > >          struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
> > > 
> > >          sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff);
> > >          sg->length = cpu_to_le32(data_len);
> > >          sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET;
> > > }
> > > 
> > > static void nvme_tcp_set_sg_host_data(struct nvme_command *c,
> > >                  u32 data_len)
> > > {
> > >          struct nvme_sgl_desc *sg = &c->common.dptr.sgl;
> > > 
> > >          sg->addr = 0;
> > >          sg->length = cpu_to_le32(data_len);
> > >          sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
> > >                          NVME_SGL_FMT_TRANSPORT_A;
> > > }
> > > --
> > > 
> > > What is the sgl type you see in the traces? transport specific sgl
> > > (host-data i.e. non-incapsule) or inline?
> > 
> > The Sub Type is 0xA, Transport Specific.
> 
> Interesting, I don't see how the host is going to send data
> without it being in-capsule, and before receiving an r2t...

The driver tracepoints captured millions of IO's where everything
happened as expected, so I really think something got confused and
mucked with the wrong request. I've added more trace points to increase
visibility because I frankly didn't find how that could happen just from
code inspection. We will also incorporate your patch below for the next
recreate.
 
> Maybe add this one for a sanity check:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index eb1feaacd11a..6cb0e13024e5 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -987,6 +987,8 @@ static int nvme_tcp_try_send_cmd_pdu(struct
> nvme_tcp_request *req)
>         len -= ret;
>         if (!len) {
>                 if (inline_data) {
> +                       pr_err("no way... data_len %d queue_max_inline
> %ld\n",
> +                               req->data_len,
> nvme_tcp_inline_data_size(req->queue));
>                         req->state = NVME_TCP_SEND_DATA;
>                         if (queue->data_digest)
>                                 crypto_ahash_init(queue->snd_hash);
> --

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-29  3:33                           ` Keith Busch
@ 2021-04-29  4:52                             ` Sagi Grimberg
  2021-05-03 18:51                               ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-29  4:52 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


> The driver tracepoints captured millions of IO's where everything
> happened as expected, so I really think something got confused and
> mucked with the wrong request. I've added more trace points to increase
> visibility because I frankly didn't find how that could happen just from
> code inspection. We will also incorporate your patch below for the next
> recreate.

Keith, does the issue still happen with eliminating the network send
from .queue_rq() ?

--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index eb1feaacd11a..b3fafa536345 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -288,7 +288,7 @@ static inline void nvme_tcp_queue_request(struct 
nvme_tcp_request *req,
          * directly, otherwise queue io_work. Also, only do that if we
          * are on the same cpu, so we don't introduce contention.
          */
-       if (queue->io_cpu == __smp_processor_id() &&
+       if (0 && queue->io_cpu == __smp_processor_id() &&
             sync && empty && mutex_trylock(&queue->send_mutex)) {
                 queue->more_requests = !last;
                 nvme_tcp_send_all(queue);
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-27 23:58                       ` Sagi Grimberg
@ 2021-04-30 23:42                         ` Sagi Grimberg
  2021-05-03 14:28                           ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-04-30 23:42 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


> You are exactly right, I think this should do the trick:

Hey Keith,

Did this resolve the issues?

> -- 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 8e55d8bc0c50..eb1feaacd11a 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -55,6 +55,7 @@ struct nvme_tcp_request {
>          /* send state */
>          size_t                  offset;
>          size_t                  data_sent;
> +       size_t                  data_received;
>          enum nvme_tcp_send_state state;
>   };
> 
> @@ -751,9 +752,12 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
> *queue, struct sk_buff *skb,
>                  *len -= recv_len;
>                  *offset += recv_len;
>                  queue->data_remaining -= recv_len;
> +               req->data_received += recv_len;
>          }
> 
>          if (!queue->data_remaining) {
> +               if (req->data_received < req->data_len)
> +                       return 0;
>                  if (queue->data_digest) {
>                          nvme_tcp_ddgst_final(queue->rcv_hash, 
> &queue->exp_ddgst);
>                          queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
> @@ -2274,6 +2278,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct 
> nvme_ns *ns,
>          req->state = NVME_TCP_SEND_CMD_PDU;
>          req->offset = 0;
>          req->data_sent = 0;
> +       req->data_received = 0;
>          req->pdu_len = 0;
>          req->pdu_sent = 0;
>          req->data_len = blk_rq_nr_phys_segments(rq) ?
> -- 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-30 23:42                         ` Sagi Grimberg
@ 2021-05-03 14:28                           ` Keith Busch
  2021-05-03 19:36                             ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-03 14:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Fri, Apr 30, 2021 at 04:42:14PM -0700, Sagi Grimberg wrote:
> 
> > You are exactly right, I think this should do the trick:
> 
> Hey Keith,
> 
> Did this resolve the issues?

We're unfortunately still observing data digest issues even with this.
Most of the testing has shifted to the r2t error, so I don't have any
additional details on the data digest problem.

On the r2t issue, I'm just waiting for the weekend test results to see
what happened with running your most recent test suggestion.
 
> > -- 
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 8e55d8bc0c50..eb1feaacd11a 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -55,6 +55,7 @@ struct nvme_tcp_request {
> >          /* send state */
> >          size_t                  offset;
> >          size_t                  data_sent;
> > +       size_t                  data_received;
> >          enum nvme_tcp_send_state state;
> >   };
> > 
> > @@ -751,9 +752,12 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
> > *queue, struct sk_buff *skb,
> >                  *len -= recv_len;
> >                  *offset += recv_len;
> >                  queue->data_remaining -= recv_len;
> > +               req->data_received += recv_len;
> >          }
> > 
> >          if (!queue->data_remaining) {
> > +               if (req->data_received < req->data_len)
> > +                       return 0;
> >                  if (queue->data_digest) {
> >                          nvme_tcp_ddgst_final(queue->rcv_hash,
> > &queue->exp_ddgst);
> >                          queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
> > @@ -2274,6 +2278,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct
> > nvme_ns *ns,
> >          req->state = NVME_TCP_SEND_CMD_PDU;
> >          req->offset = 0;
> >          req->data_sent = 0;
> > +       req->data_received = 0;
> >          req->pdu_len = 0;
> >          req->pdu_sent = 0;
> >          req->data_len = blk_rq_nr_phys_segments(rq) ?
> > -- 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-04-29  4:52                             ` Sagi Grimberg
@ 2021-05-03 18:51                               ` Keith Busch
  2021-05-03 19:58                                 ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-03 18:51 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Wed, Apr 28, 2021 at 09:52:37PM -0700, Sagi Grimberg wrote:
> 
> > The driver tracepoints captured millions of IO's where everything
> > happened as expected, so I really think something got confused and
> > mucked with the wrong request. I've added more trace points to increase
> > visibility because I frankly didn't find how that could happen just from
> > code inspection. We will also incorporate your patch below for the next
> > recreate.
> 
> Keith, does the issue still happen with eliminating the network send
> from .queue_rq() ?

This patch is successful at resolving the observed r2t issues after the
weekend test run, which is much longer than it could have run
previously. I'm happy we're narrowing this down, but I'm not seeing how
this addresses the problem. It looks like the mutex single threads the
critical parts, but maybe I'm missing something. Any ideas?
 
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index eb1feaacd11a..b3fafa536345 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -288,7 +288,7 @@ static inline void nvme_tcp_queue_request(struct
> nvme_tcp_request *req,
>          * directly, otherwise queue io_work. Also, only do that if we
>          * are on the same cpu, so we don't introduce contention.
>          */
> -       if (queue->io_cpu == __smp_processor_id() &&
> +       if (0 && queue->io_cpu == __smp_processor_id() &&
>             sync && empty && mutex_trylock(&queue->send_mutex)) {
>                 queue->more_requests = !last;
>                 nvme_tcp_send_all(queue);
> --

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-03 14:28                           ` Keith Busch
@ 2021-05-03 19:36                             ` Sagi Grimberg
  2021-05-03 19:38                               ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-03 19:36 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>> Hey Keith,
>>
>> Did this resolve the issues?
> 
> We're unfortunately still observing data digest issues even with this.
> Most of the testing has shifted to the r2t error, so I don't have any
> additional details on the data digest problem.

I've looked again at the code, and I'm not convinced that the patch
is needed at all anymore, I'm now surprised that it actually changed
anything (disregarding data digest).

The driver does not track the received bytes by definition, it relies
on the controller to send it a completion, or set the success flag in
the _last_ c2hdata pdu. Does your target set
NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-03 19:36                             ` Sagi Grimberg
@ 2021-05-03 19:38                               ` Sagi Grimberg
  2021-05-03 19:44                                 ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-03 19:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>> Hey Keith,
>>>
>>> Did this resolve the issues?
>>
>> We're unfortunately still observing data digest issues even with this.
>> Most of the testing has shifted to the r2t error, so I don't have any
>> additional details on the data digest problem.
> 
> I've looked again at the code, and I'm not convinced that the patch
> is needed at all anymore, I'm now surprised that it actually changed
> anything (disregarding data digest).
> 
> The driver does not track the received bytes by definition, it relies
> on the controller to send it a completion, or set the success flag in
> the _last_ c2hdata pdu. Does your target set
> NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus?

Perhaps you can also run this patch instead?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 6bd5b281c818..a848b5b7f77b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -55,6 +55,7 @@ struct nvme_tcp_request {
         /* send state */
         size_t                  offset;
         size_t                  data_sent;
+       size_t                  data_received;
         enum nvme_tcp_send_state state;
  };

@@ -751,6 +752,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
*queue, struct sk_buff *skb,
                 *len -= recv_len;
                 *offset += recv_len;
                 queue->data_remaining -= recv_len;
+               req->data_received += recv_len;
         }

         if (!queue->data_remaining) {
@@ -759,6 +761,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
*queue, struct sk_buff *skb,
                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
                 } else {
                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
+                               BUG_ON(req->data_received != req->data_len);
                                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
                                 queue->nr_cqe++;
                         }
@@ -2274,6 +2277,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct 
nvme_ns *ns,
         req->state = NVME_TCP_SEND_CMD_PDU;
         req->offset = 0;
         req->data_sent = 0;
+       req->data_received = 0;
         req->pdu_len = 0;
         req->pdu_sent = 0;
         req->data_len = blk_rq_nr_phys_segments(rq) ?
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-03 19:38                               ` Sagi Grimberg
@ 2021-05-03 19:44                                 ` Keith Busch
  2021-05-03 20:00                                   ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-03 19:44 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, May 03, 2021 at 12:38:39PM -0700, Sagi Grimberg wrote:
> 
> > > > Hey Keith,
> > > > 
> > > > Did this resolve the issues?
> > > 
> > > We're unfortunately still observing data digest issues even with this.
> > > Most of the testing has shifted to the r2t error, so I don't have any
> > > additional details on the data digest problem.
> > 
> > I've looked again at the code, and I'm not convinced that the patch
> > is needed at all anymore, I'm now surprised that it actually changed
> > anything (disregarding data digest).
> > 
> > The driver does not track the received bytes by definition, it relies
> > on the controller to send it a completion, or set the success flag in
> > the _last_ c2hdata pdu. Does your target set
> > NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus?
> 
> Perhaps you can also run this patch instead?

Thanks, will give this a shot.

> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 6bd5b281c818..a848b5b7f77b 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -55,6 +55,7 @@ struct nvme_tcp_request {
>         /* send state */
>         size_t                  offset;
>         size_t                  data_sent;
> +       size_t                  data_received;
>         enum nvme_tcp_send_state state;
>  };
> 
> @@ -751,6 +752,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
> *queue, struct sk_buff *skb,
>                 *len -= recv_len;
>                 *offset += recv_len;
>                 queue->data_remaining -= recv_len;
> +               req->data_received += recv_len;
>         }
> 
>         if (!queue->data_remaining) {
> @@ -759,6 +761,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
> *queue, struct sk_buff *skb,
>                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>                 } else {
>                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
> +                               BUG_ON(req->data_received != req->data_len);
>                                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
>                                 queue->nr_cqe++;
>                         }
> @@ -2274,6 +2277,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct
> nvme_ns *ns,
>         req->state = NVME_TCP_SEND_CMD_PDU;
>         req->offset = 0;
>         req->data_sent = 0;
> +       req->data_received = 0;
>         req->pdu_len = 0;
>         req->pdu_sent = 0;
>         req->data_len = blk_rq_nr_phys_segments(rq) ?
> --

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-03 18:51                               ` Keith Busch
@ 2021-05-03 19:58                                 ` Sagi Grimberg
  2021-05-03 20:25                                   ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-03 19:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>> The driver tracepoints captured millions of IO's where everything
>>> happened as expected, so I really think something got confused and
>>> mucked with the wrong request. I've added more trace points to increase
>>> visibility because I frankly didn't find how that could happen just from
>>> code inspection. We will also incorporate your patch below for the next
>>> recreate.
>>
>> Keith, does the issue still happen with eliminating the network send
>> from .queue_rq() ?
> 
> This patch is successful at resolving the observed r2t issues after the
> weekend test run, which is much longer than it could have run
> previously. I'm happy we're narrowing this down, but I'm not seeing how
> this addresses the problem. It looks like the mutex single threads the
> critical parts, but maybe I'm missing something. Any ideas?

Not yet, but note that the send part is mutually exclusive but the
receive context is where we handle the r2t, validate length/offset
and (re)queue the request for sending a h2cdata pdu back to the
controller.

The network send was an optimization for latency, and then I modified
the queueing in the driver such that a request would first go to llist
and then the sending context (either io_work or .queue_rq) would reap it
to a local send_list. This helps the driver get better understanding of
what is inflight such that it better set network msg flags for EOR/MORE.

My assumption is that maybe somehow we send the the initial command
pdu to the controller from queue_rq, receive the r2t back before the
.queue_rq context has completed and something may not be coherent.

Side question, are you running with a fully preemptible kernel? or
less NVMe queues than cpus?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-03 19:44                                 ` Keith Busch
@ 2021-05-03 20:00                                   ` Sagi Grimberg
  2021-05-04 14:36                                     ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-03 20:00 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch

>>>>> Hey Keith,
>>>>>
>>>>> Did this resolve the issues?
>>>>
>>>> We're unfortunately still observing data digest issues even with this.
>>>> Most of the testing has shifted to the r2t error, so I don't have any
>>>> additional details on the data digest problem.
>>>
>>> I've looked again at the code, and I'm not convinced that the patch
>>> is needed at all anymore, I'm now surprised that it actually changed
>>> anything (disregarding data digest).
>>>
>>> The driver does not track the received bytes by definition, it relies
>>> on the controller to send it a completion, or set the success flag in
>>> the _last_ c2hdata pdu. Does your target set
>>> NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus?
>>
>> Perhaps you can also run this patch instead?
> 
> Thanks, will give this a shot.

Still would be beneficial to look at the traces and check if
the success flag happens to be set. If this flag is set, the
driver _will_ complete the request without checking the bytes
received thus far (similar to how pci and rdma don't and can't
check dma byte count).

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-03 19:58                                 ` Sagi Grimberg
@ 2021-05-03 20:25                                   ` Keith Busch
  2021-05-04 19:29                                     ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-03 20:25 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, May 03, 2021 at 12:58:23PM -0700, Sagi Grimberg wrote:
> 
> > > > The driver tracepoints captured millions of IO's where everything
> > > > happened as expected, so I really think something got confused and
> > > > mucked with the wrong request. I've added more trace points to increase
> > > > visibility because I frankly didn't find how that could happen just from
> > > > code inspection. We will also incorporate your patch below for the next
> > > > recreate.
> > > 
> > > Keith, does the issue still happen with eliminating the network send
> > > from .queue_rq() ?
> > 
> > This patch is successful at resolving the observed r2t issues after the
> > weekend test run, which is much longer than it could have run
> > previously. I'm happy we're narrowing this down, but I'm not seeing how
> > this addresses the problem. It looks like the mutex single threads the
> > critical parts, but maybe I'm missing something. Any ideas?
> 
> Not yet, but note that the send part is mutually exclusive but the
> receive context is where we handle the r2t, validate length/offset
> and (re)queue the request for sending a h2cdata pdu back to the
> controller.
>
> The network send was an optimization for latency, and then I modified
> the queueing in the driver such that a request would first go to llist
> and then the sending context (either io_work or .queue_rq) would reap it
> to a local send_list. This helps the driver get better understanding of
> what is inflight such that it better set network msg flags for EOR/MORE.
> 
> My assumption is that maybe somehow we send the the initial command
> pdu to the controller from queue_rq, receive the r2t back before the
> .queue_rq context has completed and something may not be coherent.

Interesting. The network traces look correct, so my thoughts jumped to
possibly incorrect usage of PCIe relaxed ordering, but that appears to
be disabled.. I'll keep looking for other possibilities.

> Side question, are you running with a fully preemptible kernel? or
> less NVMe queues than cpus?

Voluntary preempt. This test is using the kernel config from Ubuntu
20.04.

There are 16 CPUs in this set up with just 7 IO queues.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-03 20:00                                   ` Sagi Grimberg
@ 2021-05-04 14:36                                     ` Keith Busch
  2021-05-04 18:15                                       ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-04 14:36 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, May 03, 2021 at 01:00:05PM -0700, Sagi Grimberg wrote:
> > > > > > Hey Keith,
> > > > > > 
> > > > > > Did this resolve the issues?
> > > > > 
> > > > > We're unfortunately still observing data digest issues even with this.
> > > > > Most of the testing has shifted to the r2t error, so I don't have any
> > > > > additional details on the data digest problem.
> > > > 
> > > > I've looked again at the code, and I'm not convinced that the patch
> > > > is needed at all anymore, I'm now surprised that it actually changed
> > > > anything (disregarding data digest).
> > > > 
> > > > The driver does not track the received bytes by definition, it relies
> > > > on the controller to send it a completion, or set the success flag in
> > > > the _last_ c2hdata pdu. Does your target set
> > > > NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus?
> > > 
> > > Perhaps you can also run this patch instead?
> > 
> > Thanks, will give this a shot.
> 
> Still would be beneficial to look at the traces and check if
> the success flag happens to be set. If this flag is set, the
> driver _will_ complete the request without checking the bytes
> received thus far (similar to how pci and rdma don't and can't
> check dma byte count).

I realized this patch is the same as one you'd sent earlier. We hit the
BUG_ON(), and then proceeded to use your follow-up patch, which appeared
to fix the data receive problem, but introduced data digest problems.

So, are you saying that hitting this BUG_ON means that the driver has
observed the completion out-of-order from the expected data?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-04 14:36                                     ` Keith Busch
@ 2021-05-04 18:15                                       ` Sagi Grimberg
  2021-05-04 19:14                                         ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-04 18:15 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>>>>>> Hey Keith,
>>>>>>>
>>>>>>> Did this resolve the issues?
>>>>>>
>>>>>> We're unfortunately still observing data digest issues even with this.
>>>>>> Most of the testing has shifted to the r2t error, so I don't have any
>>>>>> additional details on the data digest problem.
>>>>>
>>>>> I've looked again at the code, and I'm not convinced that the patch
>>>>> is needed at all anymore, I'm now surprised that it actually changed
>>>>> anything (disregarding data digest).
>>>>>
>>>>> The driver does not track the received bytes by definition, it relies
>>>>> on the controller to send it a completion, or set the success flag in
>>>>> the _last_ c2hdata pdu. Does your target set
>>>>> NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus?
>>>>
>>>> Perhaps you can also run this patch instead?
>>>
>>> Thanks, will give this a shot.
>>
>> Still would be beneficial to look at the traces and check if
>> the success flag happens to be set. If this flag is set, the
>> driver _will_ complete the request without checking the bytes
>> received thus far (similar to how pci and rdma don't and can't
>> check dma byte count).
> 
> I realized this patch is the same as one you'd sent earlier. We hit the
> BUG_ON(), and then proceeded to use your follow-up patch, which appeared
> to fix the data receive problem, but introduced data digest problems.
> 
> So, are you saying that hitting this BUG_ON means that the driver has
> observed the completion out-of-order from the expected data?

If you hit the BUG_ON it means that the host spotted a c2hdata
PDU that has the success flag set before all the request data
was received:
--
@@ -759,6 +761,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue 
*queue, struct sk_buff *skb,
                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
                 } else {
                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
+                               BUG_ON(req->data_received != req->data_len);
                                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
                                 queue->nr_cqe++;
                         }
--

Which means that the host is completing the request immediately relying
on the controller sent all the required data and knowing that a
completion response capsule will not be sent.

 From the spec:
C2HData PDUs contain a LAST_PDU flag that is set to ‘1’ in the last PDU
of a command data transfer and is cleared to ‘0’ in all other C2HData
PDUs associated with the command. C2HData PDUs also contain a SUCCESS
flag that may be set to ‘1’ in the last C2HData PDU of a command data
transfer to indicate that the command has completed successfully. In
this case, no Response Capsule is sent by the controller for the command 
and the host synthesizes a completion queue entry for the command with
the Command Specific field and the Status Field both cleared to 0h. If
the SUCCESS flag is cleared to ‘0’ in the last C2HData PDU of a command,
then the controller shall send a Response Capsule for the command to the
host. The SUCCESS flag shall be cleared to ‘0’ in all C2HData PDUs that
are not the last C2HData PDU for a command. The SUCCESS flag may be set
to ‘1’ in the last C2HData PDU only if the controller supports disabling
submission queue head pointer updates.

Hence my question, does the controller set NVME_TCP_F_DATA_SUCCESS in
any of the c2hdata PDUs which is not the last one? does it set it in the
last one and omitting the cqe response capsule as expected by the host?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-04 18:15                                       ` Sagi Grimberg
@ 2021-05-04 19:14                                         ` Keith Busch
  2021-05-10 18:06                                           ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-04 19:14 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Tue, May 04, 2021 at 11:15:28AM -0700, Sagi Grimberg wrote:
> > > > > > I've looked again at the code, and I'm not convinced that the patch
> > > > > > is needed at all anymore, I'm now surprised that it actually changed
> > > > > > anything (disregarding data digest).
> > > > > > 
> > > > > > The driver does not track the received bytes by definition, it relies
> > > > > > on the controller to send it a completion, or set the success flag in
> > > > > > the _last_ c2hdata pdu. Does your target set
> > > > > > NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus?
> > > > > 
> > > > > Perhaps you can also run this patch instead?
> > > > 
> > > > Thanks, will give this a shot.
> > > 
> > > Still would be beneficial to look at the traces and check if
> > > the success flag happens to be set. If this flag is set, the
> > > driver _will_ complete the request without checking the bytes
> > > received thus far (similar to how pci and rdma don't and can't
> > > check dma byte count).
> > 
> > I realized this patch is the same as one you'd sent earlier. We hit the
> > BUG_ON(), and then proceeded to use your follow-up patch, which appeared
> > to fix the data receive problem, but introduced data digest problems.
> > 
> > So, are you saying that hitting this BUG_ON means that the driver has
> > observed the completion out-of-order from the expected data?
> 
> If you hit the BUG_ON it means that the host spotted a c2hdata
> PDU that has the success flag set before all the request data
> was received:
> --
> @@ -759,6 +761,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue
> *queue, struct sk_buff *skb,
>                         queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>                 } else {
>                         if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
> +                               BUG_ON(req->data_received != req->data_len);
>                                 nvme_tcp_end_request(rq, NVME_SC_SUCCESS);
>                                 queue->nr_cqe++;
>                         }
> --

I apologize for the confusion. There is a subtle difference in your most
recent patch request vs. the previous one: the BUG_ON() is within the
DATA_SUCCESS section, and we hadn't actually run with that. We did hit
the BUG_ON() in the first version, and looking at it now, I suspect you
intended to put it in this new location.

We'll retest, but I don't think we'll hit the BUG: none of the headers
have the DATA_SUCCESS flag set in the tcp dumps I've seen.

And also I see your point about how the original patch shouldn't be
needed at all, and I also don't see why it could have changed the
observation without data digest.

Thank you for your patience on this issue. I will get back to you with
more info after circling back with the test group.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-03 20:25                                   ` Keith Busch
@ 2021-05-04 19:29                                     ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-04 19:29 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>>>> The driver tracepoints captured millions of IO's where everything
>>>>> happened as expected, so I really think something got confused and
>>>>> mucked with the wrong request. I've added more trace points to increase
>>>>> visibility because I frankly didn't find how that could happen just from
>>>>> code inspection. We will also incorporate your patch below for the next
>>>>> recreate.
>>>>
>>>> Keith, does the issue still happen with eliminating the network send
>>>> from .queue_rq() ?
>>>
>>> This patch is successful at resolving the observed r2t issues after the
>>> weekend test run, which is much longer than it could have run
>>> previously. I'm happy we're narrowing this down, but I'm not seeing how
>>> this addresses the problem. It looks like the mutex single threads the
>>> critical parts, but maybe I'm missing something. Any ideas?
>>
>> Not yet, but note that the send part is mutually exclusive but the
>> receive context is where we handle the r2t, validate length/offset
>> and (re)queue the request for sending a h2cdata pdu back to the
>> controller.
>>
>> The network send was an optimization for latency, and then I modified
>> the queueing in the driver such that a request would first go to llist
>> and then the sending context (either io_work or .queue_rq) would reap it
>> to a local send_list. This helps the driver get better understanding of
>> what is inflight such that it better set network msg flags for EOR/MORE.
>>
>> My assumption is that maybe somehow we send the the initial command
>> pdu to the controller from queue_rq, receive the r2t back before the
>> .queue_rq context has completed and something may not be coherent.
> 
> Interesting. The network traces look correct, so my thoughts jumped to
> possibly incorrect usage of PCIe relaxed ordering, but that appears to
> be disabled.. I'll keep looking for other possibilities.
> 
>> Side question, are you running with a fully preemptible kernel? or
>> less NVMe queues than cpus?
> 
> Voluntary preempt. This test is using the kernel config from Ubuntu
> 20.04.
> 
> There are 16 CPUs in this set up with just 7 IO queues.

Keith,

Maybe this can help to add some more information (compile tested only):
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a848b5b7f77b..9ce20b26c600 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -57,6 +57,7 @@ struct nvme_tcp_request {
         size_t                  data_sent;
         size_t                  data_received;
         enum nvme_tcp_send_state state;
+       bool                    got_r2t;
  };

  enum nvme_tcp_queue_flags {
@@ -622,10 +623,18 @@ static int nvme_tcp_handle_r2t(struct 
nvme_tcp_queue *queue,
         }
         req = blk_mq_rq_to_pdu(rq);

+       if (req->state != NVME_TCP_SEND_CMD_PDU) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x req unexpected state %d 
got_r2t %d\n",
+                       nvme_tcp_queue_id(queue), rq->tag, req->state,
+                       req->got_r2t);
+       }
+
         ret = nvme_tcp_setup_h2c_data_pdu(req, pdu);
         if (unlikely(ret))
                 return ret;

+       req->got_r2t = true;
         req->state = NVME_TCP_SEND_H2C_PDU;
         req->offset = 0;

@@ -1083,6 +1092,8 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue 
*queue)
         }

         if (req->state == NVME_TCP_SEND_DATA) {
+               /* sending data not inline AND unsolicited? */
+               WARN_ON_ONCE(!nvme_tcp_has_inline_data(req) && 
!req->got_r2t);
                 ret = nvme_tcp_try_send_data(req);
                 if (ret <= 0)
                         goto done;
@@ -2275,6 +2286,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct 
nvme_ns *ns,
                 return ret;

         req->state = NVME_TCP_SEND_CMD_PDU;
+       req->got_r2t = false;
         req->offset = 0;
         req->data_sent = 0;
         req->data_received = 0;
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-04 19:14                                         ` Keith Busch
@ 2021-05-10 18:06                                           ` Keith Busch
  2021-05-10 18:18                                             ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-10 18:06 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

Sagi,

Just wanted to give you an update on where we're at with this.

All tests run with your earlier patch removing the inline dispatch from
nvme_tcp_queue_request() are successful. At this point, I am leaning to
remove that optimization from mainline.

I added additional tracing to see what is going on, but we eventually
hit a memory issue after some hours of runtime. I've never seen an issue
like this before, It triggers in nvme_tcp_advance_req() when tracing the
rq->tag and req->data_sent:

  WARNING: CPU: 1 PID: 3428 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x33/0xa0

I think the above is a distraction, but I can provide the full stack
trace and patch adding the tracepoing if you think it's helpful. 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-10 18:06                                           ` Keith Busch
@ 2021-05-10 18:18                                             ` Sagi Grimberg
  2021-05-10 18:30                                               ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-10 18:18 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


> Sagi,
> 
> Just wanted to give you an update on where we're at with this.
> 
> All tests run with your earlier patch removing the inline dispatch from
> nvme_tcp_queue_request() are successful. At this point, I am leaning to
> remove that optimization from mainline.

Thanks Keith,

Did you run it with the extra information debug patch I sent you? What
I'm concerned about is that given that you have the only environment
where this reproduces, and this is removed it will be very difficult
to add it back in.

Also, what about the read issue? that one is still unresolved from
my PoV.

> I added additional tracing to see what is going on, but we eventually
> hit a memory issue after some hours of runtime. I've never seen an issue
> like this before, It triggers in nvme_tcp_advance_req() when tracing the
> rq->tag and req->data_sent:
> 
>    WARNING: CPU: 1 PID: 3428 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x33/0xa0
> 
> I think the above is a distraction, but I can provide the full stack
> trace and patch adding the tracepoing if you think it's helpful.

That is... odd..

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-10 18:18                                             ` Sagi Grimberg
@ 2021-05-10 18:30                                               ` Keith Busch
  2021-05-10 21:07                                                 ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-10 18:30 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, May 10, 2021 at 11:18:24AM -0700, Sagi Grimberg wrote:
> 
> > Sagi,
> > 
> > Just wanted to give you an update on where we're at with this.
> > 
> > All tests run with your earlier patch removing the inline dispatch from
> > nvme_tcp_queue_request() are successful. At this point, I am leaning to
> > remove that optimization from mainline.
> 
> Thanks Keith,
> 
> Did you run it with the extra information debug patch I sent you? What
> I'm concerned about is that given that you have the only environment
> where this reproduces, and this is removed it will be very difficult
> to add it back in.
> 
> Also, what about the read issue? that one is still unresolved from
> my PoV.

The test team reports no issues for both read and write tests with the
inline dispatch removed. That single patch appears to resolve all
problems. Sorry, I should have clarified that initially.

We will go back to the extra debug from last week on top of a pristine
mainline kernel. There's been some confusion of which patches to apply
on top of others in the mix, but I'm getting better at coordinating
that.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-10 18:30                                               ` Keith Busch
@ 2021-05-10 21:07                                                 ` Sagi Grimberg
  2021-05-11  3:00                                                   ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-10 21:07 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>>> Sagi,
>>>
>>> Just wanted to give you an update on where we're at with this.
>>>
>>> All tests run with your earlier patch removing the inline dispatch from
>>> nvme_tcp_queue_request() are successful. At this point, I am leaning to
>>> remove that optimization from mainline.
>>
>> Thanks Keith,
>>
>> Did you run it with the extra information debug patch I sent you? What
>> I'm concerned about is that given that you have the only environment
>> where this reproduces, and this is removed it will be very difficult
>> to add it back in.
>>
>> Also, what about the read issue? that one is still unresolved from
>> my PoV.
> 
> The test team reports no issues for both read and write tests with the
> inline dispatch removed. That single patch appears to resolve all
> problems. Sorry, I should have clarified that initially.
> 
> We will go back to the extra debug from last week on top of a pristine
> mainline kernel. There's been some confusion of which patches to apply
> on top of others in the mix, but I'm getting better at coordinating
> that.

Keith,

I may have a theory to this issue. I think that the problem is in
cases where we send commands with data to the controller and then in
nvme_tcp_send_data between the last successful kernel_sendpage
and before nvme_tcp_advance_req, the controller sends back a successful
completion.

If that is the case, then the completion path could be triggered,
the tag would be reused, triggering a new .queue_rq, setting again
the req.iter with the new bio params (all is not taken by the
send_mutex) and then the send context would call nvme_tcp_advance_req
progressing the req.iter with the former sent bytes... And given that
the req.iter is used for reads/writes, it is possible that it can
explain both issues.

While this is not easy to trigger, there is nothing I think that
can prevent that. The driver used to have a single context that
would do both send and recv so this could not have happened, but
now that we added the .queue_rq send context, I guess this can
indeed confuse the driver.

There are 3 possible options to resolve this:
1. Never touch the command after the last send (only advance the
request iter after we check if we are done):
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c51b70aec6dd..5c576eda5ba1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -933,7 +933,6 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                 if (ret <= 0)
                         return ret;

-               nvme_tcp_advance_req(req, ret);
                 if (queue->data_digest)
                         nvme_tcp_ddgst_update(queue->snd_hash, page,
                                         offset, ret);
@@ -950,6 +949,7 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                         }
                         return 1;
                 }
+               nvme_tcp_advance_req(req, ret);
         }
         return -EAGAIN;
  }
--

2. only initialize the request/cmd/pdu (nvme_tcp_setup_cmd_pdu)
when we have taken the send_mutex (say in nvme_tcp_fetch_request
or something), which is more cumbersome as this stuff may be
called multiple times in the life of a request.

3. Add refcounting to never complete an I/O before both send
and receive has fully completed on this command, besides having some
more overhead on the datapath, there maybe also some challanges with
moving the request state machine that may be based on the refounting.

I suggest that you guys give (1) a shot and see if the theory holds...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-10 21:07                                                 ` Sagi Grimberg
@ 2021-05-11  3:00                                                   ` Keith Busch
  2021-05-11 17:17                                                     ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-11  3:00 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Mon, May 10, 2021 at 02:07:47PM -0700, Sagi Grimberg wrote:
> I may have a theory to this issue. I think that the problem is in
> cases where we send commands with data to the controller and then in
> nvme_tcp_send_data between the last successful kernel_sendpage
> and before nvme_tcp_advance_req, the controller sends back a successful
> completion.
>
> If that is the case, then the completion path could be triggered,
> the tag would be reused, triggering a new .queue_rq, setting again
> the req.iter with the new bio params (all is not taken by the
> send_mutex) and then the send context would call nvme_tcp_advance_req
> progressing the req.iter with the former sent bytes... And given that
> the req.iter is used for reads/writes, it is possible that it can
> explain both issues.
> 
> While this is not easy to trigger, there is nothing I think that
> can prevent that. The driver used to have a single context that
> would do both send and recv so this could not have happened, but
> now that we added the .queue_rq send context, I guess this can
> indeed confuse the driver.

Awesome, this is exactly the type of sequence I've been trying to
capture, but couldn't quite get there. Now that you've described it,
that flow can certainly explain the observations, including the
corrupted debug trace event I was trying to add.

The sequence looks unlikely to happen, which agrees with the difficulty
in reproducing it. I am betting right now that you got it, but a little
surprised no one else is reporting a similar problem yet.

Your option "1" looks like the best one, IMO. I've requested dropping
all debug and test patches and using just this one on the current nvme
baseline for the next test cycle.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-11  3:00                                                   ` Keith Busch
@ 2021-05-11 17:17                                                     ` Sagi Grimberg
  2021-05-13 15:48                                                       ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-11 17:17 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch


>> I may have a theory to this issue. I think that the problem is in
>> cases where we send commands with data to the controller and then in
>> nvme_tcp_send_data between the last successful kernel_sendpage
>> and before nvme_tcp_advance_req, the controller sends back a successful
>> completion.
>>
>> If that is the case, then the completion path could be triggered,
>> the tag would be reused, triggering a new .queue_rq, setting again
>> the req.iter with the new bio params (all is not taken by the
>> send_mutex) and then the send context would call nvme_tcp_advance_req
>> progressing the req.iter with the former sent bytes... And given that
>> the req.iter is used for reads/writes, it is possible that it can
>> explain both issues.
>>
>> While this is not easy to trigger, there is nothing I think that
>> can prevent that. The driver used to have a single context that
>> would do both send and recv so this could not have happened, but
>> now that we added the .queue_rq send context, I guess this can
>> indeed confuse the driver.
> 
> Awesome, this is exactly the type of sequence I've been trying to
> capture, but couldn't quite get there. Now that you've described it,
> that flow can certainly explain the observations, including the
> corrupted debug trace event I was trying to add.
> 
> The sequence looks unlikely to happen, which agrees with the difficulty
> in reproducing it. I am betting right now that you got it, but a little
> surprised no one else is reporting a similar problem yet.

We had at least one report from Potnuri that I think may have been
triggered by this, this ended up fixed (or rather worked-around
with 5c11f7d9f843).

> Your option "1" looks like the best one, IMO. I've requested dropping
> all debug and test patches and using just this one on the current nvme
> baseline for the next test cycle.

Cool, waiting to hear back...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-11 17:17                                                     ` Sagi Grimberg
@ 2021-05-13 15:48                                                       ` Keith Busch
  2021-05-13 19:53                                                         ` Sagi Grimberg
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2021-05-13 15:48 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Tue, May 11, 2021 at 10:17:09AM -0700, Sagi Grimberg wrote:
> 
> > > I may have a theory to this issue. I think that the problem is in
> > > cases where we send commands with data to the controller and then in
> > > nvme_tcp_send_data between the last successful kernel_sendpage
> > > and before nvme_tcp_advance_req, the controller sends back a successful
> > > completion.
> > > 
> > > If that is the case, then the completion path could be triggered,
> > > the tag would be reused, triggering a new .queue_rq, setting again
> > > the req.iter with the new bio params (all is not taken by the
> > > send_mutex) and then the send context would call nvme_tcp_advance_req
> > > progressing the req.iter with the former sent bytes... And given that
> > > the req.iter is used for reads/writes, it is possible that it can
> > > explain both issues.
> > > 
> > > While this is not easy to trigger, there is nothing I think that
> > > can prevent that. The driver used to have a single context that
> > > would do both send and recv so this could not have happened, but
> > > now that we added the .queue_rq send context, I guess this can
> > > indeed confuse the driver.
> > 
> > Awesome, this is exactly the type of sequence I've been trying to
> > capture, but couldn't quite get there. Now that you've described it,
> > that flow can certainly explain the observations, including the
> > corrupted debug trace event I was trying to add.
> > 
> > The sequence looks unlikely to happen, which agrees with the difficulty
> > in reproducing it. I am betting right now that you got it, but a little
> > surprised no one else is reporting a similar problem yet.
> 
> We had at least one report from Potnuri that I think may have been
> triggered by this, this ended up fixed (or rather worked-around
> with 5c11f7d9f843).
> 
> > Your option "1" looks like the best one, IMO. I've requested dropping
> > all debug and test patches and using just this one on the current nvme
> > baseline for the next test cycle.
> 
> Cool, waiting to hear back...

This patch has been tested successfully on the initial workloads. There
are several more that need to be validated, but each one runs for many
hours, so it may be a couple more days before completed. Just wanted to
leat you know: so far, so good.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-13 15:48                                                       ` Keith Busch
@ 2021-05-13 19:53                                                         ` Sagi Grimberg
  2021-05-17 20:48                                                           ` Keith Busch
  0 siblings, 1 reply; 50+ messages in thread
From: Sagi Grimberg @ 2021-05-13 19:53 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch



On 5/13/21 8:48 AM, Keith Busch wrote:
> On Tue, May 11, 2021 at 10:17:09AM -0700, Sagi Grimberg wrote:
>>
>>>> I may have a theory to this issue. I think that the problem is in
>>>> cases where we send commands with data to the controller and then in
>>>> nvme_tcp_send_data between the last successful kernel_sendpage
>>>> and before nvme_tcp_advance_req, the controller sends back a successful
>>>> completion.
>>>>
>>>> If that is the case, then the completion path could be triggered,
>>>> the tag would be reused, triggering a new .queue_rq, setting again
>>>> the req.iter with the new bio params (all is not taken by the
>>>> send_mutex) and then the send context would call nvme_tcp_advance_req
>>>> progressing the req.iter with the former sent bytes... And given that
>>>> the req.iter is used for reads/writes, it is possible that it can
>>>> explain both issues.
>>>>
>>>> While this is not easy to trigger, there is nothing I think that
>>>> can prevent that. The driver used to have a single context that
>>>> would do both send and recv so this could not have happened, but
>>>> now that we added the .queue_rq send context, I guess this can
>>>> indeed confuse the driver.
>>>
>>> Awesome, this is exactly the type of sequence I've been trying to
>>> capture, but couldn't quite get there. Now that you've described it,
>>> that flow can certainly explain the observations, including the
>>> corrupted debug trace event I was trying to add.
>>>
>>> The sequence looks unlikely to happen, which agrees with the difficulty
>>> in reproducing it. I am betting right now that you got it, but a little
>>> surprised no one else is reporting a similar problem yet.
>>
>> We had at least one report from Potnuri that I think may have been
>> triggered by this, this ended up fixed (or rather worked-around
>> with 5c11f7d9f843).
>>
>>> Your option "1" looks like the best one, IMO. I've requested dropping
>>> all debug and test patches and using just this one on the current nvme
>>> baseline for the next test cycle.
>>
>> Cool, waiting to hear back...
> 
> This patch has been tested successfully on the initial workloads. There
> are several more that need to be validated, but each one runs for many
> hours, so it may be a couple more days before completed. Just wanted to
> leat you know: so far, so good.

Encouraging... I'll send a patch for that as soon as you give me the
final verdict. I'm assuming Narayan would be the reporter and the
tester?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: nvme tcp receive errors
  2021-05-13 19:53                                                         ` Sagi Grimberg
@ 2021-05-17 20:48                                                           ` Keith Busch
  0 siblings, 0 replies; 50+ messages in thread
From: Keith Busch @ 2021-05-17 20:48 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, hch

On Thu, May 13, 2021 at 12:53:54PM -0700, Sagi Grimberg wrote:
> On 5/13/21 8:48 AM, Keith Busch wrote:
> > On Tue, May 11, 2021 at 10:17:09AM -0700, Sagi Grimberg wrote:
> > > 
> > > > > I may have a theory to this issue. I think that the problem is in
> > > > > cases where we send commands with data to the controller and then in
> > > > > nvme_tcp_send_data between the last successful kernel_sendpage
> > > > > and before nvme_tcp_advance_req, the controller sends back a successful
> > > > > completion.
> > > > > 
> > > > > If that is the case, then the completion path could be triggered,
> > > > > the tag would be reused, triggering a new .queue_rq, setting again
> > > > > the req.iter with the new bio params (all is not taken by the
> > > > > send_mutex) and then the send context would call nvme_tcp_advance_req
> > > > > progressing the req.iter with the former sent bytes... And given that
> > > > > the req.iter is used for reads/writes, it is possible that it can
> > > > > explain both issues.
> > > > > 
> > > > > While this is not easy to trigger, there is nothing I think that
> > > > > can prevent that. The driver used to have a single context that
> > > > > would do both send and recv so this could not have happened, but
> > > > > now that we added the .queue_rq send context, I guess this can
> > > > > indeed confuse the driver.
> > > > 
> > > > Awesome, this is exactly the type of sequence I've been trying to
> > > > capture, but couldn't quite get there. Now that you've described it,
> > > > that flow can certainly explain the observations, including the
> > > > corrupted debug trace event I was trying to add.
> > > > 
> > > > The sequence looks unlikely to happen, which agrees with the difficulty
> > > > in reproducing it. I am betting right now that you got it, but a little
> > > > surprised no one else is reporting a similar problem yet.
> > > 
> > > We had at least one report from Potnuri that I think may have been
> > > triggered by this, this ended up fixed (or rather worked-around
> > > with 5c11f7d9f843).
> > > 
> > > > Your option "1" looks like the best one, IMO. I've requested dropping
> > > > all debug and test patches and using just this one on the current nvme
> > > > baseline for the next test cycle.
> > > 
> > > Cool, waiting to hear back...
> > 
> > This patch has been tested successfully on the initial workloads. There
> > are several more that need to be validated, but each one runs for many
> > hours, so it may be a couple more days before completed. Just wanted to
> > leat you know: so far, so good.
> 
> Encouraging... I'll send a patch for that as soon as you give me the
> final verdict. I'm assuming Narayan would be the reporter and the
> tester?

This tests successfully. There was one timeout issue observed in all the
testing, but does not appear related to the reported problems here, or
your fix, so I will start a new thread on that if I can get more
information on it.

You may use the following tags for the commit log:

Reported-by: Narayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
Tested-by: Anil Mishra <anil.mishra@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2021-05-17 20:50 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-31 16:18 nvme tcp receive errors Keith Busch
2021-03-31 19:10 ` Sagi Grimberg
2021-03-31 20:49   ` Keith Busch
2021-03-31 22:16     ` Sagi Grimberg
2021-03-31 22:26       ` Keith Busch
2021-03-31 22:45         ` Sagi Grimberg
2021-04-02 17:11     ` Keith Busch
2021-04-02 17:27       ` Sagi Grimberg
2021-04-05 14:37         ` Keith Busch
2021-04-07 19:53           ` Keith Busch
2021-04-09 21:38             ` Sagi Grimberg
2021-04-27 23:39               ` Keith Busch
2021-04-27 23:55                 ` Sagi Grimberg
2021-04-28 15:58                   ` Keith Busch
2021-04-28 17:42                     ` Sagi Grimberg
2021-04-28 18:01                       ` Keith Busch
2021-04-28 23:06                         ` Sagi Grimberg
2021-04-29  3:33                           ` Keith Busch
2021-04-29  4:52                             ` Sagi Grimberg
2021-05-03 18:51                               ` Keith Busch
2021-05-03 19:58                                 ` Sagi Grimberg
2021-05-03 20:25                                   ` Keith Busch
2021-05-04 19:29                                     ` Sagi Grimberg
2021-04-09 18:04           ` Sagi Grimberg
2021-04-14  0:29             ` Keith Busch
2021-04-21  5:33               ` Sagi Grimberg
2021-04-21 14:28                 ` Keith Busch
2021-04-21 16:59                   ` Sagi Grimberg
2021-04-26 15:31                 ` Keith Busch
2021-04-27  3:10                   ` Sagi Grimberg
2021-04-27 18:12                     ` Keith Busch
2021-04-27 23:58                       ` Sagi Grimberg
2021-04-30 23:42                         ` Sagi Grimberg
2021-05-03 14:28                           ` Keith Busch
2021-05-03 19:36                             ` Sagi Grimberg
2021-05-03 19:38                               ` Sagi Grimberg
2021-05-03 19:44                                 ` Keith Busch
2021-05-03 20:00                                   ` Sagi Grimberg
2021-05-04 14:36                                     ` Keith Busch
2021-05-04 18:15                                       ` Sagi Grimberg
2021-05-04 19:14                                         ` Keith Busch
2021-05-10 18:06                                           ` Keith Busch
2021-05-10 18:18                                             ` Sagi Grimberg
2021-05-10 18:30                                               ` Keith Busch
2021-05-10 21:07                                                 ` Sagi Grimberg
2021-05-11  3:00                                                   ` Keith Busch
2021-05-11 17:17                                                     ` Sagi Grimberg
2021-05-13 15:48                                                       ` Keith Busch
2021-05-13 19:53                                                         ` Sagi Grimberg
2021-05-17 20:48                                                           ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).