* nvme tcp receive errors @ 2021-03-31 16:18 Keith Busch 2021-03-31 19:10 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-03-31 16:18 UTC (permalink / raw) To: linux-nvme, sagi, hch Hi, While running a read-write mixed workload, we are observing errors like: nvme nvme4: queue 2 no space in request 0x1 Based on tcpdump, all data for this queue is expected to satisfy the command request. I'm not familiar enough with the tcp interfaces, so could anyone provide pointers on how to debug this further? Thanks, Keith _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-03-31 16:18 nvme tcp receive errors Keith Busch @ 2021-03-31 19:10 ` Sagi Grimberg 2021-03-31 20:49 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-03-31 19:10 UTC (permalink / raw) To: Keith Busch, linux-nvme, hch > Hi, Hey Keith, > While running a read-write mixed workload, we are observing errors like: > > nvme nvme4: queue 2 no space in request 0x1 This means that we get a data payload from a read request and we don't have a bio/bvec space to store it, which means we are probably not tracking the request iterator correctly if tcpdump shows that we are getting the right data length. > Based on tcpdump, all data for this queue is expected to satisfy the > command request. I'm not familiar enough with the tcp interfaces, so > could anyone provide pointers on how to debug this further? What was the size of the I/O that you were using? Is this easily reproducible? Do you have the below applied: ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges") 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter") I'm assuming yes if you are using the latest nvme tree... Does the issue still happens when you revert 0dc9edaf80ea? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-03-31 19:10 ` Sagi Grimberg @ 2021-03-31 20:49 ` Keith Busch 2021-03-31 22:16 ` Sagi Grimberg 2021-04-02 17:11 ` Keith Busch 0 siblings, 2 replies; 50+ messages in thread From: Keith Busch @ 2021-03-31 20:49 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Wed, Mar 31, 2021 at 12:10:55PM -0700, Sagi Grimberg wrote: > Hey Keith, > > > While running a read-write mixed workload, we are observing errors like: > > > > nvme nvme4: queue 2 no space in request 0x1 > > This means that we get a data payload from a read request and > we don't have a bio/bvec space to store it, which means we > are probably not tracking the request iterator correctly if > tcpdump shows that we are getting the right data length. > > > Based on tcpdump, all data for this queue is expected to satisfy the > > command request. I'm not familiar enough with the tcp interfaces, so > > could anyone provide pointers on how to debug this further? > > What was the size of the I/O that you were using? Is this easily > reproducible? > > Do you have the below applied: > ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges") > 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter") > > I'm assuming yes if you are using the latest nvme tree... > > Does the issue still happens when you revert 0dc9edaf80ea? Thanks for the reply. This was observed on the recent 5.12-rc4, so it has all the latest tcp fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a difference. It is currently reproducible, though it can take over an hour right now. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-03-31 20:49 ` Keith Busch @ 2021-03-31 22:16 ` Sagi Grimberg 2021-03-31 22:26 ` Keith Busch 2021-04-02 17:11 ` Keith Busch 1 sibling, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-03-31 22:16 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >> Hey Keith, >> >>> While running a read-write mixed workload, we are observing errors like: >>> >>> nvme nvme4: queue 2 no space in request 0x1 >> >> This means that we get a data payload from a read request and >> we don't have a bio/bvec space to store it, which means we >> are probably not tracking the request iterator correctly if >> tcpdump shows that we are getting the right data length. >> >>> Based on tcpdump, all data for this queue is expected to satisfy the >>> command request. I'm not familiar enough with the tcp interfaces, so >>> could anyone provide pointers on how to debug this further? >> >> What was the size of the I/O that you were using? Is this easily >> reproducible? >> >> Do you have the below applied: >> ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges") >> 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter") >> >> I'm assuming yes if you are using the latest nvme tree... >> >> Does the issue still happens when you revert 0dc9edaf80ea? > > Thanks for the reply. > > This was observed on the recent 5.12-rc4, so it has all the latest tcp > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a > difference. It is currently reproducible, though it can take over an > hour right now. What is the workload you are running? have an fio job file? Is this I/O to a raw block device? or with fs or iosched? Also, I'm assuming that you are using Linux nvmet as the target device? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-03-31 22:16 ` Sagi Grimberg @ 2021-03-31 22:26 ` Keith Busch 2021-03-31 22:45 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-03-31 22:26 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Wed, Mar 31, 2021 at 03:16:19PM -0700, Sagi Grimberg wrote: > > > > Hey Keith, > > > > > > > While running a read-write mixed workload, we are observing errors like: > > > > > > > > nvme nvme4: queue 2 no space in request 0x1 > > > > > > This means that we get a data payload from a read request and > > > we don't have a bio/bvec space to store it, which means we > > > are probably not tracking the request iterator correctly if > > > tcpdump shows that we are getting the right data length. > > > > > > > Based on tcpdump, all data for this queue is expected to satisfy the > > > > command request. I'm not familiar enough with the tcp interfaces, so > > > > could anyone provide pointers on how to debug this further? > > > > > > What was the size of the I/O that you were using? Is this easily > > > reproducible? > > > > > > Do you have the below applied: > > > ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges") > > > 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter") > > > > > > I'm assuming yes if you are using the latest nvme tree... > > > > > > Does the issue still happens when you revert 0dc9edaf80ea? > > > > Thanks for the reply. > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a > > difference. It is currently reproducible, though it can take over an > > hour right now. > > What is the workload you are running? have an fio job file? > Is this I/O to a raw block device? or with fs or iosched? It's O_DIRECT to raw block device using libaio engine. No fs, page cache, or io scheduler are used. The fio job is generated by a script that cycles through various sizes, rw mixes, and io depth. It is not always consistent on which paricular set of parameters are running when the error message is observed, though. I can get more details if this will be helpful. > Also, I'm assuming that you are using Linux nvmet as the target > device? Not this time. The target is implemented in a hardware device. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-03-31 22:26 ` Keith Busch @ 2021-03-31 22:45 ` Sagi Grimberg 0 siblings, 0 replies; 50+ messages in thread From: Sagi Grimberg @ 2021-03-31 22:45 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >> What is the workload you are running? have an fio job file? >> Is this I/O to a raw block device? or with fs or iosched? > > It's O_DIRECT to raw block device using libaio engine. No fs, page > cache, or io scheduler are used. I see. > > The fio job is generated by a script that cycles through various sizes, > rw mixes, and io depth. It is not always consistent on which paricular > set of parameters are running when the error message is observed, > though. I can get more details if this will be helpful. Try out a debug patch [1], and when this happens we can get some more info on the request itself. >> Also, I'm assuming that you are using Linux nvmet as the target >> device? > > Not this time. The target is implemented in a hardware device. Ha, cool... [1]: --- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 69f59d2c5799..b218a41ac088 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -37,6 +37,14 @@ enum nvme_tcp_send_state { NVME_TCP_SEND_DDGST, }; +enum nvme_tcp_cmd_state { + NVME_TCP_CMD_QUEUED = 0, + NVME_TCP_CMD_SEND_PDU, + NVME_TCP_CMD_PENDING_DATA, + NVME_TCP_CMD_DATA_DONE, + NVME_TCP_CMD_DONE, +}; + struct nvme_tcp_request { struct nvme_request req; void *pdu; @@ -56,6 +64,7 @@ struct nvme_tcp_request { size_t offset; size_t data_sent; enum nvme_tcp_send_state state; + enum nvme_tcp_cmd_state cmd_state; }; enum nvme_tcp_queue_flags { @@ -482,6 +491,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl) static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue, struct nvme_completion *cqe) { + struct nvme_tcp_request *req; struct request *rq; rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id); @@ -493,6 +503,8 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue, return -EINVAL; } + req = blk_mq_rq_to_pdu(rq); + req->cmd_state = NVME_TCP_CMD_DONE; if (!nvme_try_complete_req(rq, cqe->status, cqe->result)) nvme_complete_rq(rq); queue->nr_cqe++; @@ -503,6 +515,7 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue, static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue, struct nvme_tcp_data_pdu *pdu) { + struct nvme_tcp_request *req; struct request *rq; rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); @@ -512,11 +525,12 @@ static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue, nvme_tcp_queue_id(queue), pdu->command_id); return -ENOENT; } + req = blk_mq_rq_to_pdu(rq); if (!blk_rq_payload_bytes(rq)) { dev_err(queue->ctrl->ctrl.device, - "queue %d tag %#x unexpected data\n", - nvme_tcp_queue_id(queue), rq->tag); + "queue %d tag %#x unexpected data cmd_state %d\n", + nvme_tcp_queue_id(queue), rq->tag, req->cmd_state); return -EIO; } @@ -755,7 +769,9 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; } else { + req->cmd_state = NVME_TCP_CMD_DATA_DONE; if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { + req->cmd_state = NVME_TCP_CMD_DONE; nvme_tcp_end_request(rq, NVME_SC_SUCCESS); queue->nr_cqe++; } @@ -796,7 +812,10 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue, if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { struct request *rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); + struct nvme_tcp_request *req; + req = blk_mq_rq_to_pdu(rq); + req->cmd_state = NVME_TCP_CMD_DONE; nvme_tcp_end_request(rq, NVME_SC_SUCCESS); queue->nr_cqe++; } @@ -944,6 +963,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) nvme_tcp_ddgst_final(queue->snd_hash, &req->ddgst); req->state = NVME_TCP_SEND_DDGST; + req->cmd_state = NVME_TCP_CMD_DATA_DONE; req->offset = 0; } else { nvme_tcp_done_send_req(queue); @@ -979,6 +999,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req) len -= ret; if (!len) { + req->cmd_state = req->data_len ? NVME_TCP_CMD_PENDING_DATA : NVME_TCP_CMD_DATA_DONE; if (inline_data) { req->state = NVME_TCP_SEND_DATA; if (queue->data_digest) @@ -2329,6 +2350,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx, blk_mq_start_request(rq); + req->cmd_state = NVME_TCP_CMD_QUEUED; nvme_tcp_queue_request(req, true, bd->last); return BLK_STS_OK; --- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-03-31 20:49 ` Keith Busch 2021-03-31 22:16 ` Sagi Grimberg @ 2021-04-02 17:11 ` Keith Busch 2021-04-02 17:27 ` Sagi Grimberg 1 sibling, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-02 17:11 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Thu, Apr 01, 2021 at 05:49:58AM +0900, Keith Busch wrote: > On Wed, Mar 31, 2021 at 12:10:55PM -0700, Sagi Grimberg wrote: > > Hey Keith, > > > > > While running a read-write mixed workload, we are observing errors like: > > > > > > nvme nvme4: queue 2 no space in request 0x1 > > > > This means that we get a data payload from a read request and > > we don't have a bio/bvec space to store it, which means we > > are probably not tracking the request iterator correctly if > > tcpdump shows that we are getting the right data length. > > > > > Based on tcpdump, all data for this queue is expected to satisfy the > > > command request. I'm not familiar enough with the tcp interfaces, so > > > could anyone provide pointers on how to debug this further? > > > > What was the size of the I/O that you were using? Is this easily > > reproducible? > > > > Do you have the below applied: > > ca1ff67d0fb1 ("nvme-tcp: fix possible data corruption with bio merges") > > 0dc9edaf80ea ("nvme-tcp: pass multipage bvec to request iov_iter") > > > > I'm assuming yes if you are using the latest nvme tree... > > > > Does the issue still happens when you revert 0dc9edaf80ea? > > Thanks for the reply. > > This was observed on the recent 5.12-rc4, so it has all the latest tcp > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a > difference. It is currently reproducible, though it can take over an > hour right now. After reverting 0dc9edaf80ea, we are observing a kernel panic (below). We'll try adding it back, plust adding your debug patch. BUG: kernel NULL pointer dereference, address: 000000000000002c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 15 PID: 49319 Comm: fio Tainted: G OE 5.12.0-051200rc4-generic #202103212230 Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0a 08/02/2016 RIP: 0010:nvme_tcp_init_iter+0x55/0xf0 [nvme_tcp] Code: ff ff b9 01 00 00 00 45 31 e4 48 8d 7b 68 4c 89 da 44 89 d6 e8 ec e4 16 c8 4c 89 63 70 5b 41 5c 41 5d 41 5e 5d c3 48 8b 57 60 <8b> 42 2c 4c 8b 6a 78 44 8b 42 28 44 8b 62 30 49 89 c3 48 89 c7 49 RSP: 0018:ffffaafccd0eb920 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff9ad834bfcf28 RCX: 0000000000002000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9ad834bfcf28 RBP: ffffaafccd0eb940 R08: 0000000000002000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000002000 R13: fffff67f07c1ce00 R14: ffff9ad78422c5f0 R15: ffff9ad834bfcf28 FS: 00007f010087d740(0000) GS:ffff9adedfdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000002c CR3: 000000027b11e005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: nvme_tcp_try_send_data+0x282/0x2e0 [nvme_tcp] ? nvme_tcp_init_iter+0x44/0xf0 [nvme_tcp] nvme_tcp_try_send+0x12f/0x1b0 [nvme_tcp] ? blk_mq_start_request+0x3a/0x100 nvme_tcp_queue_rq+0x159/0x180 [nvme_tcp] __blk_mq_try_issue_directly+0x116/0x1e0 blk_mq_try_issue_list_directly+0x174/0x2b0 blk_mq_sched_insert_requests+0xa5/0xf0 blk_mq_flush_plug_list+0x106/0x1b0 blk_flush_plug_list+0xdd/0x100 blk_finish_plug+0x29/0x40 __blkdev_direct_IO+0x2ef/0x480 ? aio_fsync_work+0xf0/0xf0 blkdev_direct_IO+0x56/0x80 generic_file_read_iter+0x9c/0x140 blkdev_read_iter+0x35/0x40 aio_read+0xe0/0x1a0 ? __cond_resched+0x35/0x50 ? slab_pre_alloc_hook.constprop.0+0x96/0xe0 __io_submit_one.constprop.0+0x107/0x1f0 io_submit_one+0xe3/0x3a0 __x64_sys_io_submit+0x84/0x180 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-02 17:11 ` Keith Busch @ 2021-04-02 17:27 ` Sagi Grimberg 2021-04-05 14:37 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-02 17:27 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >> Thanks for the reply. >> >> This was observed on the recent 5.12-rc4, so it has all the latest tcp >> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a >> difference. It is currently reproducible, though it can take over an >> hour right now. > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below). Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. This patch is actually needed. > We'll try adding it back, plust adding your debug patch. Yes, that would give us more info about what is the state the request is in when getting these errors _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-02 17:27 ` Sagi Grimberg @ 2021-04-05 14:37 ` Keith Busch 2021-04-07 19:53 ` Keith Busch 2021-04-09 18:04 ` Sagi Grimberg 0 siblings, 2 replies; 50+ messages in thread From: Keith Busch @ 2021-04-05 14:37 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Fri, Apr 02, 2021 at 10:27:11AM -0700, Sagi Grimberg wrote: > > > > Thanks for the reply. > > > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a > > > difference. It is currently reproducible, though it can take over an > > > hour right now. > > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below). > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. > This patch is actually needed. > > > > We'll try adding it back, plust adding your debug patch. > > Yes, that would give us more info about what is the state the > request is in when getting these errors We have recreated with your debug patch: nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3 State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE". The summary from the test that I received: We have an Ethernet trace for this failure. I filtered the trace for the connection that maps to "queue 6 of nvme4" and tracked the state of the IO command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per the Ethernet trace is: 1. The target receives this Command in an Ethernet frame that has 9 Command capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read operation for 16K IO size 2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set in the last PDU. 3. The target sends a Response for this Command. 4. About 1.3 ms later, the Host logs this msg and closes the connection. Please let us know if you need any additional information. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-05 14:37 ` Keith Busch @ 2021-04-07 19:53 ` Keith Busch 2021-04-09 21:38 ` Sagi Grimberg 2021-04-09 18:04 ` Sagi Grimberg 1 sibling, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-07 19:53 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Mon, Apr 05, 2021 at 11:37:02PM +0900, Keith Busch wrote: > On Fri, Apr 02, 2021 at 10:27:11AM -0700, Sagi Grimberg wrote: > > > > > > Thanks for the reply. > > > > > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp > > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a > > > > difference. It is currently reproducible, though it can take over an > > > > hour right now. > > > > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below). > > > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. > > This patch is actually needed. > > > > > > > We'll try adding it back, plust adding your debug patch. > > > > Yes, that would give us more info about what is the state the > > request is in when getting these errors > > We have recreated with your debug patch: > > nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3 > > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE". > > The summary from the test that I received: > > We have an Ethernet trace for this failure. I filtered the trace for the > connection that maps to "queue 6 of nvme4" and tracked the state of the IO > command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per > the Ethernet trace is: > > 1. The target receives this Command in an Ethernet frame that has 9 Command > capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read > operation for 16K IO size > 2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one > C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set > in the last PDU. > 3. The target sends a Response for this Command. > 4. About 1.3 ms later, the Host logs this msg and closes the connection. > > Please let us know if you need any additional information. I'm not sure if this is just a different symptom of the same problem, but with the debug patch, we're occasionally hitting messages like: nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-07 19:53 ` Keith Busch @ 2021-04-09 21:38 ` Sagi Grimberg 2021-04-27 23:39 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-09 21:38 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>>>> This was observed on the recent 5.12-rc4, so it has all the latest tcp >>>>> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a >>>>> difference. It is currently reproducible, though it can take over an >>>>> hour right now. >>>> >>>> After reverting 0dc9edaf80ea, we are observing a kernel panic (below). >>> >>> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. >>> This patch is actually needed. >>> >>> >>>> We'll try adding it back, plust adding your debug patch. >>> >>> Yes, that would give us more info about what is the state the >>> request is in when getting these errors >> >> We have recreated with your debug patch: >> >> nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3 >> >> State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE". >> >> The summary from the test that I received: >> >> We have an Ethernet trace for this failure. I filtered the trace for the >> connection that maps to "queue 6 of nvme4" and tracked the state of the IO >> command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per >> the Ethernet trace is: >> >> 1. The target receives this Command in an Ethernet frame that has 9 Command >> capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read >> operation for 16K IO size >> 2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one >> C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set >> in the last PDU. >> 3. The target sends a Response for this Command. >> 4. About 1.3 ms later, the Host logs this msg and closes the connection. >> >> Please let us know if you need any additional information. > > I'm not sure if this is just a different symptom of the same problem, > but with the debug patch, we're occasionally hitting messages like: > > nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2 According to this message, this means the host got an r2t for 16384 bytes after it already sent 8192 (which can only happen if it previously got an r2t soliciting 8192 bytes or more that accumulate to that). Can you share for each r2t pdus in this sequence: r2t_length r2t_offset _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-09 21:38 ` Sagi Grimberg @ 2021-04-27 23:39 ` Keith Busch 2021-04-27 23:55 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-27 23:39 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Fri, Apr 09, 2021 at 02:38:38PM -0700, Sagi Grimberg wrote: > > > > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp > > > > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a > > > > > > difference. It is currently reproducible, though it can take over an > > > > > > hour right now. > > > > > > > > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below). > > > > > > > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. > > > > This patch is actually needed. > > > > > > > > > > > > > We'll try adding it back, plust adding your debug patch. > > > > > > > > Yes, that would give us more info about what is the state the > > > > request is in when getting these errors > > > > > > We have recreated with your debug patch: > > > > > > nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3 > > > > > > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE". > > > > > > The summary from the test that I received: > > > > > > We have an Ethernet trace for this failure. I filtered the trace for the > > > connection that maps to "queue 6 of nvme4" and tracked the state of the IO > > > command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per > > > the Ethernet trace is: > > > > > > 1. The target receives this Command in an Ethernet frame that has 9 Command > > > capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read > > > operation for 16K IO size > > > 2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one > > > C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set > > > in the last PDU. > > > 3. The target sends a Response for this Command. > > > 4. About 1.3 ms later, the Host logs this msg and closes the connection. > > > > > > Please let us know if you need any additional information. > > > > I'm not sure if this is just a different symptom of the same problem, > > but with the debug patch, we're occasionally hitting messages like: > > > > nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2 > > According to this message, this means the host got an r2t for 16384 > bytes after it already sent 8192 (which can only happen if it previously > got an r2t soliciting 8192 bytes or more that accumulate to that). Can > you share for each r2t pdus in this sequence: > r2t_length > r2t_offset This one took a bit to go through. The traces all only have a single r2t pdu with 0 offset for the full length of the requested transfer. I had to add trace events to see what the heck the driver is thinking, and the result is even more confusing. The kernel message error: nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent) And all the new trace events for this request are: fio-25086 [011] .... 9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384 fio-25093 [007] .... 9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72 <...>-22670 [003] .... 9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384 The fact "data_sent" is non-zero on the very first r2t makes no sense to me. I so far can not find any sequence where that could happen. Here's the tracepoints patch (I think something like this is worthy of being included in the driver): --- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 9c1f3d273fcf..931197fa4ef4 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -141,6 +141,9 @@ struct nvme_tcp_ctrl { u32 io_queues[HCTX_MAX_TYPES]; }; +#define CREATE_TRACE_POINTS +#include "trace-tcp.h" + static LIST_HEAD(nvme_tcp_ctrl_list); static DEFINE_MUTEX(nvme_tcp_ctrl_mutex); static struct workqueue_struct *nvme_tcp_wq; @@ -582,10 +585,13 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req, struct request *rq = blk_mq_rq_from_pdu(req); u8 hdgst = nvme_tcp_hdgst_len(queue); u8 ddgst = nvme_tcp_ddgst_len(queue); + unsigned r2t_offset = le32_to_cpu(pdu->r2t_offset); req->pdu_len = le32_to_cpu(pdu->r2t_length); req->pdu_sent = 0; + trace_nvme_tcp_r2t(rq, req, r2t_offset); + if (unlikely(!req->pdu_len)) { dev_err(queue->ctrl->ctrl.device, "req %d r2t len is %u, probably a bug...\n", @@ -595,17 +601,16 @@ static int nvme_tcp_setup_h2c_data_pdu(struct nvme_tcp_request *req, if (unlikely(req->data_sent + req->pdu_len > req->data_len)) { dev_err(queue->ctrl->ctrl.device, - "req %d r2t len %u exceeded data len %u (%zu sent)\n", - rq->tag, req->pdu_len, req->data_len, - req->data_sent); + "req %d op %d r2t len %u exceeded data len %u (%zu sent) pdu offset %u\n", + rq->tag, req_op(rq), req->pdu_len, req->data_len, + req->data_sent, r2t_offset); return -EPROTO; } - if (unlikely(le32_to_cpu(pdu->r2t_offset) < req->data_sent)) { + if (unlikely(r2t_offset < req->data_sent)) { dev_err(queue->ctrl->ctrl.device, "req %d unexpected r2t offset %u (expected %zu)\n", - rq->tag, le32_to_cpu(pdu->r2t_offset), - req->data_sent); + rq->tag, r2t_offset, req->data_sent); return -EPROTO; } @@ -943,6 +948,7 @@ static void nvme_tcp_fail_request(struct nvme_tcp_request *req) static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) { struct nvme_tcp_queue *queue = req->queue; + struct request *rq = blk_mq_rq_from_pdu(req); while (true) { struct page *page = nvme_tcp_req_cur_page(req); @@ -956,6 +962,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) else flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST; + trace_nvme_tcp_data(rq, req, offset, len, last); if (sendpage_ok(page)) { ret = kernel_sendpage(queue->sock, page, offset, len, flags); @@ -977,9 +984,9 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) nvme_tcp_ddgst_final(queue->snd_hash, &req->ddgst); req->state = NVME_TCP_SEND_DDGST; - req->cmd_state = NVME_TCP_CMD_DATA_DONE; req->offset = 0; } else { + req->cmd_state = NVME_TCP_CMD_DATA_DONE; nvme_tcp_done_send_req(queue); } return 1; @@ -996,7 +1003,8 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req) u8 hdgst = nvme_tcp_hdgst_len(queue); int len = sizeof(*pdu) + hdgst - req->offset; int flags = MSG_DONTWAIT; - int ret; + struct request *rq = blk_mq_rq_from_pdu(req); + int offset, ret; if (inline_data || nvme_tcp_queue_more(queue)) flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST; @@ -1006,8 +1014,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req) if (queue->hdr_digest && !req->offset) nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu)); - ret = kernel_sendpage(queue->sock, virt_to_page(pdu), - offset_in_page(pdu) + req->offset, len, flags); + offset = offset_in_page(pdu) + req->offset; + trace_nvme_tcp_cmd_pdu(rq, req, offset, len); + ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, + len, flags); if (unlikely(ret <= 0)) return ret; @@ -1034,14 +1044,16 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req) struct nvme_tcp_data_pdu *pdu = req->pdu; u8 hdgst = nvme_tcp_hdgst_len(queue); int len = sizeof(*pdu) - req->offset + hdgst; - int ret; + struct request *rq = blk_mq_rq_from_pdu(req); + int offset, ret; if (queue->hdr_digest && !req->offset) nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu)); - ret = kernel_sendpage(queue->sock, virt_to_page(pdu), - offset_in_page(pdu) + req->offset, len, - MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST); + offset = offset_in_page(pdu) + req->offset; + trace_nvme_tcp_h2c_pdu(rq, req, offset, len); + ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, + len, MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST); if (unlikely(ret <= 0)) return ret; @@ -2367,6 +2379,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx, if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready)) return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq); + trace_nvme_tcp_queue_rq(rq); ret = nvme_tcp_setup_cmd_pdu(ns, rq); if (unlikely(ret)) return ret; diff --git a/drivers/nvme/host/trace-tcp.h b/drivers/nvme/host/trace-tcp.h new file mode 100644 index 000000000000..0fc8f4c51ee2 --- /dev/null +++ b/drivers/nvme/host/trace-tcp.h @@ -0,0 +1,154 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM nvme_tcp + +#if !defined(_TRACE_NVME_TCP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_NVME_TCP_H + +#include <linux/nvme.h> +#include <linux/tracepoint.h> +#include <linux/trace_seq.h> + +#include "nvme.h" + +TRACE_EVENT(nvme_tcp_queue_rq, + TP_PROTO(struct request *rq), + TP_ARGS(rq), + TP_STRUCT__entry( + __field(unsigned, ctrl_id) + __field(unsigned, qid) + __field(unsigned, op) + __field(unsigned, tag) + __field(unsigned, len) + ), + TP_fast_assign( + __entry->ctrl_id = nvme_req(rq)->ctrl->instance; + __entry->qid = nvme_req_qid(rq); + __entry->op = req_op(rq); + __entry->tag = rq->tag; + __entry->len = blk_rq_payload_bytes(rq); + ), + TP_printk("nvme%u: qid=%u tag=%x op=%x data_len=%u", + __entry->ctrl_id, __entry->qid, __entry->tag, __entry->op, + __entry->len + ) +); + +TRACE_EVENT(nvme_tcp_cmd_pdu, + TP_PROTO(struct request *rq, struct nvme_tcp_request *req, unsigned offset, unsigned len), + TP_ARGS(rq, req, offset, len), + TP_STRUCT__entry( + __field(unsigned, ctrl_id) + __field(unsigned, qid) + __field(unsigned, op) + __field(unsigned, tag) + __field(unsigned, page_offset) + __field(unsigned, len) + ), + TP_fast_assign( + __entry->ctrl_id = nvme_req(rq)->ctrl->instance; + __entry->qid = nvme_req_qid(rq); + __entry->op = req_op(rq); + __entry->tag = rq->tag; + __entry->page_offset = offset; + __entry->len = len; + ), + TP_printk("nvme%u: qid=%u tag=%x op=%x page_offset=%u send_len=%u", + __entry->ctrl_id, __entry->qid, __entry->tag, __entry->op, + __entry->page_offset, __entry->len + ) +); + +TRACE_EVENT(nvme_tcp_h2c_pdu, + TP_PROTO(struct request *rq, struct nvme_tcp_request *req, unsigned offset, unsigned len), + TP_ARGS(rq, req, offset, len), + TP_STRUCT__entry( + __field(unsigned, ctrl_id) + __field(unsigned, qid) + __field(unsigned, op) + __field(unsigned, tag) + __field(unsigned, page_offset) + __field(unsigned, len) + ), + TP_fast_assign( + __entry->ctrl_id = nvme_req(rq)->ctrl->instance; + __entry->qid = nvme_req_qid(rq); + __entry->op = req_op(rq); + __entry->tag = rq->tag; + __entry->page_offset = offset; + __entry->len = len; + ), + TP_printk("nvme%u: qid=%u tag=%x op=%x page_offset=%u send_len=%u", + __entry->ctrl_id, __entry->qid, __entry->tag, __entry->op, + __entry->page_offset, __entry->len + ) +); + +TRACE_EVENT(nvme_tcp_data, + TP_PROTO(struct request *rq, struct nvme_tcp_request *req, unsigned offset, unsigned len, bool last), + TP_ARGS(rq, req, offset, len, last), + TP_STRUCT__entry( + __field(unsigned, ctrl_id) + __field(unsigned, qid) + __field(unsigned, op) + __field(unsigned, tag) + __field(unsigned, page_offset) + __field(unsigned, len) + __field(unsigned, sent) + __field(bool, last) + ), + TP_fast_assign( + __entry->ctrl_id = nvme_req(rq)->ctrl->instance; + __entry->qid = nvme_req_qid(rq); + __entry->op = req_op(rq); + __entry->tag = rq->tag; + __entry->page_offset = offset; + __entry->len = len; + __entry->sent = req->pdu_sent; + __entry->last = last; + ), + TP_printk("nvme%u: qid=%u tag=%x op=%x page_offset=%u send_len=%u pdu_sent=%u last=%u", + __entry->ctrl_id, __entry->qid, __entry->tag, __entry->op, + __entry->page_offset, __entry->len, __entry->sent, __entry->last + ) +); + +TRACE_EVENT(nvme_tcp_r2t, + TP_PROTO(struct request *rq, struct nvme_tcp_request *req, unsigned offset), + TP_ARGS(rq, req, offset), + TP_STRUCT__entry( + __field(unsigned, ctrl_id) + __field(unsigned, qid) + __field(unsigned, op) + __field(unsigned, tag) + __field(unsigned, len) + __field(unsigned, offset) + __field(unsigned, sent) + __field(unsigned, data_len) + ), + TP_fast_assign( + __entry->ctrl_id = nvme_req(rq)->ctrl->instance; + __entry->qid = nvme_req_qid(rq); + __entry->op = req_op(rq); + __entry->tag = rq->tag; + __entry->len = req->pdu_len; + __entry->offset = offset; + __entry->sent = req->data_sent; + __entry->data_len = req->data_len; + ), + TP_printk("nvme%u: qid=%u tag=%x op=%x r2t_len=%u r2t_offset=%u data_sent=%u data_len=%u", + __entry->ctrl_id, __entry->qid, __entry->tag, __entry->op, + __entry->len, __entry->offset, __entry->sent, __entry->data_len + ) +); + +#endif /* _TRACE_NVME_TCP_H */ + +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH . +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE trace-tcp + +/* This part must be outside protection */ +#include <trace/define_trace.h> -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-27 23:39 ` Keith Busch @ 2021-04-27 23:55 ` Sagi Grimberg 2021-04-28 15:58 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-27 23:55 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>>>>>> This was observed on the recent 5.12-rc4, so it has all the latest tcp >>>>>>> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a >>>>>>> difference. It is currently reproducible, though it can take over an >>>>>>> hour right now. >>>>>> >>>>>> After reverting 0dc9edaf80ea, we are observing a kernel panic (below). >>>>> >>>>> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. >>>>> This patch is actually needed. >>>>> >>>>> >>>>>> We'll try adding it back, plust adding your debug patch. >>>>> >>>>> Yes, that would give us more info about what is the state the >>>>> request is in when getting these errors >>>> >>>> We have recreated with your debug patch: >>>> >>>> nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3 >>>> >>>> State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE". >>>> >>>> The summary from the test that I received: >>>> >>>> We have an Ethernet trace for this failure. I filtered the trace for the >>>> connection that maps to "queue 6 of nvme4" and tracked the state of the IO >>>> command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per >>>> the Ethernet trace is: >>>> >>>> 1. The target receives this Command in an Ethernet frame that has 9 Command >>>> capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read >>>> operation for 16K IO size >>>> 2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one >>>> C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set >>>> in the last PDU. >>>> 3. The target sends a Response for this Command. >>>> 4. About 1.3 ms later, the Host logs this msg and closes the connection. >>>> >>>> Please let us know if you need any additional information. >>> >>> I'm not sure if this is just a different symptom of the same problem, >>> but with the debug patch, we're occasionally hitting messages like: >>> >>> nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2 >> >> According to this message, this means the host got an r2t for 16384 >> bytes after it already sent 8192 (which can only happen if it previously >> got an r2t soliciting 8192 bytes or more that accumulate to that). Can >> you share for each r2t pdus in this sequence: >> r2t_length >> r2t_offset > > This one took a bit to go through. The traces all only have a single r2t > pdu with 0 offset for the full length of the requested transfer. I had > to add trace events to see what the heck the driver is thinking, and > the result is even more confusing. > > The kernel message error: > > nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent) > > And all the new trace events for this request are: > > fio-25086 [011] .... 9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384 > fio-25093 [007] .... 9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72 > <...>-22670 [003] .... 9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384 > > The fact "data_sent" is non-zero on the very first r2t makes no sense to me. Yep, not supposed to happen... Like the traces though! very useful and absolutely worth having. > I so far can not find any sequence where that could happen. The only way that data_sent can increment is either by getting an r2t solicitation or by sending incapsule data. What is the ioccsz the controller is exposing? in nvme_tcp_sned_cmd_pdu we have: -- if (inline_data) { req->state = NVME_TCP_SEND_DATA; if (queue->data_digest) crypto_ahash_init(queue->snd_hash); } else { nvme_tcp_done_send_req(queue); } -- Where inline_data flag is: bool inline_data = nvme_tcp_has_inline_data(req); which essentially boils down to: req->data_len <= queue->cmnd_capsule_len - sizeof(struct nvme_command); I wonder how does the nvme command sgl look like? is it an offset sgl? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-27 23:55 ` Sagi Grimberg @ 2021-04-28 15:58 ` Keith Busch 2021-04-28 17:42 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-28 15:58 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Tue, Apr 27, 2021 at 04:55:43PM -0700, Sagi Grimberg wrote: > > > > > > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp > > > > > > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a > > > > > > > > difference. It is currently reproducible, though it can take over an > > > > > > > > hour right now. > > > > > > > > > > > > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below). > > > > > > > > > > > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. > > > > > > This patch is actually needed. > > > > > > > > > > > > > > > > > > > We'll try adding it back, plust adding your debug patch. > > > > > > > > > > > > Yes, that would give us more info about what is the state the > > > > > > request is in when getting these errors > > > > > > > > > > We have recreated with your debug patch: > > > > > > > > > > nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3 > > > > > > > > > > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE". > > > > > > > > > > The summary from the test that I received: > > > > > > > > > > We have an Ethernet trace for this failure. I filtered the trace for the > > > > > connection that maps to "queue 6 of nvme4" and tracked the state of the IO > > > > > command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per > > > > > the Ethernet trace is: > > > > > > > > > > 1. The target receives this Command in an Ethernet frame that has 9 Command > > > > > capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read > > > > > operation for 16K IO size > > > > > 2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one > > > > > C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set > > > > > in the last PDU. > > > > > 3. The target sends a Response for this Command. > > > > > 4. About 1.3 ms later, the Host logs this msg and closes the connection. > > > > > > > > > > Please let us know if you need any additional information. > > > > > > > > I'm not sure if this is just a different symptom of the same problem, > > > > but with the debug patch, we're occasionally hitting messages like: > > > > > > > > nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2 > > > > > > According to this message, this means the host got an r2t for 16384 > > > bytes after it already sent 8192 (which can only happen if it previously > > > got an r2t soliciting 8192 bytes or more that accumulate to that). Can > > > you share for each r2t pdus in this sequence: > > > r2t_length > > > r2t_offset > > > > This one took a bit to go through. The traces all only have a single r2t > > pdu with 0 offset for the full length of the requested transfer. I had > > to add trace events to see what the heck the driver is thinking, and > > the result is even more confusing. > > > > The kernel message error: > > > > nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent) > > > > And all the new trace events for this request are: > > > > fio-25086 [011] .... 9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384 > > fio-25093 [007] .... 9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72 > > <...>-22670 [003] .... 9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384 > > > > The fact "data_sent" is non-zero on the very first r2t makes no sense to me. > > Yep, not supposed to happen... Like the traces though! very useful and > absolutely worth having. > > > I so far can not find any sequence where that could happen. > > The only way that data_sent can increment is either by getting an r2t > solicitation or by sending incapsule data. > > What is the ioccsz the controller is exposing? In capsule data is not supported by this target, so ioccsz is 4. > in nvme_tcp_sned_cmd_pdu we have: > -- > if (inline_data) { > req->state = NVME_TCP_SEND_DATA; > if (queue->data_digest) > crypto_ahash_init(queue->snd_hash); > } else { > nvme_tcp_done_send_req(queue); > } > -- > > Where inline_data flag is: > bool inline_data = nvme_tcp_has_inline_data(req); > which essentially boils down to: > req->data_len <= queue->cmnd_capsule_len - sizeof(struct nvme_command); > > I wonder how does the nvme command sgl look like? is it an offset > sgl? Just a single data block descriptor SGL. The target supports only 1 and reports that through ID_CTRL.MSDBD. What do you mean by "offset" SGL? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-28 15:58 ` Keith Busch @ 2021-04-28 17:42 ` Sagi Grimberg 2021-04-28 18:01 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-28 17:42 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>> This one took a bit to go through. The traces all only have a single r2t >>> pdu with 0 offset for the full length of the requested transfer. I had >>> to add trace events to see what the heck the driver is thinking, and >>> the result is even more confusing. >>> >>> The kernel message error: >>> >>> nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent) >>> >>> And all the new trace events for this request are: >>> >>> fio-25086 [011] .... 9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384 >>> fio-25093 [007] .... 9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72 >>> <...>-22670 [003] .... 9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384 >>> >>> The fact "data_sent" is non-zero on the very first r2t makes no sense to me. >> >> Yep, not supposed to happen... Like the traces though! very useful and >> absolutely worth having. >> >>> I so far can not find any sequence where that could happen. >> >> The only way that data_sent can increment is either by getting an r2t >> solicitation or by sending incapsule data. >> >> What is the ioccsz the controller is exposing? > > In capsule data is not supported by this target, so ioccsz is 4. > >> in nvme_tcp_sned_cmd_pdu we have: >> -- >> if (inline_data) { >> req->state = NVME_TCP_SEND_DATA; >> if (queue->data_digest) >> crypto_ahash_init(queue->snd_hash); >> } else { >> nvme_tcp_done_send_req(queue); >> } >> -- >> >> Where inline_data flag is: >> bool inline_data = nvme_tcp_has_inline_data(req); >> which essentially boils down to: >> req->data_len <= queue->cmnd_capsule_len - sizeof(struct nvme_command); >> >> I wonder how does the nvme command sgl look like? is it an offset >> sgl? > > Just a single data block descriptor SGL. The target supports only 1 and > reports that through ID_CTRL.MSDBD. > > What do you mean by "offset" SGL? In tcp.c: -- static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue, struct nvme_command *c, u32 data_len) { struct nvme_sgl_desc *sg = &c->common.dptr.sgl; sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff); sg->length = cpu_to_le32(data_len); sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET; } static void nvme_tcp_set_sg_host_data(struct nvme_command *c, u32 data_len) { struct nvme_sgl_desc *sg = &c->common.dptr.sgl; sg->addr = 0; sg->length = cpu_to_le32(data_len); sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) | NVME_SGL_FMT_TRANSPORT_A; } -- What is the sgl type you see in the traces? transport specific sgl (host-data i.e. non-incapsule) or inline? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-28 17:42 ` Sagi Grimberg @ 2021-04-28 18:01 ` Keith Busch 2021-04-28 23:06 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-28 18:01 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Wed, Apr 28, 2021 at 10:42:13AM -0700, Sagi Grimberg wrote: > In tcp.c: > -- > static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue, > struct nvme_command *c, u32 data_len) > { > struct nvme_sgl_desc *sg = &c->common.dptr.sgl; > > sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff); > sg->length = cpu_to_le32(data_len); > sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET; > } > > static void nvme_tcp_set_sg_host_data(struct nvme_command *c, > u32 data_len) > { > struct nvme_sgl_desc *sg = &c->common.dptr.sgl; > > sg->addr = 0; > sg->length = cpu_to_le32(data_len); > sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) | > NVME_SGL_FMT_TRANSPORT_A; > } > -- > > What is the sgl type you see in the traces? transport specific sgl > (host-data i.e. non-incapsule) or inline? The Sub Type is 0xA, Transport Specific. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-28 18:01 ` Keith Busch @ 2021-04-28 23:06 ` Sagi Grimberg 2021-04-29 3:33 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-28 23:06 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >> In tcp.c: >> -- >> static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue, >> struct nvme_command *c, u32 data_len) >> { >> struct nvme_sgl_desc *sg = &c->common.dptr.sgl; >> >> sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff); >> sg->length = cpu_to_le32(data_len); >> sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET; >> } >> >> static void nvme_tcp_set_sg_host_data(struct nvme_command *c, >> u32 data_len) >> { >> struct nvme_sgl_desc *sg = &c->common.dptr.sgl; >> >> sg->addr = 0; >> sg->length = cpu_to_le32(data_len); >> sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) | >> NVME_SGL_FMT_TRANSPORT_A; >> } >> -- >> >> What is the sgl type you see in the traces? transport specific sgl >> (host-data i.e. non-incapsule) or inline? > > The Sub Type is 0xA, Transport Specific. Interesting, I don't see how the host is going to send data without it being in-capsule, and before receiving an r2t... Maybe add this one for a sanity check: -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index eb1feaacd11a..6cb0e13024e5 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -987,6 +987,8 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req) len -= ret; if (!len) { if (inline_data) { + pr_err("no way... data_len %d queue_max_inline %ld\n", + req->data_len, nvme_tcp_inline_data_size(req->queue)); req->state = NVME_TCP_SEND_DATA; if (queue->data_digest) crypto_ahash_init(queue->snd_hash); -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-28 23:06 ` Sagi Grimberg @ 2021-04-29 3:33 ` Keith Busch 2021-04-29 4:52 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-29 3:33 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Wed, Apr 28, 2021 at 04:06:12PM -0700, Sagi Grimberg wrote: > > > > In tcp.c: > > > -- > > > static void nvme_tcp_set_sg_inline(struct nvme_tcp_queue *queue, > > > struct nvme_command *c, u32 data_len) > > > { > > > struct nvme_sgl_desc *sg = &c->common.dptr.sgl; > > > > > > sg->addr = cpu_to_le64(queue->ctrl->ctrl.icdoff); > > > sg->length = cpu_to_le32(data_len); > > > sg->type = (NVME_SGL_FMT_DATA_DESC << 4) | NVME_SGL_FMT_OFFSET; > > > } > > > > > > static void nvme_tcp_set_sg_host_data(struct nvme_command *c, > > > u32 data_len) > > > { > > > struct nvme_sgl_desc *sg = &c->common.dptr.sgl; > > > > > > sg->addr = 0; > > > sg->length = cpu_to_le32(data_len); > > > sg->type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) | > > > NVME_SGL_FMT_TRANSPORT_A; > > > } > > > -- > > > > > > What is the sgl type you see in the traces? transport specific sgl > > > (host-data i.e. non-incapsule) or inline? > > > > The Sub Type is 0xA, Transport Specific. > > Interesting, I don't see how the host is going to send data > without it being in-capsule, and before receiving an r2t... The driver tracepoints captured millions of IO's where everything happened as expected, so I really think something got confused and mucked with the wrong request. I've added more trace points to increase visibility because I frankly didn't find how that could happen just from code inspection. We will also incorporate your patch below for the next recreate. > Maybe add this one for a sanity check: > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index eb1feaacd11a..6cb0e13024e5 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -987,6 +987,8 @@ static int nvme_tcp_try_send_cmd_pdu(struct > nvme_tcp_request *req) > len -= ret; > if (!len) { > if (inline_data) { > + pr_err("no way... data_len %d queue_max_inline > %ld\n", > + req->data_len, > nvme_tcp_inline_data_size(req->queue)); > req->state = NVME_TCP_SEND_DATA; > if (queue->data_digest) > crypto_ahash_init(queue->snd_hash); > -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-29 3:33 ` Keith Busch @ 2021-04-29 4:52 ` Sagi Grimberg 2021-05-03 18:51 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-29 4:52 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch > The driver tracepoints captured millions of IO's where everything > happened as expected, so I really think something got confused and > mucked with the wrong request. I've added more trace points to increase > visibility because I frankly didn't find how that could happen just from > code inspection. We will also incorporate your patch below for the next > recreate. Keith, does the issue still happen with eliminating the network send from .queue_rq() ? -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index eb1feaacd11a..b3fafa536345 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -288,7 +288,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, * directly, otherwise queue io_work. Also, only do that if we * are on the same cpu, so we don't introduce contention. */ - if (queue->io_cpu == __smp_processor_id() && + if (0 && queue->io_cpu == __smp_processor_id() && sync && empty && mutex_trylock(&queue->send_mutex)) { queue->more_requests = !last; nvme_tcp_send_all(queue); -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-29 4:52 ` Sagi Grimberg @ 2021-05-03 18:51 ` Keith Busch 2021-05-03 19:58 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-03 18:51 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Wed, Apr 28, 2021 at 09:52:37PM -0700, Sagi Grimberg wrote: > > > The driver tracepoints captured millions of IO's where everything > > happened as expected, so I really think something got confused and > > mucked with the wrong request. I've added more trace points to increase > > visibility because I frankly didn't find how that could happen just from > > code inspection. We will also incorporate your patch below for the next > > recreate. > > Keith, does the issue still happen with eliminating the network send > from .queue_rq() ? This patch is successful at resolving the observed r2t issues after the weekend test run, which is much longer than it could have run previously. I'm happy we're narrowing this down, but I'm not seeing how this addresses the problem. It looks like the mutex single threads the critical parts, but maybe I'm missing something. Any ideas? > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index eb1feaacd11a..b3fafa536345 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -288,7 +288,7 @@ static inline void nvme_tcp_queue_request(struct > nvme_tcp_request *req, > * directly, otherwise queue io_work. Also, only do that if we > * are on the same cpu, so we don't introduce contention. > */ > - if (queue->io_cpu == __smp_processor_id() && > + if (0 && queue->io_cpu == __smp_processor_id() && > sync && empty && mutex_trylock(&queue->send_mutex)) { > queue->more_requests = !last; > nvme_tcp_send_all(queue); > -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-03 18:51 ` Keith Busch @ 2021-05-03 19:58 ` Sagi Grimberg 2021-05-03 20:25 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-03 19:58 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>> The driver tracepoints captured millions of IO's where everything >>> happened as expected, so I really think something got confused and >>> mucked with the wrong request. I've added more trace points to increase >>> visibility because I frankly didn't find how that could happen just from >>> code inspection. We will also incorporate your patch below for the next >>> recreate. >> >> Keith, does the issue still happen with eliminating the network send >> from .queue_rq() ? > > This patch is successful at resolving the observed r2t issues after the > weekend test run, which is much longer than it could have run > previously. I'm happy we're narrowing this down, but I'm not seeing how > this addresses the problem. It looks like the mutex single threads the > critical parts, but maybe I'm missing something. Any ideas? Not yet, but note that the send part is mutually exclusive but the receive context is where we handle the r2t, validate length/offset and (re)queue the request for sending a h2cdata pdu back to the controller. The network send was an optimization for latency, and then I modified the queueing in the driver such that a request would first go to llist and then the sending context (either io_work or .queue_rq) would reap it to a local send_list. This helps the driver get better understanding of what is inflight such that it better set network msg flags for EOR/MORE. My assumption is that maybe somehow we send the the initial command pdu to the controller from queue_rq, receive the r2t back before the .queue_rq context has completed and something may not be coherent. Side question, are you running with a fully preemptible kernel? or less NVMe queues than cpus? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-03 19:58 ` Sagi Grimberg @ 2021-05-03 20:25 ` Keith Busch 2021-05-04 19:29 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-03 20:25 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Mon, May 03, 2021 at 12:58:23PM -0700, Sagi Grimberg wrote: > > > > > The driver tracepoints captured millions of IO's where everything > > > > happened as expected, so I really think something got confused and > > > > mucked with the wrong request. I've added more trace points to increase > > > > visibility because I frankly didn't find how that could happen just from > > > > code inspection. We will also incorporate your patch below for the next > > > > recreate. > > > > > > Keith, does the issue still happen with eliminating the network send > > > from .queue_rq() ? > > > > This patch is successful at resolving the observed r2t issues after the > > weekend test run, which is much longer than it could have run > > previously. I'm happy we're narrowing this down, but I'm not seeing how > > this addresses the problem. It looks like the mutex single threads the > > critical parts, but maybe I'm missing something. Any ideas? > > Not yet, but note that the send part is mutually exclusive but the > receive context is where we handle the r2t, validate length/offset > and (re)queue the request for sending a h2cdata pdu back to the > controller. > > The network send was an optimization for latency, and then I modified > the queueing in the driver such that a request would first go to llist > and then the sending context (either io_work or .queue_rq) would reap it > to a local send_list. This helps the driver get better understanding of > what is inflight such that it better set network msg flags for EOR/MORE. > > My assumption is that maybe somehow we send the the initial command > pdu to the controller from queue_rq, receive the r2t back before the > .queue_rq context has completed and something may not be coherent. Interesting. The network traces look correct, so my thoughts jumped to possibly incorrect usage of PCIe relaxed ordering, but that appears to be disabled.. I'll keep looking for other possibilities. > Side question, are you running with a fully preemptible kernel? or > less NVMe queues than cpus? Voluntary preempt. This test is using the kernel config from Ubuntu 20.04. There are 16 CPUs in this set up with just 7 IO queues. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-03 20:25 ` Keith Busch @ 2021-05-04 19:29 ` Sagi Grimberg 0 siblings, 0 replies; 50+ messages in thread From: Sagi Grimberg @ 2021-05-04 19:29 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>>>> The driver tracepoints captured millions of IO's where everything >>>>> happened as expected, so I really think something got confused and >>>>> mucked with the wrong request. I've added more trace points to increase >>>>> visibility because I frankly didn't find how that could happen just from >>>>> code inspection. We will also incorporate your patch below for the next >>>>> recreate. >>>> >>>> Keith, does the issue still happen with eliminating the network send >>>> from .queue_rq() ? >>> >>> This patch is successful at resolving the observed r2t issues after the >>> weekend test run, which is much longer than it could have run >>> previously. I'm happy we're narrowing this down, but I'm not seeing how >>> this addresses the problem. It looks like the mutex single threads the >>> critical parts, but maybe I'm missing something. Any ideas? >> >> Not yet, but note that the send part is mutually exclusive but the >> receive context is where we handle the r2t, validate length/offset >> and (re)queue the request for sending a h2cdata pdu back to the >> controller. >> >> The network send was an optimization for latency, and then I modified >> the queueing in the driver such that a request would first go to llist >> and then the sending context (either io_work or .queue_rq) would reap it >> to a local send_list. This helps the driver get better understanding of >> what is inflight such that it better set network msg flags for EOR/MORE. >> >> My assumption is that maybe somehow we send the the initial command >> pdu to the controller from queue_rq, receive the r2t back before the >> .queue_rq context has completed and something may not be coherent. > > Interesting. The network traces look correct, so my thoughts jumped to > possibly incorrect usage of PCIe relaxed ordering, but that appears to > be disabled.. I'll keep looking for other possibilities. > >> Side question, are you running with a fully preemptible kernel? or >> less NVMe queues than cpus? > > Voluntary preempt. This test is using the kernel config from Ubuntu > 20.04. > > There are 16 CPUs in this set up with just 7 IO queues. Keith, Maybe this can help to add some more information (compile tested only): -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index a848b5b7f77b..9ce20b26c600 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -57,6 +57,7 @@ struct nvme_tcp_request { size_t data_sent; size_t data_received; enum nvme_tcp_send_state state; + bool got_r2t; }; enum nvme_tcp_queue_flags { @@ -622,10 +623,18 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue, } req = blk_mq_rq_to_pdu(rq); + if (req->state != NVME_TCP_SEND_CMD_PDU) { + dev_err(queue->ctrl->ctrl.device, + "queue %d tag %#x req unexpected state %d got_r2t %d\n", + nvme_tcp_queue_id(queue), rq->tag, req->state, + req->got_r2t); + } + ret = nvme_tcp_setup_h2c_data_pdu(req, pdu); if (unlikely(ret)) return ret; + req->got_r2t = true; req->state = NVME_TCP_SEND_H2C_PDU; req->offset = 0; @@ -1083,6 +1092,8 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue) } if (req->state == NVME_TCP_SEND_DATA) { + /* sending data not inline AND unsolicited? */ + WARN_ON_ONCE(!nvme_tcp_has_inline_data(req) && !req->got_r2t); ret = nvme_tcp_try_send_data(req); if (ret <= 0) goto done; @@ -2275,6 +2286,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns, return ret; req->state = NVME_TCP_SEND_CMD_PDU; + req->got_r2t = false; req->offset = 0; req->data_sent = 0; req->data_received = 0; -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-05 14:37 ` Keith Busch 2021-04-07 19:53 ` Keith Busch @ 2021-04-09 18:04 ` Sagi Grimberg 2021-04-14 0:29 ` Keith Busch 1 sibling, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-09 18:04 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>>> Thanks for the reply. >>>> >>>> This was observed on the recent 5.12-rc4, so it has all the latest tcp >>>> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a >>>> difference. It is currently reproducible, though it can take over an >>>> hour right now. >>> >>> After reverting 0dc9edaf80ea, we are observing a kernel panic (below). >> >> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. >> This patch is actually needed. >> >> >>> We'll try adding it back, plust adding your debug patch. >> >> Yes, that would give us more info about what is the state the >> request is in when getting these errors > > We have recreated with your debug patch: > > nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3 > > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE". > > The summary from the test that I received: > > We have an Ethernet trace for this failure. I filtered the trace for the > connection that maps to "queue 6 of nvme4" and tracked the state of the IO > command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per > the Ethernet trace is: > > 1. The target receives this Command in an Ethernet frame that has 9 Command > capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read > operation for 16K IO size > 2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one > C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set > in the last PDU. Are the c2hdata pdus have data_length of 1416? and the last has data_length = 832? 1416 * 11 + 832 = 16408 > 16384 Can you share for each of the c2hdata PDUs what is: - hlen - plen - data_length - data_offset _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-09 18:04 ` Sagi Grimberg @ 2021-04-14 0:29 ` Keith Busch 2021-04-21 5:33 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-14 0:29 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Fri, Apr 09, 2021 at 11:04:43AM -0700, Sagi Grimberg wrote: > > > > > > Thanks for the reply. > > > > > > > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp > > > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a > > > > > difference. It is currently reproducible, though it can take over an > > > > > hour right now. > > > > > > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below). > > > > > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL.. > > > This patch is actually needed. > > > > > > > > > > We'll try adding it back, plust adding your debug patch. > > > > > > Yes, that would give us more info about what is the state the > > > request is in when getting these errors > > > > We have recreated with your debug patch: > > > > nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3 > > > > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE". > > > > The summary from the test that I received: > > > > We have an Ethernet trace for this failure. I filtered the trace for the > > connection that maps to "queue 6 of nvme4" and tracked the state of the IO > > command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per > > the Ethernet trace is: > > > > 1. The target receives this Command in an Ethernet frame that has 9 Command > > capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read > > operation for 16K IO size > > 2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one > > C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set > > in the last PDU. > > Are the c2hdata pdus have data_length of 1416? and the last has data_length > = 832? > > 1416 * 11 + 832 = 16408 > 16384 Sorry, this was a mistake in the reporting. The last one's data length was only 808; 832 was the packet length. > Can you share for each of the c2hdata PDUs what is: > - hlen 24 for all of them > - plen 11 transfers at 1440, 832 for the last one > - data_length 11 transfers at 1416, 808 for the last one > - data_offset 0, 1416, 2832, 4248, 5564, 7080, 8496, 9912, 11328, 12744, 14160, 15567 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-14 0:29 ` Keith Busch @ 2021-04-21 5:33 ` Sagi Grimberg 2021-04-21 14:28 ` Keith Busch 2021-04-26 15:31 ` Keith Busch 0 siblings, 2 replies; 50+ messages in thread From: Sagi Grimberg @ 2021-04-21 5:33 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch Hey Keith, sorry for the late response, been a bit under water lately... > Sorry, this was a mistake in the reporting. The last one's data length > was only 808; 832 was the packet length. > >> Can you share for each of the c2hdata PDUs what is: >> - hlen > > 24 for all of them > >> - plen > > 11 transfers at 1440, 832 for the last one > >> - data_length > > 11 transfers at 1416, 808 for the last one > >> - data_offset > > 0, 1416, 2832, 4248, 5564, 7080, 8496, 9912, 11328, 12744, 14160, 15567 > Can you retry with the following applied on top of what I sent you? -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index c60c1dcfb587..ff39d37e9793 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -63,6 +63,7 @@ struct nvme_tcp_request { /* send state */ size_t offset; size_t data_sent; + size_t data_recvd; enum nvme_tcp_send_state state; enum nvme_tcp_cmd_state cmd_state; }; @@ -769,6 +770,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, *len -= recv_len; *offset += recv_len; queue->data_remaining -= recv_len; + req->data_recvd += recv_len; } if (!queue->data_remaining) { @@ -776,6 +778,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; } else { + BUG_ON(req->data_recvd != req->data_len); req->cmd_state = NVME_TCP_CMD_DATA_DONE; if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { req->cmd_state = NVME_TCP_CMD_DONE; -- There might be a hidden assumption here that may cause this if multiple c2hdata pdus will come per request... If that is the case, you can try the following (on top): -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index ff39d37e9793..aabec8e6810a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -773,19 +773,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, req->data_recvd += recv_len; } - if (!queue->data_remaining) { + if (!queue->data_remaining) + nvme_tcp_init_recv_ctx(queue); + + if (req->data_recvd == req->data_len) { if (queue->data_digest) { nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; } else { - BUG_ON(req->data_recvd != req->data_len); req->cmd_state = NVME_TCP_CMD_DATA_DONE; if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { req->cmd_state = NVME_TCP_CMD_DONE; nvme_tcp_end_request(rq, NVME_SC_SUCCESS); queue->nr_cqe++; } - nvme_tcp_init_recv_ctx(queue); } } -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-21 5:33 ` Sagi Grimberg @ 2021-04-21 14:28 ` Keith Busch 2021-04-21 16:59 ` Sagi Grimberg 2021-04-26 15:31 ` Keith Busch 1 sibling, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-21 14:28 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Tue, Apr 20, 2021 at 10:33:30PM -0700, Sagi Grimberg wrote: > Can you retry with the following applied on top of what I sent you? Thanks, we'll give this a try. Just a quick question on the first patch: > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index c60c1dcfb587..ff39d37e9793 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -63,6 +63,7 @@ struct nvme_tcp_request { > /* send state */ > size_t offset; > size_t data_sent; > + size_t data_recvd; > enum nvme_tcp_send_state state; > enum nvme_tcp_cmd_state cmd_state; > }; > @@ -769,6 +770,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > *queue, struct sk_buff *skb, > *len -= recv_len; > *offset += recv_len; > queue->data_remaining -= recv_len; > + req->data_recvd += recv_len; Does this need to get reset to 0 during the initial setup? It looks like a "req->data_recvd = 0" in nvme_tcp_setup_cmd_pdu() should happen, no? > } > > if (!queue->data_remaining) { > @@ -776,6 +778,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > *queue, struct sk_buff *skb, > nvme_tcp_ddgst_final(queue->rcv_hash, > &queue->exp_ddgst); > queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; > } else { > + BUG_ON(req->data_recvd != req->data_len); > req->cmd_state = NVME_TCP_CMD_DATA_DONE; > if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { > req->cmd_state = NVME_TCP_CMD_DONE; > -- > > There might be a hidden assumption here that may cause this if multiple > c2hdata pdus will come per request... > > If that is the case, you can try the following (on top): > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index ff39d37e9793..aabec8e6810a 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -773,19 +773,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > *queue, struct sk_buff *skb, > req->data_recvd += recv_len; > } > > - if (!queue->data_remaining) { > + if (!queue->data_remaining) > + nvme_tcp_init_recv_ctx(queue); > + > + if (req->data_recvd == req->data_len) { > if (queue->data_digest) { > nvme_tcp_ddgst_final(queue->rcv_hash, > &queue->exp_ddgst); > queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; > } else { > - BUG_ON(req->data_recvd != req->data_len); > req->cmd_state = NVME_TCP_CMD_DATA_DONE; > if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { > req->cmd_state = NVME_TCP_CMD_DONE; > nvme_tcp_end_request(rq, NVME_SC_SUCCESS); > queue->nr_cqe++; > } > - nvme_tcp_init_recv_ctx(queue); > } > } > -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-21 14:28 ` Keith Busch @ 2021-04-21 16:59 ` Sagi Grimberg 0 siblings, 0 replies; 50+ messages in thread From: Sagi Grimberg @ 2021-04-21 16:59 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >> Can you retry with the following applied on top of what I sent you? > > Thanks, we'll give this a try. Just a quick question on the first patch: > >> -- >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >> index c60c1dcfb587..ff39d37e9793 100644 >> --- a/drivers/nvme/host/tcp.c >> +++ b/drivers/nvme/host/tcp.c >> @@ -63,6 +63,7 @@ struct nvme_tcp_request { >> /* send state */ >> size_t offset; >> size_t data_sent; >> + size_t data_recvd; >> enum nvme_tcp_send_state state; >> enum nvme_tcp_cmd_state cmd_state; >> }; >> @@ -769,6 +770,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue >> *queue, struct sk_buff *skb, >> *len -= recv_len; >> *offset += recv_len; >> queue->data_remaining -= recv_len; >> + req->data_recvd += recv_len; > > Does this need to get reset to 0 during the initial setup? It looks like > a "req->data_recvd = 0" in nvme_tcp_setup_cmd_pdu() should happen, no? Yes, this hunk is needed as well: -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 8e55d8bc0c50..226404239b49 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2274,6 +2274,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns, req->state = NVME_TCP_SEND_CMD_PDU; req->offset = 0; req->data_sent = 0; + req->data_recvd = 0; req->pdu_len = 0; req->pdu_sent = 0; req->data_len = blk_rq_nr_phys_segments(rq) ? -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-21 5:33 ` Sagi Grimberg 2021-04-21 14:28 ` Keith Busch @ 2021-04-26 15:31 ` Keith Busch 2021-04-27 3:10 ` Sagi Grimberg 1 sibling, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-26 15:31 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Tue, Apr 20, 2021 at 10:33:30PM -0700, Sagi Grimberg wrote: > There might be a hidden assumption here that may cause this if multiple > c2hdata pdus will come per request... > > If that is the case, you can try the following (on top): This patch is successful for clearing up the observed "no space" issues observed during read tests. There are still some issues with write tests that look a bit different. I'll get more details on that for you today, but it's probably okay if you want to make a formal patch for the receive data side. > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index ff39d37e9793..aabec8e6810a 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -773,19 +773,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > *queue, struct sk_buff *skb, > req->data_recvd += recv_len; > } > > - if (!queue->data_remaining) { > + if (!queue->data_remaining) > + nvme_tcp_init_recv_ctx(queue); > + > + if (req->data_recvd == req->data_len) { > if (queue->data_digest) { > nvme_tcp_ddgst_final(queue->rcv_hash, > &queue->exp_ddgst); > queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; > } else { > - BUG_ON(req->data_recvd != req->data_len); > req->cmd_state = NVME_TCP_CMD_DATA_DONE; > if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { > req->cmd_state = NVME_TCP_CMD_DONE; > nvme_tcp_end_request(rq, NVME_SC_SUCCESS); > queue->nr_cqe++; > } > - nvme_tcp_init_recv_ctx(queue); > } > } > -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-26 15:31 ` Keith Busch @ 2021-04-27 3:10 ` Sagi Grimberg 2021-04-27 18:12 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-27 3:10 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch > This patch is successful for clearing up the observed "no space" issues > observed during read tests. > > There are still some issues with write tests that look a bit different. > I'll get more details on that for you today, sure. > but it's probably okay if > you want to make a formal patch for the receive data side. What should I put on the Reported-by: and Tested-by: tags? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-27 3:10 ` Sagi Grimberg @ 2021-04-27 18:12 ` Keith Busch 2021-04-27 23:58 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-04-27 18:12 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Mon, Apr 26, 2021 at 08:10:38PM -0700, Sagi Grimberg wrote: > > > This patch is successful for clearing up the observed "no space" issues > > observed during read tests. > > > > There are still some issues with write tests that look a bit different. > > I'll get more details on that for you today, > > sure. > > > but it's probably okay if > > you want to make a formal patch for the receive data side. > > What should I put on the Reported-by: and Tested-by: tags? This report and testing was done courtesy of Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com> Before you submit a patch, though, we did additional testing with data digest enabled and observe a regression with the following error: nvme nvme0: queue 0: data digest flag is cleared From looking at the patch, the following part looks a bit suspicious: > @@ -776,19 +776,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, > req->data_recvd += recv_len; > } > > - if (!queue->data_remaining) { > + if (!queue->data_remaining) > + nvme_tcp_init_recv_ctx(queue); The code had previously called nvme_tcp_init_recv_ctx() only if queue->data_digest wasn't set, but now it's called all the time. I see that calling this function clears ddgst_remaining, so does that explain the new errors? > + if (req->data_recvd == req->data_len) { > if (queue->data_digest) { > nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); > queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; > } else { > - BUG_ON(req->data_recvd != req->data_len); > req->cmd_state = NVME_TCP_CMD_DATA_DONE; > if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { > req->cmd_state = NVME_TCP_CMD_DONE; > nvme_tcp_end_request(rq, NVME_SC_SUCCESS); > queue->nr_cqe++; > } > - nvme_tcp_init_recv_ctx(queue); > } > } _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-27 18:12 ` Keith Busch @ 2021-04-27 23:58 ` Sagi Grimberg 2021-04-30 23:42 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-27 23:58 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >> What should I put on the Reported-by: and Tested-by: tags? > > This report and testing was done courtesy of > > Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com> > > Before you submit a patch, though, we did additional testing with data > digest enabled and observe a regression with the following error: > > nvme nvme0: queue 0: data digest flag is cleared > > From looking at the patch, the following part looks a bit suspicious: > >> @@ -776,19 +776,20 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, >> req->data_recvd += recv_len; >> } >> >> - if (!queue->data_remaining) { >> + if (!queue->data_remaining) >> + nvme_tcp_init_recv_ctx(queue); > > The code had previously called nvme_tcp_init_recv_ctx() only if > queue->data_digest wasn't set, but now it's called all the time. I see > that calling this function clears ddgst_remaining, so does that explain > the new errors? > >> + if (req->data_recvd == req->data_len) { >> if (queue->data_digest) { >> nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); >> queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; >> } else { >> - BUG_ON(req->data_recvd != req->data_len); >> req->cmd_state = NVME_TCP_CMD_DATA_DONE; >> if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { >> req->cmd_state = NVME_TCP_CMD_DONE; >> nvme_tcp_end_request(rq, NVME_SC_SUCCESS); >> queue->nr_cqe++; >> } >> - nvme_tcp_init_recv_ctx(queue); >> } >> } You are exactly right, I think this should do the trick: -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 8e55d8bc0c50..eb1feaacd11a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -55,6 +55,7 @@ struct nvme_tcp_request { /* send state */ size_t offset; size_t data_sent; + size_t data_received; enum nvme_tcp_send_state state; }; @@ -751,9 +752,12 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, *len -= recv_len; *offset += recv_len; queue->data_remaining -= recv_len; + req->data_received += recv_len; } if (!queue->data_remaining) { + if (req->data_received < req->data_len) + return 0; if (queue->data_digest) { nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; @@ -2274,6 +2278,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns, req->state = NVME_TCP_SEND_CMD_PDU; req->offset = 0; req->data_sent = 0; + req->data_received = 0; req->pdu_len = 0; req->pdu_sent = 0; req->data_len = blk_rq_nr_phys_segments(rq) ? -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-27 23:58 ` Sagi Grimberg @ 2021-04-30 23:42 ` Sagi Grimberg 2021-05-03 14:28 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-04-30 23:42 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch > You are exactly right, I think this should do the trick: Hey Keith, Did this resolve the issues? > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 8e55d8bc0c50..eb1feaacd11a 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -55,6 +55,7 @@ struct nvme_tcp_request { > /* send state */ > size_t offset; > size_t data_sent; > + size_t data_received; > enum nvme_tcp_send_state state; > }; > > @@ -751,9 +752,12 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > *queue, struct sk_buff *skb, > *len -= recv_len; > *offset += recv_len; > queue->data_remaining -= recv_len; > + req->data_received += recv_len; > } > > if (!queue->data_remaining) { > + if (req->data_received < req->data_len) > + return 0; > if (queue->data_digest) { > nvme_tcp_ddgst_final(queue->rcv_hash, > &queue->exp_ddgst); > queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; > @@ -2274,6 +2278,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct > nvme_ns *ns, > req->state = NVME_TCP_SEND_CMD_PDU; > req->offset = 0; > req->data_sent = 0; > + req->data_received = 0; > req->pdu_len = 0; > req->pdu_sent = 0; > req->data_len = blk_rq_nr_phys_segments(rq) ? > -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-04-30 23:42 ` Sagi Grimberg @ 2021-05-03 14:28 ` Keith Busch 2021-05-03 19:36 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-03 14:28 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Fri, Apr 30, 2021 at 04:42:14PM -0700, Sagi Grimberg wrote: > > > You are exactly right, I think this should do the trick: > > Hey Keith, > > Did this resolve the issues? We're unfortunately still observing data digest issues even with this. Most of the testing has shifted to the r2t error, so I don't have any additional details on the data digest problem. On the r2t issue, I'm just waiting for the weekend test results to see what happened with running your most recent test suggestion. > > -- > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > > index 8e55d8bc0c50..eb1feaacd11a 100644 > > --- a/drivers/nvme/host/tcp.c > > +++ b/drivers/nvme/host/tcp.c > > @@ -55,6 +55,7 @@ struct nvme_tcp_request { > > /* send state */ > > size_t offset; > > size_t data_sent; > > + size_t data_received; > > enum nvme_tcp_send_state state; > > }; > > > > @@ -751,9 +752,12 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > > *queue, struct sk_buff *skb, > > *len -= recv_len; > > *offset += recv_len; > > queue->data_remaining -= recv_len; > > + req->data_received += recv_len; > > } > > > > if (!queue->data_remaining) { > > + if (req->data_received < req->data_len) > > + return 0; > > if (queue->data_digest) { > > nvme_tcp_ddgst_final(queue->rcv_hash, > > &queue->exp_ddgst); > > queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; > > @@ -2274,6 +2278,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct > > nvme_ns *ns, > > req->state = NVME_TCP_SEND_CMD_PDU; > > req->offset = 0; > > req->data_sent = 0; > > + req->data_received = 0; > > req->pdu_len = 0; > > req->pdu_sent = 0; > > req->data_len = blk_rq_nr_phys_segments(rq) ? > > -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-03 14:28 ` Keith Busch @ 2021-05-03 19:36 ` Sagi Grimberg 2021-05-03 19:38 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-03 19:36 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >> Hey Keith, >> >> Did this resolve the issues? > > We're unfortunately still observing data digest issues even with this. > Most of the testing has shifted to the r2t error, so I don't have any > additional details on the data digest problem. I've looked again at the code, and I'm not convinced that the patch is needed at all anymore, I'm now surprised that it actually changed anything (disregarding data digest). The driver does not track the received bytes by definition, it relies on the controller to send it a completion, or set the success flag in the _last_ c2hdata pdu. Does your target set NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-03 19:36 ` Sagi Grimberg @ 2021-05-03 19:38 ` Sagi Grimberg 2021-05-03 19:44 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-03 19:38 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>> Hey Keith, >>> >>> Did this resolve the issues? >> >> We're unfortunately still observing data digest issues even with this. >> Most of the testing has shifted to the r2t error, so I don't have any >> additional details on the data digest problem. > > I've looked again at the code, and I'm not convinced that the patch > is needed at all anymore, I'm now surprised that it actually changed > anything (disregarding data digest). > > The driver does not track the received bytes by definition, it relies > on the controller to send it a completion, or set the success flag in > the _last_ c2hdata pdu. Does your target set > NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus? Perhaps you can also run this patch instead? -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 6bd5b281c818..a848b5b7f77b 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -55,6 +55,7 @@ struct nvme_tcp_request { /* send state */ size_t offset; size_t data_sent; + size_t data_received; enum nvme_tcp_send_state state; }; @@ -751,6 +752,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, *len -= recv_len; *offset += recv_len; queue->data_remaining -= recv_len; + req->data_received += recv_len; } if (!queue->data_remaining) { @@ -759,6 +761,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; } else { if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { + BUG_ON(req->data_received != req->data_len); nvme_tcp_end_request(rq, NVME_SC_SUCCESS); queue->nr_cqe++; } @@ -2274,6 +2277,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns, req->state = NVME_TCP_SEND_CMD_PDU; req->offset = 0; req->data_sent = 0; + req->data_received = 0; req->pdu_len = 0; req->pdu_sent = 0; req->data_len = blk_rq_nr_phys_segments(rq) ? -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-03 19:38 ` Sagi Grimberg @ 2021-05-03 19:44 ` Keith Busch 2021-05-03 20:00 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-03 19:44 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Mon, May 03, 2021 at 12:38:39PM -0700, Sagi Grimberg wrote: > > > > > Hey Keith, > > > > > > > > Did this resolve the issues? > > > > > > We're unfortunately still observing data digest issues even with this. > > > Most of the testing has shifted to the r2t error, so I don't have any > > > additional details on the data digest problem. > > > > I've looked again at the code, and I'm not convinced that the patch > > is needed at all anymore, I'm now surprised that it actually changed > > anything (disregarding data digest). > > > > The driver does not track the received bytes by definition, it relies > > on the controller to send it a completion, or set the success flag in > > the _last_ c2hdata pdu. Does your target set > > NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus? > > Perhaps you can also run this patch instead? Thanks, will give this a shot. > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 6bd5b281c818..a848b5b7f77b 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -55,6 +55,7 @@ struct nvme_tcp_request { > /* send state */ > size_t offset; > size_t data_sent; > + size_t data_received; > enum nvme_tcp_send_state state; > }; > > @@ -751,6 +752,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > *queue, struct sk_buff *skb, > *len -= recv_len; > *offset += recv_len; > queue->data_remaining -= recv_len; > + req->data_received += recv_len; > } > > if (!queue->data_remaining) { > @@ -759,6 +761,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > *queue, struct sk_buff *skb, > queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; > } else { > if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { > + BUG_ON(req->data_received != req->data_len); > nvme_tcp_end_request(rq, NVME_SC_SUCCESS); > queue->nr_cqe++; > } > @@ -2274,6 +2277,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct > nvme_ns *ns, > req->state = NVME_TCP_SEND_CMD_PDU; > req->offset = 0; > req->data_sent = 0; > + req->data_received = 0; > req->pdu_len = 0; > req->pdu_sent = 0; > req->data_len = blk_rq_nr_phys_segments(rq) ? > -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-03 19:44 ` Keith Busch @ 2021-05-03 20:00 ` Sagi Grimberg 2021-05-04 14:36 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-03 20:00 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>>>> Hey Keith, >>>>> >>>>> Did this resolve the issues? >>>> >>>> We're unfortunately still observing data digest issues even with this. >>>> Most of the testing has shifted to the r2t error, so I don't have any >>>> additional details on the data digest problem. >>> >>> I've looked again at the code, and I'm not convinced that the patch >>> is needed at all anymore, I'm now surprised that it actually changed >>> anything (disregarding data digest). >>> >>> The driver does not track the received bytes by definition, it relies >>> on the controller to send it a completion, or set the success flag in >>> the _last_ c2hdata pdu. Does your target set >>> NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus? >> >> Perhaps you can also run this patch instead? > > Thanks, will give this a shot. Still would be beneficial to look at the traces and check if the success flag happens to be set. If this flag is set, the driver _will_ complete the request without checking the bytes received thus far (similar to how pci and rdma don't and can't check dma byte count). _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-03 20:00 ` Sagi Grimberg @ 2021-05-04 14:36 ` Keith Busch 2021-05-04 18:15 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-04 14:36 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Mon, May 03, 2021 at 01:00:05PM -0700, Sagi Grimberg wrote: > > > > > > Hey Keith, > > > > > > > > > > > > Did this resolve the issues? > > > > > > > > > > We're unfortunately still observing data digest issues even with this. > > > > > Most of the testing has shifted to the r2t error, so I don't have any > > > > > additional details on the data digest problem. > > > > > > > > I've looked again at the code, and I'm not convinced that the patch > > > > is needed at all anymore, I'm now surprised that it actually changed > > > > anything (disregarding data digest). > > > > > > > > The driver does not track the received bytes by definition, it relies > > > > on the controller to send it a completion, or set the success flag in > > > > the _last_ c2hdata pdu. Does your target set > > > > NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus? > > > > > > Perhaps you can also run this patch instead? > > > > Thanks, will give this a shot. > > Still would be beneficial to look at the traces and check if > the success flag happens to be set. If this flag is set, the > driver _will_ complete the request without checking the bytes > received thus far (similar to how pci and rdma don't and can't > check dma byte count). I realized this patch is the same as one you'd sent earlier. We hit the BUG_ON(), and then proceeded to use your follow-up patch, which appeared to fix the data receive problem, but introduced data digest problems. So, are you saying that hitting this BUG_ON means that the driver has observed the completion out-of-order from the expected data? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-04 14:36 ` Keith Busch @ 2021-05-04 18:15 ` Sagi Grimberg 2021-05-04 19:14 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-04 18:15 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>>>>>> Hey Keith, >>>>>>> >>>>>>> Did this resolve the issues? >>>>>> >>>>>> We're unfortunately still observing data digest issues even with this. >>>>>> Most of the testing has shifted to the r2t error, so I don't have any >>>>>> additional details on the data digest problem. >>>>> >>>>> I've looked again at the code, and I'm not convinced that the patch >>>>> is needed at all anymore, I'm now surprised that it actually changed >>>>> anything (disregarding data digest). >>>>> >>>>> The driver does not track the received bytes by definition, it relies >>>>> on the controller to send it a completion, or set the success flag in >>>>> the _last_ c2hdata pdu. Does your target set >>>>> NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus? >>>> >>>> Perhaps you can also run this patch instead? >>> >>> Thanks, will give this a shot. >> >> Still would be beneficial to look at the traces and check if >> the success flag happens to be set. If this flag is set, the >> driver _will_ complete the request without checking the bytes >> received thus far (similar to how pci and rdma don't and can't >> check dma byte count). > > I realized this patch is the same as one you'd sent earlier. We hit the > BUG_ON(), and then proceeded to use your follow-up patch, which appeared > to fix the data receive problem, but introduced data digest problems. > > So, are you saying that hitting this BUG_ON means that the driver has > observed the completion out-of-order from the expected data? If you hit the BUG_ON it means that the host spotted a c2hdata PDU that has the success flag set before all the request data was received: -- @@ -759,6 +761,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; } else { if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { + BUG_ON(req->data_received != req->data_len); nvme_tcp_end_request(rq, NVME_SC_SUCCESS); queue->nr_cqe++; } -- Which means that the host is completing the request immediately relying on the controller sent all the required data and knowing that a completion response capsule will not be sent. From the spec: C2HData PDUs contain a LAST_PDU flag that is set to ‘1’ in the last PDU of a command data transfer and is cleared to ‘0’ in all other C2HData PDUs associated with the command. C2HData PDUs also contain a SUCCESS flag that may be set to ‘1’ in the last C2HData PDU of a command data transfer to indicate that the command has completed successfully. In this case, no Response Capsule is sent by the controller for the command and the host synthesizes a completion queue entry for the command with the Command Specific field and the Status Field both cleared to 0h. If the SUCCESS flag is cleared to ‘0’ in the last C2HData PDU of a command, then the controller shall send a Response Capsule for the command to the host. The SUCCESS flag shall be cleared to ‘0’ in all C2HData PDUs that are not the last C2HData PDU for a command. The SUCCESS flag may be set to ‘1’ in the last C2HData PDU only if the controller supports disabling submission queue head pointer updates. Hence my question, does the controller set NVME_TCP_F_DATA_SUCCESS in any of the c2hdata PDUs which is not the last one? does it set it in the last one and omitting the cqe response capsule as expected by the host? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-04 18:15 ` Sagi Grimberg @ 2021-05-04 19:14 ` Keith Busch 2021-05-10 18:06 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-04 19:14 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Tue, May 04, 2021 at 11:15:28AM -0700, Sagi Grimberg wrote: > > > > > > I've looked again at the code, and I'm not convinced that the patch > > > > > > is needed at all anymore, I'm now surprised that it actually changed > > > > > > anything (disregarding data digest). > > > > > > > > > > > > The driver does not track the received bytes by definition, it relies > > > > > > on the controller to send it a completion, or set the success flag in > > > > > > the _last_ c2hdata pdu. Does your target set > > > > > > NVME_TCP_F_DATA_SUCCESS on any of the c2hdata pdus? > > > > > > > > > > Perhaps you can also run this patch instead? > > > > > > > > Thanks, will give this a shot. > > > > > > Still would be beneficial to look at the traces and check if > > > the success flag happens to be set. If this flag is set, the > > > driver _will_ complete the request without checking the bytes > > > received thus far (similar to how pci and rdma don't and can't > > > check dma byte count). > > > > I realized this patch is the same as one you'd sent earlier. We hit the > > BUG_ON(), and then proceeded to use your follow-up patch, which appeared > > to fix the data receive problem, but introduced data digest problems. > > > > So, are you saying that hitting this BUG_ON means that the driver has > > observed the completion out-of-order from the expected data? > > If you hit the BUG_ON it means that the host spotted a c2hdata > PDU that has the success flag set before all the request data > was received: > -- > @@ -759,6 +761,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue > *queue, struct sk_buff *skb, > queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; > } else { > if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { > + BUG_ON(req->data_received != req->data_len); > nvme_tcp_end_request(rq, NVME_SC_SUCCESS); > queue->nr_cqe++; > } > -- I apologize for the confusion. There is a subtle difference in your most recent patch request vs. the previous one: the BUG_ON() is within the DATA_SUCCESS section, and we hadn't actually run with that. We did hit the BUG_ON() in the first version, and looking at it now, I suspect you intended to put it in this new location. We'll retest, but I don't think we'll hit the BUG: none of the headers have the DATA_SUCCESS flag set in the tcp dumps I've seen. And also I see your point about how the original patch shouldn't be needed at all, and I also don't see why it could have changed the observation without data digest. Thank you for your patience on this issue. I will get back to you with more info after circling back with the test group. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-04 19:14 ` Keith Busch @ 2021-05-10 18:06 ` Keith Busch 2021-05-10 18:18 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-10 18:06 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch Sagi, Just wanted to give you an update on where we're at with this. All tests run with your earlier patch removing the inline dispatch from nvme_tcp_queue_request() are successful. At this point, I am leaning to remove that optimization from mainline. I added additional tracing to see what is going on, but we eventually hit a memory issue after some hours of runtime. I've never seen an issue like this before, It triggers in nvme_tcp_advance_req() when tracing the rq->tag and req->data_sent: WARNING: CPU: 1 PID: 3428 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x33/0xa0 I think the above is a distraction, but I can provide the full stack trace and patch adding the tracepoing if you think it's helpful. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-10 18:06 ` Keith Busch @ 2021-05-10 18:18 ` Sagi Grimberg 2021-05-10 18:30 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-10 18:18 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch > Sagi, > > Just wanted to give you an update on where we're at with this. > > All tests run with your earlier patch removing the inline dispatch from > nvme_tcp_queue_request() are successful. At this point, I am leaning to > remove that optimization from mainline. Thanks Keith, Did you run it with the extra information debug patch I sent you? What I'm concerned about is that given that you have the only environment where this reproduces, and this is removed it will be very difficult to add it back in. Also, what about the read issue? that one is still unresolved from my PoV. > I added additional tracing to see what is going on, but we eventually > hit a memory issue after some hours of runtime. I've never seen an issue > like this before, It triggers in nvme_tcp_advance_req() when tracing the > rq->tag and req->data_sent: > > WARNING: CPU: 1 PID: 3428 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x33/0xa0 > > I think the above is a distraction, but I can provide the full stack > trace and patch adding the tracepoing if you think it's helpful. That is... odd.. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-10 18:18 ` Sagi Grimberg @ 2021-05-10 18:30 ` Keith Busch 2021-05-10 21:07 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-10 18:30 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Mon, May 10, 2021 at 11:18:24AM -0700, Sagi Grimberg wrote: > > > Sagi, > > > > Just wanted to give you an update on where we're at with this. > > > > All tests run with your earlier patch removing the inline dispatch from > > nvme_tcp_queue_request() are successful. At this point, I am leaning to > > remove that optimization from mainline. > > Thanks Keith, > > Did you run it with the extra information debug patch I sent you? What > I'm concerned about is that given that you have the only environment > where this reproduces, and this is removed it will be very difficult > to add it back in. > > Also, what about the read issue? that one is still unresolved from > my PoV. The test team reports no issues for both read and write tests with the inline dispatch removed. That single patch appears to resolve all problems. Sorry, I should have clarified that initially. We will go back to the extra debug from last week on top of a pristine mainline kernel. There's been some confusion of which patches to apply on top of others in the mix, but I'm getting better at coordinating that. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-10 18:30 ` Keith Busch @ 2021-05-10 21:07 ` Sagi Grimberg 2021-05-11 3:00 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-10 21:07 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >>> Sagi, >>> >>> Just wanted to give you an update on where we're at with this. >>> >>> All tests run with your earlier patch removing the inline dispatch from >>> nvme_tcp_queue_request() are successful. At this point, I am leaning to >>> remove that optimization from mainline. >> >> Thanks Keith, >> >> Did you run it with the extra information debug patch I sent you? What >> I'm concerned about is that given that you have the only environment >> where this reproduces, and this is removed it will be very difficult >> to add it back in. >> >> Also, what about the read issue? that one is still unresolved from >> my PoV. > > The test team reports no issues for both read and write tests with the > inline dispatch removed. That single patch appears to resolve all > problems. Sorry, I should have clarified that initially. > > We will go back to the extra debug from last week on top of a pristine > mainline kernel. There's been some confusion of which patches to apply > on top of others in the mix, but I'm getting better at coordinating > that. Keith, I may have a theory to this issue. I think that the problem is in cases where we send commands with data to the controller and then in nvme_tcp_send_data between the last successful kernel_sendpage and before nvme_tcp_advance_req, the controller sends back a successful completion. If that is the case, then the completion path could be triggered, the tag would be reused, triggering a new .queue_rq, setting again the req.iter with the new bio params (all is not taken by the send_mutex) and then the send context would call nvme_tcp_advance_req progressing the req.iter with the former sent bytes... And given that the req.iter is used for reads/writes, it is possible that it can explain both issues. While this is not easy to trigger, there is nothing I think that can prevent that. The driver used to have a single context that would do both send and recv so this could not have happened, but now that we added the .queue_rq send context, I guess this can indeed confuse the driver. There are 3 possible options to resolve this: 1. Never touch the command after the last send (only advance the request iter after we check if we are done): -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index c51b70aec6dd..5c576eda5ba1 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -933,7 +933,6 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) if (ret <= 0) return ret; - nvme_tcp_advance_req(req, ret); if (queue->data_digest) nvme_tcp_ddgst_update(queue->snd_hash, page, offset, ret); @@ -950,6 +949,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) } return 1; } + nvme_tcp_advance_req(req, ret); } return -EAGAIN; } -- 2. only initialize the request/cmd/pdu (nvme_tcp_setup_cmd_pdu) when we have taken the send_mutex (say in nvme_tcp_fetch_request or something), which is more cumbersome as this stuff may be called multiple times in the life of a request. 3. Add refcounting to never complete an I/O before both send and receive has fully completed on this command, besides having some more overhead on the datapath, there maybe also some challanges with moving the request state machine that may be based on the refounting. I suggest that you guys give (1) a shot and see if the theory holds... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-10 21:07 ` Sagi Grimberg @ 2021-05-11 3:00 ` Keith Busch 2021-05-11 17:17 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-11 3:00 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Mon, May 10, 2021 at 02:07:47PM -0700, Sagi Grimberg wrote: > I may have a theory to this issue. I think that the problem is in > cases where we send commands with data to the controller and then in > nvme_tcp_send_data between the last successful kernel_sendpage > and before nvme_tcp_advance_req, the controller sends back a successful > completion. > > If that is the case, then the completion path could be triggered, > the tag would be reused, triggering a new .queue_rq, setting again > the req.iter with the new bio params (all is not taken by the > send_mutex) and then the send context would call nvme_tcp_advance_req > progressing the req.iter with the former sent bytes... And given that > the req.iter is used for reads/writes, it is possible that it can > explain both issues. > > While this is not easy to trigger, there is nothing I think that > can prevent that. The driver used to have a single context that > would do both send and recv so this could not have happened, but > now that we added the .queue_rq send context, I guess this can > indeed confuse the driver. Awesome, this is exactly the type of sequence I've been trying to capture, but couldn't quite get there. Now that you've described it, that flow can certainly explain the observations, including the corrupted debug trace event I was trying to add. The sequence looks unlikely to happen, which agrees with the difficulty in reproducing it. I am betting right now that you got it, but a little surprised no one else is reporting a similar problem yet. Your option "1" looks like the best one, IMO. I've requested dropping all debug and test patches and using just this one on the current nvme baseline for the next test cycle. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-11 3:00 ` Keith Busch @ 2021-05-11 17:17 ` Sagi Grimberg 2021-05-13 15:48 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-11 17:17 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch >> I may have a theory to this issue. I think that the problem is in >> cases where we send commands with data to the controller and then in >> nvme_tcp_send_data between the last successful kernel_sendpage >> and before nvme_tcp_advance_req, the controller sends back a successful >> completion. >> >> If that is the case, then the completion path could be triggered, >> the tag would be reused, triggering a new .queue_rq, setting again >> the req.iter with the new bio params (all is not taken by the >> send_mutex) and then the send context would call nvme_tcp_advance_req >> progressing the req.iter with the former sent bytes... And given that >> the req.iter is used for reads/writes, it is possible that it can >> explain both issues. >> >> While this is not easy to trigger, there is nothing I think that >> can prevent that. The driver used to have a single context that >> would do both send and recv so this could not have happened, but >> now that we added the .queue_rq send context, I guess this can >> indeed confuse the driver. > > Awesome, this is exactly the type of sequence I've been trying to > capture, but couldn't quite get there. Now that you've described it, > that flow can certainly explain the observations, including the > corrupted debug trace event I was trying to add. > > The sequence looks unlikely to happen, which agrees with the difficulty > in reproducing it. I am betting right now that you got it, but a little > surprised no one else is reporting a similar problem yet. We had at least one report from Potnuri that I think may have been triggered by this, this ended up fixed (or rather worked-around with 5c11f7d9f843). > Your option "1" looks like the best one, IMO. I've requested dropping > all debug and test patches and using just this one on the current nvme > baseline for the next test cycle. Cool, waiting to hear back... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-11 17:17 ` Sagi Grimberg @ 2021-05-13 15:48 ` Keith Busch 2021-05-13 19:53 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Keith Busch @ 2021-05-13 15:48 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Tue, May 11, 2021 at 10:17:09AM -0700, Sagi Grimberg wrote: > > > > I may have a theory to this issue. I think that the problem is in > > > cases where we send commands with data to the controller and then in > > > nvme_tcp_send_data between the last successful kernel_sendpage > > > and before nvme_tcp_advance_req, the controller sends back a successful > > > completion. > > > > > > If that is the case, then the completion path could be triggered, > > > the tag would be reused, triggering a new .queue_rq, setting again > > > the req.iter with the new bio params (all is not taken by the > > > send_mutex) and then the send context would call nvme_tcp_advance_req > > > progressing the req.iter with the former sent bytes... And given that > > > the req.iter is used for reads/writes, it is possible that it can > > > explain both issues. > > > > > > While this is not easy to trigger, there is nothing I think that > > > can prevent that. The driver used to have a single context that > > > would do both send and recv so this could not have happened, but > > > now that we added the .queue_rq send context, I guess this can > > > indeed confuse the driver. > > > > Awesome, this is exactly the type of sequence I've been trying to > > capture, but couldn't quite get there. Now that you've described it, > > that flow can certainly explain the observations, including the > > corrupted debug trace event I was trying to add. > > > > The sequence looks unlikely to happen, which agrees with the difficulty > > in reproducing it. I am betting right now that you got it, but a little > > surprised no one else is reporting a similar problem yet. > > We had at least one report from Potnuri that I think may have been > triggered by this, this ended up fixed (or rather worked-around > with 5c11f7d9f843). > > > Your option "1" looks like the best one, IMO. I've requested dropping > > all debug and test patches and using just this one on the current nvme > > baseline for the next test cycle. > > Cool, waiting to hear back... This patch has been tested successfully on the initial workloads. There are several more that need to be validated, but each one runs for many hours, so it may be a couple more days before completed. Just wanted to leat you know: so far, so good. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-13 15:48 ` Keith Busch @ 2021-05-13 19:53 ` Sagi Grimberg 2021-05-17 20:48 ` Keith Busch 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2021-05-13 19:53 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, hch On 5/13/21 8:48 AM, Keith Busch wrote: > On Tue, May 11, 2021 at 10:17:09AM -0700, Sagi Grimberg wrote: >> >>>> I may have a theory to this issue. I think that the problem is in >>>> cases where we send commands with data to the controller and then in >>>> nvme_tcp_send_data between the last successful kernel_sendpage >>>> and before nvme_tcp_advance_req, the controller sends back a successful >>>> completion. >>>> >>>> If that is the case, then the completion path could be triggered, >>>> the tag would be reused, triggering a new .queue_rq, setting again >>>> the req.iter with the new bio params (all is not taken by the >>>> send_mutex) and then the send context would call nvme_tcp_advance_req >>>> progressing the req.iter with the former sent bytes... And given that >>>> the req.iter is used for reads/writes, it is possible that it can >>>> explain both issues. >>>> >>>> While this is not easy to trigger, there is nothing I think that >>>> can prevent that. The driver used to have a single context that >>>> would do both send and recv so this could not have happened, but >>>> now that we added the .queue_rq send context, I guess this can >>>> indeed confuse the driver. >>> >>> Awesome, this is exactly the type of sequence I've been trying to >>> capture, but couldn't quite get there. Now that you've described it, >>> that flow can certainly explain the observations, including the >>> corrupted debug trace event I was trying to add. >>> >>> The sequence looks unlikely to happen, which agrees with the difficulty >>> in reproducing it. I am betting right now that you got it, but a little >>> surprised no one else is reporting a similar problem yet. >> >> We had at least one report from Potnuri that I think may have been >> triggered by this, this ended up fixed (or rather worked-around >> with 5c11f7d9f843). >> >>> Your option "1" looks like the best one, IMO. I've requested dropping >>> all debug and test patches and using just this one on the current nvme >>> baseline for the next test cycle. >> >> Cool, waiting to hear back... > > This patch has been tested successfully on the initial workloads. There > are several more that need to be validated, but each one runs for many > hours, so it may be a couple more days before completed. Just wanted to > leat you know: so far, so good. Encouraging... I'll send a patch for that as soon as you give me the final verdict. I'm assuming Narayan would be the reporter and the tester? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: nvme tcp receive errors 2021-05-13 19:53 ` Sagi Grimberg @ 2021-05-17 20:48 ` Keith Busch 0 siblings, 0 replies; 50+ messages in thread From: Keith Busch @ 2021-05-17 20:48 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, hch On Thu, May 13, 2021 at 12:53:54PM -0700, Sagi Grimberg wrote: > On 5/13/21 8:48 AM, Keith Busch wrote: > > On Tue, May 11, 2021 at 10:17:09AM -0700, Sagi Grimberg wrote: > > > > > > > > I may have a theory to this issue. I think that the problem is in > > > > > cases where we send commands with data to the controller and then in > > > > > nvme_tcp_send_data between the last successful kernel_sendpage > > > > > and before nvme_tcp_advance_req, the controller sends back a successful > > > > > completion. > > > > > > > > > > If that is the case, then the completion path could be triggered, > > > > > the tag would be reused, triggering a new .queue_rq, setting again > > > > > the req.iter with the new bio params (all is not taken by the > > > > > send_mutex) and then the send context would call nvme_tcp_advance_req > > > > > progressing the req.iter with the former sent bytes... And given that > > > > > the req.iter is used for reads/writes, it is possible that it can > > > > > explain both issues. > > > > > > > > > > While this is not easy to trigger, there is nothing I think that > > > > > can prevent that. The driver used to have a single context that > > > > > would do both send and recv so this could not have happened, but > > > > > now that we added the .queue_rq send context, I guess this can > > > > > indeed confuse the driver. > > > > > > > > Awesome, this is exactly the type of sequence I've been trying to > > > > capture, but couldn't quite get there. Now that you've described it, > > > > that flow can certainly explain the observations, including the > > > > corrupted debug trace event I was trying to add. > > > > > > > > The sequence looks unlikely to happen, which agrees with the difficulty > > > > in reproducing it. I am betting right now that you got it, but a little > > > > surprised no one else is reporting a similar problem yet. > > > > > > We had at least one report from Potnuri that I think may have been > > > triggered by this, this ended up fixed (or rather worked-around > > > with 5c11f7d9f843). > > > > > > > Your option "1" looks like the best one, IMO. I've requested dropping > > > > all debug and test patches and using just this one on the current nvme > > > > baseline for the next test cycle. > > > > > > Cool, waiting to hear back... > > > > This patch has been tested successfully on the initial workloads. There > > are several more that need to be validated, but each one runs for many > > hours, so it may be a couple more days before completed. Just wanted to > > leat you know: so far, so good. > > Encouraging... I'll send a patch for that as soon as you give me the > final verdict. I'm assuming Narayan would be the reporter and the > tester? This tests successfully. There was one timeout issue observed in all the testing, but does not appear related to the reported problems here, or your fix, so I will start a new thread on that if I can get more information on it. You may use the following tags for the commit log: Reported-by: Narayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com> Tested-by: Anil Mishra <anil.mishra@wdc.com> Reviewed-by: Keith Busch <kbusch@kernel.org> _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2021-05-17 20:50 UTC | newest] Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-31 16:18 nvme tcp receive errors Keith Busch 2021-03-31 19:10 ` Sagi Grimberg 2021-03-31 20:49 ` Keith Busch 2021-03-31 22:16 ` Sagi Grimberg 2021-03-31 22:26 ` Keith Busch 2021-03-31 22:45 ` Sagi Grimberg 2021-04-02 17:11 ` Keith Busch 2021-04-02 17:27 ` Sagi Grimberg 2021-04-05 14:37 ` Keith Busch 2021-04-07 19:53 ` Keith Busch 2021-04-09 21:38 ` Sagi Grimberg 2021-04-27 23:39 ` Keith Busch 2021-04-27 23:55 ` Sagi Grimberg 2021-04-28 15:58 ` Keith Busch 2021-04-28 17:42 ` Sagi Grimberg 2021-04-28 18:01 ` Keith Busch 2021-04-28 23:06 ` Sagi Grimberg 2021-04-29 3:33 ` Keith Busch 2021-04-29 4:52 ` Sagi Grimberg 2021-05-03 18:51 ` Keith Busch 2021-05-03 19:58 ` Sagi Grimberg 2021-05-03 20:25 ` Keith Busch 2021-05-04 19:29 ` Sagi Grimberg 2021-04-09 18:04 ` Sagi Grimberg 2021-04-14 0:29 ` Keith Busch 2021-04-21 5:33 ` Sagi Grimberg 2021-04-21 14:28 ` Keith Busch 2021-04-21 16:59 ` Sagi Grimberg 2021-04-26 15:31 ` Keith Busch 2021-04-27 3:10 ` Sagi Grimberg 2021-04-27 18:12 ` Keith Busch 2021-04-27 23:58 ` Sagi Grimberg 2021-04-30 23:42 ` Sagi Grimberg 2021-05-03 14:28 ` Keith Busch 2021-05-03 19:36 ` Sagi Grimberg 2021-05-03 19:38 ` Sagi Grimberg 2021-05-03 19:44 ` Keith Busch 2021-05-03 20:00 ` Sagi Grimberg 2021-05-04 14:36 ` Keith Busch 2021-05-04 18:15 ` Sagi Grimberg 2021-05-04 19:14 ` Keith Busch 2021-05-10 18:06 ` Keith Busch 2021-05-10 18:18 ` Sagi Grimberg 2021-05-10 18:30 ` Keith Busch 2021-05-10 21:07 ` Sagi Grimberg 2021-05-11 3:00 ` Keith Busch 2021-05-11 17:17 ` Sagi Grimberg 2021-05-13 15:48 ` Keith Busch 2021-05-13 19:53 ` Sagi Grimberg 2021-05-17 20:48 ` Keith Busch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).