All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme-tcp: i/o errors and stalled host from failure to send command pdus
@ 2022-08-30 23:10 Jonathan Nicklin
  2022-08-31  9:02 ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-08-30 23:10 UTC (permalink / raw)
  To: linux-nvme

Hello,

We've seen sporadic I/O errors with nvme-tcp in our lab when testing
with kernel versions 5.15 through 5.19. Using a combination of tracing
events and network traces, we can see nvme_setup_cmd() trace events
that do not result in transmitted commands PDUs.

Here's a snippet of the relevant portion of the nvme trace. The
following events are the last of tens of thousands of successully
executed reads...

tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

The following 5 commands are issued to nvme, but never make it to the
wire. The wire state was verified manually using tcpdump on the host and
trace events on the target. Perhaps the "need-resched" marker in the
trace and the context switch to the kernel worker are relevant
to the failure?

tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

The reads above eventually timeout and recover after reset. Here are
the related reset messages.

[22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
Error (sct 0x3 / sc 0x71)
[22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
flags 0x4084700 phys_seg 1 prio class 0
[22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
[22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
Error (sct 0x3 / sc 0x71)
[22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
flags 0x4084700 phys_seg 1 prio class 0
[22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
[22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
Error (sct 0x3 / sc 0x71)
[22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
flags 0x4084700 phys_seg 2 prio class 0
[22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
[22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
Error (sct 0x3 / sc 0x71)
[22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
flags 0x4080700 phys_seg 2 prio class 0
[22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
Error (sct 0x3 / sc 0x71)
[22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
flags 0x4084700 phys_seg 2 prio class 0

To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
patch from a ways back. If my understanding is correct,
nvme_tcp_cmd_pdu() is being called for the commands in question.

I've isolated the failure to a scenario we can reproduce at will
in about a minute of runtime. Is there any additional debug that we can
provide to help locate the bug?

Thanks,
-Jonathan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-08-30 23:10 nvme-tcp: i/o errors and stalled host from failure to send command pdus Jonathan Nicklin
@ 2022-08-31  9:02 ` Sagi Grimberg
  2022-08-31 16:04   ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2022-08-31  9:02 UTC (permalink / raw)
  To: Jonathan Nicklin, linux-nvme


> Hello,

Hey Jonathan,

> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> with kernel versions 5.15 through 5.19. Using a combination of tracing
> events and network traces, we can see nvme_setup_cmd() trace events
> that do not result in transmitted commands PDUs.
> 
> Here's a snippet of the relevant portion of the nvme trace. The
> following events are the last of tens of thousands of successully
> executed reads...
> 
> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> 
> The following 5 commands are issued to nvme, but never make it to the
> wire. The wire state was verified manually using tcpdump on the host and
> trace events on the target. Perhaps the "need-resched" marker in the
> trace and the context switch to the kernel worker are relevant
> to the failure?
> 
> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
>    cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> 
> The reads above eventually timeout and recover after reset. Here are
> the related reset messages.
> 
> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> flags 0x4084700 phys_seg 1 prio class 0
> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> flags 0x4084700 phys_seg 1 prio class 0
> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> flags 0x4084700 phys_seg 2 prio class 0
> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> flags 0x4080700 phys_seg 2 prio class 0
> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> Error (sct 0x3 / sc 0x71)
> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> flags 0x4084700 phys_seg 2 prio class 0
> 
> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> patch from a ways back. If my understanding is correct,
> nvme_tcp_cmd_pdu() is being called for the commands in question.
> 
> I've isolated the failure to a scenario we can reproduce at will
> in about a minute of runtime. Is there any additional debug that we can
> provide to help locate the bug?

Would be interesting to understand if for these requests,
nvme_tcp_queue_request scheduled io_work or not (can be marked on the
command and printed during the timeout expiration). If not, would be
interesting to understand why.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-08-31  9:02 ` Sagi Grimberg
@ 2022-08-31 16:04   ` Jonathan Nicklin
  2022-08-31 18:50     ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-08-31 16:04 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Hello,
>
> Hey Jonathan,
>
> > We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> > with kernel versions 5.15 through 5.19. Using a combination of tracing
> > events and network traces, we can see nvme_setup_cmd() trace events
> > that do not result in transmitted commands PDUs.
> >
> > Here's a snippet of the relevant portion of the nvme trace. The
> > following events are the last of tens of thousands of successully
> > executed reads...
> >
> > tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> > tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> > tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >
> > The following 5 commands are issued to nvme, but never make it to the
> > wire. The wire state was verified manually using tcpdump on the host and
> > trace events on the target. Perhaps the "need-resched" marker in the
> > trace and the context switch to the kernel worker are relevant
> > to the failure?
> >
> > tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> > kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> > kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
> >    cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >
> > The reads above eventually timeout and recover after reset. Here are
> > the related reset messages.
> >
> > [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> > flags 0x4084700 phys_seg 1 prio class 0
> > [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> > [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> > flags 0x4084700 phys_seg 1 prio class 0
> > [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> > [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> > flags 0x4084700 phys_seg 2 prio class 0
> > [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> > [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> > flags 0x4080700 phys_seg 2 prio class 0
> > [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> > Error (sct 0x3 / sc 0x71)
> > [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> > flags 0x4084700 phys_seg 2 prio class 0
> >
> > To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> > patch from a ways back. If my understanding is correct,
> > nvme_tcp_cmd_pdu() is being called for the commands in question.
> >
> > I've isolated the failure to a scenario we can reproduce at will
> > in about a minute of runtime. Is there any additional debug that we can
> > provide to help locate the bug?
>
> Would be interesting to understand if for these requests,
> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
> command and printed during the timeout expiration). If not, would be
> interesting to understand why.

Okay, here's a repro with a bit more context. I recorded the state
from nvme_tcp_queue_request() in the request and printed it out as you
suggested. I also added trace events for nvme_tcp_queue_rq() to make
it easier to match the state.

- lock is the return value of trylock() (if executed)
- more is the return value of nvme_tcp_queue_more(if executed)
- this_cpu is raw_smp_processor_id();

Here's the trace output for the failed requests with the associated
completion messages (from dmesg) stitched in.

  tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
    nvme1: qid=2 tag=5 op=0 data_len=32768
  tar-2153 [001] .....   536.383588: nvme_setup_cmd:
    nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
      cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
  dmesg.out           [  565.574700] nvme_tcp_queue_request()
    state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1

 tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
   nvme1: qid=2 tag=4 op=0 data_len=32768
 tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
   nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
     cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
 dmesg.out            [  565.540965] nvme_tcp_queue_request()
   state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1

kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
  nvme1: qid=2 tag=3 op=0 data_len=32768
kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
    cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
dmesg.out              [  565.490597] nvme_tcp_queue_request()
  state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1

kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
  nvme1: qid=2 tag=2 op=0 data_len=32768
kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
  nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
    cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
dmesg.out                   [  565.470687] nvme_tcp_queue_request()
  state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1

Does this shed any light on the issue?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-08-31 16:04   ` Jonathan Nicklin
@ 2022-08-31 18:50     ` Sagi Grimberg
  2022-08-31 19:06       ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2022-08-31 18:50 UTC (permalink / raw)
  To: Jonathan Nicklin; +Cc: linux-nvme



On 8/31/22 19:04, Jonathan Nicklin wrote:
> On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>
>>> Hello,
>>
>> Hey Jonathan,
>>
>>> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
>>> with kernel versions 5.15 through 5.19. Using a combination of tracing
>>> events and network traces, we can see nvme_setup_cmd() trace events
>>> that do not result in transmitted commands PDUs.
>>>
>>> Here's a snippet of the relevant portion of the nvme trace. The
>>> following events are the last of tens of thousands of successully
>>> executed reads...
>>>
>>> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
>>> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
>>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>>>
>>> The following 5 commands are issued to nvme, but never make it to the
>>> wire. The wire state was verified manually using tcpdump on the host and
>>> trace events on the target. Perhaps the "need-resched" marker in the
>>> trace and the context switch to the kernel worker are relevant
>>> to the failure?
>>>
>>> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
>>> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
>>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
>>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
>>>     cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>>>
>>> The reads above eventually timeout and recover after reset. Here are
>>> the related reset messages.
>>>
>>> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
>>> Error (sct 0x3 / sc 0x71)
>>> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
>>> flags 0x4084700 phys_seg 1 prio class 0
>>> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
>>> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
>>> Error (sct 0x3 / sc 0x71)
>>> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
>>> flags 0x4084700 phys_seg 1 prio class 0
>>> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
>>> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
>>> Error (sct 0x3 / sc 0x71)
>>> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
>>> flags 0x4084700 phys_seg 2 prio class 0
>>> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
>>> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
>>> Error (sct 0x3 / sc 0x71)
>>> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
>>> flags 0x4080700 phys_seg 2 prio class 0
>>> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
>>> Error (sct 0x3 / sc 0x71)
>>> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
>>> flags 0x4084700 phys_seg 2 prio class 0
>>>
>>> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
>>> patch from a ways back. If my understanding is correct,
>>> nvme_tcp_cmd_pdu() is being called for the commands in question.
>>>
>>> I've isolated the failure to a scenario we can reproduce at will
>>> in about a minute of runtime. Is there any additional debug that we can
>>> provide to help locate the bug?
>>
>> Would be interesting to understand if for these requests,
>> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
>> command and printed during the timeout expiration). If not, would be
>> interesting to understand why.
> 
> Okay, here's a repro with a bit more context. I recorded the state
> from nvme_tcp_queue_request() in the request and printed it out as you
> suggested. I also added trace events for nvme_tcp_queue_rq() to make
> it easier to match the state.
> 
> - lock is the return value of trylock() (if executed)
> - more is the return value of nvme_tcp_queue_more(if executed)
> - this_cpu is raw_smp_processor_id();
> 
> Here's the trace output for the failed requests with the associated
> completion messages (from dmesg) stitched in.
> 
>    tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
>      nvme1: qid=2 tag=5 op=0 data_len=32768
>    tar-2153 [001] .....   536.383588: nvme_setup_cmd:
>      nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
>        cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>    dmesg.out           [  565.574700] nvme_tcp_queue_request()
>      state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1

This is strange...
cpu matches, and the lock is taken, this means that the request should
have been sent immediately to the wire...

However last=0, this will make the send marked with MSG_SENDPAGE_NOTLAST
which means the network stack will not send it right away...

> 
>   tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
>     nvme1: qid=2 tag=4 op=0 data_len=32768
>   tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
>     nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
>       cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>   dmesg.out            [  565.540965] nvme_tcp_queue_request()
>     state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1

Same here... sent down the network stack, but with MSG_SENDPAGE_NOTLAST.

> kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
>    nvme1: qid=2 tag=3 op=0 data_len=32768
> kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
>      cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> dmesg.out              [  565.490597] nvme_tcp_queue_request()
>    state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1

Here, the queue is not empty, so we are not sending right away, and its
also not the last in the batch, so we are also not queueing io_work, we
will wait for the last request in the batch to schedule io_work, to send
the batch to the network.

> kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
>    nvme1: qid=2 tag=2 op=0 data_len=32768
> kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
>    nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
>      cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> dmesg.out                   [  565.470687] nvme_tcp_queue_request()
>    state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1

Here again, the queue is not empty, and its the last request in the 
batch, so we should have scheduled io_work to send the queued requests
in the send_list along.

These are not necessarily requests that were one after the other,
however the last request, should have flushed all the prior ones...

Can you add an indication that a command was actually sent to the wire
(i.e. nvme_tcp_try_send_cmd_pdu() sends the entire cmd pdu).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-08-31 18:50     ` Sagi Grimberg
@ 2022-08-31 19:06       ` Jonathan Nicklin
  2022-08-31 20:04         ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-08-31 19:06 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

Something about the logged state didn't make sense to me. Sure enough,
I had a bug in the state tracking. Here's another attempt.

I also added in"smp_id2", which is raw_smp_processor_id() recorded
immediately after the direct send case. Is it interesting to point out that the
CPU appears to change for the first command to go missing.

Here's the arrangement of smp_id and smp_id2.

    smp_id = raw_smp_processor_id();
if (queue->io_cpu == smp_id && sync && empty) {
        lock = mutex_trylock(&queue->send_mutex);
        if (lock) {
            queue->more_requests = !last;
            nvme_tcp_send_all(queue);
            queue->more_requests = false;
            mutex_unlock(&queue->send_mutex);
        }
}
    smp_id2 = raw_smp_processor_id();

Here's the new trace with dmesg outputs:

tar-1825    [005] .....   430.572846: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2f op=0 data_len=32768
tar-1825    [005] .N...   430.572849: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=45103, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917888, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmesg.out [  459.626849] nvme nvme1:
 state: tag 0x2f io_cpu 5 smp_id 5 smp_id2 2 sync 1 empty 1 lock 1 last 0 more 0
                 ^^^^^^^^ ^^^^^^^^ ^^^^^^^^^

kworker/5:2H-1827    [005] .....   430.572913: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2e op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572913: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=16430, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917824, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmesg.out [  459.593681] nvme nvme1:
 state: tag 0x2e io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1


kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2d op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=28717, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917760, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmest.out [  459.560506] nvme nvme1:
 state: tag 0x2d io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1


kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=29 op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=24617, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917696, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
dmest.out [  459.540904] nvme nvme1:
 state: tag 0x29 io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 1 more 1


On Wed, Aug 31, 2022 at 2:50 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 8/31/22 19:04, Jonathan Nicklin wrote:
> > On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> >>
> >>
> >>> Hello,
> >>
> >> Hey Jonathan,
> >>
> >>> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> >>> with kernel versions 5.15 through 5.19. Using a combination of tracing
> >>> events and network traces, we can see nvme_setup_cmd() trace events
> >>> that do not result in transmitted commands PDUs.
> >>>
> >>> Here's a snippet of the relevant portion of the nvme trace. The
> >>> following events are the last of tens of thousands of successully
> >>> executed reads...
> >>>
> >>> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>>
> >>> The following 5 commands are issued to nvme, but never make it to the
> >>> wire. The wire state was verified manually using tcpdump on the host and
> >>> trace events on the target. Perhaps the "need-resched" marker in the
> >>> trace and the context switch to the kernel worker are relevant
> >>> to the failure?
> >>>
> >>> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>>
> >>> The reads above eventually timeout and recover after reset. Here are
> >>> the related reset messages.
> >>>
> >>> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 1 prio class 0
> >>> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> >>> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 1 prio class 0
> >>> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> >>> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 2 prio class 0
> >>> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> >>> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> >>> flags 0x4080700 phys_seg 2 prio class 0
> >>> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 2 prio class 0
> >>>
> >>> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> >>> patch from a ways back. If my understanding is correct,
> >>> nvme_tcp_cmd_pdu() is being called for the commands in question.
> >>>
> >>> I've isolated the failure to a scenario we can reproduce at will
> >>> in about a minute of runtime. Is there any additional debug that we can
> >>> provide to help locate the bug?
> >>
> >> Would be interesting to understand if for these requests,
> >> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
> >> command and printed during the timeout expiration). If not, would be
> >> interesting to understand why.
> >
> > Okay, here's a repro with a bit more context. I recorded the state
> > from nvme_tcp_queue_request() in the request and printed it out as you
> > suggested. I also added trace events for nvme_tcp_queue_rq() to make
> > it easier to match the state.
> >
> > - lock is the return value of trylock() (if executed)
> > - more is the return value of nvme_tcp_queue_more(if executed)
> > - this_cpu is raw_smp_processor_id();
> >
> > Here's the trace output for the failed requests with the associated
> > completion messages (from dmesg) stitched in.
> >
> >    tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
> >      nvme1: qid=2 tag=5 op=0 data_len=32768
> >    tar-2153 [001] .....   536.383588: nvme_setup_cmd:
> >      nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
> >        cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >    dmesg.out           [  565.574700] nvme_tcp_queue_request()
> >      state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1
>
> This is strange...
> cpu matches, and the lock is taken, this means that the request should
> have been sent immediately to the wire...
>
> However last=0, this will make the send marked with MSG_SENDPAGE_NOTLAST
> which means the network stack will not send it right away...
>
> >
> >   tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
> >     nvme1: qid=2 tag=4 op=0 data_len=32768
> >   tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
> >     nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
> >       cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >   dmesg.out            [  565.540965] nvme_tcp_queue_request()
> >     state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1
>
> Same here... sent down the network stack, but with MSG_SENDPAGE_NOTLAST.
>
> > kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
> >    nvme1: qid=2 tag=3 op=0 data_len=32768
> > kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
> >      cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > dmesg.out              [  565.490597] nvme_tcp_queue_request()
> >    state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1
>
> Here, the queue is not empty, so we are not sending right away, and its
> also not the last in the batch, so we are also not queueing io_work, we
> will wait for the last request in the batch to schedule io_work, to send
> the batch to the network.
>
> > kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
> >    nvme1: qid=2 tag=2 op=0 data_len=32768
> > kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
> >      cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > dmesg.out                   [  565.470687] nvme_tcp_queue_request()
> >    state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1
>
> Here again, the queue is not empty, and its the last request in the
> batch, so we should have scheduled io_work to send the queued requests
> in the send_list along.
>
> These are not necessarily requests that were one after the other,
> however the last request, should have flushed all the prior ones...
>
> Can you add an indication that a command was actually sent to the wire
> (i.e. nvme_tcp_try_send_cmd_pdu() sends the entire cmd pdu).

Something about the logged state didn't make sense to me. Sure enough,
I had a bug in the state tracking. Here's another attempt.

I also added in"smp_id2", which is raw_smp_processor_id() recorded
immediately after the direct send case. Is it interesting to point out that the
CPU appears to change for the first command to go missing.

Here's the arrangement of smp_id and smp_id2.

    smp_id = raw_smp_processor_id();
if (queue->io_cpu == smp_id && sync && empty) {
        lock = mutex_trylock(&queue->send_mutex);
        if (lock) {
            queue->more_requests = !last;
            nvme_tcp_send_all(queue);
            queue->more_requests = false;
            mutex_unlock(&queue->send_mutex);
        }
}
    smp_id2 = raw_smp_processor_id();

Here's the new trace with dmesg outputs:

tar-1825    [005] .....   430.572846: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2f op=0 data_len=32768
tar-1825    [005] .N...   430.572849: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=45103, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917888, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmesg.out [  459.626849] nvme nvme1:
 state: tag 0x2f io_cpu 5 smp_id 5 smp_id2 2 sync 1 empty 1 lock 1 last 0 more 0
                 ^^^^^^^^ ^^^^^^^^ ^^^^^^^^^

kworker/5:2H-1827    [005] .....   430.572913: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2e op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572913: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=16430, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917824, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmesg.out [  459.593681] nvme nvme1:
 state: tag 0x2e io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1


kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=2d op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=28717, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917760, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)

dmest.out [  459.560506] nvme nvme1:
 state: tag 0x2d io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1


kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=29 op=0 data_len=32768
kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
 nvme1: disk=nvme1c1n1, qid=6, cmdid=24617, nsid=1, flags=0x0, meta=0x0,
  cmd=(nvme_cmd_read slba=9917696, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
dmest.out [  459.540904] nvme nvme1:
 state: tag 0x29 io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 1 more 1


On Wed, Aug 31, 2022 at 2:50 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 8/31/22 19:04, Jonathan Nicklin wrote:
> > On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> >>
> >>
> >>> Hello,
> >>
> >> Hey Jonathan,
> >>
> >>> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> >>> with kernel versions 5.15 through 5.19. Using a combination of tracing
> >>> events and network traces, we can see nvme_setup_cmd() trace events
> >>> that do not result in transmitted commands PDUs.
> >>>
> >>> Here's a snippet of the relevant portion of the nvme trace. The
> >>> following events are the last of tens of thousands of successully
> >>> executed reads...
> >>>
> >>> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>>
> >>> The following 5 commands are issued to nvme, but never make it to the
> >>> wire. The wire state was verified manually using tcpdump on the host and
> >>> trace events on the target. Perhaps the "need-resched" marker in the
> >>> trace and the context switch to the kernel worker are relevant
> >>> to the failure?
> >>>
> >>> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
> >>>     cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >>>
> >>> The reads above eventually timeout and recover after reset. Here are
> >>> the related reset messages.
> >>>
> >>> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 1 prio class 0
> >>> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> >>> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 1 prio class 0
> >>> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> >>> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 2 prio class 0
> >>> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> >>> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> >>> flags 0x4080700 phys_seg 2 prio class 0
> >>> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> >>> Error (sct 0x3 / sc 0x71)
> >>> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> >>> flags 0x4084700 phys_seg 2 prio class 0
> >>>
> >>> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> >>> patch from a ways back. If my understanding is correct,
> >>> nvme_tcp_cmd_pdu() is being called for the commands in question.
> >>>
> >>> I've isolated the failure to a scenario we can reproduce at will
> >>> in about a minute of runtime. Is there any additional debug that we can
> >>> provide to help locate the bug?
> >>
> >> Would be interesting to understand if for these requests,
> >> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
> >> command and printed during the timeout expiration). If not, would be
> >> interesting to understand why.
> >
> > Okay, here's a repro with a bit more context. I recorded the state
> > from nvme_tcp_queue_request() in the request and printed it out as you
> > suggested. I also added trace events for nvme_tcp_queue_rq() to make
> > it easier to match the state.
> >
> > - lock is the return value of trylock() (if executed)
> > - more is the return value of nvme_tcp_queue_more(if executed)
> > - this_cpu is raw_smp_processor_id();
> >
> > Here's the trace output for the failed requests with the associated
> > completion messages (from dmesg) stitched in.
> >
> >    tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
> >      nvme1: qid=2 tag=5 op=0 data_len=32768
> >    tar-2153 [001] .....   536.383588: nvme_setup_cmd:
> >      nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
> >        cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >    dmesg.out           [  565.574700] nvme_tcp_queue_request()
> >      state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1
>
> This is strange...
> cpu matches, and the lock is taken, this means that the request should
> have been sent immediately to the wire...
>
> However last=0, this will make the send marked with MSG_SENDPAGE_NOTLAST
> which means the network stack will not send it right away...
>
> >
> >   tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
> >     nvme1: qid=2 tag=4 op=0 data_len=32768
> >   tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
> >     nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
> >       cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> >   dmesg.out            [  565.540965] nvme_tcp_queue_request()
> >     state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1
>
> Same here... sent down the network stack, but with MSG_SENDPAGE_NOTLAST.
>
> > kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
> >    nvme1: qid=2 tag=3 op=0 data_len=32768
> > kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
> >      cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > dmesg.out              [  565.490597] nvme_tcp_queue_request()
> >    state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1
>
> Here, the queue is not empty, so we are not sending right away, and its
> also not the last in the batch, so we are also not queueing io_work, we
> will wait for the last request in the batch to schedule io_work, to send
> the batch to the network.
>
> > kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
> >    nvme1: qid=2 tag=2 op=0 data_len=32768
> > kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
> >    nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
> >      cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > dmesg.out                   [  565.470687] nvme_tcp_queue_request()
> >    state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1
>
> Here again, the queue is not empty, and its the last request in the
> batch, so we should have scheduled io_work to send the queued requests
> in the send_list along.
>
> These are not necessarily requests that were one after the other,
> however the last request, should have flushed all the prior ones...
>
> Can you add an indication that a command was actually sent to the wire
> (i.e. nvme_tcp_try_send_cmd_pdu() sends the entire cmd pdu).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-08-31 19:06       ` Jonathan Nicklin
@ 2022-08-31 20:04         ` Jonathan Nicklin
  2022-08-31 23:42           ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-08-31 20:04 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Wed, Aug 31, 2022 at 3:06 PM Jonathan Nicklin
<jnicklin@blockbridge.com> wrote:
>
> Something about the logged state didn't make sense to me. Sure enough,
> I had a bug in the state tracking. Here's another attempt.
>
> I also added in"smp_id2", which is raw_smp_processor_id() recorded
> immediately after the direct send case. Is it interesting to point out that the
> CPU appears to change for the first command to go missing.
>
> Here's the arrangement of smp_id and smp_id2.
>
>     smp_id = raw_smp_processor_id();
> if (queue->io_cpu == smp_id && sync && empty) {
>         lock = mutex_trylock(&queue->send_mutex);
>         if (lock) {
>             queue->more_requests = !last;
>             nvme_tcp_send_all(queue);
>             queue->more_requests = false;
>             mutex_unlock(&queue->send_mutex);
>         }
> }
>     smp_id2 = raw_smp_processor_id();
>
> Here's the new trace with dmesg outputs:
>
> tar-1825    [005] .....   430.572846: nvme_tcp_queue_rq:
>  nvme1: qid=6 tag=2f op=0 data_len=32768
> tar-1825    [005] .N...   430.572849: nvme_setup_cmd:
>  nvme1: disk=nvme1c1n1, qid=6, cmdid=45103, nsid=1, flags=0x0, meta=0x0,
>   cmd=(nvme_cmd_read slba=9917888, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>
> dmesg.out [  459.626849] nvme nvme1:
>  state: tag 0x2f io_cpu 5 smp_id 5 smp_id2 2 sync 1 empty 1 lock 1 last 0 more 0
>                  ^^^^^^^^ ^^^^^^^^ ^^^^^^^^^
>
> kworker/5:2H-1827    [005] .....   430.572913: nvme_tcp_queue_rq:
>  nvme1: qid=6 tag=2e op=0 data_len=32768
> kworker/5:2H-1827    [005] .....   430.572913: nvme_setup_cmd:
>  nvme1: disk=nvme1c1n1, qid=6, cmdid=16430, nsid=1, flags=0x0, meta=0x0,
>   cmd=(nvme_cmd_read slba=9917824, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>
> dmesg.out [  459.593681] nvme nvme1:
>  state: tag 0x2e io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1
>
>
> kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
>  nvme1: qid=6 tag=2d op=0 data_len=32768
> kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
>  nvme1: disk=nvme1c1n1, qid=6, cmdid=28717, nsid=1, flags=0x0, meta=0x0,
>   cmd=(nvme_cmd_read slba=9917760, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>
> dmest.out [  459.560506] nvme nvme1:
>  state: tag 0x2d io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1
>
>
> kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
>  nvme1: qid=6 tag=29 op=0 data_len=32768
> kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
>  nvme1: disk=nvme1c1n1, qid=6, cmdid=24617, nsid=1, flags=0x0, meta=0x0,
>   cmd=(nvme_cmd_read slba=9917696, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> dmest.out [  459.540904] nvme nvme1:
>  state: tag 0x29 io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 1 more 1
>
>
> On Wed, Aug 31, 2022 at 2:50 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> >
> >
> >
> > On 8/31/22 19:04, Jonathan Nicklin wrote:
> > > On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> > >>
> > >>
> > >>> Hello,
> > >>
> > >> Hey Jonathan,
> > >>
> > >>> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> > >>> with kernel versions 5.15 through 5.19. Using a combination of tracing
> > >>> events and network traces, we can see nvme_setup_cmd() trace events
> > >>> that do not result in transmitted commands PDUs.
> > >>>
> > >>> Here's a snippet of the relevant portion of the nvme trace. The
> > >>> following events are the last of tens of thousands of successully
> > >>> executed reads...
> > >>>
> > >>> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>>
> > >>> The following 5 commands are issued to nvme, but never make it to the
> > >>> wire. The wire state was verified manually using tcpdump on the host and
> > >>> trace events on the target. Perhaps the "need-resched" marker in the
> > >>> trace and the context switch to the kernel worker are relevant
> > >>> to the failure?
> > >>>
> > >>> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>>
> > >>> The reads above eventually timeout and recover after reset. Here are
> > >>> the related reset messages.
> > >>>
> > >>> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> > >>> flags 0x4084700 phys_seg 1 prio class 0
> > >>> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> > >>> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> > >>> flags 0x4084700 phys_seg 1 prio class 0
> > >>> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> > >>> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> > >>> flags 0x4084700 phys_seg 2 prio class 0
> > >>> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> > >>> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> > >>> flags 0x4080700 phys_seg 2 prio class 0
> > >>> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> > >>> flags 0x4084700 phys_seg 2 prio class 0
> > >>>
> > >>> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> > >>> patch from a ways back. If my understanding is correct,
> > >>> nvme_tcp_cmd_pdu() is being called for the commands in question.
> > >>>
> > >>> I've isolated the failure to a scenario we can reproduce at will
> > >>> in about a minute of runtime. Is there any additional debug that we can
> > >>> provide to help locate the bug?
> > >>
> > >> Would be interesting to understand if for these requests,
> > >> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
> > >> command and printed during the timeout expiration). If not, would be
> > >> interesting to understand why.
> > >
> > > Okay, here's a repro with a bit more context. I recorded the state
> > > from nvme_tcp_queue_request() in the request and printed it out as you
> > > suggested. I also added trace events for nvme_tcp_queue_rq() to make
> > > it easier to match the state.
> > >
> > > - lock is the return value of trylock() (if executed)
> > > - more is the return value of nvme_tcp_queue_more(if executed)
> > > - this_cpu is raw_smp_processor_id();
> > >
> > > Here's the trace output for the failed requests with the associated
> > > completion messages (from dmesg) stitched in.
> > >
> > >    tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
> > >      nvme1: qid=2 tag=5 op=0 data_len=32768
> > >    tar-2153 [001] .....   536.383588: nvme_setup_cmd:
> > >      nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
> > >        cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >    dmesg.out           [  565.574700] nvme_tcp_queue_request()
> > >      state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1
> >
> > This is strange...
> > cpu matches, and the lock is taken, this means that the request should
> > have been sent immediately to the wire...
> >
> > However last=0, this will make the send marked with MSG_SENDPAGE_NOTLAST
> > which means the network stack will not send it right away...
> >
> > >
> > >   tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
> > >     nvme1: qid=2 tag=4 op=0 data_len=32768
> > >   tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
> > >     nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
> > >       cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >   dmesg.out            [  565.540965] nvme_tcp_queue_request()
> > >     state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1
> >
> > Same here... sent down the network stack, but with MSG_SENDPAGE_NOTLAST.
> >
> > > kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
> > >    nvme1: qid=2 tag=3 op=0 data_len=32768
> > > kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
> > >    nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
> > >      cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > > dmesg.out              [  565.490597] nvme_tcp_queue_request()
> > >    state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1
> >
> > Here, the queue is not empty, so we are not sending right away, and its
> > also not the last in the batch, so we are also not queueing io_work, we
> > will wait for the last request in the batch to schedule io_work, to send
> > the batch to the network.
> >
> > > kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
> > >    nvme1: qid=2 tag=2 op=0 data_len=32768
> > > kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
> > >    nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
> > >      cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > > dmesg.out                   [  565.470687] nvme_tcp_queue_request()
> > >    state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1
> >
> > Here again, the queue is not empty, and its the last request in the
> > batch, so we should have scheduled io_work to send the queued requests
> > in the send_list along.
> >
> > These are not necessarily requests that were one after the other,
> > however the last request, should have flushed all the prior ones...
> >
> > Can you add an indication that a command was actually sent to the wire
> > (i.e. nvme_tcp_try_send_cmd_pdu() sends the entire cmd pdu).
>
> Something about the logged state didn't make sense to me. Sure enough,
> I had a bug in the state tracking. Here's another attempt.
>
> I also added in"smp_id2", which is raw_smp_processor_id() recorded
> immediately after the direct send case. Is it interesting to point out that the
> CPU appears to change for the first command to go missing.
>
> Here's the arrangement of smp_id and smp_id2.
>
>     smp_id = raw_smp_processor_id();
> if (queue->io_cpu == smp_id && sync && empty) {
>         lock = mutex_trylock(&queue->send_mutex);
>         if (lock) {
>             queue->more_requests = !last;
>             nvme_tcp_send_all(queue);
>             queue->more_requests = false;
>             mutex_unlock(&queue->send_mutex);
>         }
> }
>     smp_id2 = raw_smp_processor_id();
>
> Here's the new trace with dmesg outputs:
>
> tar-1825    [005] .....   430.572846: nvme_tcp_queue_rq:
>  nvme1: qid=6 tag=2f op=0 data_len=32768
> tar-1825    [005] .N...   430.572849: nvme_setup_cmd:
>  nvme1: disk=nvme1c1n1, qid=6, cmdid=45103, nsid=1, flags=0x0, meta=0x0,
>   cmd=(nvme_cmd_read slba=9917888, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>
> dmesg.out [  459.626849] nvme nvme1:
>  state: tag 0x2f io_cpu 5 smp_id 5 smp_id2 2 sync 1 empty 1 lock 1 last 0 more 0
>                  ^^^^^^^^ ^^^^^^^^ ^^^^^^^^^
>
> kworker/5:2H-1827    [005] .....   430.572913: nvme_tcp_queue_rq:
>  nvme1: qid=6 tag=2e op=0 data_len=32768
> kworker/5:2H-1827    [005] .....   430.572913: nvme_setup_cmd:
>  nvme1: disk=nvme1c1n1, qid=6, cmdid=16430, nsid=1, flags=0x0, meta=0x0,
>   cmd=(nvme_cmd_read slba=9917824, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>
> dmesg.out [  459.593681] nvme nvme1:
>  state: tag 0x2e io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1
>
>
> kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
>  nvme1: qid=6 tag=2d op=0 data_len=32768
> kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
>  nvme1: disk=nvme1c1n1, qid=6, cmdid=28717, nsid=1, flags=0x0, meta=0x0,
>   cmd=(nvme_cmd_read slba=9917760, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
>
> dmest.out [  459.560506] nvme nvme1:
>  state: tag 0x2d io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 0 more 1
>
>
> kworker/5:2H-1827    [005] .....   430.572914: nvme_tcp_queue_rq:
>  nvme1: qid=6 tag=29 op=0 data_len=32768
> kworker/5:2H-1827    [005] .....   430.572914: nvme_setup_cmd:
>  nvme1: disk=nvme1c1n1, qid=6, cmdid=24617, nsid=1, flags=0x0, meta=0x0,
>   cmd=(nvme_cmd_read slba=9917696, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> dmest.out [  459.540904] nvme nvme1:
>  state: tag 0x29 io_cpu 5 smp_id 5 smp_id2 5 sync 1 empty 0 lock 0 last 1 more 1
>
>
> On Wed, Aug 31, 2022 at 2:50 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> >
> >
> >
> > On 8/31/22 19:04, Jonathan Nicklin wrote:
> > > On Wed, Aug 31, 2022 at 5:02 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> > >>
> > >>
> > >>> Hello,
> > >>
> > >> Hey Jonathan,
> > >>
> > >>> We've seen sporadic I/O errors with nvme-tcp in our lab when testing
> > >>> with kernel versions 5.15 through 5.19. Using a combination of tracing
> > >>> events and network traces, we can see nvme_setup_cmd() trace events
> > >>> that do not result in transmitted commands PDUs.
> > >>>
> > >>> Here's a snippet of the relevant portion of the nvme trace. The
> > >>> following events are the last of tens of thousands of successully
> > >>> executed reads...
> > >>>
> > >>> tar-9004    [004] ..... 22457.614117: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=53272, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760448, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> tar-9004    [004] ..... 22457.614118: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61463, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760384, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16406, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760376, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28693, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760312, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> tar-9004    [004] ..... 22457.614119: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40980, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760248, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>>
> > >>> The following 5 commands are issued to nvme, but never make it to the
> > >>> wire. The wire state was verified manually using tcpdump on the host and
> > >>> trace events on the target. Perhaps the "need-resched" marker in the
> > >>> trace and the context switch to the kernel worker are relevant
> > >>> to the failure?
> > >>>
> > >>> tar-9004    [004] .n... 22457.614171: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=61469, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760704, len=55, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> kworker/4:1H-8668    [004] ..... 22457.614183: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=45084, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760640, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=28699, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760632, len=7, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=40986, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760568, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>> kworker/4:1H-8668    [004] ..... 22457.614184: nvme_setup_cmd:
> > >>>     nvme1: disk=nvme1c1n1, qid=5, cmdid=16409, nsid=1, flags=0x0, meta=0x0,
> > >>>     cmd=(nvme_cmd_read slba=15760504, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >>>
> > >>> The reads above eventually timeout and recover after reset. Here are
> > >>> the related reset messages.
> > >>>
> > >>> [22486.420606] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760568, 64 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.429326] I/O error, dev nvme1c1n1, sector 15760568 op 0x0:(READ)
> > >>> flags 0x4084700 phys_seg 1 prio class 0
> > >>> [22486.439262] nvme nvme1: queue 5: timeout request 0x1b type 4
> > >>> [22486.461788] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760632, 8 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.470440] I/O error, dev nvme1c1n1, sector 15760632 op 0x0:(READ)
> > >>> flags 0x4084700 phys_seg 1 prio class 0
> > >>> [22486.480386] nvme nvme1: queue 5: timeout request 0x1c type 4
> > >>> [22486.486256] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760640, 64 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.494994] I/O error, dev nvme1c1n1, sector 15760640 op 0x0:(READ)
> > >>> flags 0x4084700 phys_seg 2 prio class 0
> > >>> [22486.504946] nvme nvme1: queue 5: timeout request 0x1d type 4
> > >>> [22486.510822] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760704, 56 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.519563] I/O error, dev nvme1c1n1, sector 15760704 op 0x0:(READ)
> > >>> flags 0x4080700 phys_seg 2 prio class 0
> > >>> [22486.529723] nvme1c1n1: I/O Cmd(0x2) @ LBA 15760504, 64 blocks, I/O
> > >>> Error (sct 0x3 / sc 0x71)
> > >>> [22486.538610] I/O error, dev nvme1c1n1, sector 15760504 op 0x0:(READ)
> > >>> flags 0x4084700 phys_seg 2 prio class 0
> > >>>
> > >>> To further aid in debugging, I resurrected Keith's nvme_tcp tracing event
> > >>> patch from a ways back. If my understanding is correct,
> > >>> nvme_tcp_cmd_pdu() is being called for the commands in question.
> > >>>
> > >>> I've isolated the failure to a scenario we can reproduce at will
> > >>> in about a minute of runtime. Is there any additional debug that we can
> > >>> provide to help locate the bug?
> > >>
> > >> Would be interesting to understand if for these requests,
> > >> nvme_tcp_queue_request scheduled io_work or not (can be marked on the
> > >> command and printed during the timeout expiration). If not, would be
> > >> interesting to understand why.
> > >
> > > Okay, here's a repro with a bit more context. I recorded the state
> > > from nvme_tcp_queue_request() in the request and printed it out as you
> > > suggested. I also added trace events for nvme_tcp_queue_rq() to make
> > > it easier to match the state.
> > >
> > > - lock is the return value of trylock() (if executed)
> > > - more is the return value of nvme_tcp_queue_more(if executed)
> > > - this_cpu is raw_smp_processor_id();
> > >
> > > Here's the trace output for the failed requests with the associated
> > > completion messages (from dmesg) stitched in.
> > >
> > >    tar-2153 [001] .....   536.383588: nvme_tcp_queue_rq:
> > >      nvme1: qid=2 tag=5 op=0 data_len=32768
> > >    tar-2153 [001] .....   536.383588: nvme_setup_cmd:
> > >      nvme1: disk=nvme1c1n1, qid=2, cmdid=8197, nsid=1, flags=0x0, meta=0x0,
> > >        cmd=(nvme_cmd_read slba=9950336, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >    dmesg.out           [  565.574700] nvme_tcp_queue_request()
> > >      state: tag 0x5 io_cpu 1 this_cpu 1 lock 1 sync 1 empty 1 last 0 more 1
> >
> > This is strange...
> > cpu matches, and the lock is taken, this means that the request should
> > have been sent immediately to the wire...
> >
> > However last=0, this will make the send marked with MSG_SENDPAGE_NOTLAST
> > which means the network stack will not send it right away...
> >
> > >
> > >   tar-2153  [001] .N...   536.383594: nvme_tcp_queue_rq:
> > >     nvme1: qid=2 tag=4 op=0 data_len=32768
> > >   tar-2153  [001] .N...   536.383594: nvme_setup_cmd:
> > >     nvme1: disk=nvme1c1n1, qid=2, cmdid=16388, nsid=1, flags=0x0, meta=0x0,
> > >       cmd=(nvme_cmd_read slba=9950272, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > >   dmesg.out            [  565.540965] nvme_tcp_queue_request()
> > >     state: tag 0x4 io_cpu 1 this_cpu 5 lock 1 sync 1 empty 1 last 0 more 1
> >
> > Same here... sent down the network stack, but with MSG_SENDPAGE_NOTLAST.
> >
> > > kworker/1:0H-23  [001] .....   536.383619: nvme_tcp_queue_rq:
> > >    nvme1: qid=2 tag=3 op=0 data_len=32768
> > > kworker/1:0H-23  [001] .....   536.383619: nvme_setup_cmd:
> > >    nvme1: disk=nvme1c1n1, qid=2, cmdid=24579, nsid=1, flags=0x0, meta=0x0,
> > >      cmd=(nvme_cmd_read slba=9950208, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > > dmesg.out              [  565.490597] nvme_tcp_queue_request()
> > >    state: tag 0x3 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 0 more 1
> >
> > Here, the queue is not empty, so we are not sending right away, and its
> > also not the last in the batch, so we are also not queueing io_work, we
> > will wait for the last request in the batch to schedule io_work, to send
> > the batch to the network.
> >
> > > kworker/1:0H-23  [001] .....   536.383620: nvme_tcp_queue_rq:
> > >    nvme1: qid=2 tag=2 op=0 data_len=32768
> > > kworker/1:0H-23  [001] .....   536.383620: nvme_setup_cmd:
> > >    nvme1: disk=nvme1c1n1, qid=2, cmdid=40962, nsid=1, flags=0x0, meta=0x0,
> > >      cmd=(nvme_cmd_read slba=9950144, len=63, ctrl=0x8000, dsmgmt=7, reftag=0)
> > > dmesg.out                   [  565.470687] nvme_tcp_queue_request()
> > >    state: tag 0x2 io_cpu 1 this_cpu 1 lock 0 sync 1 empty 0 last 1 more 1
> >
> > Here again, the queue is not empty, and its the last request in the
> > batch, so we should have scheduled io_work to send the queued requests
> > in the send_list along.
> >
> > These are not necessarily requests that were one after the other,
> > however the last request, should have flushed all the prior ones...
> >
> > Can you add an indication that a command was actually sent to the wire
> > (i.e. nvme_tcp_try_send_cmd_pdu() sends the entire cmd pdu).

Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.

req->dbg.send = 1;
ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
req->dbg.sendpage = ret;


nvme-trace.out

tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
 nvme1: qid=6 tag=18 op=0 data_len=20480
tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=17 op=0 data_len=32768
tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=14 op=0 data_len=32768
kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=13 op=0 data_len=32768
kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
 nvme1: qid=6 tag=12 op=0 data_len=12288
tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
 nvme1: qid=8 tag=69 op=0 data_len=20480


dmesg.out (reordered to match nvme-trace.out)

[  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
 empty 1 lock 1 last 0 more 0 send 0 sendpage -1
[  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
 empty 1 lock 1 last 0 more 0 send 0 sendpage -1
[  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
 empty 1 lock 1 last 0 more 0 send 0 sendpage -1
[  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
 empty 0 lock 0 last 0 more 1 send 1 sendpage 72
[  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
 empty 0 lock 0 last 1 more 1 send 1 sendpage 72


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-08-31 20:04         ` Jonathan Nicklin
@ 2022-08-31 23:42           ` Sagi Grimberg
  2022-09-01 12:37             ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2022-08-31 23:42 UTC (permalink / raw)
  To: Jonathan Nicklin; +Cc: linux-nvme


> 
> Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
> 
> req->dbg.send = 1;
> ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
> req->dbg.sendpage = ret;
> 
> 
> nvme-trace.out
> 
> tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
>   nvme1: qid=6 tag=18 op=0 data_len=20480
> tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
>   nvme1: qid=6 tag=17 op=0 data_len=32768
> tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
>   nvme1: qid=6 tag=14 op=0 data_len=32768
> kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
>   nvme1: qid=6 tag=13 op=0 data_len=32768
> kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
>   nvme1: qid=6 tag=12 op=0 data_len=12288
> tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
>   nvme1: qid=8 tag=69 op=0 data_len=20480
> 
> 
> dmesg.out (reordered to match nvme-trace.out)
> 
> [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>   empty 1 lock 1 last 0 more 0 send 0 sendpage -1

Something is not adding up...
lock=1, means that we get into nvme_tcp_send_all(), but somehow we are 
not able to send it to the wire?

I'd look into this and add indication that it is pulled from the
req_list, and in turn from the send_list. It looks like it is somehow
picked out and silteny dropped, because as long as it is on the list(s)
another context should have picked it up at some point...

> [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
>   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 sendpage 72
> [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>   empty 0 lock 0 last 1 more 1 send 1 sendpage 72


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-08-31 23:42           ` Sagi Grimberg
@ 2022-09-01 12:37             ` Jonathan Nicklin
  2022-09-01 13:22               ` Jonathan Nicklin
  2022-09-01 17:00               ` Sagi Grimberg
  0 siblings, 2 replies; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-01 12:37 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Wed, Aug 31, 2022 at 7:42 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >
> > Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
> >
> > req->dbg.send = 1;
> > ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
> > req->dbg.sendpage = ret;
> >
> >
> > nvme-trace.out
> >
> > tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
> >   nvme1: qid=6 tag=18 op=0 data_len=20480
> > tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
> >   nvme1: qid=6 tag=17 op=0 data_len=32768
> > tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
> >   nvme1: qid=6 tag=14 op=0 data_len=32768
> > kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
> >   nvme1: qid=6 tag=13 op=0 data_len=32768
> > kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
> >   nvme1: qid=6 tag=12 op=0 data_len=12288
> > tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
> >   nvme1: qid=8 tag=69 op=0 data_len=20480
> >
> >
> > dmesg.out (reordered to match nvme-trace.out)
> >
> > [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
>
> Something is not adding up...
> lock=1, means that we get into nvme_tcp_send_all(), but somehow we are
> not able to send it to the wire?
>
> I'd look into this and add indication that it is pulled from the
> req_list, and in turn from the send_list. It looks like it is somehow
> picked out and silteny dropped, because as long as it is on the list(s)
> another context should have picked it up at some point...
>
> > [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
> >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >   empty 0 lock 0 last 0 more 1 send 1 sendpage 72
> > [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >   empty 0 lock 0 last 1 more 1 send 1 sendpage 72

Ugh. It was misleading because the send state tracking was
re-initialized after the direct send case. Here's a better trace with
more information. It appears as though sendpage() is always called
with the NOTLAST flag set. I suppose this could mean that there is
some confusion or race with nvme_tcp_queue_more()?

tar-1872          [004] ..... 389.559784: nvme_tcp_queue_rq:
 nvme1: qid=5 tag=28 op=0 data_len=16384
tar-1872          [004] .N... 389.559786: nvme_tcp_queue_rq:
 nvme1: qid=5 tag=27 op=0 data_len=16384
kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
 nvme1: qid=5 tag=26 op=0 data_len=32768
kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
 nvme1: qid=5 tag=25 op=0 data_len=32768
kworker/4:2H-1873 [004] ..... 389.559799: nvme_tcp_queue_rq:
 nvme1: qid=5 tag=24 op=0 data_len=32768


[  419.790505] nvme nvme1: state: tag 0x28 io_cpu 4 smp_id 4 smp_id2 4 sync 1
 empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
[  419.790299] nvme nvme1: state: tag 0x27 io_cpu 4 smp_id 4 smp_id2 0 sync 1
 empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
[  419.789809] nvme nvme1: state: tag 0x25 io_cpu 4 smp_id 4 smp_id2 4 sync 1
 empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
[  419.790065] nvme nvme1: state: tag 0x26 io_cpu 4 smp_id 4 smp_id2 4 sync 1
 empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
[  419.789663] nvme nvme1: state: tag 0x24 io_cpu 4 smp_id 4 smp_id2 4 sync 1
 empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-01 12:37             ` Jonathan Nicklin
@ 2022-09-01 13:22               ` Jonathan Nicklin
  2022-09-01 15:37                 ` Jonathan Nicklin
  2022-09-01 17:00               ` Sagi Grimberg
  1 sibling, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-01 13:22 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Thu, Sep 1, 2022 at 8:37 AM Jonathan Nicklin
<jnicklin@blockbridge.com> wrote:
>
> On Wed, Aug 31, 2022 at 7:42 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> >
> >
> > >
> > > Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
> > >
> > > req->dbg.send = 1;
> > > ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
> > > req->dbg.sendpage = ret;
> > >
> > >
> > > nvme-trace.out
> > >
> > > tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
> > >   nvme1: qid=6 tag=18 op=0 data_len=20480
> > > tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
> > >   nvme1: qid=6 tag=17 op=0 data_len=32768
> > > tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
> > >   nvme1: qid=6 tag=14 op=0 data_len=32768
> > > kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
> > >   nvme1: qid=6 tag=13 op=0 data_len=32768
> > > kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
> > >   nvme1: qid=6 tag=12 op=0 data_len=12288
> > > tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
> > >   nvme1: qid=8 tag=69 op=0 data_len=20480
> > >
> > >
> > > dmesg.out (reordered to match nvme-trace.out)
> > >
> > > [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> >
> > Something is not adding up...
> > lock=1, means that we get into nvme_tcp_send_all(), but somehow we are
> > not able to send it to the wire?
> >
> > I'd look into this and add indication that it is pulled from the
> > req_list, and in turn from the send_list. It looks like it is somehow
> > picked out and silteny dropped, because as long as it is on the list(s)
> > another context should have picked it up at some point...
> >
> > > [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > > [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
> > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > > [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > >   empty 0 lock 0 last 0 more 1 send 1 sendpage 72
> > > [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > >   empty 0 lock 0 last 1 more 1 send 1 sendpage 72
>
> Ugh. It was misleading because the send state tracking was
> re-initialized after the direct send case. Here's a better trace with
> more information. It appears as though sendpage() is always called
> with the NOTLAST flag set. I suppose this could mean that there is
> some confusion or race with nvme_tcp_queue_more()?
>
> tar-1872          [004] ..... 389.559784: nvme_tcp_queue_rq:
>  nvme1: qid=5 tag=28 op=0 data_len=16384
> tar-1872          [004] .N... 389.559786: nvme_tcp_queue_rq:
>  nvme1: qid=5 tag=27 op=0 data_len=16384
> kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
>  nvme1: qid=5 tag=26 op=0 data_len=32768
> kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
>  nvme1: qid=5 tag=25 op=0 data_len=32768
> kworker/4:2H-1873 [004] ..... 389.559799: nvme_tcp_queue_rq:
>  nvme1: qid=5 tag=24 op=0 data_len=32768
>
>
> [  419.790505] nvme nvme1: state: tag 0x28 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> [  419.790299] nvme nvme1: state: tag 0x27 io_cpu 4 smp_id 4 smp_id2 0 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> [  419.789809] nvme nvme1: state: tag 0x25 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> [  419.790065] nvme nvme1: state: tag 0x26 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> [  419.789663] nvme nvme1: state: tag 0x24 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040

Additional tracing suggests it could be a race or issue with
queue->more_requests.

[  675.762509] nvme nvme1: state: tag 0x5f io_cpu 1 smp_id 1 smp_id2 1 sync 1
  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
  q_send_list 0 q_req_list 0 q_more_requests 1
[  675.720866] nvme nvme1: state: tag 0x5e io_cpu 1 smp_id 1 smp_id2 5 sync 1
  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
  q_send_list 0 q_req_list 0 q_more_requests 1
[  675.679256] nvme nvme1: state: tag 0x5d io_cpu 1 smp_id 1 smp_id2 1 sync 1
  empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
  q_send_list 1 q_req_list 0 q_more_requests 1
[  675.621054] nvme nvme1: state: tag 0x5c io_cpu 1 smp_id 1 smp_id2 1 sync 1
  empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
 q_send_list 1 q_req_list 0 q_more_requests 1
[  675.593414] nvme nvme1: state: tag 0x5a io_cpu 1 smp_id 1 smp_id2 1 sync 1
  empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
  q_send_list 0 q_req_list 0 q_more_requests 1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-01 13:22               ` Jonathan Nicklin
@ 2022-09-01 15:37                 ` Jonathan Nicklin
  2022-09-01 17:02                   ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-01 15:37 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Thu, Sep 1, 2022 at 9:22 AM Jonathan Nicklin
<jnicklin@blockbridge.com> wrote:
>
> On Thu, Sep 1, 2022 at 8:37 AM Jonathan Nicklin
> <jnicklin@blockbridge.com> wrote:
> >
> > On Wed, Aug 31, 2022 at 7:42 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> > >
> > >
> > > >
> > > > Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
> > > >
> > > > req->dbg.send = 1;
> > > > ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
> > > > req->dbg.sendpage = ret;
> > > >
> > > >
> > > > nvme-trace.out
> > > >
> > > > tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=18 op=0 data_len=20480
> > > > tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=17 op=0 data_len=32768
> > > > tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=14 op=0 data_len=32768
> > > > kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=13 op=0 data_len=32768
> > > > kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
> > > >   nvme1: qid=6 tag=12 op=0 data_len=12288
> > > > tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
> > > >   nvme1: qid=8 tag=69 op=0 data_len=20480
> > > >
> > > >
> > > > dmesg.out (reordered to match nvme-trace.out)
> > > >
> > > > [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > >
> > > Something is not adding up...
> > > lock=1, means that we get into nvme_tcp_send_all(), but somehow we are
> > > not able to send it to the wire?
> > >
> > > I'd look into this and add indication that it is pulled from the
> > > req_list, and in turn from the send_list. It looks like it is somehow
> > > picked out and silteny dropped, because as long as it is on the list(s)
> > > another context should have picked it up at some point...
> > >
> > > > [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > > > [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
> > > >   empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> > > > [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > > >   empty 0 lock 0 last 0 more 1 send 1 sendpage 72
> > > > [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> > > >   empty 0 lock 0 last 1 more 1 send 1 sendpage 72
> >
> > Ugh. It was misleading because the send state tracking was
> > re-initialized after the direct send case. Here's a better trace with
> > more information. It appears as though sendpage() is always called
> > with the NOTLAST flag set. I suppose this could mean that there is
> > some confusion or race with nvme_tcp_queue_more()?
> >
> > tar-1872          [004] ..... 389.559784: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=28 op=0 data_len=16384
> > tar-1872          [004] .N... 389.559786: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=27 op=0 data_len=16384
> > kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=26 op=0 data_len=32768
> > kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=25 op=0 data_len=32768
> > kworker/4:2H-1873 [004] ..... 389.559799: nvme_tcp_queue_rq:
> >  nvme1: qid=5 tag=24 op=0 data_len=32768
> >
> >
> > [  419.790505] nvme nvme1: state: tag 0x28 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> > [  419.790299] nvme nvme1: state: tag 0x27 io_cpu 4 smp_id 4 smp_id2 0 sync 1
> >  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> > [  419.789809] nvme nvme1: state: tag 0x25 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >  empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> > [  419.790065] nvme nvme1: state: tag 0x26 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >  empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> > [  419.789663] nvme nvme1: state: tag 0x24 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >  empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>
> Additional tracing suggests it could be a race or issue with
> queue->more_requests.
>
> [  675.762509] nvme nvme1: state: tag 0x5f io_cpu 1 smp_id 1 smp_id2 1 sync 1
>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>   q_send_list 0 q_req_list 0 q_more_requests 1
> [  675.720866] nvme nvme1: state: tag 0x5e io_cpu 1 smp_id 1 smp_id2 5 sync 1
>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>   q_send_list 0 q_req_list 0 q_more_requests 1
> [  675.679256] nvme nvme1: state: tag 0x5d io_cpu 1 smp_id 1 smp_id2 1 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
>   q_send_list 1 q_req_list 0 q_more_requests 1
> [  675.621054] nvme nvme1: state: tag 0x5c io_cpu 1 smp_id 1 smp_id2 1 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
>  q_send_list 1 q_req_list 0 q_more_requests 1
> [  675.593414] nvme nvme1: state: tag 0x5a io_cpu 1 smp_id 1 smp_id2 1 sync 1
>   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>   q_send_list 0 q_req_list 0 q_more_requests 1

Question: queue->more_requests is updated under a lock. However, not
all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
values seem to show that the same queue was being processed on
different CPUs. Is this a race between the kworker and inline
submission?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-01 12:37             ` Jonathan Nicklin
  2022-09-01 13:22               ` Jonathan Nicklin
@ 2022-09-01 17:00               ` Sagi Grimberg
  2022-09-01 17:39                 ` Jonathan Nicklin
  1 sibling, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2022-09-01 17:00 UTC (permalink / raw)
  To: Jonathan Nicklin; +Cc: linux-nvme


>>> Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
>>>
>>> req->dbg.send = 1;
>>> ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
>>> req->dbg.sendpage = ret;
>>>
>>>
>>> nvme-trace.out
>>>
>>> tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=18 op=0 data_len=20480
>>> tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=17 op=0 data_len=32768
>>> tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=14 op=0 data_len=32768
>>> kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=13 op=0 data_len=32768
>>> kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
>>>    nvme1: qid=6 tag=12 op=0 data_len=12288
>>> tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
>>>    nvme1: qid=8 tag=69 op=0 data_len=20480
>>>
>>>
>>> dmesg.out (reordered to match nvme-trace.out)
>>>
>>> [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
>>
>> Something is not adding up...
>> lock=1, means that we get into nvme_tcp_send_all(), but somehow we are
>> not able to send it to the wire?
>>
>> I'd look into this and add indication that it is pulled from the
>> req_list, and in turn from the send_list. It looks like it is somehow
>> picked out and silteny dropped, because as long as it is on the list(s)
>> another context should have picked it up at some point...
>>
>>> [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
>>> [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
>>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
>>> [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>>>    empty 0 lock 0 last 0 more 1 send 1 sendpage 72
>>> [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
>>>    empty 0 lock 0 last 1 more 1 send 1 sendpage 72
> 
> Ugh. It was misleading because the send state tracking was
> re-initialized after the direct send case. Here's a better trace with
> more information. It appears as though sendpage() is always called
> with the NOTLAST flag set. I suppose this could mean that there is
> some confusion or race with nvme_tcp_queue_more()?

I don't know yet...

> tar-1872          [004] ..... 389.559784: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=28 op=0 data_len=16384
> tar-1872          [004] .N... 389.559786: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=27 op=0 data_len=16384
> kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=26 op=0 data_len=32768
> kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=25 op=0 data_len=32768
> kworker/4:2H-1873 [004] ..... 389.559799: nvme_tcp_queue_rq:
>   nvme1: qid=5 tag=24 op=0 data_len=32768
> 
> 
> [  419.790505] nvme nvme1: state: tag 0x28 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> [  419.790299] nvme nvme1: state: tag 0x27 io_cpu 4 smp_id 4 smp_id2 0 sync 1
>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040

Here lock=1, which means that we send inline, however last=0, which
means that more requests are coming and corresponds with NOLAST flag.

> [  419.789809] nvme nvme1: state: tag 0x25 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> [  419.790065] nvme nvme1: state: tag 0x26 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040

here lock=0, but last=0 as well, so we add the request to req_list and
_not_ trigger io_work, again because more requests are coming.

> [  419.789663] nvme nvme1: state: tag 0x24 io_cpu 4 smp_id 4 smp_id2 4 sync 1
>   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040

Here lock=0, last=1, and we have requests that are queued (more=1), so
we add the the req_list, and schedule io_work. Now sending the last
3 commands happens asynchronously, so by the time that we get to send
to the wire, it is possible that more requests got queued, which
may explain why NOLAST flag is set...

So we cannot know for sure... You can add what is nvme_tcp_queue_more
at the time the flag is determined, but according to the code, it is
set if more requests are queued in req_list/send_list.

At some point, we'd expect to send with a request that has NOLAST
flag cleared. Would be interesting to make sure that we are getting
there.

BTW, is this a read-only workload? filesystem? can you share more
about the workload?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-01 15:37                 ` Jonathan Nicklin
@ 2022-09-01 17:02                   ` Sagi Grimberg
  2022-09-01 17:46                     ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2022-09-01 17:02 UTC (permalink / raw)
  To: Jonathan Nicklin; +Cc: linux-nvme


> Question: queue->more_requests is updated under a lock. However, not
> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> values seem to show that the same queue was being processed on
> different CPUs. Is this a race between the kworker and inline
> submission?

The entire send path in io_work is wrapped with the send_mutex. Which
part is not protected with the lock?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-01 17:00               ` Sagi Grimberg
@ 2022-09-01 17:39                 ` Jonathan Nicklin
  0 siblings, 0 replies; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-01 17:39 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

Of course! The workload is a test from a suite of data consistency tests
we run. The test walk's a directory and generates a checksum of the
contents. So, it is pretty much read-only. The underlying filesystem
is XFS. The kernel is 5.19.5. To repro, I'm just running the following
in a loop.

umount /mnt/nvme
mount /dev/nvme1n1 /mnt/nvme
find /mnt/nvme/linux-5.18.3 | LC_ALL=C sort | sudo tar -Pcf - -T -
--no-recursion | md5sum

This is on a 4-core intel CPU, with hyperthreading (4+4), 100G
Mellanox (all interrupts bound to core0)


On Thu, Sep 1, 2022 at 1:00 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> >>> Here's a trace including state from nvme_tcp_try_send_cmd_pdu(), like so.
> >>>
> >>> req->dbg.send = 1;
> >>> ret = kernel_sendpage(queue->sock, virt_to_page(pdu), offset, len, flags);
> >>> req->dbg.sendpage = ret;
> >>>
> >>>
> >>> nvme-trace.out
> >>>
> >>> tar-1785    [005] .....   232.768892:nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=18 op=0 data_len=20480
> >>> tar-1785    [005] .....   232.768894: nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=17 op=0 data_len=32768
> >>> tar-1785    [005] .N...   232.768895: nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=14 op=0 data_len=32768
> >>> kworker/5:1H-475     [005] .....   232.768923: nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=13 op=0 data_len=32768
> >>> kworker/5:1H-475     [005] .....   232.768924: nvme_tcp_queue_rq:
> >>>    nvme1: qid=6 tag=12 op=0 data_len=12288
> >>> tar-1785    [007] .....   232.769141: nvme_tcp_queue_rq:
> >>>    nvme1: qid=8 tag=69 op=0 data_len=20480
> >>>
> >>>
> >>> dmesg.out (reordered to match nvme-trace.out)
> >>>
> >>> [  262.889536] nvme nvme1: state: tag 0x18 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> >>
> >> Something is not adding up...
> >> lock=1, means that we get into nvme_tcp_send_all(), but somehow we are
> >> not able to send it to the wire?
> >>
> >> I'd look into this and add indication that it is pulled from the
> >> req_list, and in turn from the send_list. It looks like it is somehow
> >> picked out and silteny dropped, because as long as it is on the list(s)
> >> another context should have picked it up at some point...
> >>
> >>> [  262.889337] nvme nvme1: state: tag 0x17 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> >>> [  262.889110] nvme nvme1: state: tag 0x14 io_cpu 5 smp_id 5 smp_id2 7 sync 1
> >>>    empty 1 lock 1 last 0 more 0 send 0 sendpage -1
> >>> [  262.888864] nvme nvme1: state: tag 0x13 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >>>    empty 0 lock 0 last 0 more 1 send 1 sendpage 72
> >>> [  262.888727] nvme nvme1: state: tag 0x12 io_cpu 5 smp_id 5 smp_id2 5 sync 1
> >>>    empty 0 lock 0 last 1 more 1 send 1 sendpage 72
> >
> > Ugh. It was misleading because the send state tracking was
> > re-initialized after the direct send case. Here's a better trace with
> > more information. It appears as though sendpage() is always called
> > with the NOTLAST flag set. I suppose this could mean that there is
> > some confusion or race with nvme_tcp_queue_more()?
>
> I don't know yet...
>
> > tar-1872          [004] ..... 389.559784: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=28 op=0 data_len=16384
> > tar-1872          [004] .N... 389.559786: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=27 op=0 data_len=16384
> > kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=26 op=0 data_len=32768
> > kworker/4:2H-1873 [004] ..... 389.559798: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=25 op=0 data_len=32768
> > kworker/4:2H-1873 [004] ..... 389.559799: nvme_tcp_queue_rq:
> >   nvme1: qid=5 tag=24 op=0 data_len=32768
> >
> >
> > [  419.790505] nvme nvme1: state: tag 0x28 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> > [  419.790299] nvme nvme1: state: tag 0x27 io_cpu 4 smp_id 4 smp_id2 0 sync 1
> >   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>
> Here lock=1, which means that we send inline, however last=0, which
> means that more requests are coming and corresponds with NOLAST flag.
>
> > [  419.789809] nvme nvme1: state: tag 0x25 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
> > [  419.790065] nvme nvme1: state: tag 0x26 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >   empty 0 lock 0 last 0 more 1 send 1 send_ret 72 send_flags 0x28040
>
> here lock=0, but last=0 as well, so we add the request to req_list and
> _not_ trigger io_work, again because more requests are coming.
>
> > [  419.789663] nvme nvme1: state: tag 0x24 io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>
> Here lock=0, last=1, and we have requests that are queued (more=1), so
> we add the the req_list, and schedule io_work. Now sending the last
> 3 commands happens asynchronously, so by the time that we get to send
> to the wire, it is possible that more requests got queued, which
> may explain why NOLAST flag is set...
>
> So we cannot know for sure... You can add what is nvme_tcp_queue_more
> at the time the flag is determined, but according to the code, it is
> set if more requests are queued in req_list/send_list.
>
> At some point, we'd expect to send with a request that has NOLAST
> flag cleared. Would be interesting to make sure that we are getting
> there.
>
> BTW, is this a read-only workload? filesystem? can you share more
> about the workload?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-01 17:02                   ` Sagi Grimberg
@ 2022-09-01 17:46                     ` Jonathan Nicklin
  2022-09-01 22:56                       ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-01 17:46 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
nvme_tcp_queue_request() is an example where queue->more_requests gets
read without the send_mutex? queue_work_on() is conditionalized on this.

Apologies in advance if I'm misunderstanding the code.

On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Question: queue->more_requests is updated under a lock. However, not
> > all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> > values seem to show that the same queue was being processed on
> > different CPUs. Is this a race between the kworker and inline
> > submission?
>
> The entire send path in io_work is wrapped with the send_mutex. Which
> part is not protected with the lock?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-01 17:46                     ` Jonathan Nicklin
@ 2022-09-01 22:56                       ` Jonathan Nicklin
  2022-09-02 13:47                         ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-01 22:56 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
<jnicklin@blockbridge.com> wrote:
>
> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
> nvme_tcp_queue_request() is an example where queue->more_requests gets
> read without the send_mutex? queue_work_on() is conditionalized on this.
>
> Apologies in advance if I'm misunderstanding the code.
>
> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> >
> >
> > > Question: queue->more_requests is updated under a lock. However, not
> > > all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> > > values seem to show that the same queue was being processed on
> > > different CPUs. Is this a race between the kworker and inline
> > > submission?
> >
> > The entire send path in io_work is wrapped with the send_mutex. Which
> > part is not protected with the lock?

Okay, maybe some progress?! What appears to be happening is that
the request marked "last" is being picked up via nvme_tcp_send_all() when
processing the previous command. Consider another trace:

                 CPU
tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
tag=5e op=0 data_len=16384
tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
tag=5e op=0 page_offset=1352 send_len=72
tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
tag=5d op=0 data_len=32768
tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
tag=5d op=0 page_offset=1424 send_len=72
tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
tag=5c op=0 data_len=32768
tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
tag=5c op=0 page_offset=1496 send_len=72
tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
tag=5b op=0 data_len=32768
tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
tag=5b op=0 page_offset=1568 send_len=72
kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
qid=5 tag=5a op=0 data_len=16384
tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
qid=5 tag=5a op=0 page_offset=1640 send_len=72

I've added some additional state to the request to print out on
timeout. First, queue->more_requests now tracks the command tag
that asserted more_requests. Second, I recorded the number of commands
issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
issued, it ended up sending 0x5a as well. More_requests was asserted
by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
actually pushed it out onto the wire.

[  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
 empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
q_send_list 0
 q_req_list 0 q_more_requests 0x5e req_sent 1
[  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
 empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
q_send_list 0
 q_req_list 0 q_more_requests 0x5d req_sent 1
[  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
 empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
q_send_list 0
 q_req_list 0 q_more_requests 0x5c req_sent 1
[  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
 empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
q_send_list 0
 q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
[  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
 empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
q_send_list 0
 q_req_list 0 q_more_requests 0x5b !!! req_sent 0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-01 22:56                       ` Jonathan Nicklin
@ 2022-09-02 13:47                         ` Jonathan Nicklin
  2022-09-05  8:49                           ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-02 13:47 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
<jnicklin@blockbridge.com> wrote:
>
> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
> <jnicklin@blockbridge.com> wrote:
> >
> > nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
> > nvme_tcp_queue_request() is an example where queue->more_requests gets
> > read without the send_mutex? queue_work_on() is conditionalized on this.
> >
> > Apologies in advance if I'm misunderstanding the code.
> >
> > On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> > >
> > >
> > > > Question: queue->more_requests is updated under a lock. However, not
> > > > all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> > > > values seem to show that the same queue was being processed on
> > > > different CPUs. Is this a race between the kworker and inline
> > > > submission?
> > >
> > > The entire send path in io_work is wrapped with the send_mutex. Which
> > > part is not protected with the lock?
>
> Okay, maybe some progress?! What appears to be happening is that
> the request marked "last" is being picked up via nvme_tcp_send_all() when
> processing the previous command. Consider another trace:
>
>                  CPU
> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
> tag=5e op=0 data_len=16384
> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
> tag=5e op=0 page_offset=1352 send_len=72
> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> tag=5d op=0 data_len=32768
> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
> tag=5d op=0 page_offset=1424 send_len=72
> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> tag=5c op=0 data_len=32768
> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
> tag=5c op=0 page_offset=1496 send_len=72
> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
> tag=5b op=0 data_len=32768
> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
> tag=5b op=0 page_offset=1568 send_len=72
> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
> qid=5 tag=5a op=0 data_len=16384
> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
> qid=5 tag=5a op=0 page_offset=1640 send_len=72
>
> I've added some additional state to the request to print out on
> timeout. First, queue->more_requests now tracks the command tag
> that asserted more_requests. Second, I recorded the number of commands
> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
> issued, it ended up sending 0x5a as well. More_requests was asserted
> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
> actually pushed it out onto the wire.
>
> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5e req_sent 1
> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5d req_sent 1
> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5c req_sent 1
> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
>  empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
>  empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
> q_send_list 0
>  q_req_list 0 q_more_requests 0x5b !!! req_sent 0

I'm not quite sure of all of the mechanics of command submission... but it
looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
pile up on the req_list. When rescheduled, we drain the req_list with
more_reqests set, which prevents us from pushing the bits onto the wire.

Take notice of the "N" flag in the traces and the fact that the
starting cpu "smp_id" is different from the ending cpu "smp_id2" for
the associated tag.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-02 13:47                         ` Jonathan Nicklin
@ 2022-09-05  8:49                           ` Sagi Grimberg
  2022-09-05 14:54                             ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2022-09-05  8:49 UTC (permalink / raw)
  To: Jonathan Nicklin; +Cc: linux-nvme



On 9/2/22 16:47, Jonathan Nicklin wrote:
> On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
> <jnicklin@blockbridge.com> wrote:
>>
>> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
>> <jnicklin@blockbridge.com> wrote:
>>>
>>> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
>>> nvme_tcp_queue_request() is an example where queue->more_requests gets
>>> read without the send_mutex? queue_work_on() is conditionalized on this.
>>>
>>> Apologies in advance if I'm misunderstanding the code.
>>>
>>> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>>>>
>>>>
>>>>> Question: queue->more_requests is updated under a lock. However, not
>>>>> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
>>>>> values seem to show that the same queue was being processed on
>>>>> different CPUs. Is this a race between the kworker and inline
>>>>> submission?
>>>>
>>>> The entire send path in io_work is wrapped with the send_mutex. Which
>>>> part is not protected with the lock?
>>
>> Okay, maybe some progress?! What appears to be happening is that
>> the request marked "last" is being picked up via nvme_tcp_send_all() when
>> processing the previous command. Consider another trace:
>>
>>                   CPU
>> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
>> tag=5e op=0 data_len=16384
>> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
>> tag=5e op=0 page_offset=1352 send_len=72
>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
>> tag=5d op=0 data_len=32768
>> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
>> tag=5d op=0 page_offset=1424 send_len=72
>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
>> tag=5c op=0 data_len=32768
>> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
>> tag=5c op=0 page_offset=1496 send_len=72
>> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
>> tag=5b op=0 data_len=32768
>> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
>> tag=5b op=0 page_offset=1568 send_len=72
>> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
>> qid=5 tag=5a op=0 data_len=16384
>> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
>> qid=5 tag=5a op=0 page_offset=1640 send_len=72
>>
>> I've added some additional state to the request to print out on
>> timeout. First, queue->more_requests now tracks the command tag
>> that asserted more_requests. Second, I recorded the number of commands
>> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
>> issued, it ended up sending 0x5a as well. More_requests was asserted
>> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
>> actually pushed it out onto the wire.
>>
>> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0

What is the difference between smp_id and smp_id2? where is each
taken?

>>   q_req_list 0 q_more_requests 0x5e req_sent 1
>> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0
>>   q_req_list 0 q_more_requests 0x5d req_sent 1
>> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0
>>   q_req_list 0 q_more_requests 0x5c req_sent 1
>> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
>>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0
>>   q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
>> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>> q_send_list 0
>>   q_req_list 0 q_more_requests 0x5b !!! req_sent 0
> 
> I'm not quite sure of all of the mechanics of command submission... but it
> looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
> pile up on the req_list. When rescheduled, we drain the req_list with
> more_reqests set, which prevents us from pushing the bits onto the wire.
> 
> Take notice of the "N" flag in the traces and the fact that the
> starting cpu "smp_id" is different from the ending cpu "smp_id2" for
> the associated tag.

According to the trace, 0x5a came after 0x5b, and 0x5a was last=1, which
means it should have cleared NOLAST and push it all down the wire,
however it had it set.

What happens here is that 0x5b execution sets queue->more_requests=true
for the lifetime of its execution and that is wrong.

To prove the theory can you try this patch?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3e3ebde4eff5..644def671df4 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -121,7 +121,6 @@ struct nvme_tcp_queue {
         struct mutex            send_mutex;
         struct llist_head       req_list;
         struct list_head        send_list;
-       bool                    more_requests;

         /* recv state */
         void                    *pdu;
@@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct 
nvme_tcp_queue *queue)
  static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
  {
         return !list_empty(&queue->send_list) ||
-               !llist_empty(&queue->req_list) || queue->more_requests;
+               !llist_empty(&queue->req_list);
  }

  static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
@@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct 
nvme_tcp_request *req,
          */
         if (queue->io_cpu == raw_smp_processor_id() &&
             sync && empty && mutex_trylock(&queue->send_mutex)) {
-               queue->more_requests = !last;
                 nvme_tcp_send_all(queue);
-               queue->more_requests = false;
                 mutex_unlock(&queue->send_mutex);
         }
--

The downside here, is that commands now that are queued in a batch,
will not have NOLAST set, which means that they will be sent directly
to the wire...

I'll need to think about the correct way to optimize for this case,
however there is an argument that they might as well just be sent
asap (i.e. give up batching).

Alternatively, perhaps remember this in the lifetime of the command
however this needs careful attention because network sends are also
driven from r2t sent by the controller...


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-05  8:49                           ` Sagi Grimberg
@ 2022-09-05 14:54                             ` Jonathan Nicklin
  2022-09-05 15:06                               ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-05 14:54 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

On Mon, Sep 5, 2022 at 4:50 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 9/2/22 16:47, Jonathan Nicklin wrote:
> > On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
> > <jnicklin@blockbridge.com> wrote:
> >>
> >> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
> >> <jnicklin@blockbridge.com> wrote:
> >>>
> >>> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
> >>> nvme_tcp_queue_request() is an example where queue->more_requests gets
> >>> read without the send_mutex? queue_work_on() is conditionalized on this.
> >>>
> >>> Apologies in advance if I'm misunderstanding the code.
> >>>
> >>> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> >>>>
> >>>>
> >>>>> Question: queue->more_requests is updated under a lock. However, not
> >>>>> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> >>>>> values seem to show that the same queue was being processed on
> >>>>> different CPUs. Is this a race between the kworker and inline
> >>>>> submission?
> >>>>
> >>>> The entire send path in io_work is wrapped with the send_mutex. Which
> >>>> part is not protected with the lock?
> >>
> >> Okay, maybe some progress?! What appears to be happening is that
> >> the request marked "last" is being picked up via nvme_tcp_send_all() when
> >> processing the previous command. Consider another trace:
> >>
> >>                   CPU
> >> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
> >> tag=5e op=0 data_len=16384
> >> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
> >> tag=5e op=0 page_offset=1352 send_len=72
> >> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> >> tag=5d op=0 data_len=32768
> >> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
> >> tag=5d op=0 page_offset=1424 send_len=72
> >> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> >> tag=5c op=0 data_len=32768
> >> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
> >> tag=5c op=0 page_offset=1496 send_len=72
> >> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
> >> tag=5b op=0 data_len=32768
> >> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
> >> tag=5b op=0 page_offset=1568 send_len=72
> >> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
> >> qid=5 tag=5a op=0 data_len=16384
> >> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
> >> qid=5 tag=5a op=0 page_offset=1640 send_len=72
> >>
> >> I've added some additional state to the request to print out on
> >> timeout. First, queue->more_requests now tracks the command tag
> >> that asserted more_requests. Second, I recorded the number of commands
> >> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
> >> issued, it ended up sending 0x5a as well. More_requests was asserted
> >> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
> >> actually pushed it out onto the wire.
> >>
> >> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
>
> What is the difference between smp_id and smp_id2? where is each
> taken?

smp_id2 is taken immediately after the direct send case (after unlock). It just
confirms that we were preempted and rescheduled.

smp_id = raw_smp_processor_id();
if (queue->io_cpu == smp_id && sync && empty) {
        lock = mutex_trylock(&queue->send_mutex);
        if (lock) {
            nvme_tcp_send_all(queue, &sent);
            mutex_unlock(&queue->send_mutex);
        }
}
smp_id2 = raw_smp_processor_id();

>
> >>   q_req_list 0 q_more_requests 0x5e req_sent 1
> >> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
> >>   q_req_list 0 q_more_requests 0x5d req_sent 1
> >> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
> >>   q_req_list 0 q_more_requests 0x5c req_sent 1
> >> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
> >>   empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
> >>   q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
> >> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>   empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
> >> q_send_list 0
> >>   q_req_list 0 q_more_requests 0x5b !!! req_sent 0
> >
> > I'm not quite sure of all of the mechanics of command submission... but it
> > looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
> > pile up on the req_list. When rescheduled, we drain the req_list with
> > more_reqests set, which prevents us from pushing the bits onto the wire.
> >
> > Take notice of the "N" flag in the traces and the fact that the
> > starting cpu "smp_id" is different from the ending cpu "smp_id2" for
> > the associated tag.
>
> According to the trace, 0x5a came after 0x5b, and 0x5a was last=1, which
> means it should have cleared NOLAST and push it all down the wire,
> however it had it set.
>
> What happens here is that 0x5b execution sets queue->more_requests=true
> for the lifetime of its execution and that is wrong.
>
> To prove the theory can you try this patch?
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 3e3ebde4eff5..644def671df4 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -121,7 +121,6 @@ struct nvme_tcp_queue {
>          struct mutex            send_mutex;
>          struct llist_head       req_list;
>          struct list_head        send_list;
> -       bool                    more_requests;
>
>          /* recv state */
>          void                    *pdu;
> @@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct
> nvme_tcp_queue *queue)
>   static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
>   {
>          return !list_empty(&queue->send_list) ||
> -               !llist_empty(&queue->req_list) || queue->more_requests;
> +               !llist_empty(&queue->req_list);
>   }
>
>   static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> @@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct
> nvme_tcp_request *req,
>           */
>          if (queue->io_cpu == raw_smp_processor_id() &&
>              sync && empty && mutex_trylock(&queue->send_mutex)) {
> -               queue->more_requests = !last;
>                  nvme_tcp_send_all(queue);
> -               queue->more_requests = false;
>                  mutex_unlock(&queue->send_mutex);
>          }
> --
>
> The downside here, is that commands now that are queued in a batch,
> will not have NOLAST set, which means that they will be sent directly
> to the wire...
>
> I'll need to think about the correct way to optimize for this case,
> however there is an argument that they might as well just be sent
> asap (i.e. give up batching).
>
> Alternatively, perhaps remember this in the lifetime of the command
> however this needs careful attention because network sends are also
> driven from r2t sent by the controller...

I can confirm that ignoring queue->more_requests resolves the issue. I've run
about 300 iterations without issue. Pre-fix, the problem reproduced almost
immediately (1-3 runs).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-05 14:54                             ` Jonathan Nicklin
@ 2022-09-05 15:06                               ` Sagi Grimberg
  2022-09-06 13:36                                 ` Jonathan Nicklin
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2022-09-05 15:06 UTC (permalink / raw)
  To: Jonathan Nicklin; +Cc: linux-nvme



On 9/5/22 17:54, Jonathan Nicklin wrote:
> On Mon, Sep 5, 2022 at 4:50 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>>
>>
>>
>> On 9/2/22 16:47, Jonathan Nicklin wrote:
>>> On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
>>> <jnicklin@blockbridge.com> wrote:
>>>>
>>>> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
>>>> <jnicklin@blockbridge.com> wrote:
>>>>>
>>>>> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
>>>>> nvme_tcp_queue_request() is an example where queue->more_requests gets
>>>>> read without the send_mutex? queue_work_on() is conditionalized on this.
>>>>>
>>>>> Apologies in advance if I'm misunderstanding the code.
>>>>>
>>>>> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>>>>>>
>>>>>>
>>>>>>> Question: queue->more_requests is updated under a lock. However, not
>>>>>>> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
>>>>>>> values seem to show that the same queue was being processed on
>>>>>>> different CPUs. Is this a race between the kworker and inline
>>>>>>> submission?
>>>>>>
>>>>>> The entire send path in io_work is wrapped with the send_mutex. Which
>>>>>> part is not protected with the lock?
>>>>
>>>> Okay, maybe some progress?! What appears to be happening is that
>>>> the request marked "last" is being picked up via nvme_tcp_send_all() when
>>>> processing the previous command. Consider another trace:
>>>>
>>>>                    CPU
>>>> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
>>>> tag=5e op=0 data_len=16384
>>>> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
>>>> tag=5e op=0 page_offset=1352 send_len=72
>>>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
>>>> tag=5d op=0 data_len=32768
>>>> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
>>>> tag=5d op=0 page_offset=1424 send_len=72
>>>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
>>>> tag=5c op=0 data_len=32768
>>>> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
>>>> tag=5c op=0 page_offset=1496 send_len=72
>>>> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
>>>> tag=5b op=0 data_len=32768
>>>> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
>>>> tag=5b op=0 page_offset=1568 send_len=72
>>>> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
>>>> qid=5 tag=5a op=0 data_len=16384
>>>> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
>>>> qid=5 tag=5a op=0 page_offset=1640 send_len=72
>>>>
>>>> I've added some additional state to the request to print out on
>>>> timeout. First, queue->more_requests now tracks the command tag
>>>> that asserted more_requests. Second, I recorded the number of commands
>>>> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
>>>> issued, it ended up sending 0x5a as well. More_requests was asserted
>>>> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
>>>> actually pushed it out onto the wire.
>>>>
>>>> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>
>> What is the difference between smp_id and smp_id2? where is each
>> taken?
> 
> smp_id2 is taken immediately after the direct send case (after unlock). It just
> confirms that we were preempted and rescheduled.
> 
> smp_id = raw_smp_processor_id();
> if (queue->io_cpu == smp_id && sync && empty) {
>          lock = mutex_trylock(&queue->send_mutex);
>          if (lock) {
>              nvme_tcp_send_all(queue, &sent);
>              mutex_unlock(&queue->send_mutex);
>          }
> }
> smp_id2 = raw_smp_processor_id();

Got it.

> 
>>
>>>>    q_req_list 0 q_more_requests 0x5e req_sent 1
>>>> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>>>    q_req_list 0 q_more_requests 0x5d req_sent 1
>>>> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>>>    q_req_list 0 q_more_requests 0x5c req_sent 1
>>>> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
>>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>>>    q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
>>>> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
>>>>    empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
>>>> q_send_list 0
>>>>    q_req_list 0 q_more_requests 0x5b !!! req_sent 0
>>>
>>> I'm not quite sure of all of the mechanics of command submission... but it
>>> looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
>>> pile up on the req_list. When rescheduled, we drain the req_list with
>>> more_reqests set, which prevents us from pushing the bits onto the wire.
>>>
>>> Take notice of the "N" flag in the traces and the fact that the
>>> starting cpu "smp_id" is different from the ending cpu "smp_id2" for
>>> the associated tag.
>>
>> According to the trace, 0x5a came after 0x5b, and 0x5a was last=1, which
>> means it should have cleared NOLAST and push it all down the wire,
>> however it had it set.
>>
>> What happens here is that 0x5b execution sets queue->more_requests=true
>> for the lifetime of its execution and that is wrong.
>>
>> To prove the theory can you try this patch?
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 3e3ebde4eff5..644def671df4 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -121,7 +121,6 @@ struct nvme_tcp_queue {
>>           struct mutex            send_mutex;
>>           struct llist_head       req_list;
>>           struct list_head        send_list;
>> -       bool                    more_requests;
>>
>>           /* recv state */
>>           void                    *pdu;
>> @@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct
>> nvme_tcp_queue *queue)
>>    static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
>>    {
>>           return !list_empty(&queue->send_list) ||
>> -               !llist_empty(&queue->req_list) || queue->more_requests;
>> +               !llist_empty(&queue->req_list);
>>    }
>>
>>    static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>> @@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct
>> nvme_tcp_request *req,
>>            */
>>           if (queue->io_cpu == raw_smp_processor_id() &&
>>               sync && empty && mutex_trylock(&queue->send_mutex)) {
>> -               queue->more_requests = !last;
>>                   nvme_tcp_send_all(queue);
>> -               queue->more_requests = false;
>>                   mutex_unlock(&queue->send_mutex);
>>           }
>> --
>>
>> The downside here, is that commands now that are queued in a batch,
>> will not have NOLAST set, which means that they will be sent directly
>> to the wire...
>>
>> I'll need to think about the correct way to optimize for this case,
>> however there is an argument that they might as well just be sent
>> asap (i.e. give up batching).
>>
>> Alternatively, perhaps remember this in the lifetime of the command
>> however this needs careful attention because network sends are also
>> driven from r2t sent by the controller...
> 
> I can confirm that ignoring queue->more_requests resolves the issue. I've run
> about 300 iterations without issue. Pre-fix, the problem reproduced almost
> immediately (1-3 runs).

Thanks a lot Jonathan for reporting this issue and helping to resolve
it. Appreciate the effort.

I'll send a patch, can reply with your Tested-by tag?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-05 15:06                               ` Sagi Grimberg
@ 2022-09-06 13:36                                 ` Jonathan Nicklin
  2022-09-07  7:55                                   ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Nicklin @ 2022-09-06 13:36 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme

Not a problem. Is it possible to get this fix committed back into the last few
longterm kernels? It seems like anyone running nvme/tcp could benefit from
this fix.

On Mon, Sep 5, 2022 at 11:06 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
>
> On 9/5/22 17:54, Jonathan Nicklin wrote:
> > On Mon, Sep 5, 2022 at 4:50 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> >>
> >>
> >>
> >> On 9/2/22 16:47, Jonathan Nicklin wrote:
> >>> On Thu, Sep 1, 2022 at 6:56 PM Jonathan Nicklin
> >>> <jnicklin@blockbridge.com> wrote:
> >>>>
> >>>> On Thu, Sep 1, 2022 at 1:46 PM Jonathan Nicklin
> >>>> <jnicklin@blockbridge.com> wrote:
> >>>>>
> >>>>> nvme_tcp_queue_more() reads queue->mode_requests. So, I supposed
> >>>>> nvme_tcp_queue_request() is an example where queue->more_requests gets
> >>>>> read without the send_mutex? queue_work_on() is conditionalized on this.
> >>>>>
> >>>>> Apologies in advance if I'm misunderstanding the code.
> >>>>>
> >>>>> On Thu, Sep 1, 2022 at 1:02 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Question: queue->more_requests is updated under a lock. However, not
> >>>>>>> all readers of queue->more_requests acquire the lock. The smp_id/smp_id2
> >>>>>>> values seem to show that the same queue was being processed on
> >>>>>>> different CPUs. Is this a race between the kworker and inline
> >>>>>>> submission?
> >>>>>>
> >>>>>> The entire send path in io_work is wrapped with the send_mutex. Which
> >>>>>> part is not protected with the lock?
> >>>>
> >>>> Okay, maybe some progress?! What appears to be happening is that
> >>>> the request marked "last" is being picked up via nvme_tcp_send_all() when
> >>>> processing the previous command. Consider another trace:
> >>>>
> >>>>                    CPU
> >>>> tar-2263    [004] .....   680.671184: nvme_tcp_queue_rq: nvme1: qid=5
> >>>> tag=5e op=0 data_len=16384
> >>>> tar-2263    [004] .....   680.671185: nvme_tcp_cmd_pdu: nvme1: qid=5
> >>>> tag=5e op=0 page_offset=1352 send_len=72
> >>>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> >>>> tag=5d op=0 data_len=32768
> >>>> tar-2263    [004] .....   680.671186: nvme_tcp_cmd_pdu: nvme1: qid=5
> >>>> tag=5d op=0 page_offset=1424 send_len=72
> >>>> tar-2263    [004] .....   680.671186: nvme_tcp_queue_rq: nvme1: qid=5
> >>>> tag=5c op=0 data_len=32768
> >>>> tar-2263    [004] .....   680.671187: nvme_tcp_cmd_pdu: nvme1: qid=5
> >>>> tag=5c op=0 page_offset=1496 send_len=72
> >>>> tar-2263    [004] .N...   680.671188: nvme_tcp_queue_rq: nvme1: qid=5
> >>>> tag=5b op=0 data_len=32768
> >>>> tar-2263    [004] .N...   680.671189: nvme_tcp_cmd_pdu: nvme1: qid=5
> >>>> tag=5b op=0 page_offset=1568 send_len=72
> >>>> kworker/4:0H-38 [004] .....   680.671198: nvme_tcp_queue_rq: nvme1:
> >>>> qid=5 tag=5a op=0 data_len=16384
> >>>> tar-2263        [000] .....   680.671199: nvme_tcp_cmd_pdu: nvme1:
> >>>> qid=5 tag=5a op=0 page_offset=1640 send_len=72
> >>>>
> >>>> I've added some additional state to the request to print out on
> >>>> timeout. First, queue->more_requests now tracks the command tag
> >>>> that asserted more_requests. Second, I recorded the number of commands
> >>>> issued via nvme_tcp_send_all(). As you can see, when tag 0x5b was
> >>>> issued, it ended up sending 0x5a as well. More_requests was asserted
> >>>> by 0x5b when nvme_tcp_try_send_cmd_pdu() processed 0x5a, so we never
> >>>> actually pushed it out onto the wire.
> >>>>
> >>>> [  709.457051] nvme nvme1: state: tag 0x5e io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>
> >> What is the difference between smp_id and smp_id2? where is each
> >> taken?
> >
> > smp_id2 is taken immediately after the direct send case (after unlock). It just
> > confirms that we were preempted and rescheduled.
> >
> > smp_id = raw_smp_processor_id();
> > if (queue->io_cpu == smp_id && sync && empty) {
> >          lock = mutex_trylock(&queue->send_mutex);
> >          if (lock) {
> >              nvme_tcp_send_all(queue, &sent);
> >              mutex_unlock(&queue->send_mutex);
> >          }
> > }
> > smp_id2 = raw_smp_processor_id();
>
> Got it.
>
> >
> >>
> >>>>    q_req_list 0 q_more_requests 0x5e req_sent 1
> >>>> [  709.414327] nvme nvme1: state: tag 0x5d io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>>>    q_req_list 0 q_more_requests 0x5d req_sent 1
> >>>> [  709.371665] nvme nvme1: state: tag 0x5c io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>>>    q_req_list 0 q_more_requests 0x5c req_sent 1
> >>>> [  709.312298] nvme nvme1: state: tag 0x5b io_cpu 4 smp_id 4 smp_id2 0 sync 1
> >>>>    empty 1 lock 1 last 0 more 0 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>>>    q_req_list 0 q_more_requests 0x5b req_sent 2 !!!
> >>>> [  709.283534] nvme nvme1: state: tag 0x5a io_cpu 4 smp_id 4 smp_id2 4 sync 1
> >>>>    empty 0 lock 0 last 1 more 1 send 1 send_ret 72 send_flags 0x28040
> >>>> q_send_list 0
> >>>>    q_req_list 0 q_more_requests 0x5b !!! req_sent 0
> >>>
> >>> I'm not quite sure of all of the mechanics of command submission... but it
> >>> looks like if we are preempted in nvme_tcp_queue_request(), subsequent commands
> >>> pile up on the req_list. When rescheduled, we drain the req_list with
> >>> more_reqests set, which prevents us from pushing the bits onto the wire.
> >>>
> >>> Take notice of the "N" flag in the traces and the fact that the
> >>> starting cpu "smp_id" is different from the ending cpu "smp_id2" for
> >>> the associated tag.
> >>
> >> According to the trace, 0x5a came after 0x5b, and 0x5a was last=1, which
> >> means it should have cleared NOLAST and push it all down the wire,
> >> however it had it set.
> >>
> >> What happens here is that 0x5b execution sets queue->more_requests=true
> >> for the lifetime of its execution and that is wrong.
> >>
> >> To prove the theory can you try this patch?
> >> --
> >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> >> index 3e3ebde4eff5..644def671df4 100644
> >> --- a/drivers/nvme/host/tcp.c
> >> +++ b/drivers/nvme/host/tcp.c
> >> @@ -121,7 +121,6 @@ struct nvme_tcp_queue {
> >>           struct mutex            send_mutex;
> >>           struct llist_head       req_list;
> >>           struct list_head        send_list;
> >> -       bool                    more_requests;
> >>
> >>           /* recv state */
> >>           void                    *pdu;
> >> @@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct
> >> nvme_tcp_queue *queue)
> >>    static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
> >>    {
> >>           return !list_empty(&queue->send_list) ||
> >> -               !llist_empty(&queue->req_list) || queue->more_requests;
> >> +               !llist_empty(&queue->req_list);
> >>    }
> >>
> >>    static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> >> @@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct
> >> nvme_tcp_request *req,
> >>            */
> >>           if (queue->io_cpu == raw_smp_processor_id() &&
> >>               sync && empty && mutex_trylock(&queue->send_mutex)) {
> >> -               queue->more_requests = !last;
> >>                   nvme_tcp_send_all(queue);
> >> -               queue->more_requests = false;
> >>                   mutex_unlock(&queue->send_mutex);
> >>           }
> >> --
> >>
> >> The downside here, is that commands now that are queued in a batch,
> >> will not have NOLAST set, which means that they will be sent directly
> >> to the wire...
> >>
> >> I'll need to think about the correct way to optimize for this case,
> >> however there is an argument that they might as well just be sent
> >> asap (i.e. give up batching).
> >>
> >> Alternatively, perhaps remember this in the lifetime of the command
> >> however this needs careful attention because network sends are also
> >> driven from r2t sent by the controller...
> >
> > I can confirm that ignoring queue->more_requests resolves the issue. I've run
> > about 300 iterations without issue. Pre-fix, the problem reproduced almost
> > immediately (1-3 runs).
>
> Thanks a lot Jonathan for reporting this issue and helping to resolve
> it. Appreciate the effort.
>
> I'll send a patch, can reply with your Tested-by tag?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: nvme-tcp: i/o errors and stalled host from failure to send command pdus
  2022-09-06 13:36                                 ` Jonathan Nicklin
@ 2022-09-07  7:55                                   ` Sagi Grimberg
  0 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2022-09-07  7:55 UTC (permalink / raw)
  To: Jonathan Nicklin; +Cc: linux-nvme


> Not a problem. Is it possible to get this fix committed back into the last few
> longterm kernels? It seems like anyone running nvme/tcp could benefit from
> this fix.

The patch has a Fixes tag, and the title clarifies that it fixes a
regression, so it should be picked up for stable kernels...


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-09-07  8:07 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-30 23:10 nvme-tcp: i/o errors and stalled host from failure to send command pdus Jonathan Nicklin
2022-08-31  9:02 ` Sagi Grimberg
2022-08-31 16:04   ` Jonathan Nicklin
2022-08-31 18:50     ` Sagi Grimberg
2022-08-31 19:06       ` Jonathan Nicklin
2022-08-31 20:04         ` Jonathan Nicklin
2022-08-31 23:42           ` Sagi Grimberg
2022-09-01 12:37             ` Jonathan Nicklin
2022-09-01 13:22               ` Jonathan Nicklin
2022-09-01 15:37                 ` Jonathan Nicklin
2022-09-01 17:02                   ` Sagi Grimberg
2022-09-01 17:46                     ` Jonathan Nicklin
2022-09-01 22:56                       ` Jonathan Nicklin
2022-09-02 13:47                         ` Jonathan Nicklin
2022-09-05  8:49                           ` Sagi Grimberg
2022-09-05 14:54                             ` Jonathan Nicklin
2022-09-05 15:06                               ` Sagi Grimberg
2022-09-06 13:36                                 ` Jonathan Nicklin
2022-09-07  7:55                                   ` Sagi Grimberg
2022-09-01 17:00               ` Sagi Grimberg
2022-09-01 17:39                 ` Jonathan Nicklin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.