qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* completion timeouts with pin-based interrupts in QEMU hw/nvme
@ 2023-01-12 13:10 Klaus Jensen
  2023-01-12 14:26 ` Guenter Roeck
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Klaus Jensen @ 2023-01-12 13:10 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme
  Cc: qemu-block, qemu-devel, Guenter Roeck

[-- Attachment #1: Type: text/plain, Size: 5246 bytes --]

Hi all (linux-nvme, qemu-devel, maintainers),

On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
pin-based interrupts, I'm seeing occasional completion timeouts, i.e.

  nvme nvme0: I/O 333 QID 1 timeout, completion polled

To rule out issues with shadow doorbells (which have been a source of
frustration in the past), those are disabled. FWIW I'm also seeing the
issue with shadow doorbells.

	diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
	index f25cc2c235e9..28d8e7f4b56c 100644
	--- a/hw/nvme/ctrl.c
	+++ b/hw/nvme/ctrl.c
	@@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
	     id->mdts = n->params.mdts;
	     id->ver = cpu_to_le32(NVME_SPEC_VER);
	     id->oacs =
	-        cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF);
	+        cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
	     id->cntrltype = 0x1;

	     /*


I captured a trace from QEMU when this happens:

pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 num_prps 5
pci_nvme_map_addr addr 0x80aca000 len 4096
pci_nvme_map_addr addr 0x80ac9000 len 4096
pci_nvme_map_addr addr 0x80ac8000 len 4096
pci_nvme_map_addr addr 0x80ac7000 len 4096
pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 num_prps 29
pci_nvme_map_addr addr 0x80ae6000 len 4096
pci_nvme_map_addr addr 0x80ae5000 len 4096
pci_nvme_map_addr addr 0x80ae4000 len 4096
pci_nvme_map_addr addr 0x80ae3000 len 4096
pci_nvme_map_addr addr 0x80ae2000 len 4096
pci_nvme_map_addr addr 0x80ae1000 len 4096
pci_nvme_map_addr addr 0x80ae0000 len 4096
pci_nvme_map_addr addr 0x80adf000 len 4096
pci_nvme_map_addr addr 0x80ade000 len 4096
pci_nvme_map_addr addr 0x80add000 len 4096
pci_nvme_map_addr addr 0x80adc000 len 4096
pci_nvme_map_addr addr 0x80adb000 len 4096
pci_nvme_map_addr addr 0x80ada000 len 4096
pci_nvme_map_addr addr 0x80ad9000 len 4096
pci_nvme_map_addr addr 0x80ad8000 len 4096
pci_nvme_map_addr addr 0x80ad7000 len 4096
pci_nvme_map_addr addr 0x80ad6000 len 4096
pci_nvme_map_addr addr 0x80ad5000 len 4096
pci_nvme_map_addr addr 0x80ad4000 len 4096
pci_nvme_map_addr addr 0x80ad3000 len 4096
pci_nvme_map_addr addr 0x80ad2000 len 4096
pci_nvme_map_addr addr 0x80ad1000 len 4096
pci_nvme_map_addr addr 0x80ad0000 len 4096
pci_nvme_map_addr addr 0x80acf000 len 4096
pci_nvme_map_addr addr 0x80ace000 len 4096
pci_nvme_map_addr addr 0x80acd000 len 4096
pci_nvme_map_addr addr 0x80acc000 len 4096
pci_nvme_map_addr addr 0x80acb000 len 4096
pci_nvme_rw_cb cid 4428 blk 'd0'
pci_nvme_rw_complete_cb cid 4428 blk 'd0'
pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[1]: pci_nvme_irq_pin pulsing IRQ pin
pci_nvme_rw_cb cid 4429 blk 'd0'
pci_nvme_rw_complete_cb cid 4429 blk 'd0'
pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
[2]: pci_nvme_irq_pin pulsing IRQ pin
[3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
[4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
---- TIMEOUT HERE (30s) ---
[5]: pci_nvme_mmio_read addr 0x1c size 4
[6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
[7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
--- Interrupt deasserted (cq->tail == cq->head)
[   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled

Following the timeout, everything returns to "normal" and device/driver
happily continues.

The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
am wondering if there is something going on with the kernel driver (but
I certainly do not rule out that hw/nvme is at fault here, since
pin-based interrupts has also been a source of several issues in the
past).

What I'm thinking is that following the interrupt in [1], the driver
picks up completion for cid 4428 but does not find cid 4429 in the queue
since it has not been posted yet. Before getting a cq head doorbell
write (which would cause the pin to be deasserted), the device posts the
completion for cid 4429 which just keeps the interrupt asserted in [2].
The trace then shows the cq head doorbell update in [3,4] for cid 4428
and then we hit the timeout since the driver is not aware that cid 4429
has been posted in between this (why is it not aware of this?) Timing
out, the driver then polls the queue and notices cid 4429 and updates
the cq head doorbell in [5-7], causing the device to deassert the
interrupt and we are "back in shape".

I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0).
Tested on QEMU v7.0.0 (to rule out all the shadow doorbell
optimizations) as well as QEMU nvme-next (infradead). In other words,
it's not a recent regression in either project and potentially it has
always been like this. I've not tested other platforms for now, but I
would assume others using pin-based interrupts would observe the same.

Any ideas on how to shed any light on this issue from the kernel side of
things?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 13:10 completion timeouts with pin-based interrupts in QEMU hw/nvme Klaus Jensen
@ 2023-01-12 14:26 ` Guenter Roeck
  2023-01-12 16:34 ` Keith Busch
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-01-12 14:26 UTC (permalink / raw)
  To: Klaus Jensen, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme
  Cc: qemu-block, qemu-devel

On 1/12/23 05:10, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>    nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> To rule out issues with shadow doorbells (which have been a source of
> frustration in the past), those are disabled. FWIW I'm also seeing the
> issue with shadow doorbells.
> 
> 	diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> 	index f25cc2c235e9..28d8e7f4b56c 100644
> 	--- a/hw/nvme/ctrl.c
> 	+++ b/hw/nvme/ctrl.c
> 	@@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
> 	     id->mdts = n->params.mdts;
> 	     id->ver = cpu_to_le32(NVME_SPEC_VER);
> 	     id->oacs =
> 	-        cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF);
> 	+        cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
> 	     id->cntrltype = 0x1;
> 
> 	     /*
> 
> 
> I captured a trace from QEMU when this happens:
> 
> pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
> pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
> pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
> pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 num_prps 5
> pci_nvme_map_addr addr 0x80aca000 len 4096
> pci_nvme_map_addr addr 0x80ac9000 len 4096
> pci_nvme_map_addr addr 0x80ac8000 len 4096
> pci_nvme_map_addr addr 0x80ac7000 len 4096
> pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
> pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 num_prps 29
> pci_nvme_map_addr addr 0x80ae6000 len 4096
> pci_nvme_map_addr addr 0x80ae5000 len 4096
> pci_nvme_map_addr addr 0x80ae4000 len 4096
> pci_nvme_map_addr addr 0x80ae3000 len 4096
> pci_nvme_map_addr addr 0x80ae2000 len 4096
> pci_nvme_map_addr addr 0x80ae1000 len 4096
> pci_nvme_map_addr addr 0x80ae0000 len 4096
> pci_nvme_map_addr addr 0x80adf000 len 4096
> pci_nvme_map_addr addr 0x80ade000 len 4096
> pci_nvme_map_addr addr 0x80add000 len 4096
> pci_nvme_map_addr addr 0x80adc000 len 4096
> pci_nvme_map_addr addr 0x80adb000 len 4096
> pci_nvme_map_addr addr 0x80ada000 len 4096
> pci_nvme_map_addr addr 0x80ad9000 len 4096
> pci_nvme_map_addr addr 0x80ad8000 len 4096
> pci_nvme_map_addr addr 0x80ad7000 len 4096
> pci_nvme_map_addr addr 0x80ad6000 len 4096
> pci_nvme_map_addr addr 0x80ad5000 len 4096
> pci_nvme_map_addr addr 0x80ad4000 len 4096
> pci_nvme_map_addr addr 0x80ad3000 len 4096
> pci_nvme_map_addr addr 0x80ad2000 len 4096
> pci_nvme_map_addr addr 0x80ad1000 len 4096
> pci_nvme_map_addr addr 0x80ad0000 len 4096
> pci_nvme_map_addr addr 0x80acf000 len 4096
> pci_nvme_map_addr addr 0x80ace000 len 4096
> pci_nvme_map_addr addr 0x80acd000 len 4096
> pci_nvme_map_addr addr 0x80acc000 len 4096
> pci_nvme_map_addr addr 0x80acb000 len 4096
> pci_nvme_rw_cb cid 4428 blk 'd0'
> pci_nvme_rw_complete_cb cid 4428 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [1]: pci_nvme_irq_pin pulsing IRQ pin
> pci_nvme_rw_cb cid 4429 blk 'd0'
> pci_nvme_rw_complete_cb cid 4429 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [2]: pci_nvme_irq_pin pulsing IRQ pin
> [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
> [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
> ---- TIMEOUT HERE (30s) ---
> [5]: pci_nvme_mmio_read addr 0x1c size 4
> [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
> [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
> --- Interrupt deasserted (cq->tail == cq->head)
> [   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> Following the timeout, everything returns to "normal" and device/driver
> happily continues.
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).
> 
> What I'm thinking is that following the interrupt in [1], the driver
> picks up completion for cid 4428 but does not find cid 4429 in the queue
> since it has not been posted yet. Before getting a cq head doorbell
> write (which would cause the pin to be deasserted), the device posts the
> completion for cid 4429 which just keeps the interrupt asserted in [2].
> The trace then shows the cq head doorbell update in [3,4] for cid 4428
> and then we hit the timeout since the driver is not aware that cid 4429
> has been posted in between this (why is it not aware of this?) Timing
> out, the driver then polls the queue and notices cid 4429 and updates
> the cq head doorbell in [5-7], causing the device to deassert the
> interrupt and we are "back in shape".
> 
> I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0).
> Tested on QEMU v7.0.0 (to rule out all the shadow doorbell
> optimizations) as well as QEMU nvme-next (infradead). In other words,
> it's not a recent regression in either project and potentially it has
> always been like this. I've not tested other platforms for now, but I
> would assume others using pin-based interrupts would observe the same.
> 
> Any ideas on how to shed any light on this issue from the kernel side of
> things?


I have no idea what causes the problem, but I have seen this forever
in my testing. I have seen it with various architectures. Some architectures
are affected more than others. For example, I don't test nvme boots on mips
platforms because the problem is quite notorious there.

Guenter



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 13:10 completion timeouts with pin-based interrupts in QEMU hw/nvme Klaus Jensen
  2023-01-12 14:26 ` Guenter Roeck
@ 2023-01-12 16:34 ` Keith Busch
  2023-01-12 17:45   ` Klaus Jensen
  2023-01-13  8:54 ` Klaus Jensen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2023-01-12 16:34 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel, Guenter Roeck

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).

Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 16:34 ` Keith Busch
@ 2023-01-12 17:45   ` Klaus Jensen
  2023-01-12 17:51     ` Keith Busch
                       ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Klaus Jensen @ 2023-01-12 17:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel, Guenter Roeck

[-- Attachment #1: Type: text/plain, Size: 1443 bytes --]

On Jan 12 09:34, Keith Busch wrote:
> On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > 
> > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > am wondering if there is something going on with the kernel driver (but
> > I certainly do not rule out that hw/nvme is at fault here, since
> > pin-based interrupts has also been a source of several issues in the
> > past).
> 
> Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> While probably not the "correct" thing to do, it has better results in
> my testing.
> 

A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,

	diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
	index 03760ddeae8c..0fc46dcb9ec4 100644
	--- i/hw/nvme/ctrl.c
	+++ w/hw/nvme/ctrl.c
	@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
		 return;
	     }
	     if (~intms & n->irq_status) {
	+        pci_irq_deassert(&n->parent_obj);
		 pci_irq_assert(&n->parent_obj);
	     } else {
		 pci_irq_deassert(&n->parent_obj);


seems to do the trick (pulse is the other way around, assert, then
deassert).

Probably not the "correct" thing to do, but I'll take it since it seems
to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
on ~20 runs now and have not encountered it.

I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
machine/board(s) are you testing?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 17:45   ` Klaus Jensen
@ 2023-01-12 17:51     ` Keith Busch
  2023-01-12 19:14     ` Guenter Roeck
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Keith Busch @ 2023-01-12 17:51 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel, Guenter Roeck

On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
> > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > > 
> > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > > am wondering if there is something going on with the kernel driver (but
> > > I certainly do not rule out that hw/nvme is at fault here, since
> > > pin-based interrupts has also been a source of several issues in the
> > > past).
> > 
> > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> > While probably not the "correct" thing to do, it has better results in
> > my testing.
> > 
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
> 	diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
> 	index 03760ddeae8c..0fc46dcb9ec4 100644
> 	--- i/hw/nvme/ctrl.c
> 	+++ w/hw/nvme/ctrl.c
> 	@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
> 		 return;
> 	     }
> 	     if (~intms & n->irq_status) {
> 	+        pci_irq_deassert(&n->parent_obj);
> 		 pci_irq_assert(&n->parent_obj);
> 	     } else {
> 		 pci_irq_deassert(&n->parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.

Yep, that looks good. It's sounding like something with the CPU irq
handling is treating the level interrupt as edge triggered.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 17:45   ` Klaus Jensen
  2023-01-12 17:51     ` Keith Busch
@ 2023-01-12 19:14     ` Guenter Roeck
  2023-01-12 19:27     ` Keith Busch
  2023-01-12 23:57     ` Guenter Roeck
  3 siblings, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-01-12 19:14 UTC (permalink / raw)
  To: Klaus Jensen, Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel

On 1/12/23 09:45, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
>> On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
>>>
>>> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
>>> am wondering if there is something going on with the kernel driver (but
>>> I certainly do not rule out that hw/nvme is at fault here, since
>>> pin-based interrupts has also been a source of several issues in the
>>> past).
>>
>> Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
>> While probably not the "correct" thing to do, it has better results in
>> my testing.
>>
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
> 	diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
> 	index 03760ddeae8c..0fc46dcb9ec4 100644
> 	--- i/hw/nvme/ctrl.c
> 	+++ w/hw/nvme/ctrl.c
> 	@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
> 		 return;
> 	     }
> 	     if (~intms & n->irq_status) {
> 	+        pci_irq_deassert(&n->parent_obj);
> 		 pci_irq_assert(&n->parent_obj);
> 	     } else {
> 		 pci_irq_deassert(&n->parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.
> 
> I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
> machine/board(s) are you testing?

See
https://github.com/groeck/linux-build-test/blob/master/rootfs/mips/run-qemu-mips.sh
and
https://github.com/groeck/linux-build-test/blob/master/rootfs/mipsel/run-qemu-mipsel.sh

I'll apply the change suggested above to my version of qemu (7.2.0)
and give it a try.

Guenter



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 17:45   ` Klaus Jensen
  2023-01-12 17:51     ` Keith Busch
  2023-01-12 19:14     ` Guenter Roeck
@ 2023-01-12 19:27     ` Keith Busch
  2023-01-13  0:27       ` Guenter Roeck
  2023-01-12 23:57     ` Guenter Roeck
  3 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2023-01-12 19:27 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel, Guenter Roeck

On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
> > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > > 
> > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > > am wondering if there is something going on with the kernel driver (but
> > > I certainly do not rule out that hw/nvme is at fault here, since
> > > pin-based interrupts has also been a source of several issues in the
> > > past).
> > 
> > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> > While probably not the "correct" thing to do, it has better results in
> > my testing.
> > 
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
> 	diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
> 	index 03760ddeae8c..0fc46dcb9ec4 100644
> 	--- i/hw/nvme/ctrl.c
> 	+++ w/hw/nvme/ctrl.c
> 	@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
> 		 return;
> 	     }
> 	     if (~intms & n->irq_status) {
> 	+        pci_irq_deassert(&n->parent_obj);
> 		 pci_irq_assert(&n->parent_obj);
> 	     } else {
> 		 pci_irq_deassert(&n->parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.
> 
> I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
> machine/board(s) are you testing?

Could you try the below?

---
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2c85de4700..521c3c80c1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
     }
 }
 
+static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq)
+{
+    if (!cq->irq_enabled) {
+        return;
+    }
+
+    if (msix_enabled(&(n->parent_obj))) {
+        msix_notify(&(n->parent_obj), cq->vector);
+        return;
+    }
+
+    pci_irq_pulse(&n->parent_obj);
+}
+
 static void nvme_req_clear(NvmeRequest *req)
 {
     req->ns = NULL;
@@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
             }
 
             nvme_irq_deassert(n, cq);
+        } else {
+            /*
+             * Retrigger the irq just to make sure the host has no excuse for
+             * not knowing there's more work to complete on this CQ.
+             */
+            nvme_irq_pulse(n, cq);
         }
     } else {
         /* Submission queue doorbell write */
--


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 17:45   ` Klaus Jensen
                       ` (2 preceding siblings ...)
  2023-01-12 19:27     ` Keith Busch
@ 2023-01-12 23:57     ` Guenter Roeck
  3 siblings, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-01-12 23:57 UTC (permalink / raw)
  To: Klaus Jensen, Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel

On 1/12/23 09:45, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
>> On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
>>>
>>> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
>>> am wondering if there is something going on with the kernel driver (but
>>> I certainly do not rule out that hw/nvme is at fault here, since
>>> pin-based interrupts has also been a source of several issues in the
>>> past).
>>
>> Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
>> While probably not the "correct" thing to do, it has better results in
>> my testing.
>>
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
> 	diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
> 	index 03760ddeae8c..0fc46dcb9ec4 100644
> 	--- i/hw/nvme/ctrl.c
> 	+++ w/hw/nvme/ctrl.c
> 	@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
> 		 return;
> 	     }
> 	     if (~intms & n->irq_status) {
> 	+        pci_irq_deassert(&n->parent_obj);
> 		 pci_irq_assert(&n->parent_obj);
> 	     } else {
> 		 pci_irq_deassert(&n->parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.
> 
> I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
> machine/board(s) are you testing?

So, for mipsel, two sets of results for the above:

First, qemu v7.2 is already much better than qemu v7.1. With qemu v7.1,
the boot test fails roughly every other test. Failure rate with qemu v7.2
is much less.

Second, my nvme boot test with qemu 7.2 fails after ~5-10 iterations.
After the above change, I did not see a single failure in 50 boot tests.

I'll test the other suggested change next.

Guenter



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 19:27     ` Keith Busch
@ 2023-01-13  0:27       ` Guenter Roeck
  0 siblings, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-01-13  0:27 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel

On 1/12/23 11:27, Keith Busch wrote:
> On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
>> On Jan 12 09:34, Keith Busch wrote:
>>> On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
>>>>
>>>> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
>>>> am wondering if there is something going on with the kernel driver (but
>>>> I certainly do not rule out that hw/nvme is at fault here, since
>>>> pin-based interrupts has also been a source of several issues in the
>>>> past).
>>>
>>> Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
>>> While probably not the "correct" thing to do, it has better results in
>>> my testing.
>>>
>>
>> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
>>
>> 	diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
>> 	index 03760ddeae8c..0fc46dcb9ec4 100644
>> 	--- i/hw/nvme/ctrl.c
>> 	+++ w/hw/nvme/ctrl.c
>> 	@@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
>> 		 return;
>> 	     }
>> 	     if (~intms & n->irq_status) {
>> 	+        pci_irq_deassert(&n->parent_obj);
>> 		 pci_irq_assert(&n->parent_obj);
>> 	     } else {
>> 		 pci_irq_deassert(&n->parent_obj);
>>
>>
>> seems to do the trick (pulse is the other way around, assert, then
>> deassert).
>>
>> Probably not the "correct" thing to do, but I'll take it since it seems
>> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
>> on ~20 runs now and have not encountered it.
>>
>> I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
>> machine/board(s) are you testing?
> 
> Could you try the below?
> 

This works as well: 50 iterations with no failures after applying the patch
below on top of qemu v7.2.0. Tested with qemu-system-mipsel.

Guenter

> ---
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 2c85de4700..521c3c80c1 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
>       }
>   }
>   
> +static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq)
> +{
> +    if (!cq->irq_enabled) {
> +        return;
> +    }
> +
> +    if (msix_enabled(&(n->parent_obj))) {
> +        msix_notify(&(n->parent_obj), cq->vector);
> +        return;
> +    }
> +
> +    pci_irq_pulse(&n->parent_obj);
> +}
> +
>   static void nvme_req_clear(NvmeRequest *req)
>   {
>       req->ns = NULL;
> @@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>               }
>   
>               nvme_irq_deassert(n, cq);
> +        } else {
> +            /*
> +             * Retrigger the irq just to make sure the host has no excuse for
> +             * not knowing there's more work to complete on this CQ.
> +             */
> +            nvme_irq_pulse(n, cq);
>           }
>       } else {
>           /* Submission queue doorbell write */
> --



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 13:10 completion timeouts with pin-based interrupts in QEMU hw/nvme Klaus Jensen
  2023-01-12 14:26 ` Guenter Roeck
  2023-01-12 16:34 ` Keith Busch
@ 2023-01-13  8:54 ` Klaus Jensen
  2023-01-13 12:32   ` Peter Maydell
  2023-01-16 21:14 ` Klaus Jensen
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 35+ messages in thread
From: Klaus Jensen @ 2023-01-13  8:54 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme
  Cc: qemu-block, qemu-devel, Guenter Roeck, Michael S. Tsirkin,
	Marcel Apfelbaum

[-- Attachment #1: Type: text/plain, Size: 5948 bytes --]

+CC qemu pci maintainers

Michael, Marcel,

Do you have any comments on this thread? As you can see one solution is
to simply deassert prior to asserting, the other is to reintroduce a
pci_irq_pulse(). Both seem to solve the issue.

On Jan 12 14:10, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>   nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> To rule out issues with shadow doorbells (which have been a source of
> frustration in the past), those are disabled. FWIW I'm also seeing the
> issue with shadow doorbells.
> 
> 	diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> 	index f25cc2c235e9..28d8e7f4b56c 100644
> 	--- a/hw/nvme/ctrl.c
> 	+++ b/hw/nvme/ctrl.c
> 	@@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
> 	     id->mdts = n->params.mdts;
> 	     id->ver = cpu_to_le32(NVME_SPEC_VER);
> 	     id->oacs =
> 	-        cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF);
> 	+        cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
> 	     id->cntrltype = 0x1;
> 
> 	     /*
> 
> 
> I captured a trace from QEMU when this happens:
> 
> pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
> pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
> pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
> pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 num_prps 5
> pci_nvme_map_addr addr 0x80aca000 len 4096
> pci_nvme_map_addr addr 0x80ac9000 len 4096
> pci_nvme_map_addr addr 0x80ac8000 len 4096
> pci_nvme_map_addr addr 0x80ac7000 len 4096
> pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
> pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 num_prps 29
> pci_nvme_map_addr addr 0x80ae6000 len 4096
> pci_nvme_map_addr addr 0x80ae5000 len 4096
> pci_nvme_map_addr addr 0x80ae4000 len 4096
> pci_nvme_map_addr addr 0x80ae3000 len 4096
> pci_nvme_map_addr addr 0x80ae2000 len 4096
> pci_nvme_map_addr addr 0x80ae1000 len 4096
> pci_nvme_map_addr addr 0x80ae0000 len 4096
> pci_nvme_map_addr addr 0x80adf000 len 4096
> pci_nvme_map_addr addr 0x80ade000 len 4096
> pci_nvme_map_addr addr 0x80add000 len 4096
> pci_nvme_map_addr addr 0x80adc000 len 4096
> pci_nvme_map_addr addr 0x80adb000 len 4096
> pci_nvme_map_addr addr 0x80ada000 len 4096
> pci_nvme_map_addr addr 0x80ad9000 len 4096
> pci_nvme_map_addr addr 0x80ad8000 len 4096
> pci_nvme_map_addr addr 0x80ad7000 len 4096
> pci_nvme_map_addr addr 0x80ad6000 len 4096
> pci_nvme_map_addr addr 0x80ad5000 len 4096
> pci_nvme_map_addr addr 0x80ad4000 len 4096
> pci_nvme_map_addr addr 0x80ad3000 len 4096
> pci_nvme_map_addr addr 0x80ad2000 len 4096
> pci_nvme_map_addr addr 0x80ad1000 len 4096
> pci_nvme_map_addr addr 0x80ad0000 len 4096
> pci_nvme_map_addr addr 0x80acf000 len 4096
> pci_nvme_map_addr addr 0x80ace000 len 4096
> pci_nvme_map_addr addr 0x80acd000 len 4096
> pci_nvme_map_addr addr 0x80acc000 len 4096
> pci_nvme_map_addr addr 0x80acb000 len 4096
> pci_nvme_rw_cb cid 4428 blk 'd0'
> pci_nvme_rw_complete_cb cid 4428 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [1]: pci_nvme_irq_pin pulsing IRQ pin
> pci_nvme_rw_cb cid 4429 blk 'd0'
> pci_nvme_rw_complete_cb cid 4429 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [2]: pci_nvme_irq_pin pulsing IRQ pin
> [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
> [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
> ---- TIMEOUT HERE (30s) ---
> [5]: pci_nvme_mmio_read addr 0x1c size 4
> [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
> [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
> --- Interrupt deasserted (cq->tail == cq->head)
> [   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> Following the timeout, everything returns to "normal" and device/driver
> happily continues.
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).
> 
> What I'm thinking is that following the interrupt in [1], the driver
> picks up completion for cid 4428 but does not find cid 4429 in the queue
> since it has not been posted yet. Before getting a cq head doorbell
> write (which would cause the pin to be deasserted), the device posts the
> completion for cid 4429 which just keeps the interrupt asserted in [2].
> The trace then shows the cq head doorbell update in [3,4] for cid 4428
> and then we hit the timeout since the driver is not aware that cid 4429
> has been posted in between this (why is it not aware of this?) Timing
> out, the driver then polls the queue and notices cid 4429 and updates
> the cq head doorbell in [5-7], causing the device to deassert the
> interrupt and we are "back in shape".
> 
> I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0).
> Tested on QEMU v7.0.0 (to rule out all the shadow doorbell
> optimizations) as well as QEMU nvme-next (infradead). In other words,
> it's not a recent regression in either project and potentially it has
> always been like this. I've not tested other platforms for now, but I
> would assume others using pin-based interrupts would observe the same.
> 
> Any ideas on how to shed any light on this issue from the kernel side of
> things?



-- 
One of us - No more doubt, silence or taboo about mental illness.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-13  8:54 ` Klaus Jensen
@ 2023-01-13 12:32   ` Peter Maydell
  2023-01-13 12:37     ` Klaus Jensen
  2023-01-13 16:52     ` Keith Busch
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Maydell @ 2023-01-13 12:32 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, qemu-block, qemu-devel, Guenter Roeck,
	Michael S. Tsirkin, Marcel Apfelbaum

On Fri, 13 Jan 2023 at 08:55, Klaus Jensen <its@irrelevant.dk> wrote:
>
> +CC qemu pci maintainers
>
> Michael, Marcel,
>
> Do you have any comments on this thread? As you can see one solution is
> to simply deassert prior to asserting, the other is to reintroduce a
> pci_irq_pulse(). Both seem to solve the issue.

Both seem to be missing any analysis of "this is what is
happening, this is where we differ from hardware, this
is why this is the correct fix". We shouldn't put in
random "this seems to happen to cause the guest to boot"
fixes, please.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-13 12:32   ` Peter Maydell
@ 2023-01-13 12:37     ` Klaus Jensen
  2023-01-13 12:42       ` Peter Maydell
  2023-01-13 16:52     ` Keith Busch
  1 sibling, 1 reply; 35+ messages in thread
From: Klaus Jensen @ 2023-01-13 12:37 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, qemu-block, qemu-devel, Guenter Roeck,
	Michael S. Tsirkin, Marcel Apfelbaum

[-- Attachment #1: Type: text/plain, Size: 910 bytes --]

On Jan 13 12:32, Peter Maydell wrote:
> On Fri, 13 Jan 2023 at 08:55, Klaus Jensen <its@irrelevant.dk> wrote:
> >
> > +CC qemu pci maintainers
> >
> > Michael, Marcel,
> >
> > Do you have any comments on this thread? As you can see one solution is
> > to simply deassert prior to asserting, the other is to reintroduce a
> > pci_irq_pulse(). Both seem to solve the issue.
> 
> Both seem to be missing any analysis of "this is what is
> happening, this is where we differ from hardware, this
> is why this is the correct fix". We shouldn't put in
> random "this seems to happen to cause the guest to boot"
> fixes, please.
> 

No, I'd like to get to the bottom of this, which is why I'm reaching out
to the pci maintainers to get an idea about if this is a general/known
issue with pin based interrupts on pci.

There are a fair amount of uses of pci_irq_pulse() still left in the
tree.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-13 12:37     ` Klaus Jensen
@ 2023-01-13 12:42       ` Peter Maydell
  2023-01-13 12:46         ` Klaus Jensen
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Maydell @ 2023-01-13 12:42 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, qemu-block, qemu-devel, Guenter Roeck,
	Michael S. Tsirkin, Marcel Apfelbaum

On Fri, 13 Jan 2023 at 12:37, Klaus Jensen <its@irrelevant.dk> wrote:
> There are a fair amount of uses of pci_irq_pulse() still left in the
> tree.

Are there? I feel like I'm missing something here:
$ git grep pci_irq_pulse
include/hw/pci/pci.h:static inline void pci_irq_pulse(PCIDevice *pci_dev)
$

...looks at first sight like an unused function we could delete.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-13 12:42       ` Peter Maydell
@ 2023-01-13 12:46         ` Klaus Jensen
  0 siblings, 0 replies; 35+ messages in thread
From: Klaus Jensen @ 2023-01-13 12:46 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, qemu-block, qemu-devel, Guenter Roeck,
	Michael S. Tsirkin, Marcel Apfelbaum

[-- Attachment #1: Type: text/plain, Size: 522 bytes --]

On Jan 13 12:42, Peter Maydell wrote:
> On Fri, 13 Jan 2023 at 12:37, Klaus Jensen <its@irrelevant.dk> wrote:
> > There are a fair amount of uses of pci_irq_pulse() still left in the
> > tree.
> 
> Are there? I feel like I'm missing something here:
> $ git grep pci_irq_pulse
> include/hw/pci/pci.h:static inline void pci_irq_pulse(PCIDevice *pci_dev)
> $
> 
> ...looks at first sight like an unused function we could delete.
> 

Oh! My mistake! I grepped for irq_pulse which matched qemu_irq_pulse(),
doh.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-13 12:32   ` Peter Maydell
  2023-01-13 12:37     ` Klaus Jensen
@ 2023-01-13 16:52     ` Keith Busch
  1 sibling, 0 replies; 35+ messages in thread
From: Keith Busch @ 2023-01-13 16:52 UTC (permalink / raw)
  To: Peter Maydell, t
  Cc: Klaus Jensen, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, qemu-block, qemu-devel, Guenter Roeck,
	Michael S. Tsirkin, Marcel Apfelbaum

On Fri, Jan 13, 2023 at 12:32:29PM +0000, Peter Maydell wrote:
> On Fri, 13 Jan 2023 at 08:55, Klaus Jensen <its@irrelevant.dk> wrote:
> >
> > +CC qemu pci maintainers
> >
> > Michael, Marcel,
> >
> > Do you have any comments on this thread? As you can see one solution is
> > to simply deassert prior to asserting, the other is to reintroduce a
> > pci_irq_pulse(). Both seem to solve the issue.
> 
> Both seem to be missing any analysis of "this is what is
> happening, this is where we differ from hardware, this
> is why this is the correct fix". We shouldn't put in
> random "this seems to happen to cause the guest to boot"
> fixes, please.

It looks like these are being treated as edge instead of level
interrupts so the work-around is to create more edges. I would
definitely prefer a real fix, whether that's in the kernel or CPU
emulation or somewhere else, but I'm not sure I'll have time to see it
to completion.

FWIW, x86 seems to handle legacy IRQs with NVMe as expected. It's
actually easy enough for the DEASSERT to take so long that kernel
reports the irq as "spurious" because it's spinning too often on the
level.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 13:10 completion timeouts with pin-based interrupts in QEMU hw/nvme Klaus Jensen
                   ` (2 preceding siblings ...)
  2023-01-13  8:54 ` Klaus Jensen
@ 2023-01-16 21:14 ` Klaus Jensen
  2023-01-17  4:58   ` Keith Busch
  2023-01-17 16:05   ` Guenter Roeck
  2023-01-18  4:17 ` Keith Busch
  2023-01-18 22:26 ` Keith Busch
  5 siblings, 2 replies; 35+ messages in thread
From: Klaus Jensen @ 2023-01-16 21:14 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme
  Cc: qemu-block, qemu-devel, Guenter Roeck

[-- Attachment #1: Type: text/plain, Size: 7882 bytes --]

On Jan 12 14:10, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>   nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> To rule out issues with shadow doorbells (which have been a source of
> frustration in the past), those are disabled. FWIW I'm also seeing the
> issue with shadow doorbells.
> 
> 	diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> 	index f25cc2c235e9..28d8e7f4b56c 100644
> 	--- a/hw/nvme/ctrl.c
> 	+++ b/hw/nvme/ctrl.c
> 	@@ -7407,7 +7407,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
> 	     id->mdts = n->params.mdts;
> 	     id->ver = cpu_to_le32(NVME_SPEC_VER);
> 	     id->oacs =
> 	-        cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF);
> 	+        cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
> 	     id->cntrltype = 0x1;
> 
> 	     /*
> 
> 
> I captured a trace from QEMU when this happens:
> 
> pci_nvme_mmio_write addr 0x1008 data 0x4e size 4
> pci_nvme_mmio_doorbell_sq sqid 1 new_tail 78
> pci_nvme_io_cmd cid 4428 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4428 nsid 1 nlb 32 count 16384 lba 0x1324
> pci_nvme_map_prp trans_len 4096 len 16384 prp1 0x80aca000 prp2 0x82474100 num_prps 5
> pci_nvme_map_addr addr 0x80aca000 len 4096
> pci_nvme_map_addr addr 0x80ac9000 len 4096
> pci_nvme_map_addr addr 0x80ac8000 len 4096
> pci_nvme_map_addr addr 0x80ac7000 len 4096
> pci_nvme_io_cmd cid 4429 nsid 0x1 sqid 1 opc 0x2 opname 'NVME_NVM_CMD_READ'
> pci_nvme_read cid 4429 nsid 1 nlb 224 count 114688 lba 0x1242
> pci_nvme_map_prp trans_len 4096 len 114688 prp1 0x80ae6000 prp2 0x82474000 num_prps 29
> pci_nvme_map_addr addr 0x80ae6000 len 4096
> pci_nvme_map_addr addr 0x80ae5000 len 4096
> pci_nvme_map_addr addr 0x80ae4000 len 4096
> pci_nvme_map_addr addr 0x80ae3000 len 4096
> pci_nvme_map_addr addr 0x80ae2000 len 4096
> pci_nvme_map_addr addr 0x80ae1000 len 4096
> pci_nvme_map_addr addr 0x80ae0000 len 4096
> pci_nvme_map_addr addr 0x80adf000 len 4096
> pci_nvme_map_addr addr 0x80ade000 len 4096
> pci_nvme_map_addr addr 0x80add000 len 4096
> pci_nvme_map_addr addr 0x80adc000 len 4096
> pci_nvme_map_addr addr 0x80adb000 len 4096
> pci_nvme_map_addr addr 0x80ada000 len 4096
> pci_nvme_map_addr addr 0x80ad9000 len 4096
> pci_nvme_map_addr addr 0x80ad8000 len 4096
> pci_nvme_map_addr addr 0x80ad7000 len 4096
> pci_nvme_map_addr addr 0x80ad6000 len 4096
> pci_nvme_map_addr addr 0x80ad5000 len 4096
> pci_nvme_map_addr addr 0x80ad4000 len 4096
> pci_nvme_map_addr addr 0x80ad3000 len 4096
> pci_nvme_map_addr addr 0x80ad2000 len 4096
> pci_nvme_map_addr addr 0x80ad1000 len 4096
> pci_nvme_map_addr addr 0x80ad0000 len 4096
> pci_nvme_map_addr addr 0x80acf000 len 4096
> pci_nvme_map_addr addr 0x80ace000 len 4096
> pci_nvme_map_addr addr 0x80acd000 len 4096
> pci_nvme_map_addr addr 0x80acc000 len 4096
> pci_nvme_map_addr addr 0x80acb000 len 4096
> pci_nvme_rw_cb cid 4428 blk 'd0'
> pci_nvme_rw_complete_cb cid 4428 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4428 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [1]: pci_nvme_irq_pin pulsing IRQ pin
> pci_nvme_rw_cb cid 4429 blk 'd0'
> pci_nvme_rw_complete_cb cid 4429 blk 'd0'
> pci_nvme_enqueue_req_completion cid 4429 cqid 1 dw0 0x0 dw1 0x0 status 0x0
> [2]: pci_nvme_irq_pin pulsing IRQ pin
> [3]: pci_nvme_mmio_write addr 0x100c data 0x4d size 4
> [4]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 77
> ---- TIMEOUT HERE (30s) ---
> [5]: pci_nvme_mmio_read addr 0x1c size 4
> [6]: pci_nvme_mmio_write addr 0x100c data 0x4e size 4
> [7]: pci_nvme_mmio_doorbell_cq cqid 1 new_head 78
> --- Interrupt deasserted (cq->tail == cq->head)
> [   31.757821] nvme nvme0: I/O 333 QID 1 timeout, completion polled
> 
> Following the timeout, everything returns to "normal" and device/driver
> happily continues.
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).
> 
> What I'm thinking is that following the interrupt in [1], the driver
> picks up completion for cid 4428 but does not find cid 4429 in the queue
> since it has not been posted yet. Before getting a cq head doorbell
> write (which would cause the pin to be deasserted), the device posts the
> completion for cid 4429 which just keeps the interrupt asserted in [2].
> The trace then shows the cq head doorbell update in [3,4] for cid 4428
> and then we hit the timeout since the driver is not aware that cid 4429
> has been posted in between this (why is it not aware of this?) Timing
> out, the driver then polls the queue and notices cid 4429 and updates
> the cq head doorbell in [5-7], causing the device to deassert the
> interrupt and we are "back in shape".
> 
> I'm observing this on 6.0 kernels and v6.2-rc3 (have not tested <6.0).
> Tested on QEMU v7.0.0 (to rule out all the shadow doorbell
> optimizations) as well as QEMU nvme-next (infradead). In other words,
> it's not a recent regression in either project and potentially it has
> always been like this. I've not tested other platforms for now, but I
> would assume others using pin-based interrupts would observe the same.
> 
> Any ideas on how to shed any light on this issue from the kernel side of
> things?

I noticed that the Linux driver does not use the INTMS/INTMC registers
to mask interrupts on the controller while processing CQEs. While not
required by the spec, it is *recommended* in setups not using MSI-X to
reduce the risk of spurious and/or missed interrupts.

With the patch below, running 100 boot iterations, no timeouts were
observed on QEMU emulated riscv64 or mips64.

No changes are required in the QEMU hw/nvme interrupt logic.


diff --git i/drivers/nvme/host/pci.c w/drivers/nvme/host/pci.c
index b13baccedb4a..75f6b87c4c3f 100644
--- i/drivers/nvme/host/pci.c
+++ w/drivers/nvme/host/pci.c
@@ -1128,6 +1128,27 @@ static inline int nvme_poll_cq(struct nvme_queue *nvmeq,
 }

 static irqreturn_t nvme_irq(int irq, void *data)
+{
+       struct nvme_queue *nvmeq = data;
+       struct nvme_dev *dev = nvmeq->dev;
+       u32 mask = 1 << nvmeq->cq_vector;
+       irqreturn_t ret = IRQ_NONE;
+       DEFINE_IO_COMP_BATCH(iob);
+
+       writel(mask, dev->bar + NVME_REG_INTMS);
+
+       if (nvme_poll_cq(nvmeq, &iob)) {
+               if (!rq_list_empty(iob.req_list))
+                       nvme_pci_complete_batch(&iob);
+               ret = IRQ_HANDLED;
+       }
+
+       writel(mask, dev->bar + NVME_REG_INTMC);
+
+       return ret;
+}
+
+static irqreturn_t nvme_irq_msix(int irq, void *data)
 {
        struct nvme_queue *nvmeq = data;
        DEFINE_IO_COMP_BATCH(iob);
@@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 {
        struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
        int nr = nvmeq->dev->ctrl.instance;
+       irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq;

        if (use_threaded_interrupts) {
                return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
-                               nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
+                               handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
        } else {
-               return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
+               return pci_request_irq(pdev, nvmeq->cq_vector, handler,
                                NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
        }
 }



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-16 21:14 ` Klaus Jensen
@ 2023-01-17  4:58   ` Keith Busch
  2023-01-17 16:09     ` Guenter Roeck
  2023-01-17 16:05   ` Guenter Roeck
  1 sibling, 1 reply; 35+ messages in thread
From: Keith Busch @ 2023-01-17  4:58 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel, Guenter Roeck

On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> I noticed that the Linux driver does not use the INTMS/INTMC registers
> to mask interrupts on the controller while processing CQEs. While not
> required by the spec, it is *recommended* in setups not using MSI-X to
> reduce the risk of spurious and/or missed interrupts.

That's assuming completions are deferred to a bottom half. We don't do
that by default in Linux nvme, though you can ask the driver to do that
if you want.
 
> With the patch below, running 100 boot iterations, no timeouts were
> observed on QEMU emulated riscv64 or mips64.
>
> No changes are required in the QEMU hw/nvme interrupt logic.

Yeah, I can see why: it forces the irq line to deassert then assert,
just like we had forced to happen within the device side patches. Still,
none of that is supposed to be necessary, but this idea of using these
registers is probably fine.

>  static irqreturn_t nvme_irq(int irq, void *data)
> +{
> +       struct nvme_queue *nvmeq = data;
> +       struct nvme_dev *dev = nvmeq->dev;
> +       u32 mask = 1 << nvmeq->cq_vector;
> +       irqreturn_t ret = IRQ_NONE;
> +       DEFINE_IO_COMP_BATCH(iob);
> +
> +       writel(mask, dev->bar + NVME_REG_INTMS);
> +
> +       if (nvme_poll_cq(nvmeq, &iob)) {
> +               if (!rq_list_empty(iob.req_list))
> +                       nvme_pci_complete_batch(&iob);
> +               ret = IRQ_HANDLED;
> +       }
> +
> +       writel(mask, dev->bar + NVME_REG_INTMC);
> +
> +       return ret;
> +}

If threaded interrupts are used, you'll want to do the masking in
nvme_irq_check(), then clear it in the threaded handler instead of doing
both in the same callback.

> +static irqreturn_t nvme_irq_msix(int irq, void *data)
>  {
>         struct nvme_queue *nvmeq = data;
>         DEFINE_IO_COMP_BATCH(iob);
> @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
>  {
>         struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>         int nr = nvmeq->dev->ctrl.instance;
> +       irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq;
> 
>         if (use_threaded_interrupts) {
>                 return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> -                               nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> +                               handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
>         } else {
> -               return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> +               return pci_request_irq(pdev, nvmeq->cq_vector, handler,
>                                 NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
>         }
>  }
> 
> 




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-16 21:14 ` Klaus Jensen
  2023-01-17  4:58   ` Keith Busch
@ 2023-01-17 16:05   ` Guenter Roeck
  1 sibling, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-01-17 16:05 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, qemu-block, qemu-devel

On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
[ ... ]
> 
> I noticed that the Linux driver does not use the INTMS/INTMC registers
> to mask interrupts on the controller while processing CQEs. While not
> required by the spec, it is *recommended* in setups not using MSI-X to
> reduce the risk of spurious and/or missed interrupts.
> 
> With the patch below, running 100 boot iterations, no timeouts were
> observed on QEMU emulated riscv64 or mips64.
> 
> No changes are required in the QEMU hw/nvme interrupt logic.
> 

Yes, but isn't that technically similar to dropping the
interrupt request and raising it again, or in other words
pulsing the interrupt ?

I still don't understand why the (level triggered) interrupt
pin would require pulsing in the first place.

Thanks,
Guenter

> 
> diff --git i/drivers/nvme/host/pci.c w/drivers/nvme/host/pci.c
> index b13baccedb4a..75f6b87c4c3f 100644
> --- i/drivers/nvme/host/pci.c
> +++ w/drivers/nvme/host/pci.c
> @@ -1128,6 +1128,27 @@ static inline int nvme_poll_cq(struct nvme_queue *nvmeq,
>  }
> 
>  static irqreturn_t nvme_irq(int irq, void *data)
> +{
> +       struct nvme_queue *nvmeq = data;
> +       struct nvme_dev *dev = nvmeq->dev;
> +       u32 mask = 1 << nvmeq->cq_vector;
> +       irqreturn_t ret = IRQ_NONE;
> +       DEFINE_IO_COMP_BATCH(iob);
> +
> +       writel(mask, dev->bar + NVME_REG_INTMS);
> +
> +       if (nvme_poll_cq(nvmeq, &iob)) {
> +               if (!rq_list_empty(iob.req_list))
> +                       nvme_pci_complete_batch(&iob);
> +               ret = IRQ_HANDLED;
> +       }
> +
> +       writel(mask, dev->bar + NVME_REG_INTMC);
> +
> +       return ret;
> +}
> +
> +static irqreturn_t nvme_irq_msix(int irq, void *data)
>  {
>         struct nvme_queue *nvmeq = data;
>         DEFINE_IO_COMP_BATCH(iob);
> @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
>  {
>         struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>         int nr = nvmeq->dev->ctrl.instance;
> +       irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq;
> 
>         if (use_threaded_interrupts) {
>                 return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> -                               nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> +                               handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
>         } else {
> -               return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> +               return pci_request_irq(pdev, nvmeq->cq_vector, handler,
>                                 NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
>         }
>  }
> 
> 




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-17  4:58   ` Keith Busch
@ 2023-01-17 16:09     ` Guenter Roeck
  2023-01-17 16:18       ` Peter Maydell
  0 siblings, 1 reply; 35+ messages in thread
From: Guenter Roeck @ 2023-01-17 16:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Klaus Jensen, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, qemu-block, qemu-devel

On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > to mask interrupts on the controller while processing CQEs. While not
> > required by the spec, it is *recommended* in setups not using MSI-X to
> > reduce the risk of spurious and/or missed interrupts.
> 
> That's assuming completions are deferred to a bottom half. We don't do
> that by default in Linux nvme, though you can ask the driver to do that
> if you want.
>  
> > With the patch below, running 100 boot iterations, no timeouts were
> > observed on QEMU emulated riscv64 or mips64.
> >
> > No changes are required in the QEMU hw/nvme interrupt logic.
> 
> Yeah, I can see why: it forces the irq line to deassert then assert,
> just like we had forced to happen within the device side patches. Still,
> none of that is supposed to be necessary, but this idea of using these
> registers is probably fine.

There is still no answer why this would be necessary in the first place,
on either side. In my opinion, unless someone can confirm that the problem
is seen with real hardware, we should assume that it happens on the qemu
side and address it there.

Guenter

> 
> >  static irqreturn_t nvme_irq(int irq, void *data)
> > +{
> > +       struct nvme_queue *nvmeq = data;
> > +       struct nvme_dev *dev = nvmeq->dev;
> > +       u32 mask = 1 << nvmeq->cq_vector;
> > +       irqreturn_t ret = IRQ_NONE;
> > +       DEFINE_IO_COMP_BATCH(iob);
> > +
> > +       writel(mask, dev->bar + NVME_REG_INTMS);
> > +
> > +       if (nvme_poll_cq(nvmeq, &iob)) {
> > +               if (!rq_list_empty(iob.req_list))
> > +                       nvme_pci_complete_batch(&iob);
> > +               ret = IRQ_HANDLED;
> > +       }
> > +
> > +       writel(mask, dev->bar + NVME_REG_INTMC);
> > +
> > +       return ret;
> > +}
> 
> If threaded interrupts are used, you'll want to do the masking in
> nvme_irq_check(), then clear it in the threaded handler instead of doing
> both in the same callback.
> 
> > +static irqreturn_t nvme_irq_msix(int irq, void *data)
> >  {
> >         struct nvme_queue *nvmeq = data;
> >         DEFINE_IO_COMP_BATCH(iob);
> > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
> >  {
> >         struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
> >         int nr = nvmeq->dev->ctrl.instance;
> > +       irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq;
> > 
> >         if (use_threaded_interrupts) {
> >                 return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> > -                               nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> > +                               handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> >         } else {
> > -               return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> > +               return pci_request_irq(pdev, nvmeq->cq_vector, handler,
> >                                 NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> >         }
> >  }
> > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-17 16:09     ` Guenter Roeck
@ 2023-01-17 16:18       ` Peter Maydell
  2023-01-17 19:21         ` Guenter Roeck
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Maydell @ 2023-01-17 16:18 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Keith Busch, Klaus Jensen, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, qemu-block, qemu-devel

On Tue, 17 Jan 2023 at 16:10, Guenter Roeck <linux@roeck-us.net> wrote:
>
> On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > > to mask interrupts on the controller while processing CQEs. While not
> > > required by the spec, it is *recommended* in setups not using MSI-X to
> > > reduce the risk of spurious and/or missed interrupts.
> >
> > That's assuming completions are deferred to a bottom half. We don't do
> > that by default in Linux nvme, though you can ask the driver to do that
> > if you want.
> >
> > > With the patch below, running 100 boot iterations, no timeouts were
> > > observed on QEMU emulated riscv64 or mips64.
> > >
> > > No changes are required in the QEMU hw/nvme interrupt logic.
> >
> > Yeah, I can see why: it forces the irq line to deassert then assert,
> > just like we had forced to happen within the device side patches. Still,
> > none of that is supposed to be necessary, but this idea of using these
> > registers is probably fine.
>
> There is still no answer why this would be necessary in the first place,
> on either side. In my opinion, unless someone can confirm that the problem
> is seen with real hardware, we should assume that it happens on the qemu
> side and address it there.

Sure, but that means identifying what the divergence
between QEMU's implementation and the hardware is first. I don't
want a fudged fix in QEMU's code any more than you want one in
the kernel's driver code :-)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-17 16:18       ` Peter Maydell
@ 2023-01-17 19:21         ` Guenter Roeck
  2023-01-18 15:04           ` Peter Maydell
  0 siblings, 1 reply; 35+ messages in thread
From: Guenter Roeck @ 2023-01-17 19:21 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Keith Busch, Klaus Jensen, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, qemu-block, qemu-devel

On Tue, Jan 17, 2023 at 04:18:14PM +0000, Peter Maydell wrote:
> On Tue, 17 Jan 2023 at 16:10, Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> > > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > > > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > > > to mask interrupts on the controller while processing CQEs. While not
> > > > required by the spec, it is *recommended* in setups not using MSI-X to
> > > > reduce the risk of spurious and/or missed interrupts.
> > >
> > > That's assuming completions are deferred to a bottom half. We don't do
> > > that by default in Linux nvme, though you can ask the driver to do that
> > > if you want.
> > >
> > > > With the patch below, running 100 boot iterations, no timeouts were
> > > > observed on QEMU emulated riscv64 or mips64.
> > > >
> > > > No changes are required in the QEMU hw/nvme interrupt logic.
> > >
> > > Yeah, I can see why: it forces the irq line to deassert then assert,
> > > just like we had forced to happen within the device side patches. Still,
> > > none of that is supposed to be necessary, but this idea of using these
> > > registers is probably fine.
> >
> > There is still no answer why this would be necessary in the first place,
> > on either side. In my opinion, unless someone can confirm that the problem
> > is seen with real hardware, we should assume that it happens on the qemu
> > side and address it there.
> 
> Sure, but that means identifying what the divergence
> between QEMU's implementation and the hardware is first. I don't
> want a fudged fix in QEMU's code any more than you want one in
> the kernel's driver code :-)
> 

I actually prefer it in qemu because that means I can test nvme support
on all active LTS releases of the Linux kernel, but that is POV and
secondary. This has been broken ever since I started testing nvme
support with qemu, so it doesn't make much of a difference if fixing
the problem for good takes a bit longer. Plus, I run my own version
of qemu anyway, so carrying the fix (hack) in qemu doesn't make much
of a difference for me.

Anyway - any idea what to do to help figuring out what is happening ?
Add tracing support to pci interrupt handling, maybe ?

Guenter


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 13:10 completion timeouts with pin-based interrupts in QEMU hw/nvme Klaus Jensen
                   ` (3 preceding siblings ...)
  2023-01-16 21:14 ` Klaus Jensen
@ 2023-01-18  4:17 ` Keith Busch
  2023-01-18 22:26 ` Keith Busch
  5 siblings, 0 replies; 35+ messages in thread
From: Keith Busch @ 2023-01-18  4:17 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel, Guenter Roeck

On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>   nvme nvme0: I/O 333 QID 1 timeout, completion polled

I finally got a riscv setup recreating this, and I am not sure how
you're getting a "completion polled" here. I find that the polling from
here progresses the request state to IDLE, so the timeout handler moves
on to aborting because it's expecting COMPLETED. The driver should not
be doing that, so I'll fix that up while also investigating why the
interrupt isn't retriggering as expected.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-17 19:21         ` Guenter Roeck
@ 2023-01-18 15:04           ` Peter Maydell
  2023-01-18 16:33             ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Maydell @ 2023-01-18 15:04 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Keith Busch, Klaus Jensen, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, qemu-block, qemu-devel

On Tue, 17 Jan 2023 at 19:21, Guenter Roeck <linux@roeck-us.net> wrote:
> Anyway - any idea what to do to help figuring out what is happening ?
> Add tracing support to pci interrupt handling, maybe ?

For intermittent bugs, I like recording the QEMU session under
rr (using its chaos mode to provoke the failure if necessary) to
get a recording that I can debug and re-debug at leisure. Usually
you want to turn on/add tracing to help with this, and if the
failure doesn't hit early in bootup then you might need to
do a QEMU snapshot just before point-of-failure so you can
run rr only on the short snapshot-to-failure segment.

https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/

This gives you a debugging session from the QEMU side's perspective,
of course -- assuming you know what the hardware is supposed to do
you hopefully wind up with either "the guest software did X,Y,Z
and we incorrectly did A" or else "the guest software did X,Y,Z,
the spec says A is the right/a permitted thing but the guest got confused".
If it's the latter then you have to look at the guest as a separate
code analysis/debug problem.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-18 15:04           ` Peter Maydell
@ 2023-01-18 16:33             ` Keith Busch
  2023-01-18 23:06               ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2023-01-18 16:33 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Guenter Roeck, Klaus Jensen, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, qemu-block, qemu-devel

On Wed, Jan 18, 2023 at 03:04:06PM +0000, Peter Maydell wrote:
> On Tue, 17 Jan 2023 at 19:21, Guenter Roeck <linux@roeck-us.net> wrote:
> > Anyway - any idea what to do to help figuring out what is happening ?
> > Add tracing support to pci interrupt handling, maybe ?
> 
> For intermittent bugs, I like recording the QEMU session under
> rr (using its chaos mode to provoke the failure if necessary) to
> get a recording that I can debug and re-debug at leisure. Usually
> you want to turn on/add tracing to help with this, and if the
> failure doesn't hit early in bootup then you might need to
> do a QEMU snapshot just before point-of-failure so you can
> run rr only on the short snapshot-to-failure segment.
> 
> https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
> https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/
> 
> This gives you a debugging session from the QEMU side's perspective,
> of course -- assuming you know what the hardware is supposed to do
> you hopefully wind up with either "the guest software did X,Y,Z
> and we incorrectly did A" or else "the guest software did X,Y,Z,
> the spec says A is the right/a permitted thing but the guest got confused".
> If it's the latter then you have to look at the guest as a separate
> code analysis/debug problem.

Here's what I got, though I'm way out of my depth here.

It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the
interrupt after its first handling, which I think is expected. After
claiming, QEMU masks the pending interrupt, lowering the level, though
the device that raised it never deasserted.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-12 13:10 completion timeouts with pin-based interrupts in QEMU hw/nvme Klaus Jensen
                   ` (4 preceding siblings ...)
  2023-01-18  4:17 ` Keith Busch
@ 2023-01-18 22:26 ` Keith Busch
  2023-01-19  7:06   ` Klaus Jensen
  5 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2023-01-18 22:26 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel, Guenter Roeck

Klaus,

This isn't going to help your issue, but there are at least two legacy
irq bugs in the nvme qemu implementation.

1. The admin queue breaks if start with legacy and later initialize
msix.

2. The legacy vector is shared among all queues, but it's being
deasserted when the first queue's doorbell makes it empty. It should
remain enabled if any cq is non-empty.

I'll send you some patches for those later. Still working on the real
problem.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-18 16:33             ` Keith Busch
@ 2023-01-18 23:06               ` Keith Busch
  2023-01-19  0:41                 ` Alistair Francis
  0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2023-01-18 23:06 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Guenter Roeck, Klaus Jensen, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, qemu-block, qemu-devel

On Wed, Jan 18, 2023 at 09:33:05AM -0700, Keith Busch wrote:
> On Wed, Jan 18, 2023 at 03:04:06PM +0000, Peter Maydell wrote:
> > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck <linux@roeck-us.net> wrote:
> > > Anyway - any idea what to do to help figuring out what is happening ?
> > > Add tracing support to pci interrupt handling, maybe ?
> > 
> > For intermittent bugs, I like recording the QEMU session under
> > rr (using its chaos mode to provoke the failure if necessary) to
> > get a recording that I can debug and re-debug at leisure. Usually
> > you want to turn on/add tracing to help with this, and if the
> > failure doesn't hit early in bootup then you might need to
> > do a QEMU snapshot just before point-of-failure so you can
> > run rr only on the short snapshot-to-failure segment.
> > 
> > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
> > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/
> > 
> > This gives you a debugging session from the QEMU side's perspective,
> > of course -- assuming you know what the hardware is supposed to do
> > you hopefully wind up with either "the guest software did X,Y,Z
> > and we incorrectly did A" or else "the guest software did X,Y,Z,
> > the spec says A is the right/a permitted thing but the guest got confused".
> > If it's the latter then you have to look at the guest as a separate
> > code analysis/debug problem.
> 
> Here's what I got, though I'm way out of my depth here.
> 
> It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the
> interrupt after its first handling, which I think is expected. After
> claiming, QEMU masks the pending interrupt, lowering the level, though
> the device that raised it never deasserted.

I'm not sure if this is correct, but this is what I'm coming up with and
appears to fix the problem on my setup. The hardware that sets the
pending interrupt is going clear it, so I don't see why the interrupt
controller is automatically clearing it when the host claims it.

---
diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index c2dfacf028..f8f7af08dc 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, unsigned size)
             uint32_t max_irq = sifive_plic_claimed(plic, addrid);
 
             if (max_irq) {
-                sifive_plic_set_pending(plic, max_irq, false);
                 sifive_plic_set_claimed(plic, max_irq, true);
             }
 
--


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-18 23:06               ` Keith Busch
@ 2023-01-19  0:41                 ` Alistair Francis
  2023-01-19  2:44                   ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: Alistair Francis @ 2023-01-19  0:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Peter Maydell, Guenter Roeck, Klaus Jensen, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, qemu-block,
	qemu-devel

On Thu, Jan 19, 2023 at 9:07 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Jan 18, 2023 at 09:33:05AM -0700, Keith Busch wrote:
> > On Wed, Jan 18, 2023 at 03:04:06PM +0000, Peter Maydell wrote:
> > > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck <linux@roeck-us.net> wrote:
> > > > Anyway - any idea what to do to help figuring out what is happening ?
> > > > Add tracing support to pci interrupt handling, maybe ?
> > >
> > > For intermittent bugs, I like recording the QEMU session under
> > > rr (using its chaos mode to provoke the failure if necessary) to
> > > get a recording that I can debug and re-debug at leisure. Usually
> > > you want to turn on/add tracing to help with this, and if the
> > > failure doesn't hit early in bootup then you might need to
> > > do a QEMU snapshot just before point-of-failure so you can
> > > run rr only on the short snapshot-to-failure segment.
> > >
> > > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
> > > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/
> > >
> > > This gives you a debugging session from the QEMU side's perspective,
> > > of course -- assuming you know what the hardware is supposed to do
> > > you hopefully wind up with either "the guest software did X,Y,Z
> > > and we incorrectly did A" or else "the guest software did X,Y,Z,
> > > the spec says A is the right/a permitted thing but the guest got confused".
> > > If it's the latter then you have to look at the guest as a separate
> > > code analysis/debug problem.
> >
> > Here's what I got, though I'm way out of my depth here.
> >
> > It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the
> > interrupt after its first handling, which I think is expected. After
> > claiming, QEMU masks the pending interrupt, lowering the level, though
> > the device that raised it never deasserted.
>
> I'm not sure if this is correct, but this is what I'm coming up with and
> appears to fix the problem on my setup. The hardware that sets the
> pending interrupt is going clear it, so I don't see why the interrupt
> controller is automatically clearing it when the host claims it.
>
> ---
> diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
> index c2dfacf028..f8f7af08dc 100644
> --- a/hw/intc/sifive_plic.c
> +++ b/hw/intc/sifive_plic.c
> @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, unsigned size)
>              uint32_t max_irq = sifive_plic_claimed(plic, addrid);
>
>              if (max_irq) {
> -                sifive_plic_set_pending(plic, max_irq, false);
>                  sifive_plic_set_claimed(plic, max_irq, true);
>              }
>

This change isn't correct. The PLIC spec
(https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf)
states:

"""
On receiving a claim message, the PLIC core will atomically determine
the ID of the highest-priority pending interrupt for the target and
then clear down the corresponding source’s IP bit
"""

which is what we are doing here. We are clearing the pending interrupt
inside the PLIC

Alistair


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-19  0:41                 ` Alistair Francis
@ 2023-01-19  2:44                   ` Keith Busch
  2023-01-19  3:10                     ` Alistair Francis
  0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2023-01-19  2:44 UTC (permalink / raw)
  To: Alistair Francis
  Cc: Peter Maydell, Guenter Roeck, Klaus Jensen, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, qemu-block,
	qemu-devel

On Thu, Jan 19, 2023 at 10:41:42AM +1000, Alistair Francis wrote:
> On Thu, Jan 19, 2023 at 9:07 AM Keith Busch <kbusch@kernel.org> wrote:
> > ---
> > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
> > index c2dfacf028..f8f7af08dc 100644
> > --- a/hw/intc/sifive_plic.c
> > +++ b/hw/intc/sifive_plic.c
> > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, unsigned size)
> >              uint32_t max_irq = sifive_plic_claimed(plic, addrid);
> >
> >              if (max_irq) {
> > -                sifive_plic_set_pending(plic, max_irq, false);
> >                  sifive_plic_set_claimed(plic, max_irq, true);
> >              }
> >
> 
> This change isn't correct. The PLIC spec
> (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf)
> states:
> 
> """
> On receiving a claim message, the PLIC core will atomically determine
> the ID of the highest-priority pending interrupt for the target and
> then clear down the corresponding source’s IP bit
> """
> 
> which is what we are doing here. We are clearing the pending interrupt
> inside the PLIC

Thanks for the link. That's very helpful.

If you're familiar with this area, could you possibly clear up this part
of that spec?

"
On receiving an interrupt completion message, if the interrupt is
level-triggered and the interrupt is still asserted, a new interrupt
request will be forwarded to the PLIC core.
"

Further up, it says the "interrupt gateway" is responsible for
forwarding new interrupt requests while the level remains asserted, but
it doesn't look like anything is handling that, which essentially turns
this into an edge interrupt. Am I missing something, or is this really
not being handled?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-19  2:44                   ` Keith Busch
@ 2023-01-19  3:10                     ` Alistair Francis
  2023-01-19  4:03                       ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: Alistair Francis @ 2023-01-19  3:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Peter Maydell, Guenter Roeck, Klaus Jensen, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, qemu-block,
	qemu-devel

On Thu, Jan 19, 2023 at 12:44 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, Jan 19, 2023 at 10:41:42AM +1000, Alistair Francis wrote:
> > On Thu, Jan 19, 2023 at 9:07 AM Keith Busch <kbusch@kernel.org> wrote:
> > > ---
> > > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
> > > index c2dfacf028..f8f7af08dc 100644
> > > --- a/hw/intc/sifive_plic.c
> > > +++ b/hw/intc/sifive_plic.c
> > > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, unsigned size)
> > >              uint32_t max_irq = sifive_plic_claimed(plic, addrid);
> > >
> > >              if (max_irq) {
> > > -                sifive_plic_set_pending(plic, max_irq, false);
> > >                  sifive_plic_set_claimed(plic, max_irq, true);
> > >              }
> > >
> >
> > This change isn't correct. The PLIC spec
> > (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf)
> > states:
> >
> > """
> > On receiving a claim message, the PLIC core will atomically determine
> > the ID of the highest-priority pending interrupt for the target and
> > then clear down the corresponding source’s IP bit
> > """
> >
> > which is what we are doing here. We are clearing the pending interrupt
> > inside the PLIC
>
> Thanks for the link. That's very helpful.
>
> If you're familiar with this area, could you possibly clear up this part
> of that spec?
>
> "
> On receiving an interrupt completion message, if the interrupt is
> level-triggered and the interrupt is still asserted, a new interrupt
> request will be forwarded to the PLIC core.
> "
>
> Further up, it says the "interrupt gateway" is responsible for
> forwarding new interrupt requests while the level remains asserted, but
> it doesn't look like anything is handling that, which essentially turns
> this into an edge interrupt. Am I missing something, or is this really
> not being handled?

Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
internal GPIO lines to trigger an interrupt. So with the current setup
we only support edge triggered interrupts.

It looks like we also drop the pending bit if the original interrupt
de-asserts, which appears to be incorrect as well.

Alistair


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-19  3:10                     ` Alistair Francis
@ 2023-01-19  4:03                       ` Keith Busch
  2023-01-19  7:28                         ` Klaus Jensen
  2023-01-19 10:35                         ` Peter Maydell
  0 siblings, 2 replies; 35+ messages in thread
From: Keith Busch @ 2023-01-19  4:03 UTC (permalink / raw)
  To: Alistair Francis
  Cc: Peter Maydell, Guenter Roeck, Klaus Jensen, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, qemu-block,
	qemu-devel

On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> On Thu, Jan 19, 2023 at 12:44 PM Keith Busch <kbusch@kernel.org> wrote:
> >
> > Further up, it says the "interrupt gateway" is responsible for
> > forwarding new interrupt requests while the level remains asserted, but
> > it doesn't look like anything is handling that, which essentially turns
> > this into an edge interrupt. Am I missing something, or is this really
> > not being handled?
> 
> Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> internal GPIO lines to trigger an interrupt. So with the current setup
> we only support edge triggered interrupts.

Thanks for confirming!

Klaus,
I think we can justify introducing a work-around in the emulated device
now. My previous proposal with pci_irq_pulse() is no good since it does
assert+deassert, but it needs to be the other way around, so please
don't considert that one.

Also, we ought to revisit the intms/intmc usage in the linux driver for
threaded interrupts.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-18 22:26 ` Keith Busch
@ 2023-01-19  7:06   ` Klaus Jensen
  0 siblings, 0 replies; 35+ messages in thread
From: Klaus Jensen @ 2023-01-19  7:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	qemu-block, qemu-devel, Guenter Roeck

[-- Attachment #1: Type: text/plain, Size: 725 bytes --]

On Jan 18 15:26, Keith Busch wrote:
> Klaus,
> 
> This isn't going to help your issue, but there are at least two legacy
> irq bugs in the nvme qemu implementation.
> 
> 1. The admin queue breaks if start with legacy and later initialize
> msix.
> 

Hmm. Interesting that we have not encountered this before - is this
because the kernel will enable MSI-X early and use it for the admin
queue immediately?

> 2. The legacy vector is shared among all queues, but it's being
> deasserted when the first queue's doorbell makes it empty. It should
> remain enabled if any cq is non-empty.

I was certain that we fixed this already in commit 83d7ed5c570
("hw/nvme: fix pin-based interrupt behavior (again)")...

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-19  4:03                       ` Keith Busch
@ 2023-01-19  7:28                         ` Klaus Jensen
  2023-01-19  8:02                           ` Klaus Jensen
  2023-01-19 10:35                         ` Peter Maydell
  1 sibling, 1 reply; 35+ messages in thread
From: Klaus Jensen @ 2023-01-19  7:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Alistair Francis, Peter Maydell, Guenter Roeck, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, qemu-block,
	qemu-devel, qemu-riscv

[-- Attachment #1: Type: text/plain, Size: 1412 bytes --]

On Jan 18 21:03, Keith Busch wrote:
> On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch <kbusch@kernel.org> wrote:
> > >
> > > Further up, it says the "interrupt gateway" is responsible for
> > > forwarding new interrupt requests while the level remains asserted, but
> > > it doesn't look like anything is handling that, which essentially turns
> > > this into an edge interrupt. Am I missing something, or is this really
> > > not being handled?
> > 
> > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> > internal GPIO lines to trigger an interrupt. So with the current setup
> > we only support edge triggered interrupts.
> 
> Thanks for confirming!
> 
> Klaus,
> I think we can justify introducing a work-around in the emulated device
> now. My previous proposal with pci_irq_pulse() is no good since it does
> assert+deassert, but it needs to be the other way around, so please
> don't considert that one.
> 
> Also, we ought to revisit the intms/intmc usage in the linux driver for
> threaded interrupts.

+CC: qemu-riscv

Keith,

Thanks for digging into this!

Yeah, you are probably right that we should only use the intms/intmc
changes in the use_threaded_interrupts case, not in general. While my
RFC patch does seem to "fix" this, it is just a workaround as your
analysis indicate.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-19  7:28                         ` Klaus Jensen
@ 2023-01-19  8:02                           ` Klaus Jensen
  2023-01-19 14:32                             ` Guenter Roeck
  0 siblings, 1 reply; 35+ messages in thread
From: Klaus Jensen @ 2023-01-19  8:02 UTC (permalink / raw)
  To: Keith Busch
  Cc: Alistair Francis, Peter Maydell, Guenter Roeck, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, qemu-block,
	qemu-devel, qemu-riscv, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 1656 bytes --]

On Jan 19 08:28, Klaus Jensen wrote:
> On Jan 18 21:03, Keith Busch wrote:
> > On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> > > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch <kbusch@kernel.org> wrote:
> > > >
> > > > Further up, it says the "interrupt gateway" is responsible for
> > > > forwarding new interrupt requests while the level remains asserted, but
> > > > it doesn't look like anything is handling that, which essentially turns
> > > > this into an edge interrupt. Am I missing something, or is this really
> > > > not being handled?
> > > 
> > > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> > > internal GPIO lines to trigger an interrupt. So with the current setup
> > > we only support edge triggered interrupts.
> > 
> > Thanks for confirming!
> > 
> > Klaus,
> > I think we can justify introducing a work-around in the emulated device
> > now. My previous proposal with pci_irq_pulse() is no good since it does
> > assert+deassert, but it needs to be the other way around, so please
> > don't considert that one.
> > 
> > Also, we ought to revisit the intms/intmc usage in the linux driver for
> > threaded interrupts.
> 
> +CC: qemu-riscv
> 
> Keith,
> 
> Thanks for digging into this!
> 
> Yeah, you are probably right that we should only use the intms/intmc
> changes in the use_threaded_interrupts case, not in general. While my
> RFC patch does seem to "fix" this, it is just a workaround as your
> analysis indicate.

+CC: Philippe,

I am observing these timeouts/aborts on mips as well, so I guess that
emulation could suffer from the same issue?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-19  4:03                       ` Keith Busch
  2023-01-19  7:28                         ` Klaus Jensen
@ 2023-01-19 10:35                         ` Peter Maydell
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Maydell @ 2023-01-19 10:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Alistair Francis, Guenter Roeck, Klaus Jensen, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, qemu-block,
	qemu-devel

On Thu, 19 Jan 2023 at 04:03, Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch <kbusch@kernel.org> wrote:
> > >
> > > Further up, it says the "interrupt gateway" is responsible for
> > > forwarding new interrupt requests while the level remains asserted, but
> > > it doesn't look like anything is handling that, which essentially turns
> > > this into an edge interrupt. Am I missing something, or is this really
> > > not being handled?
> >
> > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> > internal GPIO lines to trigger an interrupt. So with the current setup
> > we only support edge triggered interrupts.
>
> Thanks for confirming!
>
> Klaus,
> I think we can justify introducing a work-around in the emulated device
> now. My previous proposal with pci_irq_pulse() is no good since it does
> assert+deassert, but it needs to be the other way around, so please
> don't considert that one.

No, please don't. The bug is in the risc-v interrupt controller,
so fix the risc-v interrupt controller. It shouldn't be too difficult
(you probably have to do something like what the Arm GIC implementation
does, where when the guest dismisses the interrupt you look at the level
to see if it needs to be re-pended.)

Once "workarounds" go into QEMU device emulation that make it
deviate from hardware behaviour, it's hard to get rid of them
again, because nobody knows whether deployed guests now accidentally
rely on the wrong behaviour. So the correct thing is to never
put in the workaround in the first place.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
  2023-01-19  8:02                           ` Klaus Jensen
@ 2023-01-19 14:32                             ` Guenter Roeck
  0 siblings, 0 replies; 35+ messages in thread
From: Guenter Roeck @ 2023-01-19 14:32 UTC (permalink / raw)
  To: Klaus Jensen, Keith Busch
  Cc: Alistair Francis, Peter Maydell, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, qemu-block, qemu-devel, qemu-riscv,
	Philippe Mathieu-Daudé

On 1/19/23 00:02, Klaus Jensen wrote:
> On Jan 19 08:28, Klaus Jensen wrote:
>> On Jan 18 21:03, Keith Busch wrote:
>>> On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
>>>> On Thu, Jan 19, 2023 at 12:44 PM Keith Busch <kbusch@kernel.org> wrote:
>>>>>
>>>>> Further up, it says the "interrupt gateway" is responsible for
>>>>> forwarding new interrupt requests while the level remains asserted, but
>>>>> it doesn't look like anything is handling that, which essentially turns
>>>>> this into an edge interrupt. Am I missing something, or is this really
>>>>> not being handled?
>>>>
>>>> Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
>>>> internal GPIO lines to trigger an interrupt. So with the current setup
>>>> we only support edge triggered interrupts.
>>>
>>> Thanks for confirming!
>>>
>>> Klaus,
>>> I think we can justify introducing a work-around in the emulated device
>>> now. My previous proposal with pci_irq_pulse() is no good since it does
>>> assert+deassert, but it needs to be the other way around, so please
>>> don't considert that one.
>>>
>>> Also, we ought to revisit the intms/intmc usage in the linux driver for
>>> threaded interrupts.
>>
>> +CC: qemu-riscv
>>
>> Keith,
>>
>> Thanks for digging into this!
>>
>> Yeah, you are probably right that we should only use the intms/intmc
>> changes in the use_threaded_interrupts case, not in general. While my
>> RFC patch does seem to "fix" this, it is just a workaround as your
>> analysis indicate. >
> +CC: Philippe,
> 
> I am observing these timeouts/aborts on mips as well, so I guess that
> emulation could suffer from the same issue?

I suspect it does. In my case, I have an Ethernet controller on the same
interrupt line as the NVME controller, and it looks like one of the
interrupts gets lost if both controllers raise an interrupt at the same
time. The suggested workarounds "fix" the problem, but that doesn't seem
to be the correct solution.

Guenter



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-01-19 14:33 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12 13:10 completion timeouts with pin-based interrupts in QEMU hw/nvme Klaus Jensen
2023-01-12 14:26 ` Guenter Roeck
2023-01-12 16:34 ` Keith Busch
2023-01-12 17:45   ` Klaus Jensen
2023-01-12 17:51     ` Keith Busch
2023-01-12 19:14     ` Guenter Roeck
2023-01-12 19:27     ` Keith Busch
2023-01-13  0:27       ` Guenter Roeck
2023-01-12 23:57     ` Guenter Roeck
2023-01-13  8:54 ` Klaus Jensen
2023-01-13 12:32   ` Peter Maydell
2023-01-13 12:37     ` Klaus Jensen
2023-01-13 12:42       ` Peter Maydell
2023-01-13 12:46         ` Klaus Jensen
2023-01-13 16:52     ` Keith Busch
2023-01-16 21:14 ` Klaus Jensen
2023-01-17  4:58   ` Keith Busch
2023-01-17 16:09     ` Guenter Roeck
2023-01-17 16:18       ` Peter Maydell
2023-01-17 19:21         ` Guenter Roeck
2023-01-18 15:04           ` Peter Maydell
2023-01-18 16:33             ` Keith Busch
2023-01-18 23:06               ` Keith Busch
2023-01-19  0:41                 ` Alistair Francis
2023-01-19  2:44                   ` Keith Busch
2023-01-19  3:10                     ` Alistair Francis
2023-01-19  4:03                       ` Keith Busch
2023-01-19  7:28                         ` Klaus Jensen
2023-01-19  8:02                           ` Klaus Jensen
2023-01-19 14:32                             ` Guenter Roeck
2023-01-19 10:35                         ` Peter Maydell
2023-01-17 16:05   ` Guenter Roeck
2023-01-18  4:17 ` Keith Busch
2023-01-18 22:26 ` Keith Busch
2023-01-19  7:06   ` Klaus Jensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).