[Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control
@ 2018-11-28 21:56 Guenter Roeck
  2018-11-28 21:56 ` [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR Guenter Roeck
  2018-11-29  9:58 ` [Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control Paolo Bonzini
  0 siblings, 2 replies; 16+ messages in thread
From: Guenter Roeck @ 2018-11-28 21:56 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Fam Zheng, qemu-devel, Guenter Roeck

Per AM53C974 datasheet, definition of "SCSI Bus and Control (SBAC)"
register:

Bit 24 – STATUS – Write Erase Control

This bit controls the Write Erase feature on bits 3:1 and bit 6 of the DMA
Status Register ((B)+54h). When this bit is programmed to ‘1’, the state
of bits 3:1 are preserved when read. Bits 3:1 are only cleared when a ‘1’
is written to the corresponding bit location. For example, to clear bit 1,
the value of ‘0000_0010b’ should be written to the register. When the DMA
Status Preserve bit is ‘0’, bits 3:1 are cleared when read.

The status register is currently defined to bit 12, not bit 24.
Also, its implementation is reversed: The status is auto-cleared if
the bit is set to 1, and must be cleared explicitly when the bit is
set to 0. This results in spurious interrupts reported by the Linux
kernel, and in some cases even results in stalled SCSI operations.

Set SBAC_STATUS to bit 24 and reverse the logic to fix the problem.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
 hw/scsi/esp-pci.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/scsi/esp-pci.c b/hw/scsi/esp-pci.c
index 419fc66..d956909 100644
--- a/hw/scsi/esp-pci.c
+++ b/hw/scsi/esp-pci.c
@@ -59,7 +59,7 @@
 #define DMA_STAT_SCSIINT 0x10
 #define DMA_STAT_BCMBLT  0x20

-#define SBAC_STATUS 0x1000
+#define SBAC_STATUS (1 << 24)

 typedef struct PCIESPState {
     /*< private >*/
@@ -136,7 +136,7 @@ static void esp_pci_dma_write(PCIESPState *pci, uint32_t saddr, uint32_t val)
         pci->dma_regs[saddr] = val;
         break;
     case DMA_STAT:
-        if (!(pci->sbac & SBAC_STATUS)) {
+        if (pci->sbac & SBAC_STATUS) {
             /* clear some bits on write */
             uint32_t mask = DMA_STAT_ERROR | DMA_STAT_ABORT | DMA_STAT_DONE;
             pci->dma_regs[DMA_STAT] &= ~(val & mask);
@@ -157,7 +157,7 @@ static uint32_t esp_pci_dma_read(PCIESPState *pci, uint32_t saddr)
         if (pci->esp.rregs[ESP_RSTAT] & STAT_INT) {
             val |= DMA_STAT_SCSIINT;
         }
-        if (pci->sbac & SBAC_STATUS) {
+        if (!(pci->sbac & SBAC_STATUS)) {
             pci->dma_regs[DMA_STAT] &= ~(DMA_STAT_ERROR | DMA_STAT_ABORT |
                                          DMA_STAT_DONE);
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-28 21:56 [Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control Guenter Roeck
@ 2018-11-28 21:56 ` Guenter Roeck
  2018-11-29  9:58   ` Paolo Bonzini
  2018-11-29  9:58 ` [Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control Paolo Bonzini
  1 sibling, 1 reply; 16+ messages in thread
From: Guenter Roeck @ 2018-11-28 21:56 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Fam Zheng, qemu-devel, Guenter Roeck

The guest OS reads RSTAT, RSEQ, and RINTR, and expects those registers
to reflect a consistent state. However, it is possible that the registers
can change after RSTAT was read, but before RINTR is read.

Guest OS		qemu
--------		----
Read RSTAT
			esp_command_complete()
			 RSTAT = STAT_ST
			 esp_dma_done()
			  RSTAT |= STAT_TC
			  RSEQ = 0
			  RINTR = INTR_BS

Read RSEQ
Read RINTR		RINTR = 0
			RSTAT &= ~STAT_TC
			RSEQ = SEQ_CD

The guest OS would then try to handle INTR_BS combined with an old
value of RSTAT. This sometimes resulted in lost events, spurious
interrupts, guest OS confusion, and stalled SCSI operations.
A typical guest error log (observed with various versions of Linux)
looks as follows.

scsi host1: Spurious irq, sreg=13.
...
scsi host1: Aborting command [84531f10:2a]
scsi host1: Current command [f882eea8:35]
scsi host1: Queued command [84531f10:2a]
scsi host1:  Active command [f882eea8:35]
scsi host1: Dumping command log
scsi host1: ent[15] CMD val[44] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[00] event[0c]
scsi host1: ent[16] CMD val[01] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[02] event[0c]
scsi host1: ent[17] CMD val[43] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[02] event[0c]
scsi host1: ent[18] EVENT val[0d] sreg[92] seqreg[04] sreg2[00] ireg[18] ss[00] event[0c]
...

Adress the situation by caching RINTR and RSEQ when RSTAT is read and
using the cached values when the respective registers are read.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
I am not too happy with this solution (it looks kludgy to me), but it does
work. With this series applied, I have not seen a single spurious interrupt
after hundreds of boots, and no stalls. Prior to that, spurious interrupts
were seen with pretty much every boot, and the stall occurred on a regular
basis (roughly every other boot with qemu-system-hppa, less with others).

If anyone has a better idea, please let me know, and I'll be happy to
test it.

 hw/scsi/esp.c         | 40 ++++++++++++++++++++++++++++++++++++++++
 include/hw/scsi/esp.h |  2 ++
 2 files changed, 42 insertions(+)

diff --git a/hw/scsi/esp.c b/hw/scsi/esp.c
index 630d923..6af74bc 100644
--- a/hw/scsi/esp.c
+++ b/hw/scsi/esp.c
@@ -415,6 +415,16 @@ uint64_t esp_reg_read(ESPState *s, uint32_t saddr)
         }
         break;
     case ESP_RINTR:
+        if (s->rintr_saved) {
+            old_val = s->rintr_saved;
+            s->rintr_saved = 0;
+            if (!(s->rregs[ESP_RINTR])) {
+                s->rregs[ESP_RSTAT] &= ~STAT_TC;
+                s->rregs[ESP_RSEQ] = SEQ_CD;
+                esp_lower_irq(s);
+            }
+            return old_val;
+        }
         /* Clear sequence step, interrupt register and all status bits
            except TC */
         old_val = s->rregs[ESP_RINTR];
@@ -429,6 +439,34 @@ uint64_t esp_reg_read(ESPState *s, uint32_t saddr)
         if (!s->tchi_written) {
             return s->chip_id;
         }
+    case ESP_RSTAT:
+        /*
+         * After receiving an interrupt, the guest OS reads
+         * RSTAT, RSEQ, and RINTR. When reading RINTR,
+         * RSTAT and RSEQ are updated. It was observed that
+         * esp_command_complete() and with it esp_dma_done()
+         * was called after the guest OS reads RSTAT, but
+         * before it was able to read RINTR. In other words,
+         * the host would read RINTR associated with the
+         * old value of RSTAT, not the new value. Since RSTAT
+         * was supposed to reflect STAT_ST in this situation,
+         * the host OS got confused, and the operation stalled.
+         * Remedy the situation by caching both ESP_RINTR and
+         * ESP_RSEQ. Return those values until ESP_RINTR is read.
+         * Only do this if an interrupt is pending to limit its
+         * impact.
+         */
+        if (s->rregs[ESP_RSTAT] & STAT_INT) {
+            s->rintr_saved = s->rregs[ESP_RINTR];
+            s->rseq_saved =  s->rregs[ESP_RSEQ];
+            s->rregs[ESP_RINTR] = 0;
+        }
+        break;
+    case ESP_RSEQ:
+        if (s->rintr_saved) {
+            return s->rseq_saved;
+        }
+        break;
     default:
         break;
     }
@@ -577,6 +615,8 @@ const VMStateDescription vmstate_esp = {
     .fields = (VMStateField[]) {
         VMSTATE_BUFFER(rregs, ESPState),
         VMSTATE_BUFFER(wregs, ESPState),
+        VMSTATE_UINT8(rintr_saved, ESPState),
+        VMSTATE_UINT8(rseq_saved, ESPState),
         VMSTATE_INT32(ti_size, ESPState),
         VMSTATE_UINT32(ti_rptr, ESPState),
         VMSTATE_UINT32(ti_wptr, ESPState),
diff --git a/include/hw/scsi/esp.h b/include/hw/scsi/esp.h
index 682a0d2..342f607 100644
--- a/include/hw/scsi/esp.h
+++ b/include/hw/scsi/esp.h
@@ -17,6 +17,8 @@ typedef struct ESPState ESPState;
 struct ESPState {
     uint8_t rregs[ESP_REGS];
     uint8_t wregs[ESP_REGS];
+    uint8_t rintr_saved;
+    uint8_t rseq_saved;
     qemu_irq irq;
     uint8_t chip_id;
     bool tchi_written;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-28 21:56 ` [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR Guenter Roeck
@ 2018-11-29  9:58   ` Paolo Bonzini
  2018-11-29 11:56     ` Mark Cave-Ayland
  2018-11-29 14:18     ` Guenter Roeck
  0 siblings, 2 replies; 16+ messages in thread
From: Paolo Bonzini @ 2018-11-29  9:58 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Fam Zheng, qemu-devel

On 28/11/18 22:56, Guenter Roeck wrote:
> The guest OS reads RSTAT, RSEQ, and RINTR, and expects those registers
> to reflect a consistent state. However, it is possible that the registers
> can change after RSTAT was read, but before RINTR is read.
> 
> Guest OS		qemu
> --------		----
> Read RSTAT
> 			esp_command_complete()
> 			 RSTAT = STAT_ST
> 			 esp_dma_done()
> 			  RSTAT |= STAT_TC
> 			  RSEQ = 0
> 			  RINTR = INTR_BS
> 
> Read RSEQ
> Read RINTR		RINTR = 0
> 			RSTAT &= ~STAT_TC
> 			RSEQ = SEQ_CD
> 
> The guest OS would then try to handle INTR_BS combined with an old
> value of RSTAT. This sometimes resulted in lost events, spurious
> interrupts, guest OS confusion, and stalled SCSI operations.

The question is, why was the guest running the interrupt routine before
STAT_INT was set in RSTAT?  The code in esp_raise_irq seems good:

    if (!(s->rregs[ESP_RSTAT] & STAT_INT)) {
        s->rregs[ESP_RSTAT] |= STAT_INT;
        qemu_irq_raise(s->irq);
        trace_esp_raise_irq();
    }

Paolo

> A typical guest error log (observed with various versions of Linux)
> looks as follows.
> 
> scsi host1: Spurious irq, sreg=13.
> ...
> scsi host1: Aborting command [84531f10:2a]
> scsi host1: Current command [f882eea8:35]
> scsi host1: Queued command [84531f10:2a]
> scsi host1:  Active command [f882eea8:35]
> scsi host1: Dumping command log
> scsi host1: ent[15] CMD val[44] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[00] event[0c]
> scsi host1: ent[16] CMD val[01] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[02] event[0c]
> scsi host1: ent[17] CMD val[43] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[02] event[0c]
> scsi host1: ent[18] EVENT val[0d] sreg[92] seqreg[04] sreg2[00] ireg[18] ss[00] event[0c]
> ...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control
  2018-11-28 21:56 [Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control Guenter Roeck
  2018-11-28 21:56 ` [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR Guenter Roeck
@ 2018-11-29  9:58 ` Paolo Bonzini
  1 sibling, 0 replies; 16+ messages in thread
From: Paolo Bonzini @ 2018-11-29  9:58 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Fam Zheng, qemu-devel

On 28/11/18 22:56, Guenter Roeck wrote:
> Per AM53C974 datasheet, definition of "SCSI Bus and Control (SBAC)"
> register:
> 
> Bit 24 – STATUS – Write Erase Control
> 
> This bit controls the Write Erase feature on bits 3:1 and bit 6 of the DMA
> Status Register ((B)+54h). When this bit is programmed to ‘1’, the state
> of bits 3:1 are preserved when read. Bits 3:1 are only cleared when a ‘1’
> is written to the corresponding bit location. For example, to clear bit 1,
> the value of ‘0000_0010b’ should be written to the register. When the DMA
> Status Preserve bit is ‘0’, bits 3:1 are cleared when read.
> 
> The status register is currently defined to bit 12, not bit 24.
> Also, its implementation is reversed: The status is auto-cleared if
> the bit is set to 1, and must be cleared explicitly when the bit is
> set to 0. This results in spurious interrupts reported by the Linux
> kernel, and in some cases even results in stalled SCSI operations.
> 
> Set SBAC_STATUS to bit 24 and reverse the logic to fix the problem.
> 
> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
> ---
>  hw/scsi/esp-pci.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/scsi/esp-pci.c b/hw/scsi/esp-pci.c
> index 419fc66..d956909 100644
> --- a/hw/scsi/esp-pci.c
> +++ b/hw/scsi/esp-pci.c
> @@ -59,7 +59,7 @@
>  #define DMA_STAT_SCSIINT 0x10
>  #define DMA_STAT_BCMBLT  0x20
>  
> -#define SBAC_STATUS 0x1000
> +#define SBAC_STATUS (1 << 24)
>  
>  typedef struct PCIESPState {
>      /*< private >*/
> @@ -136,7 +136,7 @@ static void esp_pci_dma_write(PCIESPState *pci, uint32_t saddr, uint32_t val)
>          pci->dma_regs[saddr] = val;
>          break;
>      case DMA_STAT:
> -        if (!(pci->sbac & SBAC_STATUS)) {
> +        if (pci->sbac & SBAC_STATUS) {
>              /* clear some bits on write */
>              uint32_t mask = DMA_STAT_ERROR | DMA_STAT_ABORT | DMA_STAT_DONE;
>              pci->dma_regs[DMA_STAT] &= ~(val & mask);
> @@ -157,7 +157,7 @@ static uint32_t esp_pci_dma_read(PCIESPState *pci, uint32_t saddr)
>          if (pci->esp.rregs[ESP_RSTAT] & STAT_INT) {
>              val |= DMA_STAT_SCSIINT;
>          }
> -        if (pci->sbac & SBAC_STATUS) {
> +        if (!(pci->sbac & SBAC_STATUS)) {
>              pci->dma_regs[DMA_STAT] &= ~(DMA_STAT_ERROR | DMA_STAT_ABORT |
>                                           DMA_STAT_DONE);
>          }
> 

Queued this one only, for now at least.

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29  9:58   ` Paolo Bonzini
@ 2018-11-29 11:56     ` Mark Cave-Ayland
  2018-11-29 15:42       ` Guenter Roeck
  2018-11-29 17:38       ` Guenter Roeck
  2018-11-29 14:18     ` Guenter Roeck
  1 sibling, 2 replies; 16+ messages in thread
From: Mark Cave-Ayland @ 2018-11-29 11:56 UTC (permalink / raw)
  To: Paolo Bonzini, Guenter Roeck; +Cc: Fam Zheng, qemu-devel

On 29/11/2018 09:58, Paolo Bonzini wrote:

> On 28/11/18 22:56, Guenter Roeck wrote:
>> The guest OS reads RSTAT, RSEQ, and RINTR, and expects those registers
>> to reflect a consistent state. However, it is possible that the registers
>> can change after RSTAT was read, but before RINTR is read.
>>
>> Guest OS		qemu
>> --------		----
>> Read RSTAT
>> 			esp_command_complete()
>> 			 RSTAT = STAT_ST
>> 			 esp_dma_done()
>> 			  RSTAT |= STAT_TC
>> 			  RSEQ = 0
>> 			  RINTR = INTR_BS
>>
>> Read RSEQ
>> Read RINTR		RINTR = 0
>> 			RSTAT &= ~STAT_TC
>> 			RSEQ = SEQ_CD
>>
>> The guest OS would then try to handle INTR_BS combined with an old
>> value of RSTAT. This sometimes resulted in lost events, spurious
>> interrupts, guest OS confusion, and stalled SCSI operations.
> 
> The question is, why was the guest running the interrupt routine before
> STAT_INT was set in RSTAT?  The code in esp_raise_irq seems good:
> 
>     if (!(s->rregs[ESP_RSTAT] & STAT_INT)) {
>         s->rregs[ESP_RSTAT] |= STAT_INT;
>         qemu_irq_raise(s->irq);
>         trace_esp_raise_irq();
>     }
> 
> Paolo

This patch is very interesting, as I have a long-running regression trying to boot
NextSTEP 3.3 on qemu-system-sparc which I eventually bisected down to the commit that
turned on iothread by default in QEMU.

The symptom is that ESP SCSI requests hang/timeout before the kernel is able to get
to the userspace installer: however if you launch QEMU with "taskset –cpu-list 1
qemu-system-sparc ..." then it works and you can complete the installation.

So certainly this suggests that there is a race condition still present in ESP
somewhere. I've given this patch a spin, and in a few quick tests here I was able to
consistently get further in kernel boot, but it still doesn't completely solve issue
for me :/

ATB,

Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29  9:58   ` Paolo Bonzini
  2018-11-29 11:56     ` Mark Cave-Ayland
@ 2018-11-29 14:18     ` Guenter Roeck
  1 sibling, 0 replies; 16+ messages in thread
From: Guenter Roeck @ 2018-11-29 14:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Fam Zheng, qemu-devel

On 11/29/18 1:58 AM, Paolo Bonzini wrote:
> On 28/11/18 22:56, Guenter Roeck wrote:
>> The guest OS reads RSTAT, RSEQ, and RINTR, and expects those registers
>> to reflect a consistent state. However, it is possible that the registers
>> can change after RSTAT was read, but before RINTR is read.
>>
>> Guest OS		qemu
>> --------		----
>> Read RSTAT
>> 			esp_command_complete()
>> 			 RSTAT = STAT_ST
>> 			 esp_dma_done()
>> 			  RSTAT |= STAT_TC
>> 			  RSEQ = 0
>> 			  RINTR = INTR_BS
>>
>> Read RSEQ
>> Read RINTR		RINTR = 0
>> 			RSTAT &= ~STAT_TC
>> 			RSEQ = SEQ_CD
>>
>> The guest OS would then try to handle INTR_BS combined with an old
>> value of RSTAT. This sometimes resulted in lost events, spurious
>> interrupts, guest OS confusion, and stalled SCSI operations.
> 
> The question is, why was the guest running the interrupt routine before
> STAT_INT was set in RSTAT?  The code in esp_raise_irq seems good:
> 

It doesn't. It polls for interrupts, but only enters the interrupt handler
if one was observed. It uses the DMA status (see first patch) for the polling,
presumably to not disturb RSTAT. Qemu produces two interrupts in a row,
the second while the first one is still handled.

Also, it works fine with a real controller (yes, I did buy one, and an
old SCSI drive, to test).

Guenter

>      if (!(s->rregs[ESP_RSTAT] & STAT_INT)) {
>          s->rregs[ESP_RSTAT] |= STAT_INT;
>          qemu_irq_raise(s->irq);
>          trace_esp_raise_irq();
>      }
> 
> Paolo
> 
>> A typical guest error log (observed with various versions of Linux)
>> looks as follows.
>>
>> scsi host1: Spurious irq, sreg=13.
>> ...
>> scsi host1: Aborting command [84531f10:2a]
>> scsi host1: Current command [f882eea8:35]
>> scsi host1: Queued command [84531f10:2a]
>> scsi host1:  Active command [f882eea8:35]
>> scsi host1: Dumping command log
>> scsi host1: ent[15] CMD val[44] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[00] event[0c]
>> scsi host1: ent[16] CMD val[01] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[02] event[0c]
>> scsi host1: ent[17] CMD val[43] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[02] event[0c]
>> scsi host1: ent[18] EVENT val[0d] sreg[92] seqreg[04] sreg2[00] ireg[18] ss[00] event[0c]
>> ...
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 11:56     ` Mark Cave-Ayland
@ 2018-11-29 15:42       ` Guenter Roeck
  2018-11-29 17:38       ` Guenter Roeck
  1 sibling, 0 replies; 16+ messages in thread
From: Guenter Roeck @ 2018-11-29 15:42 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

On Thu, Nov 29, 2018 at 11:56:39AM +0000, Mark Cave-Ayland wrote:
> On 29/11/2018 09:58, Paolo Bonzini wrote:
> 
> > On 28/11/18 22:56, Guenter Roeck wrote:
> >> The guest OS reads RSTAT, RSEQ, and RINTR, and expects those registers
> >> to reflect a consistent state. However, it is possible that the registers
> >> can change after RSTAT was read, but before RINTR is read.
> >>
> >> Guest OS		qemu
> >> --------		----
> >> Read RSTAT
> >> 			esp_command_complete()
> >> 			 RSTAT = STAT_ST
> >> 			 esp_dma_done()
> >> 			  RSTAT |= STAT_TC
> >> 			  RSEQ = 0
> >> 			  RINTR = INTR_BS
> >>
> >> Read RSEQ
> >> Read RINTR		RINTR = 0
> >> 			RSTAT &= ~STAT_TC
> >> 			RSEQ = SEQ_CD
> >>
> >> The guest OS would then try to handle INTR_BS combined with an old
> >> value of RSTAT. This sometimes resulted in lost events, spurious
> >> interrupts, guest OS confusion, and stalled SCSI operations.
> > 
> > The question is, why was the guest running the interrupt routine before
> > STAT_INT was set in RSTAT?  The code in esp_raise_irq seems good:
> > 
> >     if (!(s->rregs[ESP_RSTAT] & STAT_INT)) {
> >         s->rregs[ESP_RSTAT] |= STAT_INT;
> >         qemu_irq_raise(s->irq);
> >         trace_esp_raise_irq();
> >     }
> > 
> > Paolo
> 
> This patch is very interesting, as I have a long-running regression trying to boot
> NextSTEP 3.3 on qemu-system-sparc which I eventually bisected down to the commit that
> turned on iothread by default in QEMU.
> 
Makes me wonder: Is there a means to hold up processing esp_command_complete()
until a pending interrupt has been handled ? That might be a cleaner solution.
I could try to implement that in the driver, but maybe there is a means to do
that using the infrastructure.

Thanks,
Guenter

> The symptom is that ESP SCSI requests hang/timeout before the kernel is able to get
> to the userspace installer: however if you launch QEMU with "taskset –cpu-list 1
> qemu-system-sparc ..." then it works and you can complete the installation.
> 
> So certainly this suggests that there is a race condition still present in ESP
> somewhere. I've given this patch a spin, and in a few quick tests here I was able to
> consistently get further in kernel boot, but it still doesn't completely solve issue
> for me :/
> 
> 
> ATB,
> 
> Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 11:56     ` Mark Cave-Ayland
  2018-11-29 15:42       ` Guenter Roeck
@ 2018-11-29 17:38       ` Guenter Roeck
  2018-11-29 17:53         ` Paolo Bonzini
                           ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Guenter Roeck @ 2018-11-29 17:38 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2486 bytes --]

Hi Mark,

On Thu, Nov 29, 2018 at 11:56:39AM +0000, Mark Cave-Ayland wrote:
> On 29/11/2018 09:58, Paolo Bonzini wrote:
> 
> > On 28/11/18 22:56, Guenter Roeck wrote:
> >> The guest OS reads RSTAT, RSEQ, and RINTR, and expects those registers
> >> to reflect a consistent state. However, it is possible that the registers
> >> can change after RSTAT was read, but before RINTR is read.
> >>
> >> Guest OS		qemu
> >> --------		----
> >> Read RSTAT
> >> 			esp_command_complete()
> >> 			 RSTAT = STAT_ST
> >> 			 esp_dma_done()
> >> 			  RSTAT |= STAT_TC
> >> 			  RSEQ = 0
> >> 			  RINTR = INTR_BS
> >>
> >> Read RSEQ
> >> Read RINTR		RINTR = 0
> >> 			RSTAT &= ~STAT_TC
> >> 			RSEQ = SEQ_CD
> >>
> >> The guest OS would then try to handle INTR_BS combined with an old
> >> value of RSTAT. This sometimes resulted in lost events, spurious
> >> interrupts, guest OS confusion, and stalled SCSI operations.
> > 
> > The question is, why was the guest running the interrupt routine before
> > STAT_INT was set in RSTAT?  The code in esp_raise_irq seems good:
> > 
> >     if (!(s->rregs[ESP_RSTAT] & STAT_INT)) {
> >         s->rregs[ESP_RSTAT] |= STAT_INT;
> >         qemu_irq_raise(s->irq);
> >         trace_esp_raise_irq();
> >     }
> > 
> > Paolo
> 
> This patch is very interesting, as I have a long-running regression trying to boot
> NextSTEP 3.3 on qemu-system-sparc which I eventually bisected down to the commit that
> turned on iothread by default in QEMU.
> 
> The symptom is that ESP SCSI requests hang/timeout before the kernel is able to get
> to the userspace installer: however if you launch QEMU with "taskset –cpu-list 1
> qemu-system-sparc ..." then it works and you can complete the installation.
> 
> So certainly this suggests that there is a race condition still present in ESP
> somewhere. I've given this patch a spin, and in a few quick tests here I was able to
> consistently get further in kernel boot, but it still doesn't completely solve issue
> for me :/
> 

Can you try the attached patch ? It is a bit cleaner than the first version,
and works for me as well.

Note that this isn't perfect. Specifically, I see differences in handling
STAT_TC. The controller specification is a bit ambiguous in that regard,
but comparing the qemu code with real controller behavior shows that the
real controller does not reset STAT_TC when reading the interrupt status
register. That doesn't seem to matter for Linux, but it may influence
other guests.

Guenter

[-- Attachment #2: 0001-scsi-esp-Defer-command-completion-until-previous-int.patch --]
[-- Type: text/x-diff, Size: 5406 bytes --]

>From 751edd383f3c70d8a9195345b45aa3eb1ada7553 Mon Sep 17 00:00:00 2001
From: Guenter Roeck <linux@roeck-us.net>
Date: Thu, 29 Nov 2018 09:17:42 -0800
Subject: [PATCH] scsi: esp: Defer command completion until previous interrupts
 have been handled

The guest OS reads RSTAT, RSEQ, and RINTR, and expects those registers
to reflect a consistent state. However, it is possible that the registers
can change after RSTAT was read, but before RINTR is read, when
esp_command_complete() is called.

Guest OS		qemu
--------		----
[handle interrupt]
Read RSTAT
			esp_command_complete()
			 RSTAT = STAT_ST
			 esp_dma_done()
			  RSTAT |= STAT_TC
			  RSEQ = 0
			  RINTR = INTR_BS

Read RSEQ
Read RINTR		RINTR = 0
			RSTAT &= ~STAT_TC
			RSEQ = SEQ_CD

The guest OS would then try to handle INTR_BS combined with an old
value of RSTAT. This sometimes resulted in lost events, spurious
interrupts, guest OS confusion, and stalled SCSI operations.
A typical guest error log (observed with various versions of Linux)
looks as follows.

scsi host1: Spurious irq, sreg=13.
...
scsi host1: Aborting command [84531f10:2a]
scsi host1: Current command [f882eea8:35]
scsi host1: Queued command [84531f10:2a]
scsi host1:  Active command [f882eea8:35]
scsi host1: Dumping command log
scsi host1: ent[15] CMD val[44] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[00] event[0c]
scsi host1: ent[16] CMD val[01] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[02] event[0c]
scsi host1: ent[17] CMD val[43] sreg[90] seqreg[00] sreg2[00] ireg[20] ss[02] event[0c]
scsi host1: ent[18] EVENT val[0d] sreg[92] seqreg[04] sreg2[00] ireg[18] ss[00] event[0c]
...

Defer handling command completion until previous interrupts have been
handled to fix the problem.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
This looks much cleaner than the previous patch and at the same time
identifies the problem. Enabling the new trace shows that the condition
is seen on a regular basis (on average 2-3 times per boot test with
qemu-system-hppa if the test does nothing but shut down immediately).

 hw/scsi/esp.c         | 29 ++++++++++++++++++++++++-----
 hw/scsi/trace-events  |  1 +
 include/hw/scsi/esp.h |  2 ++
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/hw/scsi/esp.c b/hw/scsi/esp.c
index 630d923..e3d6cc5 100644
--- a/hw/scsi/esp.c
+++ b/hw/scsi/esp.c
@@ -286,11 +286,8 @@ static void esp_do_dma(ESPState *s)
     esp_dma_done(s);
 }
 
-void esp_command_complete(SCSIRequest *req, uint32_t status,
-                                 size_t resid)
+static void _esp_command_complete(ESPState *s, uint32_t status)
 {
-    ESPState *s = req->hba_private;
-
     trace_esp_command_complete();
     if (s->ti_size != 0) {
         trace_esp_command_complete_unexpected();
@@ -311,6 +308,23 @@ void esp_command_complete(SCSIRequest *req, uint32_t status,
     }
 }
 
+void esp_command_complete(SCSIRequest *req, uint32_t status,
+                          size_t resid)
+{
+    ESPState *s = req->hba_private;
+
+    if (s->rregs[ESP_RSTAT] & STAT_INT) {
+        /* Defer handling command complete until the previous
+         * interrupt has been handled.
+         */
+        trace_esp_command_complete_deferred();
+        s->deferred_status = status;
+        s->deferred_complete = true;
+        return;
+    }
+    _esp_command_complete(s, status);
+}
+
 void esp_transfer_data(SCSIRequest *req, uint32_t len)
 {
     ESPState *s = req->hba_private;
@@ -422,7 +436,10 @@ uint64_t esp_reg_read(ESPState *s, uint32_t saddr)
         s->rregs[ESP_RSTAT] &= ~STAT_TC;
         s->rregs[ESP_RSEQ] = SEQ_CD;
         esp_lower_irq(s);
-
+        if (s->deferred_complete) {
+            _esp_command_complete(s, s->deferred_status);
+            s->deferred_complete = false;
+        }
         return old_val;
     case ESP_TCHI:
         /* Return the unique id if the value has never been written */
@@ -582,6 +599,8 @@ const VMStateDescription vmstate_esp = {
         VMSTATE_UINT32(ti_wptr, ESPState),
         VMSTATE_BUFFER(ti_buf, ESPState),
         VMSTATE_UINT32(status, ESPState),
+        VMSTATE_UINT32(deferred_status, ESPState),
+        VMSTATE_BOOL(deferred_complete, ESPState),
         VMSTATE_UINT32(dma, ESPState),
         VMSTATE_PARTIAL_BUFFER(cmdbuf, ESPState, 16),
         VMSTATE_BUFFER_START_MIDDLE_V(cmdbuf, ESPState, 16, 4),
diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
index 0fb6a99..2fe8a7c 100644
--- a/hw/scsi/trace-events
+++ b/hw/scsi/trace-events
@@ -167,6 +167,7 @@ esp_handle_satn_stop(uint32_t cmdlen) "cmdlen %d"
 esp_write_response(uint32_t status) "Transfer status (status=%d)"
 esp_do_dma(uint32_t cmdlen, uint32_t len) "command len %d + %d"
 esp_command_complete(void) "SCSI Command complete"
+esp_command_complete_deferred(void) "SCSI Command complete deferred"
 esp_command_complete_unexpected(void) "SCSI command completed unexpectedly"
 esp_command_complete_fail(void) "Command failed"
 esp_transfer_data(uint32_t dma_left, int32_t ti_size) "transfer %d/%d"
diff --git a/include/hw/scsi/esp.h b/include/hw/scsi/esp.h
index 682a0d2..adab63d 100644
--- a/include/hw/scsi/esp.h
+++ b/include/hw/scsi/esp.h
@@ -23,6 +23,8 @@ struct ESPState {
     int32_t ti_size;
     uint32_t ti_rptr, ti_wptr;
     uint32_t status;
+    uint32_t deferred_status;
+    bool deferred_complete;
     uint32_t dma;
     uint8_t ti_buf[TI_BUFSZ];
     SCSIBus bus;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 17:38       ` Guenter Roeck
@ 2018-11-29 17:53         ` Paolo Bonzini
  2018-11-29 18:07         ` Mark Cave-Ayland
  2018-11-29 18:34         ` Mark Cave-Ayland
  2 siblings, 0 replies; 16+ messages in thread
From: Paolo Bonzini @ 2018-11-29 17:53 UTC (permalink / raw)
  To: Guenter Roeck, Mark Cave-Ayland; +Cc: Fam Zheng, qemu-devel

On 29/11/18 18:38, Guenter Roeck wrote:
> This looks much cleaner than the previous patch and at the same time
> identifies the problem. Enabling the new trace shows that the condition
> is seen on a regular basis (on average 2-3 times per boot test with
> qemu-system-hppa if the test does nothing but shut down immediately).

Yes, this looks very nice.  I'll queue it if Mark reports that it works,
with just s/_esp_command_complete/esp_report_command_complete/.

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 17:38       ` Guenter Roeck
  2018-11-29 17:53         ` Paolo Bonzini
@ 2018-11-29 18:07         ` Mark Cave-Ayland
  2018-11-29 19:00           ` Guenter Roeck
  2018-11-29 18:34         ` Mark Cave-Ayland
  2 siblings, 1 reply; 16+ messages in thread
From: Mark Cave-Ayland @ 2018-11-29 18:07 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]

On 29/11/2018 17:38, Guenter Roeck wrote:

>> This patch is very interesting, as I have a long-running regression trying to boot
>> NextSTEP 3.3 on qemu-system-sparc which I eventually bisected down to the commit that
>> turned on iothread by default in QEMU.
>>
>> The symptom is that ESP SCSI requests hang/timeout before the kernel is able to get
>> to the userspace installer: however if you launch QEMU with "taskset –cpu-list 1
>> qemu-system-sparc ..." then it works and you can complete the installation.
>>
>> So certainly this suggests that there is a race condition still present in ESP
>> somewhere. I've given this patch a spin, and in a few quick tests here I was able to
>> consistently get further in kernel boot, but it still doesn't completely solve issue
>> for me :/
>>
> 
> Can you try the attached patch ? It is a bit cleaner than the first version,
> and works for me as well.
> 
> Note that this isn't perfect. Specifically, I see differences in handling
> STAT_TC. The controller specification is a bit ambiguous in that regard,
> but comparing the qemu code with real controller behavior shows that the
> real controller does not reset STAT_TC when reading the interrupt status
> register. That doesn't seem to matter for Linux, but it may influence
> other guests.

Hi Guenter,

Thanks for the patch. I just gave it a quick test, and unfortunately my NextSTEP ISO
still hangs in the same place on boot :(

Not sure if it helps, but attached is a simple trace backend log from "-trace 'esp*'"
from startup all the way to the point where the kernel hangs on boot whilst
enumerating the SCSI bus (it does seem to hang at random points in the bus
enumeration process).


ATB,

Mark.

[-- Attachment #2: trace-16358.txt.xz --]
[-- Type: application/x-xz, Size: 149840 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 17:38       ` Guenter Roeck
  2018-11-29 17:53         ` Paolo Bonzini
  2018-11-29 18:07         ` Mark Cave-Ayland
@ 2018-11-29 18:34         ` Mark Cave-Ayland
  2018-11-29 19:07           ` Guenter Roeck
  2 siblings, 1 reply; 16+ messages in thread
From: Mark Cave-Ayland @ 2018-11-29 18:34 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

On 29/11/2018 17:38, Guenter Roeck wrote:

> Can you try the attached patch ? It is a bit cleaner than the first version,
> and works for me as well.
> 
> Note that this isn't perfect. Specifically, I see differences in handling
> STAT_TC. The controller specification is a bit ambiguous in that regard,
> but comparing the qemu code with real controller behavior shows that the
> real controller does not reset STAT_TC when reading the interrupt status
> register. That doesn't seem to matter for Linux, but it may influence
> other guests.

I've now completed a boot test of all my SPARC32 OpenBIOS CDROM images with this
patch, and whilst it doesn't solve my NextSTEP issue, I don't see any obvious
regressions.

Note that NetBSD SPARC32 tends to spit out the occasional "!TC on data xfer" message
to the console during periods of disk access, however that is something that has
always happened and isn't something new introduced by this patch.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 18:07         ` Mark Cave-Ayland
@ 2018-11-29 19:00           ` Guenter Roeck
  2018-11-29 19:33             ` Mark Cave-Ayland
  0 siblings, 1 reply; 16+ messages in thread
From: Guenter Roeck @ 2018-11-29 19:00 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

On Thu, Nov 29, 2018 at 06:07:05PM +0000, Mark Cave-Ayland wrote:
> On 29/11/2018 17:38, Guenter Roeck wrote:
> 
> >> This patch is very interesting, as I have a long-running regression trying to boot
> >> NextSTEP 3.3 on qemu-system-sparc which I eventually bisected down to the commit that
> >> turned on iothread by default in QEMU.
> >>
> >> The symptom is that ESP SCSI requests hang/timeout before the kernel is able to get
> >> to the userspace installer: however if you launch QEMU with "taskset –cpu-list 1
> >> qemu-system-sparc ..." then it works and you can complete the installation.
> >>
> >> So certainly this suggests that there is a race condition still present in ESP
> >> somewhere. I've given this patch a spin, and in a few quick tests here I was able to
> >> consistently get further in kernel boot, but it still doesn't completely solve issue
> >> for me :/
> >>
> > 
> > Can you try the attached patch ? It is a bit cleaner than the first version,
> > and works for me as well.
> > 
> > Note that this isn't perfect. Specifically, I see differences in handling
> > STAT_TC. The controller specification is a bit ambiguous in that regard,
> > but comparing the qemu code with real controller behavior shows that the
> > real controller does not reset STAT_TC when reading the interrupt status
> > register. That doesn't seem to matter for Linux, but it may influence
> > other guests.
> 
> Hi Guenter,
> 
> Thanks for the patch. I just gave it a quick test, and unfortunately my NextSTEP ISO
> still hangs in the same place on boot :(
> 
Too bad. Is it "same place" as with the first version of the patch, or
"same place" as in upstream qemu ? That might be important, as the two patch
versions behave differently (one caches RSTAT/RINTR/RSEQ, one defers command
complete handling).

> Not sure if it helps, but attached is a simple trace backend log from "-trace 'esp*'"
> from startup all the way to the point where the kernel hangs on boot whilst
> enumerating the SCSI bus (it does seem to hang at random points in the bus
> enumeration process).
> 
This is interesting; yours seems to be a different problem. I don't see any
command_complete_deferred traces in your log. I also don't see any suspicious
activity between esp_raise_irq and esp_lower_irq.

Can you try tracing in singlethreaded mode ? Maybe that can help us finding
the difference.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 18:34         ` Mark Cave-Ayland
@ 2018-11-29 19:07           ` Guenter Roeck
  2018-11-29 19:38             ` Mark Cave-Ayland
  0 siblings, 1 reply; 16+ messages in thread
From: Guenter Roeck @ 2018-11-29 19:07 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

On Thu, Nov 29, 2018 at 06:34:54PM +0000, Mark Cave-Ayland wrote:
> On 29/11/2018 17:38, Guenter Roeck wrote:
> 
> > Can you try the attached patch ? It is a bit cleaner than the first version,
> > and works for me as well.
> > 
> > Note that this isn't perfect. Specifically, I see differences in handling
> > STAT_TC. The controller specification is a bit ambiguous in that regard,
> > but comparing the qemu code with real controller behavior shows that the
> > real controller does not reset STAT_TC when reading the interrupt status
> > register. That doesn't seem to matter for Linux, but it may influence
> > other guests.
> 
> I've now completed a boot test of all my SPARC32 OpenBIOS CDROM images with this
> patch, and whilst it doesn't solve my NextSTEP issue, I don't see any obvious
> regressions.
> 
> Note that NetBSD SPARC32 tends to spit out the occasional "!TC on data xfer" message
> to the console during periods of disk access, however that is something that has
> always happened and isn't something new introduced by this patch.
> 

That may be because reading the interrupt status resets the TC bit.
As mentioned above, I think it shouldn't do that. Just a wild guess, but
it might be worth a try. Can you remove "s->rregs[ESP_RSTAT] &= ~STAT_TC;"
from the ESP_RINTR case in esp_reg_read() and see what happens ?

[That may expose situations where STAT_TC _should_ be cleared but isn't,
 so we may hit other problems when doing that.]

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 19:00           ` Guenter Roeck
@ 2018-11-29 19:33             ` Mark Cave-Ayland
  2018-11-29 21:26               ` Guenter Roeck
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Cave-Ayland @ 2018-11-29 19:33 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

On 29/11/2018 19:00, Guenter Roeck wrote:

>> Thanks for the patch. I just gave it a quick test, and unfortunately my NextSTEP ISO
>> still hangs in the same place on boot :(
>>
> Too bad. Is it "same place" as with the first version of the patch, or
> "same place" as in upstream qemu ? That might be important, as the two patch
> versions behave differently (one caches RSTAT/RINTR/RSEQ, one defers command
> complete handling).

I'd say same place as in upstream QEMU, although again the exact location does vary
so I have to keep running it several times to get a feel for whether or not it is in
improvement.

>> Not sure if it helps, but attached is a simple trace backend log from "-trace 'esp*'"
>> from startup all the way to the point where the kernel hangs on boot whilst
>> enumerating the SCSI bus (it does seem to hang at random points in the bus
>> enumeration process).
>>
> This is interesting; yours seems to be a different problem. I don't see any
> command_complete_deferred traces in your log. I also don't see any suspicious
> activity between esp_raise_irq and esp_lower_irq.
> 
> Can you try tracing in singlethreaded mode ? Maybe that can help us finding
> the difference.

Attached are the "-trace 'esp*' -trace 'scsi*'" outputs for a bad (normal) boot and
good (single threaded) boot for comparison.


ATB,

Mark.

[-- Attachment #2: trace-bad.txt.xz --]
[-- Type: application/x-xz, Size: 201704 bytes --]

[-- Attachment #3: trace-good.txt.xz --]
[-- Type: application/x-xz, Size: 231548 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 19:07           ` Guenter Roeck
@ 2018-11-29 19:38             ` Mark Cave-Ayland
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Cave-Ayland @ 2018-11-29 19:38 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

On 29/11/2018 19:07, Guenter Roeck wrote:

>> I've now completed a boot test of all my SPARC32 OpenBIOS CDROM images with this
>> patch, and whilst it doesn't solve my NextSTEP issue, I don't see any obvious
>> regressions.
>>
>> Note that NetBSD SPARC32 tends to spit out the occasional "!TC on data xfer" message
>> to the console during periods of disk access, however that is something that has
>> always happened and isn't something new introduced by this patch.
>>
> 
> That may be because reading the interrupt status resets the TC bit.
> As mentioned above, I think it shouldn't do that. Just a wild guess, but
> it might be worth a try. Can you remove "s->rregs[ESP_RSTAT] &= ~STAT_TC;"
> from the ESP_RINTR case in esp_reg_read() and see what happens ?
> 
> [That may expose situations where STAT_TC _should_ be cleared but isn't,
>  so we may hit other problems when doing that.]

I've tried that, and over a number of boots it does seem to allow the boot to get
further: with this change I'd say around 95% of the time NextSTEP now gets past the
SCSI bus enumeration and hangs at the point where it tries to mount the root filesystem.

Interestingly enough even if I comment out that line I still see the "!TC on data
xfer" messages appearing on the NetBSD console...


ATB,

Mark.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR
  2018-11-29 19:33             ` Mark Cave-Ayland
@ 2018-11-29 21:26               ` Guenter Roeck
  0 siblings, 0 replies; 16+ messages in thread
From: Guenter Roeck @ 2018-11-29 21:26 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: Paolo Bonzini, Fam Zheng, qemu-devel

On Thu, Nov 29, 2018 at 07:33:15PM +0000, Mark Cave-Ayland wrote:
> On 29/11/2018 19:00, Guenter Roeck wrote:
> 
> >> Thanks for the patch. I just gave it a quick test, and unfortunately my NextSTEP ISO
> >> still hangs in the same place on boot :(
> >>
> > Too bad. Is it "same place" as with the first version of the patch, or
> > "same place" as in upstream qemu ? That might be important, as the two patch
> > versions behave differently (one caches RSTAT/RINTR/RSEQ, one defers command
> > complete handling).
> 
> I'd say same place as in upstream QEMU, although again the exact location does vary
> so I have to keep running it several times to get a feel for whether or not it is in
> improvement.
> 
> >> Not sure if it helps, but attached is a simple trace backend log from "-trace 'esp*'"
> >> from startup all the way to the point where the kernel hangs on boot whilst
> >> enumerating the SCSI bus (it does seem to hang at random points in the bus
> >> enumeration process).
> >>
> > This is interesting; yours seems to be a different problem. I don't see any
> > command_complete_deferred traces in your log. I also don't see any suspicious
> > activity between esp_raise_irq and esp_lower_irq.
> > 
> > Can you try tracing in singlethreaded mode ? Maybe that can help us finding
> > the difference.
> 
> Attached are the "-trace 'esp*' -trace 'scsi*'" outputs for a bad (normal) boot and
> good (single threaded) boot for comparison.
> 

This is really weird. The "bad" log just stops. Up to that point, the "good"
log is virtually identical (liteally up to line 83602 in both logs). Is it
possible that there is some qemu-internal race condition that has nothing
to do with esp itself, one that causes qemu to lock up ?

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-11-29 21:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-28 21:56 [Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control Guenter Roeck
2018-11-28 21:56 ` [Qemu-devel] [PATCH 2/2] scsi: esp: Improve consistency of RSTAT, RSEQ, and RINTR Guenter Roeck
2018-11-29  9:58   ` Paolo Bonzini
2018-11-29 11:56     ` Mark Cave-Ayland
2018-11-29 15:42       ` Guenter Roeck
2018-11-29 17:38       ` Guenter Roeck
2018-11-29 17:53         ` Paolo Bonzini
2018-11-29 18:07         ` Mark Cave-Ayland
2018-11-29 19:00           ` Guenter Roeck
2018-11-29 19:33             ` Mark Cave-Ayland
2018-11-29 21:26               ` Guenter Roeck
2018-11-29 18:34         ` Mark Cave-Ayland
2018-11-29 19:07           ` Guenter Roeck
2018-11-29 19:38             ` Mark Cave-Ayland
2018-11-29 14:18     ` Guenter Roeck
2018-11-29  9:58 ` [Qemu-devel] [PATCH 1/2] esp-pci: Fix status register write erase control Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.