All of lore.kernel.org
 help / color / mirror / Atom feed
* DMA Faults with XHCI driver
@ 2022-06-24 14:38 Jörg Rödel
  2022-06-27 10:52 ` Mathias Nyman
  0 siblings, 1 reply; 5+ messages in thread
From: Jörg Rödel @ 2022-06-24 14:38 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb, Robin Murphy, Jon Grimm, Suthikulpanit, Suravee

Hi Mathias,

here is a report about something strange happening on my system after a
recent IOMMU change. I am starting to see this message at boot:

	xhci_hcd 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xff00ffffffefe000 flags=0x0000]

It means that the XHCI device tried a DMA access at address
0xff00ffffffefe000, which was not mapped in the IOMMU page table.

Devices attached to that XHCI controller will not work after that
message.

There is a related change in the IOMMU code which uncovered this, the
change basically lets the IOMMU dma-allocator not allocate below 4GB by
default, but use the whole space covered by the DMA mask.

To better track this down I limited the DMA-space to 48 bits, and the
message still shows up.

I think this might be a problem in the XHCI driver, e.g. it might mangle
an allocated DMA address somehow if it is bigger than 32 bit.

The device behind 0000:02:00.0 is a

	02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43d0 (rev 01)

Please let me know what I can do to help tracking this down.

Regards,

-- 
Jörg Rödel
jroedel@suse.de

SUSE Software Solutions Germany GmbH
Frankenstraße 146
90461 Nürnberg
Germany

(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: DMA Faults with XHCI driver
  2022-06-24 14:38 DMA Faults with XHCI driver Jörg Rödel
@ 2022-06-27 10:52 ` Mathias Nyman
  2022-06-28  7:51   ` Jörg Rödel
  0 siblings, 1 reply; 5+ messages in thread
From: Mathias Nyman @ 2022-06-27 10:52 UTC (permalink / raw)
  To: Jörg Rödel, Mathias Nyman
  Cc: linux-usb, Robin Murphy, Jon Grimm, Suthikulpanit, Suravee

Hi

On 24.6.2022 17.38, Jörg Rödel wrote:
> Hi Mathias,
> 
> here is a report about something strange happening on my system after a
> recent IOMMU change. I am starting to see this message at boot:
> 
> 	xhci_hcd 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xff00ffffffefe000 flags=0x0000]
> 
> It means that the XHCI device tried a DMA access at address
> 0xff00ffffffefe000, which was not mapped in the IOMMU page table.
> 
> Devices attached to that XHCI controller will not work after that
> message.
> 
> There is a related change in the IOMMU code which uncovered this, the
> change basically lets the IOMMU dma-allocator not allocate below 4GB by
> default, but use the whole space covered by the DMA mask.
> 
> To better track this down I limited the DMA-space to 48 bits, and the
> message still shows up.
> 
> I think this might be a problem in the XHCI driver, e.g. it might mangle
> an allocated DMA address somehow if it is bigger than 32 bit.
> 
> The device behind 0000:02:00.0 is a
> 
> 	02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43d0 (rev 01)
> 
> Please let me know what I can do to help tracking this down.

Thanks for reporting this.

Can you boot with xhci dynamic debug and tracing enabled? could help pinpoint
when the controller is trying to access the unmapped DMA address.

Add to kernel cmdline:
xhci_hcd.dyndbg=+p trace_event=xhci-hcd trace_buf_size=80M
<boot>
mount -t debugfs none /sys/kernel/debug
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace

Also if you could dump the content of following registers:
cat /sys/kernel/debug/usb/xhci/<pci address>/reg-op
cat /sys/kernel/debug/usb/xhci/<pci address>/reg-runtime

xhci driver writes dma addresses it allocated for the host into
some of those registers

Thanks
Mathias

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: DMA Faults with XHCI driver
  2022-06-27 10:52 ` Mathias Nyman
@ 2022-06-28  7:51   ` Jörg Rödel
  2022-06-29 14:39     ` Mathias Nyman
  0 siblings, 1 reply; 5+ messages in thread
From: Jörg Rödel @ 2022-06-28  7:51 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Mathias Nyman, linux-usb, Robin Murphy, Jon Grimm, Suthikulpanit,
	Suravee

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

Hi Mathias,

thanks a lot for your reply.

On Mon, Jun 27, 2022 at 01:52:44PM +0300, Mathias Nyman wrote:
> Add to kernel cmdline:
> xhci_hcd.dyndbg=+p trace_event=xhci-hcd trace_buf_size=80M
> <boot>
> mount -t debugfs none /sys/kernel/debug
> Send output of dmesg
> Send content of /sys/kernel/debug/tracing/trace
> 
> Also if you could dump the content of following registers:
> cat /sys/kernel/debug/usb/xhci/<pci address>/reg-op
> cat /sys/kernel/debug/usb/xhci/<pci address>/reg-runtime
> 
> xhci driver writes dma addresses it allocated for the host into
> some of those registers

The data you requested is in the attached archive. The trace was
actually empty. My kernel has tracing enabled in general, do I need to
enable more config options to get the trace?

Regards,

-- 
Jörg Rödel
jroedel@suse.de

SUSE Software Solutions Germany GmbH
Frankenstraße 146
90461 Nürnberg
Germany

(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman


[-- Attachment #2: xhci.tar.xz --]
[-- Type: application/x-xz, Size: 29220 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: DMA Faults with XHCI driver
  2022-06-28  7:51   ` Jörg Rödel
@ 2022-06-29 14:39     ` Mathias Nyman
  2022-07-04 15:54       ` Jörg Rödel
  0 siblings, 1 reply; 5+ messages in thread
From: Mathias Nyman @ 2022-06-29 14:39 UTC (permalink / raw)
  To: Jörg Rödel
  Cc: Mathias Nyman, linux-usb, Robin Murphy, Jon Grimm, Suthikulpanit,
	Suravee

On 28.6.2022 10.51, Jörg Rödel wrote:
> Hi Mathias,
> 
> thanks a lot for your reply.
> 
> On Mon, Jun 27, 2022 at 01:52:44PM +0300, Mathias Nyman wrote:
>> Add to kernel cmdline:
>> xhci_hcd.dyndbg=+p trace_event=xhci-hcd trace_buf_size=80M
>> <boot>
>> mount -t debugfs none /sys/kernel/debug
>> Send output of dmesg
>> Send content of /sys/kernel/debug/tracing/trace
>>
>> Also if you could dump the content of following registers:
>> cat /sys/kernel/debug/usb/xhci/<pci address>/reg-op
>> cat /sys/kernel/debug/usb/xhci/<pci address>/reg-runtime
>>
>> xhci driver writes dma addresses it allocated for the host into
>> some of those registers
> 
> The data you requested is in the attached archive. The trace was
> actually empty. My kernel has tracing enabled in general, do I need to
> enable more config options to get the trace?
> 

Thanks, not sure about which tracing option is missing.

You might be right about mangling DMA addresses above 32bit.
At least that address looks a lot like the command ring DMA address with a
higher byte zeroed.

Driver allocates and maps memory for the command ring at DMA 0xffffffffffefe000:

[    6.698051] xhci_hcd 0000:02:00.0: First segment DMA is 0xffffffffffefe000
[    6.698053] xhci_hcd 0000:02:00.0: // Setting command ring address to 0xffffffffffefe001

That address is stored in a 64 bit CRCR register (mmio at xhci->op_regs->cmd_ring)
by writing low 32 bits first, then high 32 bits.

The AMD-Vi IO_PAGE_FAULT for accessing 0xff00ffffffefe000 comes right after driver
asked xHC to process commands from the command ring (the Ding dong! message)
xHC device reads the commands from the DMA address written to the CRCR register.
The faulty address has the same lower 32 bits as the command ring we allocated.

[    6.880028] xhci_hcd 0000:02:00.0: // Ding dong!
[    6.880045] xhci_hcd 0000:02:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xff00ffffffefe000 flags=0x0000]

Can't find a reason for zeroed bits in the upper 32 bit part.
Looks like we always write all 64 bits to CRCR register.

Maybe a race where command ring is started while writing high 32 bits to CRCR?
Writing address bits to CRCR register are ignored if command ring is running.

Also looks like roothub might be runtime suspending just before the issue.

Anyway, maybe flushing the CRCR register by reading it back after writing it would help:

---

diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index 8c19e151a945..e54829898a2c 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -2473,6 +2473,8 @@ int xhci_mem_init(struct xhci_hcd *xhci, gfp_t flags)
                         "// Setting command ring address to 0x%016llx", val_64);
         xhci_write_64(xhci, val_64, &xhci->op_regs->cmd_ring);
  
+       xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
+
         /* Reserve one command ring TRB for disabling LPM.
          * Since the USB core grabs the shared usb_bus bandwidth mutex before
          * disabling LPM, we only need to reserve one TRB for all devices.
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 46d0b9ad6f74..721715ba5085 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -390,6 +390,9 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci, unsigned long flags)
         crcr = xhci_trb_virt_to_dma(new_seg, new_deq);
         xhci_write_64(xhci, crcr | CMD_RING_ABORT, &xhci->op_regs->cmd_ring);
  
+       xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
+
         /* Section 4.6.1.2 of xHCI 1.0 spec says software should also time the
          * completion of the Command Abort operation. If CRR is not negated in 5
          * seconds then driver handles it as if host died (-ENODEV).
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 9ac56e9ffc64..6180a721693d 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -854,6 +854,8 @@ static void xhci_set_cmd_ring_deq(struct xhci_hcd *xhci)
                         "// Setting command ring address to 0x%llx",
                         (long unsigned long) val_64);
         xhci_write_64(xhci, val_64, &xhci->op_regs->cmd_ring);
+
+       xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
  }
  
  /*


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: DMA Faults with XHCI driver
  2022-06-29 14:39     ` Mathias Nyman
@ 2022-07-04 15:54       ` Jörg Rödel
  0 siblings, 0 replies; 5+ messages in thread
From: Jörg Rödel @ 2022-07-04 15:54 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Mathias Nyman, linux-usb, Robin Murphy, Jon Grimm, Suthikulpanit,
	Suravee

Hi Mathias,

On Wed, Jun 29, 2022 at 05:39:49PM +0300, Mathias Nyman wrote:
> Anyway, maybe flushing the CRCR register by reading it back after writing it would help:

Thanks for your help and sorry for the delay. I tried the patch but it
didn't help, the io fault still appears and the device does not work.

I added some more debugging, actually printing the cmd_ring address read
by the calls you patch adds. Unfortunately they all returned 0, looks
like the register is write-only RAZ on this hardware.

Is it possible that the device actually can't handle a 64 bit DMA mask
and needs a smaller one, say 40 bits?

Regards,

-- 
Jörg Rödel
jroedel@suse.de

SUSE Software Solutions Germany GmbH
Frankenstraße 146
90461 Nürnberg
Germany

(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-07-04 15:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-24 14:38 DMA Faults with XHCI driver Jörg Rödel
2022-06-27 10:52 ` Mathias Nyman
2022-06-28  7:51   ` Jörg Rödel
2022-06-29 14:39     ` Mathias Nyman
2022-07-04 15:54       ` Jörg Rödel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.