All of lore.kernel.org
 help / color / mirror / Atom feed
* Serious AMD-Vi issue
@ 2024-01-25 20:24 Elliott Mitchell
  2024-02-12 23:23 ` Elliott Mitchell
  2024-03-04 23:55 ` AMD-Vi issue Andrew Cooper
  0 siblings, 2 replies; 27+ messages in thread
From: Elliott Mitchell @ 2024-01-25 20:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Jan Beulich, Andrew Cooper

Apparently this was first noticed with 4.14, but more recently I've been
able to reproduce the issue:

https://bugs.debian.org/988477

The original observation features MD-RAID1 using a pair of Samsung
SATA-attached flash devices.  The main line shows up in `xl dmesg`:

(XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I

Where the device points at the SATA controller.  I've ended up
reproducing this with some noticable differences.

A major goal of RAID is to have different devices fail at different
times.  Hence my initial run had a Samsung device plus a device from
another reputable flash manufacturer.

I initially noticed this due to messages in domain 0's dmesg about
errors from the SATA device.  Wasn't until rather later that I noticed
the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should
be duplicated into domain 0's dmesg?).

All of the failures consistently pointed at the Samsung device.  Due to
the expectation it would fail first (lower quality offering with
lesser guarantees), I proceeded to replace it with a NVMe device.

With some monitoring I discovered the NVMe device was now triggering
IOMMU errors, though not nearly as many as the Samsung SATA device did.
As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort
of IOMMU issue with Xen.


All I can do is offer speculation about the underlying cause.  There
does seem to be a pattern of higher-performance flash storage devices
being more severely effected.

I was speculating about the issue being the MD-RAID1 driver abusing
Linux's DMA infrastructure in some fashion.

Upon further consideration, I'm wondering if this is perhaps a latency
issue.  I imagine there is some sort of flush after the IOMMU tables are
modified.  Perhaps the Samsung SATA (and all NVMe) devices were trying to
execute commands before reloading the IOMMU tables is complete.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445




^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-05-16  5:22 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-25 20:24 Serious AMD-Vi issue Elliott Mitchell
2024-02-12 23:23 ` Elliott Mitchell
2024-03-04 19:56   ` Elliott Mitchell
2024-03-18 19:41   ` Serious AMD-Vi(?) issue Elliott Mitchell
2024-03-22 16:41     ` Kelly Choi
2024-03-22 19:22       ` Elliott Mitchell
2024-03-25  7:55         ` Jan Beulich
2024-03-25 21:43           ` Elliott Mitchell
2024-03-27 17:27             ` Elliott Mitchell
2024-03-28  6:25               ` Jan Beulich
2024-03-28 15:22                 ` Elliott Mitchell
2024-03-28 16:17                   ` Elliott Mitchell
2024-04-11  2:41                 ` Elliott Mitchell
2024-04-17 12:40                   ` Jan Beulich
2024-04-18  6:45                     ` Elliott Mitchell
2024-04-18  7:09                       ` Jan Beulich
2024-04-19  4:33                         ` Elliott Mitchell
2024-05-11  4:09                           ` Elliott Mitchell
2024-05-13  8:44                             ` Roger Pau Monné
2024-05-13 20:11                               ` Elliott Mitchell
2024-05-14  8:22                                 ` Jan Beulich
2024-05-14 20:51                                   ` Elliott Mitchell
2024-05-15 13:40                                     ` Kelly Choi
2024-05-16  5:21                                       ` Elliott Mitchell
2024-05-14  8:20                               ` Jan Beulich
2024-03-04 23:55 ` AMD-Vi issue Andrew Cooper
2024-03-05  0:34   ` Elliott Mitchell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.