Re: Serious AMD-Vi(?) issue

From: Elliott Mitchell <ehem+xen@m5p.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: xen-devel@lists.xenproject.org,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"Kelly Choi" <kelly.choi@cloud.com>
Subject: Re: Serious AMD-Vi(?) issue
Date: Mon, 25 Mar 2024 14:43:44 -0700	[thread overview]
Message-ID: <ZgHwEGCsCLHiYU5J@mattapan.m5p.com> (raw)
In-Reply-To: <e9b1c9c4-523b-481b-946e-37c7c18ea1d2@suse.com>

On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
> On 22.03.2024 20:22, Elliott Mitchell wrote:
> > On Fri, Mar 22, 2024 at 04:41:45PM +0000, Kelly Choi wrote:
> >>
> >> I can see you've recently engaged with our community with some issues you'd
> >> like help with.
> >> We love the fact you are participating in our project, however, our
> >> developers aren't able to help if you do not provide the specific details.
> > 
> > Please point to specific details which have been omitted.  Fairly little
> > data has been provided as fairly little data is available.  The primary
> > observation is large numbers of:
> > 
> > (XEN) AMD-Vi: IO_PAGE_FAULT: DDDD:bb:dd.f d0 addr ffffff???????000 flags 0x8 I
> > 
> > Lines in Xen's ring buffer.
> 
> Yet this is (part of) the problem: By providing only the messages that appear
> relevant to you, you imply that you know that no other message is in any way
> relevant. That's judgement you'd better leave to people actually trying to
> investigate. Unless of course you were proposing an actual code change, with
> suitable justification.

Honestly, I forgot about the very small number of messages from the SATA
subsystem.  The question of whether the current mitigation actions are
effective right now was a bigger issue.  As such monitoring `xl dmesg`
was a priority to looking at SATA messages which failed to reliably
indicate status.

I *thought* I would be able to retrieve those via other slow means, but a
different and possibly overlapping issue has shown up.  Unfortunately
this means those are no longer retrievable.   :-(

> In fact when running into trouble, the usual course of action would be to
> increase verbosity in both hypervisor and kernel, just to make sure no
> potentially relevant message is missed.

More/better information might have been obtained if I'd been engaged
earlier.

> > The most overt sign was telling the Linux kernel to scan for
> > inconsistencies and the kernel finding some.  The domain didn't otherwise
> > appear to notice trouble.
> > 
> > This is from memory, it would take some time to discover whether any
> > messages were missed.  Present mitigation action is inhibiting the
> > messages, but the trouble is certainly still lurking.
> 
> Iirc you were considering whether any of this might be a timing issue. Yet
> beyond voicing that suspicion, you didn't provide any technical details as
> to why you think so. Such technical details would include taking into
> account how IOMMU mappings and associated IOMMU TLB flushing are carried
> out. Right now, to me at least, your speculation in this regard fails
> basic sanity checking. Therefore the scenario that you're thinking of
> would need better describing, imo.

True.  Mostly I'm analyzing the known information and considering what
the patterns suggest.

Presently I'm aware of two reports (Imre Szőllősi and mine).

Both of these feature AMD processor machines.  Could be people with AMD
processors are less trustful of flash storage or could be an AMD-only
IOMMU issue.  Ideally someone would test and confirm there is no issue
with Linux software RAID1 on flash on an Intel machine.

Both reports feature two flash storage devices being run through Linux
MD RAID1.  Could be the MD RAID1 subsystem is abusing the DMA interface
in some fashion.  While Imre Szőllősi reported this not occuring with a
single device, the report does not explicitly state whether that was a
degenerate RAID1 versus non-RAID.  I'm unaware of any testing with 3x
devices in RAID1.

Both reports feature Samsung SATA flash devices.  My case also includes a
Crucial NVMe device.  My case also features a Crucial SATA flash device
for which the problem did NOT occur.  So the question becomes, why did
the problem not occur for this Crucial SATA device?

According to the specifications, the Crucial SATA device is roughly on
par with the Samsung SATA devices in terms of read/write speeds.  The
NVMe device's specifications are massively better.

What comes to mind is the Crucial SATA device might have higher latency
before executing commands.  Specifications don't mention command
execution latency, so it isn't possible to know whether this is the
issue.

Yes, latency/timing is speculation.  Does seem a good fit for the pattern
though.

This could be a Linux MD RAID1 bug or a Xen bug.

Unfortunately data loss is a very serious type of bug so I'm highly
reluctant to let go of mitigations without hope for progress.

-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445