Re: Serious AMD-Vi(?) issue

From: Jan Beulich <jbeulich@suse.com>
To: Elliott Mitchell <ehem+xen@m5p.com>
Cc: xen-devel@lists.xenproject.org,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"Kelly Choi" <kelly.choi@cloud.com>
Subject: Re: Serious AMD-Vi(?) issue
Date: Thu, 18 Apr 2024 09:09:51 +0200	[thread overview]
Message-ID: <f0bdb386-0870-4468-846c-6c8a91eaf806@suse.com> (raw)
In-Reply-To: <ZiDBc3ye2wqmBAfq@mattapan.m5p.com>

On 18.04.2024 08:45, Elliott Mitchell wrote:
> On Wed, Apr 17, 2024 at 02:40:09PM +0200, Jan Beulich wrote:
>> On 11.04.2024 04:41, Elliott Mitchell wrote:
>>> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote:
>>>> On 27.03.2024 18:27, Elliott Mitchell wrote:
>>>>> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote:
>>>>>> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote:
>>>>>>>
>>>>>>> In fact when running into trouble, the usual course of action would be to
>>>>>>> increase verbosity in both hypervisor and kernel, just to make sure no
>>>>>>> potentially relevant message is missed.
>>>>>>
>>>>>> More/better information might have been obtained if I'd been engaged
>>>>>> earlier.
>>>>>
>>>>> This is still true, things are in full mitigation mode and I'll be
>>>>> quite unhappy to go back with experiments at this point.
>>>>
>>>> Well, it very likely won't work without further experimenting by someone
>>>> able to observe the bad behavior. Recall we're on xen-devel here; it is
>>>> kind of expected that without clear (and practical) repro instructions
>>>> experimenting as well as info collection will remain with the reporter.
>>>
>>> After looking at the situation and considering the issues, I /may/ be
>>> able to setup for doing more testing.  I guess I should confirm, which of
>>> those criteria do you think currently provided information fails at?
>>>
>>> AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) +
>>> dbench; seems a pretty specific setup.
>>
>> Indeed. If that's the only way to observe the issue, it suggests to me
>> that it'll need to be mainly you to do further testing, and perhaps even
>> debugging. Which isn't to say we're not available to help, but from all
>> I have gathered so far we're pretty much in the dark even as to which
>> component(s) may be to blame. As can still be seen at the top in reply
>> context, some suggestions were given as to obtaining possible further
>> information (or confirming the absence thereof).
> 
> There may be other ways which haven't yet been found.
> 
> I've been left with the suspicion AMD was to some degree sponsoring
> work to ensure Xen works on their hardware.  Given the severity of this
> problem I would kind of expect them not want to gain a reputation for
> having data loss issues.  Assuming a suitable pair of devices weren't
> already on-hand, I would kind of expect this to be well within their
> budget.

You've got to talk to AMD then. Plus I assume it's clear to you that
even if the (presumably) necessary hardware was available, it still
would require respective setup, leaving open whether the issue then
could indeed be reproduced.

>> I'd also like to come back to the vague theory you did voice, in that
>> you're suspecting flushes to take too long. I continue to have trouble
>> with this, and I would therefore like to ask that you put this down in
>> more technical terms, making connections to actual actions taken by
>> software / hardware.
> 
> I'm trying to figure out a pattern.
> 
> Nominally all the devices are roughly on par (only a very cheap flash
> device will be unable to overwhelm SATA's bandwidth).  Yet why did the
> Crucial SATA device /seem/ not to have the issue?  Why did a Crucial NVMe
> device demonstrate the issue.
> 
> My guess is the flash controllers Samsung uses may be able to start
> executing commands faster than the ones Crucial uses.  Meanwhile NVMe
> is lower overhead and latency than SATA (SATA's overhead isn't an issue
> for actual disks).  Perhaps the IOMMU is still flushing its TLB, or
> hasn't loaded the new tables.

Which would be an IOMMU issue then, that software at best may be able to
work around.

Jan

> I suspect when the MD-RAID1 issues block requests to a pair of devices,
> it likely sends the block to one device and then reuses most/all of the
> structures for the second device.  As a result the second request would
> likely get a command to the device rather faster than the first request.
> 
> Perhaps look into what structures the MD-RAID1 subsystem reuses are.
> Then see whether doing early setup of those structures triggers the
> issue?
> 
> (okay I'm deep into speculation here, but this seems the simplest
> explanation for what could be occuring)
> 
>