Re: [PATCH] x86/HVM: correct hvmemul_map_linear_addr() for multi-page case

From: Jan Beulich <jbeulich@suse.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>,
	"Paul Durrant" <paul.durrant@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>
Subject: Re: [PATCH] x86/HVM: correct hvmemul_map_linear_addr() for multi-page case
Date: Thu, 31 Aug 2023 09:03:18 +0200	[thread overview]
Message-ID: <5b28f42f-be2d-b826-2bfe-434b0c1742e2@suse.com> (raw)
In-Reply-To: <93b3e5c4-19b7-9809-e322-f0973924eef8@citrix.com>

On 30.08.2023 20:09, Andrew Cooper wrote:
> On 30/08/2023 3:30 pm, Roger Pau Monné wrote:
>> On Wed, Sep 12, 2018 at 03:09:35AM -0600, Jan Beulich wrote:
>>> The function does two translations in one go for a single guest access.
>>> Any failure of the first translation step (guest linear -> guest
>>> physical), resulting in #PF, ought to take precedence over any failure
>>> of the second step (guest physical -> host physical).
> 
> Erm... No?
> 
> There are up to 25 translations steps, assuming a memory operand
> contained entirely within a cache-line.
> 
> They intermix between gla->gpa and gpa->spa in a strict order.

But we're talking about an access crossing a page boundary here.

> There not a point where the error is ambiguous, nor is there ever a
> point where a pagewalk continues beyond a faulting condition.
> 
> Hardware certainly isn't wasting transistors to hold state just to see
> could try to progress further in order to hand back a different error...
> 
> 
> When the pipeline needs to split an access, it has to generate multiple
> adjacent memory accesses, because the unit of memory access is a cache line.
> 
> There is a total order of accesses in the memory queue, so any faults
> from first byte of the access will be delivered before any fault from
> the first byte to move into the next cache line.

Looks like we're fundamentally disagreeing on what we try to emulate in
Xen. My view is that the goal ought to be to match, as closely as
possible, how code would behave on bare metal. IOW no considerations of
of the GPA -> MA translation steps. Of course in a fully virtualized
environment these necessarily have to occur for the page table accesses
themselves, before the the actual memory access can be carried out. But
that's different for the leaf access itself. (In fact I'm not even sure
the architecture guarantees that the two split accesses, or their
associated page walks, always occur in [address] order.)

I'd also like to expand on the "we're": Considering the two R-b I got
already back at the time, both apparently agreed with my way of looking
at things. With Roger's reply that you've responded to here, I'm
getting the impression that he also shares that view.

Of course that still doesn't mean we're right and you're wrong, but if
you think that's the case, it'll take you actually supplying arguments
supporting your view. And since we're talking of an abstract concept
here, resorting to how CPUs actually deal with the same situation
isn't enough. It wouldn't be the first time that they got things
wrong. Plus it may also require you potentially accepting that
different views are possible, without either being strictly wrong and
the other strictly right.

> I'm not necessarily saying that Xen's behaviour in
> hvmemul_map_linear_addr() is correct in all cases, but it looks a hell
> of a lot more correct in it's current form than what this patch presents.
> 
> Or do you have a concrete example where you think
> hvmemul_map_linear_addr() behaves incorrectly?

I may not have observed one (the patch has been pending for too long
now for me to still recall in the context of what unrelated work I
noticed there being an issue here; certainly it was a case where I was
at least suspecting this being the possible cause, and I do recall it
was related to some specific observation with Windows guests), but the
description makes clear enough that any split access crossing a (guest
view) non-faulting/faulting boundary is going to be affected, if the
former access would instead cause some 2nd-stage translation issue on
the leaf access.

In fact I think one can even see a guest security aspect here (not an
active issue, but defense-in-depth like): If there's any chance to have
the guest kernel take corrective action, that should be preferred over
Xen potentially taking fatal (to the guest) action (because of whatever
2nd stage translation issue on the lower part of the access). From
that angle the change may even not go far enough, yet (thinking e.g.
of a PoD out-of-memory condition on the first part of the access; in
such an event hvm_translate_get_page() unconditionally using
P2M_UNSHARE, and hence implicitly P2M_ALLOC, is also getting in the
way).

Jan