Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way

From: Julien Grall <julien.grall@linaro.org>
To: George Dunlap <george.dunlap@citrix.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	Jan Beulich <JBeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Andre Przywara <andre.przywara@arm.com>, Tim Deegan <tim@xen.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Subject: Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
Date: Thu, 7 Dec 2017 13:52:24 +0000	[thread overview]
Message-ID: <cdd590a4-ff6c-af5a-8b5b-ca981c218ffc@linaro.org> (raw)
In-Reply-To: <688fe6e3-3410-aed2-0182-6589d0d8268a@citrix.com>

(+ Marc)

Hi,

@Marc: My Arm cache knowledge is somewhat limited. Feel free to correct 
me if I am wrong.

Before answering to the rest of the e-mail, let me reinforce what I said 
in my first e-mail. Set/Way are very complex to emulate and an OS using 
them should never expect good performance in virtualization context. The 
difficulty is clearly spell out in the Arm Arm.

So the main goal here is to workaround those software.

On 06/12/17 17:49, George Dunlap wrote:
> On 12/06/2017 12:58 PM, Julien Grall wrote:
>> Hi George,
>>
>> On 12/06/2017 12:28 PM, George Dunlap wrote:
>>> On 12/05/2017 06:39 PM, Julien Grall wrote:
>>>> Hi all,
>>>>
>>>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>>>> on the approach. I have a WIP branch I could share if that interest
>>>> people.
>>>>
>>>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>>>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>>>> 0 is in data/prefetch abort state at early boot. I have been able to
>>>> reproduce it reliably, although from the little information I have I
>>>> think it is related to a cache issue because we don't trap cache
>>>> maintenance instructions by set/way.
>>>>
>>>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>>>> working on a given cache level by S/W. Because the OS is not allowed to
>>>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>>>> cache. "The expected usage of the cache maintenance that operate by
>>>> set/way is associated with powerdown and powerup of caches, if this is
>>>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>>>
>>>> Those instructions will target a local processor and usually working in
>>>> batch for nuking the cache. This means if the vCPU is migrated to
>>>> another pCPU in the middle of the process, the cache may not be cleaned.
>>>> This would result to data corruption and potential crash of the OS.
>>>
>>> I don't quite understand the failure mode here: Why does vCPU migration
>>> cause cache inconsistency in the middle of one of these "cleans", but
>>> not under normal operation?
>>
>> Because they target a specific S/W cache level whereas other cache
>> operations are working with VA.
>>
>> To make it short, the other VA cache instructions will work to Poinut of
>> Coherency/Point of Unification and guarantee that the caches will be
>> consistent. For more details see B2.2.6 in ARM DDI 046C.c.
> 
> I skimmed that section, and I'm not much the wiser.
> 
> Just to be clear, this is my question.
> 
> Suppose we have the following sequence of events (where vN[pM] means
> vcpu N running on pcpu M):
> 
> Start with A == 0
> 
> 1. v0[p1] Read A
>    p1 has 'A==0' in the cache
> 2. scheduler migrates v1 to p0
> 3. v0[p0] A=2
>    p0 has 'A==2' in the cache
> 4 scheduler migrates v0 to p1
> 5 v0[p1] Read A
> 
> Now, I presume that with the guest not doing anything, the Read of A at
> #5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen
> or by the hardware, between #1 and #5, p0's version of A gets "cleaned"
> and p1's version of A gets "invalidated" (to use the terminology from
> the section mentioned above).

Caches on Arm are coherent and are controlled by the attributes in the 
page-tables. Imagine the region is normal cacheable and inner-shareable, 
a data synchronization barrier in #4 will ensure the visibility of the A 
to p1. So A will be read as 2.

> 
> So my question is, how does *adding* cache flushing of any sort end up
> violating the integrity in a situation like the above?

Because the integrity is based on the memory attributes in the 
page-tables. S/W instructions work directly on the cache and will break 
the coherency. Marc pointed me to his talk [1] that explain cache on Arm 
and also the set/way problem (see from slide 8).

> 
>>>> For those been worry about the performance impact, I have looked at the
>>>> current use of S/W instructions:
>>>>       - Linux Arm64: The last used in the kernel was beginning of 2015
>>>>       - Linux Arm32: Still use S/W for boot and secondary CPU
>>>> bring-up. No
>>>> plan to change.
>>>>       - UEFI: A couple of use in UEFI, but I have heard they plan to
>>>> remove them (need confirmation).
>>>>
>>>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>>>> state S/W instructions are not easily virtualizable, I would expect
>>>> guest OSes developers to try there best to limit the use of the
>>>> instructions.
>>>>
>>>> To limit the performance impact, we could introduce a guest option to
>>>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>>>> will be disabled.
>>>>
>>>> Now regarding the hardware domain. At the moment, it has its RAM direct
>>>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>>>> limited benefits (see why above). In that case I would suggest to impose
>>>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>>>> Again, a command line option could be introduced here.
>>>>
>>>> Any feedbacks on the approach will be welcomed.
>>>
>>> I still don't entirely understand the underlying failure mode, but there
>>> are a couple of things we could consider:
>>>
>>> 1. Automatically disabling 'vcpu migration' when caching is turned off.
>>> This wouldn't prevent a vcpu from being preempted, just from being run
>>> somewhere else.
>>
>> This suggest the guest will directly perform S/W, right? So you leave
>> the possibility to the guest to flush all caches the vCPU can access.
>> This an easy way for the guest to affect the cache entry of other guests.
>>
>> I think this would help some potential data attack.
> 
> Well, it's the equivalent of your "imposing vcpu pinning" solution
> above, but only temporary.  Was that suggestion meant to allow the
> hardware domain to directly perform S/W?

Yes for the hardware domain only because it is more trusted IHMO. I 
though you meant for every guests. The problem I can see here is you 
would need to trap cache-toggling. When trapping that, you have to trap 
all the virtual memory traps. This means:

Non-secure EL1 using AArch64: SCTLR_EL1, TTBR0_EL1, TTBR1_EL1, TCR_EL1, 
ESR_EL1,
FAR_EL1, AFSR0_EL1, AFSR1_EL1, MAIR_EL1, AMAIR_EL1, CONTEXTIDR_EL1.
Non-secure EL1 using AArch32: SCTLR, TTBR0, TTBR1, TTBCR, TTBCR2, DACR, 
DFSR,
IFSR, DFAR, IFAR, ADFSR, AIFSR, PRRR, NMRR, MAIR0, MAIR1, AMAIR0, AMAIR1,
CONTEXTIDR.

Those registers are accessed very often, so you will have a performance 
impact for the whole life of the guest.

However, looking at Marc's slide. This would not work when booting 
32-bit hardware domain on ARMv8 because system caches might be present.

> 
>>> 2. It sounds like rather than using PoD, you could use the
>>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>>> entry which cause a specific kind of HAP fault when accessed.  The fault
>>> handler then looks in the p2m entry, and if it finds an otherwise valid
>>> entry, it just fixes the "misconfigured" bits and continues.
>>
>> I thought about this. But when do you set the entry to misconfigured?
>>
>> If you take the example of Linux 32-bit. There are a couple of full
>> cache clean during the boot of uni-processor. So you would need to go
>> through the p2m multiple time and reset the access bits.
> 
> Do you want to reset the p2m multiple times?  I thought the goal was
> simply to keep the amount of p2m space you need to flush to a minimum;
> if you expect the memory which has been faulted in by the *last* flush
> to be relatively small, you could just always flush all memory that had
> been touched to that point.
> 
> If you *do* need to go through the p2m multiple times, then
> misconfiguration is a much better option than PoD.  In PoD, once a page
> has data on it, it can't be removed from the p2m anymore.  For the
> misconfiguration technique, you can go through and misconfigure the
> entries in the top-level p2m table as many times as you want.  The whole
> reason for doing it on x86 is that it's a relatively lightweight
> operation: we use it to modify MMIO mappings, to enable or disable
> logdirty for migrate, &c.

Does this also work when you share the page-tables with the IOMMU? It 
just occurred to me that for both PoD and "misconfigured bits" we would 
get into trouble because page-tables are shared with the IOMMU.

But I guess, it would be acceptable to say "you use S/W instructions in 
your OS, so you have to pay a worst performance price unless you fix 
your OS".

> 
> (This of course depends on being able to effectively misconfigure
> top-level entries of the p2m on ARM.)