Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way

From: Julien Grall <julien.grall@linaro.org>
To: George Dunlap <george.dunlap@citrix.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	Jan Beulich <JBeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Andre Przywara <andre.przywara@arm.com>, Tim Deegan <tim@xen.org>
Subject: Re: [RFC] xen/arm: Handling cache maintenance instructions by set/way
Date: Wed, 6 Dec 2017 12:58:10 +0000	[thread overview]
Message-ID: <e92ce303-1945-ceac-8a5d-93e95c687ac1@linaro.org> (raw)
In-Reply-To: <b103ed02-fe83-2901-801b-8fcf9a3c0a74@citrix.com>

Hi George,

On 12/06/2017 12:28 PM, George Dunlap wrote:
> On 12/05/2017 06:39 PM, Julien Grall wrote:
>> Hi all,
>>
>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>> on the approach. I have a WIP branch I could share if that interest people.
>>
>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>> 0 is in data/prefetch abort state at early boot. I have been able to
>> reproduce it reliably, although from the little information I have I
>> think it is related to a cache issue because we don't trap cache
>> maintenance instructions by set/way.
>>
>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>> working on a given cache level by S/W. Because the OS is not allowed to
>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>> cache. "The expected usage of the cache maintenance that operate by
>> set/way is associated with powerdown and powerup of caches, if this is
>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>
>> Those instructions will target a local processor and usually working in
>> batch for nuking the cache. This means if the vCPU is migrated to
>> another pCPU in the middle of the process, the cache may not be cleaned.
>> This would result to data corruption and potential crash of the OS.
> 
> I don't quite understand the failure mode here: Why does vCPU migration
> cause cache inconsistency in the middle of one of these "cleans", but
> not under normal operation?

Because they target a specific S/W cache level whereas other cache 
operations are working with VA.

To make it short, the other VA cache instructions will work to Poinut of 
Coherency/Point of Unification and guarantee that the caches will be 
consistent. For more details see B2.2.6 in ARM DDI 046C.c.

> 
>> For those been worry about the performance impact, I have looked at the
>> current use of S/W instructions:
>>      - Linux Arm64: The last used in the kernel was beginning of 2015
>>      - Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
>> plan to change.
>>      - UEFI: A couple of use in UEFI, but I have heard they plan to
>> remove them (need confirmation).
>>
>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>> state S/W instructions are not easily virtualizable, I would expect
>> guest OSes developers to try there best to limit the use of the
>> instructions.
>>
>> To limit the performance impact, we could introduce a guest option to
>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>> will be disabled.
>>
>> Now regarding the hardware domain. At the moment, it has its RAM direct
>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>> limited benefits (see why above). In that case I would suggest to impose
>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>> Again, a command line option could be introduced here.
>>
>> Any feedbacks on the approach will be welcomed.
> 
> I still don't entirely understand the underlying failure mode, but there
> are a couple of things we could consider:
> 
> 1. Automatically disabling 'vcpu migration' when caching is turned off.
> This wouldn't prevent a vcpu from being preempted, just from being run
> somewhere else.

This suggest the guest will directly perform S/W, right? So you leave 
the possibility to the guest to flush all caches the vCPU can access. 
This an easy way for the guest to affect the cache entry of other guests.

I think this would help some potential data attack.

> 
> 2. It sounds like rather than using PoD, you could use the
> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
> entry which cause a specific kind of HAP fault when accessed.  The fault
> handler then looks in the p2m entry, and if it finds an otherwise valid
> entry, it just fixes the "misconfigured" bits and continues.

I thought about this. But when do you set the entry to misconfigured?

If you take the example of Linux 32-bit. There are a couple of full 
cache clean during the boot of uni-processor. So you would need to go 
through the p2m multiple time and reset the access bits.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel