Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)

From: Liran Alon <liran.alon@oracle.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	juerg.haefliger@hpe.com, deepa.srinivasan@oracle.com,
	Jim Mattson <jmattson@google.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	linux-mm <linux-mm@kvack.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	joao.m.martins@oracle.com, pradeep.vincent@oracle.com,
	Andi Kleen <ak@linux.intel.com>,
	Khalid Aziz <khalid.aziz@oracle.com>,
	kanth.ghatraju@oracle.com, Kees Cook <keescook@google.com>,
	jsteckli@os.inf.tu-dresden.de,
	Kernel Hardening <kernel-hardening@lists.openwall.com>,
	chris.hyser@oracle.com, Tyler Hicks <tyhicks@canonical.com>,
	John Haxby <john.haxby@oracle.com>, Jon Masters <jcm@redhat.com>
Subject: Re: Redoing eXclusive Page Frame Ownership (XPFO) with isolated CPUs in mind (for KVM to isolate its guests per CPU)
Date: Tue, 21 Aug 2018 17:01:57 +0300	[thread overview]
Message-ID: <ED24D811-C740-417F-A443-B7A249F4FF4C@oracle.com> (raw)
In-Reply-To: <1534845423.10027.44.camel@infradead.org>

> On 21 Aug 2018, at 12:57, David Woodhouse <dwmw2@infradead.org> wrote:
> 
> Another alternative... I'm told POWER8 does an interesting thing with
> hyperthreading and gang scheduling for KVM. The host kernel doesn't
> actually *see* the hyperthreads at all, and KVM just launches the full
> set of siblings when it enters a guest, and gathers them again when any
> of them exits. That's definitely worth investigating as an option for
> x86, too.

I actually think that such scheduling mechanism which prevents leaking cache entries to sibling hyperthreads should co-exist together with the KVM address space isolation to fully mitigate L1TF and other similar vulnerabilities. The address space isolation should prevent VMExit handlers code gadgets from loading arbitrary host memory to the cache. Once VMExit code path switches to full host address space, then we should also make sure that no other sibling hyprethread is running in the guest.

Focusing on the scheduling mechanism, we must make sure that when a logical processor runs guest code, all siblings logical processors must run code which do not populate L1D cache with information unrelated to this VM. This includes forbidding one logical processor to run guest code while sibling is running a host task such as a NIC interrupt handler.
Thus, when a vCPU thread exits the guest into the host and VMExit handler reaches code flow which could populate L1D cache with this information, we should force an exit from the guest of the siblings logical processors, such that they will be allowed to resume only on a core which we can promise that the L1D cache is free from information unrelated to this VM.

At first, I have created a patch series which attempts to implement such mechanism in KVM. However, it became clear to me that this may need to be implemented in the scheduler itself. This is because:
1. It is difficult to handle all new scheduling contrains only in KVM.
2. This mechanism should be relevant for any Type-2 hypervisor which runs inside Linux besides KVM (Such as VMware Workstation or VirtualBox).
3. This mechanism could also be used to prevent future “core-cache-leaking” vulnerabilities to be exploited between processes of different security domains which run as siblings on the same core.

The main idea is a mechanism which is very similar to Microsoft's "core scheduler" which they implemented to mitigate this vulnerability. The mechanism should work as follows:
1. Each CPU core will now be tagged with a "security domain id".
2. The scheduler will provide a mechanism to tag a task with a security domain id.
3. Tasks will inherit their security domain id from their parent task.
    3.1. First task in system will have security domain id of 0. Thus, if nothing special is done, all tasks will be assigned with security domain id of 0.
4. Tasks will be able to allocate a new security domain id from the scheduler and assign it to another task dynamically.
5. Linux scheduler will prevent scheduling tasks on a core with a different security domain id:
    5.0. CPU core security domain id will be set to the security domain id of the tasks which currently run on it.
    5.1. The scheduler will attempt to first schedule a task on a core with required security domain id if such exists.
    5.2. Otherwise, will need to decide if it wishes to kick all tasks running on some core to run the task with a different security domain id on that core.

The above mechanism can be used to mitigate the L1TF HT variant by just assigning vCPU tasks with a security domain id which is unique per VM and also different than the security domain id of the host which is 0.

I would be glad to hear feedback on the above suggestion.
If this should better be discussed on a separate email thread, please say so and I will open a new thread.

Thanks,
-Liran