Re: Ongoing/future speculative mitigation work

From: George Dunlap <george.dunlap@citrix.com>
To: Wei Liu <wei.liu2@citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "Martin Pohlack" <mpohlack@amazon.de>,
	"Julien Grall" <julien.grall@arm.com>,
	"Jan Beulich" <JBeulich@suse.com>,
	"Joao Martins" <joao.m.martins@oracle.com>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	"Daniel Kiper" <daniel.kiper@oracle.com>,
	"Marek Marczykowski" <marmarek@invisiblethingslab.com>,
	"Anthony Liguori" <aliguori@amazon.com>,
	"Dannowski, Uwe" <uwed@amazon.de>,
	"Lars Kurth" <lars.kurth@citrix.com>,
	"Konrad Wilk" <konrad.wilk@oracle.com>,
	"Ross Philipson" <ross.philipson@oracle.com>,
	"Dario Faggioli" <dfaggioli@suse.com>,
	"Matt Wilson" <msw@amazon.com>,
	"Boris Ostrovsky" <boris.ostrovsky@oracle.com>,
	"Juergen Gross" <JGross@suse.com>,
	"Sergey Dyasli" <sergey.dyasli@citrix.com>,
	"George Dunlap" <george.dunlap@eu.citrix.com>,
	"Xen-devel List" <xen-devel@lists.xen.org>,
	"Mihai Donțu" <mdontu@bitdefender.com>,
	"Woodhouse, David" <dwmw@amazon.co.uk>,
	"Roger Pau Monne" <roger.pau@citri>
Subject: Re: Ongoing/future speculative mitigation work
Date: Mon, 10 Dec 2018 12:12:34 +0000	[thread overview]
Message-ID: <3e7b96cf-fe5f-604f-75f8-4919737d7e63@citrix.com> (raw)
In-Reply-To: <20181207184051.l6owpsjvecog6zhx@zion.uk.xensource.com>

On 12/7/18 6:40 PM, Wei Liu wrote:
> On Thu, Oct 18, 2018 at 06:46:22PM +0100, Andrew Cooper wrote:
>> Hello,
>>
>> This is an accumulation and summary of various tasks which have been
>> discussed since the revelation of the speculative security issues in
>> January, and also an invitation to discuss alternative ideas.  They are
>> x86 specific, but a lot of the principles are architecture-agnostic.
>>
>> 1) A secrets-free hypervisor.
>>
>> Basically every hypercall can be (ab)used by a guest, and used as an
>> arbitrary cache-load gadget.  Logically, this is the first half of a
>> Spectre SP1 gadget, and is usually the first stepping stone to
>> exploiting one of the speculative sidechannels.
>>
>> Short of compiling Xen with LLVM's Speculative Load Hardening (which is
>> still experimental, and comes with a ~30% perf hit in the common case),
>> this is unavoidable.  Furthermore, throwing a few array_index_nospec()
>> into the code isn't a viable solution to the problem.
>>
>> An alternative option is to have less data mapped into Xen's virtual
>> address space - if a piece of memory isn't mapped, it can't be loaded
>> into the cache.
>>
>> An easy first step here is to remove Xen's directmap, which will mean
>> that guests general RAM isn't mapped by default into Xen's address
>> space.  This will come with some performance hit, as the
>> map_domain_page() infrastructure will now have to actually
>> create/destroy mappings, but removing the directmap will cause an
>> improvement for non-speculative security as well (No possibility of
>> ret2dir as an exploit technique).
>>
>> Beyond the directmap, there are plenty of other interesting secrets in
>> the Xen heap and other mappings, such as the stacks of the other pcpus. 
>> Fixing this requires moving Xen to having a non-uniform memory layout,
>> and this is much harder to change.  I already experimented with this as
>> a meltdown mitigation around about a year ago, and posted the resulting
>> series on Jan 4th,
>> https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00274.html,
>> some trivial bits of which have already found their way upstream.
>>
>> To have a non-uniform memory layout, Xen may not share L4 pagetables. 
>> i.e. Xen must never have two pcpus which reference the same pagetable in
>> %cr3.
>>
>> This property already holds for 32bit PV guests, and all HVM guests, but
>> 64bit PV guests are the sticking point.  Because Linux has a flat memory
>> layout, when a 64bit PV guest schedules two threads from the same
>> process on separate vcpus, those two vcpus have the same virtual %cr3,
>> and currently, Xen programs the same real %cr3 into hardware.
>>
>> If we want Xen to have a non-uniform layout, are two options are:
>> * Fix Linux to have the same non-uniform layout that Xen wants
>> (Backwards compatibility for older 64bit PV guests can be achieved with
>> xen-shim).
>> * Make use XPTI algorithm (specifically, the pagetable sync/copy part)
>> forever more in the future.
>>
>> Option 2 isn't great (especially for perf on fixed hardware), but does
>> keep all the necessary changes in Xen.  Option 1 looks to be the better
>> option longterm.
>>
>> As an interesting point to note.  The 32bit PV ABI prohibits sharing of
>> L3 pagetables, because back in the 32bit hypervisor days, we used to
>> have linear mappings in the Xen virtual range.  This check is stale
>> (from a functionality point of view), but still present in Xen.  A
>> consequence of this is that 32bit PV guests definitely don't share
>> top-level pagetables across vcpus.
> 
> Correction: 32bit PV ABI prohibits sharing of L2 pagetables, but L3
> pagetables can be shared. So guests will schedule the same top-level
> pagetables across vcpus. >
> But, 64bit Xen creates a monitor table for 32bit PAE guest and put the
> CR3 provided by guest to the first slot, so pcpus don't share the same
> L4 pagetables. The property we want still holds.

Ah, right -- but Xen can get away with this because in PAE mode, "L3" is
just 4 entries that are loaded on CR3-switch and not automatically kept
in sync by the hardware; i.e., the OS already needs to do its own
"manual syncing" if it updates any of the L3 entires; so it's the same
for Xen.

>> Juergen/Boris: Do you have any idea if/how easy this infrastructure
>> would be to implement for 64bit PV guests as well?  If a PV guest can
>> advertise via Elfnote that it won't share top-level pagetables, then we
>> can audit this trivially in Xen.
>>
> 
> After reading Linux kernel code, I think it is not going to be trivial.
> As now threads in Linux share one pagetable (as it should be).
> 
> In order to make each thread has its own pagetable while still maintain
> the illusion of one address space, there needs to be synchronisation
> under the hood.
> 
> There is code in Linux to synchronise vmalloc, but that's only for the
> kernel portion. The infrastructure to synchronise userspace portion is
> missing.
> 
> One idea is to follow the same model as vmalloc -- maintain a reference
> pagetable in struct mm and a list of pagetables for threads, then
> synchronise the pagetables in the page fault handler. But this is
> probably a bit hard to sell to Linux maintainers because it will touch a
> lot of the non-Xen code, increase complexity and decrease performance.

Sorry -- what do you mean "synchronize vmalloc"?  If every thread has a
different view of the kernel's vmalloc area, then every thread must have
a different L4 table, right?  And if every thread has a different L4
table, then we've already got the main thing we need from Linux, don't we?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel