Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

From: Khalid Aziz <khalid.aziz@oracle.com>
To: Dave Hansen <dave.hansen@intel.com>,
	juergh@gmail.com, tycho@tycho.ws, jsteckli@amazon.de,
	ak@linux.intel.com, torvalds@linux-foundation.org,
	liran.alon@oracle.com, keescook@google.com,
	konrad.wilk@oracle.com
Cc: deepa.srinivasan@oracle.com, chris.hyser@oracle.com,
	tyhicks@canonical.com, dwmw@amazon.co.uk,
	andrew.cooper3@citrix.com, jcm@redhat.com,
	boris.ostrovsky@oracle.com, kanth.ghatraju@oracle.com,
	joao.m.martins@oracle.com, jmattson@google.com,
	pradeep.vincent@oracle.com, john.haxby@oracle.com,
	tglx@linutronix.de, kirill.shutemov@linux.intel.com, hch@lst.de,
	steven.sistare@oracle.com, kernel-hardening@lists.openwall.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership
Date: Fri, 11 Jan 2019 11:21:04 -0700	[thread overview]
Message-ID: <7e3b2c4b-51ff-2027-3a53-8c798c2ca588@oracle.com> (raw)
In-Reply-To: <31fe7522-0a59-94c8-663e-049e9ad2bff6@intel.com>

[-- Attachment #1: Type: text/plain, Size: 4657 bytes --]

Hi Dave,

Thanks for looking at this and providing feedback.

On 1/10/19 4:40 PM, Dave Hansen wrote:
> First of all, thanks for picking this back up.  It looks to be going in
> a very positive direction!
> 
> On 1/10/19 1:09 PM, Khalid Aziz wrote:
>> I implemented a solution to reduce performance penalty and
>> that has had large impact. When XPFO code flushes stale TLB entries,
>> it does so for all CPUs on the system which may include CPUs that
>> may not have any matching TLB entries or may never be scheduled to
>> run the userspace task causing TLB flush.
> ...
>> A rogue process can launch a ret2dir attack only from a CPU that has 
>> dual mapping for its pages in physmap in its TLB. We can hence defer 
>> TLB flush on a CPU until a process that would have caused a TLB
>> flush is scheduled on that CPU.
> 
> This logic is a bit suspect to me.  Imagine a situation where we have
> two attacker processes: one which is causing page to go from
> kernel->user (and be unmapped from the kernel) and a second process that
> *was* accessing that page.
> 
> The second process could easily have the page's old TLB entry.  It could
> abuse that entry as long as that CPU doesn't context switch
> (switch_mm_irqs_off()) or otherwise flush the TLB entry.

That is an interesting scenario. Working through this scenario, physmap
TLB entry for a page is flushed on the local processor when the page is
allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
page back into kernel, that page is mapped into kernel space using a va
from kmap pool in xpfo_kmap() which can be different for each new
mapping of the same page. The physical page is unmapped from kernel on
the way back from kernel to userspace by xpfo_kunmap(). So two processes
on different CPUs sharing same physical page might not be seeing the
same virtual address for that page while they are in the kernel, as long
as it is an address from kmap pool. ret2dir attack relies upon being
able to craft a predictable virtual address in the kernel physmap for a
physical page and redirect execution to that address. Does that sound right?

Now what happens if only one of these cooperating processes allocates
the page, places malicious payload on that page and passes the address
of this page to the other process which can deduce physmap for the page
through /proc and exploit the physmap entry for the page on its CPU.
That must be the scenario you are referring to.

> 
> As for where to flush the TLB...  As you know, using synchronous IPIs is
> obviously the most bulletproof from a mitigation perspective.  If you
> can batch the IPIs, you can get the overhead down, but you need to do
> the flushes for a bunch of pages at once, which I think is what you were
> exploring but haven't gotten working yet.
> 
> Anything else you do will have *some* reduced mitigation value, which
> isn't a deal-breaker (to me at least).  Some ideas:

Even without batched IPIs working reliably, I was able to measure the
performance impact of this partially working solution. With just batched
IPIs and no delayed TLB flushes, performance improved by a factor of 2.
The 26x system time went down to 12x-13x but it was still too high and a
non-starter. Combining batched IPI with delayed TLB flushes improved
performance to about 1.1x as opposed to 1.33x with delayed TLB flush
alone. Those numbers are very rough since the batching implementation is
incomplete.

> 
> Take a look at the SWITCH_TO_KERNEL_CR3 in head_64.S.  Every time that
> gets called, we've (potentially) just done a user->kernel transition and
> might benefit from flushing the TLB.  We're always doing a CR3 write (on
> Meltdown-vulnerable hardware) and it can do a full TLB flush based on if
> X86_CR3_PCID_NOFLUSH_BIT is set.  So, when you need a TLB flush, you
> would set a bit that ADJUST_KERNEL_CR3 would see on the next
> user->kernel transition on *each* CPU.  Potentially, multiple TLB
> flushes could be coalesced this way.  The downside of this is that
> you're exposed to the old TLB entries if a flush is needed while you are
> already *in* the kernel.
> 
> You could also potentially do this from C code, like in the syscall
> entry code, or in sensitive places, like when you're returning from a
> guest after a VMEXIT in the kvm code.
> 

Good suggestions. Thanks.

I think benefit will be highest from batching TLB flushes. I see a lot
of time consumed by full TLB flushes on other processors when local
processor did only a limited TLB flush. I will continue to debug the
batch TLB updates.

--
Khalid

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2501 bytes --]