Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing

From: Nadav Amit <namit@vmware.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@intel.com>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	the arch/x86 maintainers <x86@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Jann Horn <jannh@google.com>
Subject: Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing
Date: Fri, 31 May 2019 18:29:33 +0000	[thread overview]
Message-ID: <82791CFA-3748-4881-9F01-53F677108FC3@vmware.com> (raw)
In-Reply-To: <20190531105758.GO2623@hirez.programming.kicks-ass.net>

[ +Jann Horn ]

> On May 31, 2019, at 3:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, May 30, 2019 at 11:36:44PM -0700, Nadav Amit wrote:
>> When we flush userspace mappings, we can defer the TLB flushes, as long
>> the following conditions are met:
>> 
>> 1. No tables are freed, since otherwise speculative page walks might
>>   cause machine-checks.
>> 
>> 2. No one would access userspace before flush takes place. Specifically,
>>   NMI handlers and kprobes would avoid accessing userspace.
>> 
>> Use the new SMP support to execute remote function calls with inlined
>> data for the matter. The function remote TLB flushing function would be
>> executed asynchronously and the local CPU would continue execution as
>> soon as the IPI was delivered, before the function was actually
>> executed. Since tlb_flush_info is copied, there is no risk it would
>> change before the TLB flush is actually executed.
>> 
>> Change nmi_uaccess_okay() to check whether a remote TLB flush is
>> currently in progress on this CPU by checking whether the asynchronously
>> called function is the remote TLB flushing function. The current
>> implementation disallows access in such cases, but it is also possible
>> to flush the entire TLB in such case and allow access.
> 
> ARGGH, brain hurt. I'm not sure I fully understand this one. How is it
> different from today, where the NMI can hit in the middle of the TLB
> invalidation?

Let’s assume that a page mapping is removed so a PTE is changed from present
to non-present. Today, the page would only be recycled after all the cores
flushed their TLB. So NMI handlers accessing the page are safe - they might
either access the previously mapped page (which still has not been reused)
or fault, but they cannot access the wrong data.

Following this change it might happen that one core would still have stale
mappings for a page that has already recycled for a new use (e.g., it is
allocated to a different process). The core that has the stale mapping would
find itself in the interrupt handler. But we must ensure the NMI handler
does not access this page.

> Also; since we're not waiting on the IPI, what prevents us from freeing
> the user pages before the remote CPU is 'done' with them? Currently the
> synchronous IPI is like a sync point where we *know* the remote CPU is
> completely done accessing the page.

We might free them before the remote TLB flush is done, but not that only
after the IPI has been acknowledged and right before the actual TLB flush
is about to start. See flush_smp_call_function_queue().

                if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
                        func(info);
                        csd_unlock(csd);
                } else {
                        this_cpu_write(async_func_in_progress, csd->func);
                        csd_unlock(csd); (*)
                        func(info);

We do wait for the IPI to be received and the entry of the specific TLB
invalidation to be processed. Following the instruction that is marked with
(*) the initiator to the TLB shootdown would be able to continue, but not
before. We just do not wait for the actual flush (called by func()) to take
place.

> Where getting an IPI stops speculation, speculation again restarts
> inside the interrupt handler, and until we've passed the INVLPG/MOV CR3,
> speculation can happen on that TLB entry, even though we've already
> freed and re-used the user-page.

Let’s consider what the impact of such speculation might be. Virtually
indexed caches might have the wrong data. But this data should not be used
before the flush, since we are already in the interrupt handler, and we
should be able to prevent accesses to userspace.

A new vector for Spectre-type attacks might be opened. but I don’t see how.

A #MC might be caused. I tried to avoid it by not allowing freeing of
page-tables in such way. Did I miss something else? Some interaction with
MTRR changes? I’ll think about it some more, but I don’t see how.

> Also, what happens if the TLB invalidation IPI is stuck behind another
> smp_function_call IPI that is doing user-access?

It’s not exactly the IPI but the call_single_data. In such case, as shown
above, csd_unlock() is only called once you are about to start handling
the specific TLB flush.

> 
> As said,.. brain hurts.

I understand. Mine too. The textbook says that we need to flush all the TLBs
before we can recycle the page, be guaranteed that write-protected page will
not change, and so on. I think that since we should be able to guarantee
that the userspace address, which are invalidated, would never be used by
the kernel before the flush, we should be safe. After the flush, all the
artifact of speculative caching should disappear.

I might be wrong, and I do question myself, but I don’t see how.