From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71])
	by kanga.kvack.org (Postfix) with ESMTP id B3EF76B04DE
	for <linux-mm@kvack.org>; Tue, 11 Jul 2017 06:40:06 -0400 (EDT)
Received: by mail-pg0-f71.google.com with SMTP id u5so144324752pgq.14
        for <linux-mm@kvack.org>; Tue, 11 Jul 2017 03:40:06 -0700 (PDT)
Received: from mail-pg0-x243.google.com (mail-pg0-x243.google.com. [2607:f8b0:400e:c05::243])
        by mx.google.com with ESMTPS id r21si9982142pgo.321.2017.07.11.03.40.05
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 11 Jul 2017 03:40:05 -0700 (PDT)
Received: by mail-pg0-x243.google.com with SMTP id d193so16349546pgc.2
        for <linux-mm@kvack.org>; Tue, 11 Jul 2017 03:40:05 -0700 (PDT)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
Subject: Re: Potential race in TLB flush batching?
From: Nadav Amit <nadav.amit@gmail.com>
In-Reply-To: <20170711092935.bogdb4oja6v7kilq@suse.de>
Date: Tue, 11 Jul 2017 03:40:02 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <E37E0D40-821A-4C82-B924-F1CE6DF97719@gmail.com>
References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com>
 <20170711064149.bg63nvi54ycynxw4@suse.de>
 <D810A11D-1827-48C7-BA74-C1A6DCD80862@gmail.com>
 <20170711092935.bogdb4oja6v7kilq@suse.de>
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>, "open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>>=20
>>> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
>>>> Something bothers me about the TLB flushes batching mechanism that =
Linux
>>>> uses on x86 and I would appreciate your opinion regarding it.
>>>>=20
>>>> As you know, try_to_unmap_one() can batch TLB invalidations. While =
doing so,
>>>> however, the page-table lock(s) are not held, and I see no =
indication of the
>>>> pending flush saved (and regarded) in the relevant mm-structs.
>>>>=20
>>>> So, my question: what prevents, at least in theory, the following =
scenario:
>>>>=20
>>>> 	CPU0 				CPU1
>>>> 	----				----
>>>> 					user accesses memory using RW =
PTE=20
>>>> 					[PTE now cached in TLB]
>>>> 	try_to_unmap_one()
>>>> 	=3D=3D> ptep_get_and_clear()
>>>> 	=3D=3D> set_tlb_ubc_flush_pending()
>>>> 					mprotect(addr, PROT_READ)
>>>> 					=3D=3D> change_pte_range()
>>>> 					=3D=3D> [ PTE non-present - no =
flush ]
>>>>=20
>>>> 					user writes using cached RW PTE
>>>> 	...
>>>>=20
>>>> 	try_to_unmap_flush()
>>>>=20
>>>>=20
>>>> As you see CPU1 write should have failed, but may succeed.=20
>>>>=20
>>>> Now I don???t have a PoC since in practice it seems hard to create =
such a
>>>> scenario: try_to_unmap_one() is likely to find the PTE accessed and =
the PTE
>>>> would not be reclaimed.
>>>=20
>>> That is the same to a race whereby there is no batching mechanism =
and the
>>> racing operation happens between a pte clear and a flush as =
ptep_clear_flush
>>> is not atomic. All that differs is that the race window is a =
different size.
>>> The application on CPU1 is buggy in that it may or may not succeed =
the write
>>> but it is buggy regardless of whether a batching mechanism is used =
or not.
>>=20
>> Thanks for your quick and detailed response, but I fail to see how it =
can
>> happen without batching. Indeed, the PTE clear and flush are not =
???atomic???,
>> but without batching they are both performed under the page table =
lock
>> (which is acquired in page_vma_mapped_walk and released in
>> page_vma_mapped_walk_done). Since the lock is taken, other cores =
should not
>> be able to inspect/modify the PTE. Relevant functions, e.g., =
zap_pte_range
>> and change_pte_range, acquire the lock before accessing the PTEs.
>=20
> I was primarily thinking in terms of memory corruption or data loss.
> However, we are still protected although it's not particularly obvious =
why.
>=20
> On the reclaim side, we are either reclaiming clean pages (which =
ignore
> the accessed bit) or normal reclaim. If it's clean pages then any =
parallel
> write must update the dirty bit at minimum. If it's normal reclaim =
then
> the accessed bit is checked and if cleared in try_to_unmap_one, it =
uses a
> ptep_clear_flush_young_notify so the TLB gets flushed. We don't =
reclaim
> the page in either as part of page_referenced or try_to_unmap_one but
> clearing the accessed bit flushes the TLB.

Wait. Are you looking at the x86 arch function? The TLB is not flushed =
when
the access bit is cleared:

int ptep_clear_flush_young(struct vm_area_struct *vma,
                           unsigned long address, pte_t *ptep)
{
        /*
         * On x86 CPUs, clearing the accessed bit without a TLB flush
         * doesn't cause data corruption. [ It could cause incorrect
         * page aging and the (mistaken) reclaim of hot pages, but the
         * chance of that should be relatively low. ]
         *                =20
         * So as a performance optimization don't flush the TLB when
         * clearing the accessed bit, it will eventually be flushed by
         * a context switch or a VM operation anyway. [ In the rare
         * event of it not getting flushed for a long time the delay
         * shouldn't really matter because there's no real memory
         * pressure for swapout to react to. ]
         */
        return ptep_test_and_clear_young(vma, address, ptep);
}

>=20
> On the mprotect side then, as the page was first accessed, clearing =
the
> accessed bit incurs a TLB flush on the reclaim side before the second =
write.
> That means any TLB entry that exists cannot have the accessed bit set =
so
> a second write needs to update it.
>=20
> While it's not clearly documented, I checked with hardware engineers
> at the time that an update of the accessed or dirty bit even with a =
TLB
> entry will check the underlying page tables and trap if it's not =
present
> and the subsequent fault will then fail on sigsegv if the VMA =
protections
> no longer allow the write.
>=20
> So, on one side if ignoring the accessed bit during reclaim, the pages
> are clean so any access will set the dirty bit and trap if unmapped in
> parallel. On the other side, the accessed bit if set cleared the TLB =
and
> if not set, then the hardware needs to update and again will trap if
> unmapped in parallel.


Yet, even regardless to the TLB flush it seems there is still a possible
race:

CPU0				CPU1
----				----
ptep_clear_flush_young_notify
=3D=3D> PTE.A=3D=3D0
				access PTE
				=3D=3D> PTE.A=3D1
prep_get_and_clear
				change mapping (and PTE)
				Use stale TLB entry


> If this guarantee from hardware was every shown to be wrong or another
> architecture wanted to add batching without the same guarantee then =
mprotect
> would need to do a local_flush_tlb if no pages were updated by the =
mprotect
> but right now, this should not be necessary.
>=20
>> Can you please explain why you consider the application to be buggy?
>=20
> I considered it a bit dumb to mprotect for READ/NONE and then try =
writing
> the same mapping. However, it will behave as expected.

I don=E2=80=99t think that this is the only scenario. For example, the =
application
may create a new memory mapping of a different file using mmap at the =
same
memory address that was used before, just as that memory is reclaimed. =
The
application can (inadvertently) cause such a scenario by using =
MAP_FIXED.
But even without MAP_FIXED, running mmap->munmap->mmap can reuse the =
same
virtual address.

>> AFAIU
>> an application can wish to trap certain memory accesses using =
userfaultfd or
>> SIGSEGV. For example, it may do it for garbage collection or =
sandboxing. To
>> do so, it can use mprotect with PROT_NONE and expect to be able to =
trap
>> future accesses to that memory. This use-case is described in =
usefaultfd
>> documentation.
>=20
> Such applications are safe due to how the accessed bit is handled by =
the
> software (flushes TLB if clearing young) and hardware (traps if =
updating
> the accessed or dirty bit and the underlying PTE was unmapped even if
> there is a TLB entry).

I don=E2=80=99t think it is so. And I also think there are many =
additional
potentially problematic scenarios.

Thanks for your patience,
Nadav=

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>