From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id B3EF76B04DE for ; Tue, 11 Jul 2017 06:40:06 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id u5so144324752pgq.14 for ; Tue, 11 Jul 2017 03:40:06 -0700 (PDT) Received: from mail-pg0-x243.google.com (mail-pg0-x243.google.com. [2607:f8b0:400e:c05::243]) by mx.google.com with ESMTPS id r21si9982142pgo.321.2017.07.11.03.40.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 03:40:05 -0700 (PDT) Received: by mail-pg0-x243.google.com with SMTP id d193so16349546pgc.2 for ; Tue, 11 Jul 2017 03:40:05 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170711092935.bogdb4oja6v7kilq@suse.de> Date: Tue, 11 Jul 2017 03:40:02 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote: >>>> Something bothers me about the TLB flushes batching mechanism that = Linux >>>> uses on x86 and I would appreciate your opinion regarding it. >>>>=20 >>>> As you know, try_to_unmap_one() can batch TLB invalidations. While = doing so, >>>> however, the page-table lock(s) are not held, and I see no = indication of the >>>> pending flush saved (and regarded) in the relevant mm-structs. >>>>=20 >>>> So, my question: what prevents, at least in theory, the following = scenario: >>>>=20 >>>> CPU0 CPU1 >>>> ---- ---- >>>> user accesses memory using RW = PTE=20 >>>> [PTE now cached in TLB] >>>> try_to_unmap_one() >>>> =3D=3D> ptep_get_and_clear() >>>> =3D=3D> set_tlb_ubc_flush_pending() >>>> mprotect(addr, PROT_READ) >>>> =3D=3D> change_pte_range() >>>> =3D=3D> [ PTE non-present - no = flush ] >>>>=20 >>>> user writes using cached RW PTE >>>> ... >>>>=20 >>>> try_to_unmap_flush() >>>>=20 >>>>=20 >>>> As you see CPU1 write should have failed, but may succeed.=20 >>>>=20 >>>> Now I don???t have a PoC since in practice it seems hard to create = such a >>>> scenario: try_to_unmap_one() is likely to find the PTE accessed and = the PTE >>>> would not be reclaimed. >>>=20 >>> That is the same to a race whereby there is no batching mechanism = and the >>> racing operation happens between a pte clear and a flush as = ptep_clear_flush >>> is not atomic. All that differs is that the race window is a = different size. >>> The application on CPU1 is buggy in that it may or may not succeed = the write >>> but it is buggy regardless of whether a batching mechanism is used = or not. >>=20 >> Thanks for your quick and detailed response, but I fail to see how it = can >> happen without batching. Indeed, the PTE clear and flush are not = ???atomic???, >> but without batching they are both performed under the page table = lock >> (which is acquired in page_vma_mapped_walk and released in >> page_vma_mapped_walk_done). Since the lock is taken, other cores = should not >> be able to inspect/modify the PTE. Relevant functions, e.g., = zap_pte_range >> and change_pte_range, acquire the lock before accessing the PTEs. >=20 > I was primarily thinking in terms of memory corruption or data loss. > However, we are still protected although it's not particularly obvious = why. >=20 > On the reclaim side, we are either reclaiming clean pages (which = ignore > the accessed bit) or normal reclaim. If it's clean pages then any = parallel > write must update the dirty bit at minimum. If it's normal reclaim = then > the accessed bit is checked and if cleared in try_to_unmap_one, it = uses a > ptep_clear_flush_young_notify so the TLB gets flushed. We don't = reclaim > the page in either as part of page_referenced or try_to_unmap_one but > clearing the accessed bit flushes the TLB. Wait. Are you looking at the x86 arch function? The TLB is not flushed = when the access bit is cleared: int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep) { /* * On x86 CPUs, clearing the accessed bit without a TLB flush * doesn't cause data corruption. [ It could cause incorrect * page aging and the (mistaken) reclaim of hot pages, but the * chance of that should be relatively low. ] * =20 * So as a performance optimization don't flush the TLB when * clearing the accessed bit, it will eventually be flushed by * a context switch or a VM operation anyway. [ In the rare * event of it not getting flushed for a long time the delay * shouldn't really matter because there's no real memory * pressure for swapout to react to. ] */ return ptep_test_and_clear_young(vma, address, ptep); } >=20 > On the mprotect side then, as the page was first accessed, clearing = the > accessed bit incurs a TLB flush on the reclaim side before the second = write. > That means any TLB entry that exists cannot have the accessed bit set = so > a second write needs to update it. >=20 > While it's not clearly documented, I checked with hardware engineers > at the time that an update of the accessed or dirty bit even with a = TLB > entry will check the underlying page tables and trap if it's not = present > and the subsequent fault will then fail on sigsegv if the VMA = protections > no longer allow the write. >=20 > So, on one side if ignoring the accessed bit during reclaim, the pages > are clean so any access will set the dirty bit and trap if unmapped in > parallel. On the other side, the accessed bit if set cleared the TLB = and > if not set, then the hardware needs to update and again will trap if > unmapped in parallel. Yet, even regardless to the TLB flush it seems there is still a possible race: CPU0 CPU1 ---- ---- ptep_clear_flush_young_notify =3D=3D> PTE.A=3D=3D0 access PTE =3D=3D> PTE.A=3D1 prep_get_and_clear change mapping (and PTE) Use stale TLB entry > If this guarantee from hardware was every shown to be wrong or another > architecture wanted to add batching without the same guarantee then = mprotect > would need to do a local_flush_tlb if no pages were updated by the = mprotect > but right now, this should not be necessary. >=20 >> Can you please explain why you consider the application to be buggy? >=20 > I considered it a bit dumb to mprotect for READ/NONE and then try = writing > the same mapping. However, it will behave as expected. I don=E2=80=99t think that this is the only scenario. For example, the = application may create a new memory mapping of a different file using mmap at the = same memory address that was used before, just as that memory is reclaimed. = The application can (inadvertently) cause such a scenario by using = MAP_FIXED. But even without MAP_FIXED, running mmap->munmap->mmap can reuse the = same virtual address. >> AFAIU >> an application can wish to trap certain memory accesses using = userfaultfd or >> SIGSEGV. For example, it may do it for garbage collection or = sandboxing. To >> do so, it can use mprotect with PROT_NONE and expect to be able to = trap >> future accesses to that memory. This use-case is described in = usefaultfd >> documentation. >=20 > Such applications are safe due to how the accessed bit is handled by = the > software (flushes TLB if clearing young) and hardware (traps if = updating > the accessed or dirty bit and the underlying PTE was unmapped even if > there is a TLB entry). I don=E2=80=99t think it is so. And I also think there are many = additional potentially problematic scenarios. Thanks for your patience, Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org