From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 517B66B052F for ; Tue, 11 Jul 2017 12:22:53 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id 13so4388186pgg.8 for ; Tue, 11 Jul 2017 09:22:53 -0700 (PDT) Received: from mail-pg0-x243.google.com (mail-pg0-x243.google.com. [2607:f8b0:400e:c05::243]) by mx.google.com with ESMTPS id j193si257220pge.239.2017.07.11.09.22.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 09:22:52 -0700 (PDT) Received: by mail-pg0-x243.google.com with SMTP id d193so475111pgc.2 for ; Tue, 11 Jul 2017 09:22:51 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Potential race in TLB flush batching? From: Nadav Amit In-Reply-To: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> Date: Tue, 11 Jul 2017 09:22:47 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <69BBEB97-1B10-4229-9AEF-DE19C26D8DFF@gmail.com> <20170711064149.bg63nvi54ycynxw4@suse.de> <20170711092935.bogdb4oja6v7kilq@suse.de> <20170711132023.wdfpjxwtbqpi3wp2@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andy Lutomirski , "open list:MEMORY MANAGEMENT" Mel Gorman wrote: > On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote: >> Mel Gorman wrote: >>=20 >>>>> That is the same to a race whereby there is no batching mechanism = and the >>>>> racing operation happens between a pte clear and a flush as = ptep_clear_flush >>>>> is not atomic. All that differs is that the race window is a = different size. >>>>> The application on CPU1 is buggy in that it may or may not succeed = the write >>>>> but it is buggy regardless of whether a batching mechanism is used = or not. >>>>=20 >>>> Thanks for your quick and detailed response, but I fail to see how = it can >>>> happen without batching. Indeed, the PTE clear and flush are not = ???atomic???, >>>> but without batching they are both performed under the page table = lock >>>> (which is acquired in page_vma_mapped_walk and released in >>>> page_vma_mapped_walk_done). Since the lock is taken, other cores = should not >>>> be able to inspect/modify the PTE. Relevant functions, e.g., = zap_pte_range >>>> and change_pte_range, acquire the lock before accessing the PTEs. >>>=20 >>> I was primarily thinking in terms of memory corruption or data loss. >>> However, we are still protected although it's not particularly = obvious why. >>>=20 >>> On the reclaim side, we are either reclaiming clean pages (which = ignore >>> the accessed bit) or normal reclaim. If it's clean pages then any = parallel >>> write must update the dirty bit at minimum. If it's normal reclaim = then >>> the accessed bit is checked and if cleared in try_to_unmap_one, it = uses a >>> ptep_clear_flush_young_notify so the TLB gets flushed. We don't = reclaim >>> the page in either as part of page_referenced or try_to_unmap_one = but >>> clearing the accessed bit flushes the TLB. >>=20 >> Wait. Are you looking at the x86 arch function? The TLB is not = flushed when >> the access bit is cleared: >>=20 >> int ptep_clear_flush_young(struct vm_area_struct *vma, >> unsigned long address, pte_t *ptep) >> { >> /* >> * On x86 CPUs, clearing the accessed bit without a TLB flush >> * doesn't cause data corruption. [ It could cause incorrect >> * page aging and the (mistaken) reclaim of hot pages, but the >> * chance of that should be relatively low. ] >> * =20 >> * So as a performance optimization don't flush the TLB when >> * clearing the accessed bit, it will eventually be flushed by >> * a context switch or a VM operation anyway. [ In the rare >> * event of it not getting flushed for a long time the delay >> * shouldn't really matter because there's no real memory >> * pressure for swapout to react to. ] >> */ >> return ptep_test_and_clear_young(vma, address, ptep); >> } >=20 > I forgot this detail, thanks for correcting me. >=20 >>> On the mprotect side then, as the page was first accessed, clearing = the >>> accessed bit incurs a TLB flush on the reclaim side before the = second write. >>> That means any TLB entry that exists cannot have the accessed bit = set so >>> a second write needs to update it. >>>=20 >>> While it's not clearly documented, I checked with hardware engineers >>> at the time that an update of the accessed or dirty bit even with a = TLB >>> entry will check the underlying page tables and trap if it's not = present >>> and the subsequent fault will then fail on sigsegv if the VMA = protections >>> no longer allow the write. >>>=20 >>> So, on one side if ignoring the accessed bit during reclaim, the = pages >>> are clean so any access will set the dirty bit and trap if unmapped = in >>> parallel. On the other side, the accessed bit if set cleared the TLB = and >>> if not set, then the hardware needs to update and again will trap if >>> unmapped in parallel. >>=20 >>=20 >> Yet, even regardless to the TLB flush it seems there is still a = possible >> race: >>=20 >> CPU0 CPU1 >> ---- ---- >> ptep_clear_flush_young_notify >> =3D=3D> PTE.A=3D=3D0 >> access PTE >> =3D=3D> PTE.A=3D1 >> prep_get_and_clear >> change mapping (and PTE) >> Use stale TLB entry >=20 > So I think you're right and this is a potential race. The first access = can > be a read or a write as it's a problem if the mprotect call restricts > access. >=20 >>> If this guarantee from hardware was every shown to be wrong or = another >>> architecture wanted to add batching without the same guarantee then = mprotect >>> would need to do a local_flush_tlb if no pages were updated by the = mprotect >>> but right now, this should not be necessary. >>>=20 >>>> Can you please explain why you consider the application to be = buggy? >>>=20 >>> I considered it a bit dumb to mprotect for READ/NONE and then try = writing >>> the same mapping. However, it will behave as expected. >>=20 >> I don???t think that this is the only scenario. For example, the = application >> may create a new memory mapping of a different file using mmap at the = same >> memory address that was used before, just as that memory is = reclaimed. >=20 > That requires the existing mapping to be unmapped which will flush the > TLB and parallel mmap/munmap serialises on mmap_sem. The race appears = to > be specific to mprotect which avoids the TLB flush if no pages were = updated. Why? As far as I see the chain of calls during munmap is somewhat like: do_munmap =3D>unmap_region =3D=3D>tlb_gather_mmu =3D=3D=3D>unmap_vmas =3D=3D=3D=3D>unmap_page_range ... =3D=3D=3D=3D=3D>zap_pte_range - this one batches only present PTEs =3D=3D=3D>free_pgtables - this one is only if page-tables are removed =3D=3D=3D>pte_free_tlb =3D=3D>tlb_finish_mmu =3D=3D=3D>tlb_flush_mmu =3D=3D=3D=3D>tlb_flush_mmu_tlbonly zap_pte_range will check if pte_none and can find it is - if a = concurrent try_to_unmap_one already cleared the PTE. In this case it will not = update the range of the mmu_gather and would not indicate that a flush of the = PTE is needed. Then, tlb_flush_mmu_tlbonly will find that no PTE was cleared (tlb->end =3D=3D 0) and avoid flush, or may just flush fewer PTEs than = actually needed. Due to this behavior, it raises a concern that in other cases as well, = when mmu_gather is used, a PTE flush may be missed. >> The >> application can (inadvertently) cause such a scenario by using = MAP_FIXED. >> But even without MAP_FIXED, running mmap->munmap->mmap can reuse the = same >> virtual address. >=20 > With flushes in between. >=20 >>> Such applications are safe due to how the accessed bit is handled by = the >>> software (flushes TLB if clearing young) and hardware (traps if = updating >>> the accessed or dirty bit and the underlying PTE was unmapped even = if >>> there is a TLB entry). >>=20 >> I don???t think it is so. And I also think there are many additional >> potentially problematic scenarios. >=20 > I believe it's specific to mprotect but can be handled by flushing the > local TLB when mprotect updates no pages. Something like this; >=20 > ---8<--- > mm, mprotect: Flush the local TLB if mprotect potentially raced with a = parallel reclaim >=20 > Nadav Amit identified a theoritical race between page reclaim and = mprotect > due to TLB flushes being batched outside of the PTL being held. He = described > the race as follows >=20 > CPU0 CPU1 > ---- ---- > user accesses memory using RW = PTE > [PTE now cached in TLB] > try_to_unmap_one() > =3D=3D> ptep_get_and_clear() > =3D=3D> set_tlb_ubc_flush_pending() > mprotect(addr, PROT_READ) > =3D=3D> change_pte_range() > =3D=3D> [ PTE non-present - no = flush ] >=20 > user writes using cached RW PTE > ... >=20 > try_to_unmap_flush() >=20 > The same type of race exists for reads when protecting for PROT_NONE. > This is not a data integrity issue as the TLB is always flushed before = any > IO is queued or a page is freed but it is a correctness issue as a = process > restricting access with mprotect() may still be able to access the = data > after the syscall returns due to a stale TLB entry. Handle this issue = by > flushing the local TLB if reclaim is potentially batching TLB flushes = and > mprotect altered no pages. >=20 > Signed-off-by: Mel Gorman > Cc: stable@vger.kernel.org # v4.4+ > --- > mm/internal.h | 5 ++++- > mm/mprotect.c | 12 ++++++++++-- > mm/rmap.c | 20 ++++++++++++++++++++ > 3 files changed, 34 insertions(+), 3 deletions(-) >=20 > diff --git a/mm/internal.h b/mm/internal.h > index 0e4f558412fb..9b7d1a597816 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq; > #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > void try_to_unmap_flush(void); > void try_to_unmap_flush_dirty(void); > +void batched_unmap_protection_update(void); > #else > static inline void try_to_unmap_flush(void) > { > @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void) > static inline void try_to_unmap_flush_dirty(void) > { > } > - > +static inline void batched_unmap_protection_update() > +{ > +} > #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ >=20 > extern const struct trace_print_flags pageflag_names[]; > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 8edd0d576254..3de353d4b5fb 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -254,9 +254,17 @@ static unsigned long = change_protection_range(struct vm_area_struct *vma, > dirty_accountable, prot_numa); > } while (pgd++, addr =3D next, addr !=3D end); >=20 > - /* Only flush the TLB if we actually modified any entries: */ > - if (pages) > + /* > + * Only flush all TLBs if we actually modified any entries. If = no > + * pages are modified, then call batched_unmap_protection_update > + * if the context is a mprotect() syscall. > + */ > + if (pages) { > flush_tlb_range(vma, start, end); > + } else { > + if (!prot_numa) > + batched_unmap_protection_update(); > + } > clear_tlb_flush_pending(mm); >=20 > return pages; > diff --git a/mm/rmap.c b/mm/rmap.c > index d405f0e0ee96..02cb035e4ce6 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct = *mm, enum ttu_flags flags) >=20 > return should_defer; > } > + > +/* > + * This is called after an mprotect update that altered no pages. = Batched > + * unmap releases the PTL before a flush occurs leaving a window = where > + * an mprotect that reduces access rights can still access the page = after > + * mprotect returns via a stale TLB entry. Avoid this possibility by = flushing > + * the local TLB if mprotect updates no pages so that the the caller = of > + * mprotect always gets expected behaviour. It's overkill and = unnecessary to > + * flush all TLBs as a separate thread accessing the data that raced = with > + * both reclaim and mprotect as there is no risk of data corruption = and > + * the exact timing of a parallel thread seeing a protection update = without > + * any serialisation on the application side is always uncertain. > + */ > +void batched_unmap_protection_update(void) > +{ > + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > + local_flush_tlb(); > + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); > +} > + > #else > static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool = writable) > { I don=E2=80=99t think this solution is enough. I am sorry for not = providing a solution, but I don=E2=80=99t see an easy one. Thanks, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org