From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 11 Jul 2017 09:22:47 -0700 [thread overview]
Message-ID: <D9CAACCF-F929-4894-AA81-EB612F51B49A@gmail.com> (raw)
In-Reply-To: <20170711132023.wdfpjxwtbqpi3wp2@suse.de>
Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>>
>>>>> That is the same to a race whereby there is no batching mechanism and the
>>>>> racing operation happens between a pte clear and a flush as ptep_clear_flush
>>>>> is not atomic. All that differs is that the race window is a different size.
>>>>> The application on CPU1 is buggy in that it may or may not succeed the write
>>>>> but it is buggy regardless of whether a batching mechanism is used or not.
>>>>
>>>> Thanks for your quick and detailed response, but I fail to see how it can
>>>> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
>>>> but without batching they are both performed under the page table lock
>>>> (which is acquired in page_vma_mapped_walk and released in
>>>> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
>>>> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
>>>> and change_pte_range, acquire the lock before accessing the PTEs.
>>>
>>> I was primarily thinking in terms of memory corruption or data loss.
>>> However, we are still protected although it's not particularly obvious why.
>>>
>>> On the reclaim side, we are either reclaiming clean pages (which ignore
>>> the accessed bit) or normal reclaim. If it's clean pages then any parallel
>>> write must update the dirty bit at minimum. If it's normal reclaim then
>>> the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
>>> ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
>>> the page in either as part of page_referenced or try_to_unmap_one but
>>> clearing the accessed bit flushes the TLB.
>>
>> Wait. Are you looking at the x86 arch function? The TLB is not flushed when
>> the access bit is cleared:
>>
>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>> unsigned long address, pte_t *ptep)
>> {
>> /*
>> * On x86 CPUs, clearing the accessed bit without a TLB flush
>> * doesn't cause data corruption. [ It could cause incorrect
>> * page aging and the (mistaken) reclaim of hot pages, but the
>> * chance of that should be relatively low. ]
>> *
>> * So as a performance optimization don't flush the TLB when
>> * clearing the accessed bit, it will eventually be flushed by
>> * a context switch or a VM operation anyway. [ In the rare
>> * event of it not getting flushed for a long time the delay
>> * shouldn't really matter because there's no real memory
>> * pressure for swapout to react to. ]
>> */
>> return ptep_test_and_clear_young(vma, address, ptep);
>> }
>
> I forgot this detail, thanks for correcting me.
>
>>> On the mprotect side then, as the page was first accessed, clearing the
>>> accessed bit incurs a TLB flush on the reclaim side before the second write.
>>> That means any TLB entry that exists cannot have the accessed bit set so
>>> a second write needs to update it.
>>>
>>> While it's not clearly documented, I checked with hardware engineers
>>> at the time that an update of the accessed or dirty bit even with a TLB
>>> entry will check the underlying page tables and trap if it's not present
>>> and the subsequent fault will then fail on sigsegv if the VMA protections
>>> no longer allow the write.
>>>
>>> So, on one side if ignoring the accessed bit during reclaim, the pages
>>> are clean so any access will set the dirty bit and trap if unmapped in
>>> parallel. On the other side, the accessed bit if set cleared the TLB and
>>> if not set, then the hardware needs to update and again will trap if
>>> unmapped in parallel.
>>
>>
>> Yet, even regardless to the TLB flush it seems there is still a possible
>> race:
>>
>> CPU0 CPU1
>> ---- ----
>> ptep_clear_flush_young_notify
>> ==> PTE.A==0
>> access PTE
>> ==> PTE.A=1
>> prep_get_and_clear
>> change mapping (and PTE)
>> Use stale TLB entry
>
> So I think you're right and this is a potential race. The first access can
> be a read or a write as it's a problem if the mprotect call restricts
> access.
>
>>> If this guarantee from hardware was every shown to be wrong or another
>>> architecture wanted to add batching without the same guarantee then mprotect
>>> would need to do a local_flush_tlb if no pages were updated by the mprotect
>>> but right now, this should not be necessary.
>>>
>>>> Can you please explain why you consider the application to be buggy?
>>>
>>> I considered it a bit dumb to mprotect for READ/NONE and then try writing
>>> the same mapping. However, it will behave as expected.
>>
>> I don???t think that this is the only scenario. For example, the application
>> may create a new memory mapping of a different file using mmap at the same
>> memory address that was used before, just as that memory is reclaimed.
>
> That requires the existing mapping to be unmapped which will flush the
> TLB and parallel mmap/munmap serialises on mmap_sem. The race appears to
> be specific to mprotect which avoids the TLB flush if no pages were updated.
Why? As far as I see the chain of calls during munmap is somewhat like:
do_munmap
=>unmap_region
==>tlb_gather_mmu
===>unmap_vmas
====>unmap_page_range
...
=====>zap_pte_range - this one batches only present PTEs
===>free_pgtables - this one is only if page-tables are removed
===>pte_free_tlb
==>tlb_finish_mmu
===>tlb_flush_mmu
====>tlb_flush_mmu_tlbonly
zap_pte_range will check if pte_none and can find it is - if a concurrent
try_to_unmap_one already cleared the PTE. In this case it will not update
the range of the mmu_gather and would not indicate that a flush of the PTE
is needed. Then, tlb_flush_mmu_tlbonly will find that no PTE was cleared
(tlb->end == 0) and avoid flush, or may just flush fewer PTEs than actually
needed.
Due to this behavior, it raises a concern that in other cases as well, when
mmu_gather is used, a PTE flush may be missed.
>> The
>> application can (inadvertently) cause such a scenario by using MAP_FIXED.
>> But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
>> virtual address.
>
> With flushes in between.
>
>>> Such applications are safe due to how the accessed bit is handled by the
>>> software (flushes TLB if clearing young) and hardware (traps if updating
>>> the accessed or dirty bit and the underlying PTE was unmapped even if
>>> there is a TLB entry).
>>
>> I don???t think it is so. And I also think there are many additional
>> potentially problematic scenarios.
>
> I believe it's specific to mprotect but can be handled by flushing the
> local TLB when mprotect updates no pages. Something like this;
>
> ---8<---
> mm, mprotect: Flush the local TLB if mprotect potentially raced with a parallel reclaim
>
> Nadav Amit identified a theoritical race between page reclaim and mprotect
> due to TLB flushes being batched outside of the PTL being held. He described
> the race as follows
>
> CPU0 CPU1
> ---- ----
> user accesses memory using RW PTE
> [PTE now cached in TLB]
> try_to_unmap_one()
> ==> ptep_get_and_clear()
> ==> set_tlb_ubc_flush_pending()
> mprotect(addr, PROT_READ)
> ==> change_pte_range()
> ==> [ PTE non-present - no flush ]
>
> user writes using cached RW PTE
> ...
>
> try_to_unmap_flush()
>
> The same type of race exists for reads when protecting for PROT_NONE.
> This is not a data integrity issue as the TLB is always flushed before any
> IO is queued or a page is freed but it is a correctness issue as a process
> restricting access with mprotect() may still be able to access the data
> after the syscall returns due to a stale TLB entry. Handle this issue by
> flushing the local TLB if reclaim is potentially batching TLB flushes and
> mprotect altered no pages.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Cc: stable@vger.kernel.org # v4.4+
> ---
> mm/internal.h | 5 ++++-
> mm/mprotect.c | 12 ++++++++++--
> mm/rmap.c | 20 ++++++++++++++++++++
> 3 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 0e4f558412fb..9b7d1a597816 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> void try_to_unmap_flush(void);
> void try_to_unmap_flush_dirty(void);
> +void batched_unmap_protection_update(void);
> #else
> static inline void try_to_unmap_flush(void)
> {
> @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
> static inline void try_to_unmap_flush_dirty(void)
> {
> }
> -
> +static inline void batched_unmap_protection_update()
> +{
> +}
> #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
>
> extern const struct trace_print_flags pageflag_names[];
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..3de353d4b5fb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -254,9 +254,17 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
> dirty_accountable, prot_numa);
> } while (pgd++, addr = next, addr != end);
>
> - /* Only flush the TLB if we actually modified any entries: */
> - if (pages)
> + /*
> + * Only flush all TLBs if we actually modified any entries. If no
> + * pages are modified, then call batched_unmap_protection_update
> + * if the context is a mprotect() syscall.
> + */
> + if (pages) {
> flush_tlb_range(vma, start, end);
> + } else {
> + if (!prot_numa)
> + batched_unmap_protection_update();
> + }
> clear_tlb_flush_pending(mm);
>
> return pages;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d405f0e0ee96..02cb035e4ce6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>
> return should_defer;
> }
> +
> +/*
> + * This is called after an mprotect update that altered no pages. Batched
> + * unmap releases the PTL before a flush occurs leaving a window where
> + * an mprotect that reduces access rights can still access the page after
> + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
> + * the local TLB if mprotect updates no pages so that the the caller of
> + * mprotect always gets expected behaviour. It's overkill and unnecessary to
> + * flush all TLBs as a separate thread accessing the data that raced with
> + * both reclaim and mprotect as there is no risk of data corruption and
> + * the exact timing of a parallel thread seeing a protection update without
> + * any serialisation on the application side is always uncertain.
> + */
> +void batched_unmap_protection_update(void)
> +{
> + count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> + local_flush_tlb();
> + trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> +}
> +
> #else
> static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> {
I don’t think this solution is enough. I am sorry for not providing a
solution, but I don’t see an easy one.
Thanks,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2017-07-11 16:22 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-11 0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11 6:41 ` Mel Gorman
2017-07-11 7:30 ` Nadav Amit
2017-07-11 9:29 ` Mel Gorman
2017-07-11 10:40 ` Nadav Amit
2017-07-11 13:20 ` Mel Gorman
2017-07-11 14:58 ` Andy Lutomirski
2017-07-11 15:53 ` Mel Gorman
2017-07-11 17:23 ` Andy Lutomirski
2017-07-11 19:18 ` Mel Gorman
2017-07-11 20:06 ` Nadav Amit
2017-07-11 21:09 ` Mel Gorman
2017-07-11 20:09 ` Mel Gorman
2017-07-11 21:52 ` Mel Gorman
2017-07-11 22:27 ` Nadav Amit
2017-07-11 22:34 ` Nadav Amit
2017-07-12 8:27 ` Mel Gorman
2017-07-12 23:27 ` Nadav Amit
2017-07-12 23:36 ` Andy Lutomirski
2017-07-12 23:42 ` Nadav Amit
2017-07-13 5:38 ` Andy Lutomirski
2017-07-13 16:05 ` Nadav Amit
2017-07-13 16:06 ` Andy Lutomirski
2017-07-13 6:07 ` Mel Gorman
2017-07-13 16:08 ` Andy Lutomirski
2017-07-13 17:07 ` Mel Gorman
2017-07-13 17:15 ` Andy Lutomirski
2017-07-13 18:23 ` Mel Gorman
2017-07-14 23:16 ` Nadav Amit
2017-07-15 15:55 ` Mel Gorman
2017-07-15 16:41 ` Andy Lutomirski
2017-07-17 7:49 ` Mel Gorman
2017-07-18 21:28 ` Nadav Amit
2017-07-19 7:41 ` Mel Gorman
2017-07-19 19:41 ` Nadav Amit
2017-07-19 19:58 ` Mel Gorman
2017-07-19 20:20 ` Nadav Amit
2017-07-19 21:47 ` Mel Gorman
2017-07-19 22:19 ` Nadav Amit
2017-07-19 22:59 ` Mel Gorman
2017-07-19 23:39 ` Nadav Amit
2017-07-20 7:43 ` Mel Gorman
2017-07-22 1:19 ` Nadav Amit
2017-07-24 9:58 ` Mel Gorman
2017-07-24 19:46 ` Nadav Amit
2017-07-25 7:37 ` Minchan Kim
2017-07-25 8:51 ` Mel Gorman
2017-07-25 9:11 ` Minchan Kim
2017-07-25 10:10 ` Mel Gorman
2017-07-26 5:43 ` Minchan Kim
2017-07-26 9:22 ` Mel Gorman
2017-07-26 19:18 ` Nadav Amit
2017-07-26 23:40 ` Minchan Kim
2017-07-27 0:09 ` Nadav Amit
2017-07-27 0:34 ` Minchan Kim
2017-07-27 0:48 ` Nadav Amit
2017-07-27 1:13 ` Nadav Amit
2017-07-27 7:04 ` Minchan Kim
2017-07-27 7:21 ` Mel Gorman
2017-07-27 16:04 ` Nadav Amit
2017-07-27 17:36 ` Mel Gorman
2017-07-26 23:44 ` Minchan Kim
2017-07-11 22:07 ` Andy Lutomirski
2017-07-11 22:33 ` Mel Gorman
2017-07-14 7:00 ` Benjamin Herrenschmidt
2017-07-14 8:31 ` Mel Gorman
2017-07-14 9:02 ` Benjamin Herrenschmidt
2017-07-14 9:27 ` Mel Gorman
2017-07-14 22:21 ` Andy Lutomirski
2017-07-11 16:22 ` Nadav Amit [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=D9CAACCF-F929-4894-AA81-EB612F51B49A@gmail.com \
--to=nadav.amit@gmail.com \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).