linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 11 Jul 2017 13:06:48 -0700	[thread overview]
Message-ID: <3373F577-F289-4028-B6F6-777D029A7B07@gmail.com> (raw)
In-Reply-To: <20170711191823.qthrmdgqcd3rygjk@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 10:23:50AM -0700, Andrew Lutomirski wrote:
>> On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman <mgorman@suse.de> wrote:
>>> On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote:
>>>> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman <mgorman@suse.de> wrote:
>>>>> +
>>>>> +/*
>>>>> + * This is called after an mprotect update that altered no pages. Batched
>>>>> + * unmap releases the PTL before a flush occurs leaving a window where
>>>>> + * an mprotect that reduces access rights can still access the page after
>>>>> + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
>>>>> + * the local TLB if mprotect updates no pages so that the the caller of
>>>>> + * mprotect always gets expected behaviour. It's overkill and unnecessary to
>>>>> + * flush all TLBs as a separate thread accessing the data that raced with
>>>>> + * both reclaim and mprotect as there is no risk of data corruption and
>>>>> + * the exact timing of a parallel thread seeing a protection update without
>>>>> + * any serialisation on the application side is always uncertain.
>>>>> + */
>>>>> +void batched_unmap_protection_update(void)
>>>>> +{
>>>>> +       count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>>>>> +       local_flush_tlb();
>>>>> +       trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>> +}
>>>>> +
>>>> 
>>>> What about remote CPUs?  You could get migrated right after mprotect()
>>>> or the inconsistency could be observed on another CPU.
>>> 
>>> If it's migrated then it has also context switched so the TLB entry will
>>> be read for the first time.
>> 
>> I don't think this is true.  On current kernels, if the other CPU is
>> running a thread in the same process, then there won't be a flush if
>> we migrate there.
> 
> True although that would also be covered if a flush happening unconditionally
> on mprotect (and arguably munmap) if a batched TLB flush took place in the
> past. It's heavier than it needs to be but it would be trivial to track
> and only incur a cost if reclaim touched any pages belonging to the process
> in the past so a relatively rare operation in the normal case. It could be
> forced by continually keeping a system under memory pressure while looping
> around mprotect but the worst-case would be similar costs to never batching
> the flushing at all.
> 
>> In -tip, slated for 4.13, if the other CPU is lazy
>> and is using the current process's page tables, it won't flush if we
>> migrate there and it's not stale (as determined by the real flush
>> APIs, not local_tlb_flush()).  With PCID, the kernel will aggressively
>> try to avoid the flush no matter what.
> 
> I agree that PCID means that flushing needs to be more agressive and there
> is not much point working on two solutions and assume PCID is merged.
> 
>>> If the entry is inconsistent for another CPU
>>> accessing the data then it'll potentially successfully access a page that
>>> was just mprotected but this is similar to simply racing with the call
>>> to mprotect itself. The timing isn't exact, nor does it need to be.
>> 
>> Thread A:
>> mprotect(..., PROT_READ);
>> pthread_mutex_unlock();
>> 
>> Thread B:
>> pthread_mutex_lock();
>> write to the mprotected address;
>> 
>> I think it's unlikely that this exact scenario will affect a
>> conventional C program, but I can see various GC systems and sandboxes
>> being very surprised.
> 
> Maybe. The window is massively wide as the mprotect, unlock, remote wakeup
> and write all need to complete between the unmap releasing the PTL and
> the flush taking place. Still, it is theoritically possible.

Consider also virtual machines. A VCPU may be preempted by the hypervisor
right after a PTE change and before the flush - so the time between the two
can be rather large.

>>>> I also really
>>>> don't like bypassing arch code like this.  The implementation of
>>>> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!)
>>>> is *very* different from what's there now, and it is not written in
>>>> the expectation that some generic code might call local_tlb_flush()
>>>> and expect any kind of coherency at all.
>>> 
>>> Assuming that gets merged first then the most straight-forward approach
>>> would be to setup a arch_tlbflush_unmap_batch with just the local CPU set
>>> in the mask or something similar.
>> 
>> With what semantics?
> 
> I'm dropping this idea because the more I think about it, the more I think
> that a more general flush is needed if TLB batching was used in the past.
> We could keep active track of mm's with flushes pending but it would be
> fairly complex, cost in terms of keeping track of mm's needing flushing
> and ultimately might be more expensive than just flushing immediately.
> 
> If it's actually unfixable then, even though it's theoritical given the
> massive amount of activity that has to happen in a very short window, there
> would be no choice but to remove the TLB batching entirely which would be
> very unfortunate given that IPIs during reclaim will be very high once again.
> 
>>>> Would a better fix perhaps be to find a way to figure out whether a
>>>> batched flush is pending on the mm in question and flush it out if you
>>>> do any optimizations based on assuming that the TLB is in any respect
>>>> consistent with the page tables?  With the changes in -tip, x86 could,
>>>> in principle, supply a function to sync up its TLB state.  That would
>>>> require cross-CPU poking at state or an inconditional IPI (that might
>>>> end up not flushing anything), but either is doable.
>>> 
>>> It's potentially doable if a field like tlb_flush_pending was added
>>> to mm_struct that is set when batching starts. I don't think there is
>>> a logical place where it can be cleared as when the TLB gets flushed by
>>> reclaim, it can't rmap again to clear the flag. What would happen is that
>>> the first mprotect after any batching happened at any point in the past
>>> would have to unconditionally flush the TLB and then clear the flag. That
>>> would be a relatively minor hit and cover all the possibilities and should
>>> work unmodified with or without your series applied.
>>> 
>>> Would that be preferable to you?
>> 
>> I'm not sure I understand it well enough to know whether I like it.
>> I'm imagining an API that says "I'm about to rely on TLBs being
>> coherent for this mm -- make it so".
> 
> I don't think we should be particularly clever about this and instead just
> flush the full mm if there is a risk of a parallel batching of flushing is
> in progress resulting in a stale TLB entry being used. I think tracking mms
> that are currently batching would end up being costly in terms of memory,
> fairly complex, or both. Something like this?
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 45cdb27791a3..ab8f7e11c160 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -495,6 +495,10 @@ struct mm_struct {
> 	 */
> 	bool tlb_flush_pending;
> #endif
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +	/* See flush_tlb_batched_pending() */
> +	bool tlb_flush_batched;
> +#endif
> 	struct uprobes_state uprobes_state;
> #ifdef CONFIG_HUGETLB_PAGE
> 	atomic_long_t hugetlb_usage;
> diff --git a/mm/internal.h b/mm/internal.h
> index 0e4f558412fb..bf835a5a9854 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> void try_to_unmap_flush(void);
> void try_to_unmap_flush_dirty(void);
> +void flush_tlb_batched_pending(struct mm_struct *mm);
> #else
> static inline void try_to_unmap_flush(void)
> {
> @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
> static inline void try_to_unmap_flush_dirty(void)
> {
> }
> -
> +static inline void mm_tlb_flush_batched(struct mm_struct *mm)
> +{
> +}
> #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> 
> extern const struct trace_print_flags pageflag_names[];
> diff --git a/mm/memory.c b/mm/memory.c
> index bb11c474857e..b0c3d1556a94 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> 	init_rss_vec(rss);
> 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 	pte = start_pte;
> +	flush_tlb_batched_pending(mm);
> 	arch_enter_lazy_mmu_mode();
> 	do {
> 		pte_t ptent = *pte;
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..27135b91a4b4 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> 	if (!pte)
> 		return 0;
> 
> +	/* Guard against parallel reclaim batching a TLB flush without PTL */
> +	flush_tlb_batched_pending(vma->vm_mm);
> +
> 	/* Get target node for single threaded private VMAs */
> 	if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
> 	    atomic_read(&vma->vm_mm->mm_users) == 1)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d405f0e0ee96..52633a124a4e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> 		return false;
> 
> 	/* If remote CPUs need to be flushed then defer batch the flush */
> -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
> 		should_defer = true;
> +		mm->tlb_flush_batched = true;
> +	}
> 	put_cpu();
> 
> 	return should_defer;
> }
> +
> +/*
> + * Reclaim batches unmaps pages under the PTL but does not flush the TLB
> + * TLB prior to releasing the PTL. It's possible a parallel mprotect or
> + * munmap can race between reclaim unmapping the page and flushing the
> + * page. If this race occurs, it potentially allows access to data via
> + * a stale TLB entry. Tracking all mm's that have TLB batching pending
> + * would be expensive during reclaim so instead track whether TLB batching
> + * occured in the past and if so then do a full mm flush here. This will
> + * cost one additional flush per reclaim cycle paid by the first munmap or
> + * mprotect. This assumes it's called under the PTL to synchronise access
> + * to mm->tlb_flush_batched.
> + */
> +void flush_tlb_batched_pending(struct mm_struct *mm)
> +{
> +	if (mm->tlb_flush_batched) {
> +		flush_tlb_mm(mm);
> +		mm->tlb_flush_batched = false;
> +	}
> +}
> #else
> static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> {

I don’t know what is exactly the invariant that is kept, so it is hard for
me to figure out all sort of questions:

Should pte_accessible return true if mm->tlb_flush_batch==true ?

Does madvise_free_pte_range need to be modified as well?

How will future code not break anything?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-11 20:06 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit [this message]
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3373F577-F289-4028-B6F6-777D029A7B07@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).