All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 11 Jul 2017 13:06:48 -0700	[thread overview]
Message-ID: <3373F577-F289-4028-B6F6-777D029A7B07@gmail.com> (raw)
In-Reply-To: <20170711191823.qthrmdgqcd3rygjk@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 10:23:50AM -0700, Andrew Lutomirski wrote:
>> On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman <mgorman@suse.de> wrote:
>>> On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote:
>>>> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman <mgorman@suse.de> wrote:
>>>>> +
>>>>> +/*
>>>>> + * This is called after an mprotect update that altered no pages. Batched
>>>>> + * unmap releases the PTL before a flush occurs leaving a window where
>>>>> + * an mprotect that reduces access rights can still access the page after
>>>>> + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
>>>>> + * the local TLB if mprotect updates no pages so that the the caller of
>>>>> + * mprotect always gets expected behaviour. It's overkill and unnecessary to
>>>>> + * flush all TLBs as a separate thread accessing the data that raced with
>>>>> + * both reclaim and mprotect as there is no risk of data corruption and
>>>>> + * the exact timing of a parallel thread seeing a protection update without
>>>>> + * any serialisation on the application side is always uncertain.
>>>>> + */
>>>>> +void batched_unmap_protection_update(void)
>>>>> +{
>>>>> +       count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>>>>> +       local_flush_tlb();
>>>>> +       trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>> +}
>>>>> +
>>>> 
>>>> What about remote CPUs?  You could get migrated right after mprotect()
>>>> or the inconsistency could be observed on another CPU.
>>> 
>>> If it's migrated then it has also context switched so the TLB entry will
>>> be read for the first time.
>> 
>> I don't think this is true.  On current kernels, if the other CPU is
>> running a thread in the same process, then there won't be a flush if
>> we migrate there.
> 
> True although that would also be covered if a flush happening unconditionally
> on mprotect (and arguably munmap) if a batched TLB flush took place in the
> past. It's heavier than it needs to be but it would be trivial to track
> and only incur a cost if reclaim touched any pages belonging to the process
> in the past so a relatively rare operation in the normal case. It could be
> forced by continually keeping a system under memory pressure while looping
> around mprotect but the worst-case would be similar costs to never batching
> the flushing at all.
> 
>> In -tip, slated for 4.13, if the other CPU is lazy
>> and is using the current process's page tables, it won't flush if we
>> migrate there and it's not stale (as determined by the real flush
>> APIs, not local_tlb_flush()).  With PCID, the kernel will aggressively
>> try to avoid the flush no matter what.
> 
> I agree that PCID means that flushing needs to be more agressive and there
> is not much point working on two solutions and assume PCID is merged.
> 
>>> If the entry is inconsistent for another CPU
>>> accessing the data then it'll potentially successfully access a page that
>>> was just mprotected but this is similar to simply racing with the call
>>> to mprotect itself. The timing isn't exact, nor does it need to be.
>> 
>> Thread A:
>> mprotect(..., PROT_READ);
>> pthread_mutex_unlock();
>> 
>> Thread B:
>> pthread_mutex_lock();
>> write to the mprotected address;
>> 
>> I think it's unlikely that this exact scenario will affect a
>> conventional C program, but I can see various GC systems and sandboxes
>> being very surprised.
> 
> Maybe. The window is massively wide as the mprotect, unlock, remote wakeup
> and write all need to complete between the unmap releasing the PTL and
> the flush taking place. Still, it is theoritically possible.

Consider also virtual machines. A VCPU may be preempted by the hypervisor
right after a PTE change and before the flush - so the time between the two
can be rather large.

>>>> I also really
>>>> don't like bypassing arch code like this.  The implementation of
>>>> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!)
>>>> is *very* different from what's there now, and it is not written in
>>>> the expectation that some generic code might call local_tlb_flush()
>>>> and expect any kind of coherency at all.
>>> 
>>> Assuming that gets merged first then the most straight-forward approach
>>> would be to setup a arch_tlbflush_unmap_batch with just the local CPU set
>>> in the mask or something similar.
>> 
>> With what semantics?
> 
> I'm dropping this idea because the more I think about it, the more I think
> that a more general flush is needed if TLB batching was used in the past.
> We could keep active track of mm's with flushes pending but it would be
> fairly complex, cost in terms of keeping track of mm's needing flushing
> and ultimately might be more expensive than just flushing immediately.
> 
> If it's actually unfixable then, even though it's theoritical given the
> massive amount of activity that has to happen in a very short window, there
> would be no choice but to remove the TLB batching entirely which would be
> very unfortunate given that IPIs during reclaim will be very high once again.
> 
>>>> Would a better fix perhaps be to find a way to figure out whether a
>>>> batched flush is pending on the mm in question and flush it out if you
>>>> do any optimizations based on assuming that the TLB is in any respect
>>>> consistent with the page tables?  With the changes in -tip, x86 could,
>>>> in principle, supply a function to sync up its TLB state.  That would
>>>> require cross-CPU poking at state or an inconditional IPI (that might
>>>> end up not flushing anything), but either is doable.
>>> 
>>> It's potentially doable if a field like tlb_flush_pending was added
>>> to mm_struct that is set when batching starts. I don't think there is
>>> a logical place where it can be cleared as when the TLB gets flushed by
>>> reclaim, it can't rmap again to clear the flag. What would happen is that
>>> the first mprotect after any batching happened at any point in the past
>>> would have to unconditionally flush the TLB and then clear the flag. That
>>> would be a relatively minor hit and cover all the possibilities and should
>>> work unmodified with or without your series applied.
>>> 
>>> Would that be preferable to you?
>> 
>> I'm not sure I understand it well enough to know whether I like it.
>> I'm imagining an API that says "I'm about to rely on TLBs being
>> coherent for this mm -- make it so".
> 
> I don't think we should be particularly clever about this and instead just
> flush the full mm if there is a risk of a parallel batching of flushing is
> in progress resulting in a stale TLB entry being used. I think tracking mms
> that are currently batching would end up being costly in terms of memory,
> fairly complex, or both. Something like this?
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 45cdb27791a3..ab8f7e11c160 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -495,6 +495,10 @@ struct mm_struct {
> 	 */
> 	bool tlb_flush_pending;
> #endif
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +	/* See flush_tlb_batched_pending() */
> +	bool tlb_flush_batched;
> +#endif
> 	struct uprobes_state uprobes_state;
> #ifdef CONFIG_HUGETLB_PAGE
> 	atomic_long_t hugetlb_usage;
> diff --git a/mm/internal.h b/mm/internal.h
> index 0e4f558412fb..bf835a5a9854 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> void try_to_unmap_flush(void);
> void try_to_unmap_flush_dirty(void);
> +void flush_tlb_batched_pending(struct mm_struct *mm);
> #else
> static inline void try_to_unmap_flush(void)
> {
> @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
> static inline void try_to_unmap_flush_dirty(void)
> {
> }
> -
> +static inline void mm_tlb_flush_batched(struct mm_struct *mm)
> +{
> +}
> #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> 
> extern const struct trace_print_flags pageflag_names[];
> diff --git a/mm/memory.c b/mm/memory.c
> index bb11c474857e..b0c3d1556a94 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> 	init_rss_vec(rss);
> 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 	pte = start_pte;
> +	flush_tlb_batched_pending(mm);
> 	arch_enter_lazy_mmu_mode();
> 	do {
> 		pte_t ptent = *pte;
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..27135b91a4b4 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> 	if (!pte)
> 		return 0;
> 
> +	/* Guard against parallel reclaim batching a TLB flush without PTL */
> +	flush_tlb_batched_pending(vma->vm_mm);
> +
> 	/* Get target node for single threaded private VMAs */
> 	if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
> 	    atomic_read(&vma->vm_mm->mm_users) == 1)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d405f0e0ee96..52633a124a4e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> 		return false;
> 
> 	/* If remote CPUs need to be flushed then defer batch the flush */
> -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
> 		should_defer = true;
> +		mm->tlb_flush_batched = true;
> +	}
> 	put_cpu();
> 
> 	return should_defer;
> }
> +
> +/*
> + * Reclaim batches unmaps pages under the PTL but does not flush the TLB
> + * TLB prior to releasing the PTL. It's possible a parallel mprotect or
> + * munmap can race between reclaim unmapping the page and flushing the
> + * page. If this race occurs, it potentially allows access to data via
> + * a stale TLB entry. Tracking all mm's that have TLB batching pending
> + * would be expensive during reclaim so instead track whether TLB batching
> + * occured in the past and if so then do a full mm flush here. This will
> + * cost one additional flush per reclaim cycle paid by the first munmap or
> + * mprotect. This assumes it's called under the PTL to synchronise access
> + * to mm->tlb_flush_batched.
> + */
> +void flush_tlb_batched_pending(struct mm_struct *mm)
> +{
> +	if (mm->tlb_flush_batched) {
> +		flush_tlb_mm(mm);
> +		mm->tlb_flush_batched = false;
> +	}
> +}
> #else
> static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> {

I don’t know what is exactly the invariant that is kept, so it is hard for
me to figure out all sort of questions:

Should pte_accessible return true if mm->tlb_flush_batch==true ?

Does madvise_free_pte_range need to be modified as well?

How will future code not break anything?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-11 20:06 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit [this message]
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3373F577-F289-4028-B6F6-777D029A7B07@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.