All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 11 Jul 2017 14:20:23 +0100	[thread overview]
Message-ID: <20170711132023.wdfpjxwtbqpi3wp2@suse.de> (raw)
In-Reply-To: <E37E0D40-821A-4C82-B924-F1CE6DF97719@gmail.com>

On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> >>> That is the same to a race whereby there is no batching mechanism and the
> >>> racing operation happens between a pte clear and a flush as ptep_clear_flush
> >>> is not atomic. All that differs is that the race window is a different size.
> >>> The application on CPU1 is buggy in that it may or may not succeed the write
> >>> but it is buggy regardless of whether a batching mechanism is used or not.
> >> 
> >> Thanks for your quick and detailed response, but I fail to see how it can
> >> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
> >> but without batching they are both performed under the page table lock
> >> (which is acquired in page_vma_mapped_walk and released in
> >> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
> >> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
> >> and change_pte_range, acquire the lock before accessing the PTEs.
> > 
> > I was primarily thinking in terms of memory corruption or data loss.
> > However, we are still protected although it's not particularly obvious why.
> > 
> > On the reclaim side, we are either reclaiming clean pages (which ignore
> > the accessed bit) or normal reclaim. If it's clean pages then any parallel
> > write must update the dirty bit at minimum. If it's normal reclaim then
> > the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
> > ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
> > the page in either as part of page_referenced or try_to_unmap_one but
> > clearing the accessed bit flushes the TLB.
> 
> Wait. Are you looking at the x86 arch function? The TLB is not flushed when
> the access bit is cleared:
> 
> int ptep_clear_flush_young(struct vm_area_struct *vma,
>                            unsigned long address, pte_t *ptep)
> {
>         /*
>          * On x86 CPUs, clearing the accessed bit without a TLB flush
>          * doesn't cause data corruption. [ It could cause incorrect
>          * page aging and the (mistaken) reclaim of hot pages, but the
>          * chance of that should be relatively low. ]
>          *                 
>          * So as a performance optimization don't flush the TLB when
>          * clearing the accessed bit, it will eventually be flushed by
>          * a context switch or a VM operation anyway. [ In the rare
>          * event of it not getting flushed for a long time the delay
>          * shouldn't really matter because there's no real memory
>          * pressure for swapout to react to. ]
>          */
>         return ptep_test_and_clear_young(vma, address, ptep);
> }
> 

I forgot this detail, thanks for correcting me.

> > 
> > On the mprotect side then, as the page was first accessed, clearing the
> > accessed bit incurs a TLB flush on the reclaim side before the second write.
> > That means any TLB entry that exists cannot have the accessed bit set so
> > a second write needs to update it.
> > 
> > While it's not clearly documented, I checked with hardware engineers
> > at the time that an update of the accessed or dirty bit even with a TLB
> > entry will check the underlying page tables and trap if it's not present
> > and the subsequent fault will then fail on sigsegv if the VMA protections
> > no longer allow the write.
> > 
> > So, on one side if ignoring the accessed bit during reclaim, the pages
> > are clean so any access will set the dirty bit and trap if unmapped in
> > parallel. On the other side, the accessed bit if set cleared the TLB and
> > if not set, then the hardware needs to update and again will trap if
> > unmapped in parallel.
> 
> 
> Yet, even regardless to the TLB flush it seems there is still a possible
> race:
> 
> CPU0				CPU1
> ----				----
> ptep_clear_flush_young_notify
> ==> PTE.A==0
> 				access PTE
> 				==> PTE.A=1
> prep_get_and_clear
> 				change mapping (and PTE)
> 				Use stale TLB entry

So I think you're right and this is a potential race. The first access can
be a read or a write as it's a problem if the mprotect call restricts
access.

> > If this guarantee from hardware was every shown to be wrong or another
> > architecture wanted to add batching without the same guarantee then mprotect
> > would need to do a local_flush_tlb if no pages were updated by the mprotect
> > but right now, this should not be necessary.
> > 
> >> Can you please explain why you consider the application to be buggy?
> > 
> > I considered it a bit dumb to mprotect for READ/NONE and then try writing
> > the same mapping. However, it will behave as expected.
> 
> I don???t think that this is the only scenario. For example, the application
> may create a new memory mapping of a different file using mmap at the same
> memory address that was used before, just as that memory is reclaimed.

That requires the existing mapping to be unmapped which will flush the
TLB and parallel mmap/munmap serialises on mmap_sem. The race appears to
be specific to mprotect which avoids the TLB flush if no pages were updated.

> The
> application can (inadvertently) cause such a scenario by using MAP_FIXED.
> But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
> virtual address.
> 

With flushes in between.

> > Such applications are safe due to how the accessed bit is handled by the
> > software (flushes TLB if clearing young) and hardware (traps if updating
> > the accessed or dirty bit and the underlying PTE was unmapped even if
> > there is a TLB entry).
> 
> I don???t think it is so. And I also think there are many additional
> potentially problematic scenarios.
> 

I believe it's specific to mprotect but can be handled by flushing the
local TLB when mprotect updates no pages. Something like this;

---8<---
mm, mprotect: Flush the local TLB if mprotect potentially raced with a parallel reclaim

Nadav Amit identified a theoritical race between page reclaim and mprotect
due to TLB flushes being batched outside of the PTL being held. He described
the race as follows

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE.
This is not a data integrity issue as the TLB is always flushed before any
IO is queued or a page is freed but it is a correctness issue as a process
restricting access with mprotect() may still be able to access the data
after the syscall returns due to a stale TLB entry. Handle this issue by
flushing the local TLB if reclaim is potentially batching TLB flushes and
mprotect altered no pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org # v4.4+
---
 mm/internal.h |  5 ++++-
 mm/mprotect.c | 12 ++++++++++--
 mm/rmap.c     | 20 ++++++++++++++++++++
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..9b7d1a597816 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void batched_unmap_protection_update(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
-
+static inline void batched_unmap_protection_update()
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..3de353d4b5fb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -254,9 +254,17 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 				 dirty_accountable, prot_numa);
 	} while (pgd++, addr = next, addr != end);
 
-	/* Only flush the TLB if we actually modified any entries: */
-	if (pages)
+	/*
+	 * Only flush all TLBs if we actually modified any entries. If no
+	 * pages are modified, then call batched_unmap_protection_update
+	 * if the context is a mprotect() syscall.
+	 */
+	if (pages) {
 		flush_tlb_range(vma, start, end);
+	} else {
+		if (!prot_numa)
+			batched_unmap_protection_update();
+	}
 	clear_tlb_flush_pending(mm);
 
 	return pages;
diff --git a/mm/rmap.c b/mm/rmap.c
index d405f0e0ee96..02cb035e4ce6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 
 	return should_defer;
 }
+
+/*
+ * This is called after an mprotect update that altered no pages. Batched
+ * unmap releases the PTL before a flush occurs leaving a window where
+ * an mprotect that reduces access rights can still access the page after
+ * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
+ * the local TLB if mprotect updates no pages so that the the caller of
+ * mprotect always gets expected behaviour. It's overkill and unnecessary to
+ * flush all TLBs as a separate thread accessing the data that raced with
+ * both reclaim and mprotect as there is no risk of data corruption and
+ * the exact timing of a parallel thread seeing a protection update without
+ * any serialisation on the application side is always uncertain.
+ */
+void batched_unmap_protection_update(void)
+{
+	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+	local_flush_tlb();
+	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
+}
+
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-11 13:20 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman [this message]
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170711132023.wdfpjxwtbqpi3wp2@suse.de \
    --to=mgorman@suse.de \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=nadav.amit@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.