Linux-Next Archive on lore.kernel.org
 help / color / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Linux-Next Mailing List <linux-next@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Nadav Amit <namit@vmware.com>,
	Linus <torvalds@linux-foundation.org>
Subject: Re: linux-next: manual merge of the akpm-current tree with the tip tree
Date: Fri, 11 Aug 2017 11:34:49 +0200
Message-ID: <20170811093449.w5wttpulmwfykjzm@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20170811175326.36d546dc@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 1108 bytes --]

On Fri, Aug 11, 2017 at 05:53:26PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the akpm-current tree got conflicts in:
> 
>   include/linux/mm_types.h
>   mm/huge_memory.c
> 
> between commit:
> 
>   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> 
> from the tip tree and commits:
> 
>   16af97dc5a89 ("mm: migrate: prevent racy access to tlb_flush_pending")
>   a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long as possible"")
> 
> from the akpm-current tree.
> 
> The latter 2 are now in Linus' tree as well (but were not when I started
> the day).
> 
> The only way forward I could see was to revert
> 
>   8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
> 
> and the three following commits
> 
>   ff7a5fb0f1d5 ("overlayfs, locking: Remove smp_mb__before_spinlock() usage")
>   d89e588ca408 ("locking: Introduce smp_mb__after_spinlock()")
>   a9668cd6ee28 ("locking: Remove smp_mb__before_spinlock()")
> 
> before merging the akpm-current tree again.

Here's two patches that apply on top of tip.


[-- Attachment #2: nadav_amit-mm-migrate__prevent_racy_access_to_tlb_flush_pending.patch --]
[-- Type: text/x-diff, Size: 4923 bytes --]

Subject: mm: migrate: prevent racy access to tlb_flush_pending
From: Nadav Amit <nadav.amit@gmail.com>
Date: Tue, 1 Aug 2017 17:08:12 -0700

Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.

This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.

An actual data corruption was not observed, yet the race was
was confirmed by adding assertion to check tlb_flush_pending is not set
by two threads, adding artificial latency in change_protection_range()
and using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")


Cc: <akpm@linux-foundation.org>
Cc: CC:     <nadav.amit@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
---
 include/linux/mm_types.h |   29 +++++++++++++++++++++--------
 kernel/fork.c            |    2 +-
 mm/debug.c               |    2 +-
 mm/mprotect.c            |    4 ++--
 4 files changed, 25 insertions(+), 12 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -493,7 +493,7 @@ struct mm_struct {
 	 * can move process memory needs to flush the TLB when moving a
 	 * PROT_NONE or PROT_NUMA mapped page.
 	 */
-	bool tlb_flush_pending;
+	atomic_t tlb_flush_pending;
 #endif
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	/* See flush_tlb_batched_pending() */
@@ -535,11 +535,17 @@ static inline bool mm_tlb_flush_pending(
 	 * Must be called with PTL held; such that our PTL acquire will have
 	 * observed the store from set_tlb_flush_pending().
 	 */
-	return mm->tlb_flush_pending;
+	return atomic_read(&mm->tlb_flush_pending);
 }
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
-	mm->tlb_flush_pending = true;
+	atomic_set(&mm->tlb_flush_pending, 0);
+}
+
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+	atomic_inc(&mm->tlb_flush_pending);
 	/*
 	 * The only time this value is relevant is when there are indeed pages
 	 * to flush. And we'll only flush pages after changing them, which
@@ -565,21 +571,28 @@ static inline void set_tlb_flush_pending
 	 * store is constrained by the TLB invalidate.
 	 */
 }
+
 /* Clearing is done after a TLB flush, which also provides a barrier. */
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
 	/* see set_tlb_flush_pending */
-	mm->tlb_flush_pending = false;
+	atomic_dec(&mm->tlb_flush_pending);
 }
 #else
 static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
 {
 	return false;
 }
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
 }
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
+
+static inline void inc_tlb_flush_pending(struct mm_struct *mm)
+{
+}
+
+static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 {
 }
 #endif
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -809,7 +809,7 @@ static struct mm_struct *mm_init(struct
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
-	clear_tlb_flush_pending(mm);
+	init_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -159,7 +159,7 @@ void dump_mm(const struct mm_struct *mm)
 		mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
 #endif
 #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
-		mm->tlb_flush_pending,
+		atomic_read(&mm->tlb_flush_pending),
 #endif
 		mm->def_flags, &mm->def_flags
 	);
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -244,7 +244,7 @@ static unsigned long change_protection_r
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
-	set_tlb_flush_pending(mm);
+	inc_tlb_flush_pending(mm);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
@@ -256,7 +256,7 @@ static unsigned long change_protection_r
 	/* Only flush the TLB if we actually modified any entries: */
 	if (pages)
 		flush_tlb_range(vma, start, end);
-	clear_tlb_flush_pending(mm);
+	dec_tlb_flush_pending(mm);
 
 	return pages;
 }

[-- Attachment #3: nadav_amit-revert__mm-numa__defer_tlb_flush_for_thp_migration_as_long_as_possible_.patch --]
[-- Type: text/x-diff, Size: 2576 bytes --]

Subject: Revert "mm: numa: defer TLB flush for THP migration as long as possible"
From:   Nadav Amit <namit@vmware.com>
Date: Tue, 1 Aug 2017 17:08:14 -0700

While deferring TLB flushes is a good practice, the reverted patch
caused pending TLB flushes to be checked while the page-table lock is
not taken. As a result, in architectures with weak memory model (PPC),
Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
and cause (in theory) a memory corruption.

Since the alternative of using smp_mb__after_unlock_lock() was
considered a bit open-coded, and the performance impact is expected to
be small, the previous patch is reverted.

This reverts commit b0943d61b8fa420180f92f64ef67662b4f6cc493.

Cc: <akpm@linux-foundation.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: CC:     <nadav.amit@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170802000818.4760-4-namit@vmware.com
---
 mm/huge_memory.c |   13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1410,7 +1410,6 @@ int do_huge_pmd_numa_page(struct vm_faul
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid, last_cpupid = -1;
-	bool need_flush = false;
 	bool page_locked;
 	bool migrated = false;
 	bool was_writable;
@@ -1503,9 +1502,12 @@ int do_huge_pmd_numa_page(struct vm_faul
 	 *
 	 * Must be done under PTL such that we'll observe the relevant
 	 * set_tlb_flush_pending().
+	 *
+	 * We are not sure a pending tlb flush here is for a huge page
+	 * mapping or not. Hence use the tlb range variant
 	 */
 	if (mm_tlb_flush_pending(vma->vm_mm))
-		need_flush = true;
+		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
 
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
@@ -1513,13 +1515,6 @@ int do_huge_pmd_numa_page(struct vm_faul
 	 */
 	spin_unlock(vmf->ptl);
 
-	/*
-	 * We are not sure a pending tlb flush here is for a huge page
-	 * mapping or not. Hence use the tlb range variant
-	 */
-	if (need_flush)
-		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
-
 	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
 				vmf->pmd, pmd, vmf->address, page, target_nid);
 	if (migrated) {

  reply index

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11  7:53 Stephen Rothwell
2017-08-11  9:34 ` Peter Zijlstra [this message]
2017-08-11 10:48   ` Peter Zijlstra
2017-08-11 11:45   ` Stephen Rothwell
2017-08-11 11:56     ` Ingo Molnar
2017-08-11 12:17       ` Peter Zijlstra
2017-08-11 12:44         ` Ingo Molnar
2017-08-11 13:49           ` Stephen Rothwell
2017-08-11 14:04       ` Peter Zijlstra
2017-08-13  6:06         ` Nadav Amit
2017-08-13 12:50           ` Peter Zijlstra
2017-08-14  3:16             ` Minchan Kim
2017-08-14  5:07               ` Nadav Amit
2017-08-14  5:23                 ` Minchan Kim
2017-08-14  8:38                 ` Minchan Kim
2017-08-14 19:57                   ` Peter Zijlstra
2017-08-16  4:14                     ` Minchan Kim
2017-08-14 19:38                 ` Peter Zijlstra
2017-08-15  7:51                   ` Nadav Amit
2017-08-14  3:09         ` Minchan Kim
2017-08-14 18:54           ` Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2019-10-31  5:43 Stephen Rothwell
2019-06-24 10:24 Stephen Rothwell
2019-05-01 11:10 Stephen Rothwell
2019-01-31  4:31 Stephen Rothwell
2018-08-20  4:32 Stephen Rothwell
2018-08-20 19:52 ` Andrew Morton
2018-03-23  5:59 Stephen Rothwell
2017-12-18  5:04 Stephen Rothwell
2017-11-10  4:33 Stephen Rothwell
2017-11-02  7:19 Stephen Rothwell
2017-08-22  6:57 Stephen Rothwell
2017-08-23  6:39 ` Vlastimil Babka
2017-04-12  6:46 Stephen Rothwell
2017-04-12 20:53 ` Vlastimil Babka
2017-04-20  2:17   ` NeilBrown
2017-03-24  5:25 Stephen Rothwell
2017-02-17  4:40 Stephen Rothwell
2016-11-14  6:08 Stephen Rothwell
2016-07-29  4:14 Stephen Rothwell
2016-06-15  5:23 Stephen Rothwell
2016-06-18 19:39 ` Manfred Spraul
2016-04-29  6:12 Stephen Rothwell
2016-04-29  6:26 ` Ingo Molnar
2016-03-02  5:40 Stephen Rothwell
2016-02-26  5:07 Stephen Rothwell
2016-02-26 21:35 ` Andrew Morton
2016-02-19  4:09 Stephen Rothwell
2016-02-19 15:26 ` Ard Biesheuvel
2015-12-07  8:06 Stephen Rothwell
2015-10-02  4:21 Stephen Rothwell
2015-07-28  6:00 Stephen Rothwell
2015-07-29 17:12 ` Andrea Arcangeli
2015-07-29 17:47   ` Andy Lutomirski
2015-07-29 18:46     ` Thomas Gleixner
2015-07-30 15:38       ` Andrea Arcangeli
2015-07-29 23:06   ` Stephen Rothwell
2015-07-29 23:07     ` Thomas Gleixner
2015-09-07 23:35   ` Stephen Rothwell
2015-09-08 18:11     ` Linus Torvalds
2015-09-08 22:56       ` Stephen Rothwell
2015-09-08 23:03         ` Linus Torvalds
2015-09-08 23:21           ` Andrew Morton
2015-09-16  6:58             ` Geert Uytterhoeven
2015-06-04 12:07 Stephen Rothwell
2015-04-08  8:28 Stephen Rothwell
2015-04-08  8:25 Stephen Rothwell
2014-03-17  9:31 Stephen Rothwell
2014-03-17  9:36 ` Peter Zijlstra
2014-03-19 23:27   ` Andrew Morton
2014-01-14  4:53 Stephen Rothwell
2014-01-14  5:04 ` Davidlohr Bueso
2014-01-14 12:51 ` Peter Zijlstra
2014-01-14 13:17   ` Geert Uytterhoeven
2014-01-14 13:33     ` Peter Zijlstra
2014-01-14 16:19     ` H. Peter Anvin
2014-01-14 15:15   ` H. Peter Anvin
2014-01-14 15:20     ` Geert Uytterhoeven
2014-01-14 15:41       ` Peter Zijlstra
2014-01-14 15:48         ` H. Peter Anvin
2014-01-07  6:00 Stephen Rothwell
2014-01-07  6:34 ` Tang Chen
2013-11-08  7:48 Stephen Rothwell
2013-11-08 18:58 ` Josh Triplett
2013-11-08 23:20   ` Stephen Rothwell
2013-11-09  0:19     ` Josh Triplett
2013-10-30  6:40 Stephen Rothwell

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170811093449.w5wttpulmwfykjzm@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=namit@vmware.com \
    --cc=sfr@canb.auug.org.au \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Next Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-next/0 linux-next/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-next linux-next/ https://lore.kernel.org/linux-next \
		linux-next@vger.kernel.org
	public-inbox-index linux-next

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-next


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git