linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] mremap: fix race between mremap() and page cleanning
       [not found] <026b73f6-ca1d-e7bb-766c-4aaeb7071ce6@intel.com>
@ 2016-11-17  7:45 ` Aaron Lu
       [not found] ` <CA+55aFzHfpZckv8ck19fZSFK+3TmR5eF=BsDzhwVGKrbyEBjEw@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Aaron Lu @ 2016-11-17  7:45 UTC (permalink / raw)
  To: Linux Memory Management List, LKML
  Cc: Dave Hansen, Linus Torvalds, Andrew Morton,
	'Kirill A. Shutemov',
	Huang Ying

[-- Attachment #1: Type: text/plain, Size: 10089 bytes --]

+LKML.

Also attached the kernel patch that enlarges the race window and the
user space test code raceremap.c, which you can put in will-it-scale's
tests directory and run it as:
# ./raceremap_threads -t 2 -s 1

Make sure "cpu0 runs" appeared in the first line.

If you get SEGFAULT:
[aaron@aaronlu will-it-scale]$ sudo ./raceremap_threads -t 2 -s 1
cpu0 runs
cpu1 runs
cpu0: going to remap
testcase:mremap
warmup
cpu1: going to clean the page
cpu1: going to write 2
min:0 max:0 total:0
min:0 max:0 total:0
cpu0: remap done
Segmentation fault

That means the race doesn't occur.

If you get "*cpu1 wrote 2 gets lost":
[aaron@aaronlu will-it-scale]$ sudo ./raceremap_threads -t 2 -s 1
cpu1 runs
testcase:mremap
warmup
cpu0 runs
cpu0: going to remap
cpu1: going to clean the page
cpu1: going to write 2
cpu1 wrote 2
min:0 max:0 total:0
min:0 max:0 total:0
cpu0: remap done
*cpu1 wrote 2 gets lost

That means the race occurred.

Thanks,
Aaron

On Thu, Nov 10, 2016 at 05:16:33PM +0800, Aaron Lu wrote:
> Prior to 3.15, there was a race between zap_pte_range() and
> page_mkclean() where writes to a page could be lost.  Dave Hansen
> discovered by inspection that there is a similar race between
> move_ptes() and  page_mkclean().
> 
> We've been able to reproduce the issue by enlarging the race window with
> a msleep(), but have not been able to hit it without modifying the code.
> So, we think it's a real issue, but is difficult or impossible to hit
> in practice.
> 
> The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
> 'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
> this patch is to fix the race between page_mkclean() and mremap().
> 
> Here is one possible way to hit the race: suppose a process mmapped a
> file with READ | WRITE and SHARED, it has two threads and they are bound
> to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread 1
> did a write to addr X so that CPU1 now has a writable TLB for addr X
> on it. Thread 2 starts mremaping from addr X to Y while thread 1 cleaned
> the page and then did another write to the old addr X again. The 2nd
> write from thread 1 could succeed but the value will get lost.
> 
>         thread 1                           thread 2
>      (bound to CPU1)                    (bound to CPU2)
> 
>   1: write 1 to addr X to get a
>      writeable TLB on this CPU
> 
>                                         2: mremap starts
> 
>                                         3: move_ptes emptied PTE for addr X
>                                            and setup new PTE for addr Y and
>                                            then dropped PTL for X and Y
> 
>   4: page laundering for N by doing
>      fadvise FADV_DONTNEED. When done,
>      pageframe N is deemed clean.
> 
>   5: *write 2 to addr X
> 
>                                         6: tlb flush for addr X
> 
>   7: munmap (Y, pagesize) to make the
>      page unmapped
> 
>   8: fadvise with FADV_DONTNEED again
>      to kick the page off the pagecache
> 
>   9: pread the page from file to verify
>      the value. If 1 is there, it means
>      we have lost the written 2.
> 
>   *the write may or may not cause segmentation fault, it depends on
>   if the TLB is still on the CPU.
> 
> Please note that this is only one specific way of how the race could
> occur, it didn't mean that the race could only occur in exact the above
> config, e.g. more than 2 threads could be involved and fadvise() could
> be done in another thread, etc.
> 
> For anonymous pages, they could race between mremap() and page reclaim:
> THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge PMD
> gets unmapped/splitted/pagedout before the flush tlb happened for the old
> huge PMD in move_page_tables() and we could still write data to it.
> The normal anonymous page has similar situation.
> 
> To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
> if any, did the flush before dropping the PTL. If we did the flush for
> every move_ptes()/move_huge_pmd() call then we do not need to do the
> flush in move_pages_tables() for the whole range. But if we didn't, we
> still need to do the whole range flush.
> 
> Alternatively, we can track which part of the range is flushed in
> move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
> range in move_page_tables(). But that would require multiple tlb flushes
> for the different sub-ranges and should be less efficient than the
> single whole range flush.
> 
> KBuild test on my Sandybridge desktop doesn't show any noticeable change.
> v4.9-rc4:
> real    5m14.048s
> user    32m19.800s
> sys     4m50.320s
> 
> With this commit:
> real    5m13.888s
> user    32m19.330s
> sys     4m51.200s
> 
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  include/linux/huge_mm.h |  2 +-
>  mm/huge_memory.c        |  9 ++++++++-
>  mm/mremap.c             | 30 +++++++++++++++++++++---------
>  3 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9b9f65d99873..e35e6de633b9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -22,7 +22,7 @@ extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			unsigned char *vec);
>  extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  			 unsigned long new_addr, unsigned long old_end,
> -			 pmd_t *old_pmd, pmd_t *new_pmd);
> +			 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush);
>  extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			unsigned long addr, pgprot_t newprot,
>  			int prot_numa);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cdcd25cb30fe..eff3de359d50 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1426,11 +1426,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  
>  bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		  unsigned long new_addr, unsigned long old_end,
> -		  pmd_t *old_pmd, pmd_t *new_pmd)
> +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
>  {
>  	spinlock_t *old_ptl, *new_ptl;
>  	pmd_t pmd;
>  	struct mm_struct *mm = vma->vm_mm;
> +	bool force_flush = false;
>  
>  	if ((old_addr & ~HPAGE_PMD_MASK) ||
>  	    (new_addr & ~HPAGE_PMD_MASK) ||
> @@ -1455,6 +1456,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		new_ptl = pmd_lockptr(mm, new_pmd);
>  		if (new_ptl != old_ptl)
>  			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +		if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd))
> +			force_flush = true;
>  		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
>  		VM_BUG_ON(!pmd_none(*new_pmd));
>  
> @@ -1467,6 +1470,10 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
>  		if (new_ptl != old_ptl)
>  			spin_unlock(new_ptl);
> +		if (force_flush)
> +			flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
> +		else
> +			*need_flush = true;
>  		spin_unlock(old_ptl);
>  		return true;
>  	}
> diff --git a/mm/mremap.c b/mm/mremap.c
> index da22ad2a5678..6ccecc03f56a 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -104,11 +104,13 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  		unsigned long old_addr, unsigned long old_end,
>  		struct vm_area_struct *new_vma, pmd_t *new_pmd,
> -		unsigned long new_addr, bool need_rmap_locks)
> +		unsigned long new_addr, bool need_rmap_locks, bool *need_flush)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_pte, *new_pte, pte;
>  	spinlock_t *old_ptl, *new_ptl;
> +	bool force_flush = false;
> +	unsigned long len = old_end - old_addr;
>  
>  	/*
>  	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> @@ -146,6 +148,14 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  				   new_pte++, new_addr += PAGE_SIZE) {
>  		if (pte_none(*old_pte))
>  			continue;
> +
> +		/*
> +		 * We are remapping a dirty PTE, make sure to
> +		 * flush TLB before we drop the PTL for the
> +		 * old PTE or we may race with page_mkclean().
> +		 */
> +		if (pte_present(*old_pte) && pte_dirty(*old_pte))
> +			force_flush = true;
>  		pte = ptep_get_and_clear(mm, old_addr, old_pte);
>  		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
> @@ -156,6 +166,10 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  	if (new_ptl != old_ptl)
>  		spin_unlock(new_ptl);
>  	pte_unmap(new_pte - 1);
> +	if (force_flush)
> +		flush_tlb_range(vma, old_end - len, old_end);
> +	else
> +		*need_flush = true;
>  	pte_unmap_unlock(old_pte - 1, old_ptl);
>  	if (need_rmap_locks)
>  		drop_rmap_locks(vma);
> @@ -201,13 +215,12 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  				if (need_rmap_locks)
>  					take_rmap_locks(vma);
>  				moved = move_huge_pmd(vma, old_addr, new_addr,
> -						    old_end, old_pmd, new_pmd);
> +						    old_end, old_pmd, new_pmd,
> +						    &need_flush);
>  				if (need_rmap_locks)
>  					drop_rmap_locks(vma);
> -				if (moved) {
> -					need_flush = true;
> +				if (moved)
>  					continue;
> -				}
>  			}
>  			split_huge_pmd(vma, old_pmd, old_addr);
>  			if (pmd_trans_unstable(old_pmd))
> @@ -220,11 +233,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  			extent = next - new_addr;
>  		if (extent > LATENCY_LIMIT)
>  			extent = LATENCY_LIMIT;
> -		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
> -			  new_vma, new_pmd, new_addr, need_rmap_locks);
> -		need_flush = true;
> +		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
> +			  new_pmd, new_addr, need_rmap_locks, &need_flush);
>  	}
> -	if (likely(need_flush))
> +	if (need_flush)
>  		flush_tlb_range(vma, old_end-len, old_addr);
>  
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
> -- 
> 2.5.5
> 

[-- Attachment #2: 0001-mremap-add-a-2s-delay-for-MAP_FIXED-case.patch --]
[-- Type: text/plain, Size: 4339 bytes --]

>From c529dfa6bdfc643a9c3debb4b61b9b0c13b0862e Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Thu, 17 Nov 2016 15:11:08 +0800
Subject: [PATCH] mremap: add a 2s delay for MAP_FIXED case

Add a 2s delay for MAP_FIXED case to enlarge the race window so that we
can hit the race in user space.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 fs/exec.c          |  2 +-
 include/linux/mm.h |  2 +-
 mm/mremap.c        | 19 ++++++++++++-------
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 4e497b9ee71e..1e49ce9a23bd 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -619,7 +619,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	 * process cleanup to remove whatever mess we made.
 	 */
 	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length, false))
+				       vma, new_start, length, false, false))
 		return -ENOMEM;
 
 	lru_add_drain();
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d73aeaf..5e35fe3d914a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1392,7 +1392,7 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks);
+		bool need_rmap_locks, bool delay);
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
 			      int dirty_accountable, int prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index da22ad2a5678..8e35279ca622 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -22,6 +22,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
+#include <linux/delay.h>
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
@@ -166,7 +167,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks)
+		bool need_rmap_locks, bool delay)
 {
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
@@ -224,8 +225,11 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			  new_vma, new_pmd, new_addr, need_rmap_locks);
 		need_flush = true;
 	}
-	if (likely(need_flush))
+	if (likely(need_flush)) {
+		if (delay)
+			msleep(2000);
 		flush_tlb_range(vma, old_end-len, old_addr);
+	}
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
 
@@ -234,7 +238,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 static unsigned long move_vma(struct vm_area_struct *vma,
 		unsigned long old_addr, unsigned long old_len,
-		unsigned long new_len, unsigned long new_addr, bool *locked)
+		unsigned long new_len, unsigned long new_addr,
+		bool *locked, bool delay)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
@@ -273,7 +278,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		return -ENOMEM;
 
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
-				     need_rmap_locks);
+				     need_rmap_locks, delay);
 	if (moved_len < old_len) {
 		err = -ENOMEM;
 	} else if (vma->vm_ops && vma->vm_ops->mremap) {
@@ -287,7 +292,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 * and then proceed to unmap new area instead of old.
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
-				 true);
+				 true, delay);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
@@ -442,7 +447,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (offset_in_page(ret))
 		goto out1;
 
-	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked);
+	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, true);
 	if (!(offset_in_page(ret)))
 		goto out;
 out1:
@@ -576,7 +581,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 			goto out;
 		}
 
-		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
+		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked, false);
 	}
 out:
 	if (offset_in_page(ret)) {
-- 
2.5.5


[-- Attachment #3: raceremap.c --]
[-- Type: text/plain, Size: 2717 bytes --]

#define _GNU_SOURCE
#define _XOPEN_SOURCE 500
#include <sched.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <sys/io.h>

#define BUFLEN 4096

static char wistmpfile[] = "/mnt/willitscale.XXXXXX";

char *testcase_description = "mremap";

char *buf;
char *newbuf = (char *)0x700000000000;
#define FILE_SIZE (4096*128)

static void mdelay(int ms)
{
	int i;

	// gain io permission for the delay
	assert(ioperm(0x80, 8, 1) == 0);

	for (i = 0; i < ms; i++)
		inb(0x80);
}

void testcase_prepare(void)
{
	int fd = mkstemp(wistmpfile);

	assert(fd >= 0);
	assert(pwrite(fd, "X", 1, FILE_SIZE-1) == 1);
	buf = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE,
			MAP_SHARED, fd, 0);
	assert(buf != (void *)-1);
	close(fd);
}

static volatile int step = 0;

void testcase(unsigned long long *iterations)
{
	int cpu = sched_getcpu();
	int fd = open(wistmpfile, O_RDWR);
	off_t offset = sched_getcpu() * BUFLEN;
	long counterread = 0;
	long *counterbuf = (void *)&buf[offset];
	assert(fd >= 0);

	printf("cpu%d runs\n", cpu);

	while (1) {
		int ret;

		if (cpu == 0) {
			void *tmpbuf;

			// wait for step 1 done
			while (step < 1);

			// step 2: start mremap to have the old PTE emptied
			printf("cpu%d: going to remap\n", cpu);
			step = 2;
			tmpbuf = mremap(buf, FILE_SIZE, FILE_SIZE,
					MREMAP_FIXED | MREMAP_MAYMOVE,
					newbuf);
			assert(tmpbuf == newbuf);
			printf("cpu%d: remap done\n", cpu);
			pause();
		}

		// step 1: dirty the old PTE
		*counterbuf = 1;

		step = 1;
		while (step < 2);

		// step 3: clean this page
		// delay a little while to give mremap some time
		// to empty the old PTE and setup new PTE
		mdelay(1000);
		printf("cpu%d: going to clean the page\n", cpu);
		posix_fadvise(fd, offset, BUFLEN, POSIX_FADV_DONTNEED);

		// step 4: now the page is cleaned, its new PTE is
		// write protected but since mremap didn't flush tlb
		// for the old PTE yet, we could still access the old
		// addr and that will not dirty anything
		printf("cpu%d: going to write 2\n", cpu);
		*counterbuf = 2;
		printf("cpu%d wrote 2\n", cpu);

		// step 5: drop this page from page cache and then
		// read it back to verify if the last write gets lost
		// munmap the page first, or the FADV_DONTNEED won't
		// kick the page out of page cache
		munmap(newbuf + offset, BUFLEN);
		posix_fadvise(fd, offset, BUFLEN, POSIX_FADV_DONTNEED);
		ret = pread(fd, &counterread, sizeof(counterread), offset);
		assert(ret == sizeof(counterread));

		if (counterread != 2) {
			printf("*cpu%d wrote 2 gets lost\n", cpu);
			fflush(stdout);
		}
		exit(0);
	}
}

void testcase_cleanup(void)
{
	unlink(wistmpfile);
}

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 0/2] use mmu gather logic for tlb flush in mremap
       [not found]   ` <c160bc18-7c1b-2d54-8af1-7c5bfcbcefe8@intel.com>
@ 2016-11-28  8:37     ` Aaron Lu
  2016-11-28  8:39       ` [PATCH 1/2] tlb: export tlb_flush_mmu_tlbonly Aaron Lu
  2016-11-28  8:40       ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Aaron Lu
  0 siblings, 2 replies; 11+ messages in thread
From: Aaron Lu @ 2016-11-28  8:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Memory Management List, Dave Hansen, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, linux-kernel

On Fri, Nov 18, 2016 at 10:48:20AM +0800, Aaron Lu wrote:
> On 11/18/2016 01:53 AM, Linus Torvalds wrote:
> > I'm not entirely happy with the force_flush vs need_flush games, and I
> > really think this code should be updated to use the same "struct
> > mmu_gather" that we use for the other TLB flushing cases (no need for
> > the page freeing batching, but the tlb_flush_mmu_tlbonly() logic
> > should be the same).
> 
> I see.
> 
> > 
> > But I guess that's a bigger change, so that wouldn't be approriate for
> > rc5 or stable back-porting anyway. But it would be lovely if somebody
> > could look at that. Hint hint.
> 
> I'll work on it, thanks for the suggestion.

So here it is. I'm not quite sure if I've done the right thing in patch
2/2, i.e. should I just use tlb_flush_mmu or export tlb_flush_mmu_tlbonly
and then use it in mremap.c. Please take a look and let me know what you
think, thanks!

Regards,
Aaron

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] tlb: export tlb_flush_mmu_tlbonly
  2016-11-28  8:37     ` [PATCH 0/2] use mmu gather logic for tlb flush in mremap Aaron Lu
@ 2016-11-28  8:39       ` Aaron Lu
  2016-11-28  8:40       ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Aaron Lu
  1 sibling, 0 replies; 11+ messages in thread
From: Aaron Lu @ 2016-11-28  8:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Memory Management List, Dave Hansen, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, linux-kernel

The mmu gather logic for tlb flush will be used in mremap case so export
this function.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 include/asm-generic/tlb.h | 1 +
 mm/memory.c               | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index c6d667187608..0f1861db935c 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -119,6 +119,7 @@ struct mmu_gather {
 
 void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end);
 void tlb_flush_mmu(struct mmu_gather *tlb);
+void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb);
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start,
 							unsigned long end);
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
diff --git a/mm/memory.c b/mm/memory.c
index e18c57bdc75c..130d82f7d8a2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -238,7 +238,7 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 	__tlb_reset_range(tlb);
 }
 
-static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	if (!tlb->end)
 		return;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap
  2016-11-28  8:37     ` [PATCH 0/2] use mmu gather logic for tlb flush in mremap Aaron Lu
  2016-11-28  8:39       ` [PATCH 1/2] tlb: export tlb_flush_mmu_tlbonly Aaron Lu
@ 2016-11-28  8:40       ` Aaron Lu
  2016-11-28 17:15         ` Linus Torvalds
  2016-11-28 17:32         ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Dave Hansen
  1 sibling, 2 replies; 11+ messages in thread
From: Aaron Lu @ 2016-11-28  8:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Memory Management List, Dave Hansen, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, linux-kernel

As suggested by Linus, the same mmu gather logic could be used for tlb
flush in mremap and this patch just did that.

Note that since there is no page added to "struct mmu_gather" for free
during mremap, when tlb needs to be flushed, we can use tlb_flush_mmu or
tlb_flush_mmu_tlbonly. Using tlb_flush_mmu could also avoid exporting
tlb_flush_mmu_tlbonly. But tlb_flush_mmu_tlbonly *looks* more clear and
straightforward so I end up using it.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 include/linux/huge_mm.h |  6 +++---
 mm/huge_memory.c        | 13 +++++++------
 mm/mremap.c             | 30 +++++++++++++++---------------
 3 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e35e6de633b9..bbf64e05a49a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -20,9 +20,9 @@ extern int zap_huge_pmd(struct mmu_gather *tlb,
 extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec);
-extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
-			 unsigned long new_addr, unsigned long old_end,
-			 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush);
+extern bool move_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+			 unsigned long old_addr, unsigned long new_addr,
+			 unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eff3de359d50..33909f16a9ad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1424,9 +1424,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return 1;
 }
 
-bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
-		  unsigned long new_addr, unsigned long old_end,
-		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
+bool move_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		   unsigned long old_addr, unsigned long new_addr,
+		   unsigned long old_end, pmd_t *old_pmd, pmd_t *new_pmd)
 {
 	spinlock_t *old_ptl, *new_ptl;
 	pmd_t pmd;
@@ -1456,8 +1456,11 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		new_ptl = pmd_lockptr(mm, new_pmd);
 		if (new_ptl != old_ptl)
 			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+		tlb_remove_pmd_tlb_entry(tlb, old_pmd, old_addr);
 		if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd))
 			force_flush = true;
+
 		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
@@ -1471,9 +1474,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		if (force_flush)
-			flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
-		else
-			*need_flush = true;
+			tlb_flush_mmu_tlbonly(tlb);
 		spin_unlock(old_ptl);
 		return true;
 	}
diff --git a/mm/mremap.c b/mm/mremap.c
index 6ccecc03f56a..dfac96ec7294 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -25,6 +25,7 @@
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <asm/tlb.h>
 
 #include "internal.h"
 
@@ -101,16 +102,15 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 	return pte;
 }
 
-static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
-		unsigned long old_addr, unsigned long old_end,
+static void move_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		pmd_t *old_pmd, unsigned long old_addr, unsigned long old_end,
 		struct vm_area_struct *new_vma, pmd_t *new_pmd,
-		unsigned long new_addr, bool need_rmap_locks, bool *need_flush)
+		unsigned long new_addr, bool need_rmap_locks)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
-	unsigned long len = old_end - old_addr;
 
 	/*
 	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -149,6 +149,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		if (pte_none(*old_pte))
 			continue;
 
+		tlb_remove_tlb_entry(tlb, old_pte, old_addr);
 		/*
 		 * We are remapping a dirty PTE, make sure to
 		 * flush TLB before we drop the PTL for the
@@ -166,10 +167,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
+
 	if (force_flush)
-		flush_tlb_range(vma, old_end - len, old_end);
-	else
-		*need_flush = true;
+		tlb_flush_mmu_tlbonly(tlb);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (need_rmap_locks)
 		drop_rmap_locks(vma);
@@ -184,15 +184,16 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 {
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
-	bool need_flush = false;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_gather tlb;
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
+	tlb_gather_mmu(&tlb, vma->vm_mm, mmun_start, mmun_end);
 	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
@@ -214,9 +215,9 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				/* See comment in move_ptes() */
 				if (need_rmap_locks)
 					take_rmap_locks(vma);
-				moved = move_huge_pmd(vma, old_addr, new_addr,
-						    old_end, old_pmd, new_pmd,
-						    &need_flush);
+				moved = move_huge_pmd(&tlb, vma, old_addr,
+						      new_addr, old_end,
+						      old_pmd, new_pmd);
 				if (need_rmap_locks)
 					drop_rmap_locks(vma);
 				if (moved)
@@ -233,13 +234,12 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			extent = next - new_addr;
 		if (extent > LATENCY_LIMIT)
 			extent = LATENCY_LIMIT;
-		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
-			  new_pmd, new_addr, need_rmap_locks, &need_flush);
+		move_ptes(&tlb, vma, old_pmd, old_addr, old_addr + extent,
+			  new_vma, new_pmd, new_addr, need_rmap_locks);
 	}
-	if (need_flush)
-		flush_tlb_range(vma, old_end-len, old_addr);
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	tlb_finish_mmu(&tlb, mmun_start, mmun_end);
 
 	return len + old_addr - old_end;	/* how much done */
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap
  2016-11-28  8:40       ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Aaron Lu
@ 2016-11-28 17:15         ` Linus Torvalds
  2016-11-29  2:57           ` [PATCH] mremap: move_ptes: check pte dirty after its removal Aaron Lu
  2016-11-28 17:32         ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Dave Hansen
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2016-11-28 17:15 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux Memory Management List, Dave Hansen, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, Linux Kernel Mailing List

On Mon, Nov 28, 2016 at 12:40 AM, Aaron Lu <aaron.lu@intel.com> wrote:
> As suggested by Linus, the same mmu gather logic could be used for tlb
> flush in mremap and this patch just did that.

Ok, looking at this patch, I still think it looks like the right thing
to do, but I'm admittedly rather less certain of it.

The main advantage of the mmu_gather thing is that it automatically
takes care of the TLB flush ranges for us, and that's a big deal
during munmap() (where the actual unmapped page range can be _very_
different from the total range), but now that I notice that this
doesn't actually remove any other code (in fact, it adds a line), I'm
wondering if it's worth it. mremap() is already "dense" in the vma
space, unlike munmap (ie you can't move multiple vma's with a single
mremap), so the fancy range optimizations that make a difference on
some architectures aren't much of an issue.

So I guess the MM people should take a look at this and say whether
they think the current state is fine or whether we should do the
mmu_gather thing. People?

However, I also independently think I found an actual bug while
looking at the code as part of looking at the patch.

This part looks racy:

                /*
                 * We are remapping a dirty PTE, make sure to
                 * flush TLB before we drop the PTL for the
                 * old PTE or we may race with page_mkclean().
                 */
                if (pte_present(*old_pte) && pte_dirty(*old_pte))
                        force_flush = true;
                pte = ptep_get_and_clear(mm, old_addr, old_pte);

where the issue is that another thread might make the pte be dirty (in
the hardware walker, so no locking of ours make any difference)
*after* we checked whether it was dirty, but *before* we removed it
from the page tables.

So I think the "check for force-flush" needs to come *after*, and we should do

                pte = ptep_get_and_clear(mm, old_addr, old_pte);
                if (pte_present(pte) && pte_dirty(pte))
                        force_flush = true;

instead.

This happens for the pmd case too.

So now I'm not sure the mmu_gather thing is worth it, but I'm pretty
sure that there remains a (very very) small race that wasn't fixed by
the original fix in commit 5d1904204c99 ("mremap: fix race between
mremap() and page cleanning").

Aaron, sorry for waffling about this, and asking you to look at a
completely different issue instead.

                 Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap
  2016-11-28  8:40       ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Aaron Lu
  2016-11-28 17:15         ` Linus Torvalds
@ 2016-11-28 17:32         ` Dave Hansen
  2016-11-28 17:42           ` Linus Torvalds
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2016-11-28 17:32 UTC (permalink / raw)
  To: Aaron Lu, Linus Torvalds
  Cc: Linux Memory Management List, Andrew Morton, Kirill A. Shutemov,
	Huang Ying, linux-kernel

On 11/28/2016 12:40 AM, Aaron Lu wrote:
> As suggested by Linus, the same mmu gather logic could be used for tlb
> flush in mremap and this patch just did that.
> 
> Note that since there is no page added to "struct mmu_gather" for free
> during mremap, when tlb needs to be flushed, we can use tlb_flush_mmu or
> tlb_flush_mmu_tlbonly. Using tlb_flush_mmu could also avoid exporting
> tlb_flush_mmu_tlbonly. But tlb_flush_mmu_tlbonly *looks* more clear and
> straightforward so I end up using it.

OK, so the code before this patch was passing around a pointer to
'need_flush', and we basically just pass around an mmu_gather instead.
It doesn't really simplify the code _too_ much, although it does make it
less confusing than when we saw 'need_flush' mixed with 'force_flush' in
the code.

tlb_flush_mmu_tlbonly() has exactly one other use: zap_pte_range() for
flushing the TLB before pte_unmap_unlock():

        if (force_flush)
                tlb_flush_mmu_tlbonly(tlb);
        pte_unmap_unlock(start_pte, ptl);

But, both call-sites are still keeping 'force_flush' to store the
information about whether we ever saw a dirty pte.  If we moved _that_
logic into the x86 mmu_gather code, we could get rid of all the
'force_flush' tracking in both call sites.  It also makes us a bit more
future-proof against these page_mkclean() races if we ever grow a third
site for clearing ptes.

Instead of exporting and calling tlb_flush_mmu_tlbonly(), we'd need
something like tlb_flush_mmu_before_ptl_release() (but with a better
name, of course :).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap
  2016-11-28 17:32         ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Dave Hansen
@ 2016-11-28 17:42           ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2016-11-28 17:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Aaron Lu, Linux Memory Management List, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, Linux Kernel Mailing List

On Mon, Nov 28, 2016 at 9:32 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> But, both call-sites are still keeping 'force_flush' to store the
> information about whether we ever saw a dirty pte.  If we moved _that_
> logic into the x86 mmu_gather code, we could get rid of all the
> 'force_flush' tracking in both call sites.  It also makes us a bit more
> future-proof against these page_mkclean() races if we ever grow a third
> site for clearing ptes.

Yeah, that sounds like a nice cleanup and would put all the real state
into that mmu_gather structure.

            Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] mremap: move_ptes: check pte dirty after its removal
  2016-11-28 17:15         ` Linus Torvalds
@ 2016-11-29  2:57           ` Aaron Lu
  2016-11-29  3:06             ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Aaron Lu @ 2016-11-29  2:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Memory Management List, Dave Hansen, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, Linux Kernel Mailing List

On 11/29/2016 01:15 AM, Linus Torvalds wrote:
> However, I also independently think I found an actual bug while
> looking at the code as part of looking at the patch.
> 
> This part looks racy:
> 
>                 /*
>                  * We are remapping a dirty PTE, make sure to
>                  * flush TLB before we drop the PTL for the
>                  * old PTE or we may race with page_mkclean().
>                  */
>                 if (pte_present(*old_pte) && pte_dirty(*old_pte))
>                         force_flush = true;
>                 pte = ptep_get_and_clear(mm, old_addr, old_pte);
> 
> where the issue is that another thread might make the pte be dirty (in
> the hardware walker, so no locking of ours make any difference)
> *after* we checked whether it was dirty, but *before* we removed it
> from the page tables.

Ah, very right. Thanks for the catch!

> 
> So I think the "check for force-flush" needs to come *after*, and we should do
> 
>                 pte = ptep_get_and_clear(mm, old_addr, old_pte);
>                 if (pte_present(pte) && pte_dirty(pte))
>                         force_flush = true;
> 
> instead.
> 
> This happens for the pmd case too.

Here is a fix patch, sorry for the trouble.

>From c0dc52fd3d3be93afb5b97804937a1b1b7ef136e Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Tue, 29 Nov 2016 10:33:37 +0800
Subject: [PATCH] mremap: move_ptes: check pte dirty after its removal

Linus found there still is a race in mremap after commit 5d1904204c99
("mremap: fix race between mremap() and page cleanning").

As described by Linus:
the issue is that another thread might make the pte be dirty (in
the hardware walker, so no locking of ours make any difference)
*after* we checked whether it was dirty, but *before* we removed it
from the page tables.

Fix it by moving the check after we removed it from the page table.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/huge_memory.c | 2 +-
 mm/mremap.c      | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eff3de359d50..a3e466c489a9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1456,9 +1456,9 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		new_ptl = pmd_lockptr(mm, new_pmd);
 		if (new_ptl != old_ptl)
 			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
 		if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd))
 			force_flush = true;
-		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
 		if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
diff --git a/mm/mremap.c b/mm/mremap.c
index 6ccecc03f56a..4b39dd0974e5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -149,14 +149,18 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		if (pte_none(*old_pte))
 			continue;
 
+		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		/*
 		 * We are remapping a dirty PTE, make sure to
 		 * flush TLB before we drop the PTL for the
 		 * old PTE or we may race with page_mkclean().
+		 *
+		 * This check has to be done after we removed the
+		 * old PTE from page tables or another thread may
+		 * dirty it after the check and before the removal.
 		 */
 		if (pte_present(*old_pte) && pte_dirty(*old_pte))
 			force_flush = true;
-		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 		set_pte_at(mm, new_addr, new_pte, pte);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: move_ptes: check pte dirty after its removal
  2016-11-29  2:57           ` [PATCH] mremap: move_ptes: check pte dirty after its removal Aaron Lu
@ 2016-11-29  3:06             ` Linus Torvalds
  2016-11-29  3:22               ` Aaron Lu
  2016-11-29  5:27               ` [PATCH update] " Aaron Lu
  0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2016-11-29  3:06 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux Memory Management List, Dave Hansen, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, Linux Kernel Mailing List

On Mon, Nov 28, 2016 at 6:57 PM, Aaron Lu <aaron.lu@intel.com> wrote:
>
> Here is a fix patch, sorry for the trouble.

I don't think you tested this one.. You've now essentially reverted
5d1904204c99 entirely by making the new force_flush logic a no-op.

> +               pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
>                 if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd))
>                         force_flush = true;

You need to be testing "pmd", not "*old_pmd".

Because now "*old_pmd" will be zeroes.

>                 if (pte_present(*old_pte) && pte_dirty(*old_pte))
>                         force_flush = true;

Similarly here. You need to check "pte", not "*old_pte".

            Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: move_ptes: check pte dirty after its removal
  2016-11-29  3:06             ` Linus Torvalds
@ 2016-11-29  3:22               ` Aaron Lu
  2016-11-29  5:27               ` [PATCH update] " Aaron Lu
  1 sibling, 0 replies; 11+ messages in thread
From: Aaron Lu @ 2016-11-29  3:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Memory Management List, Dave Hansen, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, Linux Kernel Mailing List

On Mon, Nov 28, 2016 at 07:06:39PM -0800, Linus Torvalds wrote:
> On Mon, Nov 28, 2016 at 6:57 PM, Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > Here is a fix patch, sorry for the trouble.
> 
> I don't think you tested this one.. You've now essentially reverted
> 5d1904204c99 entirely by making the new force_flush logic a no-op.

Right, I just did a build test.
Now I'm doing more tests, sorry for being careless.

Regards,
Aaron

> 
> > +               pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
> >                 if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd))
> >                         force_flush = true;
> 
> You need to be testing "pmd", not "*old_pmd".
> 
> Because now "*old_pmd" will be zeroes.
> 
> >                 if (pte_present(*old_pte) && pte_dirty(*old_pte))
> >                         force_flush = true;
> 
> Similarly here. You need to check "pte", not "*old_pte".
> 
>             Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH update] mremap: move_ptes: check pte dirty after its removal
  2016-11-29  3:06             ` Linus Torvalds
  2016-11-29  3:22               ` Aaron Lu
@ 2016-11-29  5:27               ` Aaron Lu
  1 sibling, 0 replies; 11+ messages in thread
From: Aaron Lu @ 2016-11-29  5:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Memory Management List, Dave Hansen, Andrew Morton,
	Kirill A. Shutemov, Huang Ying, Linux Kernel Mailing List

Linus found there still is a race in mremap after commit 5d1904204c99
("mremap: fix race between mremap() and page cleanning").

As described by Linus:
the issue is that another thread might make the pte be dirty (in
the hardware walker, so no locking of ours make any difference)
*after* we checked whether it was dirty, but *before* we removed it
from the page tables.

Fix it by moving the check after we removed it from the page table.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/huge_memory.c | 4 ++--
 mm/mremap.c      | 8 ++++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eff3de359d50..d4a6e4001512 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1456,9 +1456,9 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		new_ptl = pmd_lockptr(mm, new_pmd);
 		if (new_ptl != old_ptl)
 			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
-		if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd))
-			force_flush = true;
 		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
+		if (pmd_present(pmd) && pmd_dirty(pmd))
+			force_flush = true;
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
 		if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
diff --git a/mm/mremap.c b/mm/mremap.c
index 6ccecc03f56a..53df7ec8d2ba 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -149,14 +149,18 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		if (pte_none(*old_pte))
 			continue;
 
+		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		/*
 		 * We are remapping a dirty PTE, make sure to
 		 * flush TLB before we drop the PTL for the
 		 * old PTE or we may race with page_mkclean().
+		 *
+		 * This check has to be done after we removed the
+		 * old PTE from page tables or another thread may
+		 * dirty it after the check and before the removal.
 		 */
-		if (pte_present(*old_pte) && pte_dirty(*old_pte))
+		if (pte_present(pte) && pte_dirty(pte))
 			force_flush = true;
-		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 		set_pte_at(mm, new_addr, new_pte, pte);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-11-29  5:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <026b73f6-ca1d-e7bb-766c-4aaeb7071ce6@intel.com>
2016-11-17  7:45 ` [PATCH] mremap: fix race between mremap() and page cleanning Aaron Lu
     [not found] ` <CA+55aFzHfpZckv8ck19fZSFK+3TmR5eF=BsDzhwVGKrbyEBjEw@mail.gmail.com>
     [not found]   ` <c160bc18-7c1b-2d54-8af1-7c5bfcbcefe8@intel.com>
2016-11-28  8:37     ` [PATCH 0/2] use mmu gather logic for tlb flush in mremap Aaron Lu
2016-11-28  8:39       ` [PATCH 1/2] tlb: export tlb_flush_mmu_tlbonly Aaron Lu
2016-11-28  8:40       ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Aaron Lu
2016-11-28 17:15         ` Linus Torvalds
2016-11-29  2:57           ` [PATCH] mremap: move_ptes: check pte dirty after its removal Aaron Lu
2016-11-29  3:06             ` Linus Torvalds
2016-11-29  3:22               ` Aaron Lu
2016-11-29  5:27               ` [PATCH update] " Aaron Lu
2016-11-28 17:32         ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Dave Hansen
2016-11-28 17:42           ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).