All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, Nadav Amit <namit@vmware.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>,
	Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup
Date: Fri, 25 Dec 2020 01:25:29 -0800	[thread overview]
Message-ID: <20201225092529.3228466-3-namit@vmware.com> (raw)
In-Reply-To: <20201225092529.3228466-1-namit@vmware.com>

From: Nadav Amit <namit@vmware.com>

Clearing soft-dirty through /proc/[pid]/clear_refs can cause memory
corruption as it clears the dirty-bit without acquiring the mmap_lock
for write and defers TLB flushes.

As a result of this behavior, it is possible that one of the CPUs would
have the stale PTE cached in its TLB and keep updating the page while
another thread triggers a page-fault, and the page-fault handler would
copy the old page into a new one.

Since the copying is performed without holding the page-table lock, it
is possible that after the copying, and before the PTE is actually
flushed, the CPU that cached the stale PTE in the TLB would keep
changing the page. These changes would be lost and memory corruption
would occur.

As Yu Zhao pointed, this race became more apparent since commit
09854ba94c6a ("mm: do_wp_page() simplification") which made
wp_page_copy() more likely to take place, specifically if
page_count(page) > 1.

The following test produces the failure quite well on 5.10 and my
machine. Note that the test is tailored for recent kernels behavior in
which wp_page_copy() is called when page_count(page) != 1, but the fact
the test does not fail on older kernels does not mean they are not
affected.

  #define _GNU_SOURCE
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <sys/mman.h>
  #include <unistd.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <assert.h>
  #include <fcntl.h>
  #include <string.h>
  #include <threads.h>
  #include <stdatomic.h>

  #define PAGE_SIZE	(4096)
  #define TLB_SIZE	(2000)
  #define N_PAGES	(300000)
  #define ITERATIONS	(2000)
  #define N_THREADS	(2)

  static int stop;
  static char *m;

  static int writer(void *argp)
  {
  	unsigned long t_idx = (unsigned long)argp;
  	int i, cnt = 0;

  	while (!atomic_load(&stop)) {
  		cnt++;
  		atomic_fetch_add((atomic_int *)m, 1);

  		/*
  		 * First thread only accesses the page to have it cached in the
  		 * TLB.
  		 */
  		if (t_idx == 0)
  			continue;

  		/*
  		 * Other threads access enough entries to cause eviction from
  		 * the TLB and trigger #PF upon the next access (before the TLB
  		 * flush of clear_ref actually takes place).
  		 */
  		for (i = 1; i < TLB_SIZE; i++) {
  			if (atomic_load((atomic_int *)(m + PAGE_SIZE * i))) {
  				fprintf(stderr, "unexpected error\n");
  				exit(1);
  			}
  		}
  	}
  	return cnt;
  }

  /*
   * Runs mlock/munlock in the background to raise the page-count of the
   * page and force copying instead of reusing the page. Raising the
   * page-count is possible in better ways, e.g., registering io_uring
   * buffers.
   */
  static int do_mlock(void *argp)
  {
  	while (!atomic_load(&stop)) {
  		if (mlock(m, PAGE_SIZE) || munlock(m, PAGE_SIZE)) {
  			perror("mlock/munlock");
  			exit(1);
  		}
  	}
  	return 0;
  }

  int main(void)
  {
  	int r, cnt, fd, total = 0;
  	long i;
  	thrd_t thr[N_THREADS];
  	thrd_t mlock_thr;

  	fd = open("/proc/self/clear_refs", O_WRONLY, 0666);
  	if (fd < 0) {
  		perror("open");
  		exit(1);
  	}

  	/*
  	 * Have large memory for clear_ref, so there would be some time between
  	 * the unmap and the actual deferred flush.
  	 */
  	m = mmap(NULL, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
  			MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
  	if (m == MAP_FAILED) {
  		perror("mmap");
  		exit(1);
  	}

  	for (i = 0; i < N_THREADS; i++) {
  		r = thrd_create(&thr[i], writer, (void *)i);
  		assert(r == thrd_success);
  	}

  	r = thrd_create(&mlock_thr, do_mlock, (void *)i);
  	assert(r == thrd_success);

  	for (i = 0; i < ITERATIONS; i++) {
  		r = pwrite(fd, "4", 1, 0);
  		if (r < 0) {
  			perror("pwrite");
  			exit(1);
  		}
  	}

  	atomic_store(&stop, 1);

  	r = thrd_join(mlock_thr, NULL);
  	assert(r == thrd_success);

  	for (i = 0; i < N_THREADS; i++) {
  		r = thrd_join(thr[i], &cnt);
  		assert(r == thrd_success);
  		total += cnt;
  	}

  	r = atomic_load((atomic_int *)(m));
  	if (r != total) {
  		fprintf(stderr, "failed: expected=%d actual=%d\n", total, r);
  		exit(-1);
  	}

  	fprintf(stderr, "ok\n");
  	return 0;
  }

Fix it by taking mmap_lock for write when clearing soft-dirty.

Note that the test keeps failing without the pending fix of the missing
TLB flushes in clear_refs_write() [1].

[1] https://lore.kernel.org/patchwork/patch/1351776/

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking")
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/proc/task_mmu.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 217aa2705d5d..39b2bd27af79 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1189,6 +1189,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	enum clear_refs_types type;
+	bool write_lock = false;
 	struct mmu_gather tlb;
 	int itype;
 	int rv;
@@ -1236,21 +1237,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		}
 		tlb_gather_mmu(&tlb, mm, 0, -1);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
+			mmap_read_unlock(mm);
+			if (mmap_write_lock_killable(mm)) {
+				count = -EINTR;
+				goto out_mm;
+			}
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
-				if (!(vma->vm_flags & VM_SOFTDIRTY))
-					continue;
-				mmap_read_unlock(mm);
-				if (mmap_write_lock_killable(mm)) {
-					count = -EINTR;
-					goto out_mm;
-				}
-				for (vma = mm->mmap; vma; vma = vma->vm_next) {
-					vma->vm_flags &= ~VM_SOFTDIRTY;
-					vma_set_page_prot(vma);
-				}
-				mmap_write_downgrade(mm);
-				break;
+				vma->vm_flags &= ~VM_SOFTDIRTY;
+				vma_set_page_prot(vma);
 			}
+			write_lock = true;
 
 			mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY,
 						0, NULL, mm, 0, -1UL);
@@ -1261,7 +1257,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		if (type == CLEAR_REFS_SOFT_DIRTY)
 			mmu_notifier_invalidate_range_end(&range);
 		tlb_finish_mmu(&tlb, 0, -1);
-		mmap_read_unlock(mm);
+		if (write_lock)
+			mmap_write_unlock(mm);
+		else
+			mmap_read_unlock(mm);
 out_mm:
 		mmput(mm);
 	}
-- 
2.25.1


  parent reply	other threads:[~2020-12-25  9:30 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli
2021-01-04 21:26               ` Nadav Amit
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` Nadav Amit [this message]
2021-01-05 15:08   ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:17                 ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:25                   ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:29                     ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:00                         ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:20                             ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:24                               ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:42                         ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 22:51                           ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds
2021-01-08  0:25                               ` Linus Torvalds
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:39                                     ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-08 19:25                                         ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-08 17:30                                   ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 21:36                 ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 20:32                 ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 21:05                   ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:17                       ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-09 19:46                       ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:31                         ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 18:38                           ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:03                             ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-09 19:15                               ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds
2021-01-08 18:59                         ` Linus Torvalds
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
2021-01-09  3:49                       ` Hillf Danton
2021-01-11 14:39                         ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201225092529.3228466-3-namit@vmware.com \
    --to=nadav.amit@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=namit@vmware.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.