linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages
@ 2017-05-17 17:16 Ross Zwisler
  2017-05-17 17:16 ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Ross Zwisler @ 2017-05-17 17:16 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Darrick J. Wong, Alexander Viro, Christoph Hellwig,
	Dan Williams, Dave Hansen, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-mm, linux-nvdimm, Kirill A . Shutemov,
	Pawel Lebioda, Dave Jiang, Xiong Zhou, Eryu Guan, stable

When the pmd_devmap() checks were added by:

commit 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd")

to add better support for DAX huge pages, they were all added to the end of
if() statements after existing pmd_trans_huge() checks.  So, things like:

-       if (pmd_trans_huge(*pmd))
+       if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))

When further checks were added after pmd_trans_unstable() checks by:

commit 7267ec008b5c ("mm: postpone page table allocation until we have page
to map")

they were also added at the end of the conditional:

+       if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))

This ordering is fine for pmd_trans_huge(), but doesn't work for
pmd_trans_unstable().  This is because DAX huge pages trip the bad_pmd()
check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
pmd_trans_unstable()), which prints out a warning and returns 1.  So, we do
end up doing the right thing, but only after spamming dmesg with suspicious
looking messages:

mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)

Reorder these checks so that pmd_devmap() is checked first, avoiding the
error messages.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
Cc: stable@vger.kernel.org
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6ff5d72..1ee269d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3061,7 +3061,7 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
 	 * through an atomic read in C, which is what pmd_trans_unstable()
 	 * provides.
 	 */
-	if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
+	if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
 		return VM_FAULT_NOPAGE;
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
@@ -3690,7 +3690,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
 		vmf->pte = NULL;
 	} else {
 		/* See comment in pte_alloc_one_map() */
-		if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
+		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
 			return 0;
 		/*
 		 * A regular pmd is established and it can't morph into a huge
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-17 17:16 [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages Ross Zwisler
@ 2017-05-17 17:16 ` Ross Zwisler
  2017-05-17 17:17   ` [PATCH] generic: add regression test for DAX PTE/PMD races Ross Zwisler
                     ` (2 more replies)
  2017-05-17 17:33 ` [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages Dave Hansen
  2017-05-22 14:40 ` Jan Kara
  2 siblings, 3 replies; 12+ messages in thread
From: Ross Zwisler @ 2017-05-17 17:16 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Darrick J. Wong, Alexander Viro, Christoph Hellwig,
	Dan Williams, Dave Hansen, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-mm, linux-nvdimm, Kirill A . Shutemov,
	Pawel Lebioda, Dave Jiang, Xiong Zhou, Eryu Guan, stable

We currently have two related PMD vs PTE races in the DAX code.  These can
both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that private
mapping reads can be handled with PMDs but private mapping writes are
always handled with PTEs so that we can COW.

Here is the first race:

CPU 0					CPU 1

(private mapping write)
__handle_mm_fault()
  create_huge_pmd() - FALLBACK
  handle_pte_fault()
    passes check for pmd_devmap()

					(private mapping read)
					__handle_mm_fault()
					  create_huge_pmd()
					    dax_iomap_pmd_fault() inserts PMD

    dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
    			  installed in our page tables at this spot.

Here's the second race:

CPU 0					CPU 1

(private mapping write)
__handle_mm_fault()
  create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

  handle_pte_fault()
    dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking, there
is no isolation between faults in the code in mm/memory.c.  This means for
instance that this code in __handle_mm_fault() can run:

	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent.  This is the cause of
the 2nd race.  The first race is similar - there is the following check in
handle_pte_fault():

	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
			return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault.  This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

 BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
 BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe to
continue their fault after they have taken an entry lock to block other
racing faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: stable@vger.kernel.org

---

I've written a new xfstest for this race, which I will send in response to
this patch series.  This series has also survived an xfstest run without
any new issues.

---
 fs/dax.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index c22eaf1..3cc02d1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1155,6 +1155,15 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	}
 
 	/*
+	 * It is possible, particularly with mixed reads & writes to private
+	 * mappings, that we have raced with a PMD fault that overlaps with
+	 * the PTE we need to set up.  Now that we have a locked mapping entry
+	 * we can safely unmap the huge PMD so that we can install our PTE in
+	 * our page tables.
+	 */
+	split_huge_pmd(vmf->vma, vmf->pmd, vmf->address);
+
+	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
@@ -1398,6 +1407,15 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		goto fallback;
 
 	/*
+	 * It is possible, particularly with mixed reads & writes to private
+	 * mappings, that we have raced with a PTE fault that overlaps with
+	 * the PMD we need to set up.  If so we just fall back to a PTE fault
+	 * ourselves.
+	 */
+	if (!pmd_none(*vmf->pmd))
+		goto unlock_entry;
+
+	/*
 	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
 	 * setting up a mapping, so really we're using iomap_begin() as a way
 	 * to look up our filesystem block.
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH] generic: add regression test for DAX PTE/PMD races
  2017-05-17 17:16 ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
@ 2017-05-17 17:17   ` Ross Zwisler
  2017-05-18  7:50   ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Jan Kara
  2017-05-22 14:44   ` Jan Kara
  2 siblings, 0 replies; 12+ messages in thread
From: Ross Zwisler @ 2017-05-17 17:17 UTC (permalink / raw)
  To: fstests, Xiong Zhou, Eryu Guan
  Cc: Ross Zwisler, Darrick J. Wong, Alexander Viro, Christoph Hellwig,
	Dan Williams, Dave Hansen, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-mm, linux-nvdimm, Andrew Morton,
	linux-kernel, Kirill A . Shutemov, Pawel Lebioda, Dave Jiang

This adds a regression test for the following kernel patches:

  mm: avoid spurious 'bad pmd' warning messages
  dax: Fix race between colliding PMD & PTE entries

The above patches fix two related PMD vs PTE races in the DAX code.  These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that private
mapping reads can be handled with PMDs but private mapping writes are
always handled with PTEs so that we can COW.

Without this 2-patch kernel series, the newly added test will result in the
following errors:

  run fstests generic/435 at 2017-05-16 16:53:43
  mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)
  	... a bunch of the bad pmd messages ...
  BUG: Bad rss-counter state mm:ffff8800a8c1b700 idx:1 val:1
  BUG: non-zero nr_ptes on freeing mm: 38
  XFS (pmem0p1): Unmounting Filesystem

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 .gitignore            |   1 +
 src/Makefile          |   3 +-
 src/t_mmap_cow_race.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/435     |  61 +++++++++++++++++++++++++++++
 tests/generic/435.out |   2 +
 tests/generic/group   |   1 +
 6 files changed, 173 insertions(+), 1 deletion(-)
 create mode 100644 src/t_mmap_cow_race.c
 create mode 100755 tests/generic/435
 create mode 100644 tests/generic/435.out

diff --git a/.gitignore b/.gitignore
index 38f3a00..a6ac25e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -149,6 +149,7 @@
 /src/t_rename_overwrite
 /src/t_mmap_dio
 /src/t_mmap_stale_pmd
+/src/t_mmap_cow_race
 
 # dmapi/ binaries
 /dmapi/src/common/cmd/read_invis
diff --git a/src/Makefile b/src/Makefile
index e5042c9..b505b42 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -12,7 +12,8 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
 	godown resvtest writemod makeextents itrash rename \
 	multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
 	t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
-	holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd
+	holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
+	t_mmap_cow_race
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
 	preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/t_mmap_cow_race.c b/src/t_mmap_cow_race.c
new file mode 100644
index 0000000..207ba42
--- /dev/null
+++ b/src/t_mmap_cow_race.c
@@ -0,0 +1,106 @@
+#include <errno.h>
+#include <fcntl.h>
+#include <libgen.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#define MiB(a) ((a)*1024*1024)
+#define NUM_THREADS 2
+
+void err_exit(char *op)
+{
+	fprintf(stderr, "%s: %s\n", op, strerror(errno));
+	exit(1);
+}
+
+void worker_fn(void *ptr)
+{
+	char *data = (char *)ptr;
+	volatile int a;
+	int i, err;
+
+	for (i = 0; i < 10; i++) {
+		a = data[0];
+		data[0] = a;
+
+		err = madvise(data, MiB(2), MADV_DONTNEED);
+		if (err < 0)
+			err_exit("madvise");
+
+		/* Mix up the thread timings to encourage the race. */
+		err = usleep(rand() % 100);
+		if (err < 0)
+			err_exit("usleep");
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	pthread_t thread[NUM_THREADS];
+	int i, j, fd, err;
+	char *data;
+
+	if (argc < 2) {
+		printf("Usage: %s <file>\n", basename(argv[0]));
+		exit(0);
+	}
+
+	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
+	if (fd < 0)
+		err_exit("fd");
+
+	/* This allows us to map a huge page. */
+	ftruncate(fd, 0);
+	ftruncate(fd, MiB(2));
+
+	/*
+	 * First we set up a shared mapping.  Our write will (hopefully) get
+	 * the filesystem to give us a 2MiB huge page DAX mapping.  We will
+	 * then use this 2MiB page for our private mapping race.
+	 */
+	data = mmap(NULL, MiB(2), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (data == MAP_FAILED)
+		err_exit("shared mmap");
+
+	data[0] = 1;
+
+	err = munmap(data, MiB(2));
+	if (err < 0)
+		err_exit("shared munmap");
+
+	for (i = 0; i < 500; i++) {
+		data = mmap(NULL, MiB(2), PROT_READ|PROT_WRITE, MAP_PRIVATE,
+				fd, 0);
+		if (data == MAP_FAILED)
+			err_exit("private mmap");
+
+		for (j = 0; j < NUM_THREADS; j++) {
+			err = pthread_create(&thread[j], NULL,
+					(void*)&worker_fn, data);
+			if (err)
+				err_exit("pthread_create");
+		}
+
+		for (j = 0; j < NUM_THREADS; j++) {
+			err = pthread_join(thread[j], NULL);
+			if (err)
+				err_exit("pthread_join");
+		}
+
+		err = munmap(data, MiB(2));
+		if (err < 0)
+			err_exit("private munmap");
+	}
+
+	err = close(fd);
+	if (err < 0)
+		err_exit("close");
+
+	return 0;
+}
diff --git a/tests/generic/435 b/tests/generic/435
new file mode 100755
index 0000000..f1413f1
--- /dev/null
+++ b/tests/generic/435
@@ -0,0 +1,61 @@
+#! /bin/bash
+# FS QA Test 435
+#
+# This is a regression test for kernel patches:
+#   mm: avoid spurious 'bad pmd' warning messages
+#   dax: Fix race between colliding PMD & PTE entries
+# created by Ross Zwisler <ross.zwisler@linux.intel.com>
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2017 Intel Corporation.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_test
+_require_test_program "t_mmap_cow_race"
+
+# real QA test starts here
+src/t_mmap_cow_race $TEST_DIR/testfile
+
+# success, all done
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/generic/435.out b/tests/generic/435.out
new file mode 100644
index 0000000..6a175d3
--- /dev/null
+++ b/tests/generic/435.out
@@ -0,0 +1,2 @@
+QA output created by 435
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index c4911b8..ac43f42 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -437,3 +437,4 @@
 432 auto quick copy
 433 auto quick copy
 434 auto quick copy
+435 auto quick
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages
  2017-05-17 17:16 [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages Ross Zwisler
  2017-05-17 17:16 ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
@ 2017-05-17 17:33 ` Dave Hansen
  2017-05-17 18:23   ` Ross Zwisler
  2017-05-22 14:40 ` Jan Kara
  2 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2017-05-17 17:33 UTC (permalink / raw)
  To: Ross Zwisler, Andrew Morton, linux-kernel
  Cc: Darrick J. Wong, Alexander Viro, Christoph Hellwig, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On 05/17/2017 10:16 AM, Ross Zwisler wrote:
> @@ -3061,7 +3061,7 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
>  	 * through an atomic read in C, which is what pmd_trans_unstable()
>  	 * provides.
>  	 */
> -	if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
> +	if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
>  		return VM_FAULT_NOPAGE;

I'm worried we are very unlikely to get this right in the future.  It's
totally not obvious what the ordering requirement is here.

Could we move pmd_devmap() and pmd_trans_unstable() into a helper that
gets the ordering right and also spells out the ordering requirement?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages
  2017-05-17 17:33 ` [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages Dave Hansen
@ 2017-05-17 18:23   ` Ross Zwisler
  0 siblings, 0 replies; 12+ messages in thread
From: Ross Zwisler @ 2017-05-17 18:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Darrick J. Wong,
	Alexander Viro, Christoph Hellwig, Dan Williams, Jan Kara,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Wed, May 17, 2017 at 10:33:58AM -0700, Dave Hansen wrote:
> On 05/17/2017 10:16 AM, Ross Zwisler wrote:
> > @@ -3061,7 +3061,7 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
> >  	 * through an atomic read in C, which is what pmd_trans_unstable()
> >  	 * provides.
> >  	 */
> > -	if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
> > +	if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
> >  		return VM_FAULT_NOPAGE;
> 
> I'm worried we are very unlikely to get this right in the future.  It's
> totally not obvious what the ordering requirement is here.
> 
> Could we move pmd_devmap() and pmd_trans_unstable() into a helper that
> gets the ordering right and also spells out the ordering requirement?

Sure, I'll fix this for v2.

Thanks for the review.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-17 17:16 ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
  2017-05-17 17:17   ` [PATCH] generic: add regression test for DAX PTE/PMD races Ross Zwisler
@ 2017-05-18  7:50   ` Jan Kara
  2017-05-18 21:29     ` Ross Zwisler
  2017-05-22 14:44   ` Jan Kara
  2 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2017-05-18  7:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Darrick J. Wong, Alexander Viro,
	Christoph Hellwig, Dan Williams, Dave Hansen, Jan Kara,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Wed 17-05-17 11:16:39, Ross Zwisler wrote:
> We currently have two related PMD vs PTE races in the DAX code.  These can
> both be easily triggered by having two threads reading and writing
> simultaneously to the same private mapping, with the key being that private
> mapping reads can be handled with PMDs but private mapping writes are
> always handled with PTEs so that we can COW.
> 
> Here is the first race:
> 
> CPU 0					CPU 1
> 
> (private mapping write)
> __handle_mm_fault()
>   create_huge_pmd() - FALLBACK
>   handle_pte_fault()
>     passes check for pmd_devmap()
> 
> 					(private mapping read)
> 					__handle_mm_fault()
> 					  create_huge_pmd()
> 					    dax_iomap_pmd_fault() inserts PMD
> 
>     dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
>     			  installed in our page tables at this spot.
>
> 
> Here's the second race:
> 
> CPU 0					CPU 1
> 
> (private mapping write)
> __handle_mm_fault()
>   create_huge_pmd() - FALLBACK
> 					(private mapping read)
> 					__handle_mm_fault()
> 					  passes check for pmd_none()
> 					  create_huge_pmd()
> 
>   handle_pte_fault()
>     dax_iomap_pte_fault() inserts PTE
> 					    dax_iomap_pmd_fault() inserts PMD,
> 					       but we already have a PTE at
> 					       this spot.

So I don't see how this second scenario can happen. dax_iomap_pmd_fault()
will call grab_mapping_entry(). That will either find PTE entry in the
radix tree -> EEXIST and we retry the fault. Or we will not find PTE entry
-> try to insert PMD entry which collides with the PTE entry -> EEXIST and
we retry the fault. Am I missing something?

The first scenario seems to be possible. dax_iomap_pmd_fault() will create
PMD entry in the radix tree. Then dax_iomap_pte_fault() will come, do
grab_mapping_entry(), there it sees entry is PMD but we are doing PTE fault
so I'd think that pmd_downgrade = true... But actually the condition there
doesn't trigger in this case. And that's a catch that although we asked
grab_mapping_entry() for PTE, we've got PMD back and that screws us later.

Actually I'm not convinced your patch quite fixes this because
dax_load_hole() or dax_insert_mapping_entry() will modify the passed entry
with the assumption that it's PTE entry and so they will likely corrupt the
entry in the radix tree.

So I think to fix the first case we should rather modify
grab_mapping_entry() to properly go through the pmd_downgrade path once we
find PMD entry and we do PTE fault.

What do you think?

								Honza


> 
> The core of the issue is that while there is isolation between faults to
> the same range in the DAX fault handlers via our DAX entry locking, there
> is no isolation between faults in the code in mm/memory.c.  This means for
> instance that this code in __handle_mm_fault() can run:
> 
> 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
> 		ret = create_huge_pmd(&vmf);
> 
> But by the time we actually get to run the fault handler called by
> create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
> fault has installed a normal PMD here as a parent.  This is the cause of
> the 2nd race.  The first race is similar - there is the following check in
> handle_pte_fault():
> 
> 	} else {
> 		/* See comment in pte_alloc_one_map() */
> 		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
> 			return 0;
> 
> So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
> will bail and retry the fault.  This is correct, but there is nothing
> preventing the PMD from being installed after this check but before we
> actually get to the DAX PTE fault handlers.
> 
> In my testing these races result in the following types of errors:
> 
>  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
>  BUG: non-zero nr_ptes on freeing mm: 15
> 
> Fix this issue by having the DAX fault handlers verify that it is safe to
> continue their fault after they have taken an entry lock to block other
> racing faults.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
> Cc: stable@vger.kernel.org
> 
> ---
> 
> I've written a new xfstest for this race, which I will send in response to
> this patch series.  This series has also survived an xfstest run without
> any new issues.
> 
> ---
>  fs/dax.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index c22eaf1..3cc02d1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1155,6 +1155,15 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>  	}
>  
>  	/*
> +	 * It is possible, particularly with mixed reads & writes to private
> +	 * mappings, that we have raced with a PMD fault that overlaps with
> +	 * the PTE we need to set up.  Now that we have a locked mapping entry
> +	 * we can safely unmap the huge PMD so that we can install our PTE in
> +	 * our page tables.
> +	 */
> +	split_huge_pmd(vmf->vma, vmf->pmd, vmf->address);
> +
> +	/*
>  	 * Note that we don't bother to use iomap_apply here: DAX required
>  	 * the file system block size to be equal the page size, which means
>  	 * that we never have to deal with more than a single extent here.
> @@ -1398,6 +1407,15 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		goto fallback;
>  
>  	/*
> +	 * It is possible, particularly with mixed reads & writes to private
> +	 * mappings, that we have raced with a PTE fault that overlaps with
> +	 * the PMD we need to set up.  If so we just fall back to a PTE fault
> +	 * ourselves.
> +	 */
> +	if (!pmd_none(*vmf->pmd))
> +		goto unlock_entry;
> +
> +	/*
>  	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
>  	 * setting up a mapping, so really we're using iomap_begin() as a way
>  	 * to look up our filesystem block.
> -- 
> 2.9.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-18  7:50   ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Jan Kara
@ 2017-05-18 21:29     ` Ross Zwisler
  2017-05-22 14:37       ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Ross Zwisler @ 2017-05-18 21:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Darrick J. Wong,
	Alexander Viro, Christoph Hellwig, Dan Williams, Dave Hansen,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Thu, May 18, 2017 at 09:50:37AM +0200, Jan Kara wrote:
> On Wed 17-05-17 11:16:39, Ross Zwisler wrote:
> > We currently have two related PMD vs PTE races in the DAX code.  These can
> > both be easily triggered by having two threads reading and writing
> > simultaneously to the same private mapping, with the key being that private
> > mapping reads can be handled with PMDs but private mapping writes are
> > always handled with PTEs so that we can COW.
> > 
> > Here is the first race:
> > 
> > CPU 0					CPU 1
> > 
> > (private mapping write)
> > __handle_mm_fault()
> >   create_huge_pmd() - FALLBACK
> >   handle_pte_fault()
> >     passes check for pmd_devmap()
> > 
> > 					(private mapping read)
> > 					__handle_mm_fault()
> > 					  create_huge_pmd()
> > 					    dax_iomap_pmd_fault() inserts PMD
> > 
> >     dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
> >     			  installed in our page tables at this spot.
> >
> > 
> > Here's the second race:
> > 
> > CPU 0					CPU 1
> > 
> > (private mapping write)
> > __handle_mm_fault()
> >   create_huge_pmd() - FALLBACK
> > 					(private mapping read)
> > 					__handle_mm_fault()
> > 					  passes check for pmd_none()
> > 					  create_huge_pmd()
> > 
> >   handle_pte_fault()
> >     dax_iomap_pte_fault() inserts PTE
> > 					    dax_iomap_pmd_fault() inserts PMD,
> > 					       but we already have a PTE at
> > 					       this spot.
> 
> So I don't see how this second scenario can happen. dax_iomap_pmd_fault()
> will call grab_mapping_entry(). That will either find PTE entry in the
> radix tree -> EEXIST and we retry the fault. Or we will not find PTE entry
> -> try to insert PMD entry which collides with the PTE entry -> EEXIST and
> we retry the fault. Am I missing something?

Yep, sorry, I guess I needed a few extra steps in my flow (the initial private
mapping read by CPU 0):


CPU 0					CPU 1

(private mapping read)
__handle_mm_fault()
  passes check for pmd_none()
  create_huge_pmd()
    dax_iomap_pmd_fault() inserts PMD

(private mapping write)
__handle_mm_fault()
  create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

  handle_pte_fault()
    dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

So what happens is that CPU 0 inserts a DAX PMD into the radix tree that has
real storage backing, and all PTE and PMD faults just use that same PMD radix
tree entry for locking and dirty tracking.

> The first scenario seems to be possible. dax_iomap_pmd_fault() will create
> PMD entry in the radix tree. Then dax_iomap_pte_fault() will come, do
> grab_mapping_entry(), there it sees entry is PMD but we are doing PTE fault
> so I'd think that pmd_downgrade = true... But actually the condition there
> doesn't trigger in this case. And that's a catch that although we asked
> grab_mapping_entry() for PTE, we've got PMD back and that screws us later.

Yep, it was a concious decision when implementing the PMD support to allow one
thread to use PMDs and another to use PTEs in the same range, as long as the
thread faulting in PMDs is the first to insert into the radix tree.  A PMD
radix tree entry will be inserted and used for locking and dirty tracking, and
each thread or process can fault in either PTEs or PMDs into its own address
space as needed.

We can revisit this, if you think it is incorrect.  The option you outline
below would basically mean that if any thread were to fault in a PTE in a
range, all threads and processes would be forced to use PTEs because we would
use PTEs in the radix tree.

This is cleaner...I'm not sure if the use case of having two threads accessing
the same area, one with PTEs and one with PMDs, is actually prevalent.  It's
also maybe a bit weird that the current behavior varies based on which thread
faulted first - if the PTE thread faults first, it'll insert a PTE into the
radix tree and everyone will just use PTEs.

> Actually I'm not convinced your patch quite fixes this because
> dax_load_hole() or dax_insert_mapping_entry() will modify the passed entry
> with the assumption that it's PTE entry and so they will likely corrupt the
> entry in the radix tree.

I don't think we can ever call dax_load_hole() if we have a DAX PMD entry in
the radix tree, because we have a block mapping from the filesystem.

For dax_insert_mapping_entry(), we do the right thing.  From the comments
above the function:

 * If we happen to be trying to insert a PTE and there is a PMD
 * already in the tree, we will skip the insertion and just dirty the PMD as
 * appropriate.  If we happen to be trying to insert a PTE and there is a PMD
 * already in the tree, we will skip the insertion and just dirty the PMD as
 * appropriate.

> So I think to fix the first case we should rather modify
> grab_mapping_entry() to properly go through the pmd_downgrade path once we
> find PMD entry and we do PTE fault.
> 
> What do you think?

That could also work, though I do think the fix as submitted is correct.
I think it comes down to whether we want to keep the behavior where a thread
faulting in a PTEs will use an existing PMD entry in the radix tree, instead
of making all other threads fall back to PTEs.

I think either way solves this issue for the DAX case...but do you understand
how this is solved for other fault handlers?  They don't have any isolation
between faults either in the mm/memory.c code, and are susceptible to the same
races.  How do they deal with the fact that by the time they get to their PTE
fault handler, a racing PMD fault handler in another thread could have
inserted a PMD into their page tables, and vice versa?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-18 21:29     ` Ross Zwisler
@ 2017-05-22 14:37       ` Jan Kara
  2017-05-22 19:44         ` Ross Zwisler
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2017-05-22 14:37 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Andrew Morton, linux-kernel, Darrick J. Wong,
	Alexander Viro, Christoph Hellwig, Dan Williams, Dave Hansen,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Thu 18-05-17 15:29:39, Ross Zwisler wrote:
> On Thu, May 18, 2017 at 09:50:37AM +0200, Jan Kara wrote:
> > On Wed 17-05-17 11:16:39, Ross Zwisler wrote:
> > > We currently have two related PMD vs PTE races in the DAX code.  These can
> > > both be easily triggered by having two threads reading and writing
> > > simultaneously to the same private mapping, with the key being that private
> > > mapping reads can be handled with PMDs but private mapping writes are
> > > always handled with PTEs so that we can COW.
> > > 
> > > Here is the first race:
> > > 
> > > CPU 0					CPU 1
> > > 
> > > (private mapping write)
> > > __handle_mm_fault()
> > >   create_huge_pmd() - FALLBACK
> > >   handle_pte_fault()
> > >     passes check for pmd_devmap()
> > > 
> > > 					(private mapping read)
> > > 					__handle_mm_fault()
> > > 					  create_huge_pmd()
> > > 					    dax_iomap_pmd_fault() inserts PMD
> > > 
> > >     dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
> > >     			  installed in our page tables at this spot.
> > >
> > > 
> > > Here's the second race:
> > > 
> > > CPU 0					CPU 1
> > > 
> > > (private mapping write)
> > > __handle_mm_fault()
> > >   create_huge_pmd() - FALLBACK
> > > 					(private mapping read)
> > > 					__handle_mm_fault()
> > > 					  passes check for pmd_none()
> > > 					  create_huge_pmd()
> > > 
> > >   handle_pte_fault()
> > >     dax_iomap_pte_fault() inserts PTE
> > > 					    dax_iomap_pmd_fault() inserts PMD,
> > > 					       but we already have a PTE at
> > > 					       this spot.
> > 
> > So I don't see how this second scenario can happen. dax_iomap_pmd_fault()
> > will call grab_mapping_entry(). That will either find PTE entry in the
> > radix tree -> EEXIST and we retry the fault. Or we will not find PTE entry
> > -> try to insert PMD entry which collides with the PTE entry -> EEXIST and
> > we retry the fault. Am I missing something?
> 
> Yep, sorry, I guess I needed a few extra steps in my flow (the initial private
> mapping read by CPU 0):
> 
> 
> CPU 0					CPU 1
> 
> (private mapping read)
> __handle_mm_fault()
>   passes check for pmd_none()
>   create_huge_pmd()
>     dax_iomap_pmd_fault() inserts PMD
> 
> (private mapping write)
> __handle_mm_fault()
>   create_huge_pmd() - FALLBACK
> 					(private mapping read)
> 					__handle_mm_fault()
> 					  passes check for pmd_none()
> 					  create_huge_pmd()
> 
>   handle_pte_fault()
>     dax_iomap_pte_fault() inserts PTE
> 					    dax_iomap_pmd_fault() inserts PMD,
> 					       but we already have a PTE at
> 					       this spot.
> 
> So what happens is that CPU 0 inserts a DAX PMD into the radix tree that has
> real storage backing, and all PTE and PMD faults just use that same PMD radix
> tree entry for locking and dirty tracking.

OK, I see now. So essentially it's the same catch as the other case -
grab_mapping_entry() returns PMD entry on CPU0 although we asked for PTE
entry.

> > The first scenario seems to be possible. dax_iomap_pmd_fault() will create
> > PMD entry in the radix tree. Then dax_iomap_pte_fault() will come, do
> > grab_mapping_entry(), there it sees entry is PMD but we are doing PTE fault
> > so I'd think that pmd_downgrade = true... But actually the condition there
> > doesn't trigger in this case. And that's a catch that although we asked
> > grab_mapping_entry() for PTE, we've got PMD back and that screws us later.
> 
> Yep, it was a concious decision when implementing the PMD support to allow one
> thread to use PMDs and another to use PTEs in the same range, as long as the
> thread faulting in PMDs is the first to insert into the radix tree.  A PMD
> radix tree entry will be inserted and used for locking and dirty tracking, and
> each thread or process can fault in either PTEs or PMDs into its own address
> space as needed.

Well, for *threads* it doesn't really make good sense to mix PMDs and PTEs
as they share page tables. However for *processes* it makes some sense to
allow one process to use PTEs and another process to use PMDs. And I
remember we were discussing this in the past.

> We can revisit this, if you think it is incorrect.  The option you outline
> below would basically mean that if any thread were to fault in a PTE in a
> range, all threads and processes would be forced to use PTEs because we would
> use PTEs in the radix tree.

Well, I don't think it is necessarily incorrect. I just think it is more
difficult to get it right (as current bugs show) so I'm just considering
whether the complexity is worth it.

> This is cleaner...I'm not sure if the use case of having two threads accessing
> the same area, one with PTEs and one with PMDs, is actually prevalent.  It's
> also maybe a bit weird that the current behavior varies based on which thread
> faulted first - if the PTE thread faults first, it'll insert a PTE into the
> radix tree and everyone will just use PTEs.

So for two *threads*, I don't think that is a sensible use-case. We just
have to get it right. For two *processes* it makes sense - your DB might
want to use PMDs while your backup program may just use PTEs. So thinking
more about it I guess it is worth the effort to make the mixed case work
efficiently.

> > Actually I'm not convinced your patch quite fixes this because
> > dax_load_hole() or dax_insert_mapping_entry() will modify the passed entry
> > with the assumption that it's PTE entry and so they will likely corrupt the
> > entry in the radix tree.
> 
> I don't think we can ever call dax_load_hole() if we have a DAX PMD entry in
> the radix tree, because we have a block mapping from the filesystem.
> 
> For dax_insert_mapping_entry(), we do the right thing.  From the comments
> above the function:
> 
>  * If we happen to be trying to insert a PTE and there is a PMD
>  * already in the tree, we will skip the insertion and just dirty the PMD as
>  * appropriate.  If we happen to be trying to insert a PTE and there is a PMD
>  * already in the tree, we will skip the insertion and just dirty the PMD as
>  * appropriate.

Yeah, on the first reading I missed that we won't modify the radix tree in
that particular case. Frankly, I think we should somewhat clean up that
code to make things more obvious but let's leave that for a bit later. For
now the code looks correct.

> > So I think to fix the first case we should rather modify
> > grab_mapping_entry() to properly go through the pmd_downgrade path once we
> > find PMD entry and we do PTE fault.
> > 
> > What do you think?
> 
> That could also work, though I do think the fix as submitted is correct.
> I think it comes down to whether we want to keep the behavior where a thread
> faulting in a PTEs will use an existing PMD entry in the radix tree, instead
> of making all other threads fall back to PTEs.
> 
> I think either way solves this issue for the DAX case...but do you understand
> how this is solved for other fault handlers?  They don't have any isolation
> between faults either in the mm/memory.c code, and are susceptible to the same
> races.  How do they deal with the fact that by the time they get to their PTE
> fault handler, a racing PMD fault handler in another thread could have
> inserted a PMD into their page tables, and vice versa?

So normal fault path uses alloc_set_pte() for installing new PTE. And that
uses pte_alloc_one_map() which checks whether PMD is still suitable for
inserting a PTE. If not, we return VM_FAULT_NOPAGE. Probably it would be
cleanest to factor our common parts of PTE and PMD insertion so that we can
use these functions both from DAX and generic fault paths.

Anyway, I'll have a look at your fixes with fresh eyes as they could be the
right way to go as a quick fix. Refactoring and cleanups can come later.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages
  2017-05-17 17:16 [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages Ross Zwisler
  2017-05-17 17:16 ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
  2017-05-17 17:33 ` [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages Dave Hansen
@ 2017-05-22 14:40 ` Jan Kara
  2 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2017-05-22 14:40 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Darrick J. Wong, Alexander Viro,
	Christoph Hellwig, Dan Williams, Dave Hansen, Jan Kara,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Wed 17-05-17 11:16:38, Ross Zwisler wrote:
> When the pmd_devmap() checks were added by:
> 
> commit 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd")
> 
> to add better support for DAX huge pages, they were all added to the end of
> if() statements after existing pmd_trans_huge() checks.  So, things like:
> 
> -       if (pmd_trans_huge(*pmd))
> +       if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
> 
> When further checks were added after pmd_trans_unstable() checks by:
> 
> commit 7267ec008b5c ("mm: postpone page table allocation until we have page
> to map")
> 
> they were also added at the end of the conditional:
> 
> +       if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
> 
> This ordering is fine for pmd_trans_huge(), but doesn't work for
> pmd_trans_unstable().  This is because DAX huge pages trip the bad_pmd()
> check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
> pmd_trans_unstable()), which prints out a warning and returns 1.  So, we do
> end up doing the right thing, but only after spamming dmesg with suspicious
> looking messages:
> 
> mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)
> 
> Reorder these checks so that pmd_devmap() is checked first, avoiding the
> error messages.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
> Cc: stable@vger.kernel.org

With the change requested by Dave this looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 6ff5d72..1ee269d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3061,7 +3061,7 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
>  	 * through an atomic read in C, which is what pmd_trans_unstable()
>  	 * provides.
>  	 */
> -	if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
> +	if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
>  		return VM_FAULT_NOPAGE;
>  
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> @@ -3690,7 +3690,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
>  		vmf->pte = NULL;
>  	} else {
>  		/* See comment in pte_alloc_one_map() */
> -		if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
> +		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
>  			return 0;
>  		/*
>  		 * A regular pmd is established and it can't morph into a huge
> -- 
> 2.9.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-17 17:16 ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
  2017-05-17 17:17   ` [PATCH] generic: add regression test for DAX PTE/PMD races Ross Zwisler
  2017-05-18  7:50   ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Jan Kara
@ 2017-05-22 14:44   ` Jan Kara
  2017-05-22 19:43     ` Ross Zwisler
  2 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2017-05-22 14:44 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Darrick J. Wong, Alexander Viro,
	Christoph Hellwig, Dan Williams, Dave Hansen, Jan Kara,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Wed 17-05-17 11:16:39, Ross Zwisler wrote:
> We currently have two related PMD vs PTE races in the DAX code.  These can
> both be easily triggered by having two threads reading and writing
> simultaneously to the same private mapping, with the key being that private
> mapping reads can be handled with PMDs but private mapping writes are
> always handled with PTEs so that we can COW.
> 
> Here is the first race:
> 
> CPU 0					CPU 1
> 
> (private mapping write)
> __handle_mm_fault()
>   create_huge_pmd() - FALLBACK
>   handle_pte_fault()
>     passes check for pmd_devmap()
> 
> 					(private mapping read)
> 					__handle_mm_fault()
> 					  create_huge_pmd()
> 					    dax_iomap_pmd_fault() inserts PMD
> 
>     dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
>     			  installed in our page tables at this spot.
> 
> Here's the second race:
> 
> CPU 0					CPU 1
> 
> (private mapping write)
> __handle_mm_fault()
>   create_huge_pmd() - FALLBACK
> 					(private mapping read)
> 					__handle_mm_fault()
> 					  passes check for pmd_none()
> 					  create_huge_pmd()
> 
>   handle_pte_fault()
>     dax_iomap_pte_fault() inserts PTE
> 					    dax_iomap_pmd_fault() inserts PMD,
> 					       but we already have a PTE at
> 					       this spot.
> 
> The core of the issue is that while there is isolation between faults to
> the same range in the DAX fault handlers via our DAX entry locking, there
> is no isolation between faults in the code in mm/memory.c.  This means for
> instance that this code in __handle_mm_fault() can run:
> 
> 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
> 		ret = create_huge_pmd(&vmf);
> 
> But by the time we actually get to run the fault handler called by
> create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
> fault has installed a normal PMD here as a parent.  This is the cause of
> the 2nd race.  The first race is similar - there is the following check in
> handle_pte_fault():
> 
> 	} else {
> 		/* See comment in pte_alloc_one_map() */
> 		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
> 			return 0;
> 
> So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
> will bail and retry the fault.  This is correct, but there is nothing
> preventing the PMD from being installed after this check but before we
> actually get to the DAX PTE fault handlers.
> 
> In my testing these races result in the following types of errors:
> 
>  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
>  BUG: non-zero nr_ptes on freeing mm: 15
> 
> Fix this issue by having the DAX fault handlers verify that it is safe to
> continue their fault after they have taken an entry lock to block other
> racing faults.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
> Cc: stable@vger.kernel.org
> 
> ---
> 
> I've written a new xfstest for this race, which I will send in response to
> this patch series.  This series has also survived an xfstest run without
> any new issues.
> 
> ---
>  fs/dax.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index c22eaf1..3cc02d1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1155,6 +1155,15 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>  	}
>  
>  	/*
> +	 * It is possible, particularly with mixed reads & writes to private
> +	 * mappings, that we have raced with a PMD fault that overlaps with
> +	 * the PTE we need to set up.  Now that we have a locked mapping entry
> +	 * we can safely unmap the huge PMD so that we can install our PTE in
> +	 * our page tables.
> +	 */
> +	split_huge_pmd(vmf->vma, vmf->pmd, vmf->address);
> +

Can we just check the PMD and if is isn't as we want it, bail out and retry
the fault? IMHO it will be more obvious that way (and also more in line
like these races are handled for the classical THP). Otherwise the patch
looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-22 14:44   ` Jan Kara
@ 2017-05-22 19:43     ` Ross Zwisler
  0 siblings, 0 replies; 12+ messages in thread
From: Ross Zwisler @ 2017-05-22 19:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Darrick J. Wong,
	Alexander Viro, Christoph Hellwig, Dan Williams, Dave Hansen,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Mon, May 22, 2017 at 04:44:57PM +0200, Jan Kara wrote:
> On Wed 17-05-17 11:16:39, Ross Zwisler wrote:
> > We currently have two related PMD vs PTE races in the DAX code.  These can
> > both be easily triggered by having two threads reading and writing
> > simultaneously to the same private mapping, with the key being that private
> > mapping reads can be handled with PMDs but private mapping writes are
> > always handled with PTEs so that we can COW.
> > 
> > Here is the first race:
> > 
> > CPU 0					CPU 1
> > 
> > (private mapping write)
> > __handle_mm_fault()
> >   create_huge_pmd() - FALLBACK
> >   handle_pte_fault()
> >     passes check for pmd_devmap()
> > 
> > 					(private mapping read)
> > 					__handle_mm_fault()
> > 					  create_huge_pmd()
> > 					    dax_iomap_pmd_fault() inserts PMD
> > 
> >     dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
> >     			  installed in our page tables at this spot.
> > 
> > Here's the second race:
> > 
> > CPU 0					CPU 1
> > 
> > (private mapping write)
> > __handle_mm_fault()
> >   create_huge_pmd() - FALLBACK
> > 					(private mapping read)
> > 					__handle_mm_fault()
> > 					  passes check for pmd_none()
> > 					  create_huge_pmd()
> > 
> >   handle_pte_fault()
> >     dax_iomap_pte_fault() inserts PTE
> > 					    dax_iomap_pmd_fault() inserts PMD,
> > 					       but we already have a PTE at
> > 					       this spot.
> > 
> > The core of the issue is that while there is isolation between faults to
> > the same range in the DAX fault handlers via our DAX entry locking, there
> > is no isolation between faults in the code in mm/memory.c.  This means for
> > instance that this code in __handle_mm_fault() can run:
> > 
> > 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
> > 		ret = create_huge_pmd(&vmf);
> > 
> > But by the time we actually get to run the fault handler called by
> > create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
> > fault has installed a normal PMD here as a parent.  This is the cause of
> > the 2nd race.  The first race is similar - there is the following check in
> > handle_pte_fault():
> > 
> > 	} else {
> > 		/* See comment in pte_alloc_one_map() */
> > 		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
> > 			return 0;
> > 
> > So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
> > will bail and retry the fault.  This is correct, but there is nothing
> > preventing the PMD from being installed after this check but before we
> > actually get to the DAX PTE fault handlers.
> > 
> > In my testing these races result in the following types of errors:
> > 
> >  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
> >  BUG: non-zero nr_ptes on freeing mm: 15
> > 
> > Fix this issue by having the DAX fault handlers verify that it is safe to
> > continue their fault after they have taken an entry lock to block other
> > racing faults.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
> > Cc: stable@vger.kernel.org
> > 
> > ---
> > 
> > I've written a new xfstest for this race, which I will send in response to
> > this patch series.  This series has also survived an xfstest run without
> > any new issues.
> > 
> > ---
> >  fs/dax.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index c22eaf1..3cc02d1 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1155,6 +1155,15 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
> >  	}
> >  
> >  	/*
> > +	 * It is possible, particularly with mixed reads & writes to private
> > +	 * mappings, that we have raced with a PMD fault that overlaps with
> > +	 * the PTE we need to set up.  Now that we have a locked mapping entry
> > +	 * we can safely unmap the huge PMD so that we can install our PTE in
> > +	 * our page tables.
> > +	 */
> > +	split_huge_pmd(vmf->vma, vmf->pmd, vmf->address);
> > +
> 
> Can we just check the PMD and if is isn't as we want it, bail out and retry
> the fault? IMHO it will be more obvious that way (and also more in line
> like these races are handled for the classical THP). Otherwise the patch
> looks good to me.

Yep, that works as well.  I'll do this for v2.

Thanks for the review.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-22 14:37       ` Jan Kara
@ 2017-05-22 19:44         ` Ross Zwisler
  0 siblings, 0 replies; 12+ messages in thread
From: Ross Zwisler @ 2017-05-22 19:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Darrick J. Wong,
	Alexander Viro, Christoph Hellwig, Dan Williams, Dave Hansen,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Mon, May 22, 2017 at 04:37:48PM +0200, Jan Kara wrote:
> On Thu 18-05-17 15:29:39, Ross Zwisler wrote:
> > On Thu, May 18, 2017 at 09:50:37AM +0200, Jan Kara wrote:
> > > On Wed 17-05-17 11:16:39, Ross Zwisler wrote:
<>
> > > The first scenario seems to be possible. dax_iomap_pmd_fault() will create
> > > PMD entry in the radix tree. Then dax_iomap_pte_fault() will come, do
> > > grab_mapping_entry(), there it sees entry is PMD but we are doing PTE fault
> > > so I'd think that pmd_downgrade = true... But actually the condition there
> > > doesn't trigger in this case. And that's a catch that although we asked
> > > grab_mapping_entry() for PTE, we've got PMD back and that screws us later.
> > 
> > Yep, it was a concious decision when implementing the PMD support to allow one
> > thread to use PMDs and another to use PTEs in the same range, as long as the
> > thread faulting in PMDs is the first to insert into the radix tree.  A PMD
> > radix tree entry will be inserted and used for locking and dirty tracking, and
> > each thread or process can fault in either PTEs or PMDs into its own address
> > space as needed.
> 
> Well, for *threads* it doesn't really make good sense to mix PMDs and PTEs
> as they share page tables. However for *processes* it makes some sense to
> allow one process to use PTEs and another process to use PMDs. And I
> remember we were discussing this in the past.

Ugh, I was super sloppy with my use of "thread" and "process" in my previous
email.  Sorry, and thanks for the clarifications.  I think we're on the same
page, even if I had trouble articulating it. :)

> So normal fault path uses alloc_set_pte() for installing new PTE. And that
> uses pte_alloc_one_map() which checks whether PMD is still suitable for
> inserting a PTE. If not, we return VM_FAULT_NOPAGE. Probably it would be
> cleanest to factor our common parts of PTE and PMD insertion so that we can
> use these functions both from DAX and generic fault paths.

Makes sense, thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-05-22 19:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-17 17:16 [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages Ross Zwisler
2017-05-17 17:16 ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
2017-05-17 17:17   ` [PATCH] generic: add regression test for DAX PTE/PMD races Ross Zwisler
2017-05-18  7:50   ` [PATCH 2/2] dax: Fix race between colliding PMD & PTE entries Jan Kara
2017-05-18 21:29     ` Ross Zwisler
2017-05-22 14:37       ` Jan Kara
2017-05-22 19:44         ` Ross Zwisler
2017-05-22 14:44   ` Jan Kara
2017-05-22 19:43     ` Ross Zwisler
2017-05-17 17:33 ` [PATCH 1/2] mm: avoid spurious 'bad pmd' warning messages Dave Hansen
2017-05-17 18:23   ` Ross Zwisler
2017-05-22 14:40 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).