linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/2] mm: avoid spurious 'bad pmd' warning messages
@ 2017-05-22 21:57 Ross Zwisler
  2017-05-22 21:57 ` [PATCH v2 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
  0 siblings, 1 reply; 5+ messages in thread
From: Ross Zwisler @ 2017-05-22 21:57 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Darrick J. Wong, Alexander Viro, Christoph Hellwig,
	Dan Williams, Dave Hansen, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-mm, linux-nvdimm, Kirill A . Shutemov,
	Pawel Lebioda, Dave Jiang, Xiong Zhou, Eryu Guan, stable

When the pmd_devmap() checks were added by:

commit 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd")

to add better support for DAX huge pages, they were all added to the end of
if() statements after existing pmd_trans_huge() checks.  So, things like:

-       if (pmd_trans_huge(*pmd))
+       if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))

When further checks were added after pmd_trans_unstable() checks by:

commit 7267ec008b5c ("mm: postpone page table allocation until we have page
to map")

they were also added at the end of the conditional:

+       if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))

This ordering is fine for pmd_trans_huge(), but doesn't work for
pmd_trans_unstable().  This is because DAX huge pages trip the bad_pmd()
check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
pmd_trans_unstable()), which prints out a warning and returns 1.  So, we do
end up doing the right thing, but only after spamming dmesg with suspicious
looking messages:

mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)

Reorder these checks in a helper so that pmd_devmap() is checked first,
avoiding the error messages, and add a comment explaining why the ordering
is important.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
Cc: stable@vger.kernel.org
---

Changes from v1:
 - Break the checks out into the new pmd_devmap_trans_unstable() helper and
   add a comment about the ordering (Dave).  I ended up keeping this helper
   in mm/memory.c because I didn't see an obvious header where it would
   live happily.  pmd_devmap() is either defined in
   arch/x86/include/asm/pgtable.h or in include/linux/mm.h depending on
   __HAVE_ARCH_PTE_DEVMAP and CONFIG_TRANSPARENT_HUGEPAGE, and
   pmd_trans_unstable() is defined in include/asm-generic/pgtable.h.

 - Add a comment explaining why pte_alloc_one_map() doesn't suffer from races.
   This was the result of a conversation with Dave Hansen.
---
 mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6ff5d72..2e65df1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3029,6 +3029,17 @@ static int __do_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+/*
+ * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
+ * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+	return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
 static int pte_alloc_one_map(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -3052,18 +3063,27 @@ static int pte_alloc_one_map(struct vm_fault *vmf)
 map_pte:
 	/*
 	 * If a huge pmd materialized under us just retry later.  Use
-	 * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
-	 * didn't become pmd_trans_huge under us and then back to pmd_none, as
-	 * a result of MADV_DONTNEED running immediately after a huge pmd fault
-	 * in a different thread of this mm, in turn leading to a misleading
-	 * pmd_trans_huge() retval.  All we have to ensure is that it is a
-	 * regular pmd that we can walk with pte_offset_map() and we can do that
-	 * through an atomic read in C, which is what pmd_trans_unstable()
-	 * provides.
+	 * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
+	 * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
+	 * under us and then back to pmd_none, as a result of MADV_DONTNEED
+	 * running immediately after a huge pmd fault in a different thread of
+	 * this mm, in turn leading to a misleading pmd_trans_huge() retval.
+	 * All we have to ensure is that it is a regular pmd that we can walk
+	 * with pte_offset_map() and we can do that through an atomic read in
+	 * C, which is what pmd_trans_unstable() provides.
 	 */
-	if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
+	if (pmd_devmap_trans_unstable(vmf->pmd))
 		return VM_FAULT_NOPAGE;
 
+	/*
+	 * At this point we know that our vmf->pmd points to a page of ptes
+	 * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
+	 * for the duration of the fault.  If a racing MADV_DONTNEED runs and
+	 * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
+	 * be valid and we will re-check to make sure the vmf->pte isn't
+	 * pte_none() under vmf->ptl protection when we return to
+	 * alloc_set_pte().
+	 */
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
 	return 0;
@@ -3690,7 +3710,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
 		vmf->pte = NULL;
 	} else {
 		/* See comment in pte_alloc_one_map() */
-		if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
+		if (pmd_devmap_trans_unstable(vmf->pmd))
 			return 0;
 		/*
 		 * A regular pmd is established and it can't morph into a huge
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v2 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-22 21:57 [PATCH v2 1/2] mm: avoid spurious 'bad pmd' warning messages Ross Zwisler
@ 2017-05-22 21:57 ` Ross Zwisler
  2017-05-23  9:59   ` Jan Kara
  2017-05-26 19:59   ` [PATCH] dax: improve fix for " Ross Zwisler
  0 siblings, 2 replies; 5+ messages in thread
From: Ross Zwisler @ 2017-05-22 21:57 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Darrick J. Wong, Alexander Viro, Christoph Hellwig,
	Dan Williams, Dave Hansen, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-mm, linux-nvdimm, Kirill A . Shutemov,
	Pawel Lebioda, Dave Jiang, Xiong Zhou, Eryu Guan, stable

We currently have two related PMD vs PTE races in the DAX code.  These can
both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that private
mapping reads can be handled with PMDs but private mapping writes are
always handled with PTEs so that we can COW.

Here is the first race:

CPU 0					CPU 1

(private mapping write)
__handle_mm_fault()
  create_huge_pmd() - FALLBACK
  handle_pte_fault()
    passes check for pmd_devmap()

					(private mapping read)
					__handle_mm_fault()
					  create_huge_pmd()
					    dax_iomap_pmd_fault() inserts PMD

    dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
    			  installed in our page tables at this spot.

Here's the second race:

CPU 0					CPU 1

(private mapping read)
__handle_mm_fault()
  passes check for pmd_none()
  create_huge_pmd()
    dax_iomap_pmd_fault() inserts PMD

(private mapping write)
__handle_mm_fault()
  create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

  handle_pte_fault()
    dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking, there
is no isolation between faults in the code in mm/memory.c.  This means for
instance that this code in __handle_mm_fault() can run:

	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent.  This is the cause of
the 2nd race.  The first race is similar - there is the following check in
handle_pte_fault():

	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
			return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault.  This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

 BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
 BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe to
continue their fault after they have taken an entry lock to block other
racing faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: stable@vger.kernel.org
---

Changes from v1:
 - Handle the failure case in dax_iomap_pte_fault() by retrying the fault
   (Jan).

This series has survived my new xfstest (generic/437) and full xfstest
regression testing runs.
---
 fs/dax.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index c22eaf1..fc62f36 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1155,6 +1155,17 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	}
 
 	/*
+	 * It is possible, particularly with mixed reads & writes to private
+	 * mappings, that we have raced with a PMD fault that overlaps with
+	 * the PTE we need to set up.  If so just return and the fault will be
+	 * retried.
+	 */
+	if (pmd_devmap(*vmf->pmd)) {
+		vmf_ret = VM_FAULT_NOPAGE;
+		goto unlock_entry;
+	}
+
+	/*
 	 * Note that we don't bother to use iomap_apply here: DAX required
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
@@ -1398,6 +1409,15 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 		goto fallback;
 
 	/*
+	 * It is possible, particularly with mixed reads & writes to private
+	 * mappings, that we have raced with a PTE fault that overlaps with
+	 * the PMD we need to set up.  If so we just fall back to a PTE fault
+	 * ourselves.
+	 */
+	if (!pmd_none(*vmf->pmd))
+		goto unlock_entry;
+
+	/*
 	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
 	 * setting up a mapping, so really we're using iomap_begin() as a way
 	 * to look up our filesystem block.
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 2/2] dax: Fix race between colliding PMD & PTE entries
  2017-05-22 21:57 ` [PATCH v2 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
@ 2017-05-23  9:59   ` Jan Kara
  2017-05-26 19:59   ` [PATCH] dax: improve fix for " Ross Zwisler
  1 sibling, 0 replies; 5+ messages in thread
From: Jan Kara @ 2017-05-23  9:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Darrick J. Wong, Alexander Viro,
	Christoph Hellwig, Dan Williams, Dave Hansen, Jan Kara,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Mon 22-05-17 15:57:49, Ross Zwisler wrote:
> We currently have two related PMD vs PTE races in the DAX code.  These can
> both be easily triggered by having two threads reading and writing
> simultaneously to the same private mapping, with the key being that private
> mapping reads can be handled with PMDs but private mapping writes are
> always handled with PTEs so that we can COW.
> 
> Here is the first race:
> 
> CPU 0					CPU 1
> 
> (private mapping write)
> __handle_mm_fault()
>   create_huge_pmd() - FALLBACK
>   handle_pte_fault()
>     passes check for pmd_devmap()
> 
> 					(private mapping read)
> 					__handle_mm_fault()
> 					  create_huge_pmd()
> 					    dax_iomap_pmd_fault() inserts PMD
> 
>     dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
>     			  installed in our page tables at this spot.
> 
> Here's the second race:
> 
> CPU 0					CPU 1
> 
> (private mapping read)
> __handle_mm_fault()
>   passes check for pmd_none()
>   create_huge_pmd()
>     dax_iomap_pmd_fault() inserts PMD
> 
> (private mapping write)
> __handle_mm_fault()
>   create_huge_pmd() - FALLBACK
> 					(private mapping read)
> 					__handle_mm_fault()
> 					  passes check for pmd_none()
> 					  create_huge_pmd()
> 
>   handle_pte_fault()
>     dax_iomap_pte_fault() inserts PTE
> 					    dax_iomap_pmd_fault() inserts PMD,
> 					       but we already have a PTE at
> 					       this spot.
> 
> The core of the issue is that while there is isolation between faults to
> the same range in the DAX fault handlers via our DAX entry locking, there
> is no isolation between faults in the code in mm/memory.c.  This means for
> instance that this code in __handle_mm_fault() can run:
> 
> 	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
> 		ret = create_huge_pmd(&vmf);
> 
> But by the time we actually get to run the fault handler called by
> create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
> fault has installed a normal PMD here as a parent.  This is the cause of
> the 2nd race.  The first race is similar - there is the following check in
> handle_pte_fault():
> 
> 	} else {
> 		/* See comment in pte_alloc_one_map() */
> 		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
> 			return 0;
> 
> So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
> will bail and retry the fault.  This is correct, but there is nothing
> preventing the PMD from being installed after this check but before we
> actually get to the DAX PTE fault handlers.
> 
> In my testing these races result in the following types of errors:
> 
>  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
>  BUG: non-zero nr_ptes on freeing mm: 15
> 
> Fix this issue by having the DAX fault handlers verify that it is safe to
> continue their fault after they have taken an entry lock to block other
> racing faults.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
> Cc: stable@vger.kernel.org

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
> 
> Changes from v1:
>  - Handle the failure case in dax_iomap_pte_fault() by retrying the fault
>    (Jan).
> 
> This series has survived my new xfstest (generic/437) and full xfstest
> regression testing runs.
> ---
>  fs/dax.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index c22eaf1..fc62f36 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1155,6 +1155,17 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>  	}
>  
>  	/*
> +	 * It is possible, particularly with mixed reads & writes to private
> +	 * mappings, that we have raced with a PMD fault that overlaps with
> +	 * the PTE we need to set up.  If so just return and the fault will be
> +	 * retried.
> +	 */
> +	if (pmd_devmap(*vmf->pmd)) {
> +		vmf_ret = VM_FAULT_NOPAGE;
> +		goto unlock_entry;
> +	}
> +
> +	/*
>  	 * Note that we don't bother to use iomap_apply here: DAX required
>  	 * the file system block size to be equal the page size, which means
>  	 * that we never have to deal with more than a single extent here.
> @@ -1398,6 +1409,15 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  		goto fallback;
>  
>  	/*
> +	 * It is possible, particularly with mixed reads & writes to private
> +	 * mappings, that we have raced with a PTE fault that overlaps with
> +	 * the PMD we need to set up.  If so we just fall back to a PTE fault
> +	 * ourselves.
> +	 */
> +	if (!pmd_none(*vmf->pmd))
> +		goto unlock_entry;
> +
> +	/*
>  	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
>  	 * setting up a mapping, so really we're using iomap_begin() as a way
>  	 * to look up our filesystem block.
> -- 
> 2.9.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] dax: improve fix for colliding PMD & PTE entries
  2017-05-22 21:57 ` [PATCH v2 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
  2017-05-23  9:59   ` Jan Kara
@ 2017-05-26 19:59   ` Ross Zwisler
  2017-05-29 12:17     ` Jan Kara
  1 sibling, 1 reply; 5+ messages in thread
From: Ross Zwisler @ 2017-05-26 19:59 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Darrick J. Wong, Alexander Viro, Christoph Hellwig,
	Dan Williams, Dave Hansen, Jan Kara, Matthew Wilcox,
	linux-fsdevel, linux-mm, linux-nvdimm, Kirill A . Shutemov,
	Pawel Lebioda, Dave Jiang, Xiong Zhou, Eryu Guan, stable

This commit, which has not yet made it upstream but is in the -mm tree:

    dax: Fix race between colliding PMD & PTE entries

fixed a pair of race conditions where racing DAX PTE and PMD faults could
corrupt page tables.  This fix had two shortcomings which are addressed by
this patch:

1) In the PTE fault handler we only checked for a collision using
pmd_devmap().  The pmd_devmap() check will trigger when we have raced with
a PMD that has real DAX storage, but to account for the case where we
collide with a huge zero page entry we also need to check for
pmd_trans_huge().

2) In the PMD fault handler we only continued with the fault if no PMD at
all was present (pmd_none()).  This is the case when we are faulting in a
PMD for the first time, but there are two other cases to consider.  The
first is that we are servicing a write fault over a PMD huge zero page,
which we detect with pmd_trans_huge().  The second is that we are servicing
a write fault over a DAX PMD with real storage, which we address with
pmd_devmap().

Fix both of these, and instead of manually triggering a fallback in the PMD
collision case instead be consistent with the other collision detection
code in the fault handlers and just retry.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: stable@vger.kernel.org
---

For both the -mm tree and for stable, feel free to squash this with the
original commit if you think that is appropriate.

This has passed targeted testing and an xfstests run.
---
 fs/dax.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fc62f36..2a6889b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1160,7 +1160,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	 * the PTE we need to set up.  If so just return and the fault will be
 	 * retried.
 	 */
-	if (pmd_devmap(*vmf->pmd)) {
+	if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) {
 		vmf_ret = VM_FAULT_NOPAGE;
 		goto unlock_entry;
 	}
@@ -1411,11 +1411,14 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 	/*
 	 * It is possible, particularly with mixed reads & writes to private
 	 * mappings, that we have raced with a PTE fault that overlaps with
-	 * the PMD we need to set up.  If so we just fall back to a PTE fault
-	 * ourselves.
+	 * the PMD we need to set up.  If so just return and the fault will be
+	 * retried.
 	 */
-	if (!pmd_none(*vmf->pmd))
+	if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) &&
+			!pmd_devmap(*vmf->pmd)) {
+		result = 0;
 		goto unlock_entry;
+	}
 
 	/*
 	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
-- 
2.9.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] dax: improve fix for colliding PMD & PTE entries
  2017-05-26 19:59   ` [PATCH] dax: improve fix for " Ross Zwisler
@ 2017-05-29 12:17     ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2017-05-29 12:17 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Darrick J. Wong, Alexander Viro,
	Christoph Hellwig, Dan Williams, Dave Hansen, Jan Kara,
	Matthew Wilcox, linux-fsdevel, linux-mm, linux-nvdimm,
	Kirill A . Shutemov, Pawel Lebioda, Dave Jiang, Xiong Zhou,
	Eryu Guan, stable

On Fri 26-05-17 13:59:32, Ross Zwisler wrote:
> This commit, which has not yet made it upstream but is in the -mm tree:
> 
>     dax: Fix race between colliding PMD & PTE entries
> 
> fixed a pair of race conditions where racing DAX PTE and PMD faults could
> corrupt page tables.  This fix had two shortcomings which are addressed by
> this patch:
> 
> 1) In the PTE fault handler we only checked for a collision using
> pmd_devmap().  The pmd_devmap() check will trigger when we have raced with
> a PMD that has real DAX storage, but to account for the case where we
> collide with a huge zero page entry we also need to check for
> pmd_trans_huge().
> 
> 2) In the PMD fault handler we only continued with the fault if no PMD at
> all was present (pmd_none()).  This is the case when we are faulting in a
> PMD for the first time, but there are two other cases to consider.  The
> first is that we are servicing a write fault over a PMD huge zero page,
> which we detect with pmd_trans_huge().  The second is that we are servicing
> a write fault over a DAX PMD with real storage, which we address with
> pmd_devmap().
> 
> Fix both of these, and instead of manually triggering a fallback in the PMD
> collision case instead be consistent with the other collision detection
> code in the fault handlers and just retry.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: stable@vger.kernel.org

Ugh, right. I forgot zero page in page tables will not pass the devmap
check... You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> 
> For both the -mm tree and for stable, feel free to squash this with the
> original commit if you think that is appropriate.
> 
> This has passed targeted testing and an xfstests run.
> ---
>  fs/dax.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fc62f36..2a6889b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1160,7 +1160,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>  	 * the PTE we need to set up.  If so just return and the fault will be
>  	 * retried.
>  	 */
> -	if (pmd_devmap(*vmf->pmd)) {
> +	if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) {
>  		vmf_ret = VM_FAULT_NOPAGE;
>  		goto unlock_entry;
>  	}
> @@ -1411,11 +1411,14 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>  	/*
>  	 * It is possible, particularly with mixed reads & writes to private
>  	 * mappings, that we have raced with a PTE fault that overlaps with
> -	 * the PMD we need to set up.  If so we just fall back to a PTE fault
> -	 * ourselves.
> +	 * the PMD we need to set up.  If so just return and the fault will be
> +	 * retried.
>  	 */
> -	if (!pmd_none(*vmf->pmd))
> +	if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) &&
> +			!pmd_devmap(*vmf->pmd)) {
> +		result = 0;
>  		goto unlock_entry;
> +	}
>  
>  	/*
>  	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
> -- 
> 2.9.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-05-29 12:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-22 21:57 [PATCH v2 1/2] mm: avoid spurious 'bad pmd' warning messages Ross Zwisler
2017-05-22 21:57 ` [PATCH v2 2/2] dax: Fix race between colliding PMD & PTE entries Ross Zwisler
2017-05-23  9:59   ` Jan Kara
2017-05-26 19:59   ` [PATCH] dax: improve fix for " Ross Zwisler
2017-05-29 12:17     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).