All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/12] DAX page fault locking
@ 2016-03-10 19:18 ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Hello,

this is my first attempt at DAX page fault locking rewrite. It is mostly
just a dump of current status of my git tree so that people can see where I'm
going before weekend. It should be complete but so far it is only compile
tested.

The basic idea is that we use a bit in an exceptional radix tree entry as
a lock bit and use it similarly to how page lock is used for normal faults.
That way we fix races between hole instantiation and read faults of the
same index. For now I have disabled PMD faults since there the issues with
page fault locking are even worse. I think we will need something like
Matthew's multi-order radix tree to fix the races for PMD faults but I
want to have a look at those once locking for normal pages is working.

In the end I have decided to implement the bit locking directly in the DAX
code. Originally I was thinking we could provide something generic directly
in the radix tree code but the functions DAX needs are rather specific.
Maybe someone else will have a good idea how to distill some generally useful
functions out of what I've implemented for DAX but for now I didn't bother
with that.

								Honza
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC] [PATCH 0/12] DAX page fault locking
@ 2016-03-10 19:18 ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Hello,

this is my first attempt at DAX page fault locking rewrite. It is mostly
just a dump of current status of my git tree so that people can see where I'm
going before weekend. It should be complete but so far it is only compile
tested.

The basic idea is that we use a bit in an exceptional radix tree entry as
a lock bit and use it similarly to how page lock is used for normal faults.
That way we fix races between hole instantiation and read faults of the
same index. For now I have disabled PMD faults since there the issues with
page fault locking are even worse. I think we will need something like
Matthew's multi-order radix tree to fix the races for PMD faults but I
want to have a look at those once locking for normal pages is working.

In the end I have decided to implement the bit locking directly in the DAX
code. Originally I was thinking we could provide something generic directly
in the radix tree code but the functions DAX needs are rather specific.
Maybe someone else will have a good idea how to distill some generally useful
functions out of what I've implemented for DAX but for now I didn't bother
with that.

								Honza

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 01/12] DAX: move RADIX_DAX_ definitions to dax.c
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

From: NeilBrown <neilb@suse.com>

These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c                   | 9 +++++++++
 include/linux/radix-tree.h | 9 ---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 711172450da6..9c4d697fb6fc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,6 +32,15 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index f54be7082207..968150ab8a1c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,15 +51,6 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
-#define RADIX_DAX_MASK	0xf
-#define RADIX_DAX_SHIFT	4
-#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
-
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 01/12] DAX: move RADIX_DAX_ definitions to dax.c
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

From: NeilBrown <neilb@suse.com>

These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c                   | 9 +++++++++
 include/linux/radix-tree.h | 9 ---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 711172450da6..9c4d697fb6fc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,6 +32,15 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index f54be7082207..968150ab8a1c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,15 +51,6 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
-#define RADIX_DAX_MASK	0xf
-#define RADIX_DAX_SHIFT	4
-#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
-
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 02/12] radix-tree: make 'indirect' bit available to exception entries.
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

From: NeilBrown <neilb@suse.com>

A pointer to a radix_tree_node will always have the 'exception'
bit cleared, so if the exception bit is set the value cannot
be an indirect pointer.  Thus it is safe to make the 'indirect bit'
available to store extra information in exception entries.

This patch adds a 'PTR_MASK' and a value is only treated as
an indirect (pointer) entry the 2 ls-bits are '01'.

The change in radix-tree.c ensures the stored value still looks like an
indirect pointer, and saves a load as well.

We could swap the two bits and so keep all the exectional bits contigious.
But I have other plans for that bit....

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/radix-tree.h | 11 +++++++++--
 lib/radix-tree.c           |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 968150ab8a1c..450c12b546b7 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -40,8 +40,13 @@
  * Indirect pointer in fact is also used to tag the last pointer of a node
  * when it is shrunk, before we rcu free the node. See shrink code for
  * details.
+ *
+ * To allow an exception entry to only lose one bit, we ignore
+ * the INDIRECT bit when the exception bit is set.  So an entry is
+ * indirect if the least significant 2 bits are 01.
  */
 #define RADIX_TREE_INDIRECT_PTR		1
+#define RADIX_TREE_INDIRECT_MASK	3
 /*
  * A common use of the radix tree is to store pointers to struct pages;
  * but shmem/tmpfs needs also to store swap entries in the same tree:
@@ -53,7 +58,8 @@
 
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
+	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
+		== RADIX_TREE_INDIRECT_PTR;
 }
 
 /*** radix-tree API starts here ***/
@@ -221,7 +227,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
  */
 static inline int radix_tree_deref_retry(void *arg)
 {
-	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
+	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
+			== RADIX_TREE_INDIRECT_PTR);
 }
 
 /**
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6b79e9026e24..37d4643ab5c0 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1305,7 +1305,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 		 * to force callers to retry.
 		 */
 		if (root->height == 0)
-			*((unsigned long *)&to_free->slots[0]) |=
+			*((unsigned long *)&to_free->slots[0]) =
 						RADIX_TREE_INDIRECT_PTR;
 
 		radix_tree_node_free(to_free);
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 02/12] radix-tree: make 'indirect' bit available to exception entries.
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

From: NeilBrown <neilb@suse.com>

A pointer to a radix_tree_node will always have the 'exception'
bit cleared, so if the exception bit is set the value cannot
be an indirect pointer.  Thus it is safe to make the 'indirect bit'
available to store extra information in exception entries.

This patch adds a 'PTR_MASK' and a value is only treated as
an indirect (pointer) entry the 2 ls-bits are '01'.

The change in radix-tree.c ensures the stored value still looks like an
indirect pointer, and saves a load as well.

We could swap the two bits and so keep all the exectional bits contigious.
But I have other plans for that bit....

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/radix-tree.h | 11 +++++++++--
 lib/radix-tree.c           |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 968150ab8a1c..450c12b546b7 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -40,8 +40,13 @@
  * Indirect pointer in fact is also used to tag the last pointer of a node
  * when it is shrunk, before we rcu free the node. See shrink code for
  * details.
+ *
+ * To allow an exception entry to only lose one bit, we ignore
+ * the INDIRECT bit when the exception bit is set.  So an entry is
+ * indirect if the least significant 2 bits are 01.
  */
 #define RADIX_TREE_INDIRECT_PTR		1
+#define RADIX_TREE_INDIRECT_MASK	3
 /*
  * A common use of the radix tree is to store pointers to struct pages;
  * but shmem/tmpfs needs also to store swap entries in the same tree:
@@ -53,7 +58,8 @@
 
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
+	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
+		== RADIX_TREE_INDIRECT_PTR;
 }
 
 /*** radix-tree API starts here ***/
@@ -221,7 +227,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
  */
 static inline int radix_tree_deref_retry(void *arg)
 {
-	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
+	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
+			== RADIX_TREE_INDIRECT_PTR);
 }
 
 /**
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6b79e9026e24..37d4643ab5c0 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1305,7 +1305,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 		 * to force callers to retry.
 		 */
 		if (root->height == 0)
-			*((unsigned long *)&to_free->slots[0]) |=
+			*((unsigned long *)&to_free->slots[0]) =
 						RADIX_TREE_INDIRECT_PTR;
 
 		radix_tree_node_free(to_free);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 03/12] mm: Remove VM_FAULT_MINOR
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

The define has a comment from Nick Pigging from 2007:

/* For backwards compat. Remove me quickly. */

I guess 9 years should not be too hurried sense of 'quickly' even for
kernel measures.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 arch/arm/mm/fault.c       | 2 +-
 arch/arm64/mm/fault.c     | 2 +-
 arch/unicore32/mm/fault.c | 2 +-
 arch/xtensa/mm/fault.c    | 2 +-
 include/linux/mm.h        | 2 --
 5 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index daafcf121ce0..ad5841856007 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -346,7 +346,7 @@ retry:
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
+	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
 		return 0;
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index abe2a9542b3a..97135b61b32a 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -295,7 +295,7 @@ retry:
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
+	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
 			      VM_FAULT_BADACCESS))))
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index afccef5529cc..2ec3d3adcefc 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -276,7 +276,7 @@ retry:
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
+	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault &
 	       (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index c9784c1b18d8..7f4a1fdb1502 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -146,7 +146,7 @@ good_area:
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	if (flags & VM_FAULT_MAJOR)
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	else if (flags & VM_FAULT_MINOR)
+	else
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 
 	return;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 516e14944339..37f319d7636e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1051,8 +1051,6 @@ static inline void clear_page_pfmemalloc(struct page *page)
  * just gets major/minor fault counters bumped up.
  */
 
-#define VM_FAULT_MINOR	0 /* For backwards compat. Remove me quickly. */
-
 #define VM_FAULT_OOM	0x0001
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 03/12] mm: Remove VM_FAULT_MINOR
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

The define has a comment from Nick Pigging from 2007:

/* For backwards compat. Remove me quickly. */

I guess 9 years should not be too hurried sense of 'quickly' even for
kernel measures.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 arch/arm/mm/fault.c       | 2 +-
 arch/arm64/mm/fault.c     | 2 +-
 arch/unicore32/mm/fault.c | 2 +-
 arch/xtensa/mm/fault.c    | 2 +-
 include/linux/mm.h        | 2 --
 5 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index daafcf121ce0..ad5841856007 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -346,7 +346,7 @@ retry:
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
+	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
 		return 0;
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index abe2a9542b3a..97135b61b32a 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -295,7 +295,7 @@ retry:
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
+	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
 			      VM_FAULT_BADACCESS))))
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index afccef5529cc..2ec3d3adcefc 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -276,7 +276,7 @@ retry:
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
+	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
 	if (likely(!(fault &
 	       (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index c9784c1b18d8..7f4a1fdb1502 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -146,7 +146,7 @@ good_area:
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	if (flags & VM_FAULT_MAJOR)
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	else if (flags & VM_FAULT_MINOR)
+	else
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 
 	return;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 516e14944339..37f319d7636e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1051,8 +1051,6 @@ static inline void clear_page_pfmemalloc(struct page *page)
  * just gets major/minor fault counters bumped up.
  */
 
-#define VM_FAULT_MINOR	0 /* For backwards compat. Remove me quickly. */
-
 #define VM_FAULT_OOM	0x0001
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 04/12] ocfs2: Fix return value from ocfs2_page_mkwrite()
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

ocfs2_page_mkwrite() could mistakenly return error code instead of
mkwrite status value. Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/mmap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
index 9581d190f6e1..77ebc2bc1cca 100644
--- a/fs/ocfs2/mmap.c
+++ b/fs/ocfs2/mmap.c
@@ -147,6 +147,10 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	ret = ocfs2_inode_lock(inode, &di_bh, 1);
 	if (ret < 0) {
 		mlog_errno(ret);
+		if (ret == -ENOMEM)
+			ret = VM_FAULT_OOM;
+		else
+			ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 04/12] ocfs2: Fix return value from ocfs2_page_mkwrite()
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

ocfs2_page_mkwrite() could mistakenly return error code instead of
mkwrite status value. Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/mmap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
index 9581d190f6e1..77ebc2bc1cca 100644
--- a/fs/ocfs2/mmap.c
+++ b/fs/ocfs2/mmap.c
@@ -147,6 +147,10 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	ret = ocfs2_inode_lock(inode, &di_bh, 1);
 	if (ret < 0) {
 		mlog_errno(ret);
+		if (ret == -ENOMEM)
+			ret = VM_FAULT_OOM;
+		else
+			ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
  2016-03-10 19:18 ` Jan Kara
                   ` (4 preceding siblings ...)
  (?)
@ 2016-03-10 19:18 ` Jan Kara
  2016-03-10 19:55     ` Wilcox, Matthew R
  -1 siblings, 1 reply; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

At one point DAX used i_mmap_lock so synchronize page faults with page
table invalidation during truncate. However these days DAX uses
filesystem specific RW semaphores to protect against these races
(i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
the unnecessary locking.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c    | 19 -------------------
 mm/memory.c | 14 --------------
 2 files changed, 33 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9c4d697fb6fc..e409e8fc13b7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	pgoff_t size;
 	int error;
 
-	i_mmap_lock_read(mapping);
-
 	/*
 	 * Check truncate didn't happen while we were allocating a block.
 	 * If it did, this block may or may not be still allocated to the
@@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	return error;
 }
 
@@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		if (error)
 			goto unlock_page;
 		vmf->page = page;
-		if (!page) {
-			i_mmap_lock_read(mapping);
-			/* Check we didn't race with truncate */
-			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
-								PAGE_SHIFT;
-			if (vmf->pgoff >= size) {
-				i_mmap_unlock_read(mapping);
-				error = -EIO;
-				goto out;
-			}
-		}
 		return VM_FAULT_LOCKED;
 	}
 
@@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	i_mmap_lock_read(mapping);
-
 	/*
 	 * If a truncate happened while we were allocating blocks, we may
 	 * leave blocks allocated to the file that are beyond EOF.  We can't
@@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	if (buffer_unwritten(&bh))
 		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
 
diff --git a/mm/memory.c b/mm/memory.c
index 8132787ae4d5..13f76eb08f33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
 
-
-	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
@@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (fault_page) {
 			unlock_page(fault_page);
 			page_cache_release(fault_page);
-		} else {
-			/*
-			 * The fault handler has no page to lock, so it holds
-			 * i_mmap_lock for read to protect against truncate.
-			 */
-			i_mmap_unlock_read(vma->vm_file->f_mapping);
 		}
 		goto uncharge_out;
 	}
@@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (fault_page) {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
-	} else {
-		/*
-		 * The fault handler has no page to lock, so it holds
-		 * i_mmap_lock for read to protect against truncate.
-		 */
-		i_mmap_unlock_read(vma->vm_file->f_mapping);
 	}
 	return ret;
 uncharge_out:
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 06/12] dax: Remove complete_unwritten argument
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c      |  4 ++--
 fs/dax.c            | 43 +++++++++----------------------------------
 fs/ext2/file.c      |  4 ++--
 fs/ext4/file.c      |  4 ++--
 fs/xfs/xfs_file.c   |  7 +++----
 include/linux/dax.h | 17 +++++++----------
 include/linux/fs.h  |  1 -
 7 files changed, 25 insertions(+), 55 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 826b164a4b5b..d7c6b12b5591 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1742,7 +1742,7 @@ static const struct address_space_operations def_blk_aops = {
  */
 static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
+	return __dax_fault(vma, vmf, blkdev_get_block);
 }
 
 static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
@@ -1754,7 +1754,7 @@ static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
 static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, unsigned int flags)
 {
-	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block);
 }
 
 static const struct vm_operations_struct blkdev_dax_vm_ops = {
diff --git a/fs/dax.c b/fs/dax.c
index e409e8fc13b7..f53b13d9644b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -603,19 +603,13 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
- * @complete_unwritten: The filesystem method used to convert unwritten blocks
- *	to written so the data written to them is exposed. This is required for
- *	required by write faults for filesystems that will return unwritten
- *	extent mappings from @get_block, but it is optional for reads as
- *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
- *	not support unwritten extents, the it should pass NULL.
  *
  * When a page fault occurs, filesystems may call this helper in their
  * fault handler for DAX files. __dax_fault() assumes the caller has done all
  * the necessary locking for the page fault to proceed successfully.
  */
 int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block, dax_iodone_t complete_unwritten)
+			get_block_t get_block)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -707,23 +701,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page = NULL;
 	}
 
-	/*
-	 * If we successfully insert the new mapping over an unwritten extent,
-	 * we need to ensure we convert the unwritten extent. If there is an
-	 * error inserting the mapping, the filesystem needs to leave it as
-	 * unwritten to prevent exposure of the stale underlying data to
-	 * userspace, but we still need to call the completion function so
-	 * the private resources on the mapping buffer can be released. We
-	 * indicate what the callback should do via the uptodate variable, same
-	 * as for normal BH based IO completions.
-	 */
+	/* Filesystem should not return unwritten buffers to us! */
+	WARN_ON_ONCE(buffer_unwritten(&bh));
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
-	if (buffer_unwritten(&bh)) {
-		if (complete_unwritten)
-			complete_unwritten(&bh, !error);
-		else
-			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
-	}
 
  out:
 	if (error == -ENOMEM)
@@ -752,7 +732,7 @@ EXPORT_SYMBOL(__dax_fault);
  * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-	      get_block_t get_block, dax_iodone_t complete_unwritten)
+	      get_block_t get_block)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -761,7 +741,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
+	result = __dax_fault(vma, vmf, get_block);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
@@ -795,8 +775,7 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
 #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
 
 int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block,
-		dax_iodone_t complete_unwritten)
+		pmd_t *pmd, unsigned int flags, get_block_t get_block)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -855,6 +834,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		if (get_block(inode, block, &bh, 1) != 0)
 			return VM_FAULT_SIGBUS;
 		alloc = true;
+		WARN_ON_ONCE(buffer_unwritten(&bh));
 	}
 
 	bdev = bh.b_bdev;
@@ -996,9 +976,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	if (buffer_unwritten(&bh))
-		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
-
 	return result;
 
  fallback:
@@ -1018,8 +995,7 @@ EXPORT_SYMBOL_GPL(__dax_pmd_fault);
  * pmd_fault handler for DAX files.
  */
 int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-			pmd_t *pmd, unsigned int flags, get_block_t get_block,
-			dax_iodone_t complete_unwritten)
+			pmd_t *pmd, unsigned int flags, get_block_t get_block)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -1028,8 +1004,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
-				complete_unwritten);
+	result = __dax_pmd_fault(vma, address, pmd, flags, get_block);
 	if (flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index c1400b109805..868c02317b05 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -51,7 +51,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
+	ret = __dax_fault(vma, vmf, ext2_get_block);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
@@ -72,7 +72,7 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
 
 	up_read(&ei->dax_sem);
 	if (flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 4cd318f31cbe..3606dc96f96f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -213,7 +213,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (IS_ERR(handle))
 		result = VM_FAULT_SIGBUS;
 	else
-		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block);
 
 	if (write) {
 		if (!IS_ERR(handle))
@@ -249,7 +249,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		result = VM_FAULT_SIGBUS;
 	else
 		result = __dax_pmd_fault(vma, addr, pmd, flags,
-				ext4_dax_mmap_get_block, NULL);
+				ext4_dax_mmap_get_block);
 
 	if (write) {
 		if (!IS_ERR(handle))
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 52883ac3cf84..2ecdb39d2424 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1526,7 +1526,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
 	} else {
 		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
 		ret = block_page_mkwrite_return(ret);
@@ -1560,7 +1560,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1597,8 +1597,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
-			      NULL);
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 636dd59ab505..7c45ac7ea1d1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -10,10 +10,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
-int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
@@ -27,21 +25,20 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
+				unsigned int flags, get_block_t);
 int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
+				unsigned int flags, get_block_t);
 #else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-				pmd_t *pmd, unsigned int flags, get_block_t gb,
-				dax_iodone_t di)
+				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
 #define __dax_pmd_fault dax_pmd_fault
 #endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
-#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
-#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
+#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+#define __dax_mkwrite(vma, vmf, gb)	__dax_fault(vma, vmf, gb)
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae681002100a..a8c3c9c6b4d9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -72,7 +72,6 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
-typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 
 #define MAY_EXEC		0x00000001
 #define MAY_WRITE		0x00000002
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 06/12] dax: Remove complete_unwritten argument
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c      |  4 ++--
 fs/dax.c            | 43 +++++++++----------------------------------
 fs/ext2/file.c      |  4 ++--
 fs/ext4/file.c      |  4 ++--
 fs/xfs/xfs_file.c   |  7 +++----
 include/linux/dax.h | 17 +++++++----------
 include/linux/fs.h  |  1 -
 7 files changed, 25 insertions(+), 55 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 826b164a4b5b..d7c6b12b5591 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1742,7 +1742,7 @@ static const struct address_space_operations def_blk_aops = {
  */
 static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
+	return __dax_fault(vma, vmf, blkdev_get_block);
 }
 
 static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
@@ -1754,7 +1754,7 @@ static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
 static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, unsigned int flags)
 {
-	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block);
 }
 
 static const struct vm_operations_struct blkdev_dax_vm_ops = {
diff --git a/fs/dax.c b/fs/dax.c
index e409e8fc13b7..f53b13d9644b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -603,19 +603,13 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
- * @complete_unwritten: The filesystem method used to convert unwritten blocks
- *	to written so the data written to them is exposed. This is required for
- *	required by write faults for filesystems that will return unwritten
- *	extent mappings from @get_block, but it is optional for reads as
- *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
- *	not support unwritten extents, the it should pass NULL.
  *
  * When a page fault occurs, filesystems may call this helper in their
  * fault handler for DAX files. __dax_fault() assumes the caller has done all
  * the necessary locking for the page fault to proceed successfully.
  */
 int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block, dax_iodone_t complete_unwritten)
+			get_block_t get_block)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -707,23 +701,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page = NULL;
 	}
 
-	/*
-	 * If we successfully insert the new mapping over an unwritten extent,
-	 * we need to ensure we convert the unwritten extent. If there is an
-	 * error inserting the mapping, the filesystem needs to leave it as
-	 * unwritten to prevent exposure of the stale underlying data to
-	 * userspace, but we still need to call the completion function so
-	 * the private resources on the mapping buffer can be released. We
-	 * indicate what the callback should do via the uptodate variable, same
-	 * as for normal BH based IO completions.
-	 */
+	/* Filesystem should not return unwritten buffers to us! */
+	WARN_ON_ONCE(buffer_unwritten(&bh));
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
-	if (buffer_unwritten(&bh)) {
-		if (complete_unwritten)
-			complete_unwritten(&bh, !error);
-		else
-			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
-	}
 
  out:
 	if (error == -ENOMEM)
@@ -752,7 +732,7 @@ EXPORT_SYMBOL(__dax_fault);
  * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-	      get_block_t get_block, dax_iodone_t complete_unwritten)
+	      get_block_t get_block)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -761,7 +741,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
+	result = __dax_fault(vma, vmf, get_block);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
@@ -795,8 +775,7 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
 #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
 
 int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block,
-		dax_iodone_t complete_unwritten)
+		pmd_t *pmd, unsigned int flags, get_block_t get_block)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -855,6 +834,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		if (get_block(inode, block, &bh, 1) != 0)
 			return VM_FAULT_SIGBUS;
 		alloc = true;
+		WARN_ON_ONCE(buffer_unwritten(&bh));
 	}
 
 	bdev = bh.b_bdev;
@@ -996,9 +976,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	if (buffer_unwritten(&bh))
-		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
-
 	return result;
 
  fallback:
@@ -1018,8 +995,7 @@ EXPORT_SYMBOL_GPL(__dax_pmd_fault);
  * pmd_fault handler for DAX files.
  */
 int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-			pmd_t *pmd, unsigned int flags, get_block_t get_block,
-			dax_iodone_t complete_unwritten)
+			pmd_t *pmd, unsigned int flags, get_block_t get_block)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -1028,8 +1004,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
-				complete_unwritten);
+	result = __dax_pmd_fault(vma, address, pmd, flags, get_block);
 	if (flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index c1400b109805..868c02317b05 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -51,7 +51,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
+	ret = __dax_fault(vma, vmf, ext2_get_block);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
@@ -72,7 +72,7 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
 
 	up_read(&ei->dax_sem);
 	if (flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 4cd318f31cbe..3606dc96f96f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -213,7 +213,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (IS_ERR(handle))
 		result = VM_FAULT_SIGBUS;
 	else
-		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block);
 
 	if (write) {
 		if (!IS_ERR(handle))
@@ -249,7 +249,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		result = VM_FAULT_SIGBUS;
 	else
 		result = __dax_pmd_fault(vma, addr, pmd, flags,
-				ext4_dax_mmap_get_block, NULL);
+				ext4_dax_mmap_get_block);
 
 	if (write) {
 		if (!IS_ERR(handle))
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 52883ac3cf84..2ecdb39d2424 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1526,7 +1526,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
 	} else {
 		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
 		ret = block_page_mkwrite_return(ret);
@@ -1560,7 +1560,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1597,8 +1597,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
-			      NULL);
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 636dd59ab505..7c45ac7ea1d1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -10,10 +10,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
-int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
@@ -27,21 +25,20 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
+				unsigned int flags, get_block_t);
 int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
+				unsigned int flags, get_block_t);
 #else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-				pmd_t *pmd, unsigned int flags, get_block_t gb,
-				dax_iodone_t di)
+				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
 #define __dax_pmd_fault dax_pmd_fault
 #endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
-#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
-#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
+#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+#define __dax_mkwrite(vma, vmf, gb)	__dax_fault(vma, vmf, gb)
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae681002100a..a8c3c9c6b4d9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -72,7 +72,6 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
-typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 
 #define MAY_EXEC		0x00000001
 #define MAY_WRITE		0x00000002
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 07/12] dax: Fix data corruption for written and mmapped files
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

When a fault to a hole races with write filling the hole, it can happen
that block zeroing in __dax_fault() overwrites the data copied by write.
Since filesystem is supposed to provide pre-zeroed blocks for fault
anyway, just remove the racy zeroing from dax code. The only catch is
with read-faults over unwritten block where __dax_fault() filled in the
block into page tables anyway. For that case we have to fall back to
using hole page now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f53b13d9644b..4aa440f9305b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -580,11 +580,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		error = PTR_ERR(dax.addr);
 		goto out;
 	}
-
-	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(dax.addr, PAGE_SIZE);
-		wmb_pmem();
-	}
 	dax_unmap_atomic(bdev, &dax);
 
 	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
@@ -661,7 +656,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (error)
 		goto unlock_page;
 
-	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+	if (!buffer_mapped(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
 			count_vm_event(PGMAJFAULT);
@@ -933,8 +928,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(dax.addr, PMD_SIZE);
-			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 07/12] dax: Fix data corruption for written and mmapped files
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

When a fault to a hole races with write filling the hole, it can happen
that block zeroing in __dax_fault() overwrites the data copied by write.
Since filesystem is supposed to provide pre-zeroed blocks for fault
anyway, just remove the racy zeroing from dax code. The only catch is
with read-faults over unwritten block where __dax_fault() filled in the
block into page tables anyway. For that case we have to fall back to
using hole page now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f53b13d9644b..4aa440f9305b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -580,11 +580,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		error = PTR_ERR(dax.addr);
 		goto out;
 	}
-
-	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(dax.addr, PAGE_SIZE);
-		wmb_pmem();
-	}
 	dax_unmap_atomic(bdev, &dax);
 
 	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
@@ -661,7 +656,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (error)
 		goto unlock_page;
 
-	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+	if (!buffer_mapped(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
 			count_vm_event(PGMAJFAULT);
@@ -933,8 +928,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(dax.addr, PMD_SIZE);
-			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 08/12] dax: Fix bogus fault return value on cow faults
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

In case of cow fault when there is no page mapped at cow location, we
return VM_FAULT_LOCKED despite we don't return any page from the fault.
The code in do_cow_fault() carefully handled the case when no page was
actually returned so no harm was done but still this is a bug waiting to
happen.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4aa440f9305b..bf39f16e5390 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -680,7 +680,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		if (error)
 			goto unlock_page;
 		vmf->page = page;
-		return VM_FAULT_LOCKED;
+		if (page)
+			return VM_FAULT_LOCKED;
+		return 0;
 	}
 
 	/* Check we didn't race with a read fault installing a new page */
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 08/12] dax: Fix bogus fault return value on cow faults
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

In case of cow fault when there is no page mapped at cow location, we
return VM_FAULT_LOCKED despite we don't return any page from the fault.
The code in do_cow_fault() carefully handled the case when no page was
actually returned so no harm was done but still this is a bug waiting to
happen.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4aa440f9305b..bf39f16e5390 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -680,7 +680,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		if (error)
 			goto unlock_page;
 		vmf->page = page;
-		return VM_FAULT_LOCKED;
+		if (page)
+			return VM_FAULT_LOCKED;
+		return 0;
 	}
 
 	/* Check we didn't race with a read fault installing a new page */
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 09/12] dax: Allow DAX code to replace exceptional entries
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Currently we forbid page_cache_tree_insert() to replace exceptional radix
tree entries for DAX inodes. However to make DAX faults race free we will
lock radix tree entries and when hole is created, we need to replace
such locked radix tree entry with a hole page. So modify
page_cache_tree_insert() to allow that.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/dax.h |  6 ++++++
 mm/filemap.c        | 18 +++++++++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7c45ac7ea1d1..4b63923e1f8d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -3,8 +3,14 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
+/*
+ * Since exceptional entries do not use indirect bit, we reuse it as a lock bit
+ */
+#define DAX_ENTRY_LOCK RADIX_TREE_INDIRECT_PTR
+
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3461d97ecb30..262e6eff0b66 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -583,14 +583,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 
-		if (WARN_ON(dax_mapping(mapping)))
-			return -EINVAL;
-
-		if (shadowp)
-			*shadowp = p;
 		mapping->nrexceptional--;
-		if (node)
-			workingset_node_shadows_dec(node);
+		if (!dax_mapping(mapping)) {
+			if (shadowp)
+				*shadowp = p;
+			if (node)
+				workingset_node_shadows_dec(node);
+		} else {
+			/* DAX can replace empty locked entry with a hole */
+			WARN_ON_ONCE(p !=
+				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+					 DAX_ENTRY_LOCK));
+		}
 	}
 	radix_tree_replace_slot(slot, page);
 	mapping->nrpages++;
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 09/12] dax: Allow DAX code to replace exceptional entries
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Currently we forbid page_cache_tree_insert() to replace exceptional radix
tree entries for DAX inodes. However to make DAX faults race free we will
lock radix tree entries and when hole is created, we need to replace
such locked radix tree entry with a hole page. So modify
page_cache_tree_insert() to allow that.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/dax.h |  6 ++++++
 mm/filemap.c        | 18 +++++++++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7c45ac7ea1d1..4b63923e1f8d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -3,8 +3,14 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
+/*
+ * Since exceptional entries do not use indirect bit, we reuse it as a lock bit
+ */
+#define DAX_ENTRY_LOCK RADIX_TREE_INDIRECT_PTR
+
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3461d97ecb30..262e6eff0b66 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -583,14 +583,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 
-		if (WARN_ON(dax_mapping(mapping)))
-			return -EINVAL;
-
-		if (shadowp)
-			*shadowp = p;
 		mapping->nrexceptional--;
-		if (node)
-			workingset_node_shadows_dec(node);
+		if (!dax_mapping(mapping)) {
+			if (shadowp)
+				*shadowp = p;
+			if (node)
+				workingset_node_shadows_dec(node);
+		} else {
+			/* DAX can replace empty locked entry with a hole */
+			WARN_ON_ONCE(p !=
+				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+					 DAX_ENTRY_LOCK));
+		}
 	}
 	radix_tree_replace_slot(slot, page);
 	mapping->nrpages++;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 10/12] dax: Remove redundant inode size checks
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Callers of dax fault handlers must make sure these calls cannot race
with truncate. Thus it is enough to check inode size when entering the
function and we don't have to recheck it again later in the handler.
Note that inode size itself can be decreased while the fault handler
runs but filesystem locking prevents against any radix tree or block
mapping information changes resulting from the truncate and that is what
we really care about.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 48 ------------------------------------------------
 1 file changed, 48 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bf39f16e5390..3951237ff248 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -316,20 +316,12 @@ EXPORT_SYMBOL_GPL(dax_do_io);
 static int dax_load_hole(struct address_space *mapping, struct page *page,
 							struct vm_fault *vmf)
 {
-	unsigned long size;
 	struct inode *inode = mapping->host;
 	if (!page)
 		page = find_or_create_page(mapping, vmf->pgoff,
 						GFP_KERNEL | __GFP_ZERO);
 	if (!page)
 		return VM_FAULT_OOM;
-	/* Recheck i_size under page lock to avoid truncate race */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (vmf->pgoff >= size) {
-		unlock_page(page);
-		page_cache_release(page);
-		return VM_FAULT_SIGBUS;
-	}
 
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
@@ -560,22 +552,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		.sector = to_sector(bh, inode),
 		.size = bh->b_size,
 	};
-	pgoff_t size;
 	int error;
 
-	/*
-	 * Check truncate didn't happen while we were allocating a block.
-	 * If it did, this block may or may not be still allocated to the
-	 * file.  We can't tell the filesystem to free it because we can't
-	 * take i_mutex here.  In the worst case, the file still has blocks
-	 * allocated past the end of the file.
-	 */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (unlikely(vmf->pgoff >= size)) {
-		error = -EIO;
-		goto out;
-	}
-
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
 		goto out;
@@ -639,15 +617,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			page_cache_release(page);
 			goto repeat;
 		}
-		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			/*
-			 * We have a struct page covering a hole in the file
-			 * from a read fault and we've raced with a truncate
-			 */
-			error = -EIO;
-			goto unlock_page;
-		}
 	}
 
 	error = get_block(inode, block, &bh, 0);
@@ -857,23 +826,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	/*
-	 * If a truncate happened while we were allocating blocks, we may
-	 * leave blocks allocated to the file that are beyond EOF.  We can't
-	 * take i_mutex here, so just leave them hanging; they'll be freed
-	 * when the file is deleted.
-	 */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size) {
-		result = VM_FAULT_SIGBUS;
-		goto out;
-	}
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(&bh, address,
-				"offset + huge page size > file size");
-		goto fallback;
-	}
-
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
 		spinlock_t *ptl;
 		pmd_t entry;
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 10/12] dax: Remove redundant inode size checks
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Callers of dax fault handlers must make sure these calls cannot race
with truncate. Thus it is enough to check inode size when entering the
function and we don't have to recheck it again later in the handler.
Note that inode size itself can be decreased while the fault handler
runs but filesystem locking prevents against any radix tree or block
mapping information changes resulting from the truncate and that is what
we really care about.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 48 ------------------------------------------------
 1 file changed, 48 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bf39f16e5390..3951237ff248 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -316,20 +316,12 @@ EXPORT_SYMBOL_GPL(dax_do_io);
 static int dax_load_hole(struct address_space *mapping, struct page *page,
 							struct vm_fault *vmf)
 {
-	unsigned long size;
 	struct inode *inode = mapping->host;
 	if (!page)
 		page = find_or_create_page(mapping, vmf->pgoff,
 						GFP_KERNEL | __GFP_ZERO);
 	if (!page)
 		return VM_FAULT_OOM;
-	/* Recheck i_size under page lock to avoid truncate race */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (vmf->pgoff >= size) {
-		unlock_page(page);
-		page_cache_release(page);
-		return VM_FAULT_SIGBUS;
-	}
 
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
@@ -560,22 +552,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		.sector = to_sector(bh, inode),
 		.size = bh->b_size,
 	};
-	pgoff_t size;
 	int error;
 
-	/*
-	 * Check truncate didn't happen while we were allocating a block.
-	 * If it did, this block may or may not be still allocated to the
-	 * file.  We can't tell the filesystem to free it because we can't
-	 * take i_mutex here.  In the worst case, the file still has blocks
-	 * allocated past the end of the file.
-	 */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (unlikely(vmf->pgoff >= size)) {
-		error = -EIO;
-		goto out;
-	}
-
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
 		goto out;
@@ -639,15 +617,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			page_cache_release(page);
 			goto repeat;
 		}
-		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			/*
-			 * We have a struct page covering a hole in the file
-			 * from a read fault and we've raced with a truncate
-			 */
-			error = -EIO;
-			goto unlock_page;
-		}
 	}
 
 	error = get_block(inode, block, &bh, 0);
@@ -857,23 +826,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	/*
-	 * If a truncate happened while we were allocating blocks, we may
-	 * leave blocks allocated to the file that are beyond EOF.  We can't
-	 * take i_mutex here, so just leave them hanging; they'll be freed
-	 * when the file is deleted.
-	 */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size) {
-		result = VM_FAULT_SIGBUS;
-		goto out;
-	}
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(&bh, address,
-				"offset + huge page size > file size");
-		goto fallback;
-	}
-
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
 		spinlock_t *ptl;
 		pmd_t entry;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 11/12] dax: Disable huge page handling
  2016-03-10 19:18 ` Jan Kara
                   ` (10 preceding siblings ...)
  (?)
@ 2016-03-10 19:18 ` Jan Kara
  2016-03-10 19:34     ` Dan Williams
  -1 siblings, 1 reply; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Currently the handling of huge pages for DAX is racy. For example the
following can happen:

CPU0 (THP write fault)			CPU1 (normal read fault)

__dax_pmd_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
					get_block(inode, block, &bh, 0)
					  -> not mapped
  if (!buffer_mapped(&bh) && write)
    get_block(inode, block, &bh, 1) -> allocates blocks
  truncate_pagecache_range(inode, lstart, lend);
					dax_load_hole();

This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.

The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So disable THP support in DAX code for now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 2 +-
 include/linux/dax.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 3951237ff248..7148fcdb2c92 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -715,7 +715,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if 0
 /*
  * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
  * more often than one might expect in the below function.
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 4b63923e1f8d..fd28d824254b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if 0
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
 				unsigned int flags, get_block_t);
 int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 12/12] dax: New fault locking
  2016-03-10 19:18 ` Jan Kara
@ 2016-03-10 19:18   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Currently DAX page fault locking is racy.

CPU0 (write fault)		CPU1 (read fault)

__dax_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
				  get_block(inode, block, &bh, 0)
				    -> not mapped
  if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
      get_block(inode, block, &bh, 1) -> allocates blocks
  if (page) -> no
				  if (!buffer_mapped(&bh))
				    if (vmf->flags & FAULT_FLAG_WRITE) {
				    } else {
				      dax_load_hole();
				    }
  dax_insert_mapping()

And we are in a situation where we fail in dax_radix_entry() with -EIO.

Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.

We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 470 ++++++++++++++++++++++++++++++++++++++--------------
 include/linux/dax.h |   1 +
 mm/truncate.c       |  62 ++++---
 3 files changed, 374 insertions(+), 159 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7148fcdb2c92..7d2e34f90654 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -41,6 +41,30 @@
 #define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
 		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
 
+/* We choose 4096 entries - same as per-zone page wait tables */
+#define DAX_WAIT_TABLE_BITS 12
+#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+
+wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+
+static int __init init_dax_wait_table(void)
+{
+	int i;
+
+	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
+		init_waitqueue_head(wait_table + i);
+	return 0;
+}
+fs_initcall(init_dax_wait_table);
+
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+					      pgoff_t index)
+{
+	unsigned long hash = hash_long((unsigned long)mapping ^ index,
+				       DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -306,6 +330,214 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 EXPORT_SYMBOL_GPL(dax_do_io);
 
 /*
+ * DAX radix tree locking
+ */
+struct exceptional_entry_key {
+	struct radix_tree_root *root;
+	unsigned long index;
+};
+
+struct wait_exceptional_entry_queue {
+	wait_queue_t wait;
+	struct exceptional_entry_key key;
+};
+
+static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned mode,
+				       int sync, void *keyp)
+{
+	struct exceptional_entry_key *key = keyp;
+	struct wait_exceptional_entry_queue *ewait =
+		container_of(wait, struct wait_exceptional_entry_queue, wait);
+
+	if (key->root != ewait->key.root || key->index != ewait->key.index)
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, NULL);
+}
+
+static inline int slot_locked(void **v)
+{
+	unsigned long l = *(unsigned long *)v;
+	return l & DAX_ENTRY_LOCK;
+}
+
+static inline void *lock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l |= DAX_ENTRY_LOCK);
+}
+
+static inline void *unlock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
+}
+
+/*
+ * Lookup entry in radix tree, wait for it to become unlocked if it is
+ * exceptional entry and return.
+ *
+ * The function must be called with mapping->tree_lock held.
+ */
+static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
+					   pgoff_t index, void ***slotp)
+{
+	void *ret, **slot;
+	struct wait_exceptional_entry_queue wait;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	init_wait(&wait.wait);
+	wait.wait.func = wake_exceptional_entry_func;
+	wait.key.root = &mapping->page_tree;
+	wait.key.index = index;
+
+	for (;;) {
+		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+					  &slot);
+		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		    !slot_locked(slot)) {
+			if (slotp)
+				*slotp = slot;
+			return ret;
+		}
+		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&mapping->tree_lock);
+		schedule();
+		finish_wait(wq, &wait.wait);
+		spin_lock_irq(&mapping->tree_lock);
+	}
+}
+
+/*
+ * Find radix tree entry at given index. If it points to a page, return with
+ * the page locked. If it points to the exceptional entry, return with the
+ * radix tree entry locked. If the radix tree doesn't contain given index,
+ * create empty exceptional entry for the index and return with it locked.
+ *
+ * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
+ * persistent memory the benefit is doubtful. We can add that later if we can
+ * show it helps.
+ */
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+
+restart:
+	spin_lock_irq(&mapping->tree_lock);
+	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
+	/* No entry for given index? Make sure radix tree is big enough. */
+	if (!ret) {
+		int err;
+
+		spin_unlock_irq(&mapping->tree_lock);
+		err = radix_tree_preload(mapping_gfp_mask(mapping));
+		if (err)
+			return ERR_PTR(err);
+		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
+		spin_lock_irq(&mapping->tree_lock);
+		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		radix_tree_preload_end();
+		if (err) {
+			spin_unlock_irq(&mapping->tree_lock);
+			/* Someone already created the entry? */
+			if (err == -EEXIST)
+				goto restart;
+			return ERR_PTR(err);
+		}
+		/* Good, we have inserted empty locked entry into the tree. */
+		mapping->nrexceptional++;
+		spin_unlock_irq(&mapping->tree_lock);
+		return ret;
+	}
+	/* Normal page in radix tree? */
+	if (!radix_tree_exceptional_entry(ret)) {
+		struct page *page = ret;
+
+		page_cache_get(page);
+		spin_unlock_irq(&mapping->tree_lock);
+		lock_page(page);
+		/* Page got truncated? Retry... */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto restart;
+		}
+		return page;
+	}
+	ret = lock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	spin_lock_irq(&mapping->tree_lock);
+	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	if (WARN_ON_ONCE(!slot_locked(slot))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	unlock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+static void put_mapping_entry(struct address_space *mapping, pgoff_t index,
+			      void *entry)
+{
+	if (!radix_tree_exceptional_entry(entry)) {
+		unlock_page(entry);
+		page_cache_release(entry);
+	} else {
+		unlock_mapping_entry(mapping, index);
+	}
+}
+
+/*
+ * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
+ * entry to get unlocked before deleting it.
+ */
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	void *entry;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	/*
+	 * Caller should make sure radix tree modifications don't race and
+	 * we have seen exceptional entry here before.
+	 */
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+	radix_tree_delete(&mapping->page_tree, index);
+	mapping->nrexceptional--;
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+	return 1;
+}
+
+/*
  * The user has performed a load from a hole in the file.  Allocating
  * a new page in the file would cause excessive storage usage for
  * workloads with sparse files.  We allocate a page cache page instead.
@@ -313,16 +545,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, struct page *page,
-							struct vm_fault *vmf)
+static int dax_load_hole(struct address_space *mapping, void *entry,
+			 struct vm_fault *vmf)
 {
-	struct inode *inode = mapping->host;
-	if (!page)
-		page = find_or_create_page(mapping, vmf->pgoff,
-						GFP_KERNEL | __GFP_ZERO);
-	if (!page)
-		return VM_FAULT_OOM;
+	struct page *page;
 
+	/* Hole page already exists? Return it...  */
+	if (!radix_tree_exceptional_entry(entry)) {
+		vmf->page = entry;
+		return VM_FAULT_LOCKED;
+	}
+
+	/* This will replace locked radix tree entry with a hole page */
+	page = find_or_create_page(mapping, vmf->pgoff,
+				   vmf->gfp_mask | __GFP_ZERO);
+	if (!page) {
+		put_mapping_entry(mapping, vmf->pgoff, entry);
+		return VM_FAULT_OOM;
+	}
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
 }
@@ -346,77 +586,58 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define NO_SECTOR -1
 #define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
-		sector_t sector, bool pmd_entry, bool dirty)
+static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
+			       void *entry, sector_t sector, bool dirty,
+			       gfp_t gfp_mask)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	pgoff_t pmd_index = DAX_PMD_INDEX(index);
-	int type, error = 0;
-	void *entry;
+	int error = 0;
+	bool hole_fill = false;
+	struct mem_cgroup *memcg;
+	void *ret;
 
-	WARN_ON_ONCE(pmd_entry && !dirty);
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	spin_lock_irq(&mapping->tree_lock);
-
-	entry = radix_tree_lookup(page_tree, pmd_index);
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
-		index = pmd_index;
-		goto dirty;
+	/* Replacing hole page with block mapping? */
+	if (!radix_tree_exceptional_entry(entry)) {
+		hole_fill = true;
+		error = radix_tree_preload(gfp_mask);
+		if (error)
+			return ERR_PTR(error);
+		memcg = mem_cgroup_begin_page_stat(entry);
 	}
 
-	entry = radix_tree_lookup(page_tree, index);
-	if (entry) {
-		type = RADIX_DAX_TYPE(entry);
-		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
-					type != RADIX_DAX_PMD)) {
-			error = -EIO;
+	spin_lock_irq(&mapping->tree_lock);
+	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
+		       DAX_ENTRY_LOCK);
+	if (hole_fill) {
+		__delete_from_page_cache(entry, NULL, memcg);
+		error = radix_tree_insert(page_tree, index, ret);
+		if (error) {
+			ret = ERR_PTR(error);
 			goto unlock;
 		}
+		mapping->nrexceptional++;
+	} else {
+		void **slot;
+		void *ret2;
 
-		if (!pmd_entry || type == RADIX_DAX_PMD)
-			goto dirty;
-
-		/*
-		 * We only insert dirty PMD entries into the radix tree.  This
-		 * means we don't need to worry about removing a dirty PTE
-		 * entry and inserting a clean PMD entry, thus reducing the
-		 * range we would flush with a follow-up fsync/msync call.
-		 */
-		radix_tree_delete(&mapping->page_tree, index);
-		mapping->nrexceptional--;
+		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
+		WARN_ON_ONCE(ret2 != entry);
+		radix_tree_replace_slot(slot, ret);
 	}
-
-	if (sector == NO_SECTOR) {
-		/*
-		 * This can happen during correct operation if our pfn_mkwrite
-		 * fault raced against a hole punch operation.  If this
-		 * happens the pte that was hole punched will have been
-		 * unmapped and the radix tree entry will have been removed by
-		 * the time we are called, but the call will still happen.  We
-		 * will return all the way up to wp_pfn_shared(), where the
-		 * pte_same() check will fail, eventually causing page fault
-		 * to be retried by the CPU.
-		 */
-		goto unlock;
-	}
-
-	error = radix_tree_insert(page_tree, index,
-			RADIX_DAX_ENTRY(sector, pmd_entry));
-	if (error)
-		goto unlock;
-
-	mapping->nrexceptional++;
- dirty:
 	if (dirty)
 		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
  unlock:
 	spin_unlock_irq(&mapping->tree_lock);
-	return error;
+	if (hole_fill) {
+		mem_cgroup_end_page_stat(memcg);
+		radix_tree_preload_end();
+	}
+	return ret;
 }
 
 static int dax_writeback_one(struct block_device *bdev,
@@ -542,17 +763,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+static int dax_insert_mapping(struct address_space *mapping,
+			struct buffer_head *bh, void *entry,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
 	struct blk_dax_ctl dax = {
-		.sector = to_sector(bh, inode),
+		.sector = to_sector(bh, mapping->host),
 		.size = bh->b_size,
 	};
 	int error;
+	void *ret;
 
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
@@ -560,14 +782,25 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
-	if (error)
+	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
+			        vmf->flags & FAULT_FLAG_WRITE,
+			        vmf->gfp_mask & ~__GFP_HIGHMEM);
+	if (IS_ERR(ret)) {
+		error = PTR_ERR(ret);
 		goto out;
+	}
+	/* Have we replaced hole page? Unmap and free it. */
+	if (!radix_tree_exceptional_entry(entry)) {
+		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_CACHE_SIZE, 0);
+		unlock_page(entry);
+		page_cache_release(entry);
+	}
+	entry = ret;
 
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
-
  out:
+	put_mapping_entry(mapping, vmf->pgoff, entry);
 	return error;
 }
 
@@ -587,7 +820,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
-	struct page *page;
+	void *entry;
 	struct buffer_head bh;
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	unsigned blkbits = inode->i_blkbits;
@@ -596,6 +829,11 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int error;
 	int major = 0;
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -605,40 +843,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
- repeat:
-	page = find_get_page(mapping, vmf->pgoff);
-	if (page) {
-		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
-			page_cache_release(page);
-			return VM_FAULT_RETRY;
-		}
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
-		}
+	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	if (IS_ERR(entry)) {
+		error = PTR_ERR(entry);
+		goto out;
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock_page;
-
-	if (!buffer_mapped(&bh) && !vmf->cow_page) {
-		if (vmf->flags & FAULT_FLAG_WRITE) {
-			error = get_block(inode, block, &bh, 1);
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-			major = VM_FAULT_MAJOR;
-			if (!error && (bh.b_size < PAGE_SIZE))
-				error = -EIO;
-			if (error)
-				goto unlock_page;
-		} else {
-			return dax_load_hole(mapping, page, vmf);
-		}
-	}
+		goto unlock_entry;
 
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
@@ -647,30 +862,33 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock_page;
-		vmf->page = page;
-		if (page)
+			goto unlock_entry;
+		if (!radix_tree_exceptional_entry(entry)) {
+			vmf->page = entry;
 			return VM_FAULT_LOCKED;
+		}
+		unlock_mapping_entry(mapping, vmf->pgoff);
 		return 0;
 	}
 
-	/* Check we didn't race with a read fault installing a new page */
-	if (!page && major)
-		page = find_lock_page(mapping, vmf->pgoff);
-
-	if (page) {
-		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
-							PAGE_CACHE_SIZE, 0);
-		delete_from_page_cache(page);
-		unlock_page(page);
-		page_cache_release(page);
-		page = NULL;
+	if (!buffer_mapped(&bh)) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
+			if (!error && (bh.b_size < PAGE_SIZE))
+				error = -EIO;
+			if (error)
+				goto unlock_entry;
+		} else {
+			return dax_load_hole(mapping, entry, vmf);
+		}
 	}
 
 	/* Filesystem should not return unwritten buffers to us! */
 	WARN_ON_ONCE(buffer_unwritten(&bh));
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
-
+	error = dax_insert_mapping(mapping, &bh, entry, vma, vmf);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -679,11 +897,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock_page:
-	if (page) {
-		unlock_page(page);
-		page_cache_release(page);
-	}
+ unlock_entry:
+	put_mapping_entry(mapping, vmf->pgoff, entry);
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -968,16 +1183,17 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	void *entry;
+	pgoff_t index = vmf->pgoff;
 
-	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
-	 * RADIX_DAX_PTE entry already exists in the radix tree from a
-	 * previous call to __dax_fault().  We just want to look up that PTE
-	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
-	 * saves us from having to make a call to get_block() here to look
-	 * up the sector.
-	 */
-	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+out:
+	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fd28d824254b..da2416d916e6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -18,6 +18,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/truncate.c b/mm/truncate.c
index e3ee0e27cd17..2dd33df71e42 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -34,40 +34,38 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	if (shmem_mapping(mapping))
 		return;
 
-	spin_lock_irq(&mapping->tree_lock);
-
 	if (dax_mapping(mapping)) {
-		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
-			mapping->nrexceptional--;
-	} else {
-		/*
-		 * Regular page slots are stabilized by the page lock even
-		 * without the tree itself locked.  These unlocked entries
-		 * need verification under the tree lock.
-		 */
-		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
-					&slot))
-			goto unlock;
-		if (*slot != entry)
-			goto unlock;
-		radix_tree_replace_slot(slot, NULL);
-		mapping->nrexceptional--;
-		if (!node)
-			goto unlock;
-		workingset_node_shadows_dec(node);
-		/*
-		 * Don't track node without shadow entries.
-		 *
-		 * Avoid acquiring the list_lru lock if already untracked.
-		 * The list_empty() test is safe as node->private_list is
-		 * protected by mapping->tree_lock.
-		 */
-		if (!workingset_node_shadows(node) &&
-		    !list_empty(&node->private_list))
-			list_lru_del(&workingset_shadow_nodes,
-					&node->private_list);
-		__radix_tree_delete_node(&mapping->page_tree, node);
+		dax_delete_mapping_entry(mapping, index);
+		return;
 	}
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+				&slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrexceptional--;
+	if (!node)
+		goto unlock;
+	workingset_node_shadows_dec(node);
+	/*
+	 * Don't track node without shadow entries.
+	 *
+	 * Avoid acquiring the list_lru lock if already untracked.
+	 * The list_empty() test is safe as node->private_list is
+	 * protected by mapping->tree_lock.
+	 */
+	if (!workingset_node_shadows(node) &&
+	    !list_empty(&node->private_list))
+		list_lru_del(&workingset_shadow_nodes,
+				&node->private_list);
+	__radix_tree_delete_node(&mapping->page_tree, node);
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 12/12] dax: New fault locking
@ 2016-03-10 19:18   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Currently DAX page fault locking is racy.

CPU0 (write fault)		CPU1 (read fault)

__dax_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
				  get_block(inode, block, &bh, 0)
				    -> not mapped
  if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
      get_block(inode, block, &bh, 1) -> allocates blocks
  if (page) -> no
				  if (!buffer_mapped(&bh))
				    if (vmf->flags & FAULT_FLAG_WRITE) {
				    } else {
				      dax_load_hole();
				    }
  dax_insert_mapping()

And we are in a situation where we fail in dax_radix_entry() with -EIO.

Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.

We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 470 ++++++++++++++++++++++++++++++++++++++--------------
 include/linux/dax.h |   1 +
 mm/truncate.c       |  62 ++++---
 3 files changed, 374 insertions(+), 159 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7148fcdb2c92..7d2e34f90654 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -41,6 +41,30 @@
 #define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
 		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
 
+/* We choose 4096 entries - same as per-zone page wait tables */
+#define DAX_WAIT_TABLE_BITS 12
+#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+
+wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+
+static int __init init_dax_wait_table(void)
+{
+	int i;
+
+	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
+		init_waitqueue_head(wait_table + i);
+	return 0;
+}
+fs_initcall(init_dax_wait_table);
+
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+					      pgoff_t index)
+{
+	unsigned long hash = hash_long((unsigned long)mapping ^ index,
+				       DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -306,6 +330,214 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 EXPORT_SYMBOL_GPL(dax_do_io);
 
 /*
+ * DAX radix tree locking
+ */
+struct exceptional_entry_key {
+	struct radix_tree_root *root;
+	unsigned long index;
+};
+
+struct wait_exceptional_entry_queue {
+	wait_queue_t wait;
+	struct exceptional_entry_key key;
+};
+
+static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned mode,
+				       int sync, void *keyp)
+{
+	struct exceptional_entry_key *key = keyp;
+	struct wait_exceptional_entry_queue *ewait =
+		container_of(wait, struct wait_exceptional_entry_queue, wait);
+
+	if (key->root != ewait->key.root || key->index != ewait->key.index)
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, NULL);
+}
+
+static inline int slot_locked(void **v)
+{
+	unsigned long l = *(unsigned long *)v;
+	return l & DAX_ENTRY_LOCK;
+}
+
+static inline void *lock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l |= DAX_ENTRY_LOCK);
+}
+
+static inline void *unlock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
+}
+
+/*
+ * Lookup entry in radix tree, wait for it to become unlocked if it is
+ * exceptional entry and return.
+ *
+ * The function must be called with mapping->tree_lock held.
+ */
+static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
+					   pgoff_t index, void ***slotp)
+{
+	void *ret, **slot;
+	struct wait_exceptional_entry_queue wait;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	init_wait(&wait.wait);
+	wait.wait.func = wake_exceptional_entry_func;
+	wait.key.root = &mapping->page_tree;
+	wait.key.index = index;
+
+	for (;;) {
+		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+					  &slot);
+		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		    !slot_locked(slot)) {
+			if (slotp)
+				*slotp = slot;
+			return ret;
+		}
+		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&mapping->tree_lock);
+		schedule();
+		finish_wait(wq, &wait.wait);
+		spin_lock_irq(&mapping->tree_lock);
+	}
+}
+
+/*
+ * Find radix tree entry at given index. If it points to a page, return with
+ * the page locked. If it points to the exceptional entry, return with the
+ * radix tree entry locked. If the radix tree doesn't contain given index,
+ * create empty exceptional entry for the index and return with it locked.
+ *
+ * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
+ * persistent memory the benefit is doubtful. We can add that later if we can
+ * show it helps.
+ */
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+
+restart:
+	spin_lock_irq(&mapping->tree_lock);
+	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
+	/* No entry for given index? Make sure radix tree is big enough. */
+	if (!ret) {
+		int err;
+
+		spin_unlock_irq(&mapping->tree_lock);
+		err = radix_tree_preload(mapping_gfp_mask(mapping));
+		if (err)
+			return ERR_PTR(err);
+		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
+		spin_lock_irq(&mapping->tree_lock);
+		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		radix_tree_preload_end();
+		if (err) {
+			spin_unlock_irq(&mapping->tree_lock);
+			/* Someone already created the entry? */
+			if (err == -EEXIST)
+				goto restart;
+			return ERR_PTR(err);
+		}
+		/* Good, we have inserted empty locked entry into the tree. */
+		mapping->nrexceptional++;
+		spin_unlock_irq(&mapping->tree_lock);
+		return ret;
+	}
+	/* Normal page in radix tree? */
+	if (!radix_tree_exceptional_entry(ret)) {
+		struct page *page = ret;
+
+		page_cache_get(page);
+		spin_unlock_irq(&mapping->tree_lock);
+		lock_page(page);
+		/* Page got truncated? Retry... */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto restart;
+		}
+		return page;
+	}
+	ret = lock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	spin_lock_irq(&mapping->tree_lock);
+	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	if (WARN_ON_ONCE(!slot_locked(slot))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	unlock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+static void put_mapping_entry(struct address_space *mapping, pgoff_t index,
+			      void *entry)
+{
+	if (!radix_tree_exceptional_entry(entry)) {
+		unlock_page(entry);
+		page_cache_release(entry);
+	} else {
+		unlock_mapping_entry(mapping, index);
+	}
+}
+
+/*
+ * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
+ * entry to get unlocked before deleting it.
+ */
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	void *entry;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	/*
+	 * Caller should make sure radix tree modifications don't race and
+	 * we have seen exceptional entry here before.
+	 */
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+	radix_tree_delete(&mapping->page_tree, index);
+	mapping->nrexceptional--;
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+	return 1;
+}
+
+/*
  * The user has performed a load from a hole in the file.  Allocating
  * a new page in the file would cause excessive storage usage for
  * workloads with sparse files.  We allocate a page cache page instead.
@@ -313,16 +545,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, struct page *page,
-							struct vm_fault *vmf)
+static int dax_load_hole(struct address_space *mapping, void *entry,
+			 struct vm_fault *vmf)
 {
-	struct inode *inode = mapping->host;
-	if (!page)
-		page = find_or_create_page(mapping, vmf->pgoff,
-						GFP_KERNEL | __GFP_ZERO);
-	if (!page)
-		return VM_FAULT_OOM;
+	struct page *page;
 
+	/* Hole page already exists? Return it...  */
+	if (!radix_tree_exceptional_entry(entry)) {
+		vmf->page = entry;
+		return VM_FAULT_LOCKED;
+	}
+
+	/* This will replace locked radix tree entry with a hole page */
+	page = find_or_create_page(mapping, vmf->pgoff,
+				   vmf->gfp_mask | __GFP_ZERO);
+	if (!page) {
+		put_mapping_entry(mapping, vmf->pgoff, entry);
+		return VM_FAULT_OOM;
+	}
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
 }
@@ -346,77 +586,58 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define NO_SECTOR -1
 #define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
-		sector_t sector, bool pmd_entry, bool dirty)
+static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
+			       void *entry, sector_t sector, bool dirty,
+			       gfp_t gfp_mask)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	pgoff_t pmd_index = DAX_PMD_INDEX(index);
-	int type, error = 0;
-	void *entry;
+	int error = 0;
+	bool hole_fill = false;
+	struct mem_cgroup *memcg;
+	void *ret;
 
-	WARN_ON_ONCE(pmd_entry && !dirty);
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	spin_lock_irq(&mapping->tree_lock);
-
-	entry = radix_tree_lookup(page_tree, pmd_index);
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
-		index = pmd_index;
-		goto dirty;
+	/* Replacing hole page with block mapping? */
+	if (!radix_tree_exceptional_entry(entry)) {
+		hole_fill = true;
+		error = radix_tree_preload(gfp_mask);
+		if (error)
+			return ERR_PTR(error);
+		memcg = mem_cgroup_begin_page_stat(entry);
 	}
 
-	entry = radix_tree_lookup(page_tree, index);
-	if (entry) {
-		type = RADIX_DAX_TYPE(entry);
-		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
-					type != RADIX_DAX_PMD)) {
-			error = -EIO;
+	spin_lock_irq(&mapping->tree_lock);
+	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
+		       DAX_ENTRY_LOCK);
+	if (hole_fill) {
+		__delete_from_page_cache(entry, NULL, memcg);
+		error = radix_tree_insert(page_tree, index, ret);
+		if (error) {
+			ret = ERR_PTR(error);
 			goto unlock;
 		}
+		mapping->nrexceptional++;
+	} else {
+		void **slot;
+		void *ret2;
 
-		if (!pmd_entry || type == RADIX_DAX_PMD)
-			goto dirty;
-
-		/*
-		 * We only insert dirty PMD entries into the radix tree.  This
-		 * means we don't need to worry about removing a dirty PTE
-		 * entry and inserting a clean PMD entry, thus reducing the
-		 * range we would flush with a follow-up fsync/msync call.
-		 */
-		radix_tree_delete(&mapping->page_tree, index);
-		mapping->nrexceptional--;
+		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
+		WARN_ON_ONCE(ret2 != entry);
+		radix_tree_replace_slot(slot, ret);
 	}
-
-	if (sector == NO_SECTOR) {
-		/*
-		 * This can happen during correct operation if our pfn_mkwrite
-		 * fault raced against a hole punch operation.  If this
-		 * happens the pte that was hole punched will have been
-		 * unmapped and the radix tree entry will have been removed by
-		 * the time we are called, but the call will still happen.  We
-		 * will return all the way up to wp_pfn_shared(), where the
-		 * pte_same() check will fail, eventually causing page fault
-		 * to be retried by the CPU.
-		 */
-		goto unlock;
-	}
-
-	error = radix_tree_insert(page_tree, index,
-			RADIX_DAX_ENTRY(sector, pmd_entry));
-	if (error)
-		goto unlock;
-
-	mapping->nrexceptional++;
- dirty:
 	if (dirty)
 		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
  unlock:
 	spin_unlock_irq(&mapping->tree_lock);
-	return error;
+	if (hole_fill) {
+		mem_cgroup_end_page_stat(memcg);
+		radix_tree_preload_end();
+	}
+	return ret;
 }
 
 static int dax_writeback_one(struct block_device *bdev,
@@ -542,17 +763,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+static int dax_insert_mapping(struct address_space *mapping,
+			struct buffer_head *bh, void *entry,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
 	struct blk_dax_ctl dax = {
-		.sector = to_sector(bh, inode),
+		.sector = to_sector(bh, mapping->host),
 		.size = bh->b_size,
 	};
 	int error;
+	void *ret;
 
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
@@ -560,14 +782,25 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
-	if (error)
+	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
+			        vmf->flags & FAULT_FLAG_WRITE,
+			        vmf->gfp_mask & ~__GFP_HIGHMEM);
+	if (IS_ERR(ret)) {
+		error = PTR_ERR(ret);
 		goto out;
+	}
+	/* Have we replaced hole page? Unmap and free it. */
+	if (!radix_tree_exceptional_entry(entry)) {
+		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_CACHE_SIZE, 0);
+		unlock_page(entry);
+		page_cache_release(entry);
+	}
+	entry = ret;
 
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
-
  out:
+	put_mapping_entry(mapping, vmf->pgoff, entry);
 	return error;
 }
 
@@ -587,7 +820,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
-	struct page *page;
+	void *entry;
 	struct buffer_head bh;
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	unsigned blkbits = inode->i_blkbits;
@@ -596,6 +829,11 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int error;
 	int major = 0;
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -605,40 +843,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
- repeat:
-	page = find_get_page(mapping, vmf->pgoff);
-	if (page) {
-		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
-			page_cache_release(page);
-			return VM_FAULT_RETRY;
-		}
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
-		}
+	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	if (IS_ERR(entry)) {
+		error = PTR_ERR(entry);
+		goto out;
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock_page;
-
-	if (!buffer_mapped(&bh) && !vmf->cow_page) {
-		if (vmf->flags & FAULT_FLAG_WRITE) {
-			error = get_block(inode, block, &bh, 1);
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-			major = VM_FAULT_MAJOR;
-			if (!error && (bh.b_size < PAGE_SIZE))
-				error = -EIO;
-			if (error)
-				goto unlock_page;
-		} else {
-			return dax_load_hole(mapping, page, vmf);
-		}
-	}
+		goto unlock_entry;
 
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
@@ -647,30 +862,33 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock_page;
-		vmf->page = page;
-		if (page)
+			goto unlock_entry;
+		if (!radix_tree_exceptional_entry(entry)) {
+			vmf->page = entry;
 			return VM_FAULT_LOCKED;
+		}
+		unlock_mapping_entry(mapping, vmf->pgoff);
 		return 0;
 	}
 
-	/* Check we didn't race with a read fault installing a new page */
-	if (!page && major)
-		page = find_lock_page(mapping, vmf->pgoff);
-
-	if (page) {
-		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
-							PAGE_CACHE_SIZE, 0);
-		delete_from_page_cache(page);
-		unlock_page(page);
-		page_cache_release(page);
-		page = NULL;
+	if (!buffer_mapped(&bh)) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
+			if (!error && (bh.b_size < PAGE_SIZE))
+				error = -EIO;
+			if (error)
+				goto unlock_entry;
+		} else {
+			return dax_load_hole(mapping, entry, vmf);
+		}
 	}
 
 	/* Filesystem should not return unwritten buffers to us! */
 	WARN_ON_ONCE(buffer_unwritten(&bh));
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
-
+	error = dax_insert_mapping(mapping, &bh, entry, vma, vmf);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -679,11 +897,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock_page:
-	if (page) {
-		unlock_page(page);
-		page_cache_release(page);
-	}
+ unlock_entry:
+	put_mapping_entry(mapping, vmf->pgoff, entry);
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -968,16 +1183,17 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	void *entry;
+	pgoff_t index = vmf->pgoff;
 
-	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
-	 * RADIX_DAX_PTE entry already exists in the radix tree from a
-	 * previous call to __dax_fault().  We just want to look up that PTE
-	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
-	 * saves us from having to make a call to get_block() here to look
-	 * up the sector.
-	 */
-	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+out:
+	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fd28d824254b..da2416d916e6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -18,6 +18,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/truncate.c b/mm/truncate.c
index e3ee0e27cd17..2dd33df71e42 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -34,40 +34,38 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	if (shmem_mapping(mapping))
 		return;
 
-	spin_lock_irq(&mapping->tree_lock);
-
 	if (dax_mapping(mapping)) {
-		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
-			mapping->nrexceptional--;
-	} else {
-		/*
-		 * Regular page slots are stabilized by the page lock even
-		 * without the tree itself locked.  These unlocked entries
-		 * need verification under the tree lock.
-		 */
-		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
-					&slot))
-			goto unlock;
-		if (*slot != entry)
-			goto unlock;
-		radix_tree_replace_slot(slot, NULL);
-		mapping->nrexceptional--;
-		if (!node)
-			goto unlock;
-		workingset_node_shadows_dec(node);
-		/*
-		 * Don't track node without shadow entries.
-		 *
-		 * Avoid acquiring the list_lru lock if already untracked.
-		 * The list_empty() test is safe as node->private_list is
-		 * protected by mapping->tree_lock.
-		 */
-		if (!workingset_node_shadows(node) &&
-		    !list_empty(&node->private_list))
-			list_lru_del(&workingset_shadow_nodes,
-					&node->private_list);
-		__radix_tree_delete_node(&mapping->page_tree, node);
+		dax_delete_mapping_entry(mapping, index);
+		return;
 	}
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+				&slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrexceptional--;
+	if (!node)
+		goto unlock;
+	workingset_node_shadows_dec(node);
+	/*
+	 * Don't track node without shadow entries.
+	 *
+	 * Avoid acquiring the list_lru lock if already untracked.
+	 * The list_empty() test is safe as node->private_list is
+	 * protected by mapping->tree_lock.
+	 */
+	if (!workingset_node_shadows(node) &&
+	    !list_empty(&node->private_list))
+		list_lru_del(&workingset_shadow_nodes,
+				&node->private_list);
+	__radix_tree_delete_node(&mapping->page_tree, node);
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/12] dax: Disable huge page handling
  2016-03-10 19:18 ` [PATCH 11/12] dax: Disable huge page handling Jan Kara
@ 2016-03-10 19:34     ` Dan Williams
  0 siblings, 0 replies; 64+ messages in thread
From: Dan Williams @ 2016-03-10 19:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox,
	Matthew R  <matthew.r.wilcox@intel.com>,
	NeilBrown <neilb@suse.com>,
	linux-nvdimm@lists.01.org

On Thu, Mar 10, 2016 at 11:18 AM, Jan Kara <jack@suse.cz> wrote:
> Currently the handling of huge pages for DAX is racy. For example the
> following can happen:
>
> CPU0 (THP write fault)                  CPU1 (normal read fault)
>
> __dax_pmd_fault()                       __dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
>                                         get_block(inode, block, &bh, 0)
>                                           -> not mapped
>   if (!buffer_mapped(&bh) && write)
>     get_block(inode, block, &bh, 1) -> allocates blocks
>   truncate_pagecache_range(inode, lstart, lend);
>                                         dax_load_hole();
>
> This results in data corruption since process on CPU1 won't see changes
> into the file done by CPU0.
>
> The race can happen even if two normal faults race however with THP the
> situation is even worse because the two faults don't operate on the same
> entries in the radix tree and we want to use these entries for
> serialization. So disable THP support in DAX code for now.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c            | 2 +-
>  include/linux/dax.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 3951237ff248..7148fcdb2c92 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -715,7 +715,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  }
>  EXPORT_SYMBOL_GPL(dax_fault);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  /*
>   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
>   * more often than one might expect in the below function.
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 4b63923e1f8d..fd28d824254b 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
>  }
>  #endif
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
>                                 unsigned int flags, get_block_t);
>  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> --
> 2.6.2
>

Maybe switch to marking FS_DAX_PMD as "depends on BROKEN" again?  That
way we re-use the same mechanism as the check for the presence of
ZONE_DEVICE / struct page for the given pfn.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/12] dax: Disable huge page handling
@ 2016-03-10 19:34     ` Dan Williams
  0 siblings, 0 replies; 64+ messages in thread
From: Dan Williams @ 2016-03-10 19:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, linux-nvdimm, NeilBrown

On Thu, Mar 10, 2016 at 11:18 AM, Jan Kara <jack@suse.cz> wrote:
> Currently the handling of huge pages for DAX is racy. For example the
> following can happen:
>
> CPU0 (THP write fault)                  CPU1 (normal read fault)
>
> __dax_pmd_fault()                       __dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
>                                         get_block(inode, block, &bh, 0)
>                                           -> not mapped
>   if (!buffer_mapped(&bh) && write)
>     get_block(inode, block, &bh, 1) -> allocates blocks
>   truncate_pagecache_range(inode, lstart, lend);
>                                         dax_load_hole();
>
> This results in data corruption since process on CPU1 won't see changes
> into the file done by CPU0.
>
> The race can happen even if two normal faults race however with THP the
> situation is even worse because the two faults don't operate on the same
> entries in the radix tree and we want to use these entries for
> serialization. So disable THP support in DAX code for now.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c            | 2 +-
>  include/linux/dax.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 3951237ff248..7148fcdb2c92 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -715,7 +715,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  }
>  EXPORT_SYMBOL_GPL(dax_fault);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  /*
>   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
>   * more often than one might expect in the below function.
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 4b63923e1f8d..fd28d824254b 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
>  }
>  #endif
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
>                                 unsigned int flags, get_block_t);
>  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> --
> 2.6.2
>

Maybe switch to marking FS_DAX_PMD as "depends on BROKEN" again?  That
way we re-use the same mechanism as the check for the presence of
ZONE_DEVICE / struct page for the given pfn.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 03/12] mm: Remove VM_FAULT_MINOR
  2016-03-10 19:18   ` Jan Kara
@ 2016-03-10 19:38     ` Wilcox, Matthew R
  -1 siblings, 0 replies; 64+ messages in thread
From: Wilcox, Matthew R @ 2016-03-10 19:38 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel, linux-xtensa; +Cc: NeilBrown, linux-nvdimm

You've clearly fixed a bug in xtensa here ;-)  Wonder why nobody noticed that counter was always 0?

+++ b/arch/xtensa/mm/fault.c
@@ -146,7 +146,7 @@ good_area:
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	if (flags & VM_FAULT_MAJOR)
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	else if (flags & VM_FAULT_MINOR)
+	else
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 
 	return;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 516e14944339..37f319d7636e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1051,8 +1051,6 @@ static inline void clear_page_pfmemalloc(struct page *page)
  * just gets major/minor fault counters bumped up.
  */
 
-#define VM_FAULT_MINOR	0 /* For backwards compat. Remove me quickly. */
-
 #define VM_FAULT_OOM	0x0001
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* RE: [PATCH 03/12] mm: Remove VM_FAULT_MINOR
@ 2016-03-10 19:38     ` Wilcox, Matthew R
  0 siblings, 0 replies; 64+ messages in thread
From: Wilcox, Matthew R @ 2016-03-10 19:38 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel, linux-xtensa
  Cc: Ross Zwisler, Williams, Dan J, linux-nvdimm, NeilBrown

You've clearly fixed a bug in xtensa here ;-)  Wonder why nobody noticed that counter was always 0?

+++ b/arch/xtensa/mm/fault.c
@@ -146,7 +146,7 @@ good_area:
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	if (flags & VM_FAULT_MAJOR)
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	else if (flags & VM_FAULT_MINOR)
+	else
 		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
 
 	return;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 516e14944339..37f319d7636e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1051,8 +1051,6 @@ static inline void clear_page_pfmemalloc(struct page *page)
  * just gets major/minor fault counters bumped up.
  */
 
-#define VM_FAULT_MINOR	0 /* For backwards compat. Remove me quickly. */
-
 #define VM_FAULT_OOM	0x0001
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/12] mm: Remove VM_FAULT_MINOR
  2016-03-10 19:38     ` Wilcox, Matthew R
@ 2016-03-10 19:48       ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:48 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: linux-xtensa, Jan Kara, linux-nvdimm, NeilBrown, linux-fsdevel

On Thu 10-03-16 19:38:56, Wilcox, Matthew R wrote:
> You've clearly fixed a bug in xtensa here ;-)  Wonder why nobody noticed
> that counter was always 0?

Likely that architecture is not used too much ;). BTW, I forgot to mention
that patch 4 has been already been merged into Linus' tree today and this
patch is sitting in mm tree. They are unrelated and I was just lazy to
remove them from the series at this point.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/12] mm: Remove VM_FAULT_MINOR
@ 2016-03-10 19:48       ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:48 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: Jan Kara, linux-fsdevel, linux-xtensa, Ross Zwisler, Williams,
	Dan J, linux-nvdimm, NeilBrown

On Thu 10-03-16 19:38:56, Wilcox, Matthew R wrote:
> You've clearly fixed a bug in xtensa here ;-)  Wonder why nobody noticed
> that counter was always 0?

Likely that architecture is not used too much ;). BTW, I forgot to mention
that patch 4 has been already been merged into Linus' tree today and this
patch is sitting in mm tree. They are unrelated and I was just lazy to
remove them from the series at this point.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/12] dax: Disable huge page handling
  2016-03-10 19:34     ` Dan Williams
@ 2016-03-10 19:52       ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R, linux-fsdevel

On Thu 10-03-16 11:34:39, Dan Williams wrote:
> On Thu, Mar 10, 2016 at 11:18 AM, Jan Kara <jack@suse.cz> wrote:
> > Currently the handling of huge pages for DAX is racy. For example the
> > following can happen:
> >
> > CPU0 (THP write fault)                  CPU1 (normal read fault)
> >
> > __dax_pmd_fault()                       __dax_fault()
> >   get_block(inode, block, &bh, 0) -> not mapped
> >                                         get_block(inode, block, &bh, 0)
> >                                           -> not mapped
> >   if (!buffer_mapped(&bh) && write)
> >     get_block(inode, block, &bh, 1) -> allocates blocks
> >   truncate_pagecache_range(inode, lstart, lend);
> >                                         dax_load_hole();
> >
> > This results in data corruption since process on CPU1 won't see changes
> > into the file done by CPU0.
> >
> > The race can happen even if two normal faults race however with THP the
> > situation is even worse because the two faults don't operate on the same
> > entries in the radix tree and we want to use these entries for
> > serialization. So disable THP support in DAX code for now.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c            | 2 +-
> >  include/linux/dax.h | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 3951237ff248..7148fcdb2c92 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -715,7 +715,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_fault);
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  /*
> >   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> >   * more often than one might expect in the below function.
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 4b63923e1f8d..fd28d824254b 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
> >  }
> >  #endif
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> >                                 unsigned int flags, get_block_t);
> >  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> > --
> > 2.6.2
> >
> 
> Maybe switch to marking FS_DAX_PMD as "depends on BROKEN" again?  That
> way we re-use the same mechanism as the check for the presence of
> ZONE_DEVICE / struct page for the given pfn.

Yeah, maybe I could do that. At this point PMD fault handler would not even
compile but I could possibly massage it so that it will work with the new
locking unless you try mixing PMD and PTE faults...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/12] dax: Disable huge page handling
@ 2016-03-10 19:52       ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 19:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Ross Zwisler,
	linux-nvdimm, NeilBrown

On Thu 10-03-16 11:34:39, Dan Williams wrote:
> On Thu, Mar 10, 2016 at 11:18 AM, Jan Kara <jack@suse.cz> wrote:
> > Currently the handling of huge pages for DAX is racy. For example the
> > following can happen:
> >
> > CPU0 (THP write fault)                  CPU1 (normal read fault)
> >
> > __dax_pmd_fault()                       __dax_fault()
> >   get_block(inode, block, &bh, 0) -> not mapped
> >                                         get_block(inode, block, &bh, 0)
> >                                           -> not mapped
> >   if (!buffer_mapped(&bh) && write)
> >     get_block(inode, block, &bh, 1) -> allocates blocks
> >   truncate_pagecache_range(inode, lstart, lend);
> >                                         dax_load_hole();
> >
> > This results in data corruption since process on CPU1 won't see changes
> > into the file done by CPU0.
> >
> > The race can happen even if two normal faults race however with THP the
> > situation is even worse because the two faults don't operate on the same
> > entries in the radix tree and we want to use these entries for
> > serialization. So disable THP support in DAX code for now.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c            | 2 +-
> >  include/linux/dax.h | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 3951237ff248..7148fcdb2c92 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -715,7 +715,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_fault);
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  /*
> >   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> >   * more often than one might expect in the below function.
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 4b63923e1f8d..fd28d824254b 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
> >  }
> >  #endif
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> >                                 unsigned int flags, get_block_t);
> >  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> > --
> > 2.6.2
> >
> 
> Maybe switch to marking FS_DAX_PMD as "depends on BROKEN" again?  That
> way we re-use the same mechanism as the check for the presence of
> ZONE_DEVICE / struct page for the given pfn.

Yeah, maybe I could do that. At this point PMD fault handler would not even
compile but I could possibly massage it so that it will work with the new
locking unless you try mixing PMD and PTE faults...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
  2016-03-10 19:18 ` [PATCH 05/12] dax: Remove synchronization using i_mmap_lock Jan Kara
@ 2016-03-10 19:55     ` Wilcox, Matthew R
  0 siblings, 0 replies; 64+ messages in thread
From: Wilcox, Matthew R @ 2016-03-10 19:55 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel; +Cc: NeilBrown, linux-nvdimm

This locking's still necessary.  i_mmap_sem has already been released by the time we're back in do_cow_fault(), so it doesn't protect that page, and truncate can have whizzed past and thinks there's nothing to unmap.  So a task can have a MAP_PRIVATE page still in its address space after it's supposed to have been unmapped.

We need a test suite for this ;-)

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Thursday, March 10, 2016 11:19 AM
To: linux-fsdevel@vger.kernel.org
Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock

At one point DAX used i_mmap_lock so synchronize page faults with page
table invalidation during truncate. However these days DAX uses
filesystem specific RW semaphores to protect against these races
(i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
the unnecessary locking.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c    | 19 -------------------
 mm/memory.c | 14 --------------
 2 files changed, 33 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9c4d697fb6fc..e409e8fc13b7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	pgoff_t size;
 	int error;
 
-	i_mmap_lock_read(mapping);
-
 	/*
 	 * Check truncate didn't happen while we were allocating a block.
 	 * If it did, this block may or may not be still allocated to the
@@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	return error;
 }
 
@@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		if (error)
 			goto unlock_page;
 		vmf->page = page;
-		if (!page) {
-			i_mmap_lock_read(mapping);
-			/* Check we didn't race with truncate */
-			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
-								PAGE_SHIFT;
-			if (vmf->pgoff >= size) {
-				i_mmap_unlock_read(mapping);
-				error = -EIO;
-				goto out;
-			}
-		}
 		return VM_FAULT_LOCKED;
 	}
 
@@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	i_mmap_lock_read(mapping);
-
 	/*
 	 * If a truncate happened while we were allocating blocks, we may
 	 * leave blocks allocated to the file that are beyond EOF.  We can't
@@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	if (buffer_unwritten(&bh))
 		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
 
diff --git a/mm/memory.c b/mm/memory.c
index 8132787ae4d5..13f76eb08f33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
 
-
-	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
@@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (fault_page) {
 			unlock_page(fault_page);
 			page_cache_release(fault_page);
-		} else {
-			/*
-			 * The fault handler has no page to lock, so it holds
-			 * i_mmap_lock for read to protect against truncate.
-			 */
-			i_mmap_unlock_read(vma->vm_file->f_mapping);
 		}
 		goto uncharge_out;
 	}
@@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (fault_page) {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
-	} else {
-		/*
-		 * The fault handler has no page to lock, so it holds
-		 * i_mmap_lock for read to protect against truncate.
-		 */
-		i_mmap_unlock_read(vma->vm_file->f_mapping);
 	}
 	return ret;
 uncharge_out:
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
@ 2016-03-10 19:55     ` Wilcox, Matthew R
  0 siblings, 0 replies; 64+ messages in thread
From: Wilcox, Matthew R @ 2016-03-10 19:55 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: Ross Zwisler, Williams, Dan J, linux-nvdimm, NeilBrown

This locking's still necessary.  i_mmap_sem has already been released by the time we're back in do_cow_fault(), so it doesn't protect that page, and truncate can have whizzed past and thinks there's nothing to unmap.  So a task can have a MAP_PRIVATE page still in its address space after it's supposed to have been unmapped.

We need a test suite for this ;-)

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Thursday, March 10, 2016 11:19 AM
To: linux-fsdevel@vger.kernel.org
Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock

At one point DAX used i_mmap_lock so synchronize page faults with page
table invalidation during truncate. However these days DAX uses
filesystem specific RW semaphores to protect against these races
(i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
the unnecessary locking.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c    | 19 -------------------
 mm/memory.c | 14 --------------
 2 files changed, 33 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9c4d697fb6fc..e409e8fc13b7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	pgoff_t size;
 	int error;
 
-	i_mmap_lock_read(mapping);
-
 	/*
 	 * Check truncate didn't happen while we were allocating a block.
 	 * If it did, this block may or may not be still allocated to the
@@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	return error;
 }
 
@@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		if (error)
 			goto unlock_page;
 		vmf->page = page;
-		if (!page) {
-			i_mmap_lock_read(mapping);
-			/* Check we didn't race with truncate */
-			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
-								PAGE_SHIFT;
-			if (vmf->pgoff >= size) {
-				i_mmap_unlock_read(mapping);
-				error = -EIO;
-				goto out;
-			}
-		}
 		return VM_FAULT_LOCKED;
 	}
 
@@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	i_mmap_lock_read(mapping);
-
 	/*
 	 * If a truncate happened while we were allocating blocks, we may
 	 * leave blocks allocated to the file that are beyond EOF.  We can't
@@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	if (buffer_unwritten(&bh))
 		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
 
diff --git a/mm/memory.c b/mm/memory.c
index 8132787ae4d5..13f76eb08f33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
 
-
-	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
@@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (fault_page) {
 			unlock_page(fault_page);
 			page_cache_release(fault_page);
-		} else {
-			/*
-			 * The fault handler has no page to lock, so it holds
-			 * i_mmap_lock for read to protect against truncate.
-			 */
-			i_mmap_unlock_read(vma->vm_file->f_mapping);
 		}
 		goto uncharge_out;
 	}
@@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (fault_page) {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
-	} else {
-		/*
-		 * The fault handler has no page to lock, so it holds
-		 * i_mmap_lock for read to protect against truncate.
-		 */
-		i_mmap_unlock_read(vma->vm_file->f_mapping);
 	}
 	return ret;
 uncharge_out:
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
  2016-03-10 19:55     ` Wilcox, Matthew R
@ 2016-03-10 20:05       ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 20:05 UTC (permalink / raw)
  To: Wilcox, Matthew R; +Cc: Jan Kara, linux-nvdimm, NeilBrown, linux-fsdevel

On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> This locking's still necessary.  i_mmap_sem has already been released by
> the time we're back in do_cow_fault(), so it doesn't protect that page,
> and truncate can have whizzed past and thinks there's nothing to unmap.
> So a task can have a MAP_PRIVATE page still in its address space after
> it's supposed to have been unmapped.

I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
reading when handling the fault. That synchronizes against truncate...

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 11:19 AM
> To: linux-fsdevel@vger.kernel.org
> Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> At one point DAX used i_mmap_lock so synchronize page faults with page
> table invalidation during truncate. However these days DAX uses
> filesystem specific RW semaphores to protect against these races
> (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> the unnecessary locking.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c    | 19 -------------------
>  mm/memory.c | 14 --------------
>  2 files changed, 33 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 9c4d697fb6fc..e409e8fc13b7 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	pgoff_t size;
>  	int error;
>  
> -	i_mmap_lock_read(mapping);
> -
>  	/*
>  	 * Check truncate didn't happen while we were allocating a block.
>  	 * If it did, this block may or may not be still allocated to the
> @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	return error;
>  }
>  
> @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		if (error)
>  			goto unlock_page;
>  		vmf->page = page;
> -		if (!page) {
> -			i_mmap_lock_read(mapping);
> -			/* Check we didn't race with truncate */
> -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> -								PAGE_SHIFT;
> -			if (vmf->pgoff >= size) {
> -				i_mmap_unlock_read(mapping);
> -				error = -EIO;
> -				goto out;
> -			}
> -		}
>  		return VM_FAULT_LOCKED;
>  	}
>  
> @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		truncate_pagecache_range(inode, lstart, lend);
>  	}
>  
> -	i_mmap_lock_read(mapping);
> -
>  	/*
>  	 * If a truncate happened while we were allocating blocks, we may
>  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	}
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	if (buffer_unwritten(&bh))
>  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 8132787ae4d5..13f76eb08f33 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
>  	if (details.last_index < details.first_index)
>  		details.last_index = ULONG_MAX;
>  
> -
> -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
>  	i_mmap_lock_write(mapping);
>  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (fault_page) {
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
> -		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
>  		}
>  		goto uncharge_out;
>  	}
> @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (fault_page) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
> -	} else {
> -		/*
> -		 * The fault handler has no page to lock, so it holds
> -		 * i_mmap_lock for read to protect against truncate.
> -		 */
> -		i_mmap_unlock_read(vma->vm_file->f_mapping);
>  	}
>  	return ret;
>  uncharge_out:
> -- 
> 2.6.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
@ 2016-03-10 20:05       ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-10 20:05 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: Jan Kara, linux-fsdevel, Ross Zwisler, Williams, Dan J,
	linux-nvdimm, NeilBrown

On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> This locking's still necessary.  i_mmap_sem has already been released by
> the time we're back in do_cow_fault(), so it doesn't protect that page,
> and truncate can have whizzed past and thinks there's nothing to unmap.
> So a task can have a MAP_PRIVATE page still in its address space after
> it's supposed to have been unmapped.

I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
reading when handling the fault. That synchronizes against truncate...

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 11:19 AM
> To: linux-fsdevel@vger.kernel.org
> Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> At one point DAX used i_mmap_lock so synchronize page faults with page
> table invalidation during truncate. However these days DAX uses
> filesystem specific RW semaphores to protect against these races
> (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> the unnecessary locking.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c    | 19 -------------------
>  mm/memory.c | 14 --------------
>  2 files changed, 33 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 9c4d697fb6fc..e409e8fc13b7 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	pgoff_t size;
>  	int error;
>  
> -	i_mmap_lock_read(mapping);
> -
>  	/*
>  	 * Check truncate didn't happen while we were allocating a block.
>  	 * If it did, this block may or may not be still allocated to the
> @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	return error;
>  }
>  
> @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		if (error)
>  			goto unlock_page;
>  		vmf->page = page;
> -		if (!page) {
> -			i_mmap_lock_read(mapping);
> -			/* Check we didn't race with truncate */
> -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> -								PAGE_SHIFT;
> -			if (vmf->pgoff >= size) {
> -				i_mmap_unlock_read(mapping);
> -				error = -EIO;
> -				goto out;
> -			}
> -		}
>  		return VM_FAULT_LOCKED;
>  	}
>  
> @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		truncate_pagecache_range(inode, lstart, lend);
>  	}
>  
> -	i_mmap_lock_read(mapping);
> -
>  	/*
>  	 * If a truncate happened while we were allocating blocks, we may
>  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	}
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	if (buffer_unwritten(&bh))
>  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 8132787ae4d5..13f76eb08f33 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
>  	if (details.last_index < details.first_index)
>  		details.last_index = ULONG_MAX;
>  
> -
> -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
>  	i_mmap_lock_write(mapping);
>  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (fault_page) {
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
> -		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
>  		}
>  		goto uncharge_out;
>  	}
> @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (fault_page) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
> -	} else {
> -		/*
> -		 * The fault handler has no page to lock, so it holds
> -		 * i_mmap_lock for read to protect against truncate.
> -		 */
> -		i_mmap_unlock_read(vma->vm_file->f_mapping);
>  	}
>  	return ret;
>  uncharge_out:
> -- 
> 2.6.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
  2016-03-10 20:05       ` Jan Kara
@ 2016-03-10 20:10         ` Wilcox, Matthew R
  -1 siblings, 0 replies; 64+ messages in thread
From: Wilcox, Matthew R @ 2016-03-10 20:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, NeilBrown, linux-nvdimm

Here's the race:

CPU 0				CPU 1
do_cow_fault()
__do_fault()
takes sem
dax_fault()
releases sem
				truncate()
				unmap_mapping_range()
				i_mmap_lock_write()
				unmap_mapping_range_tree()
				i_mmap_unlock_write()
do_set_pte()

Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate from proceeding until the page is inseted with do_set_pte().


-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Thursday, March 10, 2016 12:05 PM
To: Wilcox, Matthew R
Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock

On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> This locking's still necessary.  i_mmap_sem has already been released by
> the time we're back in do_cow_fault(), so it doesn't protect that page,
> and truncate can have whizzed past and thinks there's nothing to unmap.
> So a task can have a MAP_PRIVATE page still in its address space after
> it's supposed to have been unmapped.

I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
reading when handling the fault. That synchronizes against truncate...

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 11:19 AM
> To: linux-fsdevel@vger.kernel.org
> Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> At one point DAX used i_mmap_lock so synchronize page faults with page
> table invalidation during truncate. However these days DAX uses
> filesystem specific RW semaphores to protect against these races
> (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> the unnecessary locking.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c    | 19 -------------------
>  mm/memory.c | 14 --------------
>  2 files changed, 33 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 9c4d697fb6fc..e409e8fc13b7 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	pgoff_t size;
>  	int error;
>  
> -	i_mmap_lock_read(mapping);
> -
>  	/*
>  	 * Check truncate didn't happen while we were allocating a block.
>  	 * If it did, this block may or may not be still allocated to the
> @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	return error;
>  }
>  
> @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		if (error)
>  			goto unlock_page;
>  		vmf->page = page;
> -		if (!page) {
> -			i_mmap_lock_read(mapping);
> -			/* Check we didn't race with truncate */
> -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> -								PAGE_SHIFT;
> -			if (vmf->pgoff >= size) {
> -				i_mmap_unlock_read(mapping);
> -				error = -EIO;
> -				goto out;
> -			}
> -		}
>  		return VM_FAULT_LOCKED;
>  	}
>  
> @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		truncate_pagecache_range(inode, lstart, lend);
>  	}
>  
> -	i_mmap_lock_read(mapping);
> -
>  	/*
>  	 * If a truncate happened while we were allocating blocks, we may
>  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	}
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	if (buffer_unwritten(&bh))
>  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 8132787ae4d5..13f76eb08f33 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
>  	if (details.last_index < details.first_index)
>  		details.last_index = ULONG_MAX;
>  
> -
> -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
>  	i_mmap_lock_write(mapping);
>  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (fault_page) {
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
> -		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
>  		}
>  		goto uncharge_out;
>  	}
> @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (fault_page) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
> -	} else {
> -		/*
> -		 * The fault handler has no page to lock, so it holds
> -		 * i_mmap_lock for read to protect against truncate.
> -		 */
> -		i_mmap_unlock_read(vma->vm_file->f_mapping);
>  	}
>  	return ret;
>  uncharge_out:
> -- 
> 2.6.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
@ 2016-03-10 20:10         ` Wilcox, Matthew R
  0 siblings, 0 replies; 64+ messages in thread
From: Wilcox, Matthew R @ 2016-03-10 20:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ross Zwisler, Williams, Dan J, linux-nvdimm, NeilBrown

Here's the race:

CPU 0				CPU 1
do_cow_fault()
__do_fault()
takes sem
dax_fault()
releases sem
				truncate()
				unmap_mapping_range()
				i_mmap_lock_write()
				unmap_mapping_range_tree()
				i_mmap_unlock_write()
do_set_pte()

Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate from proceeding until the page is inseted with do_set_pte().


-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Thursday, March 10, 2016 12:05 PM
To: Wilcox, Matthew R
Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock

On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> This locking's still necessary.  i_mmap_sem has already been released by
> the time we're back in do_cow_fault(), so it doesn't protect that page,
> and truncate can have whizzed past and thinks there's nothing to unmap.
> So a task can have a MAP_PRIVATE page still in its address space after
> it's supposed to have been unmapped.

I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
reading when handling the fault. That synchronizes against truncate...

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 11:19 AM
> To: linux-fsdevel@vger.kernel.org
> Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> At one point DAX used i_mmap_lock so synchronize page faults with page
> table invalidation during truncate. However these days DAX uses
> filesystem specific RW semaphores to protect against these races
> (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> the unnecessary locking.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c    | 19 -------------------
>  mm/memory.c | 14 --------------
>  2 files changed, 33 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 9c4d697fb6fc..e409e8fc13b7 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	pgoff_t size;
>  	int error;
>  
> -	i_mmap_lock_read(mapping);
> -
>  	/*
>  	 * Check truncate didn't happen while we were allocating a block.
>  	 * If it did, this block may or may not be still allocated to the
> @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	return error;
>  }
>  
> @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		if (error)
>  			goto unlock_page;
>  		vmf->page = page;
> -		if (!page) {
> -			i_mmap_lock_read(mapping);
> -			/* Check we didn't race with truncate */
> -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> -								PAGE_SHIFT;
> -			if (vmf->pgoff >= size) {
> -				i_mmap_unlock_read(mapping);
> -				error = -EIO;
> -				goto out;
> -			}
> -		}
>  		return VM_FAULT_LOCKED;
>  	}
>  
> @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		truncate_pagecache_range(inode, lstart, lend);
>  	}
>  
> -	i_mmap_lock_read(mapping);
> -
>  	/*
>  	 * If a truncate happened while we were allocating blocks, we may
>  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	}
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	if (buffer_unwritten(&bh))
>  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 8132787ae4d5..13f76eb08f33 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
>  	if (details.last_index < details.first_index)
>  		details.last_index = ULONG_MAX;
>  
> -
> -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
>  	i_mmap_lock_write(mapping);
>  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (fault_page) {
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
> -		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
>  		}
>  		goto uncharge_out;
>  	}
> @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (fault_page) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
> -	} else {
> -		/*
> -		 * The fault handler has no page to lock, so it holds
> -		 * i_mmap_lock for read to protect against truncate.
> -		 */
> -		i_mmap_unlock_read(vma->vm_file->f_mapping);
>  	}
>  	return ret;
>  uncharge_out:
> -- 
> 2.6.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-10 19:18   ` Jan Kara
@ 2016-03-10 23:54     ` NeilBrown
  -1 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-03-10 23:54 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 1377 bytes --]

On Fri, Mar 11 2016, Jan Kara wrote:

> Currently DAX page fault locking is racy.
>
> CPU0 (write fault)		CPU1 (read fault)
>
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
>
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
>
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
>
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.

Hi,
 I think the exception locking bits look good - I cannot comment on the
 rest.
 I looks like it was a good idea to bring the locking into dax.c instead
 of trying to make it generic.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-03-10 23:54     ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-03-10 23:54 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 1377 bytes --]

On Fri, Mar 11 2016, Jan Kara wrote:

> Currently DAX page fault locking is racy.
>
> CPU0 (write fault)		CPU1 (read fault)
>
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
>
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
>
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
>
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.

Hi,
 I think the exception locking bits look good - I cannot comment on the
 rest.
 I looks like it was a good idea to bring the locking into dax.c instead
 of trying to make it generic.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/12] DAX: move RADIX_DAX_ definitions to dax.c
  2016-03-10 19:18   ` Jan Kara
@ 2016-03-11 22:54     ` Ross Zwisler
  -1 siblings, 0 replies; 64+ messages in thread
From: Ross Zwisler @ 2016-03-11 22:54 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Thu, Mar 10, 2016 at 08:18:44PM +0100, Jan Kara wrote:
> From: NeilBrown <neilb@suse.com>
> 
> These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
> Let's try to maintain the idea that radix-tree simply implements an
> abstract data type.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> ---
>  fs/dax.c                   | 9 +++++++++
>  include/linux/radix-tree.h | 9 ---------
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 711172450da6..9c4d697fb6fc 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -32,6 +32,15 @@
>  #include <linux/pfn_t.h>
>  #include <linux/sizes.h>
>  
> +#define RADIX_DAX_MASK	0xf
> +#define RADIX_DAX_SHIFT	4
> +#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> +#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> +		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
> +
>  static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
>  {
>  	struct request_queue *q = bdev->bd_queue;
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index f54be7082207..968150ab8a1c 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -51,15 +51,6 @@
>  #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
>  #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
>  
> -#define RADIX_DAX_MASK	0xf
> -#define RADIX_DAX_SHIFT	4
> -#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> -#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
> -
>  static inline int radix_tree_is_indirect_ptr(void *ptr)
>  {
>  	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
> -- 
> 2.6.2
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/12] DAX: move RADIX_DAX_ definitions to dax.c
@ 2016-03-11 22:54     ` Ross Zwisler
  0 siblings, 0 replies; 64+ messages in thread
From: Ross Zwisler @ 2016-03-11 22:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Thu, Mar 10, 2016 at 08:18:44PM +0100, Jan Kara wrote:
> From: NeilBrown <neilb@suse.com>
> 
> These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
> Let's try to maintain the idea that radix-tree simply implements an
> abstract data type.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> ---
>  fs/dax.c                   | 9 +++++++++
>  include/linux/radix-tree.h | 9 ---------
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 711172450da6..9c4d697fb6fc 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -32,6 +32,15 @@
>  #include <linux/pfn_t.h>
>  #include <linux/sizes.h>
>  
> +#define RADIX_DAX_MASK	0xf
> +#define RADIX_DAX_SHIFT	4
> +#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> +#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> +		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
> +
>  static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
>  {
>  	struct request_queue *q = bdev->bd_queue;
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index f54be7082207..968150ab8a1c 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -51,15 +51,6 @@
>  #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
>  #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
>  
> -#define RADIX_DAX_MASK	0xf
> -#define RADIX_DAX_SHIFT	4
> -#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> -#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
> -
>  static inline int radix_tree_is_indirect_ptr(void *ptr)
>  {
>  	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
> -- 
> 2.6.2
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
  2016-03-10 20:10         ` Wilcox, Matthew R
@ 2016-03-14 10:01           ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-14 10:01 UTC (permalink / raw)
  To: Wilcox, Matthew R; +Cc: Jan Kara, linux-nvdimm, NeilBrown, linux-fsdevel

On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote:
> Here's the race:
> 
> CPU 0				CPU 1
> do_cow_fault()
> __do_fault()
> takes sem
> dax_fault()
> releases sem
> 				truncate()
> 				unmap_mapping_range()
> 				i_mmap_lock_write()
> 				unmap_mapping_range_tree()
> 				i_mmap_unlock_write()
> do_set_pte()
> 
> Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate
> from proceeding until the page is inseted with do_set_pte().

Ah, right. Thanks for reminding me. I was hoping to get rid of this
i_mmap_lock abuse in DAX code but obviously it needs more work :).

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 12:05 PM
> To: Wilcox, Matthew R
> Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
> Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> > This locking's still necessary.  i_mmap_sem has already been released by
> > the time we're back in do_cow_fault(), so it doesn't protect that page,
> > and truncate can have whizzed past and thinks there's nothing to unmap.
> > So a task can have a MAP_PRIVATE page still in its address space after
> > it's supposed to have been unmapped.
> 
> I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
> reading when handling the fault. That synchronizes against truncate...
> 
> 								Honza
> 
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz] 
> > Sent: Thursday, March 10, 2016 11:19 AM
> > To: linux-fsdevel@vger.kernel.org
> > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> > 
> > At one point DAX used i_mmap_lock so synchronize page faults with page
> > table invalidation during truncate. However these days DAX uses
> > filesystem specific RW semaphores to protect against these races
> > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> > the unnecessary locking.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c    | 19 -------------------
> >  mm/memory.c | 14 --------------
> >  2 files changed, 33 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 9c4d697fb6fc..e409e8fc13b7 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	pgoff_t size;
> >  	int error;
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * Check truncate didn't happen while we were allocating a block.
> >  	 * If it did, this block may or may not be still allocated to the
> > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	return error;
> >  }
> >  
> > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		if (error)
> >  			goto unlock_page;
> >  		vmf->page = page;
> > -		if (!page) {
> > -			i_mmap_lock_read(mapping);
> > -			/* Check we didn't race with truncate */
> > -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> > -								PAGE_SHIFT;
> > -			if (vmf->pgoff >= size) {
> > -				i_mmap_unlock_read(mapping);
> > -				error = -EIO;
> > -				goto out;
> > -			}
> > -		}
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		truncate_pagecache_range(inode, lstart, lend);
> >  	}
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * If a truncate happened while we were allocating blocks, we may
> >  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	}
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	if (buffer_unwritten(&bh))
> >  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8132787ae4d5..13f76eb08f33 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
> >  	if (details.last_index < details.first_index)
> >  		details.last_index = ULONG_MAX;
> >  
> > -
> > -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> >  	i_mmap_lock_write(mapping);
> >  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> >  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		if (fault_page) {
> >  			unlock_page(fault_page);
> >  			page_cache_release(fault_page);
> > -		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  		}
> >  		goto uncharge_out;
> >  	}
> > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (fault_page) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> > -	} else {
> > -		/*
> > -		 * The fault handler has no page to lock, so it holds
> > -		 * i_mmap_lock for read to protect against truncate.
> > -		 */
> > -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  	}
> >  	return ret;
> >  uncharge_out:
> > -- 
> > 2.6.2
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
@ 2016-03-14 10:01           ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-14 10:01 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: Jan Kara, linux-fsdevel, Ross Zwisler, Williams, Dan J,
	linux-nvdimm, NeilBrown

On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote:
> Here's the race:
> 
> CPU 0				CPU 1
> do_cow_fault()
> __do_fault()
> takes sem
> dax_fault()
> releases sem
> 				truncate()
> 				unmap_mapping_range()
> 				i_mmap_lock_write()
> 				unmap_mapping_range_tree()
> 				i_mmap_unlock_write()
> do_set_pte()
> 
> Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate
> from proceeding until the page is inseted with do_set_pte().

Ah, right. Thanks for reminding me. I was hoping to get rid of this
i_mmap_lock abuse in DAX code but obviously it needs more work :).

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 12:05 PM
> To: Wilcox, Matthew R
> Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
> Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> > This locking's still necessary.  i_mmap_sem has already been released by
> > the time we're back in do_cow_fault(), so it doesn't protect that page,
> > and truncate can have whizzed past and thinks there's nothing to unmap.
> > So a task can have a MAP_PRIVATE page still in its address space after
> > it's supposed to have been unmapped.
> 
> I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
> reading when handling the fault. That synchronizes against truncate...
> 
> 								Honza
> 
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz] 
> > Sent: Thursday, March 10, 2016 11:19 AM
> > To: linux-fsdevel@vger.kernel.org
> > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> > 
> > At one point DAX used i_mmap_lock so synchronize page faults with page
> > table invalidation during truncate. However these days DAX uses
> > filesystem specific RW semaphores to protect against these races
> > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> > the unnecessary locking.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c    | 19 -------------------
> >  mm/memory.c | 14 --------------
> >  2 files changed, 33 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 9c4d697fb6fc..e409e8fc13b7 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	pgoff_t size;
> >  	int error;
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * Check truncate didn't happen while we were allocating a block.
> >  	 * If it did, this block may or may not be still allocated to the
> > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	return error;
> >  }
> >  
> > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		if (error)
> >  			goto unlock_page;
> >  		vmf->page = page;
> > -		if (!page) {
> > -			i_mmap_lock_read(mapping);
> > -			/* Check we didn't race with truncate */
> > -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> > -								PAGE_SHIFT;
> > -			if (vmf->pgoff >= size) {
> > -				i_mmap_unlock_read(mapping);
> > -				error = -EIO;
> > -				goto out;
> > -			}
> > -		}
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		truncate_pagecache_range(inode, lstart, lend);
> >  	}
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * If a truncate happened while we were allocating blocks, we may
> >  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	}
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	if (buffer_unwritten(&bh))
> >  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8132787ae4d5..13f76eb08f33 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
> >  	if (details.last_index < details.first_index)
> >  		details.last_index = ULONG_MAX;
> >  
> > -
> > -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> >  	i_mmap_lock_write(mapping);
> >  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> >  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		if (fault_page) {
> >  			unlock_page(fault_page);
> >  			page_cache_release(fault_page);
> > -		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  		}
> >  		goto uncharge_out;
> >  	}
> > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (fault_page) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> > -	} else {
> > -		/*
> > -		 * The fault handler has no page to lock, so it holds
> > -		 * i_mmap_lock for read to protect against truncate.
> > -		 */
> > -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  	}
> >  	return ret;
> >  uncharge_out:
> > -- 
> > 2.6.2
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
  2016-03-14 10:01           ` Jan Kara
@ 2016-03-14 14:51             ` Wilcox, Matthew R
  -1 siblings, 0 replies; 64+ messages in thread
From: Wilcox, Matthew R @ 2016-03-14 14:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, NeilBrown, linux-nvdimm

I think the ultimate goal here has to be to have the truncate code lock the DAX entry in the radix tree and delete it.  Then we can have do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock.  So we'll need another element in struct vm_fault where we can pass back a pointer into the radix tree instead of a pointer to struct page (or add another bit to VM_FAULT_ that indicates that 'page' is not actually a page, but a pointer to an exceptional entry ... or have the MM code understand the exceptional bit ... there's a few ways we can go here).

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Monday, March 14, 2016 3:01 AM
To: Wilcox, Matthew R
Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock

On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote:
> Here's the race:
> 
> CPU 0				CPU 1
> do_cow_fault()
> __do_fault()
> takes sem
> dax_fault()
> releases sem
> 				truncate()
> 				unmap_mapping_range()
> 				i_mmap_lock_write()
> 				unmap_mapping_range_tree()
> 				i_mmap_unlock_write()
> do_set_pte()
> 
> Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate
> from proceeding until the page is inseted with do_set_pte().

Ah, right. Thanks for reminding me. I was hoping to get rid of this
i_mmap_lock abuse in DAX code but obviously it needs more work :).

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 12:05 PM
> To: Wilcox, Matthew R
> Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
> Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> > This locking's still necessary.  i_mmap_sem has already been released by
> > the time we're back in do_cow_fault(), so it doesn't protect that page,
> > and truncate can have whizzed past and thinks there's nothing to unmap.
> > So a task can have a MAP_PRIVATE page still in its address space after
> > it's supposed to have been unmapped.
> 
> I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
> reading when handling the fault. That synchronizes against truncate...
> 
> 								Honza
> 
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz] 
> > Sent: Thursday, March 10, 2016 11:19 AM
> > To: linux-fsdevel@vger.kernel.org
> > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> > 
> > At one point DAX used i_mmap_lock so synchronize page faults with page
> > table invalidation during truncate. However these days DAX uses
> > filesystem specific RW semaphores to protect against these races
> > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> > the unnecessary locking.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c    | 19 -------------------
> >  mm/memory.c | 14 --------------
> >  2 files changed, 33 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 9c4d697fb6fc..e409e8fc13b7 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	pgoff_t size;
> >  	int error;
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * Check truncate didn't happen while we were allocating a block.
> >  	 * If it did, this block may or may not be still allocated to the
> > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	return error;
> >  }
> >  
> > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		if (error)
> >  			goto unlock_page;
> >  		vmf->page = page;
> > -		if (!page) {
> > -			i_mmap_lock_read(mapping);
> > -			/* Check we didn't race with truncate */
> > -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> > -								PAGE_SHIFT;
> > -			if (vmf->pgoff >= size) {
> > -				i_mmap_unlock_read(mapping);
> > -				error = -EIO;
> > -				goto out;
> > -			}
> > -		}
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		truncate_pagecache_range(inode, lstart, lend);
> >  	}
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * If a truncate happened while we were allocating blocks, we may
> >  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	}
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	if (buffer_unwritten(&bh))
> >  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8132787ae4d5..13f76eb08f33 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
> >  	if (details.last_index < details.first_index)
> >  		details.last_index = ULONG_MAX;
> >  
> > -
> > -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> >  	i_mmap_lock_write(mapping);
> >  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> >  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		if (fault_page) {
> >  			unlock_page(fault_page);
> >  			page_cache_release(fault_page);
> > -		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  		}
> >  		goto uncharge_out;
> >  	}
> > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (fault_page) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> > -	} else {
> > -		/*
> > -		 * The fault handler has no page to lock, so it holds
> > -		 * i_mmap_lock for read to protect against truncate.
> > -		 */
> > -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  	}
> >  	return ret;
> >  uncharge_out:
> > -- 
> > 2.6.2
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
@ 2016-03-14 14:51             ` Wilcox, Matthew R
  0 siblings, 0 replies; 64+ messages in thread
From: Wilcox, Matthew R @ 2016-03-14 14:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Ross Zwisler, Williams, Dan J, linux-nvdimm, NeilBrown

I think the ultimate goal here has to be to have the truncate code lock the DAX entry in the radix tree and delete it.  Then we can have do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock.  So we'll need another element in struct vm_fault where we can pass back a pointer into the radix tree instead of a pointer to struct page (or add another bit to VM_FAULT_ that indicates that 'page' is not actually a page, but a pointer to an exceptional entry ... or have the MM code understand the exceptional bit ... there's a few ways we can go here).

-----Original Message-----
From: Jan Kara [mailto:jack@suse.cz] 
Sent: Monday, March 14, 2016 3:01 AM
To: Wilcox, Matthew R
Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock

On Thu 10-03-16 20:10:09, Wilcox, Matthew R wrote:
> Here's the race:
> 
> CPU 0				CPU 1
> do_cow_fault()
> __do_fault()
> takes sem
> dax_fault()
> releases sem
> 				truncate()
> 				unmap_mapping_range()
> 				i_mmap_lock_write()
> 				unmap_mapping_range_tree()
> 				i_mmap_unlock_write()
> do_set_pte()
> 
> Holding i_mmap_lock_read() from inside __do_fault() prevents the truncate
> from proceeding until the page is inseted with do_set_pte().

Ah, right. Thanks for reminding me. I was hoping to get rid of this
i_mmap_lock abuse in DAX code but obviously it needs more work :).

								Honza

> -----Original Message-----
> From: Jan Kara [mailto:jack@suse.cz] 
> Sent: Thursday, March 10, 2016 12:05 PM
> To: Wilcox, Matthew R
> Cc: Jan Kara; linux-fsdevel@vger.kernel.org; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown
> Subject: Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> 
> On Thu 10-03-16 19:55:21, Wilcox, Matthew R wrote:
> > This locking's still necessary.  i_mmap_sem has already been released by
> > the time we're back in do_cow_fault(), so it doesn't protect that page,
> > and truncate can have whizzed past and thinks there's nothing to unmap.
> > So a task can have a MAP_PRIVATE page still in its address space after
> > it's supposed to have been unmapped.
> 
> I don't think this is possible. Filesystem holds its inode->i_mmap_sem for
> reading when handling the fault. That synchronizes against truncate...
> 
> 								Honza
> 
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz] 
> > Sent: Thursday, March 10, 2016 11:19 AM
> > To: linux-fsdevel@vger.kernel.org
> > Cc: Wilcox, Matthew R; Ross Zwisler; Williams, Dan J; linux-nvdimm@lists.01.org; NeilBrown; Jan Kara
> > Subject: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
> > 
> > At one point DAX used i_mmap_lock so synchronize page faults with page
> > table invalidation during truncate. However these days DAX uses
> > filesystem specific RW semaphores to protect against these races
> > (i_mmap_sem in ext2 & ext4 cases, XFS_MMAPLOCK in xfs case). So remove
> > the unnecessary locking.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c    | 19 -------------------
> >  mm/memory.c | 14 --------------
> >  2 files changed, 33 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 9c4d697fb6fc..e409e8fc13b7 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -563,8 +563,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	pgoff_t size;
> >  	int error;
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * Check truncate didn't happen while we were allocating a block.
> >  	 * If it did, this block may or may not be still allocated to the
> > @@ -597,8 +595,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	return error;
> >  }
> >  
> > @@ -695,17 +691,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		if (error)
> >  			goto unlock_page;
> >  		vmf->page = page;
> > -		if (!page) {
> > -			i_mmap_lock_read(mapping);
> > -			/* Check we didn't race with truncate */
> > -			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
> > -								PAGE_SHIFT;
> > -			if (vmf->pgoff >= size) {
> > -				i_mmap_unlock_read(mapping);
> > -				error = -EIO;
> > -				goto out;
> > -			}
> > -		}
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > @@ -895,8 +880,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		truncate_pagecache_range(inode, lstart, lend);
> >  	}
> >  
> > -	i_mmap_lock_read(mapping);
> > -
> >  	/*
> >  	 * If a truncate happened while we were allocating blocks, we may
> >  	 * leave blocks allocated to the file that are beyond EOF.  We can't
> > @@ -1013,8 +996,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	}
> >  
> >   out:
> > -	i_mmap_unlock_read(mapping);
> > -
> >  	if (buffer_unwritten(&bh))
> >  		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8132787ae4d5..13f76eb08f33 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2430,8 +2430,6 @@ void unmap_mapping_range(struct address_space *mapping,
> >  	if (details.last_index < details.first_index)
> >  		details.last_index = ULONG_MAX;
> >  
> > -
> > -	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> >  	i_mmap_lock_write(mapping);
> >  	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> >  		unmap_mapping_range_tree(&mapping->i_mmap, &details);
> > @@ -3019,12 +3017,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		if (fault_page) {
> >  			unlock_page(fault_page);
> >  			page_cache_release(fault_page);
> > -		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  		}
> >  		goto uncharge_out;
> >  	}
> > @@ -3035,12 +3027,6 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	if (fault_page) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> > -	} else {
> > -		/*
> > -		 * The fault handler has no page to lock, so it holds
> > -		 * i_mmap_lock for read to protect against truncate.
> > -		 */
> > -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  	}
> >  	return ret;
> >  uncharge_out:
> > -- 
> > 2.6.2
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
  2016-03-14 14:51             ` Wilcox, Matthew R
@ 2016-03-15  9:50               ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-15  9:50 UTC (permalink / raw)
  To: Wilcox, Matthew R; +Cc: Jan Kara, linux-nvdimm, NeilBrown, linux-fsdevel

On Mon 14-03-16 14:51:26, Wilcox, Matthew R wrote:
> I think the ultimate goal here has to be to have the truncate code lock
> the DAX entry in the radix tree and delete it.  Then we can have
> do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock.
> So we'll need another element in struct vm_fault where we can pass back a
> pointer into the radix tree instead of a pointer to struct page (or add
> another bit to VM_FAULT_ that indicates that 'page' is not actually a
> page, but a pointer to an exceptional entry ... or have the MM code
> understand the exceptional bit ... there's a few ways we can go here).

Yes, with my last patch truncate already waits for lock in radix tree. And
I was looking into various ways how to cleanly handle cow faults using
the radix tree lock instead of i_mmap_lock. So far it's a bit hacky to my
taste but I agree we want to fix the race that way.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/12] dax: Remove synchronization using i_mmap_lock
@ 2016-03-15  9:50               ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-15  9:50 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: Jan Kara, linux-fsdevel, Ross Zwisler, Williams, Dan J,
	linux-nvdimm, NeilBrown

On Mon 14-03-16 14:51:26, Wilcox, Matthew R wrote:
> I think the ultimate goal here has to be to have the truncate code lock
> the DAX entry in the radix tree and delete it.  Then we can have
> do_cow_fault() unlock the radix tree entry instead of the i_mmap_lock.
> So we'll need another element in struct vm_fault where we can pass back a
> pointer into the radix tree instead of a pointer to struct page (or add
> another bit to VM_FAULT_ that indicates that 'page' is not actually a
> page, but a pointer to an exceptional entry ... or have the MM code
> understand the exceptional bit ... there's a few ways we can go here).

Yes, with my last patch truncate already waits for lock in radix tree. And
I was looking into various ways how to cleanly handle cow faults using
the radix tree lock instead of i_mmap_lock. So far it's a bit hacky to my
taste but I agree we want to fix the race that way.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-10 23:54     ` NeilBrown
@ 2016-03-15 21:34       ` NeilBrown
  -1 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-03-15 21:34 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 2346 bytes --]

On Fri, Mar 11 2016, NeilBrown wrote:

> On Fri, Mar 11 2016, Jan Kara wrote:
>
>> Currently DAX page fault locking is racy.
>>
>> CPU0 (write fault)		CPU1 (read fault)
>>
>> __dax_fault()			__dax_fault()
>>   get_block(inode, block, &bh, 0) -> not mapped
>> 				  get_block(inode, block, &bh, 0)
>> 				    -> not mapped
>>   if (!buffer_mapped(&bh))
>>     if (vmf->flags & FAULT_FLAG_WRITE)
>>       get_block(inode, block, &bh, 1) -> allocates blocks
>>   if (page) -> no
>> 				  if (!buffer_mapped(&bh))
>> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
>> 				    } else {
>> 				      dax_load_hole();
>> 				    }
>>   dax_insert_mapping()
>>
>> And we are in a situation where we fail in dax_radix_entry() with -EIO.
>>
>> Another problem with the current DAX page fault locking is that there is
>> no race-free way to clear dirty tag in the radix tree. We can always
>> end up with clean radix tree and dirty data in CPU cache.
>>
>> We fix the first problem by introducing locking of exceptional radix
>> tree entries in DAX mappings acting very similarly to page lock and thus
>> synchronizing properly faults against the same mapping index. The same
>> lock can later be used to avoid races when clearing radix tree dirty
>> tag.
>
> Hi,
>  I think the exception locking bits look good - I cannot comment on the
>  rest.
>  I looks like it was a good idea to bring the locking into dax.c instead
>  of trying to make it generic.
>

Actually ... I'm still bothered by the exclusive waiting.  If an entry
is locked and there are two threads in dax_pfn_mkwrite() then one would
be woken up when the entry is unlocked and it will just set the TAG_DIRTY
flag and then continue without ever waking the next waiter on the
wait queue.

I *think* that any thread which gets an exclusive wakeup is responsible
for performing another wakeup.  In this case it must either lock the
slot, or call __wakeup.
That means:
  grab_mapping_entry needs to call wakeup:
     if radix_tree_preload() fails
     if radix_tree_insert fails other than with -EEXIST
     if a valid page was found
  dax_delete_mapping_entry needs to call wakeup
     if the fail case, though as that isn't expect (WARN_ON_ONCE)
        it should be a problem not to wakeup here
  dax_pfn_mkwrite needs to call wakeup unconditionally


Am I missing something?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-03-15 21:34       ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-03-15 21:34 UTC (permalink / raw)
  To: Jan Kara, linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 2346 bytes --]

On Fri, Mar 11 2016, NeilBrown wrote:

> On Fri, Mar 11 2016, Jan Kara wrote:
>
>> Currently DAX page fault locking is racy.
>>
>> CPU0 (write fault)		CPU1 (read fault)
>>
>> __dax_fault()			__dax_fault()
>>   get_block(inode, block, &bh, 0) -> not mapped
>> 				  get_block(inode, block, &bh, 0)
>> 				    -> not mapped
>>   if (!buffer_mapped(&bh))
>>     if (vmf->flags & FAULT_FLAG_WRITE)
>>       get_block(inode, block, &bh, 1) -> allocates blocks
>>   if (page) -> no
>> 				  if (!buffer_mapped(&bh))
>> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
>> 				    } else {
>> 				      dax_load_hole();
>> 				    }
>>   dax_insert_mapping()
>>
>> And we are in a situation where we fail in dax_radix_entry() with -EIO.
>>
>> Another problem with the current DAX page fault locking is that there is
>> no race-free way to clear dirty tag in the radix tree. We can always
>> end up with clean radix tree and dirty data in CPU cache.
>>
>> We fix the first problem by introducing locking of exceptional radix
>> tree entries in DAX mappings acting very similarly to page lock and thus
>> synchronizing properly faults against the same mapping index. The same
>> lock can later be used to avoid races when clearing radix tree dirty
>> tag.
>
> Hi,
>  I think the exception locking bits look good - I cannot comment on the
>  rest.
>  I looks like it was a good idea to bring the locking into dax.c instead
>  of trying to make it generic.
>

Actually ... I'm still bothered by the exclusive waiting.  If an entry
is locked and there are two threads in dax_pfn_mkwrite() then one would
be woken up when the entry is unlocked and it will just set the TAG_DIRTY
flag and then continue without ever waking the next waiter on the
wait queue.

I *think* that any thread which gets an exclusive wakeup is responsible
for performing another wakeup.  In this case it must either lock the
slot, or call __wakeup.
That means:
  grab_mapping_entry needs to call wakeup:
     if radix_tree_preload() fails
     if radix_tree_insert fails other than with -EEXIST
     if a valid page was found
  dax_delete_mapping_entry needs to call wakeup
     if the fail case, though as that isn't expect (WARN_ON_ONCE)
        it should be a problem not to wakeup here
  dax_pfn_mkwrite needs to call wakeup unconditionally


Am I missing something?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-15 21:34       ` NeilBrown
@ 2016-03-18 14:16         ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-18 14:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, linux-nvdimm, Wilcox

On Wed 16-03-16 08:34:28, NeilBrown wrote:
> On Fri, Mar 11 2016, NeilBrown wrote:
> 
> > On Fri, Mar 11 2016, Jan Kara wrote:
> >
> >> Currently DAX page fault locking is racy.
> >>
> >> CPU0 (write fault)		CPU1 (read fault)
> >>
> >> __dax_fault()			__dax_fault()
> >>   get_block(inode, block, &bh, 0) -> not mapped
> >> 				  get_block(inode, block, &bh, 0)
> >> 				    -> not mapped
> >>   if (!buffer_mapped(&bh))
> >>     if (vmf->flags & FAULT_FLAG_WRITE)
> >>       get_block(inode, block, &bh, 1) -> allocates blocks
> >>   if (page) -> no
> >> 				  if (!buffer_mapped(&bh))
> >> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> >> 				    } else {
> >> 				      dax_load_hole();
> >> 				    }
> >>   dax_insert_mapping()
> >>
> >> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> >>
> >> Another problem with the current DAX page fault locking is that there is
> >> no race-free way to clear dirty tag in the radix tree. We can always
> >> end up with clean radix tree and dirty data in CPU cache.
> >>
> >> We fix the first problem by introducing locking of exceptional radix
> >> tree entries in DAX mappings acting very similarly to page lock and thus
> >> synchronizing properly faults against the same mapping index. The same
> >> lock can later be used to avoid races when clearing radix tree dirty
> >> tag.
> >
> > Hi,
> >  I think the exception locking bits look good - I cannot comment on the
> >  rest.
> >  I looks like it was a good idea to bring the locking into dax.c instead
> >  of trying to make it generic.
> >
> 
> Actually ... I'm still bothered by the exclusive waiting.  If an entry
> is locked and there are two threads in dax_pfn_mkwrite() then one would
> be woken up when the entry is unlocked and it will just set the TAG_DIRTY
> flag and then continue without ever waking the next waiter on the
> wait queue.
> 
> I *think* that any thread which gets an exclusive wakeup is responsible
> for performing another wakeup.  In this case it must either lock the
> slot, or call __wakeup.
> That means:
>   grab_mapping_entry needs to call wakeup:
>      if radix_tree_preload() fails
>      if radix_tree_insert fails other than with -EEXIST
>      if a valid page was found

Why would we need to call wake up when a valid page was found? In that case
there should not be any process waiting for the radix tree entry lock.
Otherwise I agree with you. Thanks for pointing this out, you've likely
saved me quite some debugging ;).


>   dax_delete_mapping_entry needs to call wakeup
>      if the fail case, though as that isn't expect (WARN_ON_ONCE)
>         it should be a problem not to wakeup here
>   dax_pfn_mkwrite needs to call wakeup unconditionally

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-03-18 14:16         ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-18 14:16 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Ross Zwisler,
	Dan Williams, linux-nvdimm

On Wed 16-03-16 08:34:28, NeilBrown wrote:
> On Fri, Mar 11 2016, NeilBrown wrote:
> 
> > On Fri, Mar 11 2016, Jan Kara wrote:
> >
> >> Currently DAX page fault locking is racy.
> >>
> >> CPU0 (write fault)		CPU1 (read fault)
> >>
> >> __dax_fault()			__dax_fault()
> >>   get_block(inode, block, &bh, 0) -> not mapped
> >> 				  get_block(inode, block, &bh, 0)
> >> 				    -> not mapped
> >>   if (!buffer_mapped(&bh))
> >>     if (vmf->flags & FAULT_FLAG_WRITE)
> >>       get_block(inode, block, &bh, 1) -> allocates blocks
> >>   if (page) -> no
> >> 				  if (!buffer_mapped(&bh))
> >> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> >> 				    } else {
> >> 				      dax_load_hole();
> >> 				    }
> >>   dax_insert_mapping()
> >>
> >> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> >>
> >> Another problem with the current DAX page fault locking is that there is
> >> no race-free way to clear dirty tag in the radix tree. We can always
> >> end up with clean radix tree and dirty data in CPU cache.
> >>
> >> We fix the first problem by introducing locking of exceptional radix
> >> tree entries in DAX mappings acting very similarly to page lock and thus
> >> synchronizing properly faults against the same mapping index. The same
> >> lock can later be used to avoid races when clearing radix tree dirty
> >> tag.
> >
> > Hi,
> >  I think the exception locking bits look good - I cannot comment on the
> >  rest.
> >  I looks like it was a good idea to bring the locking into dax.c instead
> >  of trying to make it generic.
> >
> 
> Actually ... I'm still bothered by the exclusive waiting.  If an entry
> is locked and there are two threads in dax_pfn_mkwrite() then one would
> be woken up when the entry is unlocked and it will just set the TAG_DIRTY
> flag and then continue without ever waking the next waiter on the
> wait queue.
> 
> I *think* that any thread which gets an exclusive wakeup is responsible
> for performing another wakeup.  In this case it must either lock the
> slot, or call __wakeup.
> That means:
>   grab_mapping_entry needs to call wakeup:
>      if radix_tree_preload() fails
>      if radix_tree_insert fails other than with -EEXIST
>      if a valid page was found

Why would we need to call wake up when a valid page was found? In that case
there should not be any process waiting for the radix tree entry lock.
Otherwise I agree with you. Thanks for pointing this out, you've likely
saved me quite some debugging ;).


>   dax_delete_mapping_entry needs to call wakeup
>      if the fail case, though as that isn't expect (WARN_ON_ONCE)
>         it should be a problem not to wakeup here
>   dax_pfn_mkwrite needs to call wakeup unconditionally

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-18 14:16         ` Jan Kara
@ 2016-03-18 15:39           ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-18 15:39 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, linux-nvdimm, Wilcox

On Fri 18-03-16 15:16:18, Jan Kara wrote:
> On Wed 16-03-16 08:34:28, NeilBrown wrote:
> > On Fri, Mar 11 2016, NeilBrown wrote:
> > 
> > > On Fri, Mar 11 2016, Jan Kara wrote:
> > >
> > >> Currently DAX page fault locking is racy.
> > >>
> > >> CPU0 (write fault)		CPU1 (read fault)
> > >>
> > >> __dax_fault()			__dax_fault()
> > >>   get_block(inode, block, &bh, 0) -> not mapped
> > >> 				  get_block(inode, block, &bh, 0)
> > >> 				    -> not mapped
> > >>   if (!buffer_mapped(&bh))
> > >>     if (vmf->flags & FAULT_FLAG_WRITE)
> > >>       get_block(inode, block, &bh, 1) -> allocates blocks
> > >>   if (page) -> no
> > >> 				  if (!buffer_mapped(&bh))
> > >> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> > >> 				    } else {
> > >> 				      dax_load_hole();
> > >> 				    }
> > >>   dax_insert_mapping()
> > >>
> > >> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> > >>
> > >> Another problem with the current DAX page fault locking is that there is
> > >> no race-free way to clear dirty tag in the radix tree. We can always
> > >> end up with clean radix tree and dirty data in CPU cache.
> > >>
> > >> We fix the first problem by introducing locking of exceptional radix
> > >> tree entries in DAX mappings acting very similarly to page lock and thus
> > >> synchronizing properly faults against the same mapping index. The same
> > >> lock can later be used to avoid races when clearing radix tree dirty
> > >> tag.
> > >
> > > Hi,
> > >  I think the exception locking bits look good - I cannot comment on the
> > >  rest.
> > >  I looks like it was a good idea to bring the locking into dax.c instead
> > >  of trying to make it generic.
> > >
> > 
> > Actually ... I'm still bothered by the exclusive waiting.  If an entry
> > is locked and there are two threads in dax_pfn_mkwrite() then one would
> > be woken up when the entry is unlocked and it will just set the TAG_DIRTY
> > flag and then continue without ever waking the next waiter on the
> > wait queue.
> > 
> > I *think* that any thread which gets an exclusive wakeup is responsible
> > for performing another wakeup.  In this case it must either lock the
> > slot, or call __wakeup.
> > That means:
> >   grab_mapping_entry needs to call wakeup:
> >      if radix_tree_preload() fails
> >      if radix_tree_insert fails other than with -EEXIST
> >      if a valid page was found
> 
> Why would we need to call wake up when a valid page was found? In that case
> there should not be any process waiting for the radix tree entry lock.
> Otherwise I agree with you. Thanks for pointing this out, you've likely
> saved me quite some debugging ;).
> 
> 
> >   dax_delete_mapping_entry needs to call wakeup
> >      if the fail case, though as that isn't expect (WARN_ON_ONCE)
> >         it should be a problem not to wakeup here
> >   dax_pfn_mkwrite needs to call wakeup unconditionally

Actually, after some thought I don't think the wakeup is needed except for
dax_pfn_mkwrite(). In the other cases we know there is no radix tree
exceptional entry and thus there can be no waiters for its lock...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-03-18 15:39           ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-18 15:39 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Ross Zwisler,
	Dan Williams, linux-nvdimm

On Fri 18-03-16 15:16:18, Jan Kara wrote:
> On Wed 16-03-16 08:34:28, NeilBrown wrote:
> > On Fri, Mar 11 2016, NeilBrown wrote:
> > 
> > > On Fri, Mar 11 2016, Jan Kara wrote:
> > >
> > >> Currently DAX page fault locking is racy.
> > >>
> > >> CPU0 (write fault)		CPU1 (read fault)
> > >>
> > >> __dax_fault()			__dax_fault()
> > >>   get_block(inode, block, &bh, 0) -> not mapped
> > >> 				  get_block(inode, block, &bh, 0)
> > >> 				    -> not mapped
> > >>   if (!buffer_mapped(&bh))
> > >>     if (vmf->flags & FAULT_FLAG_WRITE)
> > >>       get_block(inode, block, &bh, 1) -> allocates blocks
> > >>   if (page) -> no
> > >> 				  if (!buffer_mapped(&bh))
> > >> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> > >> 				    } else {
> > >> 				      dax_load_hole();
> > >> 				    }
> > >>   dax_insert_mapping()
> > >>
> > >> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> > >>
> > >> Another problem with the current DAX page fault locking is that there is
> > >> no race-free way to clear dirty tag in the radix tree. We can always
> > >> end up with clean radix tree and dirty data in CPU cache.
> > >>
> > >> We fix the first problem by introducing locking of exceptional radix
> > >> tree entries in DAX mappings acting very similarly to page lock and thus
> > >> synchronizing properly faults against the same mapping index. The same
> > >> lock can later be used to avoid races when clearing radix tree dirty
> > >> tag.
> > >
> > > Hi,
> > >  I think the exception locking bits look good - I cannot comment on the
> > >  rest.
> > >  I looks like it was a good idea to bring the locking into dax.c instead
> > >  of trying to make it generic.
> > >
> > 
> > Actually ... I'm still bothered by the exclusive waiting.  If an entry
> > is locked and there are two threads in dax_pfn_mkwrite() then one would
> > be woken up when the entry is unlocked and it will just set the TAG_DIRTY
> > flag and then continue without ever waking the next waiter on the
> > wait queue.
> > 
> > I *think* that any thread which gets an exclusive wakeup is responsible
> > for performing another wakeup.  In this case it must either lock the
> > slot, or call __wakeup.
> > That means:
> >   grab_mapping_entry needs to call wakeup:
> >      if radix_tree_preload() fails
> >      if radix_tree_insert fails other than with -EEXIST
> >      if a valid page was found
> 
> Why would we need to call wake up when a valid page was found? In that case
> there should not be any process waiting for the radix tree entry lock.
> Otherwise I agree with you. Thanks for pointing this out, you've likely
> saved me quite some debugging ;).
> 
> 
> >   dax_delete_mapping_entry needs to call wakeup
> >      if the fail case, though as that isn't expect (WARN_ON_ONCE)
> >         it should be a problem not to wakeup here
> >   dax_pfn_mkwrite needs to call wakeup unconditionally

Actually, after some thought I don't think the wakeup is needed except for
dax_pfn_mkwrite(). In the other cases we know there is no radix tree
exceptional entry and thus there can be no waiters for its lock...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-18 15:39           ` Jan Kara
@ 2016-03-22 21:10             ` NeilBrown
  -1 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-03-22 21:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]

On Sat, Mar 19 2016, Jan Kara wrote:
>
> Actually, after some thought I don't think the wakeup is needed except for
> dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> exceptional entry and thus there can be no waiters for its lock...
>

I think that is fragile logic - though it may be correct at present.

A radix tree slot can transition from "Locked exception" to "unlocked
exception" to "deleted" to "struct page".

So it is absolutely certain that a thread cannot go to sleep after
finding a "locked exception" and wake up to find a "struct page" ??

How about a much simpler change.
 - new local variable "slept" in lookup_unlocked_mapping_entry() which
   is set if prepare_to_wait_exclusive() gets called.
 - if after __radix_tree_lookup() returns:
        (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
   then it calls wakeup immediately - because if it was waiting,
   something else might be to.

That would cover all vaguely possible cases except dax_pfn_mkwrite()


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-03-22 21:10             ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-03-22 21:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Ross Zwisler,
	Dan Williams, linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]

On Sat, Mar 19 2016, Jan Kara wrote:
>
> Actually, after some thought I don't think the wakeup is needed except for
> dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> exceptional entry and thus there can be no waiters for its lock...
>

I think that is fragile logic - though it may be correct at present.

A radix tree slot can transition from "Locked exception" to "unlocked
exception" to "deleted" to "struct page".

So it is absolutely certain that a thread cannot go to sleep after
finding a "locked exception" and wake up to find a "struct page" ??

How about a much simpler change.
 - new local variable "slept" in lookup_unlocked_mapping_entry() which
   is set if prepare_to_wait_exclusive() gets called.
 - if after __radix_tree_lookup() returns:
        (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
   then it calls wakeup immediately - because if it was waiting,
   something else might be to.

That would cover all vaguely possible cases except dax_pfn_mkwrite()


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-22 21:10             ` NeilBrown
@ 2016-03-23 11:00               ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-23 11:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, linux-nvdimm, Wilcox

On Wed 23-03-16 08:10:42, NeilBrown wrote:
> On Sat, Mar 19 2016, Jan Kara wrote:
> >
> > Actually, after some thought I don't think the wakeup is needed except for
> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> > exceptional entry and thus there can be no waiters for its lock...
> >
> 
> I think that is fragile logic - though it may be correct at present.
> 
> A radix tree slot can transition from "Locked exception" to "unlocked
> exception" to "deleted" to "struct page".

Yes.
 
> So it is absolutely certain that a thread cannot go to sleep after
> finding a "locked exception" and wake up to find a "struct page" ??

With current implementation this should not happen but I agree entry
locking code should not rely on this.

> How about a much simpler change.
>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
>    is set if prepare_to_wait_exclusive() gets called.
>  - if after __radix_tree_lookup() returns:
>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
>    then it calls wakeup immediately - because if it was waiting,
>    something else might be to.
> 
> That would cover all vaguely possible cases except dax_pfn_mkwrite()

But how does this really help? If lookup_unlocked_mapping_entry() finds
there is no entry (and it was there before), the process deleting the entry
(or replacing it with something else) is responsible for waking up
everybody. So your change would only duplicate what
dax_delete_mapping_entry() does. The potential for breakage is that callers
of lookup_unlocked_mapping_entry() are responsible for waking up other
waiters *even if* they do not lock or delete the entry in the end. Maybe
I'll rename lookup_unlocked_mapping_entry() to get_unlocked_mapping_entry()
so that it is clearer that one must call either put_unlocked_mapping_entry()
or put_locked_mapping_entry() on it.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-03-23 11:00               ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-23 11:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Ross Zwisler,
	Dan Williams, linux-nvdimm

On Wed 23-03-16 08:10:42, NeilBrown wrote:
> On Sat, Mar 19 2016, Jan Kara wrote:
> >
> > Actually, after some thought I don't think the wakeup is needed except for
> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> > exceptional entry and thus there can be no waiters for its lock...
> >
> 
> I think that is fragile logic - though it may be correct at present.
> 
> A radix tree slot can transition from "Locked exception" to "unlocked
> exception" to "deleted" to "struct page".

Yes.
 
> So it is absolutely certain that a thread cannot go to sleep after
> finding a "locked exception" and wake up to find a "struct page" ??

With current implementation this should not happen but I agree entry
locking code should not rely on this.

> How about a much simpler change.
>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
>    is set if prepare_to_wait_exclusive() gets called.
>  - if after __radix_tree_lookup() returns:
>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
>    then it calls wakeup immediately - because if it was waiting,
>    something else might be to.
> 
> That would cover all vaguely possible cases except dax_pfn_mkwrite()

But how does this really help? If lookup_unlocked_mapping_entry() finds
there is no entry (and it was there before), the process deleting the entry
(or replacing it with something else) is responsible for waking up
everybody. So your change would only duplicate what
dax_delete_mapping_entry() does. The potential for breakage is that callers
of lookup_unlocked_mapping_entry() are responsible for waking up other
waiters *even if* they do not lock or delete the entry in the end. Maybe
I'll rename lookup_unlocked_mapping_entry() to get_unlocked_mapping_entry()
so that it is clearer that one must call either put_unlocked_mapping_entry()
or put_locked_mapping_entry() on it.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-23 11:00               ` Jan Kara
@ 2016-03-31  4:20                 ` NeilBrown
  -1 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-03-31  4:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 2376 bytes --]

On Wed, Mar 23 2016, Jan Kara wrote:

> On Wed 23-03-16 08:10:42, NeilBrown wrote:
>> On Sat, Mar 19 2016, Jan Kara wrote:
>> >
>> > Actually, after some thought I don't think the wakeup is needed except for
>> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
>> > exceptional entry and thus there can be no waiters for its lock...
>> >
>> 
>> I think that is fragile logic - though it may be correct at present.
>> 
>> A radix tree slot can transition from "Locked exception" to "unlocked
>> exception" to "deleted" to "struct page".
>
> Yes.
>  
>> So it is absolutely certain that a thread cannot go to sleep after
>> finding a "locked exception" and wake up to find a "struct page" ??
>
> With current implementation this should not happen but I agree entry
> locking code should not rely on this.
>
>> How about a much simpler change.
>>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
>>    is set if prepare_to_wait_exclusive() gets called.
>>  - if after __radix_tree_lookup() returns:
>>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
>>    then it calls wakeup immediately - because if it was waiting,
>>    something else might be to.
>> 
>> That would cover all vaguely possible cases except dax_pfn_mkwrite()
>
> But how does this really help? If lookup_unlocked_mapping_entry() finds
> there is no entry (and it was there before), the process deleting the entry
> (or replacing it with something else) is responsible for waking up
> everybody.

"everybody" - yes.  But it doesn't wake everybody does it?  It just
wakes one.

+		__wake_up(wq, TASK_NORMAL, 1, &key);
                                           ^one!

Or am I misunderstanding how exclusive waiting works?

Thanks,
NeilBrown

>             So your change would only duplicate what
> dax_delete_mapping_entry() does. The potential for breakage is that callers
> of lookup_unlocked_mapping_entry() are responsible for waking up other
> waiters *even if* they do not lock or delete the entry in the end. Maybe
> I'll rename lookup_unlocked_mapping_entry() to get_unlocked_mapping_entry()
> so that it is clearer that one must call either put_unlocked_mapping_entry()
> or put_locked_mapping_entry() on it.
>
> 								Honza
>
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-03-31  4:20                 ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-03-31  4:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Ross Zwisler,
	Dan Williams, linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 2376 bytes --]

On Wed, Mar 23 2016, Jan Kara wrote:

> On Wed 23-03-16 08:10:42, NeilBrown wrote:
>> On Sat, Mar 19 2016, Jan Kara wrote:
>> >
>> > Actually, after some thought I don't think the wakeup is needed except for
>> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
>> > exceptional entry and thus there can be no waiters for its lock...
>> >
>> 
>> I think that is fragile logic - though it may be correct at present.
>> 
>> A radix tree slot can transition from "Locked exception" to "unlocked
>> exception" to "deleted" to "struct page".
>
> Yes.
>  
>> So it is absolutely certain that a thread cannot go to sleep after
>> finding a "locked exception" and wake up to find a "struct page" ??
>
> With current implementation this should not happen but I agree entry
> locking code should not rely on this.
>
>> How about a much simpler change.
>>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
>>    is set if prepare_to_wait_exclusive() gets called.
>>  - if after __radix_tree_lookup() returns:
>>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
>>    then it calls wakeup immediately - because if it was waiting,
>>    something else might be to.
>> 
>> That would cover all vaguely possible cases except dax_pfn_mkwrite()
>
> But how does this really help? If lookup_unlocked_mapping_entry() finds
> there is no entry (and it was there before), the process deleting the entry
> (or replacing it with something else) is responsible for waking up
> everybody.

"everybody" - yes.  But it doesn't wake everybody does it?  It just
wakes one.

+		__wake_up(wq, TASK_NORMAL, 1, &key);
                                           ^one!

Or am I misunderstanding how exclusive waiting works?

Thanks,
NeilBrown

>             So your change would only duplicate what
> dax_delete_mapping_entry() does. The potential for breakage is that callers
> of lookup_unlocked_mapping_entry() are responsible for waking up other
> waiters *even if* they do not lock or delete the entry in the end. Maybe
> I'll rename lookup_unlocked_mapping_entry() to get_unlocked_mapping_entry()
> so that it is clearer that one must call either put_unlocked_mapping_entry()
> or put_locked_mapping_entry() on it.
>
> 								Honza
>
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-31  4:20                 ` NeilBrown
@ 2016-03-31  8:54                   ` Jan Kara
  -1 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-31  8:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jan Kara, linux-nvdimm, Wilcox

On Thu 31-03-16 15:20:00, NeilBrown wrote:
> On Wed, Mar 23 2016, Jan Kara wrote:
> 
> > On Wed 23-03-16 08:10:42, NeilBrown wrote:
> >> On Sat, Mar 19 2016, Jan Kara wrote:
> >> >
> >> > Actually, after some thought I don't think the wakeup is needed except for
> >> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> >> > exceptional entry and thus there can be no waiters for its lock...
> >> >
> >> 
> >> I think that is fragile logic - though it may be correct at present.
> >> 
> >> A radix tree slot can transition from "Locked exception" to "unlocked
> >> exception" to "deleted" to "struct page".
> >
> > Yes.
> >  
> >> So it is absolutely certain that a thread cannot go to sleep after
> >> finding a "locked exception" and wake up to find a "struct page" ??
> >
> > With current implementation this should not happen but I agree entry
> > locking code should not rely on this.
> >
> >> How about a much simpler change.
> >>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
> >>    is set if prepare_to_wait_exclusive() gets called.
> >>  - if after __radix_tree_lookup() returns:
> >>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
> >>    then it calls wakeup immediately - because if it was waiting,
> >>    something else might be to.
> >> 
> >> That would cover all vaguely possible cases except dax_pfn_mkwrite()
> >
> > But how does this really help? If lookup_unlocked_mapping_entry() finds
> > there is no entry (and it was there before), the process deleting the entry
> > (or replacing it with something else) is responsible for waking up
> > everybody.
> 
> "everybody" - yes.  But it doesn't wake everybody does it?  It just
> wakes one.
> 
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
>                                            ^one!
> 
> Or am I misunderstanding how exclusive waiting works?

Ah, OK. I have already fixed that in my local version of the patches so
that we wake-up everybody after deleting the entry but forgot to tell you.
So I have there now:

		__wake_up(wq, TASK_NORMAL, 0, &key);

Are you OK with the code now?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-03-31  8:54                   ` Jan Kara
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Kara @ 2016-03-31  8:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Ross Zwisler,
	Dan Williams, linux-nvdimm

On Thu 31-03-16 15:20:00, NeilBrown wrote:
> On Wed, Mar 23 2016, Jan Kara wrote:
> 
> > On Wed 23-03-16 08:10:42, NeilBrown wrote:
> >> On Sat, Mar 19 2016, Jan Kara wrote:
> >> >
> >> > Actually, after some thought I don't think the wakeup is needed except for
> >> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> >> > exceptional entry and thus there can be no waiters for its lock...
> >> >
> >> 
> >> I think that is fragile logic - though it may be correct at present.
> >> 
> >> A radix tree slot can transition from "Locked exception" to "unlocked
> >> exception" to "deleted" to "struct page".
> >
> > Yes.
> >  
> >> So it is absolutely certain that a thread cannot go to sleep after
> >> finding a "locked exception" and wake up to find a "struct page" ??
> >
> > With current implementation this should not happen but I agree entry
> > locking code should not rely on this.
> >
> >> How about a much simpler change.
> >>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
> >>    is set if prepare_to_wait_exclusive() gets called.
> >>  - if after __radix_tree_lookup() returns:
> >>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
> >>    then it calls wakeup immediately - because if it was waiting,
> >>    something else might be to.
> >> 
> >> That would cover all vaguely possible cases except dax_pfn_mkwrite()
> >
> > But how does this really help? If lookup_unlocked_mapping_entry() finds
> > there is no entry (and it was there before), the process deleting the entry
> > (or replacing it with something else) is responsible for waking up
> > everybody.
> 
> "everybody" - yes.  But it doesn't wake everybody does it?  It just
> wakes one.
> 
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
>                                            ^one!
> 
> Or am I misunderstanding how exclusive waiting works?

Ah, OK. I have already fixed that in my local version of the patches so
that we wake-up everybody after deleting the entry but forgot to tell you.
So I have there now:

		__wake_up(wq, TASK_NORMAL, 0, &key);

Are you OK with the code now?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
  2016-03-31  8:54                   ` Jan Kara
@ 2016-04-01  0:34                     ` NeilBrown
  -1 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-04-01  0:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 1260 bytes --]

On Thu, Mar 31 2016, Jan Kara wrote:

> On Thu 31-03-16 15:20:00, NeilBrown wrote:
>> On Wed, Mar 23 2016, Jan Kara wrote:
>> 

>> > But how does this really help? If lookup_unlocked_mapping_entry() finds
>> > there is no entry (and it was there before), the process deleting the entry
>> > (or replacing it with something else) is responsible for waking up
>> > everybody.
>> 
>> "everybody" - yes.  But it doesn't wake everybody does it?  It just
>> wakes one.
>> 
>> +		__wake_up(wq, TASK_NORMAL, 1, &key);
>>                                            ^one!
>> 
>> Or am I misunderstanding how exclusive waiting works?
>
> Ah, OK. I have already fixed that in my local version of the patches so
> that we wake-up everybody after deleting the entry but forgot to tell you.
> So I have there now:
>
> 		__wake_up(wq, TASK_NORMAL, 0, &key);

"0" meaning "MAX_INT+1" (or something).  I didn't realize you could do
that, but it makes perfect sense.

>
> Are you OK with the code now?

Certainly less unhappy.  Assume that I am OK but I'll try to have
another look when you post again and make sure any lingering doubts are
gone.

Thanks,
NeilBrown


>
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/12] dax: New fault locking
@ 2016-04-01  0:34                     ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2016-04-01  0:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Ross Zwisler,
	Dan Williams, linux-nvdimm

[-- Attachment #1: Type: text/plain, Size: 1260 bytes --]

On Thu, Mar 31 2016, Jan Kara wrote:

> On Thu 31-03-16 15:20:00, NeilBrown wrote:
>> On Wed, Mar 23 2016, Jan Kara wrote:
>> 

>> > But how does this really help? If lookup_unlocked_mapping_entry() finds
>> > there is no entry (and it was there before), the process deleting the entry
>> > (or replacing it with something else) is responsible for waking up
>> > everybody.
>> 
>> "everybody" - yes.  But it doesn't wake everybody does it?  It just
>> wakes one.
>> 
>> +		__wake_up(wq, TASK_NORMAL, 1, &key);
>>                                            ^one!
>> 
>> Or am I misunderstanding how exclusive waiting works?
>
> Ah, OK. I have already fixed that in my local version of the patches so
> that we wake-up everybody after deleting the entry but forgot to tell you.
> So I have there now:
>
> 		__wake_up(wq, TASK_NORMAL, 0, &key);

"0" meaning "MAX_INT+1" (or something).  I didn't realize you could do
that, but it makes perfect sense.

>
> Are you OK with the code now?

Certainly less unhappy.  Assume that I am OK but I'll try to have
another look when you post again and make sure any lingering doubts are
gone.

Thanks,
NeilBrown


>
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2016-04-01  0:35 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-10 19:18 [RFC] [PATCH 0/12] DAX page fault locking Jan Kara
2016-03-10 19:18 ` Jan Kara
2016-03-10 19:18 ` [PATCH 01/12] DAX: move RADIX_DAX_ definitions to dax.c Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-11 22:54   ` Ross Zwisler
2016-03-11 22:54     ` Ross Zwisler
2016-03-10 19:18 ` [PATCH 02/12] radix-tree: make 'indirect' bit available to exception entries Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 03/12] mm: Remove VM_FAULT_MINOR Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:38   ` Wilcox, Matthew R
2016-03-10 19:38     ` Wilcox, Matthew R
2016-03-10 19:48     ` Jan Kara
2016-03-10 19:48       ` Jan Kara
2016-03-10 19:18 ` [PATCH 04/12] ocfs2: Fix return value from ocfs2_page_mkwrite() Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 05/12] dax: Remove synchronization using i_mmap_lock Jan Kara
2016-03-10 19:55   ` Wilcox, Matthew R
2016-03-10 19:55     ` Wilcox, Matthew R
2016-03-10 20:05     ` Jan Kara
2016-03-10 20:05       ` Jan Kara
2016-03-10 20:10       ` Wilcox, Matthew R
2016-03-10 20:10         ` Wilcox, Matthew R
2016-03-14 10:01         ` Jan Kara
2016-03-14 10:01           ` Jan Kara
2016-03-14 14:51           ` Wilcox, Matthew R
2016-03-14 14:51             ` Wilcox, Matthew R
2016-03-15  9:50             ` Jan Kara
2016-03-15  9:50               ` Jan Kara
2016-03-10 19:18 ` [PATCH 06/12] dax: Remove complete_unwritten argument Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 07/12] dax: Fix data corruption for written and mmapped files Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 08/12] dax: Fix bogus fault return value on cow faults Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 09/12] dax: Allow DAX code to replace exceptional entries Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 10/12] dax: Remove redundant inode size checks Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 19:18 ` [PATCH 11/12] dax: Disable huge page handling Jan Kara
2016-03-10 19:34   ` Dan Williams
2016-03-10 19:34     ` Dan Williams
2016-03-10 19:52     ` Jan Kara
2016-03-10 19:52       ` Jan Kara
2016-03-10 19:18 ` [PATCH 12/12] dax: New fault locking Jan Kara
2016-03-10 19:18   ` Jan Kara
2016-03-10 23:54   ` NeilBrown
2016-03-10 23:54     ` NeilBrown
2016-03-15 21:34     ` NeilBrown
2016-03-15 21:34       ` NeilBrown
2016-03-18 14:16       ` Jan Kara
2016-03-18 14:16         ` Jan Kara
2016-03-18 15:39         ` Jan Kara
2016-03-18 15:39           ` Jan Kara
2016-03-22 21:10           ` NeilBrown
2016-03-22 21:10             ` NeilBrown
2016-03-23 11:00             ` Jan Kara
2016-03-23 11:00               ` Jan Kara
2016-03-31  4:20               ` NeilBrown
2016-03-31  4:20                 ` NeilBrown
2016-03-31  8:54                 ` Jan Kara
2016-03-31  8:54                   ` Jan Kara
2016-04-01  0:34                   ` NeilBrown
2016-04-01  0:34                     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.