All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-21 13:22 ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

[Sorry for repost but I accidentally sent initial email without patches]

Hello,

this is my second attempt at DAX page fault locking rewrite. Things now
work reasonably well, it has survived full xfstests run on ext4. I guess
I need to do more mmap targetted tests to unveil issues. Guys what do you
used for DAX testing?

Changes since v1:
- handle wakeups of exclusive waiters properly
- fix cow fault races
- other minor stuff

General description

The basic idea is that we use a bit in an exceptional radix tree entry as
a lock bit and use it similarly to how page lock is used for normal faults.
That way we fix races between hole instantiation and read faults of the
same index. For now I have disabled PMD faults since there the issues with
page fault locking are even worse. Now that Matthew's multi-order radix tree
has landed, I can have a look into using that for proper locking of PMD faults
but first I want normal pages sorted out.

In the end I have decided to implement the bit locking directly in the DAX
code. Originally I was thinking we could provide something generic directly
in the radix tree code but the functions DAX needs are rather specific.
Maybe someone else will have a good idea how to distill some generally useful
functions out of what I've implemented for DAX but for now I didn't bother
with that.

								Honza
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-21 13:22 ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

[Sorry for repost but I accidentally sent initial email without patches]

Hello,

this is my second attempt at DAX page fault locking rewrite. Things now
work reasonably well, it has survived full xfstests run on ext4. I guess
I need to do more mmap targetted tests to unveil issues. Guys what do you
used for DAX testing?

Changes since v1:
- handle wakeups of exclusive waiters properly
- fix cow fault races
- other minor stuff

General description

The basic idea is that we use a bit in an exceptional radix tree entry as
a lock bit and use it similarly to how page lock is used for normal faults.
That way we fix races between hole instantiation and read faults of the
same index. For now I have disabled PMD faults since there the issues with
page fault locking are even worse. Now that Matthew's multi-order radix tree
has landed, I can have a look into using that for proper locking of PMD faults
but first I want normal pages sorted out.

In the end I have decided to implement the bit locking directly in the DAX
code. Originally I was thinking we could provide something generic directly
in the radix tree code but the functions DAX needs are rather specific.
Maybe someone else will have a good idea how to distill some generally useful
functions out of what I've implemented for DAX but for now I didn't bother
with that.

								Honza

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

From: NeilBrown <neilb@suse.com>

These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c                   | 9 +++++++++
 include/linux/radix-tree.h | 9 ---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bbb2ad783770..b32e1b5eb8d4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,6 +32,15 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 51a97ac8bfbf..d08d6ec3bf53 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -52,15 +52,6 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
-#define RADIX_DAX_MASK	0xf
-#define RADIX_DAX_SHIFT	4
-#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
-
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

From: NeilBrown <neilb@suse.com>

These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c                   | 9 +++++++++
 include/linux/radix-tree.h | 9 ---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bbb2ad783770..b32e1b5eb8d4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,6 +32,15 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 51a97ac8bfbf..d08d6ec3bf53 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -52,15 +52,6 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
-#define RADIX_DAX_MASK	0xf
-#define RADIX_DAX_SHIFT	4
-#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
-
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

From: NeilBrown <neilb@suse.com>

A pointer to a radix_tree_node will always have the 'exception'
bit cleared, so if the exception bit is set the value cannot
be an indirect pointer.  Thus it is safe to make the 'indirect bit'
available to store extra information in exception entries.

This patch adds a 'PTR_MASK' and a value is only treated as
an indirect (pointer) entry the 2 ls-bits are '01'.

The change in radix-tree.c ensures the stored value still looks like an
indirect pointer, and saves a load as well.

We could swap the two bits and so keep all the exectional bits contigious.
But I have other plans for that bit....

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/radix-tree.h | 11 +++++++++--
 lib/radix-tree.c           |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index d08d6ec3bf53..2bc8c5829441 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -41,8 +41,13 @@
  * Indirect pointer in fact is also used to tag the last pointer of a node
  * when it is shrunk, before we rcu free the node. See shrink code for
  * details.
+ *
+ * To allow an exception entry to only lose one bit, we ignore
+ * the INDIRECT bit when the exception bit is set.  So an entry is
+ * indirect if the least significant 2 bits are 01.
  */
 #define RADIX_TREE_INDIRECT_PTR		1
+#define RADIX_TREE_INDIRECT_MASK	3
 /*
  * A common use of the radix tree is to store pointers to struct pages;
  * but shmem/tmpfs needs also to store swap entries in the same tree:
@@ -54,7 +59,8 @@
 
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
+	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
+		== RADIX_TREE_INDIRECT_PTR;
 }
 
 /*** radix-tree API starts here ***/
@@ -222,7 +228,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
  */
 static inline int radix_tree_deref_retry(void *arg)
 {
-	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
+	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
+			== RADIX_TREE_INDIRECT_PTR);
 }
 
 /**
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1624c4117961..c6af1a445b67 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1412,7 +1412,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 		 * to force callers to retry.
 		 */
 		if (root->height == 0)
-			*((unsigned long *)&to_free->slots[0]) |=
+			*((unsigned long *)&to_free->slots[0]) =
 						RADIX_TREE_INDIRECT_PTR;
 
 		radix_tree_node_free(to_free);
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

From: NeilBrown <neilb@suse.com>

A pointer to a radix_tree_node will always have the 'exception'
bit cleared, so if the exception bit is set the value cannot
be an indirect pointer.  Thus it is safe to make the 'indirect bit'
available to store extra information in exception entries.

This patch adds a 'PTR_MASK' and a value is only treated as
an indirect (pointer) entry the 2 ls-bits are '01'.

The change in radix-tree.c ensures the stored value still looks like an
indirect pointer, and saves a load as well.

We could swap the two bits and so keep all the exectional bits contigious.
But I have other plans for that bit....

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/radix-tree.h | 11 +++++++++--
 lib/radix-tree.c           |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index d08d6ec3bf53..2bc8c5829441 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -41,8 +41,13 @@
  * Indirect pointer in fact is also used to tag the last pointer of a node
  * when it is shrunk, before we rcu free the node. See shrink code for
  * details.
+ *
+ * To allow an exception entry to only lose one bit, we ignore
+ * the INDIRECT bit when the exception bit is set.  So an entry is
+ * indirect if the least significant 2 bits are 01.
  */
 #define RADIX_TREE_INDIRECT_PTR		1
+#define RADIX_TREE_INDIRECT_MASK	3
 /*
  * A common use of the radix tree is to store pointers to struct pages;
  * but shmem/tmpfs needs also to store swap entries in the same tree:
@@ -54,7 +59,8 @@
 
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
+	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
+		== RADIX_TREE_INDIRECT_PTR;
 }
 
 /*** radix-tree API starts here ***/
@@ -222,7 +228,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
  */
 static inline int radix_tree_deref_retry(void *arg)
 {
-	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
+	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
+			== RADIX_TREE_INDIRECT_PTR);
 }
 
 /**
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1624c4117961..c6af1a445b67 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1412,7 +1412,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 		 * to force callers to retry.
 		 */
 		if (root->height == 0)
-			*((unsigned long *)&to_free->slots[0]) |=
+			*((unsigned long *)&to_free->slots[0]) =
 						RADIX_TREE_INDIRECT_PTR;
 
 		radix_tree_node_free(to_free);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 03/10] dax: Remove complete_unwritten argument
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c      |  4 ++--
 fs/dax.c            | 43 +++++++++----------------------------------
 fs/ext2/file.c      |  4 ++--
 fs/ext4/file.c      |  4 ++--
 fs/xfs/xfs_file.c   |  7 +++----
 include/linux/dax.h | 17 +++++++----------
 include/linux/fs.h  |  1 -
 7 files changed, 25 insertions(+), 55 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3172c4e2f502..a59f155f9aaf 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1746,7 +1746,7 @@ static const struct address_space_operations def_blk_aops = {
  */
 static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
+	return __dax_fault(vma, vmf, blkdev_get_block);
 }
 
 static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
@@ -1758,7 +1758,7 @@ static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
 static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, unsigned int flags)
 {
-	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block);
 }
 
 static const struct vm_operations_struct blkdev_dax_vm_ops = {
diff --git a/fs/dax.c b/fs/dax.c
index b32e1b5eb8d4..d496466652cd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -607,19 +607,13 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
- * @complete_unwritten: The filesystem method used to convert unwritten blocks
- *	to written so the data written to them is exposed. This is required for
- *	required by write faults for filesystems that will return unwritten
- *	extent mappings from @get_block, but it is optional for reads as
- *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
- *	not support unwritten extents, the it should pass NULL.
  *
  * When a page fault occurs, filesystems may call this helper in their
  * fault handler for DAX files. __dax_fault() assumes the caller has done all
  * the necessary locking for the page fault to proceed successfully.
  */
 int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block, dax_iodone_t complete_unwritten)
+			get_block_t get_block)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -722,23 +716,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page = NULL;
 	}
 
-	/*
-	 * If we successfully insert the new mapping over an unwritten extent,
-	 * we need to ensure we convert the unwritten extent. If there is an
-	 * error inserting the mapping, the filesystem needs to leave it as
-	 * unwritten to prevent exposure of the stale underlying data to
-	 * userspace, but we still need to call the completion function so
-	 * the private resources on the mapping buffer can be released. We
-	 * indicate what the callback should do via the uptodate variable, same
-	 * as for normal BH based IO completions.
-	 */
+	/* Filesystem should not return unwritten buffers to us! */
+	WARN_ON_ONCE(buffer_unwritten(&bh));
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
-	if (buffer_unwritten(&bh)) {
-		if (complete_unwritten)
-			complete_unwritten(&bh, !error);
-		else
-			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
-	}
 
  out:
 	if (error == -ENOMEM)
@@ -767,7 +747,7 @@ EXPORT_SYMBOL(__dax_fault);
  * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-	      get_block_t get_block, dax_iodone_t complete_unwritten)
+	      get_block_t get_block)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -776,7 +756,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
+	result = __dax_fault(vma, vmf, get_block);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
@@ -810,8 +790,7 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
 #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
 
 int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block,
-		dax_iodone_t complete_unwritten)
+		pmd_t *pmd, unsigned int flags, get_block_t get_block)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -870,6 +849,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		if (get_block(inode, block, &bh, 1) != 0)
 			return VM_FAULT_SIGBUS;
 		alloc = true;
+		WARN_ON_ONCE(buffer_unwritten(&bh));
 	}
 
 	bdev = bh.b_bdev;
@@ -1015,9 +995,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
  out:
 	i_mmap_unlock_read(mapping);
 
-	if (buffer_unwritten(&bh))
-		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
-
 	return result;
 
  fallback:
@@ -1037,8 +1014,7 @@ EXPORT_SYMBOL_GPL(__dax_pmd_fault);
  * pmd_fault handler for DAX files.
  */
 int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-			pmd_t *pmd, unsigned int flags, get_block_t get_block,
-			dax_iodone_t complete_unwritten)
+			pmd_t *pmd, unsigned int flags, get_block_t get_block)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -1047,8 +1023,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
-				complete_unwritten);
+	result = __dax_pmd_fault(vma, address, pmd, flags, get_block);
 	if (flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index c1400b109805..868c02317b05 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -51,7 +51,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
+	ret = __dax_fault(vma, vmf, ext2_get_block);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
@@ -72,7 +72,7 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
 
 	up_read(&ei->dax_sem);
 	if (flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6659e216385e..cf20040a1a49 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,7 +207,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (IS_ERR(handle))
 		result = VM_FAULT_SIGBUS;
 	else
-		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block);
 
 	if (write) {
 		if (!IS_ERR(handle))
@@ -243,7 +243,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		result = VM_FAULT_SIGBUS;
 	else
 		result = __dax_pmd_fault(vma, addr, pmd, flags,
-				ext4_dax_mmap_get_block, NULL);
+				ext4_dax_mmap_get_block);
 
 	if (write) {
 		if (!IS_ERR(handle))
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 52883ac3cf84..2ecdb39d2424 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1526,7 +1526,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
 	} else {
 		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
 		ret = block_page_mkwrite_return(ret);
@@ -1560,7 +1560,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1597,8 +1597,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
-			      NULL);
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 636dd59ab505..7c45ac7ea1d1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -10,10 +10,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
-int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
@@ -27,21 +25,20 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
+				unsigned int flags, get_block_t);
 int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
+				unsigned int flags, get_block_t);
 #else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-				pmd_t *pmd, unsigned int flags, get_block_t gb,
-				dax_iodone_t di)
+				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
 #define __dax_pmd_fault dax_pmd_fault
 #endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
-#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
-#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
+#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+#define __dax_mkwrite(vma, vmf, gb)	__dax_fault(vma, vmf, gb)
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bb703ef728d1..960fa5e0f7c3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -72,7 +72,6 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
-typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 
 #define MAY_EXEC		0x00000001
 #define MAY_WRITE		0x00000002
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 03/10] dax: Remove complete_unwritten argument
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c      |  4 ++--
 fs/dax.c            | 43 +++++++++----------------------------------
 fs/ext2/file.c      |  4 ++--
 fs/ext4/file.c      |  4 ++--
 fs/xfs/xfs_file.c   |  7 +++----
 include/linux/dax.h | 17 +++++++----------
 include/linux/fs.h  |  1 -
 7 files changed, 25 insertions(+), 55 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3172c4e2f502..a59f155f9aaf 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1746,7 +1746,7 @@ static const struct address_space_operations def_blk_aops = {
  */
 static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
+	return __dax_fault(vma, vmf, blkdev_get_block);
 }
 
 static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
@@ -1758,7 +1758,7 @@ static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
 static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, unsigned int flags)
 {
-	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block);
 }
 
 static const struct vm_operations_struct blkdev_dax_vm_ops = {
diff --git a/fs/dax.c b/fs/dax.c
index b32e1b5eb8d4..d496466652cd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -607,19 +607,13 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
- * @complete_unwritten: The filesystem method used to convert unwritten blocks
- *	to written so the data written to them is exposed. This is required for
- *	required by write faults for filesystems that will return unwritten
- *	extent mappings from @get_block, but it is optional for reads as
- *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
- *	not support unwritten extents, the it should pass NULL.
  *
  * When a page fault occurs, filesystems may call this helper in their
  * fault handler for DAX files. __dax_fault() assumes the caller has done all
  * the necessary locking for the page fault to proceed successfully.
  */
 int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block, dax_iodone_t complete_unwritten)
+			get_block_t get_block)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -722,23 +716,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page = NULL;
 	}
 
-	/*
-	 * If we successfully insert the new mapping over an unwritten extent,
-	 * we need to ensure we convert the unwritten extent. If there is an
-	 * error inserting the mapping, the filesystem needs to leave it as
-	 * unwritten to prevent exposure of the stale underlying data to
-	 * userspace, but we still need to call the completion function so
-	 * the private resources on the mapping buffer can be released. We
-	 * indicate what the callback should do via the uptodate variable, same
-	 * as for normal BH based IO completions.
-	 */
+	/* Filesystem should not return unwritten buffers to us! */
+	WARN_ON_ONCE(buffer_unwritten(&bh));
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
-	if (buffer_unwritten(&bh)) {
-		if (complete_unwritten)
-			complete_unwritten(&bh, !error);
-		else
-			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
-	}
 
  out:
 	if (error == -ENOMEM)
@@ -767,7 +747,7 @@ EXPORT_SYMBOL(__dax_fault);
  * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-	      get_block_t get_block, dax_iodone_t complete_unwritten)
+	      get_block_t get_block)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -776,7 +756,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
+	result = __dax_fault(vma, vmf, get_block);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
@@ -810,8 +790,7 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
 #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
 
 int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block,
-		dax_iodone_t complete_unwritten)
+		pmd_t *pmd, unsigned int flags, get_block_t get_block)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -870,6 +849,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		if (get_block(inode, block, &bh, 1) != 0)
 			return VM_FAULT_SIGBUS;
 		alloc = true;
+		WARN_ON_ONCE(buffer_unwritten(&bh));
 	}
 
 	bdev = bh.b_bdev;
@@ -1015,9 +995,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
  out:
 	i_mmap_unlock_read(mapping);
 
-	if (buffer_unwritten(&bh))
-		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
-
 	return result;
 
  fallback:
@@ -1037,8 +1014,7 @@ EXPORT_SYMBOL_GPL(__dax_pmd_fault);
  * pmd_fault handler for DAX files.
  */
 int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-			pmd_t *pmd, unsigned int flags, get_block_t get_block,
-			dax_iodone_t complete_unwritten)
+			pmd_t *pmd, unsigned int flags, get_block_t get_block)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -1047,8 +1023,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
-				complete_unwritten);
+	result = __dax_pmd_fault(vma, address, pmd, flags, get_block);
 	if (flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index c1400b109805..868c02317b05 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -51,7 +51,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
+	ret = __dax_fault(vma, vmf, ext2_get_block);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
@@ -72,7 +72,7 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
 
 	up_read(&ei->dax_sem);
 	if (flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6659e216385e..cf20040a1a49 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,7 +207,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (IS_ERR(handle))
 		result = VM_FAULT_SIGBUS;
 	else
-		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block);
 
 	if (write) {
 		if (!IS_ERR(handle))
@@ -243,7 +243,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 		result = VM_FAULT_SIGBUS;
 	else
 		result = __dax_pmd_fault(vma, addr, pmd, flags,
-				ext4_dax_mmap_get_block, NULL);
+				ext4_dax_mmap_get_block);
 
 	if (write) {
 		if (!IS_ERR(handle))
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 52883ac3cf84..2ecdb39d2424 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1526,7 +1526,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
 	} else {
 		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
 		ret = block_page_mkwrite_return(ret);
@@ -1560,7 +1560,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1597,8 +1597,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
-			      NULL);
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 636dd59ab505..7c45ac7ea1d1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -10,10 +10,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
-int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
@@ -27,21 +25,20 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
+				unsigned int flags, get_block_t);
 int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
+				unsigned int flags, get_block_t);
 #else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-				pmd_t *pmd, unsigned int flags, get_block_t gb,
-				dax_iodone_t di)
+				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
 #define __dax_pmd_fault dax_pmd_fault
 #endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
-#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
-#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
+#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+#define __dax_mkwrite(vma, vmf, gb)	__dax_fault(vma, vmf, gb)
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bb703ef728d1..960fa5e0f7c3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -72,7 +72,6 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
-typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 
 #define MAY_EXEC		0x00000001
 #define MAY_WRITE		0x00000002
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 04/10] dax: Fix data corruption for written and mmapped files
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

When a fault to a hole races with write filling the hole, it can happen
that block zeroing in __dax_fault() overwrites the data copied by write.
Since filesystem is supposed to provide pre-zeroed blocks for fault
anyway, just remove the racy zeroing from dax code. The only catch is
with read-faults over unwritten block where __dax_fault() filled in the
block into page tables anyway. For that case we have to fall back to
using hole page now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index d496466652cd..50d81172438b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -582,11 +582,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		error = PTR_ERR(dax.addr);
 		goto out;
 	}
-
-	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(dax.addr, PAGE_SIZE);
-		wmb_pmem();
-	}
 	dax_unmap_atomic(bdev, &dax);
 
 	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
@@ -665,7 +660,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (error)
 		goto unlock_page;
 
-	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+	if (!buffer_mapped(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
 			count_vm_event(PGMAJFAULT);
@@ -950,8 +945,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(dax.addr, PMD_SIZE);
-			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 04/10] dax: Fix data corruption for written and mmapped files
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

When a fault to a hole races with write filling the hole, it can happen
that block zeroing in __dax_fault() overwrites the data copied by write.
Since filesystem is supposed to provide pre-zeroed blocks for fault
anyway, just remove the racy zeroing from dax code. The only catch is
with read-faults over unwritten block where __dax_fault() filled in the
block into page tables anyway. For that case we have to fall back to
using hole page now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index d496466652cd..50d81172438b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -582,11 +582,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		error = PTR_ERR(dax.addr);
 		goto out;
 	}
-
-	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(dax.addr, PAGE_SIZE);
-		wmb_pmem();
-	}
 	dax_unmap_atomic(bdev, &dax);
 
 	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
@@ -665,7 +660,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (error)
 		goto unlock_page;
 
-	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+	if (!buffer_mapped(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			error = get_block(inode, block, &bh, 1);
 			count_vm_event(PGMAJFAULT);
@@ -950,8 +945,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(dax.addr, PMD_SIZE);
-			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 05/10] dax: Allow DAX code to replace exceptional entries
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Currently we forbid page_cache_tree_insert() to replace exceptional radix
tree entries for DAX inodes. However to make DAX faults race free we will
lock radix tree entries and when hole is created, we need to replace
such locked radix tree entry with a hole page. So modify
page_cache_tree_insert() to allow that.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/dax.h |  6 ++++++
 mm/filemap.c        | 18 +++++++++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7c45ac7ea1d1..4b63923e1f8d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -3,8 +3,14 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
+/*
+ * Since exceptional entries do not use indirect bit, we reuse it as a lock bit
+ */
+#define DAX_ENTRY_LOCK RADIX_TREE_INDIRECT_PTR
+
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7c00f105845e..fbebedaf719e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -597,14 +597,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 
-		if (WARN_ON(dax_mapping(mapping)))
-			return -EINVAL;
-
-		if (shadowp)
-			*shadowp = p;
 		mapping->nrexceptional--;
-		if (node)
-			workingset_node_shadows_dec(node);
+		if (!dax_mapping(mapping)) {
+			if (shadowp)
+				*shadowp = p;
+			if (node)
+				workingset_node_shadows_dec(node);
+		} else {
+			/* DAX can replace empty locked entry with a hole */
+			WARN_ON_ONCE(p !=
+				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+					 DAX_ENTRY_LOCK));
+		}
 	}
 	radix_tree_replace_slot(slot, page);
 	mapping->nrpages++;
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 05/10] dax: Allow DAX code to replace exceptional entries
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Currently we forbid page_cache_tree_insert() to replace exceptional radix
tree entries for DAX inodes. However to make DAX faults race free we will
lock radix tree entries and when hole is created, we need to replace
such locked radix tree entry with a hole page. So modify
page_cache_tree_insert() to allow that.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/dax.h |  6 ++++++
 mm/filemap.c        | 18 +++++++++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7c45ac7ea1d1..4b63923e1f8d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -3,8 +3,14 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
+/*
+ * Since exceptional entries do not use indirect bit, we reuse it as a lock bit
+ */
+#define DAX_ENTRY_LOCK RADIX_TREE_INDIRECT_PTR
+
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7c00f105845e..fbebedaf719e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -597,14 +597,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 
-		if (WARN_ON(dax_mapping(mapping)))
-			return -EINVAL;
-
-		if (shadowp)
-			*shadowp = p;
 		mapping->nrexceptional--;
-		if (node)
-			workingset_node_shadows_dec(node);
+		if (!dax_mapping(mapping)) {
+			if (shadowp)
+				*shadowp = p;
+			if (node)
+				workingset_node_shadows_dec(node);
+		} else {
+			/* DAX can replace empty locked entry with a hole */
+			WARN_ON_ONCE(p !=
+				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+					 DAX_ENTRY_LOCK));
+		}
 	}
 	radix_tree_replace_slot(slot, page);
 	mapping->nrpages++;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 06/10] dax: Remove redundant inode size checks
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Callers of dax fault handlers must make sure these calls cannot race
with truncate. Thus it is enough to check inode size when entering the
function and we don't have to recheck it again later in the handler.
Note that inode size itself can be decreased while the fault handler
runs but filesystem locking prevents against any radix tree or block
mapping information changes resulting from the truncate and that is what
we really care about.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 59 +----------------------------------------------------------
 1 file changed, 1 insertion(+), 58 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 50d81172438b..0329ec0bee2e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -316,20 +316,12 @@ EXPORT_SYMBOL_GPL(dax_do_io);
 static int dax_load_hole(struct address_space *mapping, struct page *page,
 							struct vm_fault *vmf)
 {
-	unsigned long size;
 	struct inode *inode = mapping->host;
 	if (!page)
 		page = find_or_create_page(mapping, vmf->pgoff,
 						GFP_KERNEL | __GFP_ZERO);
 	if (!page)
 		return VM_FAULT_OOM;
-	/* Recheck i_size under page lock to avoid truncate race */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (vmf->pgoff >= size) {
-		unlock_page(page);
-		page_cache_release(page);
-		return VM_FAULT_SIGBUS;
-	}
 
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
@@ -560,24 +552,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		.sector = to_sector(bh, inode),
 		.size = bh->b_size,
 	};
-	pgoff_t size;
 	int error;
 
 	i_mmap_lock_read(mapping);
 
-	/*
-	 * Check truncate didn't happen while we were allocating a block.
-	 * If it did, this block may or may not be still allocated to the
-	 * file.  We can't tell the filesystem to free it because we can't
-	 * take i_mutex here.  In the worst case, the file still has blocks
-	 * allocated past the end of the file.
-	 */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (unlikely(vmf->pgoff >= size)) {
-		error = -EIO;
-		goto out;
-	}
-
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
 		goto out;
@@ -643,15 +621,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			page_cache_release(page);
 			goto repeat;
 		}
-		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			/*
-			 * We have a struct page covering a hole in the file
-			 * from a read fault and we've raced with a truncate
-			 */
-			error = -EIO;
-			goto unlock_page;
-		}
 	}
 
 	error = get_block(inode, block, &bh, 0);
@@ -684,17 +653,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		if (error)
 			goto unlock_page;
 		vmf->page = page;
-		if (!page) {
+		if (!page)
 			i_mmap_lock_read(mapping);
-			/* Check we didn't race with truncate */
-			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
-								PAGE_SHIFT;
-			if (vmf->pgoff >= size) {
-				i_mmap_unlock_read(mapping);
-				error = -EIO;
-				goto out;
-			}
-		}
 		return VM_FAULT_LOCKED;
 	}
 
@@ -872,23 +832,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	i_mmap_lock_read(mapping);
 
-	/*
-	 * If a truncate happened while we were allocating blocks, we may
-	 * leave blocks allocated to the file that are beyond EOF.  We can't
-	 * take i_mutex here, so just leave them hanging; they'll be freed
-	 * when the file is deleted.
-	 */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size) {
-		result = VM_FAULT_SIGBUS;
-		goto out;
-	}
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(&bh, address,
-				"offset + huge page size > file size");
-		goto fallback;
-	}
-
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
 		spinlock_t *ptl;
 		pmd_t entry;
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 06/10] dax: Remove redundant inode size checks
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Callers of dax fault handlers must make sure these calls cannot race
with truncate. Thus it is enough to check inode size when entering the
function and we don't have to recheck it again later in the handler.
Note that inode size itself can be decreased while the fault handler
runs but filesystem locking prevents against any radix tree or block
mapping information changes resulting from the truncate and that is what
we really care about.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 59 +----------------------------------------------------------
 1 file changed, 1 insertion(+), 58 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 50d81172438b..0329ec0bee2e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -316,20 +316,12 @@ EXPORT_SYMBOL_GPL(dax_do_io);
 static int dax_load_hole(struct address_space *mapping, struct page *page,
 							struct vm_fault *vmf)
 {
-	unsigned long size;
 	struct inode *inode = mapping->host;
 	if (!page)
 		page = find_or_create_page(mapping, vmf->pgoff,
 						GFP_KERNEL | __GFP_ZERO);
 	if (!page)
 		return VM_FAULT_OOM;
-	/* Recheck i_size under page lock to avoid truncate race */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (vmf->pgoff >= size) {
-		unlock_page(page);
-		page_cache_release(page);
-		return VM_FAULT_SIGBUS;
-	}
 
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
@@ -560,24 +552,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		.sector = to_sector(bh, inode),
 		.size = bh->b_size,
 	};
-	pgoff_t size;
 	int error;
 
 	i_mmap_lock_read(mapping);
 
-	/*
-	 * Check truncate didn't happen while we were allocating a block.
-	 * If it did, this block may or may not be still allocated to the
-	 * file.  We can't tell the filesystem to free it because we can't
-	 * take i_mutex here.  In the worst case, the file still has blocks
-	 * allocated past the end of the file.
-	 */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (unlikely(vmf->pgoff >= size)) {
-		error = -EIO;
-		goto out;
-	}
-
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
 		goto out;
@@ -643,15 +621,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			page_cache_release(page);
 			goto repeat;
 		}
-		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		if (unlikely(vmf->pgoff >= size)) {
-			/*
-			 * We have a struct page covering a hole in the file
-			 * from a read fault and we've raced with a truncate
-			 */
-			error = -EIO;
-			goto unlock_page;
-		}
 	}
 
 	error = get_block(inode, block, &bh, 0);
@@ -684,17 +653,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		if (error)
 			goto unlock_page;
 		vmf->page = page;
-		if (!page) {
+		if (!page)
 			i_mmap_lock_read(mapping);
-			/* Check we didn't race with truncate */
-			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
-								PAGE_SHIFT;
-			if (vmf->pgoff >= size) {
-				i_mmap_unlock_read(mapping);
-				error = -EIO;
-				goto out;
-			}
-		}
 		return VM_FAULT_LOCKED;
 	}
 
@@ -872,23 +832,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	i_mmap_lock_read(mapping);
 
-	/*
-	 * If a truncate happened while we were allocating blocks, we may
-	 * leave blocks allocated to the file that are beyond EOF.  We can't
-	 * take i_mutex here, so just leave them hanging; they'll be freed
-	 * when the file is deleted.
-	 */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size) {
-		result = VM_FAULT_SIGBUS;
-		goto out;
-	}
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(&bh, address,
-				"offset + huge page size > file size");
-		goto fallback;
-	}
-
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
 		spinlock_t *ptl;
 		pmd_t entry;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 07/10] dax: Disable huge page handling
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Currently the handling of huge pages for DAX is racy. For example the
following can happen:

CPU0 (THP write fault)			CPU1 (normal read fault)

__dax_pmd_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
					get_block(inode, block, &bh, 0)
					  -> not mapped
  if (!buffer_mapped(&bh) && write)
    get_block(inode, block, &bh, 1) -> allocates blocks
  truncate_pagecache_range(inode, lstart, lend);
					dax_load_hole();

This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.

The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So disable THP support in DAX code for now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 2 +-
 include/linux/dax.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0329ec0bee2e..444e9dd079ca 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if 0
 /*
  * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
  * more often than one might expect in the below function.
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 4b63923e1f8d..fd28d824254b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if 0
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
 				unsigned int flags, get_block_t);
 int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 07/10] dax: Disable huge page handling
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Currently the handling of huge pages for DAX is racy. For example the
following can happen:

CPU0 (THP write fault)			CPU1 (normal read fault)

__dax_pmd_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
					get_block(inode, block, &bh, 0)
					  -> not mapped
  if (!buffer_mapped(&bh) && write)
    get_block(inode, block, &bh, 1) -> allocates blocks
  truncate_pagecache_range(inode, lstart, lend);
					dax_load_hole();

This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.

The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So disable THP support in DAX code for now.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 2 +-
 include/linux/dax.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0329ec0bee2e..444e9dd079ca 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if 0
 /*
  * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
  * more often than one might expect in the below function.
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 4b63923e1f8d..fd28d824254b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if 0
 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
 				unsigned int flags, get_block_t);
 int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 08/10] dax: New fault locking
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Currently DAX page fault locking is racy.

CPU0 (write fault)		CPU1 (read fault)

__dax_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
				  get_block(inode, block, &bh, 0)
				    -> not mapped
  if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
      get_block(inode, block, &bh, 1) -> allocates blocks
  if (page) -> no
				  if (!buffer_mapped(&bh))
				    if (vmf->flags & FAULT_FLAG_WRITE) {
				    } else {
				      dax_load_hole();
				    }
  dax_insert_mapping()

And we are in a situation where we fail in dax_radix_entry() with -EIO.

Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.

We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 500 ++++++++++++++++++++++++++++++++++++++--------------
 include/linux/dax.h |   1 +
 mm/truncate.c       |  62 ++++---
 3 files changed, 396 insertions(+), 167 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 444e9dd079ca..4fcac59b6dcb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -41,6 +41,30 @@
 #define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
 		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
 
+/* We choose 4096 entries - same as per-zone page wait tables */
+#define DAX_WAIT_TABLE_BITS 12
+#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+
+wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+
+static int __init init_dax_wait_table(void)
+{
+	int i;
+
+	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
+		init_waitqueue_head(wait_table + i);
+	return 0;
+}
+fs_initcall(init_dax_wait_table);
+
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+					      pgoff_t index)
+{
+	unsigned long hash = hash_long((unsigned long)mapping ^ index,
+				       DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -306,6 +330,237 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 EXPORT_SYMBOL_GPL(dax_do_io);
 
 /*
+ * DAX radix tree locking
+ */
+struct exceptional_entry_key {
+	struct radix_tree_root *root;
+	unsigned long index;
+};
+
+struct wait_exceptional_entry_queue {
+	wait_queue_t wait;
+	struct exceptional_entry_key key;
+};
+
+static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned mode,
+				       int sync, void *keyp)
+{
+	struct exceptional_entry_key *key = keyp;
+	struct wait_exceptional_entry_queue *ewait =
+		container_of(wait, struct wait_exceptional_entry_queue, wait);
+
+	if (key->root != ewait->key.root || key->index != ewait->key.index)
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, NULL);
+}
+
+static inline int slot_locked(void **v)
+{
+	unsigned long l = *(unsigned long *)v;
+	return l & DAX_ENTRY_LOCK;
+}
+
+static inline void *lock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l |= DAX_ENTRY_LOCK);
+}
+
+static inline void *unlock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
+}
+
+/*
+ * Lookup entry in radix tree, wait for it to become unlocked if it is
+ * exceptional entry and return.
+ *
+ * The function must be called with mapping->tree_lock held.
+ */
+static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
+					   pgoff_t index, void ***slotp)
+{
+	void *ret, **slot;
+	struct wait_exceptional_entry_queue wait;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	init_wait(&wait.wait);
+	wait.wait.func = wake_exceptional_entry_func;
+	wait.key.root = &mapping->page_tree;
+	wait.key.index = index;
+
+	for (;;) {
+		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+					  &slot);
+		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		    !slot_locked(slot)) {
+			if (slotp)
+				*slotp = slot;
+			return ret;
+		}
+		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&mapping->tree_lock);
+		schedule();
+		finish_wait(wq, &wait.wait);
+		spin_lock_irq(&mapping->tree_lock);
+	}
+}
+
+/*
+ * Find radix tree entry at given index. If it points to a page, return with
+ * the page locked. If it points to the exceptional entry, return with the
+ * radix tree entry locked. If the radix tree doesn't contain given index,
+ * create empty exceptional entry for the index and return with it locked.
+ *
+ * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
+ * persistent memory the benefit is doubtful. We can add that later if we can
+ * show it helps.
+ */
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+
+restart:
+	spin_lock_irq(&mapping->tree_lock);
+	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
+	/* No entry for given index? Make sure radix tree is big enough. */
+	if (!ret) {
+		int err;
+
+		spin_unlock_irq(&mapping->tree_lock);
+		err = radix_tree_preload(
+				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
+		if (err)
+			return ERR_PTR(err);
+		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
+		spin_lock_irq(&mapping->tree_lock);
+		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		radix_tree_preload_end();
+		if (err) {
+			spin_unlock_irq(&mapping->tree_lock);
+			/* Someone already created the entry? */
+			if (err == -EEXIST)
+				goto restart;
+			return ERR_PTR(err);
+		}
+		/* Good, we have inserted empty locked entry into the tree. */
+		mapping->nrexceptional++;
+		spin_unlock_irq(&mapping->tree_lock);
+		return ret;
+	}
+	/* Normal page in radix tree? */
+	if (!radix_tree_exceptional_entry(ret)) {
+		struct page *page = ret;
+
+		page_cache_get(page);
+		spin_unlock_irq(&mapping->tree_lock);
+		lock_page(page);
+		/* Page got truncated? Retry... */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto restart;
+		}
+		return page;
+	}
+	ret = lock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	spin_lock_irq(&mapping->tree_lock);
+	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	if (WARN_ON_ONCE(!slot_locked(slot))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	unlock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+static void put_locked_mapping_entry(struct address_space *mapping,
+				     pgoff_t index, void *entry)
+{
+	if (!radix_tree_exceptional_entry(entry)) {
+		unlock_page(entry);
+		page_cache_release(entry);
+	} else {
+		unlock_mapping_entry(mapping, index);
+	}
+}
+
+/*
+ * Called when we are done with radix tree entry we looked up via
+ * lookup_unlocked_mapping_entry() and which we didn't lock in the end.
+ */
+static void put_unlocked_mapping_entry(struct address_space *mapping,
+				       pgoff_t index, void *entry)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	if(!radix_tree_exceptional_entry(entry))
+		return;
+
+	/* We have to wake up next waiter for the radix tree entry lock */
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+/*
+ * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
+ * entry to get unlocked before deleting it.
+ */
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	void *entry;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	/*
+	 * Caller should make sure radix tree modifications don't race and
+	 * we have seen exceptional entry here before.
+	 */
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+	radix_tree_delete(&mapping->page_tree, index);
+	mapping->nrexceptional--;
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 0, &key);
+	}
+	return 1;
+}
+
+/*
  * The user has performed a load from a hole in the file.  Allocating
  * a new page in the file would cause excessive storage usage for
  * workloads with sparse files.  We allocate a page cache page instead.
@@ -313,16 +568,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, struct page *page,
-							struct vm_fault *vmf)
+static int dax_load_hole(struct address_space *mapping, void *entry,
+			 struct vm_fault *vmf)
 {
-	struct inode *inode = mapping->host;
-	if (!page)
-		page = find_or_create_page(mapping, vmf->pgoff,
-						GFP_KERNEL | __GFP_ZERO);
-	if (!page)
-		return VM_FAULT_OOM;
+	struct page *page;
+
+	/* Hole page already exists? Return it...  */
+	if (!radix_tree_exceptional_entry(entry)) {
+		vmf->page = entry;
+		return VM_FAULT_LOCKED;
+	}
 
+	/* This will replace locked radix tree entry with a hole page */
+	page = find_or_create_page(mapping, vmf->pgoff,
+				   vmf->gfp_mask | __GFP_ZERO);
+	if (!page) {
+		put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+		return VM_FAULT_OOM;
+	}
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
 }
@@ -346,77 +609,54 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define NO_SECTOR -1
 #define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
-		sector_t sector, bool pmd_entry, bool dirty)
+static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
+			       void *entry, sector_t sector, bool dirty,
+			       gfp_t gfp_mask)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	pgoff_t pmd_index = DAX_PMD_INDEX(index);
-	int type, error = 0;
-	void *entry;
+	int error = 0;
+	bool hole_fill = false;
+	void *ret;
 
-	WARN_ON_ONCE(pmd_entry && !dirty);
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	spin_lock_irq(&mapping->tree_lock);
-
-	entry = radix_tree_lookup(page_tree, pmd_index);
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
-		index = pmd_index;
-		goto dirty;
+	/* Replacing hole page with block mapping? */
+	if (!radix_tree_exceptional_entry(entry)) {
+		hole_fill = true;
+		error = radix_tree_preload(gfp_mask);
+		if (error)
+			return ERR_PTR(error);
 	}
 
-	entry = radix_tree_lookup(page_tree, index);
-	if (entry) {
-		type = RADIX_DAX_TYPE(entry);
-		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
-					type != RADIX_DAX_PMD)) {
-			error = -EIO;
+	spin_lock_irq(&mapping->tree_lock);
+	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
+		       DAX_ENTRY_LOCK);
+	if (hole_fill) {
+		__delete_from_page_cache(entry, NULL);
+		error = radix_tree_insert(page_tree, index, ret);
+		if (error) {
+			ret = ERR_PTR(error);
 			goto unlock;
 		}
+		mapping->nrexceptional++;
+	} else {
+		void **slot;
+		void *ret2;
 
-		if (!pmd_entry || type == RADIX_DAX_PMD)
-			goto dirty;
-
-		/*
-		 * We only insert dirty PMD entries into the radix tree.  This
-		 * means we don't need to worry about removing a dirty PTE
-		 * entry and inserting a clean PMD entry, thus reducing the
-		 * range we would flush with a follow-up fsync/msync call.
-		 */
-		radix_tree_delete(&mapping->page_tree, index);
-		mapping->nrexceptional--;
-	}
-
-	if (sector == NO_SECTOR) {
-		/*
-		 * This can happen during correct operation if our pfn_mkwrite
-		 * fault raced against a hole punch operation.  If this
-		 * happens the pte that was hole punched will have been
-		 * unmapped and the radix tree entry will have been removed by
-		 * the time we are called, but the call will still happen.  We
-		 * will return all the way up to wp_pfn_shared(), where the
-		 * pte_same() check will fail, eventually causing page fault
-		 * to be retried by the CPU.
-		 */
-		goto unlock;
+		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
+		WARN_ON_ONCE(ret2 != entry);
+		radix_tree_replace_slot(slot, ret);
 	}
-
-	error = radix_tree_insert(page_tree, index,
-			RADIX_DAX_ENTRY(sector, pmd_entry));
-	if (error)
-		goto unlock;
-
-	mapping->nrexceptional++;
- dirty:
 	if (dirty)
 		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
  unlock:
 	spin_unlock_irq(&mapping->tree_lock);
-	return error;
+	if (hole_fill)
+		radix_tree_preload_end();
+	return ret;
 }
 
 static int dax_writeback_one(struct block_device *bdev,
@@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+static int dax_insert_mapping(struct address_space *mapping,
+			struct buffer_head *bh, void *entry,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
 	struct blk_dax_ctl dax = {
-		.sector = to_sector(bh, inode),
+		.sector = to_sector(bh, mapping->host),
 		.size = bh->b_size,
 	};
 	int error;
+	void *ret;
 
 	i_mmap_lock_read(mapping);
 
@@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
-	if (error)
+	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
+			        vmf->flags & FAULT_FLAG_WRITE,
+			        vmf->gfp_mask & ~__GFP_HIGHMEM);
+	if (IS_ERR(ret)) {
+		error = PTR_ERR(ret);
 		goto out;
+	}
+	/* Have we replaced hole page? Unmap and free it. */
+	if (!radix_tree_exceptional_entry(entry)) {
+		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_CACHE_SIZE, 0);
+		unlock_page(entry);
+		page_cache_release(entry);
+	}
+	entry = ret;
 
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
-
  out:
 	i_mmap_unlock_read(mapping);
-
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return error;
 }
 
@@ -591,7 +842,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
-	struct page *page;
+	void *entry;
 	struct buffer_head bh;
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	unsigned blkbits = inode->i_blkbits;
@@ -600,6 +851,11 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int error;
 	int major = 0;
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -609,40 +865,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
- repeat:
-	page = find_get_page(mapping, vmf->pgoff);
-	if (page) {
-		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
-			page_cache_release(page);
-			return VM_FAULT_RETRY;
-		}
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
-		}
+	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	if (IS_ERR(entry)) {
+		error = PTR_ERR(entry);
+		goto out;
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock_page;
-
-	if (!buffer_mapped(&bh) && !vmf->cow_page) {
-		if (vmf->flags & FAULT_FLAG_WRITE) {
-			error = get_block(inode, block, &bh, 1);
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-			major = VM_FAULT_MAJOR;
-			if (!error && (bh.b_size < PAGE_SIZE))
-				error = -EIO;
-			if (error)
-				goto unlock_page;
-		} else {
-			return dax_load_hole(mapping, page, vmf);
-		}
-	}
+		goto unlock_entry;
 
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
@@ -651,30 +884,35 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock_page;
-		vmf->page = page;
-		if (!page)
+			goto unlock_entry;
+		if (!radix_tree_exceptional_entry(entry)) {
+			vmf->page = entry;
+		} else {
+			unlock_mapping_entry(mapping, vmf->pgoff);
 			i_mmap_lock_read(mapping);
+			vmf->page = NULL;
+		}
 		return VM_FAULT_LOCKED;
 	}
 
-	/* Check we didn't race with a read fault installing a new page */
-	if (!page && major)
-		page = find_lock_page(mapping, vmf->pgoff);
-
-	if (page) {
-		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
-							PAGE_CACHE_SIZE, 0);
-		delete_from_page_cache(page);
-		unlock_page(page);
-		page_cache_release(page);
-		page = NULL;
+	if (!buffer_mapped(&bh)) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
+			if (!error && (bh.b_size < PAGE_SIZE))
+				error = -EIO;
+			if (error)
+				goto unlock_entry;
+		} else {
+			return dax_load_hole(mapping, entry, vmf);
+		}
 	}
 
 	/* Filesystem should not return unwritten buffers to us! */
 	WARN_ON_ONCE(buffer_unwritten(&bh));
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
-
+	error = dax_insert_mapping(mapping, &bh, entry, vma, vmf);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -683,11 +921,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock_page:
-	if (page) {
-		unlock_page(page);
-		page_cache_release(page);
-	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -976,23 +1211,18 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
-	int error;
-
-	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
-	 * RADIX_DAX_PTE entry already exists in the radix tree from a
-	 * previous call to __dax_fault().  We just want to look up that PTE
-	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
-	 * saves us from having to make a call to get_block() here to look
-	 * up the sector.
-	 */
-	error = dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false,
-			true);
+	struct address_space *mapping = file->f_mapping;
+	void *entry;
+	pgoff_t index = vmf->pgoff;
 
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (error)
-		return VM_FAULT_SIGBUS;
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+	put_unlocked_mapping_entry(mapping, index, entry);
+out:
+	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fd28d824254b..da2416d916e6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -18,6 +18,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/truncate.c b/mm/truncate.c
index 7598b552ae03..a38d87688012 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -34,40 +34,38 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	if (shmem_mapping(mapping))
 		return;
 
-	spin_lock_irq(&mapping->tree_lock);
-
 	if (dax_mapping(mapping)) {
-		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
-			mapping->nrexceptional--;
-	} else {
-		/*
-		 * Regular page slots are stabilized by the page lock even
-		 * without the tree itself locked.  These unlocked entries
-		 * need verification under the tree lock.
-		 */
-		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
-					&slot))
-			goto unlock;
-		if (*slot != entry)
-			goto unlock;
-		radix_tree_replace_slot(slot, NULL);
-		mapping->nrexceptional--;
-		if (!node)
-			goto unlock;
-		workingset_node_shadows_dec(node);
-		/*
-		 * Don't track node without shadow entries.
-		 *
-		 * Avoid acquiring the list_lru lock if already untracked.
-		 * The list_empty() test is safe as node->private_list is
-		 * protected by mapping->tree_lock.
-		 */
-		if (!workingset_node_shadows(node) &&
-		    !list_empty(&node->private_list))
-			list_lru_del(&workingset_shadow_nodes,
-					&node->private_list);
-		__radix_tree_delete_node(&mapping->page_tree, node);
+		dax_delete_mapping_entry(mapping, index);
+		return;
 	}
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+				&slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrexceptional--;
+	if (!node)
+		goto unlock;
+	workingset_node_shadows_dec(node);
+	/*
+	 * Don't track node without shadow entries.
+	 *
+	 * Avoid acquiring the list_lru lock if already untracked.
+	 * The list_empty() test is safe as node->private_list is
+	 * protected by mapping->tree_lock.
+	 */
+	if (!workingset_node_shadows(node) &&
+	    !list_empty(&node->private_list))
+		list_lru_del(&workingset_shadow_nodes,
+				&node->private_list);
+	__radix_tree_delete_node(&mapping->page_tree, node);
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 08/10] dax: New fault locking
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Currently DAX page fault locking is racy.

CPU0 (write fault)		CPU1 (read fault)

__dax_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
				  get_block(inode, block, &bh, 0)
				    -> not mapped
  if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
      get_block(inode, block, &bh, 1) -> allocates blocks
  if (page) -> no
				  if (!buffer_mapped(&bh))
				    if (vmf->flags & FAULT_FLAG_WRITE) {
				    } else {
				      dax_load_hole();
				    }
  dax_insert_mapping()

And we are in a situation where we fail in dax_radix_entry() with -EIO.

Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.

We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 500 ++++++++++++++++++++++++++++++++++++++--------------
 include/linux/dax.h |   1 +
 mm/truncate.c       |  62 ++++---
 3 files changed, 396 insertions(+), 167 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 444e9dd079ca..4fcac59b6dcb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -41,6 +41,30 @@
 #define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
 		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
 
+/* We choose 4096 entries - same as per-zone page wait tables */
+#define DAX_WAIT_TABLE_BITS 12
+#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+
+wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+
+static int __init init_dax_wait_table(void)
+{
+	int i;
+
+	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
+		init_waitqueue_head(wait_table + i);
+	return 0;
+}
+fs_initcall(init_dax_wait_table);
+
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+					      pgoff_t index)
+{
+	unsigned long hash = hash_long((unsigned long)mapping ^ index,
+				       DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -306,6 +330,237 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 EXPORT_SYMBOL_GPL(dax_do_io);
 
 /*
+ * DAX radix tree locking
+ */
+struct exceptional_entry_key {
+	struct radix_tree_root *root;
+	unsigned long index;
+};
+
+struct wait_exceptional_entry_queue {
+	wait_queue_t wait;
+	struct exceptional_entry_key key;
+};
+
+static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned mode,
+				       int sync, void *keyp)
+{
+	struct exceptional_entry_key *key = keyp;
+	struct wait_exceptional_entry_queue *ewait =
+		container_of(wait, struct wait_exceptional_entry_queue, wait);
+
+	if (key->root != ewait->key.root || key->index != ewait->key.index)
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, NULL);
+}
+
+static inline int slot_locked(void **v)
+{
+	unsigned long l = *(unsigned long *)v;
+	return l & DAX_ENTRY_LOCK;
+}
+
+static inline void *lock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l |= DAX_ENTRY_LOCK);
+}
+
+static inline void *unlock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
+}
+
+/*
+ * Lookup entry in radix tree, wait for it to become unlocked if it is
+ * exceptional entry and return.
+ *
+ * The function must be called with mapping->tree_lock held.
+ */
+static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
+					   pgoff_t index, void ***slotp)
+{
+	void *ret, **slot;
+	struct wait_exceptional_entry_queue wait;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	init_wait(&wait.wait);
+	wait.wait.func = wake_exceptional_entry_func;
+	wait.key.root = &mapping->page_tree;
+	wait.key.index = index;
+
+	for (;;) {
+		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+					  &slot);
+		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		    !slot_locked(slot)) {
+			if (slotp)
+				*slotp = slot;
+			return ret;
+		}
+		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&mapping->tree_lock);
+		schedule();
+		finish_wait(wq, &wait.wait);
+		spin_lock_irq(&mapping->tree_lock);
+	}
+}
+
+/*
+ * Find radix tree entry at given index. If it points to a page, return with
+ * the page locked. If it points to the exceptional entry, return with the
+ * radix tree entry locked. If the radix tree doesn't contain given index,
+ * create empty exceptional entry for the index and return with it locked.
+ *
+ * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
+ * persistent memory the benefit is doubtful. We can add that later if we can
+ * show it helps.
+ */
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+
+restart:
+	spin_lock_irq(&mapping->tree_lock);
+	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
+	/* No entry for given index? Make sure radix tree is big enough. */
+	if (!ret) {
+		int err;
+
+		spin_unlock_irq(&mapping->tree_lock);
+		err = radix_tree_preload(
+				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
+		if (err)
+			return ERR_PTR(err);
+		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
+		spin_lock_irq(&mapping->tree_lock);
+		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		radix_tree_preload_end();
+		if (err) {
+			spin_unlock_irq(&mapping->tree_lock);
+			/* Someone already created the entry? */
+			if (err == -EEXIST)
+				goto restart;
+			return ERR_PTR(err);
+		}
+		/* Good, we have inserted empty locked entry into the tree. */
+		mapping->nrexceptional++;
+		spin_unlock_irq(&mapping->tree_lock);
+		return ret;
+	}
+	/* Normal page in radix tree? */
+	if (!radix_tree_exceptional_entry(ret)) {
+		struct page *page = ret;
+
+		page_cache_get(page);
+		spin_unlock_irq(&mapping->tree_lock);
+		lock_page(page);
+		/* Page got truncated? Retry... */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto restart;
+		}
+		return page;
+	}
+	ret = lock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	spin_lock_irq(&mapping->tree_lock);
+	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	if (WARN_ON_ONCE(!slot_locked(slot))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	unlock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+static void put_locked_mapping_entry(struct address_space *mapping,
+				     pgoff_t index, void *entry)
+{
+	if (!radix_tree_exceptional_entry(entry)) {
+		unlock_page(entry);
+		page_cache_release(entry);
+	} else {
+		unlock_mapping_entry(mapping, index);
+	}
+}
+
+/*
+ * Called when we are done with radix tree entry we looked up via
+ * lookup_unlocked_mapping_entry() and which we didn't lock in the end.
+ */
+static void put_unlocked_mapping_entry(struct address_space *mapping,
+				       pgoff_t index, void *entry)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	if(!radix_tree_exceptional_entry(entry))
+		return;
+
+	/* We have to wake up next waiter for the radix tree entry lock */
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+/*
+ * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
+ * entry to get unlocked before deleting it.
+ */
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	void *entry;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	/*
+	 * Caller should make sure radix tree modifications don't race and
+	 * we have seen exceptional entry here before.
+	 */
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+	radix_tree_delete(&mapping->page_tree, index);
+	mapping->nrexceptional--;
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 0, &key);
+	}
+	return 1;
+}
+
+/*
  * The user has performed a load from a hole in the file.  Allocating
  * a new page in the file would cause excessive storage usage for
  * workloads with sparse files.  We allocate a page cache page instead.
@@ -313,16 +568,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, struct page *page,
-							struct vm_fault *vmf)
+static int dax_load_hole(struct address_space *mapping, void *entry,
+			 struct vm_fault *vmf)
 {
-	struct inode *inode = mapping->host;
-	if (!page)
-		page = find_or_create_page(mapping, vmf->pgoff,
-						GFP_KERNEL | __GFP_ZERO);
-	if (!page)
-		return VM_FAULT_OOM;
+	struct page *page;
+
+	/* Hole page already exists? Return it...  */
+	if (!radix_tree_exceptional_entry(entry)) {
+		vmf->page = entry;
+		return VM_FAULT_LOCKED;
+	}
 
+	/* This will replace locked radix tree entry with a hole page */
+	page = find_or_create_page(mapping, vmf->pgoff,
+				   vmf->gfp_mask | __GFP_ZERO);
+	if (!page) {
+		put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+		return VM_FAULT_OOM;
+	}
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
 }
@@ -346,77 +609,54 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define NO_SECTOR -1
 #define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
-		sector_t sector, bool pmd_entry, bool dirty)
+static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
+			       void *entry, sector_t sector, bool dirty,
+			       gfp_t gfp_mask)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	pgoff_t pmd_index = DAX_PMD_INDEX(index);
-	int type, error = 0;
-	void *entry;
+	int error = 0;
+	bool hole_fill = false;
+	void *ret;
 
-	WARN_ON_ONCE(pmd_entry && !dirty);
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	spin_lock_irq(&mapping->tree_lock);
-
-	entry = radix_tree_lookup(page_tree, pmd_index);
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
-		index = pmd_index;
-		goto dirty;
+	/* Replacing hole page with block mapping? */
+	if (!radix_tree_exceptional_entry(entry)) {
+		hole_fill = true;
+		error = radix_tree_preload(gfp_mask);
+		if (error)
+			return ERR_PTR(error);
 	}
 
-	entry = radix_tree_lookup(page_tree, index);
-	if (entry) {
-		type = RADIX_DAX_TYPE(entry);
-		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
-					type != RADIX_DAX_PMD)) {
-			error = -EIO;
+	spin_lock_irq(&mapping->tree_lock);
+	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
+		       DAX_ENTRY_LOCK);
+	if (hole_fill) {
+		__delete_from_page_cache(entry, NULL);
+		error = radix_tree_insert(page_tree, index, ret);
+		if (error) {
+			ret = ERR_PTR(error);
 			goto unlock;
 		}
+		mapping->nrexceptional++;
+	} else {
+		void **slot;
+		void *ret2;
 
-		if (!pmd_entry || type == RADIX_DAX_PMD)
-			goto dirty;
-
-		/*
-		 * We only insert dirty PMD entries into the radix tree.  This
-		 * means we don't need to worry about removing a dirty PTE
-		 * entry and inserting a clean PMD entry, thus reducing the
-		 * range we would flush with a follow-up fsync/msync call.
-		 */
-		radix_tree_delete(&mapping->page_tree, index);
-		mapping->nrexceptional--;
-	}
-
-	if (sector == NO_SECTOR) {
-		/*
-		 * This can happen during correct operation if our pfn_mkwrite
-		 * fault raced against a hole punch operation.  If this
-		 * happens the pte that was hole punched will have been
-		 * unmapped and the radix tree entry will have been removed by
-		 * the time we are called, but the call will still happen.  We
-		 * will return all the way up to wp_pfn_shared(), where the
-		 * pte_same() check will fail, eventually causing page fault
-		 * to be retried by the CPU.
-		 */
-		goto unlock;
+		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
+		WARN_ON_ONCE(ret2 != entry);
+		radix_tree_replace_slot(slot, ret);
 	}
-
-	error = radix_tree_insert(page_tree, index,
-			RADIX_DAX_ENTRY(sector, pmd_entry));
-	if (error)
-		goto unlock;
-
-	mapping->nrexceptional++;
- dirty:
 	if (dirty)
 		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
  unlock:
 	spin_unlock_irq(&mapping->tree_lock);
-	return error;
+	if (hole_fill)
+		radix_tree_preload_end();
+	return ret;
 }
 
 static int dax_writeback_one(struct block_device *bdev,
@@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+static int dax_insert_mapping(struct address_space *mapping,
+			struct buffer_head *bh, void *entry,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
 	struct blk_dax_ctl dax = {
-		.sector = to_sector(bh, inode),
+		.sector = to_sector(bh, mapping->host),
 		.size = bh->b_size,
 	};
 	int error;
+	void *ret;
 
 	i_mmap_lock_read(mapping);
 
@@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
-	if (error)
+	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
+			        vmf->flags & FAULT_FLAG_WRITE,
+			        vmf->gfp_mask & ~__GFP_HIGHMEM);
+	if (IS_ERR(ret)) {
+		error = PTR_ERR(ret);
 		goto out;
+	}
+	/* Have we replaced hole page? Unmap and free it. */
+	if (!radix_tree_exceptional_entry(entry)) {
+		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_CACHE_SIZE, 0);
+		unlock_page(entry);
+		page_cache_release(entry);
+	}
+	entry = ret;
 
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
-
  out:
 	i_mmap_unlock_read(mapping);
-
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return error;
 }
 
@@ -591,7 +842,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
-	struct page *page;
+	void *entry;
 	struct buffer_head bh;
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	unsigned blkbits = inode->i_blkbits;
@@ -600,6 +851,11 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int error;
 	int major = 0;
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -609,40 +865,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
- repeat:
-	page = find_get_page(mapping, vmf->pgoff);
-	if (page) {
-		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
-			page_cache_release(page);
-			return VM_FAULT_RETRY;
-		}
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
-		}
+	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	if (IS_ERR(entry)) {
+		error = PTR_ERR(entry);
+		goto out;
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock_page;
-
-	if (!buffer_mapped(&bh) && !vmf->cow_page) {
-		if (vmf->flags & FAULT_FLAG_WRITE) {
-			error = get_block(inode, block, &bh, 1);
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-			major = VM_FAULT_MAJOR;
-			if (!error && (bh.b_size < PAGE_SIZE))
-				error = -EIO;
-			if (error)
-				goto unlock_page;
-		} else {
-			return dax_load_hole(mapping, page, vmf);
-		}
-	}
+		goto unlock_entry;
 
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
@@ -651,30 +884,35 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock_page;
-		vmf->page = page;
-		if (!page)
+			goto unlock_entry;
+		if (!radix_tree_exceptional_entry(entry)) {
+			vmf->page = entry;
+		} else {
+			unlock_mapping_entry(mapping, vmf->pgoff);
 			i_mmap_lock_read(mapping);
+			vmf->page = NULL;
+		}
 		return VM_FAULT_LOCKED;
 	}
 
-	/* Check we didn't race with a read fault installing a new page */
-	if (!page && major)
-		page = find_lock_page(mapping, vmf->pgoff);
-
-	if (page) {
-		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
-							PAGE_CACHE_SIZE, 0);
-		delete_from_page_cache(page);
-		unlock_page(page);
-		page_cache_release(page);
-		page = NULL;
+	if (!buffer_mapped(&bh)) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
+			if (!error && (bh.b_size < PAGE_SIZE))
+				error = -EIO;
+			if (error)
+				goto unlock_entry;
+		} else {
+			return dax_load_hole(mapping, entry, vmf);
+		}
 	}
 
 	/* Filesystem should not return unwritten buffers to us! */
 	WARN_ON_ONCE(buffer_unwritten(&bh));
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
-
+	error = dax_insert_mapping(mapping, &bh, entry, vma, vmf);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -683,11 +921,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock_page:
-	if (page) {
-		unlock_page(page);
-		page_cache_release(page);
-	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -976,23 +1211,18 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
-	int error;
-
-	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
-	 * RADIX_DAX_PTE entry already exists in the radix tree from a
-	 * previous call to __dax_fault().  We just want to look up that PTE
-	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
-	 * saves us from having to make a call to get_block() here to look
-	 * up the sector.
-	 */
-	error = dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false,
-			true);
+	struct address_space *mapping = file->f_mapping;
+	void *entry;
+	pgoff_t index = vmf->pgoff;
 
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (error)
-		return VM_FAULT_SIGBUS;
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+	put_unlocked_mapping_entry(mapping, index, entry);
+out:
+	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fd28d824254b..da2416d916e6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -18,6 +18,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/truncate.c b/mm/truncate.c
index 7598b552ae03..a38d87688012 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -34,40 +34,38 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	if (shmem_mapping(mapping))
 		return;
 
-	spin_lock_irq(&mapping->tree_lock);
-
 	if (dax_mapping(mapping)) {
-		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
-			mapping->nrexceptional--;
-	} else {
-		/*
-		 * Regular page slots are stabilized by the page lock even
-		 * without the tree itself locked.  These unlocked entries
-		 * need verification under the tree lock.
-		 */
-		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
-					&slot))
-			goto unlock;
-		if (*slot != entry)
-			goto unlock;
-		radix_tree_replace_slot(slot, NULL);
-		mapping->nrexceptional--;
-		if (!node)
-			goto unlock;
-		workingset_node_shadows_dec(node);
-		/*
-		 * Don't track node without shadow entries.
-		 *
-		 * Avoid acquiring the list_lru lock if already untracked.
-		 * The list_empty() test is safe as node->private_list is
-		 * protected by mapping->tree_lock.
-		 */
-		if (!workingset_node_shadows(node) &&
-		    !list_empty(&node->private_list))
-			list_lru_del(&workingset_shadow_nodes,
-					&node->private_list);
-		__radix_tree_delete_node(&mapping->page_tree, node);
+		dax_delete_mapping_entry(mapping, index);
+		return;
 	}
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+				&slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrexceptional--;
+	if (!node)
+		goto unlock;
+	workingset_node_shadows_dec(node);
+	/*
+	 * Don't track node without shadow entries.
+	 *
+	 * Avoid acquiring the list_lru lock if already untracked.
+	 * The list_empty() test is safe as node->private_list is
+	 * protected by mapping->tree_lock.
+	 */
+	if (!workingset_node_shadows(node) &&
+	    !list_empty(&node->private_list))
+		list_lru_del(&workingset_shadow_nodes,
+				&node->private_list);
+	__radix_tree_delete_node(&mapping->page_tree, node);
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

When doing cow faults, we cannot directly fill in PTE as we do for other
faults as we rely on generic code to do proper accounting of the cowed page.
We also have no page to lock to protect against races with truncate as
other faults have and we need the protection to extend until the moment
generic code inserts cowed page into PTE thus at that point we have no
protection of fs-specific i_mmap_sem. So far we relied on using
i_mmap_lock for the protection however that is completely special to cow
faults. To make fault locking more uniform use DAX entry lock instead.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 12 +++++-------
 include/linux/dax.h |  1 +
 include/linux/mm.h  |  7 +++++++
 mm/memory.c         | 38 ++++++++++++++++++--------------------
 4 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4fcac59b6dcb..2fcf4e8a17c5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -469,7 +469,7 @@ restart:
 	return ret;
 }
 
-static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
 	void *ret, **slot;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
@@ -502,7 +502,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
 		unlock_page(entry);
 		page_cache_release(entry);
 	} else {
-		unlock_mapping_entry(mapping, index);
+		dax_unlock_mapping_entry(mapping, index);
 	}
 }
 
@@ -887,12 +887,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			goto unlock_entry;
 		if (!radix_tree_exceptional_entry(entry)) {
 			vmf->page = entry;
-		} else {
-			unlock_mapping_entry(mapping, vmf->pgoff);
-			i_mmap_lock_read(mapping);
-			vmf->page = NULL;
+			return VM_FAULT_LOCKED;
 		}
-		return VM_FAULT_LOCKED;
+		vmf->entry = entry;
+		return VM_FAULT_DAX_LOCKED;
 	}
 
 	if (!buffer_mapped(&bh)) {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index da2416d916e6..29a83a767ea3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -60,4 +60,5 @@ static inline bool dax_mapping(struct address_space *mapping)
 struct writeback_control;
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
 #endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 450fc977ed02..1c64039dc505 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -299,6 +299,12 @@ struct vm_fault {
 					 * is set (which is also implied by
 					 * VM_FAULT_ERROR).
 					 */
+	void *entry;			/* ->fault handler can alternatively
+					 * return locked DAX entry. In that
+					 * case handler should return
+					 * VM_FAULT_DAX_LOCKED and fill in
+					 * entry here.
+					 */
 	/* for ->map_pages() only */
 	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
 					 * max_pgoff inclusive */
@@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
+#define VM_FAULT_DAX_LOCKED 0x1000	/* ->fault has locked DAX entry */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
diff --git a/mm/memory.c b/mm/memory.c
index 81dca0083fcd..7a704d3cd3b5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -63,6 +63,7 @@
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/dax.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2783,7 +2784,8 @@ oom:
  */
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 			pgoff_t pgoff, unsigned int flags,
-			struct page *cow_page, struct page **page)
+			struct page *cow_page, struct page **page,
+			void **entry)
 {
 	struct vm_fault vmf;
 	int ret;
@@ -2798,8 +2800,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
-	if (!vmf.page)
-		goto out;
+	if (ret & VM_FAULT_DAX_LOCKED) {
+		*entry = vmf.entry;
+		return ret;
+	}
 
 	if (unlikely(PageHWPoison(vmf.page))) {
 		if (ret & VM_FAULT_LOCKED)
@@ -2813,7 +2817,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
- out:
 	*page = vmf.page;
 	return ret;
 }
@@ -2985,7 +2988,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(pte, ptl);
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -3008,6 +3011,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
 	struct page *fault_page, *new_page;
+	void *fault_entry;
 	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pte_t *pte;
@@ -3025,26 +3029,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
+			 &fault_entry);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
-	if (fault_page)
+	if (!(ret & VM_FAULT_DAX_LOCKED))
 		copy_user_highpage(new_page, fault_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (unlikely(!pte_same(*pte, orig_pte))) {
 		pte_unmap_unlock(pte, ptl);
-		if (fault_page) {
+		if (!(ret & VM_FAULT_DAX_LOCKED)) {
 			unlock_page(fault_page);
 			page_cache_release(fault_page);
 		} else {
-			/*
-			 * The fault handler has no page to lock, so it holds
-			 * i_mmap_lock for read to protect against truncate.
-			 */
-			i_mmap_unlock_read(vma->vm_file->f_mapping);
+			dax_unlock_mapping_entry(vma->vm_file->f_mapping,
+						 pgoff);
 		}
 		goto uncharge_out;
 	}
@@ -3052,15 +3054,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	mem_cgroup_commit_charge(new_page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
-	if (fault_page) {
+	if (!(ret & VM_FAULT_DAX_LOCKED)) {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 	} else {
-		/*
-		 * The fault handler has no page to lock, so it holds
-		 * i_mmap_lock for read to protect against truncate.
-		 */
-		i_mmap_unlock_read(vma->vm_file->f_mapping);
+		dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
 	}
 	return ret;
 uncharge_out:
@@ -3080,7 +3078,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

When doing cow faults, we cannot directly fill in PTE as we do for other
faults as we rely on generic code to do proper accounting of the cowed page.
We also have no page to lock to protect against races with truncate as
other faults have and we need the protection to extend until the moment
generic code inserts cowed page into PTE thus at that point we have no
protection of fs-specific i_mmap_sem. So far we relied on using
i_mmap_lock for the protection however that is completely special to cow
faults. To make fault locking more uniform use DAX entry lock instead.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 12 +++++-------
 include/linux/dax.h |  1 +
 include/linux/mm.h  |  7 +++++++
 mm/memory.c         | 38 ++++++++++++++++++--------------------
 4 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4fcac59b6dcb..2fcf4e8a17c5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -469,7 +469,7 @@ restart:
 	return ret;
 }
 
-static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
 	void *ret, **slot;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
@@ -502,7 +502,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
 		unlock_page(entry);
 		page_cache_release(entry);
 	} else {
-		unlock_mapping_entry(mapping, index);
+		dax_unlock_mapping_entry(mapping, index);
 	}
 }
 
@@ -887,12 +887,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			goto unlock_entry;
 		if (!radix_tree_exceptional_entry(entry)) {
 			vmf->page = entry;
-		} else {
-			unlock_mapping_entry(mapping, vmf->pgoff);
-			i_mmap_lock_read(mapping);
-			vmf->page = NULL;
+			return VM_FAULT_LOCKED;
 		}
-		return VM_FAULT_LOCKED;
+		vmf->entry = entry;
+		return VM_FAULT_DAX_LOCKED;
 	}
 
 	if (!buffer_mapped(&bh)) {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index da2416d916e6..29a83a767ea3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -60,4 +60,5 @@ static inline bool dax_mapping(struct address_space *mapping)
 struct writeback_control;
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
 #endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 450fc977ed02..1c64039dc505 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -299,6 +299,12 @@ struct vm_fault {
 					 * is set (which is also implied by
 					 * VM_FAULT_ERROR).
 					 */
+	void *entry;			/* ->fault handler can alternatively
+					 * return locked DAX entry. In that
+					 * case handler should return
+					 * VM_FAULT_DAX_LOCKED and fill in
+					 * entry here.
+					 */
 	/* for ->map_pages() only */
 	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
 					 * max_pgoff inclusive */
@@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
+#define VM_FAULT_DAX_LOCKED 0x1000	/* ->fault has locked DAX entry */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
diff --git a/mm/memory.c b/mm/memory.c
index 81dca0083fcd..7a704d3cd3b5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -63,6 +63,7 @@
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/dax.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2783,7 +2784,8 @@ oom:
  */
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 			pgoff_t pgoff, unsigned int flags,
-			struct page *cow_page, struct page **page)
+			struct page *cow_page, struct page **page,
+			void **entry)
 {
 	struct vm_fault vmf;
 	int ret;
@@ -2798,8 +2800,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
-	if (!vmf.page)
-		goto out;
+	if (ret & VM_FAULT_DAX_LOCKED) {
+		*entry = vmf.entry;
+		return ret;
+	}
 
 	if (unlikely(PageHWPoison(vmf.page))) {
 		if (ret & VM_FAULT_LOCKED)
@@ -2813,7 +2817,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
- out:
 	*page = vmf.page;
 	return ret;
 }
@@ -2985,7 +2988,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(pte, ptl);
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -3008,6 +3011,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
 	struct page *fault_page, *new_page;
+	void *fault_entry;
 	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pte_t *pte;
@@ -3025,26 +3029,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
+			 &fault_entry);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
-	if (fault_page)
+	if (!(ret & VM_FAULT_DAX_LOCKED))
 		copy_user_highpage(new_page, fault_page, address, vma);
 	__SetPageUptodate(new_page);
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (unlikely(!pte_same(*pte, orig_pte))) {
 		pte_unmap_unlock(pte, ptl);
-		if (fault_page) {
+		if (!(ret & VM_FAULT_DAX_LOCKED)) {
 			unlock_page(fault_page);
 			page_cache_release(fault_page);
 		} else {
-			/*
-			 * The fault handler has no page to lock, so it holds
-			 * i_mmap_lock for read to protect against truncate.
-			 */
-			i_mmap_unlock_read(vma->vm_file->f_mapping);
+			dax_unlock_mapping_entry(vma->vm_file->f_mapping,
+						 pgoff);
 		}
 		goto uncharge_out;
 	}
@@ -3052,15 +3054,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	mem_cgroup_commit_charge(new_page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
-	if (fault_page) {
+	if (!(ret & VM_FAULT_DAX_LOCKED)) {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 	} else {
-		/*
-		 * The fault handler has no page to lock, so it holds
-		 * i_mmap_lock for read to protect against truncate.
-		 */
-		i_mmap_unlock_read(vma->vm_file->f_mapping);
+		dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
 	}
 	return ret;
 uncharge_out:
@@ -3080,7 +3078,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 10/10] dax: Remove i_mmap_lock protection
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 13:22   ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Currently faults are protected against truncate by filesystem specific
i_mmap_sem and page lock in case of hole page. Cow faults are protected
DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
code. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2fcf4e8a17c5..a2a370db59b7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -795,8 +795,6 @@ static int dax_insert_mapping(struct address_space *mapping,
 	int error;
 	void *ret;
 
-	i_mmap_lock_read(mapping);
-
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
 		goto out;
@@ -821,7 +819,6 @@ static int dax_insert_mapping(struct address_space *mapping,
 
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
  out:
-	i_mmap_unlock_read(mapping);
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return error;
 }
@@ -1063,8 +1060,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	i_mmap_lock_read(mapping);
-
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
 		spinlock_t *ptl;
 		pmd_t entry;
@@ -1162,8 +1157,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	return result;
 
  fallback:
-- 
2.6.2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 10/10] dax: Remove i_mmap_lock protection
@ 2016-03-21 13:22   ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Currently faults are protected against truncate by filesystem specific
i_mmap_sem and page lock in case of hole page. Cow faults are protected
DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
code. Remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2fcf4e8a17c5..a2a370db59b7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -795,8 +795,6 @@ static int dax_insert_mapping(struct address_space *mapping,
 	int error;
 	void *ret;
 
-	i_mmap_lock_read(mapping);
-
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
 		goto out;
@@ -821,7 +819,6 @@ static int dax_insert_mapping(struct address_space *mapping,
 
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
  out:
-	i_mmap_unlock_read(mapping);
 	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return error;
 }
@@ -1063,8 +1060,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		truncate_pagecache_range(inode, lstart, lend);
 	}
 
-	i_mmap_lock_read(mapping);
-
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
 		spinlock_t *ptl;
 		pmd_t entry;
@@ -1162,8 +1157,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
-	i_mmap_unlock_read(mapping);
-
 	return result;
 
  fallback:
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-21 17:25     ` Matthew Wilcox
  -1 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-21 17:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Wilcox,

On Mon, Mar 21, 2016 at 02:22:46PM +0100, Jan Kara wrote:
> These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
> Let's try to maintain the idea that radix-tree simply implements an
> abstract data type.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c
@ 2016-03-21 17:25     ` Matthew Wilcox
  0 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-21 17:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R

On Mon, Mar 21, 2016 at 02:22:46PM +0100, Jan Kara wrote:
> These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
> Let's try to maintain the idea that radix-tree simply implements an
> abstract data type.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Matthew Wilcox <willy@linux.intel.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-21 17:34     ` Matthew Wilcox
  -1 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-21 17:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Wilcox,

On Mon, Mar 21, 2016 at 02:22:47PM +0100, Jan Kara wrote:
> A pointer to a radix_tree_node will always have the 'exception'
> bit cleared, so if the exception bit is set the value cannot
> be an indirect pointer.  Thus it is safe to make the 'indirect bit'
> available to store extra information in exception entries.
> 
> This patch adds a 'PTR_MASK' and a value is only treated as
> an indirect (pointer) entry the 2 ls-bits are '01'.

Nitpick: it's called INDIRECT_MASK, not PTR_MASK.

> The change in radix-tree.c ensures the stored value still looks like an
> indirect pointer, and saves a load as well.
> 
> We could swap the two bits and so keep all the exectional bits contigious.

typo "exceptional"

> But I have other plans for that bit....
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/radix-tree.h | 11 +++++++++--
>  lib/radix-tree.c           |  2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index d08d6ec3bf53..2bc8c5829441 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -41,8 +41,13 @@
>   * Indirect pointer in fact is also used to tag the last pointer of a node
>   * when it is shrunk, before we rcu free the node. See shrink code for
>   * details.
> + *
> + * To allow an exception entry to only lose one bit, we ignore
> + * the INDIRECT bit when the exception bit is set.  So an entry is
> + * indirect if the least significant 2 bits are 01.
>   */
>  #define RADIX_TREE_INDIRECT_PTR		1
> +#define RADIX_TREE_INDIRECT_MASK	3
>  /*
>   * A common use of the radix tree is to store pointers to struct pages;
>   * but shmem/tmpfs needs also to store swap entries in the same tree:
> @@ -54,7 +59,8 @@
>  
>  static inline int radix_tree_is_indirect_ptr(void *ptr)
>  {
> -	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
> +	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
> +		== RADIX_TREE_INDIRECT_PTR;
>  }
>  
>  /*** radix-tree API starts here ***/
> @@ -222,7 +228,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
>   */
>  static inline int radix_tree_deref_retry(void *arg)
>  {
> -	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
> +	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
> +			== RADIX_TREE_INDIRECT_PTR);
>  }
>  
>  /**
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 1624c4117961..c6af1a445b67 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1412,7 +1412,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
>  		 * to force callers to retry.
>  		 */
>  		if (root->height == 0)
> -			*((unsigned long *)&to_free->slots[0]) |=
> +			*((unsigned long *)&to_free->slots[0]) =
>  						RADIX_TREE_INDIRECT_PTR;

I have a patch currently in my tree which has the same effect, but looks
a little neater:

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index b77c31c..06dfed5 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -70,6 +70,8 @@ struct radix_tree_preload {
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
+#define RADIX_TREE_RETRY       ((void *)1)
+
 static inline void *ptr_to_indirect(void *ptr)
 {
        return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
@@ -934,7 +936,7 @@ restart:
                }
 
                slot = rcu_dereference_raw(node->slots[offset]);
-               if (slot == NULL)
+               if ((slot == NULL) || (slot == RADIX_TREE_RETRY))
                        goto restart;
                offset = follow_sibling(node, &slot, offset);
                if (!radix_tree_is_indirect_ptr(slot))
@@ -1443,8 +1455,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
                 * to force callers to retry.
                 */
                if (!radix_tree_is_indirect_ptr(slot))
-                       *((unsigned long *)&to_free->slots[0]) |=
-                                               RADIX_TREE_INDIRECT_PTR;
+                       to_free->slots[0] = RADIX_TREE_RETRY;
 
                radix_tree_node_free(to_free);
        }

What do you think to doing it this way?

It might be slightly neater to replace the first hunk with this:

#define RADIX_TREE_RETRY       ((void *)RADIX_TREE_INDIRECT_PTR)

I also considered putting that define in radix-tree.h instead of
radix-tree.c, but on the whole I don't think it'll be useful outside
radix-tree.h.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
@ 2016-03-21 17:34     ` Matthew Wilcox
  0 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-21 17:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R

On Mon, Mar 21, 2016 at 02:22:47PM +0100, Jan Kara wrote:
> A pointer to a radix_tree_node will always have the 'exception'
> bit cleared, so if the exception bit is set the value cannot
> be an indirect pointer.  Thus it is safe to make the 'indirect bit'
> available to store extra information in exception entries.
> 
> This patch adds a 'PTR_MASK' and a value is only treated as
> an indirect (pointer) entry the 2 ls-bits are '01'.

Nitpick: it's called INDIRECT_MASK, not PTR_MASK.

> The change in radix-tree.c ensures the stored value still looks like an
> indirect pointer, and saves a load as well.
> 
> We could swap the two bits and so keep all the exectional bits contigious.

typo "exceptional"

> But I have other plans for that bit....
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/radix-tree.h | 11 +++++++++--
>  lib/radix-tree.c           |  2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index d08d6ec3bf53..2bc8c5829441 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -41,8 +41,13 @@
>   * Indirect pointer in fact is also used to tag the last pointer of a node
>   * when it is shrunk, before we rcu free the node. See shrink code for
>   * details.
> + *
> + * To allow an exception entry to only lose one bit, we ignore
> + * the INDIRECT bit when the exception bit is set.  So an entry is
> + * indirect if the least significant 2 bits are 01.
>   */
>  #define RADIX_TREE_INDIRECT_PTR		1
> +#define RADIX_TREE_INDIRECT_MASK	3
>  /*
>   * A common use of the radix tree is to store pointers to struct pages;
>   * but shmem/tmpfs needs also to store swap entries in the same tree:
> @@ -54,7 +59,8 @@
>  
>  static inline int radix_tree_is_indirect_ptr(void *ptr)
>  {
> -	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
> +	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
> +		== RADIX_TREE_INDIRECT_PTR;
>  }
>  
>  /*** radix-tree API starts here ***/
> @@ -222,7 +228,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
>   */
>  static inline int radix_tree_deref_retry(void *arg)
>  {
> -	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
> +	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
> +			== RADIX_TREE_INDIRECT_PTR);
>  }
>  
>  /**
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 1624c4117961..c6af1a445b67 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1412,7 +1412,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
>  		 * to force callers to retry.
>  		 */
>  		if (root->height == 0)
> -			*((unsigned long *)&to_free->slots[0]) |=
> +			*((unsigned long *)&to_free->slots[0]) =
>  						RADIX_TREE_INDIRECT_PTR;

I have a patch currently in my tree which has the same effect, but looks
a little neater:

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index b77c31c..06dfed5 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -70,6 +70,8 @@ struct radix_tree_preload {
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
+#define RADIX_TREE_RETRY       ((void *)1)
+
 static inline void *ptr_to_indirect(void *ptr)
 {
        return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
@@ -934,7 +936,7 @@ restart:
                }
 
                slot = rcu_dereference_raw(node->slots[offset]);
-               if (slot == NULL)
+               if ((slot == NULL) || (slot == RADIX_TREE_RETRY))
                        goto restart;
                offset = follow_sibling(node, &slot, offset);
                if (!radix_tree_is_indirect_ptr(slot))
@@ -1443,8 +1455,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
                 * to force callers to retry.
                 */
                if (!radix_tree_is_indirect_ptr(slot))
-                       *((unsigned long *)&to_free->slots[0]) |=
-                                               RADIX_TREE_INDIRECT_PTR;
+                       to_free->slots[0] = RADIX_TREE_RETRY;
 
                radix_tree_node_free(to_free);
        }

What do you think to doing it this way?

It might be slightly neater to replace the first hunk with this:

#define RADIX_TREE_RETRY       ((void *)RADIX_TREE_INDIRECT_PTR)

I also considered putting that define in radix-tree.h instead of
radix-tree.c, but on the whole I don't think it'll be useful outside
radix-tree.h.

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-21 17:41   ` Matthew Wilcox
  -1 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-21 17:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox,
	Matthew R  <matthew.r.wilcox@intel.com>,
	NeilBrown <neilb@suse.com>,
	Kirill A. Shutemov, linux-nvdimm

On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> The basic idea is that we use a bit in an exceptional radix tree entry as
> a lock bit and use it similarly to how page lock is used for normal faults.
> That way we fix races between hole instantiation and read faults of the
> same index. For now I have disabled PMD faults since there the issues with
> page fault locking are even worse. Now that Matthew's multi-order radix tree
> has landed, I can have a look into using that for proper locking of PMD faults
> but first I want normal pages sorted out.

FYI, the multi-order radix tree code that landed is unusably buggy.
Ross and I have been working like madmen for the past three weeks to fix
all of the bugs we've found and not introduce new ones.  The radix tree
test suite has been enormously helpful in this regard, but we're still
finding corner cases (thanks, RCU! ;-)

Our current best effort can be found hiding in
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/radix-fixes-2016-03-15
but it's for sure not ready for review yet.  I just don't want other
people trying to use the facility and wasting their time.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-21 17:41   ` Matthew Wilcox
  0 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-21 17:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R,
	Kirill A. Shutemov

On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> The basic idea is that we use a bit in an exceptional radix tree entry as
> a lock bit and use it similarly to how page lock is used for normal faults.
> That way we fix races between hole instantiation and read faults of the
> same index. For now I have disabled PMD faults since there the issues with
> page fault locking are even worse. Now that Matthew's multi-order radix tree
> has landed, I can have a look into using that for proper locking of PMD faults
> but first I want normal pages sorted out.

FYI, the multi-order radix tree code that landed is unusably buggy.
Ross and I have been working like madmen for the past three weeks to fix
all of the bugs we've found and not introduce new ones.  The radix tree
test suite has been enormously helpful in this regard, but we're still
finding corner cases (thanks, RCU! ;-)

Our current best effort can be found hiding in
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/radix-fixes-2016-03-15
but it's for sure not ready for review yet.  I just don't want other
people trying to use the facility and wasting their time.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-21 19:11     ` Matthew Wilcox
  -1 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-21 19:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox,
	Matthew R  <matthew.r.wilcox@intel.com>,
	NeilBrown <neilb@suse.com>,
	Kirill A. Shutemov, linux-nvdimm

On Mon, Mar 21, 2016 at 02:22:54PM +0100, Jan Kara wrote:
> When doing cow faults, we cannot directly fill in PTE as we do for other
> faults as we rely on generic code to do proper accounting of the cowed page.
> We also have no page to lock to protect against races with truncate as
> other faults have and we need the protection to extend until the moment
> generic code inserts cowed page into PTE thus at that point we have no
> protection of fs-specific i_mmap_sem. So far we relied on using
> i_mmap_lock for the protection however that is completely special to cow
> faults. To make fault locking more uniform use DAX entry lock instead.

You can also (I believe) delete this lock in mm/memory.c:


        /* DAX uses i_mmap_lock to serialise file truncate vs page fault */
        i_mmap_lock_write(mapping);
        if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
                unmap_mapping_range_tree(&mapping->i_mmap, &details);
        i_mmap_unlock_write(mapping);
}
EXPORT_SYMBOL(unmap_mapping_range);


> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c            | 12 +++++-------
>  include/linux/dax.h |  1 +
>  include/linux/mm.h  |  7 +++++++
>  mm/memory.c         | 38 ++++++++++++++++++--------------------
>  4 files changed, 31 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4fcac59b6dcb..2fcf4e8a17c5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -469,7 +469,7 @@ restart:
>  	return ret;
>  }
>  
> -static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
>  	void *ret, **slot;
>  	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> @@ -502,7 +502,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
>  		unlock_page(entry);
>  		page_cache_release(entry);
>  	} else {
> -		unlock_mapping_entry(mapping, index);
> +		dax_unlock_mapping_entry(mapping, index);
>  	}
>  }
>  
> @@ -887,12 +887,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			goto unlock_entry;
>  		if (!radix_tree_exceptional_entry(entry)) {
>  			vmf->page = entry;
> -		} else {
> -			unlock_mapping_entry(mapping, vmf->pgoff);
> -			i_mmap_lock_read(mapping);
> -			vmf->page = NULL;
> +			return VM_FAULT_LOCKED;
>  		}
> -		return VM_FAULT_LOCKED;
> +		vmf->entry = entry;
> +		return VM_FAULT_DAX_LOCKED;
>  	}
>  
>  	if (!buffer_mapped(&bh)) {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index da2416d916e6..29a83a767ea3 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -60,4 +60,5 @@ static inline bool dax_mapping(struct address_space *mapping)
>  struct writeback_control;
>  int dax_writeback_mapping_range(struct address_space *mapping,
>  		struct block_device *bdev, struct writeback_control *wbc);
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
>  #endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 450fc977ed02..1c64039dc505 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -299,6 +299,12 @@ struct vm_fault {
>  					 * is set (which is also implied by
>  					 * VM_FAULT_ERROR).
>  					 */
> +	void *entry;			/* ->fault handler can alternatively
> +					 * return locked DAX entry. In that
> +					 * case handler should return
> +					 * VM_FAULT_DAX_LOCKED and fill in
> +					 * entry here.
> +					 */
>  	/* for ->map_pages() only */
>  	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
>  					 * max_pgoff inclusive */
> @@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
>  #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
>  #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>  #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
> +#define VM_FAULT_DAX_LOCKED 0x1000	/* ->fault has locked DAX entry */
>  
>  #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 81dca0083fcd..7a704d3cd3b5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -63,6 +63,7 @@
>  #include <linux/dma-debug.h>
>  #include <linux/debugfs.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/dax.h>
>  
>  #include <asm/io.h>
>  #include <asm/mmu_context.h>
> @@ -2783,7 +2784,8 @@ oom:
>   */
>  static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  			pgoff_t pgoff, unsigned int flags,
> -			struct page *cow_page, struct page **page)
> +			struct page *cow_page, struct page **page,
> +			void **entry)
>  {
>  	struct vm_fault vmf;
>  	int ret;
> @@ -2798,8 +2800,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	ret = vma->vm_ops->fault(vma, &vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
> -	if (!vmf.page)
> -		goto out;
> +	if (ret & VM_FAULT_DAX_LOCKED) {
> +		*entry = vmf.entry;
> +		return ret;
> +	}
>  
>  	if (unlikely(PageHWPoison(vmf.page))) {
>  		if (ret & VM_FAULT_LOCKED)
> @@ -2813,7 +2817,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	else
>  		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
>  
> - out:
>  	*page = vmf.page;
>  	return ret;
>  }
> @@ -2985,7 +2988,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pte_unmap_unlock(pte, ptl);
>  	}
>  
> -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> @@ -3008,6 +3011,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
>  {
>  	struct page *fault_page, *new_page;
> +	void *fault_entry;
>  	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pte_t *pte;
> @@ -3025,26 +3029,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		return VM_FAULT_OOM;
>  	}
>  
> -	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
> +			 &fault_entry);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		goto uncharge_out;
>  
> -	if (fault_page)
> +	if (!(ret & VM_FAULT_DAX_LOCKED))
>  		copy_user_highpage(new_page, fault_page, address, vma);
>  	__SetPageUptodate(new_page);
>  
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (unlikely(!pte_same(*pte, orig_pte))) {
>  		pte_unmap_unlock(pte, ptl);
> -		if (fault_page) {
> +		if (!(ret & VM_FAULT_DAX_LOCKED)) {
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
>  		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> +			dax_unlock_mapping_entry(vma->vm_file->f_mapping,
> +						 pgoff);
>  		}
>  		goto uncharge_out;
>  	}
> @@ -3052,15 +3054,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	mem_cgroup_commit_charge(new_page, memcg, false, false);
>  	lru_cache_add_active_or_unevictable(new_page, vma);
>  	pte_unmap_unlock(pte, ptl);
> -	if (fault_page) {
> +	if (!(ret & VM_FAULT_DAX_LOCKED)) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
>  	} else {
> -		/*
> -		 * The fault handler has no page to lock, so it holds
> -		 * i_mmap_lock for read to protect against truncate.
> -		 */
> -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> +		dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
>  	}
>  	return ret;
>  uncharge_out:
> @@ -3080,7 +3078,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int dirtied = 0;
>  	int ret, tmp;
>  
> -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> -- 
> 2.6.2
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults
@ 2016-03-21 19:11     ` Matthew Wilcox
  0 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-21 19:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R,
	Kirill A. Shutemov

On Mon, Mar 21, 2016 at 02:22:54PM +0100, Jan Kara wrote:
> When doing cow faults, we cannot directly fill in PTE as we do for other
> faults as we rely on generic code to do proper accounting of the cowed page.
> We also have no page to lock to protect against races with truncate as
> other faults have and we need the protection to extend until the moment
> generic code inserts cowed page into PTE thus at that point we have no
> protection of fs-specific i_mmap_sem. So far we relied on using
> i_mmap_lock for the protection however that is completely special to cow
> faults. To make fault locking more uniform use DAX entry lock instead.

You can also (I believe) delete this lock in mm/memory.c:


        /* DAX uses i_mmap_lock to serialise file truncate vs page fault */
        i_mmap_lock_write(mapping);
        if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
                unmap_mapping_range_tree(&mapping->i_mmap, &details);
        i_mmap_unlock_write(mapping);
}
EXPORT_SYMBOL(unmap_mapping_range);


> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c            | 12 +++++-------
>  include/linux/dax.h |  1 +
>  include/linux/mm.h  |  7 +++++++
>  mm/memory.c         | 38 ++++++++++++++++++--------------------
>  4 files changed, 31 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4fcac59b6dcb..2fcf4e8a17c5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -469,7 +469,7 @@ restart:
>  	return ret;
>  }
>  
> -static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
>  	void *ret, **slot;
>  	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> @@ -502,7 +502,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
>  		unlock_page(entry);
>  		page_cache_release(entry);
>  	} else {
> -		unlock_mapping_entry(mapping, index);
> +		dax_unlock_mapping_entry(mapping, index);
>  	}
>  }
>  
> @@ -887,12 +887,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			goto unlock_entry;
>  		if (!radix_tree_exceptional_entry(entry)) {
>  			vmf->page = entry;
> -		} else {
> -			unlock_mapping_entry(mapping, vmf->pgoff);
> -			i_mmap_lock_read(mapping);
> -			vmf->page = NULL;
> +			return VM_FAULT_LOCKED;
>  		}
> -		return VM_FAULT_LOCKED;
> +		vmf->entry = entry;
> +		return VM_FAULT_DAX_LOCKED;
>  	}
>  
>  	if (!buffer_mapped(&bh)) {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index da2416d916e6..29a83a767ea3 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -60,4 +60,5 @@ static inline bool dax_mapping(struct address_space *mapping)
>  struct writeback_control;
>  int dax_writeback_mapping_range(struct address_space *mapping,
>  		struct block_device *bdev, struct writeback_control *wbc);
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
>  #endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 450fc977ed02..1c64039dc505 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -299,6 +299,12 @@ struct vm_fault {
>  					 * is set (which is also implied by
>  					 * VM_FAULT_ERROR).
>  					 */
> +	void *entry;			/* ->fault handler can alternatively
> +					 * return locked DAX entry. In that
> +					 * case handler should return
> +					 * VM_FAULT_DAX_LOCKED and fill in
> +					 * entry here.
> +					 */
>  	/* for ->map_pages() only */
>  	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
>  					 * max_pgoff inclusive */
> @@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
>  #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
>  #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>  #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
> +#define VM_FAULT_DAX_LOCKED 0x1000	/* ->fault has locked DAX entry */
>  
>  #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 81dca0083fcd..7a704d3cd3b5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -63,6 +63,7 @@
>  #include <linux/dma-debug.h>
>  #include <linux/debugfs.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/dax.h>
>  
>  #include <asm/io.h>
>  #include <asm/mmu_context.h>
> @@ -2783,7 +2784,8 @@ oom:
>   */
>  static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  			pgoff_t pgoff, unsigned int flags,
> -			struct page *cow_page, struct page **page)
> +			struct page *cow_page, struct page **page,
> +			void **entry)
>  {
>  	struct vm_fault vmf;
>  	int ret;
> @@ -2798,8 +2800,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	ret = vma->vm_ops->fault(vma, &vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
> -	if (!vmf.page)
> -		goto out;
> +	if (ret & VM_FAULT_DAX_LOCKED) {
> +		*entry = vmf.entry;
> +		return ret;
> +	}
>  
>  	if (unlikely(PageHWPoison(vmf.page))) {
>  		if (ret & VM_FAULT_LOCKED)
> @@ -2813,7 +2817,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	else
>  		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
>  
> - out:
>  	*page = vmf.page;
>  	return ret;
>  }
> @@ -2985,7 +2988,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pte_unmap_unlock(pte, ptl);
>  	}
>  
> -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> @@ -3008,6 +3011,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
>  {
>  	struct page *fault_page, *new_page;
> +	void *fault_entry;
>  	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pte_t *pte;
> @@ -3025,26 +3029,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		return VM_FAULT_OOM;
>  	}
>  
> -	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
> +			 &fault_entry);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		goto uncharge_out;
>  
> -	if (fault_page)
> +	if (!(ret & VM_FAULT_DAX_LOCKED))
>  		copy_user_highpage(new_page, fault_page, address, vma);
>  	__SetPageUptodate(new_page);
>  
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (unlikely(!pte_same(*pte, orig_pte))) {
>  		pte_unmap_unlock(pte, ptl);
> -		if (fault_page) {
> +		if (!(ret & VM_FAULT_DAX_LOCKED)) {
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
>  		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> +			dax_unlock_mapping_entry(vma->vm_file->f_mapping,
> +						 pgoff);
>  		}
>  		goto uncharge_out;
>  	}
> @@ -3052,15 +3054,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	mem_cgroup_commit_charge(new_page, memcg, false, false);
>  	lru_cache_add_active_or_unevictable(new_page, vma);
>  	pte_unmap_unlock(pte, ptl);
> -	if (fault_page) {
> +	if (!(ret & VM_FAULT_DAX_LOCKED)) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
>  	} else {
> -		/*
> -		 * The fault handler has no page to lock, so it holds
> -		 * i_mmap_lock for read to protect against truncate.
> -		 */
> -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> +		dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
>  	}
>  	return ret;
>  uncharge_out:
> @@ -3080,7 +3078,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int dirtied = 0;
>  	int ret, tmp;
>  
> -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> -- 
> 2.6.2
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults
  2016-03-21 19:11     ` Matthew Wilcox
@ 2016-03-22  7:03       ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-22  7:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R,
	linux-fsdevel, Kirill A. Shutemov

On Mon 21-03-16 15:11:06, Matthew Wilcox wrote:
> On Mon, Mar 21, 2016 at 02:22:54PM +0100, Jan Kara wrote:
> > When doing cow faults, we cannot directly fill in PTE as we do for other
> > faults as we rely on generic code to do proper accounting of the cowed page.
> > We also have no page to lock to protect against races with truncate as
> > other faults have and we need the protection to extend until the moment
> > generic code inserts cowed page into PTE thus at that point we have no
> > protection of fs-specific i_mmap_sem. So far we relied on using
> > i_mmap_lock for the protection however that is completely special to cow
> > faults. To make fault locking more uniform use DAX entry lock instead.
> 
> You can also (I believe) delete this lock in mm/memory.c:
> 
> 
>         /* DAX uses i_mmap_lock to serialise file truncate vs page fault */
>         i_mmap_lock_write(mapping);
>         if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>                 unmap_mapping_range_tree(&mapping->i_mmap, &details);
>         i_mmap_unlock_write(mapping);
> }
> EXPORT_SYMBOL(unmap_mapping_range);

I don't think we can. The i_mmap_lock protection is there certainly from
pre-DAX times and guards changes of the reverse mapping tree AFAIU. But I
should certainly drop the comment.

								Honza

> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c            | 12 +++++-------
> >  include/linux/dax.h |  1 +
> >  include/linux/mm.h  |  7 +++++++
> >  mm/memory.c         | 38 ++++++++++++++++++--------------------
> >  4 files changed, 31 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 4fcac59b6dcb..2fcf4e8a17c5 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -469,7 +469,7 @@ restart:
> >  	return ret;
> >  }
> >  
> > -static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> >  {
> >  	void *ret, **slot;
> >  	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > @@ -502,7 +502,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
> >  		unlock_page(entry);
> >  		page_cache_release(entry);
> >  	} else {
> > -		unlock_mapping_entry(mapping, index);
> > +		dax_unlock_mapping_entry(mapping, index);
> >  	}
> >  }
> >  
> > @@ -887,12 +887,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  			goto unlock_entry;
> >  		if (!radix_tree_exceptional_entry(entry)) {
> >  			vmf->page = entry;
> > -		} else {
> > -			unlock_mapping_entry(mapping, vmf->pgoff);
> > -			i_mmap_lock_read(mapping);
> > -			vmf->page = NULL;
> > +			return VM_FAULT_LOCKED;
> >  		}
> > -		return VM_FAULT_LOCKED;
> > +		vmf->entry = entry;
> > +		return VM_FAULT_DAX_LOCKED;
> >  	}
> >  
> >  	if (!buffer_mapped(&bh)) {
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index da2416d916e6..29a83a767ea3 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -60,4 +60,5 @@ static inline bool dax_mapping(struct address_space *mapping)
> >  struct writeback_control;
> >  int dax_writeback_mapping_range(struct address_space *mapping,
> >  		struct block_device *bdev, struct writeback_control *wbc);
> > +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
> >  #endif
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 450fc977ed02..1c64039dc505 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -299,6 +299,12 @@ struct vm_fault {
> >  					 * is set (which is also implied by
> >  					 * VM_FAULT_ERROR).
> >  					 */
> > +	void *entry;			/* ->fault handler can alternatively
> > +					 * return locked DAX entry. In that
> > +					 * case handler should return
> > +					 * VM_FAULT_DAX_LOCKED and fill in
> > +					 * entry here.
> > +					 */
> >  	/* for ->map_pages() only */
> >  	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
> >  					 * max_pgoff inclusive */
> > @@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
> >  #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
> >  #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
> >  #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
> > +#define VM_FAULT_DAX_LOCKED 0x1000	/* ->fault has locked DAX entry */
> >  
> >  #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 81dca0083fcd..7a704d3cd3b5 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -63,6 +63,7 @@
> >  #include <linux/dma-debug.h>
> >  #include <linux/debugfs.h>
> >  #include <linux/userfaultfd_k.h>
> > +#include <linux/dax.h>
> >  
> >  #include <asm/io.h>
> >  #include <asm/mmu_context.h>
> > @@ -2783,7 +2784,8 @@ oom:
> >   */
> >  static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> >  			pgoff_t pgoff, unsigned int flags,
> > -			struct page *cow_page, struct page **page)
> > +			struct page *cow_page, struct page **page,
> > +			void **entry)
> >  {
> >  	struct vm_fault vmf;
> >  	int ret;
> > @@ -2798,8 +2800,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> >  	ret = vma->vm_ops->fault(vma, &vmf);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		return ret;
> > -	if (!vmf.page)
> > -		goto out;
> > +	if (ret & VM_FAULT_DAX_LOCKED) {
> > +		*entry = vmf.entry;
> > +		return ret;
> > +	}
> >  
> >  	if (unlikely(PageHWPoison(vmf.page))) {
> >  		if (ret & VM_FAULT_LOCKED)
> > @@ -2813,7 +2817,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> >  	else
> >  		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
> >  
> > - out:
> >  	*page = vmf.page;
> >  	return ret;
> >  }
> > @@ -2985,7 +2988,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		pte_unmap_unlock(pte, ptl);
> >  	}
> >  
> > -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> > +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		return ret;
> >  
> > @@ -3008,6 +3011,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
> >  {
> >  	struct page *fault_page, *new_page;
> > +	void *fault_entry;
> >  	struct mem_cgroup *memcg;
> >  	spinlock_t *ptl;
> >  	pte_t *pte;
> > @@ -3025,26 +3029,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		return VM_FAULT_OOM;
> >  	}
> >  
> > -	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
> > +	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
> > +			 &fault_entry);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		goto uncharge_out;
> >  
> > -	if (fault_page)
> > +	if (!(ret & VM_FAULT_DAX_LOCKED))
> >  		copy_user_highpage(new_page, fault_page, address, vma);
> >  	__SetPageUptodate(new_page);
> >  
> >  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> >  	if (unlikely(!pte_same(*pte, orig_pte))) {
> >  		pte_unmap_unlock(pte, ptl);
> > -		if (fault_page) {
> > +		if (!(ret & VM_FAULT_DAX_LOCKED)) {
> >  			unlock_page(fault_page);
> >  			page_cache_release(fault_page);
> >  		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> > +			dax_unlock_mapping_entry(vma->vm_file->f_mapping,
> > +						 pgoff);
> >  		}
> >  		goto uncharge_out;
> >  	}
> > @@ -3052,15 +3054,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	mem_cgroup_commit_charge(new_page, memcg, false, false);
> >  	lru_cache_add_active_or_unevictable(new_page, vma);
> >  	pte_unmap_unlock(pte, ptl);
> > -	if (fault_page) {
> > +	if (!(ret & VM_FAULT_DAX_LOCKED)) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> >  	} else {
> > -		/*
> > -		 * The fault handler has no page to lock, so it holds
> > -		 * i_mmap_lock for read to protect against truncate.
> > -		 */
> > -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> > +		dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
> >  	}
> >  	return ret;
> >  uncharge_out:
> > @@ -3080,7 +3078,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	int dirtied = 0;
> >  	int ret, tmp;
> >  
> > -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> > +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		return ret;
> >  
> > -- 
> > 2.6.2
> > 
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults
@ 2016-03-22  7:03       ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-22  7:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox,
	Matthew R, Kirill A. Shutemov

On Mon 21-03-16 15:11:06, Matthew Wilcox wrote:
> On Mon, Mar 21, 2016 at 02:22:54PM +0100, Jan Kara wrote:
> > When doing cow faults, we cannot directly fill in PTE as we do for other
> > faults as we rely on generic code to do proper accounting of the cowed page.
> > We also have no page to lock to protect against races with truncate as
> > other faults have and we need the protection to extend until the moment
> > generic code inserts cowed page into PTE thus at that point we have no
> > protection of fs-specific i_mmap_sem. So far we relied on using
> > i_mmap_lock for the protection however that is completely special to cow
> > faults. To make fault locking more uniform use DAX entry lock instead.
> 
> You can also (I believe) delete this lock in mm/memory.c:
> 
> 
>         /* DAX uses i_mmap_lock to serialise file truncate vs page fault */
>         i_mmap_lock_write(mapping);
>         if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>                 unmap_mapping_range_tree(&mapping->i_mmap, &details);
>         i_mmap_unlock_write(mapping);
> }
> EXPORT_SYMBOL(unmap_mapping_range);

I don't think we can. The i_mmap_lock protection is there certainly from
pre-DAX times and guards changes of the reverse mapping tree AFAIU. But I
should certainly drop the comment.

								Honza

> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c            | 12 +++++-------
> >  include/linux/dax.h |  1 +
> >  include/linux/mm.h  |  7 +++++++
> >  mm/memory.c         | 38 ++++++++++++++++++--------------------
> >  4 files changed, 31 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 4fcac59b6dcb..2fcf4e8a17c5 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -469,7 +469,7 @@ restart:
> >  	return ret;
> >  }
> >  
> > -static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> >  {
> >  	void *ret, **slot;
> >  	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > @@ -502,7 +502,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
> >  		unlock_page(entry);
> >  		page_cache_release(entry);
> >  	} else {
> > -		unlock_mapping_entry(mapping, index);
> > +		dax_unlock_mapping_entry(mapping, index);
> >  	}
> >  }
> >  
> > @@ -887,12 +887,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  			goto unlock_entry;
> >  		if (!radix_tree_exceptional_entry(entry)) {
> >  			vmf->page = entry;
> > -		} else {
> > -			unlock_mapping_entry(mapping, vmf->pgoff);
> > -			i_mmap_lock_read(mapping);
> > -			vmf->page = NULL;
> > +			return VM_FAULT_LOCKED;
> >  		}
> > -		return VM_FAULT_LOCKED;
> > +		vmf->entry = entry;
> > +		return VM_FAULT_DAX_LOCKED;
> >  	}
> >  
> >  	if (!buffer_mapped(&bh)) {
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index da2416d916e6..29a83a767ea3 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -60,4 +60,5 @@ static inline bool dax_mapping(struct address_space *mapping)
> >  struct writeback_control;
> >  int dax_writeback_mapping_range(struct address_space *mapping,
> >  		struct block_device *bdev, struct writeback_control *wbc);
> > +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
> >  #endif
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 450fc977ed02..1c64039dc505 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -299,6 +299,12 @@ struct vm_fault {
> >  					 * is set (which is also implied by
> >  					 * VM_FAULT_ERROR).
> >  					 */
> > +	void *entry;			/* ->fault handler can alternatively
> > +					 * return locked DAX entry. In that
> > +					 * case handler should return
> > +					 * VM_FAULT_DAX_LOCKED and fill in
> > +					 * entry here.
> > +					 */
> >  	/* for ->map_pages() only */
> >  	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
> >  					 * max_pgoff inclusive */
> > @@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
> >  #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
> >  #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
> >  #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
> > +#define VM_FAULT_DAX_LOCKED 0x1000	/* ->fault has locked DAX entry */
> >  
> >  #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 81dca0083fcd..7a704d3cd3b5 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -63,6 +63,7 @@
> >  #include <linux/dma-debug.h>
> >  #include <linux/debugfs.h>
> >  #include <linux/userfaultfd_k.h>
> > +#include <linux/dax.h>
> >  
> >  #include <asm/io.h>
> >  #include <asm/mmu_context.h>
> > @@ -2783,7 +2784,8 @@ oom:
> >   */
> >  static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> >  			pgoff_t pgoff, unsigned int flags,
> > -			struct page *cow_page, struct page **page)
> > +			struct page *cow_page, struct page **page,
> > +			void **entry)
> >  {
> >  	struct vm_fault vmf;
> >  	int ret;
> > @@ -2798,8 +2800,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> >  	ret = vma->vm_ops->fault(vma, &vmf);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		return ret;
> > -	if (!vmf.page)
> > -		goto out;
> > +	if (ret & VM_FAULT_DAX_LOCKED) {
> > +		*entry = vmf.entry;
> > +		return ret;
> > +	}
> >  
> >  	if (unlikely(PageHWPoison(vmf.page))) {
> >  		if (ret & VM_FAULT_LOCKED)
> > @@ -2813,7 +2817,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> >  	else
> >  		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
> >  
> > - out:
> >  	*page = vmf.page;
> >  	return ret;
> >  }
> > @@ -2985,7 +2988,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		pte_unmap_unlock(pte, ptl);
> >  	}
> >  
> > -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> > +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		return ret;
> >  
> > @@ -3008,6 +3011,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
> >  {
> >  	struct page *fault_page, *new_page;
> > +	void *fault_entry;
> >  	struct mem_cgroup *memcg;
> >  	spinlock_t *ptl;
> >  	pte_t *pte;
> > @@ -3025,26 +3029,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		return VM_FAULT_OOM;
> >  	}
> >  
> > -	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
> > +	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
> > +			 &fault_entry);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		goto uncharge_out;
> >  
> > -	if (fault_page)
> > +	if (!(ret & VM_FAULT_DAX_LOCKED))
> >  		copy_user_highpage(new_page, fault_page, address, vma);
> >  	__SetPageUptodate(new_page);
> >  
> >  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> >  	if (unlikely(!pte_same(*pte, orig_pte))) {
> >  		pte_unmap_unlock(pte, ptl);
> > -		if (fault_page) {
> > +		if (!(ret & VM_FAULT_DAX_LOCKED)) {
> >  			unlock_page(fault_page);
> >  			page_cache_release(fault_page);
> >  		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> > +			dax_unlock_mapping_entry(vma->vm_file->f_mapping,
> > +						 pgoff);
> >  		}
> >  		goto uncharge_out;
> >  	}
> > @@ -3052,15 +3054,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	mem_cgroup_commit_charge(new_page, memcg, false, false);
> >  	lru_cache_add_active_or_unevictable(new_page, vma);
> >  	pte_unmap_unlock(pte, ptl);
> > -	if (fault_page) {
> > +	if (!(ret & VM_FAULT_DAX_LOCKED)) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> >  	} else {
> > -		/*
> > -		 * The fault handler has no page to lock, so it holds
> > -		 * i_mmap_lock for read to protect against truncate.
> > -		 */
> > -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> > +		dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
> >  	}
> >  	return ret;
> >  uncharge_out:
> > @@ -3080,7 +3078,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	int dirtied = 0;
> >  	int ret, tmp;
> >  
> > -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> > +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		return ret;
> >  
> > -- 
> > 2.6.2
> > 
> > _______________________________________________
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
  2016-03-21 17:34     ` Matthew Wilcox
@ 2016-03-22  9:12       ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-22  9:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Wilcox,
	Matthew R  <matthew.r.wilcox@intel.com>,
	Jan Kara <jack@suse.cz>,
	NeilBrown, linux-nvdimm

On Mon 21-03-16 13:34:58, Matthew Wilcox wrote:
> I have a patch currently in my tree which has the same effect, but looks
> a little neater:
> 
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index b77c31c..06dfed5 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -70,6 +70,8 @@ struct radix_tree_preload {
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>  
> +#define RADIX_TREE_RETRY       ((void *)1)
> +
>  static inline void *ptr_to_indirect(void *ptr)
>  {
>         return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
> @@ -934,7 +936,7 @@ restart:
>                 }
>  
>                 slot = rcu_dereference_raw(node->slots[offset]);
> -               if (slot == NULL)
> +               if ((slot == NULL) || (slot == RADIX_TREE_RETRY))
>                         goto restart;
>                 offset = follow_sibling(node, &slot, offset);
>                 if (!radix_tree_is_indirect_ptr(slot))
> @@ -1443,8 +1455,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
>                  * to force callers to retry.
>                  */
>                 if (!radix_tree_is_indirect_ptr(slot))
> -                       *((unsigned long *)&to_free->slots[0]) |=
> -                                               RADIX_TREE_INDIRECT_PTR;
> +                       to_free->slots[0] = RADIX_TREE_RETRY;
>  
>                 radix_tree_node_free(to_free);
>         }
> 
> What do you think to doing it this way?
> 
> It might be slightly neater to replace the first hunk with this:
> 
> #define RADIX_TREE_RETRY       ((void *)RADIX_TREE_INDIRECT_PTR)
> 
> I also considered putting that define in radix-tree.h instead of
> radix-tree.c, but on the whole I don't think it'll be useful outside
> radix-tree.h.

So after spending over and hour reading radix tree code back and forth (and
also digging into historical versions where stuff is easier to understand) I
think I can finally fully appreciate subtlety of the retry logic ;). And
actually now I think that Neil's variant is buggy because in his case
radix_tree_lookup() could return NULL if it raced with radix_tree_shrink()
for index 0, although there is valid entry at index 0.

Your variant actually doesn't make things much better. See e.g.
mm/filemap.c: find_get_entry()

	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
	// pagep points to node that is under RCU freeing
	if (pagep) {
		page = radix_tree_deref_slot(pagep);
		if (unlikely(!page))	// False since
					// RADIX_TREE_INDIRECT_PTR is set
		if (radix_tree_exception(page))	// False - no exeptional bit
		if (!page_cache_get_speculative(page)) // oops...

What we need to do is either to make all radix_tree_deref_slot() callers
check return value immediately with something like radix_tree_deref_retry()
(but that is still prone to hard to debug bugs when someone forgets to call
radix_tree_deref_retry() in some place) or we just give up the idea of
using INDIRECT bit in exceptional entries.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
@ 2016-03-22  9:12       ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-22  9:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R

On Mon 21-03-16 13:34:58, Matthew Wilcox wrote:
> I have a patch currently in my tree which has the same effect, but looks
> a little neater:
> 
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index b77c31c..06dfed5 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -70,6 +70,8 @@ struct radix_tree_preload {
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>  
> +#define RADIX_TREE_RETRY       ((void *)1)
> +
>  static inline void *ptr_to_indirect(void *ptr)
>  {
>         return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
> @@ -934,7 +936,7 @@ restart:
>                 }
>  
>                 slot = rcu_dereference_raw(node->slots[offset]);
> -               if (slot == NULL)
> +               if ((slot == NULL) || (slot == RADIX_TREE_RETRY))
>                         goto restart;
>                 offset = follow_sibling(node, &slot, offset);
>                 if (!radix_tree_is_indirect_ptr(slot))
> @@ -1443,8 +1455,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
>                  * to force callers to retry.
>                  */
>                 if (!radix_tree_is_indirect_ptr(slot))
> -                       *((unsigned long *)&to_free->slots[0]) |=
> -                                               RADIX_TREE_INDIRECT_PTR;
> +                       to_free->slots[0] = RADIX_TREE_RETRY;
>  
>                 radix_tree_node_free(to_free);
>         }
> 
> What do you think to doing it this way?
> 
> It might be slightly neater to replace the first hunk with this:
> 
> #define RADIX_TREE_RETRY       ((void *)RADIX_TREE_INDIRECT_PTR)
> 
> I also considered putting that define in radix-tree.h instead of
> radix-tree.c, but on the whole I don't think it'll be useful outside
> radix-tree.h.

So after spending over and hour reading radix tree code back and forth (and
also digging into historical versions where stuff is easier to understand) I
think I can finally fully appreciate subtlety of the retry logic ;). And
actually now I think that Neil's variant is buggy because in his case
radix_tree_lookup() could return NULL if it raced with radix_tree_shrink()
for index 0, although there is valid entry at index 0.

Your variant actually doesn't make things much better. See e.g.
mm/filemap.c: find_get_entry()

	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
	// pagep points to node that is under RCU freeing
	if (pagep) {
		page = radix_tree_deref_slot(pagep);
		if (unlikely(!page))	// False since
					// RADIX_TREE_INDIRECT_PTR is set
		if (radix_tree_exception(page))	// False - no exeptional bit
		if (!page_cache_get_speculative(page)) // oops...

What we need to do is either to make all radix_tree_deref_slot() callers
check return value immediately with something like radix_tree_deref_retry()
(but that is still prone to hard to debug bugs when someone forgets to call
radix_tree_deref_retry() in some place) or we just give up the idea of
using INDIRECT bit in exceptional entries.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
  2016-03-22  9:12       ` Jan Kara
@ 2016-03-22  9:27         ` Matthew Wilcox
  -1 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-22  9:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Wilcox,

On Tue, Mar 22, 2016 at 10:12:32AM +0100, Jan Kara wrote:
> 		if (unlikely(!page))	// False since
> 					// RADIX_TREE_INDIRECT_PTR is set
> 		if (radix_tree_exception(page))	// False - no exeptional bit

Oops, you got confused:

static inline int radix_tree_exception(void *arg)
{
        return unlikely((unsigned long)arg &
                (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
@ 2016-03-22  9:27         ` Matthew Wilcox
  0 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-22  9:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R

On Tue, Mar 22, 2016 at 10:12:32AM +0100, Jan Kara wrote:
> 		if (unlikely(!page))	// False since
> 					// RADIX_TREE_INDIRECT_PTR is set
> 		if (radix_tree_exception(page))	// False - no exeptional bit

Oops, you got confused:

static inline int radix_tree_exception(void *arg)
{
        return unlikely((unsigned long)arg &
                (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
}


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
  2016-03-22  9:27         ` Matthew Wilcox
@ 2016-03-22 10:37           ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-22 10:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Wilcox,
	Matthew R  <matthew.r.wilcox@intel.com>,
	Jan Kara <jack@suse.cz>,
	NeilBrown, linux-nvdimm

On Tue 22-03-16 05:27:08, Matthew Wilcox wrote:
> On Tue, Mar 22, 2016 at 10:12:32AM +0100, Jan Kara wrote:
> > 		if (unlikely(!page))	// False since
> > 					// RADIX_TREE_INDIRECT_PTR is set
> > 		if (radix_tree_exception(page))	// False - no exeptional bit
> 
> Oops, you got confused:
> 
> static inline int radix_tree_exception(void *arg)
> {
>         return unlikely((unsigned long)arg &
>                 (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> }

Ah, I've confused radix_tree_exception() and
radix_tree_exceptional_entry(). OK, so your code works AFAICT. But using
RADIX_TREE_RETRY still doesn't make things clearer to me - you still need
to check for INDIRECT bit in the retry logic to catch the
radix_tree_extend() race as well...

As a side note I think we should do away with radix_tree_exception() - it
isn't very useful (doesn't simplify any of its callers) and only creates
possibility for confusion.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
@ 2016-03-22 10:37           ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-22 10:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R

On Tue 22-03-16 05:27:08, Matthew Wilcox wrote:
> On Tue, Mar 22, 2016 at 10:12:32AM +0100, Jan Kara wrote:
> > 		if (unlikely(!page))	// False since
> > 					// RADIX_TREE_INDIRECT_PTR is set
> > 		if (radix_tree_exception(page))	// False - no exeptional bit
> 
> Oops, you got confused:
> 
> static inline int radix_tree_exception(void *arg)
> {
>         return unlikely((unsigned long)arg &
>                 (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> }

Ah, I've confused radix_tree_exception() and
radix_tree_exceptional_entry(). OK, so your code works AFAICT. But using
RADIX_TREE_RETRY still doesn't make things clearer to me - you still need
to check for INDIRECT bit in the retry logic to catch the
radix_tree_extend() race as well...

As a side note I think we should do away with radix_tree_exception() - it
isn't very useful (doesn't simplify any of its callers) and only creates
possibility for confusion.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-21 13:22 ` Jan Kara
@ 2016-03-22 19:32   ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-22 19:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> [Sorry for repost but I accidentally sent initial email without patches]
> 
> Hello,
> 
> this is my second attempt at DAX page fault locking rewrite. Things now
> work reasonably well, it has survived full xfstests run on ext4. I guess
> I need to do more mmap targetted tests to unveil issues. Guys what do you
> used for DAX testing?

I typically use xfstests for regression testing.  If we can come up with new
generally useful regression tests, especially ones concerning mmap races, that
would be awesome.  I guess it's just a choice between adding them somewhere in
xfstests or somewhere else like with the unit tests in ndctl.

> Changes since v1:
> - handle wakeups of exclusive waiters properly
> - fix cow fault races
> - other minor stuff
> 
> General description
> 
> The basic idea is that we use a bit in an exceptional radix tree entry as
> a lock bit and use it similarly to how page lock is used for normal faults.
> That way we fix races between hole instantiation and read faults of the
> same index. For now I have disabled PMD faults since there the issues with
> page fault locking are even worse. Now that Matthew's multi-order radix tree
> has landed, I can have a look into using that for proper locking of PMD faults
> but first I want normal pages sorted out.
> 
> In the end I have decided to implement the bit locking directly in the DAX
> code. Originally I was thinking we could provide something generic directly
> in the radix tree code but the functions DAX needs are rather specific.
> Maybe someone else will have a good idea how to distill some generally useful
> functions out of what I've implemented for DAX but for now I didn't bother
> with that.
> 
> 								Honza
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-22 19:32   ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-22 19:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown, Jeff Moyer

On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> [Sorry for repost but I accidentally sent initial email without patches]
> 
> Hello,
> 
> this is my second attempt at DAX page fault locking rewrite. Things now
> work reasonably well, it has survived full xfstests run on ext4. I guess
> I need to do more mmap targetted tests to unveil issues. Guys what do you
> used for DAX testing?

I typically use xfstests for regression testing.  If we can come up with new
generally useful regression tests, especially ones concerning mmap races, that
would be awesome.  I guess it's just a choice between adding them somewhere in
xfstests or somewhere else like with the unit tests in ndctl.

> Changes since v1:
> - handle wakeups of exclusive waiters properly
> - fix cow fault races
> - other minor stuff
> 
> General description
> 
> The basic idea is that we use a bit in an exceptional radix tree entry as
> a lock bit and use it similarly to how page lock is used for normal faults.
> That way we fix races between hole instantiation and read faults of the
> same index. For now I have disabled PMD faults since there the issues with
> page fault locking are even worse. Now that Matthew's multi-order radix tree
> has landed, I can have a look into using that for proper locking of PMD faults
> but first I want normal pages sorted out.
> 
> In the end I have decided to implement the bit locking directly in the DAX
> code. Originally I was thinking we could provide something generic directly
> in the radix tree code but the functions DAX needs are rather specific.
> Maybe someone else will have a good idea how to distill some generally useful
> functions out of what I've implemented for DAX but for now I didn't bother
> with that.
> 
> 								Honza

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-22 19:32   ` Ross Zwisler
@ 2016-03-22 21:07     ` Toshi Kani
  -1 siblings, 0 replies; 88+ messages in thread
From: Toshi Kani @ 2016-03-22 21:07 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara; +Cc: linux-fsdevel, brian.boylston, Wilcox, Matthew

On Tue, 2016-03-22 at 13:32 -0600, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > [Sorry for repost but I accidentally sent initial email without
> > patches]
> > 
> > Hello,
> > 
> > this is my second attempt at DAX page fault locking rewrite. Things now
> > work reasonably well, it has survived full xfstests run on ext4. I
> > guess I need to do more mmap targetted tests to unveil issues. Guys
> > what do you used for DAX testing?
> 
> I typically use xfstests for regression testing.  If we can come up with
> new generally useful regression tests, especially ones concerning mmap
> races, that would be awesome.  I guess it's just a choice between adding
> them somewhere in xfstests or somewhere else like with the unit tests in
> ndctl.

Brian Boylston wrote a test for mmap race conditions and posted it before.
 This test was very useful to fix the data corruption issue we had before.
http://www.spinics.net/lists/linux-ext4/msg49876.html

If there is anything we can do to make this test useful as regression
tests?

Thanks,
-Toshi

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-22 21:07     ` Toshi Kani
  0 siblings, 0 replies; 88+ messages in thread
From: Toshi Kani @ 2016-03-22 21:07 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara
  Cc: linux-nvdimm, NeilBrown, Wilcox, Matthew R, linux-fsdevel,
	brian.boylston

On Tue, 2016-03-22 at 13:32 -0600, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > [Sorry for repost but I accidentally sent initial email without
> > patches]
> > 
> > Hello,
> > 
> > this is my second attempt at DAX page fault locking rewrite. Things now
> > work reasonably well, it has survived full xfstests run on ext4. I
> > guess I need to do more mmap targetted tests to unveil issues. Guys
> > what do you used for DAX testing?
> 
> I typically use xfstests for regression testing.  If we can come up with
> new generally useful regression tests, especially ones concerning mmap
> races, that would be awesome.  I guess it's just a choice between adding
> them somewhere in xfstests or somewhere else like with the unit tests in
> ndctl.

Brian Boylston wrote a test for mmap race conditions and posted it before.
 This test was very useful to fix the data corruption issue we had before.
http://www.spinics.net/lists/linux-ext4/msg49876.html

If there is anything we can do to make this test useful as regression
tests?

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-22 21:07     ` Toshi Kani
@ 2016-03-22 21:15       ` Dave Chinner
  -1 siblings, 0 replies; 88+ messages in thread
From: Dave Chinner @ 2016-03-22 21:15 UTC (permalink / raw)
  To: Toshi Kani; +Cc: Jan Kara, linux-nvdimm, NeilBrown, brian.boylston, Wilcox

On Tue, Mar 22, 2016 at 03:07:33PM -0600, Toshi Kani wrote:
> On Tue, 2016-03-22 at 13:32 -0600, Ross Zwisler wrote:
> > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > [Sorry for repost but I accidentally sent initial email without
> > > patches]
> > > 
> > > Hello,
> > > 
> > > this is my second attempt at DAX page fault locking rewrite. Things now
> > > work reasonably well, it has survived full xfstests run on ext4. I
> > > guess I need to do more mmap targetted tests to unveil issues. Guys
> > > what do you used for DAX testing?
> > 
> > I typically use xfstests for regression testing.  If we can come up with
> > new generally useful regression tests, especially ones concerning mmap
> > races, that would be awesome.  I guess it's just a choice between adding
> > them somewhere in xfstests or somewhere else like with the unit tests in
> > ndctl.
> 
> Brian Boylston wrote a test for mmap race conditions and posted it before.
>  This test was very useful to fix the data corruption issue we had before.
> http://www.spinics.net/lists/linux-ext4/msg49876.html
> 
> If there is anything we can do to make this test useful as regression
> tests?

It's gplv2, so wrap it in a xfstests harness script and send it to
fstests@vger.kernel.org.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-22 21:15       ` Dave Chinner
  0 siblings, 0 replies; 88+ messages in thread
From: Dave Chinner @ 2016-03-22 21:15 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Ross Zwisler, Jan Kara, linux-nvdimm, NeilBrown, Wilcox,
	Matthew R, linux-fsdevel, brian.boylston

On Tue, Mar 22, 2016 at 03:07:33PM -0600, Toshi Kani wrote:
> On Tue, 2016-03-22 at 13:32 -0600, Ross Zwisler wrote:
> > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > [Sorry for repost but I accidentally sent initial email without
> > > patches]
> > > 
> > > Hello,
> > > 
> > > this is my second attempt at DAX page fault locking rewrite. Things now
> > > work reasonably well, it has survived full xfstests run on ext4. I
> > > guess I need to do more mmap targetted tests to unveil issues. Guys
> > > what do you used for DAX testing?
> > 
> > I typically use xfstests for regression testing.��If we can come up with
> > new generally useful regression tests, especially ones concerning mmap
> > races, that would be awesome.��I guess it's just a choice between adding
> > them somewhere in xfstests or somewhere else like with the unit tests in
> > ndctl.
> 
> Brian Boylston wrote a test for mmap race conditions and posted it before.
> �This test was very useful to fix the data corruption issue we had before.
> http://www.spinics.net/lists/linux-ext4/msg49876.html
> 
> If there is anything we can do to make this test useful as regression
> tests?

It's gplv2, so wrap it in a xfstests harness script and send it to
fstests@vger.kernel.org.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-22 21:07     ` Toshi Kani
@ 2016-03-23  9:45       ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-23  9:45 UTC (permalink / raw)
  To: Toshi Kani; +Cc: Jan Kara, linux-nvdimm, NeilBrown, brian.boylston, Wilcox

On Tue 22-03-16 15:07:33, Toshi Kani wrote:
> On Tue, 2016-03-22 at 13:32 -0600, Ross Zwisler wrote:
> > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > [Sorry for repost but I accidentally sent initial email without
> > > patches]
> > > 
> > > Hello,
> > > 
> > > this is my second attempt at DAX page fault locking rewrite. Things now
> > > work reasonably well, it has survived full xfstests run on ext4. I
> > > guess I need to do more mmap targetted tests to unveil issues. Guys
> > > what do you used for DAX testing?
> > 
> > I typically use xfstests for regression testing.  If we can come up with
> > new generally useful regression tests, especially ones concerning mmap
> > races, that would be awesome.  I guess it's just a choice between adding
> > them somewhere in xfstests or somewhere else like with the unit tests in
> > ndctl.
> 
> Brian Boylston wrote a test for mmap race conditions and posted it before.
>  This test was very useful to fix the data corruption issue we had before.
> http://www.spinics.net/lists/linux-ext4/msg49876.html
> 
> If there is anything we can do to make this test useful as regression
> tests?

Thanks for the pointer, I forgot about this useful test. As Dave said the
best way for this to not get lost is to include it in xfstests. Since I
want to run the test regularly anyway, I can integrate it myself.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-23  9:45       ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-23  9:45 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Ross Zwisler, Jan Kara, linux-nvdimm, NeilBrown, Wilcox,
	Matthew R, linux-fsdevel, brian.boylston

On Tue 22-03-16 15:07:33, Toshi Kani wrote:
> On Tue, 2016-03-22 at 13:32 -0600, Ross Zwisler wrote:
> > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > [Sorry for repost but I accidentally sent initial email without
> > > patches]
> > > 
> > > Hello,
> > > 
> > > this is my second attempt at DAX page fault locking rewrite. Things now
> > > work reasonably well, it has survived full xfstests run on ext4. I
> > > guess I need to do more mmap targetted tests to unveil issues. Guys
> > > what do you used for DAX testing?
> > 
> > I typically use xfstests for regression testing.��If we can come up with
> > new generally useful regression tests, especially ones concerning mmap
> > races, that would be awesome.��I guess it's just a choice between adding
> > them somewhere in xfstests or somewhere else like with the unit tests in
> > ndctl.
> 
> Brian Boylston wrote a test for mmap race conditions and posted it before.
> �This test was very useful to fix the data corruption issue we had before.
> http://www.spinics.net/lists/linux-ext4/msg49876.html
> 
> If there is anything we can do to make this test useful as regression
> tests?

Thanks for the pointer, I forgot about this useful test. As Dave said the
best way for this to not get lost is to include it in xfstests. Since I
want to run the test regularly anyway, I can integrate it myself.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-21 17:41   ` Matthew Wilcox
@ 2016-03-23 15:09     ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-23 15:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R,
	linux-fsdevel, Kirill A. Shutemov

On Mon 21-03-16 13:41:03, Matthew Wilcox wrote:
> On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > The basic idea is that we use a bit in an exceptional radix tree entry as
> > a lock bit and use it similarly to how page lock is used for normal faults.
> > That way we fix races between hole instantiation and read faults of the
> > same index. For now I have disabled PMD faults since there the issues with
> > page fault locking are even worse. Now that Matthew's multi-order radix tree
> > has landed, I can have a look into using that for proper locking of PMD faults
> > but first I want normal pages sorted out.
> 
> FYI, the multi-order radix tree code that landed is unusably buggy.
> Ross and I have been working like madmen for the past three weeks to fix
> all of the bugs we've found and not introduce new ones.  The radix tree
> test suite has been enormously helpful in this regard, but we're still
> finding corner cases (thanks, RCU! ;-)
> 
> Our current best effort can be found hiding in
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/radix-fixes-2016-03-15
> but it's for sure not ready for review yet.  I just don't want other
> people trying to use the facility and wasting their time.

So when looking through the fixes I was wondering: Are really sibling
entries worth it? Won't the result be simpler if we just used
RADIX_TREE_MAP_SHIFT == 9? We would need to put slot pointers out of
radix_tree_node structure (there'd be full page worth of them) but that's
easy. More complications probably come from the fact that we don't want
that unconditionally since radix tree for small files would consume
considerably more memory and that could be an issue for some systems. For
DAX as such we don't really care I think, at least for now, but for normal
page cache we do. So we would have to make RADIX_TREE_MAP_SHIFT
per-radix-tree property. What do you think? I can try to write some patches
if you'd consider it's worth it...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-23 15:09     ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-23 15:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox,
	Matthew R, Kirill A. Shutemov

On Mon 21-03-16 13:41:03, Matthew Wilcox wrote:
> On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > The basic idea is that we use a bit in an exceptional radix tree entry as
> > a lock bit and use it similarly to how page lock is used for normal faults.
> > That way we fix races between hole instantiation and read faults of the
> > same index. For now I have disabled PMD faults since there the issues with
> > page fault locking are even worse. Now that Matthew's multi-order radix tree
> > has landed, I can have a look into using that for proper locking of PMD faults
> > but first I want normal pages sorted out.
> 
> FYI, the multi-order radix tree code that landed is unusably buggy.
> Ross and I have been working like madmen for the past three weeks to fix
> all of the bugs we've found and not introduce new ones.  The radix tree
> test suite has been enormously helpful in this regard, but we're still
> finding corner cases (thanks, RCU! ;-)
> 
> Our current best effort can be found hiding in
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/radix-fixes-2016-03-15
> but it's for sure not ready for review yet.  I just don't want other
> people trying to use the facility and wasting their time.

So when looking through the fixes I was wondering: Are really sibling
entries worth it? Won't the result be simpler if we just used
RADIX_TREE_MAP_SHIFT == 9? We would need to put slot pointers out of
radix_tree_node structure (there'd be full page worth of them) but that's
easy. More complications probably come from the fact that we don't want
that unconditionally since radix tree for small files would consume
considerably more memory and that could be an issue for some systems. For
DAX as such we don't really care I think, at least for now, but for normal
page cache we do. So we would have to make RADIX_TREE_MAP_SHIFT
per-radix-tree property. What do you think? I can try to write some patches
if you'd consider it's worth it...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-23  9:45       ` Jan Kara
@ 2016-03-23 15:11         ` Toshi Kani
  -1 siblings, 0 replies; 88+ messages in thread
From: Toshi Kani @ 2016-03-23 15:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, NeilBrown, brian.boylston, Wilcox, Matthew R,
	linux-fsdevel

On Wed, 2016-03-23 at 10:45 +0100, Jan Kara wrote:
> On Tue 22-03-16 15:07:33, Toshi Kani wrote:
> > On Tue, 2016-03-22 at 13:32 -0600, Ross Zwisler wrote:
> > > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > > [Sorry for repost but I accidentally sent initial email without
> > > > patches]
> > > > 
> > > > Hello,
> > > > 
> > > > this is my second attempt at DAX page fault locking rewrite. Things
> > > > now work reasonably well, it has survived full xfstests run on
> > > > ext4. I guess I need to do more mmap targetted tests to unveil
> > > > issues. Guys what do you used for DAX testing?
> > > 
> > > I typically use xfstests for regression testing.  If we can come up
> > > with new generally useful regression tests, especially ones
> > > concerning mmap races, that would be awesome.  I guess it's just a
> > > choice between adding them somewhere in xfstests or somewhere else
> > > like with the unit tests in ndctl.
> > 
> > Brian Boylston wrote a test for mmap race conditions and posted it
> > before.  This test was very useful to fix the data corruption issue we
> > had before.
> > http://www.spinics.net/lists/linux-ext4/msg49876.html
> > 
> > If there is anything we can do to make this test useful as regression
> > tests?
> 
> Thanks for the pointer, I forgot about this useful test. As Dave said the
> best way for this to not get lost is to include it in xfstests. Since I
> want to run the test regularly anyway, I can integrate it myself.

Great! Thanks Jan!
-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-23 15:11         ` Toshi Kani
  0 siblings, 0 replies; 88+ messages in thread
From: Toshi Kani @ 2016-03-23 15:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-nvdimm, NeilBrown, Wilcox, Matthew R,
	linux-fsdevel, brian.boylston

On Wed, 2016-03-23 at 10:45 +0100, Jan Kara wrote:
> On Tue 22-03-16 15:07:33, Toshi Kani wrote:
> > On Tue, 2016-03-22 at 13:32 -0600, Ross Zwisler wrote:
> > > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > > [Sorry for repost but I accidentally sent initial email without
> > > > patches]
> > > > 
> > > > Hello,
> > > > 
> > > > this is my second attempt at DAX page fault locking rewrite. Things
> > > > now work reasonably well, it has survived full xfstests run on
> > > > ext4. I guess I need to do more mmap targetted tests to unveil
> > > > issues. Guys what do you used for DAX testing?
> > > 
> > > I typically use xfstests for regression testing.  If we can come up
> > > with new generally useful regression tests, especially ones
> > > concerning mmap races, that would be awesome.  I guess it's just a
> > > choice between adding them somewhere in xfstests or somewhere else
> > > like with the unit tests in ndctl.
> > 
> > Brian Boylston wrote a test for mmap race conditions and posted it
> > before.  This test was very useful to fix the data corruption issue we
> > had before.
> > http://www.spinics.net/lists/linux-ext4/msg49876.html
> > 
> > If there is anything we can do to make this test useful as regression
> > tests?
> 
> Thanks for the pointer, I forgot about this useful test. As Dave said the
> best way for this to not get lost is to include it in xfstests. Since I
> want to run the test regularly anyway, I can integrate it myself.

Great! Thanks Jan!
-Toshi

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
  2016-03-22 10:37           ` Jan Kara
@ 2016-03-23 16:41             ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 16:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, Wilcox

On Tue, Mar 22, 2016 at 11:37:54AM +0100, Jan Kara wrote:
> On Tue 22-03-16 05:27:08, Matthew Wilcox wrote:
> > On Tue, Mar 22, 2016 at 10:12:32AM +0100, Jan Kara wrote:
> > > 		if (unlikely(!page))	// False since
> > > 					// RADIX_TREE_INDIRECT_PTR is set
> > > 		if (radix_tree_exception(page))	// False - no exeptional bit
> > 
> > Oops, you got confused:
> > 
> > static inline int radix_tree_exception(void *arg)
> > {
> >         return unlikely((unsigned long)arg &
> >                 (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> > }
> 
> Ah, I've confused radix_tree_exception() and
> radix_tree_exceptional_entry(). OK, so your code works AFAICT. But using
> RADIX_TREE_RETRY still doesn't make things clearer to me - you still need
> to check for INDIRECT bit in the retry logic to catch the
> radix_tree_extend() race as well...
> 
> As a side note I think we should do away with radix_tree_exception() - it
> isn't very useful (doesn't simplify any of its callers) and only creates
> possibility for confusion.

Perhaps it would be clearer if we explicitly enumerated the four radix tree
entry types?

#define RADIX_TREE_TYPE_MASK		3

#define	RADIX_TREE_TYPE_DATA		0
#define RADIX_TREE_TYPE_INDIRECT	1
#define	RADIX_TREE_TYPE_EXCEPTIONAL	2
#define RADIX_TREE_TYPE_LOCKED_EXC	3

This would make radix_tree_exception (which we could rename so it doesn't
get confused with "exceptional" entries):

static inline int radix_tree_non_data(void *arg)
{
        return unlikely((unsigned long)arg & RADIX_TREE_TYPE_MASK);
}

Etc?  I guess we'd have to code it up and see if the result was simpler, but
it seems like it might be.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
@ 2016-03-23 16:41             ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 16:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-fsdevel, Wilcox, Matthew R, NeilBrown,
	linux-nvdimm

On Tue, Mar 22, 2016 at 11:37:54AM +0100, Jan Kara wrote:
> On Tue 22-03-16 05:27:08, Matthew Wilcox wrote:
> > On Tue, Mar 22, 2016 at 10:12:32AM +0100, Jan Kara wrote:
> > > 		if (unlikely(!page))	// False since
> > > 					// RADIX_TREE_INDIRECT_PTR is set
> > > 		if (radix_tree_exception(page))	// False - no exeptional bit
> > 
> > Oops, you got confused:
> > 
> > static inline int radix_tree_exception(void *arg)
> > {
> >         return unlikely((unsigned long)arg &
> >                 (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> > }
> 
> Ah, I've confused radix_tree_exception() and
> radix_tree_exceptional_entry(). OK, so your code works AFAICT. But using
> RADIX_TREE_RETRY still doesn't make things clearer to me - you still need
> to check for INDIRECT bit in the retry logic to catch the
> radix_tree_extend() race as well...
> 
> As a side note I think we should do away with radix_tree_exception() - it
> isn't very useful (doesn't simplify any of its callers) and only creates
> possibility for confusion.

Perhaps it would be clearer if we explicitly enumerated the four radix tree
entry types?

#define RADIX_TREE_TYPE_MASK		3

#define	RADIX_TREE_TYPE_DATA		0
#define RADIX_TREE_TYPE_INDIRECT	1
#define	RADIX_TREE_TYPE_EXCEPTIONAL	2
#define RADIX_TREE_TYPE_LOCKED_EXC	3

This would make radix_tree_exception (which we could rename so it doesn't
get confused with "exceptional" entries):

static inline int radix_tree_non_data(void *arg)
{
        return unlikely((unsigned long)arg & RADIX_TREE_TYPE_MASK);
}

Etc?  I guess we'd have to code it up and see if the result was simpler, but
it seems like it might be.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] dax: Remove complete_unwritten argument
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-23 17:12     ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 17:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:48PM +0100, Jan Kara wrote:
> Fault handlers currently take complete_unwritten argument to convert
> unwritten extents after PTEs are updated. However no filesystem uses
> this anymore as the code is racy. Remove the unused argument.

This looks good.  Looking at this reminded me that at some point it may be
good to clean up our buffer head flags checks and make sure we don't have
checks that don't make sense - we still check for buffer_unwritten() in
buffer_written(), for instance. 

The handling of BH_New isn't consistent among filesystems, either - XFS & ext4
go out of their way to make sure BH_New is not set, while ext2 sets BH_New.

But I think those are separate from this patch, I think.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c      |  4 ++--
>  fs/dax.c            | 43 +++++++++----------------------------------
>  fs/ext2/file.c      |  4 ++--
>  fs/ext4/file.c      |  4 ++--
>  fs/xfs/xfs_file.c   |  7 +++----
>  include/linux/dax.h | 17 +++++++----------
>  include/linux/fs.h  |  1 -
>  7 files changed, 25 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3172c4e2f502..a59f155f9aaf 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1746,7 +1746,7 @@ static const struct address_space_operations def_blk_aops = {
>   */
>  static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
> +	return __dax_fault(vma, vmf, blkdev_get_block);
>  }
>  
>  static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
> @@ -1758,7 +1758,7 @@ static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
>  static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  		pmd_t *pmd, unsigned int flags)
>  {
> -	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
> +	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block);
>  }
>  
>  static const struct vm_operations_struct blkdev_dax_vm_ops = {
> diff --git a/fs/dax.c b/fs/dax.c
> index b32e1b5eb8d4..d496466652cd 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -607,19 +607,13 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
>   * @get_block: The filesystem method used to translate file offsets to blocks
> - * @complete_unwritten: The filesystem method used to convert unwritten blocks
> - *	to written so the data written to them is exposed. This is required for
> - *	required by write faults for filesystems that will return unwritten
> - *	extent mappings from @get_block, but it is optional for reads as
> - *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
> - *	not support unwritten extents, the it should pass NULL.
>   *
>   * When a page fault occurs, filesystems may call this helper in their
>   * fault handler for DAX files. __dax_fault() assumes the caller has done all
>   * the necessary locking for the page fault to proceed successfully.
>   */
>  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block, dax_iodone_t complete_unwritten)
> +			get_block_t get_block)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> @@ -722,23 +716,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		page = NULL;
>  	}
>  
> -	/*
> -	 * If we successfully insert the new mapping over an unwritten extent,
> -	 * we need to ensure we convert the unwritten extent. If there is an
> -	 * error inserting the mapping, the filesystem needs to leave it as
> -	 * unwritten to prevent exposure of the stale underlying data to
> -	 * userspace, but we still need to call the completion function so
> -	 * the private resources on the mapping buffer can be released. We
> -	 * indicate what the callback should do via the uptodate variable, same
> -	 * as for normal BH based IO completions.
> -	 */
> +	/* Filesystem should not return unwritten buffers to us! */
> +	WARN_ON_ONCE(buffer_unwritten(&bh));
>  	error = dax_insert_mapping(inode, &bh, vma, vmf);
> -	if (buffer_unwritten(&bh)) {
> -		if (complete_unwritten)
> -			complete_unwritten(&bh, !error);
> -		else
> -			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
> -	}
>  
>   out:
>  	if (error == -ENOMEM)
> @@ -767,7 +747,7 @@ EXPORT_SYMBOL(__dax_fault);
>   * fault handler for DAX files.
>   */
>  int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -	      get_block_t get_block, dax_iodone_t complete_unwritten)
> +	      get_block_t get_block)
>  {
>  	int result;
>  	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> @@ -776,7 +756,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
> +	result = __dax_fault(vma, vmf, get_block);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> @@ -810,8 +790,7 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
>  #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
>  
>  int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> -		pmd_t *pmd, unsigned int flags, get_block_t get_block,
> -		dax_iodone_t complete_unwritten)
> +		pmd_t *pmd, unsigned int flags, get_block_t get_block)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> @@ -870,6 +849,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		if (get_block(inode, block, &bh, 1) != 0)
>  			return VM_FAULT_SIGBUS;
>  		alloc = true;
> +		WARN_ON_ONCE(buffer_unwritten(&bh));
>  	}
>  
>  	bdev = bh.b_bdev;
> @@ -1015,9 +995,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>   out:
>  	i_mmap_unlock_read(mapping);
>  
> -	if (buffer_unwritten(&bh))
> -		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> -
>  	return result;
>  
>   fallback:
> @@ -1037,8 +1014,7 @@ EXPORT_SYMBOL_GPL(__dax_pmd_fault);
>   * pmd_fault handler for DAX files.
>   */
>  int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> -			pmd_t *pmd, unsigned int flags, get_block_t get_block,
> -			dax_iodone_t complete_unwritten)
> +			pmd_t *pmd, unsigned int flags, get_block_t get_block)
>  {
>  	int result;
>  	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> @@ -1047,8 +1023,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
> -				complete_unwritten);
> +	result = __dax_pmd_fault(vma, address, pmd, flags, get_block);
>  	if (flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index c1400b109805..868c02317b05 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -51,7 +51,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	}
>  	down_read(&ei->dax_sem);
>  
> -	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
> +	ret = __dax_fault(vma, vmf, ext2_get_block);
>  
>  	up_read(&ei->dax_sem);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
> @@ -72,7 +72,7 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	down_read(&ei->dax_sem);
>  
> -	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
> +	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
>  
>  	up_read(&ei->dax_sem);
>  	if (flags & FAULT_FLAG_WRITE)
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6659e216385e..cf20040a1a49 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -207,7 +207,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	if (IS_ERR(handle))
>  		result = VM_FAULT_SIGBUS;
>  	else
> -		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
> +		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block);
>  
>  	if (write) {
>  		if (!IS_ERR(handle))
> @@ -243,7 +243,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  		result = VM_FAULT_SIGBUS;
>  	else
>  		result = __dax_pmd_fault(vma, addr, pmd, flags,
> -				ext4_dax_mmap_get_block, NULL);
> +				ext4_dax_mmap_get_block);
>  
>  	if (write) {
>  		if (!IS_ERR(handle))
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 52883ac3cf84..2ecdb39d2424 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1526,7 +1526,7 @@ xfs_filemap_page_mkwrite(
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (IS_DAX(inode)) {
> -		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> +		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
>  	} else {
>  		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
>  		ret = block_page_mkwrite_return(ret);
> @@ -1560,7 +1560,7 @@ xfs_filemap_fault(
>  		 * changes to xfs_get_blocks_direct() to map unwritten extent
>  		 * ioend for conversion on read-only mappings.
>  		 */
> -		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> +		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault);
>  	} else
>  		ret = filemap_fault(vma, vmf);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> @@ -1597,8 +1597,7 @@ xfs_filemap_pmd_fault(
>  	}
>  
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
> -			      NULL);
> +	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (flags & FAULT_FLAG_WRITE)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 636dd59ab505..7c45ac7ea1d1 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -10,10 +10,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
>  int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> -int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> -		dax_iodone_t);
> -int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> -		dax_iodone_t);
> +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> +int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  
>  #ifdef CONFIG_FS_DAX
>  struct page *read_dax_sector(struct block_device *bdev, sector_t n);
> @@ -27,21 +25,20 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> -				unsigned int flags, get_block_t, dax_iodone_t);
> +				unsigned int flags, get_block_t);
>  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> -				unsigned int flags, get_block_t, dax_iodone_t);
> +				unsigned int flags, get_block_t);
>  #else
>  static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
> -				pmd_t *pmd, unsigned int flags, get_block_t gb,
> -				dax_iodone_t di)
> +				pmd_t *pmd, unsigned int flags, get_block_t gb)
>  {
>  	return VM_FAULT_FALLBACK;
>  }
>  #define __dax_pmd_fault dax_pmd_fault
>  #endif
>  int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
> -#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
> -#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
> +#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +#define __dax_mkwrite(vma, vmf, gb)	__dax_fault(vma, vmf, gb)
>  
>  static inline bool vma_is_dax(struct vm_area_struct *vma)
>  {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bb703ef728d1..960fa5e0f7c3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -72,7 +72,6 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
>  			struct buffer_head *bh_result, int create);
>  typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  			ssize_t bytes, void *private);
> -typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
>  
>  #define MAY_EXEC		0x00000001
>  #define MAY_WRITE		0x00000002
> -- 
> 2.6.2
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] dax: Remove complete_unwritten argument
@ 2016-03-23 17:12     ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 17:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Mon, Mar 21, 2016 at 02:22:48PM +0100, Jan Kara wrote:
> Fault handlers currently take complete_unwritten argument to convert
> unwritten extents after PTEs are updated. However no filesystem uses
> this anymore as the code is racy. Remove the unused argument.

This looks good.  Looking at this reminded me that at some point it may be
good to clean up our buffer head flags checks and make sure we don't have
checks that don't make sense - we still check for buffer_unwritten() in
buffer_written(), for instance. 

The handling of BH_New isn't consistent among filesystems, either - XFS & ext4
go out of their way to make sure BH_New is not set, while ext2 sets BH_New.

But I think those are separate from this patch, I think.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c      |  4 ++--
>  fs/dax.c            | 43 +++++++++----------------------------------
>  fs/ext2/file.c      |  4 ++--
>  fs/ext4/file.c      |  4 ++--
>  fs/xfs/xfs_file.c   |  7 +++----
>  include/linux/dax.h | 17 +++++++----------
>  include/linux/fs.h  |  1 -
>  7 files changed, 25 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3172c4e2f502..a59f155f9aaf 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1746,7 +1746,7 @@ static const struct address_space_operations def_blk_aops = {
>   */
>  static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
> +	return __dax_fault(vma, vmf, blkdev_get_block);
>  }
>  
>  static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
> @@ -1758,7 +1758,7 @@ static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
>  static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  		pmd_t *pmd, unsigned int flags)
>  {
> -	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
> +	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block);
>  }
>  
>  static const struct vm_operations_struct blkdev_dax_vm_ops = {
> diff --git a/fs/dax.c b/fs/dax.c
> index b32e1b5eb8d4..d496466652cd 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -607,19 +607,13 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
>   * @get_block: The filesystem method used to translate file offsets to blocks
> - * @complete_unwritten: The filesystem method used to convert unwritten blocks
> - *	to written so the data written to them is exposed. This is required for
> - *	required by write faults for filesystems that will return unwritten
> - *	extent mappings from @get_block, but it is optional for reads as
> - *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
> - *	not support unwritten extents, the it should pass NULL.
>   *
>   * When a page fault occurs, filesystems may call this helper in their
>   * fault handler for DAX files. __dax_fault() assumes the caller has done all
>   * the necessary locking for the page fault to proceed successfully.
>   */
>  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block, dax_iodone_t complete_unwritten)
> +			get_block_t get_block)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> @@ -722,23 +716,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		page = NULL;
>  	}
>  
> -	/*
> -	 * If we successfully insert the new mapping over an unwritten extent,
> -	 * we need to ensure we convert the unwritten extent. If there is an
> -	 * error inserting the mapping, the filesystem needs to leave it as
> -	 * unwritten to prevent exposure of the stale underlying data to
> -	 * userspace, but we still need to call the completion function so
> -	 * the private resources on the mapping buffer can be released. We
> -	 * indicate what the callback should do via the uptodate variable, same
> -	 * as for normal BH based IO completions.
> -	 */
> +	/* Filesystem should not return unwritten buffers to us! */
> +	WARN_ON_ONCE(buffer_unwritten(&bh));
>  	error = dax_insert_mapping(inode, &bh, vma, vmf);
> -	if (buffer_unwritten(&bh)) {
> -		if (complete_unwritten)
> -			complete_unwritten(&bh, !error);
> -		else
> -			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
> -	}
>  
>   out:
>  	if (error == -ENOMEM)
> @@ -767,7 +747,7 @@ EXPORT_SYMBOL(__dax_fault);
>   * fault handler for DAX files.
>   */
>  int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -	      get_block_t get_block, dax_iodone_t complete_unwritten)
> +	      get_block_t get_block)
>  {
>  	int result;
>  	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> @@ -776,7 +756,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
> +	result = __dax_fault(vma, vmf, get_block);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> @@ -810,8 +790,7 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
>  #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
>  
>  int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> -		pmd_t *pmd, unsigned int flags, get_block_t get_block,
> -		dax_iodone_t complete_unwritten)
> +		pmd_t *pmd, unsigned int flags, get_block_t get_block)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> @@ -870,6 +849,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		if (get_block(inode, block, &bh, 1) != 0)
>  			return VM_FAULT_SIGBUS;
>  		alloc = true;
> +		WARN_ON_ONCE(buffer_unwritten(&bh));
>  	}
>  
>  	bdev = bh.b_bdev;
> @@ -1015,9 +995,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>   out:
>  	i_mmap_unlock_read(mapping);
>  
> -	if (buffer_unwritten(&bh))
> -		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> -
>  	return result;
>  
>   fallback:
> @@ -1037,8 +1014,7 @@ EXPORT_SYMBOL_GPL(__dax_pmd_fault);
>   * pmd_fault handler for DAX files.
>   */
>  int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> -			pmd_t *pmd, unsigned int flags, get_block_t get_block,
> -			dax_iodone_t complete_unwritten)
> +			pmd_t *pmd, unsigned int flags, get_block_t get_block)
>  {
>  	int result;
>  	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> @@ -1047,8 +1023,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
> -				complete_unwritten);
> +	result = __dax_pmd_fault(vma, address, pmd, flags, get_block);
>  	if (flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index c1400b109805..868c02317b05 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -51,7 +51,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	}
>  	down_read(&ei->dax_sem);
>  
> -	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
> +	ret = __dax_fault(vma, vmf, ext2_get_block);
>  
>  	up_read(&ei->dax_sem);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
> @@ -72,7 +72,7 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	down_read(&ei->dax_sem);
>  
> -	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
> +	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
>  
>  	up_read(&ei->dax_sem);
>  	if (flags & FAULT_FLAG_WRITE)
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6659e216385e..cf20040a1a49 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -207,7 +207,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	if (IS_ERR(handle))
>  		result = VM_FAULT_SIGBUS;
>  	else
> -		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
> +		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block);
>  
>  	if (write) {
>  		if (!IS_ERR(handle))
> @@ -243,7 +243,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  		result = VM_FAULT_SIGBUS;
>  	else
>  		result = __dax_pmd_fault(vma, addr, pmd, flags,
> -				ext4_dax_mmap_get_block, NULL);
> +				ext4_dax_mmap_get_block);
>  
>  	if (write) {
>  		if (!IS_ERR(handle))
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 52883ac3cf84..2ecdb39d2424 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1526,7 +1526,7 @@ xfs_filemap_page_mkwrite(
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (IS_DAX(inode)) {
> -		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> +		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
>  	} else {
>  		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
>  		ret = block_page_mkwrite_return(ret);
> @@ -1560,7 +1560,7 @@ xfs_filemap_fault(
>  		 * changes to xfs_get_blocks_direct() to map unwritten extent
>  		 * ioend for conversion on read-only mappings.
>  		 */
> -		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> +		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault);
>  	} else
>  		ret = filemap_fault(vma, vmf);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> @@ -1597,8 +1597,7 @@ xfs_filemap_pmd_fault(
>  	}
>  
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
> -			      NULL);
> +	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (flags & FAULT_FLAG_WRITE)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 636dd59ab505..7c45ac7ea1d1 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -10,10 +10,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
>  int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> -int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> -		dax_iodone_t);
> -int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> -		dax_iodone_t);
> +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> +int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  
>  #ifdef CONFIG_FS_DAX
>  struct page *read_dax_sector(struct block_device *bdev, sector_t n);
> @@ -27,21 +25,20 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> -				unsigned int flags, get_block_t, dax_iodone_t);
> +				unsigned int flags, get_block_t);
>  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> -				unsigned int flags, get_block_t, dax_iodone_t);
> +				unsigned int flags, get_block_t);
>  #else
>  static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
> -				pmd_t *pmd, unsigned int flags, get_block_t gb,
> -				dax_iodone_t di)
> +				pmd_t *pmd, unsigned int flags, get_block_t gb)
>  {
>  	return VM_FAULT_FALLBACK;
>  }
>  #define __dax_pmd_fault dax_pmd_fault
>  #endif
>  int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
> -#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
> -#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
> +#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +#define __dax_mkwrite(vma, vmf, gb)	__dax_fault(vma, vmf, gb)
>  
>  static inline bool vma_is_dax(struct vm_area_struct *vma)
>  {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bb703ef728d1..960fa5e0f7c3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -72,7 +72,6 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
>  			struct buffer_head *bh_result, int create);
>  typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  			ssize_t bytes, void *private);
> -typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
>  
>  #define MAY_EXEC		0x00000001
>  #define MAY_WRITE		0x00000002
> -- 
> 2.6.2
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] dax: Fix data corruption for written and mmapped files
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-23 17:39     ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 17:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> When a fault to a hole races with write filling the hole, it can happen
> that block zeroing in __dax_fault() overwrites the data copied by write.
> Since filesystem is supposed to provide pre-zeroed blocks for fault
> anyway, just remove the racy zeroing from dax code. The only catch is
> with read-faults over unwritten block where __dax_fault() filled in the
> block into page tables anyway. For that case we have to fall back to
> using hole page now.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d496466652cd..50d81172438b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -582,11 +582,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  		error = PTR_ERR(dax.addr);
>  		goto out;
>  	}
> -
> -	if (buffer_unwritten(bh) || buffer_new(bh)) {
> -		clear_pmem(dax.addr, PAGE_SIZE);
> -		wmb_pmem();
> -	}

I agree that we should be dropping these bits of code, but I think they are
just dead code that could never be executed?  I don't see how we could have
hit a race?

For the above, dax_insert_mapping() is only called if we actually have a block
mapping (holes go through dax_load_hole()), so for ext4 and XFS I think
buffer_unwritten() and buffer_new() are always false, so this code could never
be executed, right?

I suppose that maybe we could get into here via ext2 if BH_New was set?  Is
that the race?

>  	dax_unmap_atomic(bdev, &dax);
>  
>  	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> @@ -665,7 +660,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	if (error)
>  		goto unlock_page;
>  
> -	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
> +	if (!buffer_mapped(&bh) && !vmf->cow_page) {

Sure.

>  		if (vmf->flags & FAULT_FLAG_WRITE) {
>  			error = get_block(inode, block, &bh, 1);
>  			count_vm_event(PGMAJFAULT);
> @@ -950,8 +945,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		}
>  
>  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> -			clear_pmem(dax.addr, PMD_SIZE);
> -			wmb_pmem();
>  			count_vm_event(PGMAJFAULT);
>  			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
>  			result |= VM_FAULT_MAJOR;

I think this whole block is just dead code, right?  Can we ever get into here?

Same argument applies as from dax_insert_mapping() - if we get this far then
we have a mapped buffer, and in the PMD case we know we're on ext4 of XFS
since ext2 doesn't do huge page mappings.

So, buffer_unwritten() and buffer_new() both always return false, right?

Yea...we really need to clean up our buffer flag handling. :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] dax: Fix data corruption for written and mmapped files
@ 2016-03-23 17:39     ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 17:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Mon, Mar 21, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> When a fault to a hole races with write filling the hole, it can happen
> that block zeroing in __dax_fault() overwrites the data copied by write.
> Since filesystem is supposed to provide pre-zeroed blocks for fault
> anyway, just remove the racy zeroing from dax code. The only catch is
> with read-faults over unwritten block where __dax_fault() filled in the
> block into page tables anyway. For that case we have to fall back to
> using hole page now.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d496466652cd..50d81172438b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -582,11 +582,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  		error = PTR_ERR(dax.addr);
>  		goto out;
>  	}
> -
> -	if (buffer_unwritten(bh) || buffer_new(bh)) {
> -		clear_pmem(dax.addr, PAGE_SIZE);
> -		wmb_pmem();
> -	}

I agree that we should be dropping these bits of code, but I think they are
just dead code that could never be executed?  I don't see how we could have
hit a race?

For the above, dax_insert_mapping() is only called if we actually have a block
mapping (holes go through dax_load_hole()), so for ext4 and XFS I think
buffer_unwritten() and buffer_new() are always false, so this code could never
be executed, right?

I suppose that maybe we could get into here via ext2 if BH_New was set?  Is
that the race?

>  	dax_unmap_atomic(bdev, &dax);
>  
>  	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> @@ -665,7 +660,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	if (error)
>  		goto unlock_page;
>  
> -	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
> +	if (!buffer_mapped(&bh) && !vmf->cow_page) {

Sure.

>  		if (vmf->flags & FAULT_FLAG_WRITE) {
>  			error = get_block(inode, block, &bh, 1);
>  			count_vm_event(PGMAJFAULT);
> @@ -950,8 +945,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		}
>  
>  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> -			clear_pmem(dax.addr, PMD_SIZE);
> -			wmb_pmem();
>  			count_vm_event(PGMAJFAULT);
>  			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
>  			result |= VM_FAULT_MAJOR;

I think this whole block is just dead code, right?  Can we ever get into here?

Same argument applies as from dax_insert_mapping() - if we get this far then
we have a mapped buffer, and in the PMD case we know we're on ext4 of XFS
since ext2 doesn't do huge page mappings.

So, buffer_unwritten() and buffer_new() both always return false, right?

Yea...we really need to clean up our buffer flag handling. :)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] dax: Allow DAX code to replace exceptional entries
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-23 17:52     ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 17:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:50PM +0100, Jan Kara wrote:
> Currently we forbid page_cache_tree_insert() to replace exceptional radix
> tree entries for DAX inodes. However to make DAX faults race free we will
> lock radix tree entries and when hole is created, we need to replace
> such locked radix tree entry with a hole page. So modify
> page_cache_tree_insert() to allow that.

Perhaps this is addressed later in the series, but at first glance this seems
unsafe to me - what happens of we had tasks waiting on the locked entry?  Are
they woken up when we replace the locked entry with a page cache entry?

If they are woken up, will they be alright with the fact that they were
sleeping on a lock for an exceptional entry, and now that lock no longer
exists?  The radix tree entry will now be filled with a zero page hole, and so
they can't acquire the lock they were sleeping for - instead if they wanted to
lock that page they'd need to lock the zero page page lock, right?

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/dax.h |  6 ++++++
>  mm/filemap.c        | 18 +++++++++++-------
>  2 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 7c45ac7ea1d1..4b63923e1f8d 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -3,8 +3,14 @@
>  
>  #include <linux/fs.h>
>  #include <linux/mm.h>
> +#include <linux/radix-tree.h>
>  #include <asm/pgtable.h>
>  
> +/*
> + * Since exceptional entries do not use indirect bit, we reuse it as a lock bit
> + */
> +#define DAX_ENTRY_LOCK RADIX_TREE_INDIRECT_PTR
> +
>  ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
>  		  get_block_t, dio_iodone_t, int flags);
>  int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7c00f105845e..fbebedaf719e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -597,14 +597,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		if (!radix_tree_exceptional_entry(p))
>  			return -EEXIST;
>  
> -		if (WARN_ON(dax_mapping(mapping)))
> -			return -EINVAL;
> -
> -		if (shadowp)
> -			*shadowp = p;
>  		mapping->nrexceptional--;
> -		if (node)
> -			workingset_node_shadows_dec(node);
> +		if (!dax_mapping(mapping)) {
> +			if (shadowp)
> +				*shadowp = p;
> +			if (node)
> +				workingset_node_shadows_dec(node);
> +		} else {
> +			/* DAX can replace empty locked entry with a hole */
> +			WARN_ON_ONCE(p !=
> +				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> +					 DAX_ENTRY_LOCK));
> +		}
>  	}
>  	radix_tree_replace_slot(slot, page);
>  	mapping->nrpages++;
> -- 
> 2.6.2
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] dax: Allow DAX code to replace exceptional entries
@ 2016-03-23 17:52     ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 17:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Mon, Mar 21, 2016 at 02:22:50PM +0100, Jan Kara wrote:
> Currently we forbid page_cache_tree_insert() to replace exceptional radix
> tree entries for DAX inodes. However to make DAX faults race free we will
> lock radix tree entries and when hole is created, we need to replace
> such locked radix tree entry with a hole page. So modify
> page_cache_tree_insert() to allow that.

Perhaps this is addressed later in the series, but at first glance this seems
unsafe to me - what happens of we had tasks waiting on the locked entry?  Are
they woken up when we replace the locked entry with a page cache entry?

If they are woken up, will they be alright with the fact that they were
sleeping on a lock for an exceptional entry, and now that lock no longer
exists?  The radix tree entry will now be filled with a zero page hole, and so
they can't acquire the lock they were sleeping for - instead if they wanted to
lock that page they'd need to lock the zero page page lock, right?

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/dax.h |  6 ++++++
>  mm/filemap.c        | 18 +++++++++++-------
>  2 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 7c45ac7ea1d1..4b63923e1f8d 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -3,8 +3,14 @@
>  
>  #include <linux/fs.h>
>  #include <linux/mm.h>
> +#include <linux/radix-tree.h>
>  #include <asm/pgtable.h>
>  
> +/*
> + * Since exceptional entries do not use indirect bit, we reuse it as a lock bit
> + */
> +#define DAX_ENTRY_LOCK RADIX_TREE_INDIRECT_PTR
> +
>  ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
>  		  get_block_t, dio_iodone_t, int flags);
>  int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7c00f105845e..fbebedaf719e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -597,14 +597,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		if (!radix_tree_exceptional_entry(p))
>  			return -EEXIST;
>  
> -		if (WARN_ON(dax_mapping(mapping)))
> -			return -EINVAL;
> -
> -		if (shadowp)
> -			*shadowp = p;
>  		mapping->nrexceptional--;
> -		if (node)
> -			workingset_node_shadows_dec(node);
> +		if (!dax_mapping(mapping)) {
> +			if (shadowp)
> +				*shadowp = p;
> +			if (node)
> +				workingset_node_shadows_dec(node);
> +		} else {
> +			/* DAX can replace empty locked entry with a hole */
> +			WARN_ON_ONCE(p !=
> +				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> +					 DAX_ENTRY_LOCK));
> +		}
>  	}
>  	radix_tree_replace_slot(slot, page);
>  	mapping->nrpages++;
> -- 
> 2.6.2
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/10] dax: Disable huge page handling
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-23 20:50     ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 20:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:52PM +0100, Jan Kara wrote:
> Currently the handling of huge pages for DAX is racy. For example the
> following can happen:
> 
> CPU0 (THP write fault)			CPU1 (normal read fault)
> 
> __dax_pmd_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 					get_block(inode, block, &bh, 0)
> 					  -> not mapped
>   if (!buffer_mapped(&bh) && write)
>     get_block(inode, block, &bh, 1) -> allocates blocks
>   truncate_pagecache_range(inode, lstart, lend);
> 					dax_load_hole();
> 
> This results in data corruption since process on CPU1 won't see changes
> into the file done by CPU0.
> 
> The race can happen even if two normal faults race however with THP the
> situation is even worse because the two faults don't operate on the same
> entries in the radix tree and we want to use these entries for
> serialization. So disable THP support in DAX code for now.

Yep, I agree that we should disable PMD faults until we get the multi-order
radix tree work finished and integrated with this locking.

I do agree with Dan though that it would be preferable to disable PMD faults
by having CONFIG_FS_DAX_PMD depend on BROKEN.  That seems smaller and easier
to switch PMD faults back on for testing.

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c            | 2 +-
>  include/linux/dax.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 0329ec0bee2e..444e9dd079ca 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  }
>  EXPORT_SYMBOL_GPL(dax_fault);
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  /*
>   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
>   * more often than one might expect in the below function.
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 4b63923e1f8d..fd28d824254b 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
>  }
>  #endif
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
>  				unsigned int flags, get_block_t);
>  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> -- 
> 2.6.2
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/10] dax: Disable huge page handling
@ 2016-03-23 20:50     ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 20:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Mon, Mar 21, 2016 at 02:22:52PM +0100, Jan Kara wrote:
> Currently the handling of huge pages for DAX is racy. For example the
> following can happen:
> 
> CPU0 (THP write fault)			CPU1 (normal read fault)
> 
> __dax_pmd_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 					get_block(inode, block, &bh, 0)
> 					  -> not mapped
>   if (!buffer_mapped(&bh) && write)
>     get_block(inode, block, &bh, 1) -> allocates blocks
>   truncate_pagecache_range(inode, lstart, lend);
> 					dax_load_hole();
> 
> This results in data corruption since process on CPU1 won't see changes
> into the file done by CPU0.
> 
> The race can happen even if two normal faults race however with THP the
> situation is even worse because the two faults don't operate on the same
> entries in the radix tree and we want to use these entries for
> serialization. So disable THP support in DAX code for now.

Yep, I agree that we should disable PMD faults until we get the multi-order
radix tree work finished and integrated with this locking.

I do agree with Dan though that it would be preferable to disable PMD faults
by having CONFIG_FS_DAX_PMD depend on BROKEN.  That seems smaller and easier
to switch PMD faults back on for testing.

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c            | 2 +-
>  include/linux/dax.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 0329ec0bee2e..444e9dd079ca 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  }
>  EXPORT_SYMBOL_GPL(dax_fault);
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  /*
>   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
>   * more often than one might expect in the below function.
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 4b63923e1f8d..fd28d824254b 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
>  }
>  #endif
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if 0
>  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
>  				unsigned int flags, get_block_t);
>  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> -- 
> 2.6.2
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-23 15:09     ` Jan Kara
@ 2016-03-23 20:50       ` Matthew Wilcox
  -1 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-23 20:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox,
	Matthew R  <matthew.r.wilcox@intel.com>,
	NeilBrown <neilb@suse.com>,
	Kirill A. Shutemov, linux-nvdimm

On Wed, Mar 23, 2016 at 04:09:39PM +0100, Jan Kara wrote:
> On Mon 21-03-16 13:41:03, Matthew Wilcox wrote:
> > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > The basic idea is that we use a bit in an exceptional radix tree entry as
> > > a lock bit and use it similarly to how page lock is used for normal faults.
> > > That way we fix races between hole instantiation and read faults of the
> > > same index. For now I have disabled PMD faults since there the issues with
> > > page fault locking are even worse. Now that Matthew's multi-order radix tree
> > > has landed, I can have a look into using that for proper locking of PMD faults
> > > but first I want normal pages sorted out.
> > 
> > FYI, the multi-order radix tree code that landed is unusably buggy.
> > Ross and I have been working like madmen for the past three weeks to fix
> > all of the bugs we've found and not introduce new ones.  The radix tree
> > test suite has been enormously helpful in this regard, but we're still
> > finding corner cases (thanks, RCU! ;-)
> > 
> > Our current best effort can be found hiding in
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/radix-fixes-2016-03-15
> > but it's for sure not ready for review yet.  I just don't want other
> > people trying to use the facility and wasting their time.
> 
> So when looking through the fixes I was wondering: Are really sibling
> entries worth it? Won't the result be simpler if we just used
> RADIX_TREE_MAP_SHIFT == 9? We would need to put slot pointers out of
> radix_tree_node structure (there'd be full page worth of them) but that's
> easy. More complications probably come from the fact that we don't want
> that unconditionally since radix tree for small files would consume
> considerably more memory and that could be an issue for some systems. For
> DAX as such we don't really care I think, at least for now, but for normal
> page cache we do. So we would have to make RADIX_TREE_MAP_SHIFT
> per-radix-tree property. What do you think? I can try to write some patches
> if you'd consider it's worth it...

I haven't tried it yet.  I think one of the problems is that there may be
architectures which have PMD_SHIFT-PAGE_SHIFT != PUD_SHIFT-PMD_SHIFT.
I have started evolving the radix tree code towards something
that can support variable height nodes (check the latest head of
radix-fixes-2016-03-15), but I didn't consider splitting the slot array
out of the radix_tree_node.

It'd absolutely be possible to mix different order nodes within the same
tree, but the problem becomes deciding when to use which shift at which
level.  If the first insertion is an order-9 entry, then that's easy, but
if you already have a few order-0 entries in a few places in an order-6
based tree then converting that tree to be order-9 based could be tricky.

Do we really want to introduce another pointer follow operation at each
level of the radix tree?  It'd be partially compensated for by having
fewer levels.  Eg: a file with 1TB entries (and 4k pages) would have 28
bits used for index.  With the current 6-bit MAP_SHIFT, that's 5 levels.
With a 9-bit MAP_SHIFT, that's 4 levels, or 8 indirections.  I seem to
have picked the worst possible case out of thin air there ;-)  A 512GB
file would also use 5 levels with a 6-bit MAP_SHIFT and only 3 with a
9-bit MAP_SHIFT (which would be 6 indirections).

Another way we could go here is removing all the metadata from the
tree, so that each level is only a page.  We could have a metadata tree
that shadows its structure and contains the parent, shift, tags, etc.
That way the lookup would be fast and the less common operations would
be slower.

I'm going to keep going with the sibling entries, but feel free to try
other ways of organising the radix tree!  May the best one win (and may
we all contribute to the test suite ...)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-23 20:50       ` Matthew Wilcox
  0 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-23 20:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R,
	Kirill A. Shutemov

On Wed, Mar 23, 2016 at 04:09:39PM +0100, Jan Kara wrote:
> On Mon 21-03-16 13:41:03, Matthew Wilcox wrote:
> > On Mon, Mar 21, 2016 at 02:22:45PM +0100, Jan Kara wrote:
> > > The basic idea is that we use a bit in an exceptional radix tree entry as
> > > a lock bit and use it similarly to how page lock is used for normal faults.
> > > That way we fix races between hole instantiation and read faults of the
> > > same index. For now I have disabled PMD faults since there the issues with
> > > page fault locking are even worse. Now that Matthew's multi-order radix tree
> > > has landed, I can have a look into using that for proper locking of PMD faults
> > > but first I want normal pages sorted out.
> > 
> > FYI, the multi-order radix tree code that landed is unusably buggy.
> > Ross and I have been working like madmen for the past three weeks to fix
> > all of the bugs we've found and not introduce new ones.  The radix tree
> > test suite has been enormously helpful in this regard, but we're still
> > finding corner cases (thanks, RCU! ;-)
> > 
> > Our current best effort can be found hiding in
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/radix-fixes-2016-03-15
> > but it's for sure not ready for review yet.  I just don't want other
> > people trying to use the facility and wasting their time.
> 
> So when looking through the fixes I was wondering: Are really sibling
> entries worth it? Won't the result be simpler if we just used
> RADIX_TREE_MAP_SHIFT == 9? We would need to put slot pointers out of
> radix_tree_node structure (there'd be full page worth of them) but that's
> easy. More complications probably come from the fact that we don't want
> that unconditionally since radix tree for small files would consume
> considerably more memory and that could be an issue for some systems. For
> DAX as such we don't really care I think, at least for now, but for normal
> page cache we do. So we would have to make RADIX_TREE_MAP_SHIFT
> per-radix-tree property. What do you think? I can try to write some patches
> if you'd consider it's worth it...

I haven't tried it yet.  I think one of the problems is that there may be
architectures which have PMD_SHIFT-PAGE_SHIFT != PUD_SHIFT-PMD_SHIFT.
I have started evolving the radix tree code towards something
that can support variable height nodes (check the latest head of
radix-fixes-2016-03-15), but I didn't consider splitting the slot array
out of the radix_tree_node.

It'd absolutely be possible to mix different order nodes within the same
tree, but the problem becomes deciding when to use which shift at which
level.  If the first insertion is an order-9 entry, then that's easy, but
if you already have a few order-0 entries in a few places in an order-6
based tree then converting that tree to be order-9 based could be tricky.

Do we really want to introduce another pointer follow operation at each
level of the radix tree?  It'd be partially compensated for by having
fewer levels.  Eg: a file with 1TB entries (and 4k pages) would have 28
bits used for index.  With the current 6-bit MAP_SHIFT, that's 5 levels.
With a 9-bit MAP_SHIFT, that's 4 levels, or 8 indirections.  I seem to
have picked the worst possible case out of thin air there ;-)  A 512GB
file would also use 5 levels with a 6-bit MAP_SHIFT and only 3 with a
9-bit MAP_SHIFT (which would be 6 indirections).

Another way we could go here is removing all the metadata from the
tree, so that each level is only a page.  We could have a metadata tree
that shadows its structure and contains the parent, shift, tags, etc.
That way the lookup would be fast and the less common operations would
be slower.

I'm going to keep going with the sibling entries, but feel free to try
other ways of organising the radix tree!  May the best one win (and may
we all contribute to the test suite ...)


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 06/10] dax: Remove redundant inode size checks
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-23 21:08     ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 21:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:51PM +0100, Jan Kara wrote:
> Callers of dax fault handlers must make sure these calls cannot race
> with truncate. Thus it is enough to check inode size when entering the
> function and we don't have to recheck it again later in the handler.
> Note that inode size itself can be decreased while the fault handler
> runs but filesystem locking prevents against any radix tree or block
> mapping information changes resulting from the truncate and that is what
> we really care about.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 59 +----------------------------------------------------------
>  1 file changed, 1 insertion(+), 58 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 50d81172438b..0329ec0bee2e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -316,20 +316,12 @@ EXPORT_SYMBOL_GPL(dax_do_io);
>  static int dax_load_hole(struct address_space *mapping, struct page *page,
>  							struct vm_fault *vmf)
>  {
> -	unsigned long size;
>  	struct inode *inode = mapping->host;

'inode' is also unused after this patch, and can be removed.

Otherwise the rest of this patch looks good to me, though Matthew might see
more as he was the original author of this code.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 06/10] dax: Remove redundant inode size checks
@ 2016-03-23 21:08     ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-23 21:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Mon, Mar 21, 2016 at 02:22:51PM +0100, Jan Kara wrote:
> Callers of dax fault handlers must make sure these calls cannot race
> with truncate. Thus it is enough to check inode size when entering the
> function and we don't have to recheck it again later in the handler.
> Note that inode size itself can be decreased while the fault handler
> runs but filesystem locking prevents against any radix tree or block
> mapping information changes resulting from the truncate and that is what
> we really care about.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 59 +----------------------------------------------------------
>  1 file changed, 1 insertion(+), 58 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 50d81172438b..0329ec0bee2e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -316,20 +316,12 @@ EXPORT_SYMBOL_GPL(dax_do_io);
>  static int dax_load_hole(struct address_space *mapping, struct page *page,
>  							struct vm_fault *vmf)
>  {
> -	unsigned long size;
>  	struct inode *inode = mapping->host;

'inode' is also unused after this patch, and can be removed.

Otherwise the rest of this patch looks good to me, though Matthew might see
more as he was the original author of this code.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
  2016-03-23 15:09     ` Jan Kara
@ 2016-03-24 10:00       ` Matthew Wilcox
  -1 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-24 10:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox,
	Matthew R  <matthew.r.wilcox@intel.com>,
	NeilBrown <neilb@suse.com>,
	Kirill A. Shutemov, linux-nvdimm

On Wed, Mar 23, 2016 at 04:09:39PM +0100, Jan Kara wrote:
> So when looking through the fixes I was wondering: Are really sibling
> entries worth it? Won't the result be simpler if we just used

I realised we could slightly simplify the pattern in each walker so they
don't have to explicitly care about sibling entries.  It ends up saving
64 bytes of text on my .config, so I think it's worth it:

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index f0f2f49..779025f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -99,20 +99,19 @@ static inline unsigned get_sibling_offset(struct radix_tree_node *parent,
        return ptr - parent->slots;
 }
 
-static unsigned follow_sibling(struct radix_tree_node *parent,
+static unsigned rt_next_level(struct radix_tree_node *parent,
 				struct radix_tree_node **slot, unsigned offset)
 {
-	struct radix_tree_node *node = *slot;
-
-	if (!radix_tree_is_indirect_ptr(node))
-		return offset;
-
-	node = indirect_to_ptr(node);
-	if (!is_sibling_entry(parent, node))
-		return offset;
+	void **entry = rcu_dereference_raw(parent->slots[offset]);
+	if (radix_tree_is_indirect_ptr(entry)) {
+		uintptr_t siboff = entry - parent->slots;
+		if (siboff < RADIX_TREE_MAP_SIZE) {
+			offset = siboff;
+			entry = rcu_dereference_raw(parent->slots[offset]);
+		}
+	}
 
-	offset = (void **)node - parent->slots;
-	*slot = *(void **)node;
+	*slot = (void *)entry;
 	return offset;
 }
 
@@ -663,6 +662,8 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 	shift = height * RADIX_TREE_MAP_SHIFT;
 
 	for (;;) {
+		unsigned offset;
+
 		if (!node)
 			return NULL;
 		if (node == RADIX_TREE_RETRY)
@@ -670,18 +671,14 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 		if (!radix_tree_is_indirect_ptr(node))
 			break;
 		node = indirect_to_ptr(node);
-		if (is_sibling_entry(parent, node)) {
-			slot = (void **)node;
-			node = rcu_dereference_raw(*slot);
-			break;
-		}
 
 		BUG_ON(shift == 0);
 		shift -= RADIX_TREE_MAP_SHIFT;
 		BUG_ON(node->shift != shift);
 		parent = node;
-		slot = node->slots + ((index >> shift) & RADIX_TREE_MAP_MASK);
-		node = rcu_dereference_raw(*slot);
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		offset = rt_next_level(parent, &node, offset);
+		slot = parent->slots + offset;
 	}
 
 	if (nodep)
@@ -764,10 +761,9 @@ void *radix_tree_tag_set(struct radix_tree_root *root,
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 
 		slot = indirect_to_ptr(slot);
-		next = slot->slots[offset];
+		offset = rt_next_level(slot, &next, offset);
 		BUG_ON(!next);
 
-		offset = follow_sibling(slot, &next, offset);
 		if (!tag_get(slot, tag, offset))
 			tag_set(slot, tag, offset);
 		slot = next;
@@ -819,10 +815,9 @@ void *radix_tree_tag_clear(struct radix_tree_root *root,
 		BUG_ON(shift != slot->shift);
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		node = slot;
-		slot = slot->slots[offset];
+		offset = rt_next_level(node, &slot, offset);
 		if (slot == NULL)
 			goto out;
-		offset = follow_sibling(node, &slot, offset);
 	}
 
 	if (slot == NULL)
@@ -892,11 +887,11 @@ int radix_tree_tag_get(struct radix_tree_root *root,
 
 		node = indirect_to_ptr(node);
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
-		next = rcu_dereference_raw(node->slots[offset]);
+		offset = rt_next_level(node, &next, offset);
+
 		if (!next)
 			return 0;
-
-		offset = follow_sibling(node, &next, offset);
+		/* RADIX_TREE_RETRY is OK here; the tag is still valid */
 		if (!tag_get(node, tag, offset))
 			return 0;
 		if (!radix_tree_is_indirect_ptr(next))
@@ -988,10 +983,9 @@ restart:
 				goto restart;
 		}
 
-		slot = rcu_dereference_raw(node->slots[offset]);
+		offset = rt_next_level(node, &slot, offset);
 		if ((slot == NULL) || (slot == RADIX_TREE_RETRY))
 			goto restart;
-		offset = follow_sibling(node, &slot, offset);
 		if (!radix_tree_is_indirect_ptr(slot))
 			break;
 		if (shift == 0)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-24 10:00       ` Matthew Wilcox
  0 siblings, 0 replies; 88+ messages in thread
From: Matthew Wilcox @ 2016-03-24 10:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-nvdimm, NeilBrown, Wilcox, Matthew R,
	Kirill A. Shutemov

On Wed, Mar 23, 2016 at 04:09:39PM +0100, Jan Kara wrote:
> So when looking through the fixes I was wondering: Are really sibling
> entries worth it? Won't the result be simpler if we just used

I realised we could slightly simplify the pattern in each walker so they
don't have to explicitly care about sibling entries.  It ends up saving
64 bytes of text on my .config, so I think it's worth it:

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index f0f2f49..779025f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -99,20 +99,19 @@ static inline unsigned get_sibling_offset(struct radix_tree_node *parent,
        return ptr - parent->slots;
 }
 
-static unsigned follow_sibling(struct radix_tree_node *parent,
+static unsigned rt_next_level(struct radix_tree_node *parent,
 				struct radix_tree_node **slot, unsigned offset)
 {
-	struct radix_tree_node *node = *slot;
-
-	if (!radix_tree_is_indirect_ptr(node))
-		return offset;
-
-	node = indirect_to_ptr(node);
-	if (!is_sibling_entry(parent, node))
-		return offset;
+	void **entry = rcu_dereference_raw(parent->slots[offset]);
+	if (radix_tree_is_indirect_ptr(entry)) {
+		uintptr_t siboff = entry - parent->slots;
+		if (siboff < RADIX_TREE_MAP_SIZE) {
+			offset = siboff;
+			entry = rcu_dereference_raw(parent->slots[offset]);
+		}
+	}
 
-	offset = (void **)node - parent->slots;
-	*slot = *(void **)node;
+	*slot = (void *)entry;
 	return offset;
 }
 
@@ -663,6 +662,8 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 	shift = height * RADIX_TREE_MAP_SHIFT;
 
 	for (;;) {
+		unsigned offset;
+
 		if (!node)
 			return NULL;
 		if (node == RADIX_TREE_RETRY)
@@ -670,18 +671,14 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 		if (!radix_tree_is_indirect_ptr(node))
 			break;
 		node = indirect_to_ptr(node);
-		if (is_sibling_entry(parent, node)) {
-			slot = (void **)node;
-			node = rcu_dereference_raw(*slot);
-			break;
-		}
 
 		BUG_ON(shift == 0);
 		shift -= RADIX_TREE_MAP_SHIFT;
 		BUG_ON(node->shift != shift);
 		parent = node;
-		slot = node->slots + ((index >> shift) & RADIX_TREE_MAP_MASK);
-		node = rcu_dereference_raw(*slot);
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		offset = rt_next_level(parent, &node, offset);
+		slot = parent->slots + offset;
 	}
 
 	if (nodep)
@@ -764,10 +761,9 @@ void *radix_tree_tag_set(struct radix_tree_root *root,
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 
 		slot = indirect_to_ptr(slot);
-		next = slot->slots[offset];
+		offset = rt_next_level(slot, &next, offset);
 		BUG_ON(!next);
 
-		offset = follow_sibling(slot, &next, offset);
 		if (!tag_get(slot, tag, offset))
 			tag_set(slot, tag, offset);
 		slot = next;
@@ -819,10 +815,9 @@ void *radix_tree_tag_clear(struct radix_tree_root *root,
 		BUG_ON(shift != slot->shift);
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		node = slot;
-		slot = slot->slots[offset];
+		offset = rt_next_level(node, &slot, offset);
 		if (slot == NULL)
 			goto out;
-		offset = follow_sibling(node, &slot, offset);
 	}
 
 	if (slot == NULL)
@@ -892,11 +887,11 @@ int radix_tree_tag_get(struct radix_tree_root *root,
 
 		node = indirect_to_ptr(node);
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
-		next = rcu_dereference_raw(node->slots[offset]);
+		offset = rt_next_level(node, &next, offset);
+
 		if (!next)
 			return 0;
-
-		offset = follow_sibling(node, &next, offset);
+		/* RADIX_TREE_RETRY is OK here; the tag is still valid */
 		if (!tag_get(node, tag, offset))
 			return 0;
 		if (!radix_tree_is_indirect_ptr(next))
@@ -988,10 +983,9 @@ restart:
 				goto restart;
 		}
 
-		slot = rcu_dereference_raw(node->slots[offset]);
+		offset = rt_next_level(node, &slot, offset);
 		if ((slot == NULL) || (slot == RADIX_TREE_RETRY))
 			goto restart;
-		offset = follow_sibling(node, &slot, offset);
 		if (!radix_tree_is_indirect_ptr(slot))
 			break;
 		if (shift == 0)

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] dax: Allow DAX code to replace exceptional entries
  2016-03-23 17:52     ` Ross Zwisler
@ 2016-03-24 10:42       ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 10:42 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R, linux-fsdevel

On Wed 23-03-16 11:52:58, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:50PM +0100, Jan Kara wrote:
> > Currently we forbid page_cache_tree_insert() to replace exceptional radix
> > tree entries for DAX inodes. However to make DAX faults race free we will
> > lock radix tree entries and when hole is created, we need to replace
> > such locked radix tree entry with a hole page. So modify
> > page_cache_tree_insert() to allow that.
> 
> Perhaps this is addressed later in the series, but at first glance this seems
> unsafe to me - what happens of we had tasks waiting on the locked entry?  Are
> they woken up when we replace the locked entry with a page cache entry?
> 
> If they are woken up, will they be alright with the fact that they were
> sleeping on a lock for an exceptional entry, and now that lock no longer
> exists?  The radix tree entry will now be filled with a zero page hole, and so
> they can't acquire the lock they were sleeping for - instead if they wanted to
> lock that page they'd need to lock the zero page page lock, right?

Correct. What you describes can happen and the DAX entry locking code
counts with that. I guess I could comment on that in the changelog.

								Honza
> 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  include/linux/dax.h |  6 ++++++
> >  mm/filemap.c        | 18 +++++++++++-------
> >  2 files changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 7c45ac7ea1d1..4b63923e1f8d 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -3,8 +3,14 @@
> >  
> >  #include <linux/fs.h>
> >  #include <linux/mm.h>
> > +#include <linux/radix-tree.h>
> >  #include <asm/pgtable.h>
> >  
> > +/*
> > + * Since exceptional entries do not use indirect bit, we reuse it as a lock bit
> > + */
> > +#define DAX_ENTRY_LOCK RADIX_TREE_INDIRECT_PTR
> > +
> >  ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
> >  		  get_block_t, dio_iodone_t, int flags);
> >  int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 7c00f105845e..fbebedaf719e 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -597,14 +597,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
> >  		if (!radix_tree_exceptional_entry(p))
> >  			return -EEXIST;
> >  
> > -		if (WARN_ON(dax_mapping(mapping)))
> > -			return -EINVAL;
> > -
> > -		if (shadowp)
> > -			*shadowp = p;
> >  		mapping->nrexceptional--;
> > -		if (node)
> > -			workingset_node_shadows_dec(node);
> > +		if (!dax_mapping(mapping)) {
> > +			if (shadowp)
> > +				*shadowp = p;
> > +			if (node)
> > +				workingset_node_shadows_dec(node);
> > +		} else {
> > +			/* DAX can replace empty locked entry with a hole */
> > +			WARN_ON_ONCE(p !=
> > +				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > +					 DAX_ENTRY_LOCK));
> > +		}
> >  	}
> >  	radix_tree_replace_slot(slot, page);
> >  	mapping->nrpages++;
> > -- 
> > 2.6.2
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] dax: Allow DAX code to replace exceptional entries
@ 2016-03-24 10:42       ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 10:42 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Dan Williams,
	linux-nvdimm, NeilBrown

On Wed 23-03-16 11:52:58, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:50PM +0100, Jan Kara wrote:
> > Currently we forbid page_cache_tree_insert() to replace exceptional radix
> > tree entries for DAX inodes. However to make DAX faults race free we will
> > lock radix tree entries and when hole is created, we need to replace
> > such locked radix tree entry with a hole page. So modify
> > page_cache_tree_insert() to allow that.
> 
> Perhaps this is addressed later in the series, but at first glance this seems
> unsafe to me - what happens of we had tasks waiting on the locked entry?  Are
> they woken up when we replace the locked entry with a page cache entry?
> 
> If they are woken up, will they be alright with the fact that they were
> sleeping on a lock for an exceptional entry, and now that lock no longer
> exists?  The radix tree entry will now be filled with a zero page hole, and so
> they can't acquire the lock they were sleeping for - instead if they wanted to
> lock that page they'd need to lock the zero page page lock, right?

Correct. What you describes can happen and the DAX entry locking code
counts with that. I guess I could comment on that in the changelog.

								Honza
> 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  include/linux/dax.h |  6 ++++++
> >  mm/filemap.c        | 18 +++++++++++-------
> >  2 files changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 7c45ac7ea1d1..4b63923e1f8d 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -3,8 +3,14 @@
> >  
> >  #include <linux/fs.h>
> >  #include <linux/mm.h>
> > +#include <linux/radix-tree.h>
> >  #include <asm/pgtable.h>
> >  
> > +/*
> > + * Since exceptional entries do not use indirect bit, we reuse it as a lock bit
> > + */
> > +#define DAX_ENTRY_LOCK RADIX_TREE_INDIRECT_PTR
> > +
> >  ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
> >  		  get_block_t, dio_iodone_t, int flags);
> >  int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 7c00f105845e..fbebedaf719e 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -597,14 +597,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
> >  		if (!radix_tree_exceptional_entry(p))
> >  			return -EEXIST;
> >  
> > -		if (WARN_ON(dax_mapping(mapping)))
> > -			return -EINVAL;
> > -
> > -		if (shadowp)
> > -			*shadowp = p;
> >  		mapping->nrexceptional--;
> > -		if (node)
> > -			workingset_node_shadows_dec(node);
> > +		if (!dax_mapping(mapping)) {
> > +			if (shadowp)
> > +				*shadowp = p;
> > +			if (node)
> > +				workingset_node_shadows_dec(node);
> > +		} else {
> > +			/* DAX can replace empty locked entry with a hole */
> > +			WARN_ON_ONCE(p !=
> > +				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > +					 DAX_ENTRY_LOCK));
> > +		}
> >  	}
> >  	radix_tree_replace_slot(slot, page);
> >  	mapping->nrpages++;
> > -- 
> > 2.6.2
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
  2016-03-23 16:41             ` Ross Zwisler
@ 2016-03-24 12:31               ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 12:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R, linux-fsdevel

On Wed 23-03-16 10:41:44, Ross Zwisler wrote:
> On Tue, Mar 22, 2016 at 11:37:54AM +0100, Jan Kara wrote:
> > On Tue 22-03-16 05:27:08, Matthew Wilcox wrote:
> > > On Tue, Mar 22, 2016 at 10:12:32AM +0100, Jan Kara wrote:
> > > > 		if (unlikely(!page))	// False since
> > > > 					// RADIX_TREE_INDIRECT_PTR is set
> > > > 		if (radix_tree_exception(page))	// False - no exeptional bit
> > > 
> > > Oops, you got confused:
> > > 
> > > static inline int radix_tree_exception(void *arg)
> > > {
> > >         return unlikely((unsigned long)arg &
> > >                 (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> > > }
> > 
> > Ah, I've confused radix_tree_exception() and
> > radix_tree_exceptional_entry(). OK, so your code works AFAICT. But using
> > RADIX_TREE_RETRY still doesn't make things clearer to me - you still need
> > to check for INDIRECT bit in the retry logic to catch the
> > radix_tree_extend() race as well...
> > 
> > As a side note I think we should do away with radix_tree_exception() - it
> > isn't very useful (doesn't simplify any of its callers) and only creates
> > possibility for confusion.
> 
> Perhaps it would be clearer if we explicitly enumerated the four radix tree
> entry types?
> 
> #define RADIX_TREE_TYPE_MASK		3
> 
> #define	RADIX_TREE_TYPE_DATA		0
> #define RADIX_TREE_TYPE_INDIRECT	1
> #define	RADIX_TREE_TYPE_EXCEPTIONAL	2
> #define RADIX_TREE_TYPE_LOCKED_EXC	3
> 
> This would make radix_tree_exception (which we could rename so it doesn't
> get confused with "exceptional" entries):
> 
> static inline int radix_tree_non_data(void *arg)
> {
>         return unlikely((unsigned long)arg & RADIX_TREE_TYPE_MASK);
> }
> 
> Etc?  I guess we'd have to code it up and see if the result was simpler, but
> it seems like it might be.

Well, for now I have decided to postpone tricks with saving exceptional
entry bits and just used bit 2 for the lock bit for DAX exceptional entries
because the retry logic in the RCU walking code got rather convoluted with
that. If we ever feel we are running out of bits in the entry, we can
always look again at compressing the contents more.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries.
@ 2016-03-24 12:31               ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 12:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Matthew Wilcox, linux-fsdevel, Wilcox, Matthew R,
	NeilBrown, linux-nvdimm

On Wed 23-03-16 10:41:44, Ross Zwisler wrote:
> On Tue, Mar 22, 2016 at 11:37:54AM +0100, Jan Kara wrote:
> > On Tue 22-03-16 05:27:08, Matthew Wilcox wrote:
> > > On Tue, Mar 22, 2016 at 10:12:32AM +0100, Jan Kara wrote:
> > > > 		if (unlikely(!page))	// False since
> > > > 					// RADIX_TREE_INDIRECT_PTR is set
> > > > 		if (radix_tree_exception(page))	// False - no exeptional bit
> > > 
> > > Oops, you got confused:
> > > 
> > > static inline int radix_tree_exception(void *arg)
> > > {
> > >         return unlikely((unsigned long)arg &
> > >                 (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> > > }
> > 
> > Ah, I've confused radix_tree_exception() and
> > radix_tree_exceptional_entry(). OK, so your code works AFAICT. But using
> > RADIX_TREE_RETRY still doesn't make things clearer to me - you still need
> > to check for INDIRECT bit in the retry logic to catch the
> > radix_tree_extend() race as well...
> > 
> > As a side note I think we should do away with radix_tree_exception() - it
> > isn't very useful (doesn't simplify any of its callers) and only creates
> > possibility for confusion.
> 
> Perhaps it would be clearer if we explicitly enumerated the four radix tree
> entry types?
> 
> #define RADIX_TREE_TYPE_MASK		3
> 
> #define	RADIX_TREE_TYPE_DATA		0
> #define RADIX_TREE_TYPE_INDIRECT	1
> #define	RADIX_TREE_TYPE_EXCEPTIONAL	2
> #define RADIX_TREE_TYPE_LOCKED_EXC	3
> 
> This would make radix_tree_exception (which we could rename so it doesn't
> get confused with "exceptional" entries):
> 
> static inline int radix_tree_non_data(void *arg)
> {
>         return unlikely((unsigned long)arg & RADIX_TREE_TYPE_MASK);
> }
> 
> Etc?  I guess we'd have to code it up and see if the result was simpler, but
> it seems like it might be.

Well, for now I have decided to postpone tricks with saving exceptional
entry bits and just used bit 2 for the lock bit for DAX exceptional entries
because the retry logic in the RCU walking code got rather convoluted with
that. If we ever feel we are running out of bits in the entry, we can
always look again at compressing the contents more.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] dax: Remove complete_unwritten argument
  2016-03-23 17:12     ` Ross Zwisler
@ 2016-03-24 12:32       ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 12:32 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R, linux-fsdevel

On Wed 23-03-16 11:12:26, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:48PM +0100, Jan Kara wrote:
> > Fault handlers currently take complete_unwritten argument to convert
> > unwritten extents after PTEs are updated. However no filesystem uses
> > this anymore as the code is racy. Remove the unused argument.
> 
> This looks good.  Looking at this reminded me that at some point it may be
> good to clean up our buffer head flags checks and make sure we don't have
> checks that don't make sense - we still check for buffer_unwritten() in
> buffer_written(), for instance. 
> 
> The handling of BH_New isn't consistent among filesystems, either - XFS & ext4
> go out of their way to make sure BH_New is not set, while ext2 sets BH_New.
> 
> But I think those are separate from this patch, I think.

Yes, that's a separate thing and I think I've handled most of that in the
following patches. Thanks for review!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] dax: Remove complete_unwritten argument
@ 2016-03-24 12:32       ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 12:32 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Dan Williams,
	linux-nvdimm, NeilBrown

On Wed 23-03-16 11:12:26, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:48PM +0100, Jan Kara wrote:
> > Fault handlers currently take complete_unwritten argument to convert
> > unwritten extents after PTEs are updated. However no filesystem uses
> > this anymore as the code is racy. Remove the unused argument.
> 
> This looks good.  Looking at this reminded me that at some point it may be
> good to clean up our buffer head flags checks and make sure we don't have
> checks that don't make sense - we still check for buffer_unwritten() in
> buffer_written(), for instance. 
> 
> The handling of BH_New isn't consistent among filesystems, either - XFS & ext4
> go out of their way to make sure BH_New is not set, while ext2 sets BH_New.
> 
> But I think those are separate from this patch, I think.

Yes, that's a separate thing and I think I've handled most of that in the
following patches. Thanks for review!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] dax: Fix data corruption for written and mmapped files
  2016-03-23 17:39     ` Ross Zwisler
@ 2016-03-24 12:51       ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 12:51 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R, linux-fsdevel

On Wed 23-03-16 11:39:45, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> > When a fault to a hole races with write filling the hole, it can happen
> > that block zeroing in __dax_fault() overwrites the data copied by write.
> > Since filesystem is supposed to provide pre-zeroed blocks for fault
> > anyway, just remove the racy zeroing from dax code. The only catch is
> > with read-faults over unwritten block where __dax_fault() filled in the
> > block into page tables anyway. For that case we have to fall back to
> > using hole page now.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c | 9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d496466652cd..50d81172438b 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -582,11 +582,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  		error = PTR_ERR(dax.addr);
> >  		goto out;
> >  	}
> > -
> > -	if (buffer_unwritten(bh) || buffer_new(bh)) {
> > -		clear_pmem(dax.addr, PAGE_SIZE);
> > -		wmb_pmem();
> > -	}
> 
> I agree that we should be dropping these bits of code, but I think they are
> just dead code that could never be executed?  I don't see how we could have
> hit a race?
> 
> For the above, dax_insert_mapping() is only called if we actually have a block
> mapping (holes go through dax_load_hole()), so for ext4 and XFS I think
> buffer_unwritten() and buffer_new() are always false, so this code could never
> be executed, right?
> 
> I suppose that maybe we could get into here via ext2 if BH_New was set?  Is
> that the race?

Yeah, you are right that only ext2 is prone to the race I have described
since for the rest this should be just a dead code. I'll update the changelog
in this sense.

> >  		if (vmf->flags & FAULT_FLAG_WRITE) {
> >  			error = get_block(inode, block, &bh, 1);
> >  			count_vm_event(PGMAJFAULT);
> > @@ -950,8 +945,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		}
> >  
> >  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > -			clear_pmem(dax.addr, PMD_SIZE);
> > -			wmb_pmem();
> >  			count_vm_event(PGMAJFAULT);
> >  			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> >  			result |= VM_FAULT_MAJOR;
> 
> I think this whole block is just dead code, right?  Can we ever get into here?
> 
> Same argument applies as from dax_insert_mapping() - if we get this far then
> we have a mapped buffer, and in the PMD case we know we're on ext4 of XFS
> since ext2 doesn't do huge page mappings.
> 
> So, buffer_unwritten() and buffer_new() both always return false, right?
> 
> Yea...we really need to clean up our buffer flag handling. :)

Hum, looking at the code now I'm somewhat confused. __dax_pmd_fault does:

if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
	... install zero page ...
}

but what the buffer_update() check is about? That will never be true,
right? So we will fall back to the second branch and there we can actually
hit the

if (buffer_unwritten(&bh) || buffer_new(&bh)) {

because for read fault we can get unwritten buffer. But I guess that is a
mistake in the first branch. After fixing that we can just remove the
second if as you say. Unless you object, I'll update the patch in this
sense.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] dax: Fix data corruption for written and mmapped files
@ 2016-03-24 12:51       ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 12:51 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Dan Williams,
	linux-nvdimm, NeilBrown

On Wed 23-03-16 11:39:45, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> > When a fault to a hole races with write filling the hole, it can happen
> > that block zeroing in __dax_fault() overwrites the data copied by write.
> > Since filesystem is supposed to provide pre-zeroed blocks for fault
> > anyway, just remove the racy zeroing from dax code. The only catch is
> > with read-faults over unwritten block where __dax_fault() filled in the
> > block into page tables anyway. For that case we have to fall back to
> > using hole page now.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c | 9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d496466652cd..50d81172438b 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -582,11 +582,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  		error = PTR_ERR(dax.addr);
> >  		goto out;
> >  	}
> > -
> > -	if (buffer_unwritten(bh) || buffer_new(bh)) {
> > -		clear_pmem(dax.addr, PAGE_SIZE);
> > -		wmb_pmem();
> > -	}
> 
> I agree that we should be dropping these bits of code, but I think they are
> just dead code that could never be executed?  I don't see how we could have
> hit a race?
> 
> For the above, dax_insert_mapping() is only called if we actually have a block
> mapping (holes go through dax_load_hole()), so for ext4 and XFS I think
> buffer_unwritten() and buffer_new() are always false, so this code could never
> be executed, right?
> 
> I suppose that maybe we could get into here via ext2 if BH_New was set?  Is
> that the race?

Yeah, you are right that only ext2 is prone to the race I have described
since for the rest this should be just a dead code. I'll update the changelog
in this sense.

> >  		if (vmf->flags & FAULT_FLAG_WRITE) {
> >  			error = get_block(inode, block, &bh, 1);
> >  			count_vm_event(PGMAJFAULT);
> > @@ -950,8 +945,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		}
> >  
> >  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > -			clear_pmem(dax.addr, PMD_SIZE);
> > -			wmb_pmem();
> >  			count_vm_event(PGMAJFAULT);
> >  			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> >  			result |= VM_FAULT_MAJOR;
> 
> I think this whole block is just dead code, right?  Can we ever get into here?
> 
> Same argument applies as from dax_insert_mapping() - if we get this far then
> we have a mapped buffer, and in the PMD case we know we're on ext4 of XFS
> since ext2 doesn't do huge page mappings.
> 
> So, buffer_unwritten() and buffer_new() both always return false, right?
> 
> Yea...we really need to clean up our buffer flag handling. :)

Hum, looking at the code now I'm somewhat confused. __dax_pmd_fault does:

if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
	... install zero page ...
}

but what the buffer_update() check is about? That will never be true,
right? So we will fall back to the second branch and there we can actually
hit the

if (buffer_unwritten(&bh) || buffer_new(&bh)) {

because for read fault we can get unwritten buffer. But I guess that is a
mistake in the first branch. After fixing that we can just remove the
second if as you say. Unless you object, I'll update the patch in this
sense.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/10] dax: Disable huge page handling
  2016-03-23 20:50     ` Ross Zwisler
@ 2016-03-24 12:56       ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 12:56 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R, linux-fsdevel

On Wed 23-03-16 14:50:00, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:52PM +0100, Jan Kara wrote:
> > Currently the handling of huge pages for DAX is racy. For example the
> > following can happen:
> > 
> > CPU0 (THP write fault)			CPU1 (normal read fault)
> > 
> > __dax_pmd_fault()			__dax_fault()
> >   get_block(inode, block, &bh, 0) -> not mapped
> > 					get_block(inode, block, &bh, 0)
> > 					  -> not mapped
> >   if (!buffer_mapped(&bh) && write)
> >     get_block(inode, block, &bh, 1) -> allocates blocks
> >   truncate_pagecache_range(inode, lstart, lend);
> > 					dax_load_hole();
> > 
> > This results in data corruption since process on CPU1 won't see changes
> > into the file done by CPU0.
> > 
> > The race can happen even if two normal faults race however with THP the
> > situation is even worse because the two faults don't operate on the same
> > entries in the radix tree and we want to use these entries for
> > serialization. So disable THP support in DAX code for now.
> 
> Yep, I agree that we should disable PMD faults until we get the multi-order
> radix tree work finished and integrated with this locking.
> 
> I do agree with Dan though that it would be preferable to disable PMD faults
> by having CONFIG_FS_DAX_PMD depend on BROKEN.  That seems smaller and easier
> to switch PMD faults back on for testing.

I did it this way because I wasn't sure PMD fault code even compiles and I
didn't want to break CONFIG_BROKEN builds. But OK, I'll make the code at
least compile and do what you say.

								Honza

> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c            | 2 +-
> >  include/linux/dax.h | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 0329ec0bee2e..444e9dd079ca 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_fault);
> >  
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  /*
> >   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> >   * more often than one might expect in the below function.
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 4b63923e1f8d..fd28d824254b 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
> >  }
> >  #endif
> >  
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> >  				unsigned int flags, get_block_t);
> >  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> > -- 
> > 2.6.2
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/10] dax: Disable huge page handling
@ 2016-03-24 12:56       ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-24 12:56 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Dan Williams,
	linux-nvdimm, NeilBrown

On Wed 23-03-16 14:50:00, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:52PM +0100, Jan Kara wrote:
> > Currently the handling of huge pages for DAX is racy. For example the
> > following can happen:
> > 
> > CPU0 (THP write fault)			CPU1 (normal read fault)
> > 
> > __dax_pmd_fault()			__dax_fault()
> >   get_block(inode, block, &bh, 0) -> not mapped
> > 					get_block(inode, block, &bh, 0)
> > 					  -> not mapped
> >   if (!buffer_mapped(&bh) && write)
> >     get_block(inode, block, &bh, 1) -> allocates blocks
> >   truncate_pagecache_range(inode, lstart, lend);
> > 					dax_load_hole();
> > 
> > This results in data corruption since process on CPU1 won't see changes
> > into the file done by CPU0.
> > 
> > The race can happen even if two normal faults race however with THP the
> > situation is even worse because the two faults don't operate on the same
> > entries in the radix tree and we want to use these entries for
> > serialization. So disable THP support in DAX code for now.
> 
> Yep, I agree that we should disable PMD faults until we get the multi-order
> radix tree work finished and integrated with this locking.
> 
> I do agree with Dan though that it would be preferable to disable PMD faults
> by having CONFIG_FS_DAX_PMD depend on BROKEN.  That seems smaller and easier
> to switch PMD faults back on for testing.

I did it this way because I wasn't sure PMD fault code even compiles and I
didn't want to break CONFIG_BROKEN builds. But OK, I'll make the code at
least compile and do what you say.

								Honza

> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c            | 2 +-
> >  include/linux/dax.h | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 0329ec0bee2e..444e9dd079ca 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_fault);
> >  
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  /*
> >   * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> >   * more often than one might expect in the below function.
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 4b63923e1f8d..fd28d824254b 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
> >  }
> >  #endif
> >  
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if 0
> >  int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> >  				unsigned int flags, get_block_t);
> >  int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> > -- 
> > 2.6.2
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] dax: Fix data corruption for written and mmapped files
  2016-03-24 12:51       ` Jan Kara
@ 2016-03-29 15:17         ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-29 15:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Thu, Mar 24, 2016 at 01:51:12PM +0100, Jan Kara wrote:
> On Wed 23-03-16 11:39:45, Ross Zwisler wrote:
> > On Mon, Mar 21, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> > > When a fault to a hole races with write filling the hole, it can happen
> > > that block zeroing in __dax_fault() overwrites the data copied by write.
> > > Since filesystem is supposed to provide pre-zeroed blocks for fault
> > > anyway, just remove the racy zeroing from dax code. The only catch is
> > > with read-faults over unwritten block where __dax_fault() filled in the
> > > block into page tables anyway. For that case we have to fall back to
> > > using hole page now.
> > >
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/dax.c | 9 +--------
> > >  1 file changed, 1 insertion(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index d496466652cd..50d81172438b 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -582,11 +582,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> > >  		error = PTR_ERR(dax.addr);
> > >  		goto out;
> > >  	}
> > > -
> > > -	if (buffer_unwritten(bh) || buffer_new(bh)) {
> > > -		clear_pmem(dax.addr, PAGE_SIZE);
> > > -		wmb_pmem();
> > > -	}
> > 
> > I agree that we should be dropping these bits of code, but I think they are
> > just dead code that could never be executed?  I don't see how we could have
> > hit a race?
> > 
> > For the above, dax_insert_mapping() is only called if we actually have a block
> > mapping (holes go through dax_load_hole()), so for ext4 and XFS I think
> > buffer_unwritten() and buffer_new() are always false, so this code could never
> > be executed, right?
> > 
> > I suppose that maybe we could get into here via ext2 if BH_New was set?  Is
> > that the race?
> 
> Yeah, you are right that only ext2 is prone to the race I have described
> since for the rest this should be just a dead code. I'll update the changelog
> in this sense.

What do you think about updating ext2 so that like ext4 and xfs it doesn't
ever return BH_New?  AFAICT ext2 doesn't rely on DAX to clear the sectors it
returns - it does that in ext2_get_blocks() via dax_clear_sectors(), right?

Or, really, I guess we could just leave ext2 alone and let it return BH_New,
and just make sure that DAX doesn't do anything with it.

> > >  		if (vmf->flags & FAULT_FLAG_WRITE) {
> > >  			error = get_block(inode, block, &bh, 1);
> > >  			count_vm_event(PGMAJFAULT);
> > > @@ -950,8 +945,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >  		}
> > >  
> > >  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > > -			clear_pmem(dax.addr, PMD_SIZE);
> > > -			wmb_pmem();
> > >  			count_vm_event(PGMAJFAULT);
> > >  			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> > >  			result |= VM_FAULT_MAJOR;
> > 
> > I think this whole block is just dead code, right?  Can we ever get into here?
> > 
> > Same argument applies as from dax_insert_mapping() - if we get this far then
> > we have a mapped buffer, and in the PMD case we know we're on ext4 of XFS
> > since ext2 doesn't do huge page mappings.
> > 
> > So, buffer_unwritten() and buffer_new() both always return false, right?
> > 
> > Yea...we really need to clean up our buffer flag handling. :)
> 
> Hum, looking at the code now I'm somewhat confused. __dax_pmd_fault does:
> 
> if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
> 	... install zero page ...
> }
> 
> but what the buffer_update() check is about? That will never be true,
> right? So we will fall back to the second branch and there we can actually
> hit the
> 
> if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> 
> because for read fault we can get unwritten buffer. But I guess that is a
> mistake in the first branch. After fixing that we can just remove the
> second if as you say. Unless you object, I'll update the patch in this
> sense.

I can't remember if I've ever seen this code get executed - I *think* that
when we hit a hole we always drop back and do 4k zero pages via this code:

	/*
	 * If the filesystem isn't willing to tell us the length of a hole,
	 * just fall back to PTEs.  Calling get_block 512 times in a loop
	 * would be silly.
	 */
	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
		dax_pmd_dbg(&bh, address, "allocated block too small");
		return VM_FAULT_FALLBACK;
	}

I agree that this could probably use some cleanup and additional testing.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] dax: Fix data corruption for written and mmapped files
@ 2016-03-29 15:17         ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-29 15:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-fsdevel, Wilcox, Matthew R, Dan Williams,
	linux-nvdimm, NeilBrown

On Thu, Mar 24, 2016 at 01:51:12PM +0100, Jan Kara wrote:
> On Wed 23-03-16 11:39:45, Ross Zwisler wrote:
> > On Mon, Mar 21, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> > > When a fault to a hole races with write filling the hole, it can happen
> > > that block zeroing in __dax_fault() overwrites the data copied by write.
> > > Since filesystem is supposed to provide pre-zeroed blocks for fault
> > > anyway, just remove the racy zeroing from dax code. The only catch is
> > > with read-faults over unwritten block where __dax_fault() filled in the
> > > block into page tables anyway. For that case we have to fall back to
> > > using hole page now.
> > >
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/dax.c | 9 +--------
> > >  1 file changed, 1 insertion(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index d496466652cd..50d81172438b 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -582,11 +582,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> > >  		error = PTR_ERR(dax.addr);
> > >  		goto out;
> > >  	}
> > > -
> > > -	if (buffer_unwritten(bh) || buffer_new(bh)) {
> > > -		clear_pmem(dax.addr, PAGE_SIZE);
> > > -		wmb_pmem();
> > > -	}
> > 
> > I agree that we should be dropping these bits of code, but I think they are
> > just dead code that could never be executed?  I don't see how we could have
> > hit a race?
> > 
> > For the above, dax_insert_mapping() is only called if we actually have a block
> > mapping (holes go through dax_load_hole()), so for ext4 and XFS I think
> > buffer_unwritten() and buffer_new() are always false, so this code could never
> > be executed, right?
> > 
> > I suppose that maybe we could get into here via ext2 if BH_New was set?  Is
> > that the race?
> 
> Yeah, you are right that only ext2 is prone to the race I have described
> since for the rest this should be just a dead code. I'll update the changelog
> in this sense.

What do you think about updating ext2 so that like ext4 and xfs it doesn't
ever return BH_New?  AFAICT ext2 doesn't rely on DAX to clear the sectors it
returns - it does that in ext2_get_blocks() via dax_clear_sectors(), right?

Or, really, I guess we could just leave ext2 alone and let it return BH_New,
and just make sure that DAX doesn't do anything with it.

> > >  		if (vmf->flags & FAULT_FLAG_WRITE) {
> > >  			error = get_block(inode, block, &bh, 1);
> > >  			count_vm_event(PGMAJFAULT);
> > > @@ -950,8 +945,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >  		}
> > >  
> > >  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > > -			clear_pmem(dax.addr, PMD_SIZE);
> > > -			wmb_pmem();
> > >  			count_vm_event(PGMAJFAULT);
> > >  			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> > >  			result |= VM_FAULT_MAJOR;
> > 
> > I think this whole block is just dead code, right?  Can we ever get into here?
> > 
> > Same argument applies as from dax_insert_mapping() - if we get this far then
> > we have a mapped buffer, and in the PMD case we know we're on ext4 of XFS
> > since ext2 doesn't do huge page mappings.
> > 
> > So, buffer_unwritten() and buffer_new() both always return false, right?
> > 
> > Yea...we really need to clean up our buffer flag handling. :)
> 
> Hum, looking at the code now I'm somewhat confused. __dax_pmd_fault does:
> 
> if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
> 	... install zero page ...
> }
> 
> but what the buffer_update() check is about? That will never be true,
> right? So we will fall back to the second branch and there we can actually
> hit the
> 
> if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> 
> because for read fault we can get unwritten buffer. But I guess that is a
> mistake in the first branch. After fixing that we can just remove the
> second if as you say. Unless you object, I'll update the patch in this
> sense.

I can't remember if I've ever seen this code get executed - I *think* that
when we hit a hole we always drop back and do 4k zero pages via this code:

	/*
	 * If the filesystem isn't willing to tell us the length of a hole,
	 * just fall back to PTEs.  Calling get_block 512 times in a loop
	 * would be silly.
	 */
	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
		dax_pmd_dbg(&bh, address, "allocated block too small");
		return VM_FAULT_FALLBACK;
	}

I agree that this could probably use some cleanup and additional testing.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] dax: New fault locking
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-29 21:57     ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-29 21:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> Currently DAX page fault locking is racy.
> 
> CPU0 (write fault)		CPU1 (read fault)
> 
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
> 
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> 
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
> 
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I've got lots of little comments, but from my point of view this seems like
it is looking pretty good.  I agree with the choice to put this in dax.c as
opposed to radix-tree.c or something - this seems very DAX specific for now.

> ---
>  fs/dax.c            | 500 ++++++++++++++++++++++++++++++++++++++--------------
>  include/linux/dax.h |   1 +
>  mm/truncate.c       |  62 ++++---
>  3 files changed, 396 insertions(+), 167 deletions(-)
<>
> +static inline int slot_locked(void **v)
> +{
> +	unsigned long l = *(unsigned long *)v;
> +	return l & DAX_ENTRY_LOCK;
> +}
> +
> +static inline void *lock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l |= DAX_ENTRY_LOCK);
> +}
> +
> +static inline void *unlock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
> +}

For the above three helpers I think we could do with better parameter and
variable naming so it's clearer what's going on.  s/v/slot/ and s/l/entry/ ?

Also, for many of these new functions we need to be holding
mapping->tree_lock - can we quickly document that with comments?

> +/*
> + * Lookup entry in radix tree, wait for it to become unlocked if it is
> + * exceptional entry and return.
> + *
> + * The function must be called with mapping->tree_lock held.
> + */
> +static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
> +					   pgoff_t index, void ***slotp)
> +{
> +	void *ret, **slot;
> +	struct wait_exceptional_entry_queue wait;

This should probably be named 'ewait' to be consistent with
wake_exceptional_entry_func(), and so we have a different and consistent
naming between our struct wait_exceptional_entry_queue and wait_queue_t
variables.

> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	init_wait(&wait.wait);
> +	wait.wait.func = wake_exceptional_entry_func;
> +	wait.key.root = &mapping->page_tree;
> +	wait.key.index = index;
> +
> +	for (;;) {
> +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> +					  &slot);
> +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> +		    !slot_locked(slot)) {
> +			if (slotp)
> +				*slotp = slot;
> +			return ret;
> +		}
> +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);

Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
zombie?

> +		spin_unlock_irq(&mapping->tree_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wait);
> +		spin_lock_irq(&mapping->tree_lock);
> +	}
> +}
> +
> +/*
> + * Find radix tree entry at given index. If it points to a page, return with
> + * the page locked. If it points to the exceptional entry, return with the
> + * radix tree entry locked. If the radix tree doesn't contain given index,
> + * create empty exceptional entry for the index and return with it locked.
> + *
> + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> + * persistent memory the benefit is doubtful. We can add that later if we can
> + * show it helps.
> + */
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +
> +restart:
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> +	/* No entry for given index? Make sure radix tree is big enough. */
> +	if (!ret) {
> +		int err;
> +
> +		spin_unlock_irq(&mapping->tree_lock);
> +		err = radix_tree_preload(
> +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);

What is the benefit to preloading the radix tree?  It looks like we have to
drop the mapping->tree_lock, deal with an error, regrab the lock and then deal
with a possible collision with an entry that was inserted while we didn't hold
the lock.

Can we just try and insert it, then if it fails with -ENOMEM we just do our
normal error path, dropping the tree_lock and returning the error?

> +		if (err)
> +			return ERR_PTR(err);
> +		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
> +		spin_lock_irq(&mapping->tree_lock);
> +		err = radix_tree_insert(&mapping->page_tree, index, ret);
> +		radix_tree_preload_end();
> +		if (err) {
> +			spin_unlock_irq(&mapping->tree_lock);
> +			/* Someone already created the entry? */
> +			if (err == -EEXIST)
> +				goto restart;
> +			return ERR_PTR(err);
> +		}
> +		/* Good, we have inserted empty locked entry into the tree. */
> +		mapping->nrexceptional++;
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return ret;
> +	}
> +	/* Normal page in radix tree? */
> +	if (!radix_tree_exceptional_entry(ret)) {
> +		struct page *page = ret;
> +
> +		page_cache_get(page);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		lock_page(page);
> +		/* Page got truncated? Retry... */
> +		if (unlikely(page->mapping != mapping)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto restart;
> +		}
> +		return page;
> +	}
> +	ret = lock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}
> +
> +static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
> +	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}
> +	if (WARN_ON_ONCE(!slot_locked(slot))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}

It may be worth combining these two WARN_ON_ONCE() error cases for brevity,
since they are both insanity conditions.

> +	unlock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	if (waitqueue_active(wq)) {
> +		struct exceptional_entry_key key;
> +
> +		key.root = &mapping->page_tree;
> +		key.index = index;
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
> +	}

The above if() block is repeated 3 times in the next few functions with small
variations (the third argument to __wake_up()).  Perhaps it should be pulled
out into a helper?

> +static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
> +			       void *entry, sector_t sector, bool dirty,
> +			       gfp_t gfp_mask)

This argument list is getting pretty long, and our one caller gets lots of
these guys out of the VMF.  Perhaps we could just pass in the VMF and extract
the bits ourselves?

>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	pgoff_t pmd_index = DAX_PMD_INDEX(index);
> -	int type, error = 0;
> -	void *entry;
> +	int error = 0;
> +	bool hole_fill = false;
> +	void *ret;

Just a nit, but I find the use of 'ret' a bit confusing, since it's not a
return value that we got from anywhere, it's an entry that we set up, insert
and then return to our caller.  We use 'error' to capture return values from
calls this function makes.  Maybe this would be clearer as "new_entry" or
something?

> -	WARN_ON_ONCE(pmd_entry && !dirty);
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	spin_lock_irq(&mapping->tree_lock);
> -
> -	entry = radix_tree_lookup(page_tree, pmd_index);
> -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> -		index = pmd_index;
> -		goto dirty;
> +	/* Replacing hole page with block mapping? */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		hole_fill = true;
> +		error = radix_tree_preload(gfp_mask);
> +		if (error)
> +			return ERR_PTR(error);
>  	}
>  
> -	entry = radix_tree_lookup(page_tree, index);
> -	if (entry) {
> -		type = RADIX_DAX_TYPE(entry);
> -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> -					type != RADIX_DAX_PMD)) {
> -			error = -EIO;
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> +		       DAX_ENTRY_LOCK);
> +	if (hole_fill) {
> +		__delete_from_page_cache(entry, NULL);
> +		error = radix_tree_insert(page_tree, index, ret);
> +		if (error) {
> +			ret = ERR_PTR(error);
>  			goto unlock;
>  		}
> +		mapping->nrexceptional++;
> +	} else {
> +		void **slot;
> +		void *ret2;
>  
> -		if (!pmd_entry || type == RADIX_DAX_PMD)
> -			goto dirty;
> -
> -		/*
> -		 * We only insert dirty PMD entries into the radix tree.  This
> -		 * means we don't need to worry about removing a dirty PTE
> -		 * entry and inserting a clean PMD entry, thus reducing the
> -		 * range we would flush with a follow-up fsync/msync call.
> -		 */
> -		radix_tree_delete(&mapping->page_tree, index);
> -		mapping->nrexceptional--;
> -	}
> -
> -	if (sector == NO_SECTOR) {
> -		/*
> -		 * This can happen during correct operation if our pfn_mkwrite
> -		 * fault raced against a hole punch operation.  If this
> -		 * happens the pte that was hole punched will have been
> -		 * unmapped and the radix tree entry will have been removed by
> -		 * the time we are called, but the call will still happen.  We
> -		 * will return all the way up to wp_pfn_shared(), where the
> -		 * pte_same() check will fail, eventually causing page fault
> -		 * to be retried by the CPU.
> -		 */
> -		goto unlock;
> +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);

You don't need ret2.  You can just compare 'entry' with '*slot' - see
dax_writeback_one() for an example.

> +		WARN_ON_ONCE(ret2 != entry);
> +		radix_tree_replace_slot(slot, ret);
>  	}
> -
> -	error = radix_tree_insert(page_tree, index,
> -			RADIX_DAX_ENTRY(sector, pmd_entry));
> -	if (error)
> -		goto unlock;
> -
> -	mapping->nrexceptional++;
> - dirty:
>  	if (dirty)
>  		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
>   unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
> -	return error;
> +	if (hole_fill)
> +		radix_tree_preload_end();
> +	return ret;
>  }
>  
>  static int dax_writeback_one(struct block_device *bdev,
> @@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  
> -static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> +static int dax_insert_mapping(struct address_space *mapping,
> +			struct buffer_head *bh, void *entry,
>  			struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> -	struct address_space *mapping = inode->i_mapping;
>  	struct block_device *bdev = bh->b_bdev;
>  	struct blk_dax_ctl dax = {
> -		.sector = to_sector(bh, inode),
> +		.sector = to_sector(bh, mapping->host),
>  		.size = bh->b_size,
>  	};
>  	int error;
> +	void *ret;
>  
>  	i_mmap_lock_read(mapping);
>  
> @@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	}
>  	dax_unmap_atomic(bdev, &dax);
>  
> -	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> -			vmf->flags & FAULT_FLAG_WRITE);
> -	if (error)
> +	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
> +			        vmf->flags & FAULT_FLAG_WRITE,
> +			        vmf->gfp_mask & ~__GFP_HIGHMEM);

The spacing before the parameters to dax_mapping_entry() is messed up & makes
checkpatch grumpy:

ERROR: code indent should use tabs where possible
#488: FILE: fs/dax.c:812:
+^I^I^I        vmf->flags & FAULT_FLAG_WRITE,$

ERROR: code indent should use tabs where possible
#489: FILE: fs/dax.c:813:
+^I^I^I        vmf->gfp_mask & ~__GFP_HIGHMEM);$

There are a few other checkpatch warnings as well that should probably be
addressed.

> +	if (IS_ERR(ret)) {
> +		error = PTR_ERR(ret);
>  		goto out;
> +	}
> +	/* Have we replaced hole page? Unmap and free it. */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> +				    PAGE_CACHE_SIZE, 0);
> +		unlock_page(entry);
> +		page_cache_release(entry);
> +	}
> +	entry = ret;
>  
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> -
>   out:
>  	i_mmap_unlock_read(mapping);
> -
> +	put_locked_mapping_entry(mapping, vmf->pgoff, entry);

Hmm....this entry was locked by our parent (__dax_fault()), and is released by
our parent in error cases that go through 'unlock_entry:'.  For symmetry it's
probably better to move this call up to our parent as well.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] dax: New fault locking
@ 2016-03-29 21:57     ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-29 21:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> Currently DAX page fault locking is racy.
> 
> CPU0 (write fault)		CPU1 (read fault)
> 
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
> 
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> 
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
> 
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I've got lots of little comments, but from my point of view this seems like
it is looking pretty good.  I agree with the choice to put this in dax.c as
opposed to radix-tree.c or something - this seems very DAX specific for now.

> ---
>  fs/dax.c            | 500 ++++++++++++++++++++++++++++++++++++++--------------
>  include/linux/dax.h |   1 +
>  mm/truncate.c       |  62 ++++---
>  3 files changed, 396 insertions(+), 167 deletions(-)
<>
> +static inline int slot_locked(void **v)
> +{
> +	unsigned long l = *(unsigned long *)v;
> +	return l & DAX_ENTRY_LOCK;
> +}
> +
> +static inline void *lock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l |= DAX_ENTRY_LOCK);
> +}
> +
> +static inline void *unlock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
> +}

For the above three helpers I think we could do with better parameter and
variable naming so it's clearer what's going on.  s/v/slot/ and s/l/entry/ ?

Also, for many of these new functions we need to be holding
mapping->tree_lock - can we quickly document that with comments?

> +/*
> + * Lookup entry in radix tree, wait for it to become unlocked if it is
> + * exceptional entry and return.
> + *
> + * The function must be called with mapping->tree_lock held.
> + */
> +static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
> +					   pgoff_t index, void ***slotp)
> +{
> +	void *ret, **slot;
> +	struct wait_exceptional_entry_queue wait;

This should probably be named 'ewait' to be consistent with
wake_exceptional_entry_func(), and so we have a different and consistent
naming between our struct wait_exceptional_entry_queue and wait_queue_t
variables.

> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	init_wait(&wait.wait);
> +	wait.wait.func = wake_exceptional_entry_func;
> +	wait.key.root = &mapping->page_tree;
> +	wait.key.index = index;
> +
> +	for (;;) {
> +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> +					  &slot);
> +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> +		    !slot_locked(slot)) {
> +			if (slotp)
> +				*slotp = slot;
> +			return ret;
> +		}
> +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);

Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
zombie?

> +		spin_unlock_irq(&mapping->tree_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wait);
> +		spin_lock_irq(&mapping->tree_lock);
> +	}
> +}
> +
> +/*
> + * Find radix tree entry at given index. If it points to a page, return with
> + * the page locked. If it points to the exceptional entry, return with the
> + * radix tree entry locked. If the radix tree doesn't contain given index,
> + * create empty exceptional entry for the index and return with it locked.
> + *
> + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> + * persistent memory the benefit is doubtful. We can add that later if we can
> + * show it helps.
> + */
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +
> +restart:
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> +	/* No entry for given index? Make sure radix tree is big enough. */
> +	if (!ret) {
> +		int err;
> +
> +		spin_unlock_irq(&mapping->tree_lock);
> +		err = radix_tree_preload(
> +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);

What is the benefit to preloading the radix tree?  It looks like we have to
drop the mapping->tree_lock, deal with an error, regrab the lock and then deal
with a possible collision with an entry that was inserted while we didn't hold
the lock.

Can we just try and insert it, then if it fails with -ENOMEM we just do our
normal error path, dropping the tree_lock and returning the error?

> +		if (err)
> +			return ERR_PTR(err);
> +		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
> +		spin_lock_irq(&mapping->tree_lock);
> +		err = radix_tree_insert(&mapping->page_tree, index, ret);
> +		radix_tree_preload_end();
> +		if (err) {
> +			spin_unlock_irq(&mapping->tree_lock);
> +			/* Someone already created the entry? */
> +			if (err == -EEXIST)
> +				goto restart;
> +			return ERR_PTR(err);
> +		}
> +		/* Good, we have inserted empty locked entry into the tree. */
> +		mapping->nrexceptional++;
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return ret;
> +	}
> +	/* Normal page in radix tree? */
> +	if (!radix_tree_exceptional_entry(ret)) {
> +		struct page *page = ret;
> +
> +		page_cache_get(page);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		lock_page(page);
> +		/* Page got truncated? Retry... */
> +		if (unlikely(page->mapping != mapping)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto restart;
> +		}
> +		return page;
> +	}
> +	ret = lock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}
> +
> +static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
> +	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}
> +	if (WARN_ON_ONCE(!slot_locked(slot))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}

It may be worth combining these two WARN_ON_ONCE() error cases for brevity,
since they are both insanity conditions.

> +	unlock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	if (waitqueue_active(wq)) {
> +		struct exceptional_entry_key key;
> +
> +		key.root = &mapping->page_tree;
> +		key.index = index;
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
> +	}

The above if() block is repeated 3 times in the next few functions with small
variations (the third argument to __wake_up()).  Perhaps it should be pulled
out into a helper?

> +static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
> +			       void *entry, sector_t sector, bool dirty,
> +			       gfp_t gfp_mask)

This argument list is getting pretty long, and our one caller gets lots of
these guys out of the VMF.  Perhaps we could just pass in the VMF and extract
the bits ourselves?

>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	pgoff_t pmd_index = DAX_PMD_INDEX(index);
> -	int type, error = 0;
> -	void *entry;
> +	int error = 0;
> +	bool hole_fill = false;
> +	void *ret;

Just a nit, but I find the use of 'ret' a bit confusing, since it's not a
return value that we got from anywhere, it's an entry that we set up, insert
and then return to our caller.  We use 'error' to capture return values from
calls this function makes.  Maybe this would be clearer as "new_entry" or
something?

> -	WARN_ON_ONCE(pmd_entry && !dirty);
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	spin_lock_irq(&mapping->tree_lock);
> -
> -	entry = radix_tree_lookup(page_tree, pmd_index);
> -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> -		index = pmd_index;
> -		goto dirty;
> +	/* Replacing hole page with block mapping? */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		hole_fill = true;
> +		error = radix_tree_preload(gfp_mask);
> +		if (error)
> +			return ERR_PTR(error);
>  	}
>  
> -	entry = radix_tree_lookup(page_tree, index);
> -	if (entry) {
> -		type = RADIX_DAX_TYPE(entry);
> -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> -					type != RADIX_DAX_PMD)) {
> -			error = -EIO;
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> +		       DAX_ENTRY_LOCK);
> +	if (hole_fill) {
> +		__delete_from_page_cache(entry, NULL);
> +		error = radix_tree_insert(page_tree, index, ret);
> +		if (error) {
> +			ret = ERR_PTR(error);
>  			goto unlock;
>  		}
> +		mapping->nrexceptional++;
> +	} else {
> +		void **slot;
> +		void *ret2;
>  
> -		if (!pmd_entry || type == RADIX_DAX_PMD)
> -			goto dirty;
> -
> -		/*
> -		 * We only insert dirty PMD entries into the radix tree.  This
> -		 * means we don't need to worry about removing a dirty PTE
> -		 * entry and inserting a clean PMD entry, thus reducing the
> -		 * range we would flush with a follow-up fsync/msync call.
> -		 */
> -		radix_tree_delete(&mapping->page_tree, index);
> -		mapping->nrexceptional--;
> -	}
> -
> -	if (sector == NO_SECTOR) {
> -		/*
> -		 * This can happen during correct operation if our pfn_mkwrite
> -		 * fault raced against a hole punch operation.  If this
> -		 * happens the pte that was hole punched will have been
> -		 * unmapped and the radix tree entry will have been removed by
> -		 * the time we are called, but the call will still happen.  We
> -		 * will return all the way up to wp_pfn_shared(), where the
> -		 * pte_same() check will fail, eventually causing page fault
> -		 * to be retried by the CPU.
> -		 */
> -		goto unlock;
> +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);

You don't need ret2.  You can just compare 'entry' with '*slot' - see
dax_writeback_one() for an example.

> +		WARN_ON_ONCE(ret2 != entry);
> +		radix_tree_replace_slot(slot, ret);
>  	}
> -
> -	error = radix_tree_insert(page_tree, index,
> -			RADIX_DAX_ENTRY(sector, pmd_entry));
> -	if (error)
> -		goto unlock;
> -
> -	mapping->nrexceptional++;
> - dirty:
>  	if (dirty)
>  		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
>   unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
> -	return error;
> +	if (hole_fill)
> +		radix_tree_preload_end();
> +	return ret;
>  }
>  
>  static int dax_writeback_one(struct block_device *bdev,
> @@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  
> -static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> +static int dax_insert_mapping(struct address_space *mapping,
> +			struct buffer_head *bh, void *entry,
>  			struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> -	struct address_space *mapping = inode->i_mapping;
>  	struct block_device *bdev = bh->b_bdev;
>  	struct blk_dax_ctl dax = {
> -		.sector = to_sector(bh, inode),
> +		.sector = to_sector(bh, mapping->host),
>  		.size = bh->b_size,
>  	};
>  	int error;
> +	void *ret;
>  
>  	i_mmap_lock_read(mapping);
>  
> @@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	}
>  	dax_unmap_atomic(bdev, &dax);
>  
> -	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> -			vmf->flags & FAULT_FLAG_WRITE);
> -	if (error)
> +	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
> +			        vmf->flags & FAULT_FLAG_WRITE,
> +			        vmf->gfp_mask & ~__GFP_HIGHMEM);

The spacing before the parameters to dax_mapping_entry() is messed up & makes
checkpatch grumpy:

ERROR: code indent should use tabs where possible
#488: FILE: fs/dax.c:812:
+^I^I^I        vmf->flags & FAULT_FLAG_WRITE,$

ERROR: code indent should use tabs where possible
#489: FILE: fs/dax.c:813:
+^I^I^I        vmf->gfp_mask & ~__GFP_HIGHMEM);$

There are a few other checkpatch warnings as well that should probably be
addressed.

> +	if (IS_ERR(ret)) {
> +		error = PTR_ERR(ret);
>  		goto out;
> +	}
> +	/* Have we replaced hole page? Unmap and free it. */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> +				    PAGE_CACHE_SIZE, 0);
> +		unlock_page(entry);
> +		page_cache_release(entry);
> +	}
> +	entry = ret;
>  
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> -
>   out:
>  	i_mmap_unlock_read(mapping);
> -
> +	put_locked_mapping_entry(mapping, vmf->pgoff, entry);

Hmm....this entry was locked by our parent (__dax_fault()), and is released by
our parent in error cases that go through 'unlock_entry:'.  For symmetry it's
probably better to move this call up to our parent as well.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 10/10] dax: Remove i_mmap_lock protection
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-29 22:17     ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-29 22:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:55PM +0100, Jan Kara wrote:
> Currently faults are protected against truncate by filesystem specific
> i_mmap_sem and page lock in case of hole page. Cow faults are protected
> DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
> code. Remove it.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> ---
>  fs/dax.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 2fcf4e8a17c5..a2a370db59b7 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -795,8 +795,6 @@ static int dax_insert_mapping(struct address_space *mapping,
>  	int error;
>  	void *ret;
>  
> -	i_mmap_lock_read(mapping);
> -
>  	if (dax_map_atomic(bdev, &dax) < 0) {
>  		error = PTR_ERR(dax.addr);
>  		goto out;
> @@ -821,7 +819,6 @@ static int dax_insert_mapping(struct address_space *mapping,
>  
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
>   out:
> -	i_mmap_unlock_read(mapping);
>  	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
>  	return error;
>  }
> @@ -1063,8 +1060,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		truncate_pagecache_range(inode, lstart, lend);
>  	}
>  
> -	i_mmap_lock_read(mapping);
> -
>  	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
>  		spinlock_t *ptl;
>  		pmd_t entry;
> @@ -1162,8 +1157,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	}
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	return result;
>  
>   fallback:
> -- 
> 2.6.2
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 10/10] dax: Remove i_mmap_lock protection
@ 2016-03-29 22:17     ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-29 22:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Mon, Mar 21, 2016 at 02:22:55PM +0100, Jan Kara wrote:
> Currently faults are protected against truncate by filesystem specific
> i_mmap_sem and page lock in case of hole page. Cow faults are protected
> DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
> code. Remove it.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> ---
>  fs/dax.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 2fcf4e8a17c5..a2a370db59b7 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -795,8 +795,6 @@ static int dax_insert_mapping(struct address_space *mapping,
>  	int error;
>  	void *ret;
>  
> -	i_mmap_lock_read(mapping);
> -
>  	if (dax_map_atomic(bdev, &dax) < 0) {
>  		error = PTR_ERR(dax.addr);
>  		goto out;
> @@ -821,7 +819,6 @@ static int dax_insert_mapping(struct address_space *mapping,
>  
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
>   out:
> -	i_mmap_unlock_read(mapping);
>  	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
>  	return error;
>  }
> @@ -1063,8 +1060,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		truncate_pagecache_range(inode, lstart, lend);
>  	}
>  
> -	i_mmap_lock_read(mapping);
> -
>  	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
>  		spinlock_t *ptl;
>  		pmd_t entry;
> @@ -1162,8 +1157,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	}
>  
>   out:
> -	i_mmap_unlock_read(mapping);
> -
>  	return result;
>  
>   fallback:
> -- 
> 2.6.2
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults
  2016-03-21 13:22   ` Jan Kara
@ 2016-03-29 22:18     ` Ross Zwisler
  -1 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-29 22:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, NeilBrown, Wilcox

On Mon, Mar 21, 2016 at 02:22:54PM +0100, Jan Kara wrote:
> When doing cow faults, we cannot directly fill in PTE as we do for other
> faults as we rely on generic code to do proper accounting of the cowed page.
> We also have no page to lock to protect against races with truncate as
> other faults have and we need the protection to extend until the moment
> generic code inserts cowed page into PTE thus at that point we have no
> protection of fs-specific i_mmap_sem. So far we relied on using
> i_mmap_lock for the protection however that is completely special to cow
> faults. To make fault locking more uniform use DAX entry lock instead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> ---
>  fs/dax.c            | 12 +++++-------
>  include/linux/dax.h |  1 +
>  include/linux/mm.h  |  7 +++++++
>  mm/memory.c         | 38 ++++++++++++++++++--------------------
>  4 files changed, 31 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4fcac59b6dcb..2fcf4e8a17c5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -469,7 +469,7 @@ restart:
>  	return ret;
>  }
>  
> -static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
>  	void *ret, **slot;
>  	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> @@ -502,7 +502,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
>  		unlock_page(entry);
>  		page_cache_release(entry);
>  	} else {
> -		unlock_mapping_entry(mapping, index);
> +		dax_unlock_mapping_entry(mapping, index);
>  	}
>  }
>  
> @@ -887,12 +887,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			goto unlock_entry;
>  		if (!radix_tree_exceptional_entry(entry)) {
>  			vmf->page = entry;
> -		} else {
> -			unlock_mapping_entry(mapping, vmf->pgoff);
> -			i_mmap_lock_read(mapping);
> -			vmf->page = NULL;
> +			return VM_FAULT_LOCKED;
>  		}
> -		return VM_FAULT_LOCKED;
> +		vmf->entry = entry;
> +		return VM_FAULT_DAX_LOCKED;
>  	}
>  
>  	if (!buffer_mapped(&bh)) {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index da2416d916e6..29a83a767ea3 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -60,4 +60,5 @@ static inline bool dax_mapping(struct address_space *mapping)
>  struct writeback_control;
>  int dax_writeback_mapping_range(struct address_space *mapping,
>  		struct block_device *bdev, struct writeback_control *wbc);
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
>  #endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 450fc977ed02..1c64039dc505 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -299,6 +299,12 @@ struct vm_fault {
>  					 * is set (which is also implied by
>  					 * VM_FAULT_ERROR).
>  					 */
> +	void *entry;			/* ->fault handler can alternatively
> +					 * return locked DAX entry. In that
> +					 * case handler should return
> +					 * VM_FAULT_DAX_LOCKED and fill in
> +					 * entry here.
> +					 */
>  	/* for ->map_pages() only */
>  	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
>  					 * max_pgoff inclusive */
> @@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
>  #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
>  #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>  #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
> +#define VM_FAULT_DAX_LOCKED 0x1000	/* ->fault has locked DAX entry */
>  
>  #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 81dca0083fcd..7a704d3cd3b5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -63,6 +63,7 @@
>  #include <linux/dma-debug.h>
>  #include <linux/debugfs.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/dax.h>
>  
>  #include <asm/io.h>
>  #include <asm/mmu_context.h>
> @@ -2783,7 +2784,8 @@ oom:
>   */
>  static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  			pgoff_t pgoff, unsigned int flags,
> -			struct page *cow_page, struct page **page)
> +			struct page *cow_page, struct page **page,
> +			void **entry)
>  {
>  	struct vm_fault vmf;
>  	int ret;
> @@ -2798,8 +2800,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	ret = vma->vm_ops->fault(vma, &vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
> -	if (!vmf.page)
> -		goto out;
> +	if (ret & VM_FAULT_DAX_LOCKED) {
> +		*entry = vmf.entry;
> +		return ret;
> +	}
>  
>  	if (unlikely(PageHWPoison(vmf.page))) {
>  		if (ret & VM_FAULT_LOCKED)
> @@ -2813,7 +2817,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	else
>  		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
>  
> - out:
>  	*page = vmf.page;
>  	return ret;
>  }
> @@ -2985,7 +2988,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pte_unmap_unlock(pte, ptl);
>  	}
>  
> -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> @@ -3008,6 +3011,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
>  {
>  	struct page *fault_page, *new_page;
> +	void *fault_entry;
>  	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pte_t *pte;
> @@ -3025,26 +3029,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		return VM_FAULT_OOM;
>  	}
>  
> -	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
> +			 &fault_entry);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		goto uncharge_out;
>  
> -	if (fault_page)
> +	if (!(ret & VM_FAULT_DAX_LOCKED))
>  		copy_user_highpage(new_page, fault_page, address, vma);
>  	__SetPageUptodate(new_page);
>  
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (unlikely(!pte_same(*pte, orig_pte))) {
>  		pte_unmap_unlock(pte, ptl);
> -		if (fault_page) {
> +		if (!(ret & VM_FAULT_DAX_LOCKED)) {
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
>  		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> +			dax_unlock_mapping_entry(vma->vm_file->f_mapping,
> +						 pgoff);
>  		}
>  		goto uncharge_out;
>  	}
> @@ -3052,15 +3054,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	mem_cgroup_commit_charge(new_page, memcg, false, false);
>  	lru_cache_add_active_or_unevictable(new_page, vma);
>  	pte_unmap_unlock(pte, ptl);
> -	if (fault_page) {
> +	if (!(ret & VM_FAULT_DAX_LOCKED)) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
>  	} else {
> -		/*
> -		 * The fault handler has no page to lock, so it holds
> -		 * i_mmap_lock for read to protect against truncate.
> -		 */
> -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> +		dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
>  	}
>  	return ret;
>  uncharge_out:
> @@ -3080,7 +3078,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int dirtied = 0;
>  	int ret, tmp;
>  
> -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> -- 
> 2.6.2
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults
@ 2016-03-29 22:18     ` Ross Zwisler
  0 siblings, 0 replies; 88+ messages in thread
From: Ross Zwisler @ 2016-03-29 22:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Wilcox, Matthew R, Ross Zwisler, Dan Williams,
	linux-nvdimm, NeilBrown

On Mon, Mar 21, 2016 at 02:22:54PM +0100, Jan Kara wrote:
> When doing cow faults, we cannot directly fill in PTE as we do for other
> faults as we rely on generic code to do proper accounting of the cowed page.
> We also have no page to lock to protect against races with truncate as
> other faults have and we need the protection to extend until the moment
> generic code inserts cowed page into PTE thus at that point we have no
> protection of fs-specific i_mmap_sem. So far we relied on using
> i_mmap_lock for the protection however that is completely special to cow
> faults. To make fault locking more uniform use DAX entry lock instead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

> ---
>  fs/dax.c            | 12 +++++-------
>  include/linux/dax.h |  1 +
>  include/linux/mm.h  |  7 +++++++
>  mm/memory.c         | 38 ++++++++++++++++++--------------------
>  4 files changed, 31 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4fcac59b6dcb..2fcf4e8a17c5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -469,7 +469,7 @@ restart:
>  	return ret;
>  }
>  
> -static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
>  	void *ret, **slot;
>  	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> @@ -502,7 +502,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
>  		unlock_page(entry);
>  		page_cache_release(entry);
>  	} else {
> -		unlock_mapping_entry(mapping, index);
> +		dax_unlock_mapping_entry(mapping, index);
>  	}
>  }
>  
> @@ -887,12 +887,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			goto unlock_entry;
>  		if (!radix_tree_exceptional_entry(entry)) {
>  			vmf->page = entry;
> -		} else {
> -			unlock_mapping_entry(mapping, vmf->pgoff);
> -			i_mmap_lock_read(mapping);
> -			vmf->page = NULL;
> +			return VM_FAULT_LOCKED;
>  		}
> -		return VM_FAULT_LOCKED;
> +		vmf->entry = entry;
> +		return VM_FAULT_DAX_LOCKED;
>  	}
>  
>  	if (!buffer_mapped(&bh)) {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index da2416d916e6..29a83a767ea3 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -60,4 +60,5 @@ static inline bool dax_mapping(struct address_space *mapping)
>  struct writeback_control;
>  int dax_writeback_mapping_range(struct address_space *mapping,
>  		struct block_device *bdev, struct writeback_control *wbc);
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
>  #endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 450fc977ed02..1c64039dc505 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -299,6 +299,12 @@ struct vm_fault {
>  					 * is set (which is also implied by
>  					 * VM_FAULT_ERROR).
>  					 */
> +	void *entry;			/* ->fault handler can alternatively
> +					 * return locked DAX entry. In that
> +					 * case handler should return
> +					 * VM_FAULT_DAX_LOCKED and fill in
> +					 * entry here.
> +					 */
>  	/* for ->map_pages() only */
>  	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
>  					 * max_pgoff inclusive */
> @@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
>  #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
>  #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>  #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
> +#define VM_FAULT_DAX_LOCKED 0x1000	/* ->fault has locked DAX entry */
>  
>  #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 81dca0083fcd..7a704d3cd3b5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -63,6 +63,7 @@
>  #include <linux/dma-debug.h>
>  #include <linux/debugfs.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/dax.h>
>  
>  #include <asm/io.h>
>  #include <asm/mmu_context.h>
> @@ -2783,7 +2784,8 @@ oom:
>   */
>  static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  			pgoff_t pgoff, unsigned int flags,
> -			struct page *cow_page, struct page **page)
> +			struct page *cow_page, struct page **page,
> +			void **entry)
>  {
>  	struct vm_fault vmf;
>  	int ret;
> @@ -2798,8 +2800,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	ret = vma->vm_ops->fault(vma, &vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
> -	if (!vmf.page)
> -		goto out;
> +	if (ret & VM_FAULT_DAX_LOCKED) {
> +		*entry = vmf.entry;
> +		return ret;
> +	}
>  
>  	if (unlikely(PageHWPoison(vmf.page))) {
>  		if (ret & VM_FAULT_LOCKED)
> @@ -2813,7 +2817,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  	else
>  		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
>  
> - out:
>  	*page = vmf.page;
>  	return ret;
>  }
> @@ -2985,7 +2988,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pte_unmap_unlock(pte, ptl);
>  	}
>  
> -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> @@ -3008,6 +3011,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
>  {
>  	struct page *fault_page, *new_page;
> +	void *fault_entry;
>  	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pte_t *pte;
> @@ -3025,26 +3029,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		return VM_FAULT_OOM;
>  	}
>  
> -	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
> +			 &fault_entry);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		goto uncharge_out;
>  
> -	if (fault_page)
> +	if (!(ret & VM_FAULT_DAX_LOCKED))
>  		copy_user_highpage(new_page, fault_page, address, vma);
>  	__SetPageUptodate(new_page);
>  
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (unlikely(!pte_same(*pte, orig_pte))) {
>  		pte_unmap_unlock(pte, ptl);
> -		if (fault_page) {
> +		if (!(ret & VM_FAULT_DAX_LOCKED)) {
>  			unlock_page(fault_page);
>  			page_cache_release(fault_page);
>  		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> +			dax_unlock_mapping_entry(vma->vm_file->f_mapping,
> +						 pgoff);
>  		}
>  		goto uncharge_out;
>  	}
> @@ -3052,15 +3054,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	mem_cgroup_commit_charge(new_page, memcg, false, false);
>  	lru_cache_add_active_or_unevictable(new_page, vma);
>  	pte_unmap_unlock(pte, ptl);
> -	if (fault_page) {
> +	if (!(ret & VM_FAULT_DAX_LOCKED)) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
>  	} else {
> -		/*
> -		 * The fault handler has no page to lock, so it holds
> -		 * i_mmap_lock for read to protect against truncate.
> -		 */
> -		i_mmap_unlock_read(vma->vm_file->f_mapping);
> +		dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
>  	}
>  	return ret;
>  uncharge_out:
> @@ -3080,7 +3078,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int dirtied = 0;
>  	int ret, tmp;
>  
> -	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> +	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> -- 
> 2.6.2
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] dax: New fault locking
  2016-03-29 21:57     ` Ross Zwisler
@ 2016-03-31 16:27       ` Jan Kara
  -1 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-31 16:27 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R, linux-fsdevel

Thanks for review Ross! I have implemented your comments unless I state
here otherwise.

On Tue 29-03-16 15:57:32, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> > +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > +
> > +	init_wait(&wait.wait);
> > +	wait.wait.func = wake_exceptional_entry_func;
> > +	wait.key.root = &mapping->page_tree;
> > +	wait.key.index = index;
> > +
> > +	for (;;) {
> > +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> > +					  &slot);
> > +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> > +		    !slot_locked(slot)) {
> > +			if (slotp)
> > +				*slotp = slot;
> > +			return ret;
> > +		}
> > +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
> 
> Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
> zombie?

Well, and do you want to deal with signal handling all the way up? The wait
should be pretty short given the nature of pmem so I didn't see a big point
in bothering with signal handling...

> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		schedule();
> > +		finish_wait(wq, &wait.wait);
> > +		spin_lock_irq(&mapping->tree_lock);
> > +	}
> > +}
> > +
> > +/*
> > + * Find radix tree entry at given index. If it points to a page, return with
> > + * the page locked. If it points to the exceptional entry, return with the
> > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > + * create empty exceptional entry for the index and return with it locked.
> > + *
> > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > + * persistent memory the benefit is doubtful. We can add that later if we can
> > + * show it helps.
> > + */
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > +	void *ret, **slot;
> > +
> > +restart:
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> > +	/* No entry for given index? Make sure radix tree is big enough. */
> > +	if (!ret) {
> > +		int err;
> > +
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		err = radix_tree_preload(
> > +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> 
> What is the benefit to preloading the radix tree?  It looks like we have
> to drop the mapping->tree_lock, deal with an error, regrab the lock and
> then deal with a possible collision with an entry that was inserted while
> we didn't hold the lock.
> 
> Can we just try and insert it, then if it fails with -ENOMEM we just do
> our normal error path, dropping the tree_lock and returning the error?

If we don't preload, the allocations will happen with GFP_ATOMIC. That
should be avoided if possible since atomic allocations are pretty
restricted. So basically all the pagecache first allocates nodes we may
need before acquiring locks and then uses these nodes later and I have
mirrored that behavior. Note that we take the hit for dropping the lock
only if we really need to allocate new radix tree node so about once per 64
new entries. So it is not too bad.

> > -	WARN_ON_ONCE(pmd_entry && !dirty);
> >  	if (dirty)
> >  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >  
> > -	spin_lock_irq(&mapping->tree_lock);
> > -
> > -	entry = radix_tree_lookup(page_tree, pmd_index);
> > -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> > -		index = pmd_index;
> > -		goto dirty;
> > +	/* Replacing hole page with block mapping? */
> > +	if (!radix_tree_exceptional_entry(entry)) {
> > +		hole_fill = true;
> > +		error = radix_tree_preload(gfp_mask);
> > +		if (error)
> > +			return ERR_PTR(error);
> >  	}
> >  
> > -	entry = radix_tree_lookup(page_tree, index);
> > -	if (entry) {
> > -		type = RADIX_DAX_TYPE(entry);
> > -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > -					type != RADIX_DAX_PMD)) {
> > -			error = -EIO;
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > +		       DAX_ENTRY_LOCK);
> > +	if (hole_fill) {
> > +		__delete_from_page_cache(entry, NULL);
> > +		error = radix_tree_insert(page_tree, index, ret);
> > +		if (error) {
> > +			ret = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> > +		mapping->nrexceptional++;
> > +	} else {
> > +		void **slot;
> > +		void *ret2;
> >  
> > -		if (!pmd_entry || type == RADIX_DAX_PMD)
> > -			goto dirty;
> > -
> > -		/*
> > -		 * We only insert dirty PMD entries into the radix tree.  This
> > -		 * means we don't need to worry about removing a dirty PTE
> > -		 * entry and inserting a clean PMD entry, thus reducing the
> > -		 * range we would flush with a follow-up fsync/msync call.
> > -		 */
> > -		radix_tree_delete(&mapping->page_tree, index);
> > -		mapping->nrexceptional--;
> > -	}
> > -
> > -	if (sector == NO_SECTOR) {
> > -		/*
> > -		 * This can happen during correct operation if our pfn_mkwrite
> > -		 * fault raced against a hole punch operation.  If this
> > -		 * happens the pte that was hole punched will have been
> > -		 * unmapped and the radix tree entry will have been removed by
> > -		 * the time we are called, but the call will still happen.  We
> > -		 * will return all the way up to wp_pfn_shared(), where the
> > -		 * pte_same() check will fail, eventually causing page fault
> > -		 * to be retried by the CPU.
> > -		 */
> > -		goto unlock;
> > +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
> 
> You don't need ret2.  You can just compare 'entry' with '*slot' - see
> dax_writeback_one() for an example.

Hum, but if we want to do this cleanly (and get all the lockdep
verification), we should use

radix_tree_deref_slot_protected(slot, &mapping->tree_lock)

instead of *slot. And at that point my fingers hurt so much that I just
create a new variable for caching the result ;). BTW, this has prompted me
to also fix lock_slot, unlock_slot, slot_locked to use proper RCU
primitives for modifying slot contents.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] dax: New fault locking
@ 2016-03-31 16:27       ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-31 16:27 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-fsdevel, Wilcox, Matthew R, Dan Williams,
	linux-nvdimm, NeilBrown

Thanks for review Ross! I have implemented your comments unless I state
here otherwise.

On Tue 29-03-16 15:57:32, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> > +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > +
> > +	init_wait(&wait.wait);
> > +	wait.wait.func = wake_exceptional_entry_func;
> > +	wait.key.root = &mapping->page_tree;
> > +	wait.key.index = index;
> > +
> > +	for (;;) {
> > +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> > +					  &slot);
> > +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> > +		    !slot_locked(slot)) {
> > +			if (slotp)
> > +				*slotp = slot;
> > +			return ret;
> > +		}
> > +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
> 
> Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
> zombie?

Well, and do you want to deal with signal handling all the way up? The wait
should be pretty short given the nature of pmem so I didn't see a big point
in bothering with signal handling...

> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		schedule();
> > +		finish_wait(wq, &wait.wait);
> > +		spin_lock_irq(&mapping->tree_lock);
> > +	}
> > +}
> > +
> > +/*
> > + * Find radix tree entry at given index. If it points to a page, return with
> > + * the page locked. If it points to the exceptional entry, return with the
> > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > + * create empty exceptional entry for the index and return with it locked.
> > + *
> > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > + * persistent memory the benefit is doubtful. We can add that later if we can
> > + * show it helps.
> > + */
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > +	void *ret, **slot;
> > +
> > +restart:
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> > +	/* No entry for given index? Make sure radix tree is big enough. */
> > +	if (!ret) {
> > +		int err;
> > +
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		err = radix_tree_preload(
> > +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> 
> What is the benefit to preloading the radix tree?  It looks like we have
> to drop the mapping->tree_lock, deal with an error, regrab the lock and
> then deal with a possible collision with an entry that was inserted while
> we didn't hold the lock.
> 
> Can we just try and insert it, then if it fails with -ENOMEM we just do
> our normal error path, dropping the tree_lock and returning the error?

If we don't preload, the allocations will happen with GFP_ATOMIC. That
should be avoided if possible since atomic allocations are pretty
restricted. So basically all the pagecache first allocates nodes we may
need before acquiring locks and then uses these nodes later and I have
mirrored that behavior. Note that we take the hit for dropping the lock
only if we really need to allocate new radix tree node so about once per 64
new entries. So it is not too bad.

> > -	WARN_ON_ONCE(pmd_entry && !dirty);
> >  	if (dirty)
> >  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >  
> > -	spin_lock_irq(&mapping->tree_lock);
> > -
> > -	entry = radix_tree_lookup(page_tree, pmd_index);
> > -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> > -		index = pmd_index;
> > -		goto dirty;
> > +	/* Replacing hole page with block mapping? */
> > +	if (!radix_tree_exceptional_entry(entry)) {
> > +		hole_fill = true;
> > +		error = radix_tree_preload(gfp_mask);
> > +		if (error)
> > +			return ERR_PTR(error);
> >  	}
> >  
> > -	entry = radix_tree_lookup(page_tree, index);
> > -	if (entry) {
> > -		type = RADIX_DAX_TYPE(entry);
> > -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > -					type != RADIX_DAX_PMD)) {
> > -			error = -EIO;
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > +		       DAX_ENTRY_LOCK);
> > +	if (hole_fill) {
> > +		__delete_from_page_cache(entry, NULL);
> > +		error = radix_tree_insert(page_tree, index, ret);
> > +		if (error) {
> > +			ret = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> > +		mapping->nrexceptional++;
> > +	} else {
> > +		void **slot;
> > +		void *ret2;
> >  
> > -		if (!pmd_entry || type == RADIX_DAX_PMD)
> > -			goto dirty;
> > -
> > -		/*
> > -		 * We only insert dirty PMD entries into the radix tree.  This
> > -		 * means we don't need to worry about removing a dirty PTE
> > -		 * entry and inserting a clean PMD entry, thus reducing the
> > -		 * range we would flush with a follow-up fsync/msync call.
> > -		 */
> > -		radix_tree_delete(&mapping->page_tree, index);
> > -		mapping->nrexceptional--;
> > -	}
> > -
> > -	if (sector == NO_SECTOR) {
> > -		/*
> > -		 * This can happen during correct operation if our pfn_mkwrite
> > -		 * fault raced against a hole punch operation.  If this
> > -		 * happens the pte that was hole punched will have been
> > -		 * unmapped and the radix tree entry will have been removed by
> > -		 * the time we are called, but the call will still happen.  We
> > -		 * will return all the way up to wp_pfn_shared(), where the
> > -		 * pte_same() check will fail, eventually causing page fault
> > -		 * to be retried by the CPU.
> > -		 */
> > -		goto unlock;
> > +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
> 
> You don't need ret2.  You can just compare 'entry' with '*slot' - see
> dax_writeback_one() for an example.

Hum, but if we want to do this cleanly (and get all the lockdep
verification), we should use

radix_tree_deref_slot_protected(slot, &mapping->tree_lock)

instead of *slot. And at that point my fingers hurt so much that I just
create a new variable for caching the result ;). BTW, this has prompted me
to also fix lock_slot, unlock_slot, slot_locked to use proper RCU
primitives for modifying slot contents.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-21 13:21 ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, linux-nvdimm, NeilBrown, Wilcox, Matthew R

Hello,

this is my second attempt at DAX page fault locking rewrite. Things now
work reasonably well, it has survived full xfstests run on ext4. I guess
I need to do more mmap targetted tests to unveil issues. Guys what do you
used for DAX testing?

Changes since v1:
- handle wakeups of exclusive waiters properly
- fix cow fault races
- other minor stuff

General description

The basic idea is that we use a bit in an exceptional radix tree entry as
a lock bit and use it similarly to how page lock is used for normal faults.
That way we fix races between hole instantiation and read faults of the
same index. For now I have disabled PMD faults since there the issues with
page fault locking are even worse. Now that Matthew's multi-order radix tree
has landed, I can have a look into using that for proper locking of PMD faults
but first I want normal pages sorted out.

In the end I have decided to implement the bit locking directly in the DAX
code. Originally I was thinking we could provide something generic directly
in the radix tree code but the functions DAX needs are rather specific.
Maybe someone else will have a good idea how to distill some generally useful
functions out of what I've implemented for DAX but for now I didn't bother
with that.

								Honza
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [RFC v2] [PATCH 0/10] DAX page fault locking
@ 2016-03-21 13:21 ` Jan Kara
  0 siblings, 0 replies; 88+ messages in thread
From: Jan Kara @ 2016-03-21 13:21 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wilcox, Matthew R, Ross Zwisler, Dan Williams, linux-nvdimm,
	NeilBrown, Jan Kara

Hello,

this is my second attempt at DAX page fault locking rewrite. Things now
work reasonably well, it has survived full xfstests run on ext4. I guess
I need to do more mmap targetted tests to unveil issues. Guys what do you
used for DAX testing?

Changes since v1:
- handle wakeups of exclusive waiters properly
- fix cow fault races
- other minor stuff

General description

The basic idea is that we use a bit in an exceptional radix tree entry as
a lock bit and use it similarly to how page lock is used for normal faults.
That way we fix races between hole instantiation and read faults of the
same index. For now I have disabled PMD faults since there the issues with
page fault locking are even worse. Now that Matthew's multi-order radix tree
has landed, I can have a look into using that for proper locking of PMD faults
but first I want normal pages sorted out.

In the end I have decided to implement the bit locking directly in the DAX
code. Originally I was thinking we could provide something generic directly
in the radix tree code but the functions DAX needs are rather specific.
Maybe someone else will have a good idea how to distill some generally useful
functions out of what I've implemented for DAX but for now I didn't bother
with that.

								Honza

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2016-04-04  8:34 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-21 13:22 [RFC v2] [PATCH 0/10] DAX page fault locking Jan Kara
2016-03-21 13:22 ` Jan Kara
2016-03-21 13:22 ` [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 17:25   ` Matthew Wilcox
2016-03-21 17:25     ` Matthew Wilcox
2016-03-21 13:22 ` [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 17:34   ` Matthew Wilcox
2016-03-21 17:34     ` Matthew Wilcox
2016-03-22  9:12     ` Jan Kara
2016-03-22  9:12       ` Jan Kara
2016-03-22  9:27       ` Matthew Wilcox
2016-03-22  9:27         ` Matthew Wilcox
2016-03-22 10:37         ` Jan Kara
2016-03-22 10:37           ` Jan Kara
2016-03-23 16:41           ` Ross Zwisler
2016-03-23 16:41             ` Ross Zwisler
2016-03-24 12:31             ` Jan Kara
2016-03-24 12:31               ` Jan Kara
2016-03-21 13:22 ` [PATCH 03/10] dax: Remove complete_unwritten argument Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:12   ` Ross Zwisler
2016-03-23 17:12     ` Ross Zwisler
2016-03-24 12:32     ` Jan Kara
2016-03-24 12:32       ` Jan Kara
2016-03-21 13:22 ` [PATCH 04/10] dax: Fix data corruption for written and mmapped files Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:39   ` Ross Zwisler
2016-03-23 17:39     ` Ross Zwisler
2016-03-24 12:51     ` Jan Kara
2016-03-24 12:51       ` Jan Kara
2016-03-29 15:17       ` Ross Zwisler
2016-03-29 15:17         ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 05/10] dax: Allow DAX code to replace exceptional entries Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:52   ` Ross Zwisler
2016-03-23 17:52     ` Ross Zwisler
2016-03-24 10:42     ` Jan Kara
2016-03-24 10:42       ` Jan Kara
2016-03-21 13:22 ` [PATCH 06/10] dax: Remove redundant inode size checks Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 21:08   ` Ross Zwisler
2016-03-23 21:08     ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 07/10] dax: Disable huge page handling Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 20:50   ` Ross Zwisler
2016-03-23 20:50     ` Ross Zwisler
2016-03-24 12:56     ` Jan Kara
2016-03-24 12:56       ` Jan Kara
2016-03-21 13:22 ` [PATCH 08/10] dax: New fault locking Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-29 21:57   ` Ross Zwisler
2016-03-29 21:57     ` Ross Zwisler
2016-03-31 16:27     ` Jan Kara
2016-03-31 16:27       ` Jan Kara
2016-03-21 13:22 ` [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 19:11   ` Matthew Wilcox
2016-03-21 19:11     ` Matthew Wilcox
2016-03-22  7:03     ` Jan Kara
2016-03-22  7:03       ` Jan Kara
2016-03-29 22:18   ` Ross Zwisler
2016-03-29 22:18     ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 10/10] dax: Remove i_mmap_lock protection Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-29 22:17   ` Ross Zwisler
2016-03-29 22:17     ` Ross Zwisler
2016-03-21 17:41 ` [RFC v2] [PATCH 0/10] DAX page fault locking Matthew Wilcox
2016-03-21 17:41   ` Matthew Wilcox
2016-03-23 15:09   ` Jan Kara
2016-03-23 15:09     ` Jan Kara
2016-03-23 20:50     ` Matthew Wilcox
2016-03-23 20:50       ` Matthew Wilcox
2016-03-24 10:00     ` Matthew Wilcox
2016-03-24 10:00       ` Matthew Wilcox
2016-03-22 19:32 ` Ross Zwisler
2016-03-22 19:32   ` Ross Zwisler
2016-03-22 21:07   ` Toshi Kani
2016-03-22 21:07     ` Toshi Kani
2016-03-22 21:15     ` Dave Chinner
2016-03-22 21:15       ` Dave Chinner
2016-03-23  9:45     ` Jan Kara
2016-03-23  9:45       ` Jan Kara
2016-03-23 15:11       ` Toshi Kani
2016-03-23 15:11         ` Toshi Kani
  -- strict thread matches above, loose matches on Subject: below --
2016-03-21 13:21 Jan Kara
2016-03-21 13:21 ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.