linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/12] re-enable DAX PMD support
@ 2016-09-29 22:49 Ross Zwisler
  2016-09-29 22:49 ` [PATCH v4 01/12] ext4: allow DAX writeback for hole punch Ross Zwisler
                   ` (13 more replies)
  0 siblings, 14 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Ted, can you please take the ext2 + ext4 patches through your tree?  Dave,
can you please take the rest through the XFS tree?

Changes since v3:
 - Corrected dax iomap code namespace for functions defined in fs/dax.c.
   (Dave Chinner)
 - Added leading "dax" namespace to new static functions in fs/dax.c.
   (Dave Chinner)
 - Made all DAX PMD code in fs/dax.c conditionally compiled based on
   CONFIG_FS_DAX_PMD.  Otherwise a stub in include/linux/dax.h that just
   returns VM_FAULT_FALLBACK will be used.  (Dave Chinner)
 - Removed dynamic debugging messages from DAX PMD fault path.  Debugging
   tracepoints will be added later to both the PTE and PMD paths via a
   later patch set. (Dave Chinner)
 - Added a comment to ext2_dax_vm_ops explaining why we don't support DAX
   PMD faults in ext2. (Dave Chinner)

This was built upon xfs/for-next with PMD performance fixes from Toshi Kani
and Dan Williams.  Dan's patch has already been merged for v4.8, and
Toshi's patches are currently queued in Andrew Morton's mm tree for v4.9
inclusion.

Here is a tree containing my changes and all the fixes that I've been testing:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v4

This tree has passed xfstests for ext2, ext4 and XFS both with and without DAX,
and has passed targeted testing where I inserted, removed and flushed DAX PTEs
and PMDs in every combination I could think of.

Previously reported performance numbers:

In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults.  Here is the FIO script I
used for my testing:

  [global]
  bs=4k
  size=2G
  directory=/mnt/pmem0
  ioengine=mmap
  [randrw]
  rw=randrw

Here are the performance results with XFS using only pte faults:
   READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
  WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec

Here are performance numbers for that same test using PMD faults:
   READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
  WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

This was done on a random lab machine with a PMEM device made from memmap'd
RAM.  To get XFS to use PMD faults, I did the following:

  mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
  mount -o dax /dev/pmem0 /mnt/pmem0
  xfs_io -c "extsize 2m" /mnt/pmem0

Ross Zwisler (12):
  ext4: allow DAX writeback for hole punch
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  ext2: remove support for DAX PMD faults
  dax: make 'wait_table' global variable static
  dax: consistent variable naming for DAX entries
  dax: coordinate locking for offsets in PMD range
  dax: remove dax_pmd_fault()
  dax: correct dax iomap code namespace
  dax: add struct iomap based DAX PMD support
  xfs: use struct iomap based DAX PMD fault path
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 650 +++++++++++++++++++++++++++-------------------------
 fs/ext2/file.c      |  35 +--
 fs/ext4/inode.c     |   7 +-
 fs/xfs/xfs_aops.c   |  25 +-
 fs/xfs/xfs_aops.h   |   3 -
 fs/xfs/xfs_file.c   |  10 +-
 include/linux/dax.h |  48 +++-
 mm/filemap.c        |   6 +-
 9 files changed, 402 insertions(+), 383 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v4 01/12] ext4: allow DAX writeback for hole punch
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-29 22:49 ` [PATCH v4 02/12] ext4: tell DAX the size of allocation holes Ross Zwisler
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs, stable

Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 }
 
 /*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
  *
  * @inode:  File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	 * Write out all dirty pages to avoid race conditions
 	 * Then release them.
 	 */
-	if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		ret = filemap_write_and_wait_range(mapping, offset,
 						   offset + length - 1);
 		if (ret)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 02/12] ext4: tell DAX the size of allocation holes
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
  2016-09-29 22:49 ` [PATCH v4 01/12] ext4: allow DAX writeback for hole punch Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-29 22:49 ` [PATCH v4 03/12] dax: remove buffer_size_valid() Ross Zwisler
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 03/12] dax: remove buffer_size_valid()
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
  2016-09-29 22:49 ` [PATCH v4 01/12] ext4: allow DAX writeback for hole punch Ross Zwisler
  2016-09-29 22:49 ` [PATCH v4 02/12] ext4: tell DAX the size of allocation holes Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-30  8:49   ` Christoph Hellwig
  2016-09-29 22:49 ` [PATCH v4 04/12] ext2: remove support for DAX PMD faults Ross Zwisler
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 04/12] ext2: remove support for DAX PMD faults
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (2 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 03/12] dax: remove buffer_size_valid() Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-30  8:49   ` Christoph Hellwig
  2016-10-03  9:35   ` Jan Kara
  2016-09-29 22:49 ` [PATCH v4 05/12] dax: make 'wait_table' global variable static Ross Zwisler
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/file.c | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca363d..0f257f8 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-						pmd_t *pmd, unsigned int flags)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret;
-
-	if (flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(inode->i_sb);
-		file_update_time(vma->vm_file);
-	}
-	down_read(&ei->dax_sem);
-
-	ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
-	up_read(&ei->dax_sem);
-	if (flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
 static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 		struct vm_fault *vmf)
 {
@@ -154,7 +133,11 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
-	.pmd_fault	= ext2_dax_pmd_fault,
+	/*
+	 * .pmd_fault is not supported for DAX because allocation in ext2
+	 * cannot be reliably aligned to huge page sizes and so pmd faults
+	 * will always fail and fail back to regular faults.
+	 */
 	.page_mkwrite	= ext2_dax_fault,
 	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
@@ -166,7 +149,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 05/12] dax: make 'wait_table' global variable static
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (3 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 04/12] ext2: remove support for DAX PMD faults Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-30  8:50   ` Christoph Hellwig
  2016-10-03  9:36   ` Jan Kara
  2016-09-29 22:49 ` [PATCH v4 06/12] dax: consistent variable naming for DAX entries Ross Zwisler
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9b9be8a..ac28cdf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 06/12] dax: consistent variable naming for DAX entries
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (4 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 05/12] dax: make 'wait_table' global variable static Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-30  8:50   ` Christoph Hellwig
  2016-10-03  9:37   ` Jan Kara
  2016-09-29 22:49 ` [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range Ross Zwisler
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac28cdf..baef586 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 					pgoff_t index, void ***slotp)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.key.index = index;
 
 	for (;;) {
-		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
 					  &slot);
-		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		if (!entry || !radix_tree_exceptional_entry(entry) ||
 		    !slot_locked(mapping, slot)) {
 			if (slotp)
 				*slotp = slot;
-			return ret;
+			return entry;
 		}
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -410,10 +410,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -425,11 +425,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -442,9 +442,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+	entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 			 !slot_locked(mapping, slot))) {
 		spin_unlock_irq(&mapping->tree_lock);
 		return;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (5 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 06/12] dax: consistent variable naming for DAX entries Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-30  9:44   ` Christoph Hellwig
  2016-10-03  9:55   ` Jan Kara
  2016-09-29 22:49 ` [PATCH v4 08/12] dax: remove dax_pmd_fault() Ross Zwisler
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 37 ++++++++++++++++++++++++-------------
 include/linux/dax.h |  2 +-
 mm/filemap.c        |  2 +-
 3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index baef586..406feea 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,10 +64,17 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
+static pgoff_t dax_entry_start(pgoff_t index, void *entry)
+{
+	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		index &= (PMD_MASK >> PAGE_SHIFT);
+	return index;
+}
+
 static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
+					      pgoff_t entry_start)
 {
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
+	unsigned long hash = hash_long((unsigned long)mapping ^ entry_start,
 				       DAX_WAIT_TABLE_BITS);
 	return wait_table + hash;
 }
@@ -285,7 +292,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  */
 struct exceptional_entry_key {
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t entry_start;
 };
 
 struct wait_exceptional_entry_queue {
@@ -301,7 +308,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
 	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	    key->entry_start != ewait->key.entry_start)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -359,12 +366,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -375,6 +380,11 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return entry;
 		}
+
+		wq = dax_entry_waitqueue(mapping,
+				dax_entry_start(index, entry));
+		ewait.key.mapping = mapping;
+		ewait.key.entry_start = dax_entry_start(index, entry);
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -447,10 +457,11 @@ restart:
 	return entry;
 }
 
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
+void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
 				   pgoff_t index, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping,
+						dax_entry_start(index, entry));
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -462,7 +473,7 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		struct exceptional_entry_key key;
 
 		key.mapping = mapping;
-		key.index = index;
+		key.entry_start = dax_entry_start(index, entry);
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
 	}
 }
@@ -480,7 +491,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(entry, mapping, index, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -505,7 +516,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(entry, mapping, index, false);
 }
 
 /*
@@ -532,7 +543,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(entry, mapping, index, true);
 
 	return 1;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index add6c4b..4065601 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -21,7 +21,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
+void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
 				   pgoff_t index, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..35e880d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,7 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
+			dax_wake_mapping_entry_waiter(p, mapping, page->index,
 						      false);
 		}
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 08/12] dax: remove dax_pmd_fault()
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (6 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-30  8:50   ` Christoph Hellwig
  2016-10-03  9:56   ` Jan Kara
  2016-09-29 22:49 ` [PATCH v4 09/12] dax: correct dax iomap code namespace Ross Zwisler
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 213 ----------------------------------------------------
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 406feea..b5e7b13 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -909,219 +909,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-		const char *reason, const char *fn)
-{
-	if (bh) {
-		char bname[BDEVNAME_SIZE];
-		bdevname(bh->b_bdev, bname);
-		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-			"length %zd fallback: %s\n", fn, current->comm,
-			address, bname, bh->b_state, (u64)bh->b_blocknr,
-			bh->b_size, reason);
-	} else {
-		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-			current->comm, address, reason);
-	}
-}
-
-#define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	struct buffer_head bh;
-	unsigned blkbits = inode->i_blkbits;
-	unsigned long pmd_addr = address & PMD_MASK;
-	bool write = flags & FAULT_FLAG_WRITE;
-	struct block_device *bdev;
-	pgoff_t size, pgoff;
-	sector_t block;
-	int result = 0;
-	bool alloc = false;
-
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-		return VM_FAULT_FALLBACK;
-
-	/* Fall back to PTEs if we're going to COW */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
-		split_huge_pmd(vma, pmd, address);
-		dax_pmd_dbg(NULL, address, "cow write");
-		return VM_FAULT_FALLBACK;
-	}
-	/* If the PMD would extend outside the VMA */
-	if (pmd_addr < vma->vm_start) {
-		dax_pmd_dbg(NULL, address, "vma start unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-	if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-		dax_pmd_dbg(NULL, address, "vma end unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-
-	pgoff = linear_page_index(vma, pmd_addr);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size)
-		return VM_FAULT_SIGBUS;
-	/* If the PMD would cover blocks out of the file */
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(NULL, address,
-				"offset + huge page size > file size");
-		return VM_FAULT_FALLBACK;
-	}
-
-	memset(&bh, 0, sizeof(bh));
-	bh.b_bdev = inode->i_sb->s_bdev;
-	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-	bh.b_size = PMD_SIZE;
-
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
-
-	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
-	}
-
-	bdev = bh.b_bdev;
-
-	if (bh.b_size < PMD_SIZE) {
-		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
-	}
-
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
-
-	if (!write && !buffer_mapped(&bh)) {
-		spinlock_t *ptl;
-		pmd_t entry;
-		struct page *zero_page = get_huge_zero_page();
-
-		if (unlikely(!zero_page)) {
-			dax_pmd_dbg(&bh, address, "no zero page");
-			goto fallback;
-		}
-
-		ptl = pmd_lock(vma->vm_mm, pmd);
-		if (!pmd_none(*pmd)) {
-			spin_unlock(ptl);
-			dax_pmd_dbg(&bh, address, "pmd already present");
-			goto fallback;
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
-				__func__, current->comm, address,
-				(unsigned long long) to_sector(&bh, inode));
-
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
-		result = VM_FAULT_NOPAGE;
-		spin_unlock(ptl);
-	} else {
-		struct blk_dax_ctl dax = {
-			.sector = to_sector(&bh, inode),
-			.size = PMD_SIZE,
-		};
-		long length = dax_map_atomic(bdev, &dax);
-
-		if (length < 0) {
-			dax_pmd_dbg(&bh, address, "dax-error fallback");
-			goto fallback;
-		}
-		if (length < PMD_SIZE) {
-			dax_pmd_dbg(&bh, address, "dax-length too small");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
-			dax_pmd_dbg(&bh, address, "pfn unaligned");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-
-		if (!pfn_t_devmap(dax.pfn)) {
-			dax_unmap_atomic(bdev, &dax);
-			dax_pmd_dbg(&bh, address, "pfn not in memmap");
-			goto fallback;
-		}
-		dax_unmap_atomic(bdev, &dax);
-
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
-				__func__, current->comm, address,
-				pfn_t_to_pfn(dax.pfn),
-				(unsigned long long) dax.sector);
-		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				dax.pfn, write);
-	}
-
- out:
-	return result;
-
- fallback:
-	count_vm_event(THP_FAULT_FALLBACK);
-	result = VM_FAULT_FALLBACK;
-	goto out;
-}
-EXPORT_SYMBOL_GPL(dax_pmd_fault);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 /**
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 4065601..d9a8350 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -48,16 +48,12 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 }
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t);
-#else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
-#endif
+
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 09/12] dax: correct dax iomap code namespace
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (7 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 08/12] dax: remove dax_pmd_fault() Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-30  8:51   ` Christoph Hellwig
  2016-10-03  9:57   ` Jan Kara
  2016-09-29 22:49 ` [PATCH v4 10/12] dax: add struct iomap based DAX PMD support Ross Zwisler
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
---
 fs/dax.c            | 16 ++++++++--------
 fs/ext2/file.c      |  6 +++---
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  4 ++--
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b5e7b13..6977e5e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
 static loff_t
-iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
+dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
 {
 	struct iov_iter *iter = data;
@@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 }
 
 /**
- * iomap_dax_rw - Perform I/O to a DAX file
+ * dax_iomap_rw - Perform I/O to a DAX file
  * @iocb:	The control block for this I/O
  * @iter:	The addresses to do I/O from or to
  * @ops:	iomap ops passed from the file system
@@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
  * and evicting any page cache pages in the region under I/O.
  */
 ssize_t
-iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
@@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, iomap_dax_actor);
+				iter, dax_iomap_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
@@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 	iocb->ki_pos += done;
 	return done ? done : ret;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_rw);
+EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
 /**
- * iomap_dax_fault - handle a page fault on a DAX file
+ * dax_iomap_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @ops: iomap ops passed from the file system
@@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
  * or mkwrite handler for DAX files. Assumes the caller has done all the
  * necessary locking for the page fault to proceed successfully.
  */
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 }
-EXPORT_SYMBOL_GPL(iomap_dax_fault);
+EXPORT_SYMBOL_GPL(dax_iomap_fault);
 #endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0f257f8..32a4913 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return 0; /* skip atime */
 
 	inode_lock_shared(inode);
-	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret)
 		goto out_unlock;
 
-	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		mark_inode_dirty(inode);
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 882f264..00293d2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -342,7 +342,7 @@ xfs_file_dax_read(
 		return 0; /* skip atime */
 
 	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
@@ -721,7 +721,7 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 
-	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1468,7 +1468,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1502,7 +1502,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d9a8350..c4a51bb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -11,13 +11,13 @@ struct iomap_ops;
 /* We use lowest available exceptional entry bit for locking */
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
 
-ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
+ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
 		  get_block_t, dio_iodone_t, int flags);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (8 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 09/12] dax: correct dax iomap code namespace Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-30  9:56   ` Christoph Hellwig
  2016-10-03 10:59   ` Jan Kara
  2016-09-29 22:49 ` [PATCH v4 11/12] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 334 +++++++++++++++++++++++++++++++++++++++++++++-------
 include/linux/dax.h |  38 +++++-
 mm/filemap.c        |   4 +-
 3 files changed, 327 insertions(+), 49 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6977e5e..93a818d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include <linux/iomap.h>
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -400,19 +386,52 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with type RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  Also, a 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long new_type)
 {
+	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry && new_type == RADIX_DAX_PMD) {
+		if (!radix_tree_exceptional_entry(entry) ||
+				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			spin_unlock_irq(&mapping->tree_lock);
+			return ERR_PTR(-EEXIST);
+		}
+	} else if (entry && new_type == RADIX_DAX_PTE) {
+		if (radix_tree_exceptional_entry(entry) &&
+		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
+		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
+			pmd_downgrade = true;
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -420,15 +439,39 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		/*
+		 * Besides huge zero pages the only other thing that gets
+		 * downgraded are empty entries which don't need to be
+		 * unmapped.
+		 */
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(entry, mapping, index,
+					false);
+		}
+
+		entry = RADIX_DAX_EMPTY_ENTRY(new_type);
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				RADIX_DAX_ORDER(new_type), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
-			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			/*
+			 * Someone already created the entry?  This is a
+			 * normal failure when inserting PMDs in a range
+			 * that already contains PTEs.  In that case we want
+			 * to return -EEXIST immediately.
+			 */
+			if (err == -EEXIST && new_type == RADIX_DAX_PTE)
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -596,11 +639,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
+/*
+ * By this point grab_mapping_entry() has ensured that we have a locked entry
+ * of the appropriate size so we don't have to worry about downgrading PMDs to
+ * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
+ * already in the tree, we will skip the insertion and just dirty the PMD as
+ * appropriate.
+ */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long new_type, bool hzp)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
@@ -623,22 +672,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
+		/* replacing huge zero page with PMD block mapping */
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	if (hzp)
+		new_entry = RADIX_DAX_HZP_ENTRY();
+	else
+		new_entry = RADIX_DAX_ENTRY(sector, new_type);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				RADIX_DAX_ORDER(new_type), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -694,6 +751,18 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto unlock;
 	}
 
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
 	dax.sector = RADIX_DAX_SECTOR(entry);
 	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
 	spin_unlock_irq(&mapping->tree_lock);
@@ -734,12 +803,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -749,15 +817,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -802,7 +861,8 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector,
+			RADIX_DAX_PTE, false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -849,7 +909,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1023,6 +1083,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1048,8 +1113,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
-		dax.sector = iomap->blkno +
-			(((pos & PAGE_MASK) - iomap->offset) >> 9);
+		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
 		if (map_len < 0) {
@@ -1164,7 +1228,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1186,7 +1250,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	}
 
-	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+	sector = dax_iomap_sector(&iomap, pos);
 
 	if (vmf->cow_page) {
 		switch (iomap.type) {
@@ -1246,4 +1310,184 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+#if defined(CONFIG_FS_DAX_PMD)
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
+ * more often than one might expect in the below functions.
+ */
+#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, loff_t pos, bool write, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct block_device *bdev = iomap->bdev;
+	struct blk_dax_ctl dax = {
+		.sector = dax_iomap_sector(iomap, pos),
+		.size = PMD_SIZE,
+	};
+	long length = dax_map_atomic(bdev, &dax);
+	void *ret;
+
+	if (length < 0) /* dax_map_atomic() failed */
+		return VM_FAULT_FALLBACK;
+	if (length < PMD_SIZE)
+		goto unmap_fallback;
+	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
+		goto unmap_fallback;
+	if (!pfn_t_devmap(dax.pfn))
+		goto unmap_fallback;
+
+	dax_unmap_atomic(bdev, &dax);
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp,
+			dax.sector, RADIX_DAX_PMD, false);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
+
+ unmap_fallback:
+	dax_unmap_atomic(bdev, &dax);
+	return VM_FAULT_FALLBACK;
+}
+
+static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	struct page *zero_page;
+	spinlock_t *ptl;
+	pmd_t pmd_entry;
+	void *ret;
+
+	zero_page = get_huge_zero_page();
+
+	if (unlikely(!zero_page))
+		return VM_FAULT_FALLBACK;
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp,
+			0, RADIX_DAX_PMD, true);
+	if (IS_ERR(ret))
+		return VM_FAULT_FALLBACK;
+	*entryp = ret;
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_none(*pmd)) {
+		spin_unlock(ptl);
+		return VM_FAULT_FALLBACK;
+	}
+
+	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+	pmd_entry = pmd_mkhuge(pmd_entry);
+	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	bool write = flags & FAULT_FLAG_WRITE;
+	struct inode *inode = mapping->host;
+	struct iomap iomap = { 0 };
+	int error, result = 0;
+	pgoff_t size, pgoff;
+	struct vm_fault vmf;
+	void *entry;
+	loff_t pos;
+
+	/* Fall back to PTEs if we're going to COW */
+	if (write && !(vma->vm_flags & VM_SHARED)) {
+		split_huge_pmd(vma, pmd, address);
+		return VM_FAULT_FALLBACK;
+	}
+
+	/* If the PMD would extend outside the VMA */
+	if (pmd_addr < vma->vm_start)
+		return VM_FAULT_FALLBACK;
+	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
+		return VM_FAULT_FALLBACK;
+
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is
+	 * supposed to hold locks serializing us with truncate / punch hole so
+	 * this is a reliable test.
+	 */
+	pgoff = linear_page_index(vma, pmd_addr);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	/* If the PMD would extend beyond the file size */
+	if ((pgoff | PG_PMD_COLOUR) >= size)
+		return VM_FAULT_FALLBACK;
+
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry))
+		return VM_FAULT_FALLBACK;
+
+	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, write ? IOMAP_WRITE : 0,
+			&iomap);
+	if (error)
+		goto fallback;
+	if (iomap.offset + iomap.length < pos + PMD_SIZE)
+		goto fallback;
+
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
+
+	switch (iomap.type) {
+	case IOMAP_MAPPED:
+		result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
+				&iomap, pos, write, &entry);
+		break;
+	case IOMAP_UNWRITTEN:
+	case IOMAP_HOLE:
+		if (WARN_ON_ONCE(write))
+			goto fallback;
+		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
+				&entry);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		result = VM_FAULT_FALLBACK;
+		break;
+	}
+
+	if (result == VM_FAULT_FALLBACK)
+		count_vm_event(THP_FAULT_FALLBACK);
+
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
+	return result;
+
+ fallback:
+	count_vm_event(THP_FAULT_FALLBACK);
+	result = VM_FAULT_FALLBACK;
+	goto unlock_entry;
+}
+EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
+#endif /* CONFIG_FS_DAX_PMD */
 #endif /* CONFIG_FS_IOMAP */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index c4a51bb..cacff9e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,33 @@
 
 struct iomap_ops;
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, two bits for
+ * the entry type (PMD & PTE), and two more for flags (HZP and empty).  In
+ * total five special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+/* PTE and PMD types */
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+/* huge zero page and empty entry flags */
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
+
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+
+/* entries begin locked */
+#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
+	type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+	RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+		type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+
+#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
@@ -54,6 +79,17 @@ static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 	return VM_FAULT_FALLBACK;
 }
 
+#if defined(CONFIG_FS_DAX_PMD)
+int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
+#else
+static inline int dax_iomap_pmd_fault(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		struct iomap_ops *ops)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 35e880d..d9dd97e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != RADIX_DAX_EMPTY_ENTRY(RADIX_DAX_PTE));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 11/12] xfs: use struct iomap based DAX PMD fault path
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (9 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 10/12] dax: add struct iomap based DAX PMD support Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-29 22:49 ` [PATCH v4 12/12] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_aops.c | 25 +++++--------------------
 fs/xfs/xfs_aops.h |  3 ---
 fs/xfs/xfs_file.c |  2 +-
 3 files changed, 6 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4a28fa9..39c754f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1170,8 +1170,7 @@ __xfs_get_blocks(
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	bool			direct,
-	bool			dax_fault)
+	bool			direct)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1265,12 +1264,8 @@ __xfs_get_blocks(
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
-		if (create) {
-			if (dax_fault)
-				ASSERT(!ISUNWRITTEN(&imap));
-			else
-				xfs_map_direct(inode, bh_result, &imap, offset);
-		}
+		if (create)
+			xfs_map_direct(inode, bh_result, &imap, offset);
 	}
 
 	/*
@@ -1310,7 +1305,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
 }
 
 int
@@ -1320,17 +1315,7 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
-	struct inode		*inode,
-	sector_t		iblock,
-	struct buffer_head	*bh_result,
-	int			create)
-{
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 1950e3b..6779e9d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -57,9 +57,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
 		       struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
-int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
-			         struct buffer_head *map_bh, int create);
-
 int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
 		ssize_t size, void *private);
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00293d2..ab8f652 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1539,7 +1539,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+	ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v4 12/12] dax: remove "depends on BROKEN" from FS_DAX_PMD
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (10 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 11/12] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
@ 2016-09-29 22:49 ` Ross Zwisler
  2016-09-29 23:43 ` [PATCH v4 00/12] re-enable DAX PMD support Dave Chinner
  2016-09-30 11:46 ` Christoph Hellwig
  13 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-29 22:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 00/12] re-enable DAX PMD support
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (11 preceding siblings ...)
  2016-09-29 22:49 ` [PATCH v4 12/12] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
@ 2016-09-29 23:43 ` Dave Chinner
  2016-09-30  3:03   ` Ross Zwisler
  2016-10-03 23:05   ` Ross Zwisler
  2016-09-30 11:46 ` Christoph Hellwig
  13 siblings, 2 replies; 46+ messages in thread
From: Dave Chinner @ 2016-09-29 23:43 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu, Sep 29, 2016 at 04:49:18PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.
> 
> Ted, can you please take the ext2 + ext4 patches through your tree?  Dave,
> can you please take the rest through the XFS tree?

That will make merging this difficult, because later patches in the
series are dependent on the ext2/ext4 patches early in the series.
I'd much prefer they all go through the one tree to avoid issues
like this.

> 
> Changes since v3:
>  - Corrected dax iomap code namespace for functions defined in fs/dax.c.
>    (Dave Chinner)
>  - Added leading "dax" namespace to new static functions in fs/dax.c.
>    (Dave Chinner)
>  - Made all DAX PMD code in fs/dax.c conditionally compiled based on
>    CONFIG_FS_DAX_PMD.  Otherwise a stub in include/linux/dax.h that just
>    returns VM_FAULT_FALLBACK will be used.  (Dave Chinner)
>  - Removed dynamic debugging messages from DAX PMD fault path.  Debugging
>    tracepoints will be added later to both the PTE and PMD paths via a
>    later patch set. (Dave Chinner)
>  - Added a comment to ext2_dax_vm_ops explaining why we don't support DAX
>    PMD faults in ext2. (Dave Chinner)
> 
> This was built upon xfs/for-next with PMD performance fixes from Toshi Kani
> and Dan Williams.  Dan's patch has already been merged for v4.8, and
> Toshi's patches are currently queued in Andrew Morton's mm tree for v4.9
> inclusion.

the xfs/for-next branch is not a stable branch - it can rebase at
any time just like linux-next can. The topic branches that I merge
into the for-next branch, OTOH, are usually stable. i.e. The topic
branch you should be basing this on is "iomap-4.9-dax".

And then you'll also see that there are already ext2 patches in this
topic branch to convert it to iomap for DAX. So I'm quite happy to
take the ext2/4 patches in this series in the same way.

The patches from Dan and Toshi: is you patch series dependent on
them? Do I need to take them as well?

Finally: none of the patches in your tree have reviewed-by tags.
That says to me that none of this code has been reviewed yet.
Reviewed-by tags are non-negotiable requirement for anything going
through my trees. I don't have time right now to review this code,
so you're going to need to chase up other reviewers before merging.

And, really, this is getting very late in the cycle to be merging
new code - we're less than one working day away from the merge
window opening and we've missed the last linux-next build. I'd
suggest that we'd might be best served by slipping this to the PMD
support code to the next cycle when there's no time pressure for
review and we can get a decent linux-next soak on the code.

> This tree has passed xfstests for ext2, ext4 and XFS both with and without DAX,
> and has passed targeted testing where I inserted, removed and flushed DAX PTEs
> and PMDs in every combination I could think of.
> 
> Previously reported performance numbers:
> 
>   [global]
>   bs=4k
>   size=2G
>   directory=/mnt/pmem0
>   ioengine=mmap
>   [randrw]
>   rw=randrw
> 
> Here are the performance results with XFS using only pte faults:
>    READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
>   WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec
> 
> Here are performance numbers for that same test using PMD faults:
>    READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
>   WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

The numbers look good - how much of that is from lower filesystem
allocation overhead and how much of it is from fewer page faults?
You can probably determine this by creating the test file with
write() to ensure it is fully allocated and so all the filesystem
is doing on both the read and write paths is mapping allocated
regions....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 00/12] re-enable DAX PMD support
  2016-09-29 23:43 ` [PATCH v4 00/12] re-enable DAX PMD support Dave Chinner
@ 2016-09-30  3:03   ` Ross Zwisler
  2016-09-30  4:00     ` Darrick J. Wong
  2016-09-30  6:48     ` Dave Chinner
  2016-10-03 23:05   ` Ross Zwisler
  1 sibling, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-09-30  3:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> Finally: none of the patches in your tree have reviewed-by tags.
> That says to me that none of this code has been reviewed yet.
> Reviewed-by tags are non-negotiable requirement for anything going
> through my trees. I don't have time right now to review this code,
> so you're going to need to chase up other reviewers before merging.
> 
> And, really, this is getting very late in the cycle to be merging
> new code - we're less than one working day away from the merge
> window opening and we've missed the last linux-next build. I'd
> suggest that we'd might be best served by slipping this to the PMD
> support code to the next cycle when there's no time pressure for
> review and we can get a decent linux-next soak on the code.

I absolutely support your policy of only sending code to Linux that has passed
peer review.

However, I do feel compelled to point out that this is not new code.  I didn't
just spring it on everyone in the hours before the v4.8 merge window.  I
posted the first version of this patch set on August 15th, *seven weeks ago*:

https://lkml.org/lkml/2016/8/15/613

This was the day after v4.7-rc2 was released.

Since then I have responded promptly to the little review feedback that I've
received.  I've also reviewed and tested other DAX changes, like the struct
iomap changes from Christoph.  Those changes were first posted to the mailing
list on September 9th, four weeks after mine.  Nevertheless, I was happy to
rebase my changes on top of his, which meant a full rewrite of the DAX PMD
fault handler so it would be based on struct iomap.  His changes are going to
be merged for v4.9, and mine are not.

Please, help me understand what I can do to get my code reviewed.  Do I need
to more aggressively ping my patch series, asking people by name for reviews?
Do we need to rework our code flow to Linus so that the DAX changes go through
a filesystem tree like XFS or ext4, and ask the developers of that filesystem
to help with reviews?  Something else?

I'm honestly very frustrated by this because I've done my best to be open to
constructive criticism and I've tried to respond promptly to the feedback that
I've received.  In the end, though, a system where it's a requirement that all
upstreamed code be peer reviewed but in which I can't get any feedback is
essentially a system where I'm not allowed to contribute.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 00/12] re-enable DAX PMD support
  2016-09-30  3:03   ` Ross Zwisler
@ 2016-09-30  4:00     ` Darrick J. Wong
  2016-10-03 18:54       ` Ross Zwisler
  2016-09-30  6:48     ` Dave Chinner
  1 sibling, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2016-09-30  4:00 UTC (permalink / raw)
  To: Ross Zwisler, Dave Chinner, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Christoph Hellwig,
	Dan Williams, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Thu, Sep 29, 2016 at 09:03:43PM -0600, Ross Zwisler wrote:
> On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> > Finally: none of the patches in your tree have reviewed-by tags.
> > That says to me that none of this code has been reviewed yet.
> > Reviewed-by tags are non-negotiable requirement for anything going
> > through my trees. I don't have time right now to review this code,
> > so you're going to need to chase up other reviewers before merging.
> > 
> > And, really, this is getting very late in the cycle to be merging
> > new code - we're less than one working day away from the merge
> > window opening and we've missed the last linux-next build. I'd
> > suggest that we'd might be best served by slipping this to the PMD
> > support code to the next cycle when there's no time pressure for
> > review and we can get a decent linux-next soak on the code.
> 
> I absolutely support your policy of only sending code to Linux that has passed
> peer review.
> 
> However, I do feel compelled to point out that this is not new code.  I didn't
> just spring it on everyone in the hours before the v4.8 merge window.  I
> posted the first version of this patch set on August 15th, *seven weeks ago*:
> 
> https://lkml.org/lkml/2016/8/15/613
> 
> This was the day after v4.7-rc2 was released.
> 
> Since then I have responded promptly to the little review feedback
> that I've received.  I've also reviewed and tested other DAX changes,
> like the struct iomap changes from Christoph.  Those changes were
> first posted to the mailing list on September 9th, four weeks after
> mine.  Nevertheless, I was happy to rebase my changes on top of his,
> which meant a full rewrite of the DAX PMD fault handler so it would be
> based on struct iomap.  His changes are going to be merged for v4.9,
> and mine are not.

I'm not knocking the iomap migration, but it did cause a fair amount of
churn in the XFS reflink patchset -- and that's for a filesystem that
already /had/ iomap implemented.  It'd be neat to have all(?) the DAX
filesystems (ext[24], XFS) move over to iomap so that you wouldn't have
to support multiple ways of talking to FSes.  AFAICT ext4 hasn't gotten
iomap, which complicates things.  But that's my opinion, maybe you're
fine with supporting iomap and not-iomap.

The thing that (personally) makes it harder to review these
multi-subsystem patches is that I'm not a domain expert in some of those
subsystems -- memory in this case.  I get to the point where I'm
thinking "Uh, this looks ok, and it seems to work on my test VM, but is
that enough to stick my neck out and Reviewed-by?" and then get stuck.
It's hard to get unstuck with a complex piece of machinery.

> Please, help me understand what I can do to get my code reviewed.  Do
> I need to more aggressively ping my patch series, asking people by
> name for reviews?  Do we need to rework our code flow to Linus so that
> the DAX changes go through a filesystem tree like XFS or ext4, and ask
> the developers of that filesystem to help with reviews?  Something
> else?

FWIW, I /think/ it looks fine, though I'm afraid enough of the memory
manager that I haven't said anything yet.  I'll look it over more
tomorrow when my brain is fresher.  If reflink for XFS lands in 4.9 I'll
start looking again at pagecache sharing and/or dax+reflink.

> I'm honestly very frustrated by this because I've done my best to be
> open to constructive criticism and I've tried to respond promptly to
> the feedback that I've received.  In the end, though, a system where
> it's a requirement that all upstreamed code be peer reviewed but in
> which I can't get any feedback is essentially a system where I'm not
> allowed to contribute.

I have the same frustrations with getting non-XFS/non-ext4 patches
reviewed and upstreamed by whomever the maintainer is.  I wish we had a
broader range of people who knew both FS and MM, but wow is that a long
onboarding process. :/

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 00/12] re-enable DAX PMD support
  2016-09-30  3:03   ` Ross Zwisler
  2016-09-30  4:00     ` Darrick J. Wong
@ 2016-09-30  6:48     ` Dave Chinner
  2016-10-03 21:11       ` Ross Zwisler
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2016-09-30  6:48 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu, Sep 29, 2016 at 09:03:43PM -0600, Ross Zwisler wrote:
> On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> > Finally: none of the patches in your tree have reviewed-by tags.
> > That says to me that none of this code has been reviewed yet.
> > Reviewed-by tags are non-negotiable requirement for anything going
> > through my trees. I don't have time right now to review this code,
> > so you're going to need to chase up other reviewers before merging.
> > 
> > And, really, this is getting very late in the cycle to be merging
> > new code - we're less than one working day away from the merge
> > window opening and we've missed the last linux-next build. I'd
> > suggest that we'd might be best served by slipping this to the PMD
> > support code to the next cycle when there's no time pressure for
> > review and we can get a decent linux-next soak on the code.
> 
> I absolutely support your policy of only sending code to Linux that has passed
> peer review.
> 
> However, I do feel compelled to point out that this is not new code.  I didn't
> just spring it on everyone in the hours before the v4.8 merge window.  I
> posted the first version of this patch set on August 15th, *seven weeks ago*:
> 
> https://lkml.org/lkml/2016/8/15/613
>
> This was the day after v4.7-rc2 was released.

I understand, Ross. We've all been there - welcome to the club. :/

The problem is there are lots of people writing code and very few
people who spent time reviewing it. And for many of those reviewers,
they also have to spend time on other code bases, whether it be the
distro's they maintain, the userspace that the kernel code needs to
function, or even the test code needed to make this all work
properly.

Stuff that spans multiple subsystems such as DAX (i.e. vm,
filesystems and block devices) are particularly troublesome in this
respect because there are very few people with expertise in all
three aspects and hence able to review a change that spans all three
subsystems. Those people also tend to be the busiest and have to
prioritise what they do.

Consider the XFS side of review in recent times: in the past 4
months, there's been ~30,000 lines of code change between kernel and
userspace.  And there's already another 15,000+ lines of code in the
backlog for the next 2-3 months.  That review load is falling on 3-4
people, who all have other work to do as well. This is for work that
was started well over a year ago, and that picked up from work that
was originally done 3 years ago. We're swamped on pure XFS review
right now, let alone all the other infratructure stuff we need to
get reviewed at the same time...

Having someone say "lets get sorted after the merge window" is far
better than having your patches ignored - it tells you someone wants
your code and is actually planning to review it in the near term!
Have patience. Keep the patches up to date, keep building what you
need to build on top of them. Missing a merge window is not the end
of the world.

> Since then I have responded promptly to the little review feedback that I've
> received.  I've also reviewed and tested other DAX changes, like the struct
> iomap changes from Christoph.

And I'm grateful for your doing that - it sped the process up a lot
because they then weren't blocked waiting for me to get to them. As
a result, I owe you some review time in return but unfortunately I
can't fully return the favour immediately. If more people treated
review as a selfless act that should be returned in kind, then we
wouldn't have a review bottleneck like we do...

> Those changes were first posted to the mailing
> list on September 9th, four weeks after mine.  Nevertheless, I was happy to
> rebase my changes on top of his, which meant a full rewrite of the DAX PMD
> fault handler so it would be based on struct iomap.  His changes are going to
> be merged for v4.9, and mine are not.

Yes, this can happen, too - core infrastructure changes can appear
suddenly and be implemented very quickly, but that does not mean
there hasn't been a lot of background work and effort put into the
code. The iomap code goes way back. Back to early 2010, in fact:

http://lkml.iu.edu/hypermail/linux/kernel/1005.1/02720.html

At that time I implemented a working multipage write IO path for
XFS, and Christoph integrated that int various OEM products shortly
afterwards. Yes, there have been iomap based XFS implementations out
there in production for over 5 years now, but that code was not
clean enough to even consider merging.

Another reference in 2013, when someone proposed a hack for embedded
systems to optimise the write path:

https://lkml.org/lkml/2013/7/23/809

Then Christoph introduced the struct iomap for pNFS and the XFS
block layout driver in late 2013, and when DAX first came along I
really wanted iomaps to be used up front rather than buffer heads
for block mapping.

Now we've finally got iomaps in the IO path and that's rapidly
cascading through all the XFS IO interfaces and into other
filesystems. This is exactly what we first talked about 6 years ago.

So while it might look like the iomap infrastructure has come out of
nowhere, it's really been a long, long road to get to this point. We
work to a different time scale over here - it's not uncommon to be
planning 5 years ahead for new XFS features. We know how long it
takes to develop, test, review and stabilise significant new
features, so while it might look like something appears and is
committed quickly, that's because you haven't seen the work that has
been done leading up to patches being presented for review and
merge.

Hopefully this will give you some more perspective on why I think
slipping a single release isn't something to get worried about. :)

> Please, help me understand what I can do to get my code reviewed.  Do I need
> to more aggressively ping my patch series, asking people by name for reviews?

On the XFS and fstests lists, if nobody has responded within a few
days (usually a week) then it's OK to ping it and see if anyone has
time to review it. In general, a single ping is enough if the
patchset is in good shape. Fix it all, repost, ping in a week if
there's no followup. Repeat until merge.

> Do we need to rework our code flow to Linus so that the DAX changes go through
> a filesystem tree like XFS or ext4, and ask the developers of that filesystem
> to help with reviews?  Something else?

The question we have to ask here is whether a lack of development
support from a filesystem stop us from driving the DAX
implementation forward?  I've said it before, and I'll say it again:
I'm happy to drive DAX on XFS forwards at the rate at which we can
sustain review via the XFS tree and I don't care if we break support
on ext2 or ext4. If we keep having to wait for ext4 to fix stuff to
catch up with what we want/need to do then progress will continue to
be sporadic and frustrating.  Perhaps it's time to stop waiting for
ext4 to play catchup every time we take a step forwards....

> I'm honestly very frustrated by this because I've done my best to be open to
> constructive criticism and I've tried to respond promptly to the feedback that
> I've received.  In the end, though, a system where it's a requirement that all
> upstreamed code be peer reviewed but in which I can't get any feedback is
> essentially a system where I'm not allowed to contribute.

There's always been a review bottleneck, and everyone ends up on the
end of that frustration from time to time. Delays will happen - it's
just part of the process we all need to deal with. I used to get
frustrated, too, but now I just accept it, roll with it and we've
made it an acceptible part of the process to ping patches when it
looks like they have been forgotten...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 03/12] dax: remove buffer_size_valid()
  2016-09-29 22:49 ` [PATCH v4 03/12] dax: remove buffer_size_valid() Ross Zwisler
@ 2016-09-30  8:49   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:49 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 04/12] ext2: remove support for DAX PMD faults
  2016-09-29 22:49 ` [PATCH v4 04/12] ext2: remove support for DAX PMD faults Ross Zwisler
@ 2016-09-30  8:49   ` Christoph Hellwig
  2016-10-03  9:35   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:49 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 05/12] dax: make 'wait_table' global variable static
  2016-09-29 22:49 ` [PATCH v4 05/12] dax: make 'wait_table' global variable static Ross Zwisler
@ 2016-09-30  8:50   ` Christoph Hellwig
  2016-10-03  9:36   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 06/12] dax: consistent variable naming for DAX entries
  2016-09-29 22:49 ` [PATCH v4 06/12] dax: consistent variable naming for DAX entries Ross Zwisler
@ 2016-09-30  8:50   ` Christoph Hellwig
  2016-10-03  9:37   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 08/12] dax: remove dax_pmd_fault()
  2016-09-29 22:49 ` [PATCH v4 08/12] dax: remove dax_pmd_fault() Ross Zwisler
@ 2016-09-30  8:50   ` Christoph Hellwig
  2016-10-03  9:56   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 09/12] dax: correct dax iomap code namespace
  2016-09-29 22:49 ` [PATCH v4 09/12] dax: correct dax iomap code namespace Ross Zwisler
@ 2016-09-30  8:51   ` Christoph Hellwig
  2016-10-03  9:57   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:51 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu, Sep 29, 2016 at 04:49:27PM -0600, Ross Zwisler wrote:
> The recently added DAX functions that use the new struct iomap data
> structure were named iomap_dax_rw(), iomap_dax_fault() and
> iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
> should be part of the "dax" namespace and not the "iomap" namespace.
> Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
> respectively.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>

I don't really care either way, but this is trivial enought to not
introduce a bug, so:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range
  2016-09-29 22:49 ` [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range Ross Zwisler
@ 2016-09-30  9:44   ` Christoph Hellwig
  2016-10-03  9:55   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30  9:44 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

> +static pgoff_t dax_entry_start(pgoff_t index, void *entry)
> +{
> +	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +		index &= (PMD_MASK >> PAGE_SHIFT);
> +	return index;
> +}
> +
>  static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> -					      pgoff_t index)
> +					      pgoff_t entry_start)
>  {
> -	unsigned long hash = hash_long((unsigned long)mapping ^ index,
> +	unsigned long hash = hash_long((unsigned long)mapping ^ entry_start,
>  				       DAX_WAIT_TABLE_BITS);
>  	return wait_table + hash;
>  }

All callers of dax_entry_waitqueue need to calculate entry_start
using this new dax_entry_start helper.  Either we should move the
call to dax_entry_start into this helper.  Or at least use local
variables for in the callers as both of them also fill out a
wait_exceptional_entry_queue structure with it.  Or do both by
letting dax_entry_waitqueue fill out that structure as well.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-09-29 22:49 ` [PATCH v4 10/12] dax: add struct iomap based DAX PMD support Ross Zwisler
@ 2016-09-30  9:56   ` Christoph Hellwig
  2016-10-03 21:16     ` Ross Zwisler
  2016-10-03 10:59   ` Jan Kara
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30  9:56 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

> -/*
> - * We use lowest available bit in exceptional entry for locking, other two
> - * bits to determine entry type. In total 3 special bits.
> - */
> -#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> -#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> -#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> -#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> -		RADIX_TREE_EXCEPTIONAL_ENTRY))
> -

Please split the move of these constants into a separate patch.

> -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> +		unsigned long new_type)
>  {
> +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
>  	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry && new_type == RADIX_DAX_PMD) {
> +		if (!radix_tree_exceptional_entry(entry) ||
> +				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> +			spin_unlock_irq(&mapping->tree_lock);
> +			return ERR_PTR(-EEXIST);
> +		}
> +	} else if (entry && new_type == RADIX_DAX_PTE) {
> +		if (radix_tree_exceptional_entry(entry) &&
> +		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
> +		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> +			pmd_downgrade = true;
> +		}
> +	}

	Would be nice to use switch on the type here:

	old_type = RADIX_DAX_TYPE(entry);

	if (entry) {
		switch (new_type) {
		case RADIX_DAX_PMD:
			if (!radix_tree_exceptional_entry(entry) ||
			    oldentry == RADIX_DAX_PTE) {
			    	entry = ERR_PTR(-EEXIST);
				goto out_unlock;
			}
			break;
		case RADIX_DAX_PTE:
			if (radix_tree_exceptional_entry(entry) &&
			    old_entry = RADIX_DAX_PMD &&
			    (unsigned long)entry & 
			      (RADIX_DAX_HZP|RADIX_DAX_EMPTY))
			      	..

Btw, why are only RADIX_DAX_PTE and RADIX_DAX_PMD in the type mask,
and not RADIX_DAX_HZP and RADIX_DAX_EMPTY?  With that we could use the
above old_entry local variable over this function and make it a lot les
of a mess.

>  static void *dax_insert_mapping_entry(struct address_space *mapping,
>  				      struct vm_fault *vmf,
> -				      void *entry, sector_t sector)
> +				      void *entry, sector_t sector,
> +				      unsigned long new_type, bool hzp)

And then we could also drop the hzp argument here..

>  #ifdef CONFIG_FS_IOMAP
> +static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> +	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> +}

Please split adding this new helper into a separate patch.

> +#if defined(CONFIG_FS_DAX_PMD)

Please use #ifdef here.

> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +
> +/* entries begin locked */
> +#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
> +	type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
> +#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
> +	RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
> +#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
> +		type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
> +
> +#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)

All these macros don't properly brace their arguments.  I think
you'd make your life a lot easier by making them inline functions.

> +#if defined(CONFIG_FS_DAX_PMD)

#ifdef, please

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 00/12] re-enable DAX PMD support
  2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
                   ` (12 preceding siblings ...)
  2016-09-29 23:43 ` [PATCH v4 00/12] re-enable DAX PMD support Dave Chinner
@ 2016-09-30 11:46 ` Christoph Hellwig
  13 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-09-30 11:46 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu, Sep 29, 2016 at 04:49:18PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.

This seems to pass xfstests fine, so with the various cleanups noted
in the individual patches this looks fine to me.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 04/12] ext2: remove support for DAX PMD faults
  2016-09-29 22:49 ` [PATCH v4 04/12] ext2: remove support for DAX PMD faults Ross Zwisler
  2016-09-30  8:49   ` Christoph Hellwig
@ 2016-10-03  9:35   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Jan Kara @ 2016-10-03  9:35 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu 29-09-16 16:49:22, Ross Zwisler wrote:
> DAX PMD support was added via the following commit:
> 
> commit e7b1ea2ad658 ("ext2: huge page fault support")
> 
> I believe this path to be untested as ext2 doesn't reliably provide block
> allocations that are aligned to 2MiB.  In my testing I've been unable to
> get ext2 to actually fault in a PMD.  It always fails with a "pfn
> unaligned" message because the sector returned by ext2_get_block() isn't
> aligned.
> 
> I've tried various settings for the "stride" and "stripe_width" extended
> options to mkfs.ext2, without any luck.
> 
> Since we can't reliably get PMDs, remove support so that we don't have an
> untested code path that we may someday traverse when we happen to get an
> aligned block allocation.  This should also make 4k DAX faults in ext2 a
> bit faster since they will no longer have to call the PMD fault handler
> only to get a response of VM_FAULT_FALLBACK.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext2/file.c | 29 ++++++-----------------------
>  1 file changed, 6 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 0ca363d..0f257f8 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	return ret;
>  }
>  
> -static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
> -						pmd_t *pmd, unsigned int flags)
> -{
> -	struct inode *inode = file_inode(vma->vm_file);
> -	struct ext2_inode_info *ei = EXT2_I(inode);
> -	int ret;
> -
> -	if (flags & FAULT_FLAG_WRITE) {
> -		sb_start_pagefault(inode->i_sb);
> -		file_update_time(vma->vm_file);
> -	}
> -	down_read(&ei->dax_sem);
> -
> -	ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
> -
> -	up_read(&ei->dax_sem);
> -	if (flags & FAULT_FLAG_WRITE)
> -		sb_end_pagefault(inode->i_sb);
> -	return ret;
> -}
> -
>  static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
>  		struct vm_fault *vmf)
>  {
> @@ -154,7 +133,11 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
>  	.fault		= ext2_dax_fault,
> -	.pmd_fault	= ext2_dax_pmd_fault,
> +	/*
> +	 * .pmd_fault is not supported for DAX because allocation in ext2
> +	 * cannot be reliably aligned to huge page sizes and so pmd faults
> +	 * will always fail and fail back to regular faults.
> +	 */
>  	.page_mkwrite	= ext2_dax_fault,
>  	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
>  };
> @@ -166,7 +149,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
>  
>  	file_accessed(file);
>  	vma->vm_ops = &ext2_dax_vm_ops;
> -	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> +	vma->vm_flags |= VM_MIXEDMAP;
>  	return 0;
>  }
>  #else
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 05/12] dax: make 'wait_table' global variable static
  2016-09-29 22:49 ` [PATCH v4 05/12] dax: make 'wait_table' global variable static Ross Zwisler
  2016-09-30  8:50   ` Christoph Hellwig
@ 2016-10-03  9:36   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Jan Kara @ 2016-10-03  9:36 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu 29-09-16 16:49:23, Ross Zwisler wrote:
> The global 'wait_table' variable is only used within fs/dax.c, and
> generates the following sparse warning:
> 
> fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?
> 
> Make it static so it has scope local to fs/dax.c, and to make sparse happy.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks fine. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 9b9be8a..ac28cdf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -52,7 +52,7 @@
>  #define DAX_WAIT_TABLE_BITS 12
>  #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
>  
> -wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
> +static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
>  
>  static int __init init_dax_wait_table(void)
>  {
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 06/12] dax: consistent variable naming for DAX entries
  2016-09-29 22:49 ` [PATCH v4 06/12] dax: consistent variable naming for DAX entries Ross Zwisler
  2016-09-30  8:50   ` Christoph Hellwig
@ 2016-10-03  9:37   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Jan Kara @ 2016-10-03  9:37 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu 29-09-16 16:49:24, Ross Zwisler wrote:
> No functional change.
> 
> Consistently use the variable name 'entry' instead of 'ret' for DAX radix
> tree entries.  This was already happening in most of the code, so update
> get_unlocked_mapping_entry(), grab_mapping_entry() and
> dax_unlock_mapping_entry().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c | 34 +++++++++++++++++-----------------
>  1 file changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ac28cdf..baef586 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
>  static void *get_unlocked_mapping_entry(struct address_space *mapping,
>  					pgoff_t index, void ***slotp)
>  {
> -	void *ret, **slot;
> +	void *entry, **slot;
>  	struct wait_exceptional_entry_queue ewait;
>  	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
>  
> @@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>  	ewait.key.index = index;
>  
>  	for (;;) {
> -		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> +		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
>  					  &slot);
> -		if (!ret || !radix_tree_exceptional_entry(ret) ||
> +		if (!entry || !radix_tree_exceptional_entry(entry) ||
>  		    !slot_locked(mapping, slot)) {
>  			if (slotp)
>  				*slotp = slot;
> -			return ret;
> +			return entry;
>  		}
>  		prepare_to_wait_exclusive(wq, &ewait.wait,
>  					  TASK_UNINTERRUPTIBLE);
> @@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
>   */
>  static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
> -	void *ret, **slot;
> +	void *entry, **slot;
>  
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
> -	ret = get_unlocked_mapping_entry(mapping, index, &slot);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!ret) {
> +	if (!entry) {
>  		int err;
>  
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -410,10 +410,10 @@ restart:
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);
> -		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> +		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
>  			       RADIX_DAX_ENTRY_LOCK);
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, ret);
> +		err = radix_tree_insert(&mapping->page_tree, index, entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> @@ -425,11 +425,11 @@ restart:
>  		/* Good, we have inserted empty locked entry into the tree. */
>  		mapping->nrexceptional++;
>  		spin_unlock_irq(&mapping->tree_lock);
> -		return ret;
> +		return entry;
>  	}
>  	/* Normal page in radix tree? */
> -	if (!radix_tree_exceptional_entry(ret)) {
> -		struct page *page = ret;
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		struct page *page = entry;
>  
>  		get_page(page);
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -442,9 +442,9 @@ restart:
>  		}
>  		return page;
>  	}
> -	ret = lock_slot(mapping, slot);
> +	entry = lock_slot(mapping, slot);
>  	spin_unlock_irq(&mapping->tree_lock);
> -	return ret;
> +	return entry;
>  }
>  
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> @@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
>  
>  void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
>  {
> -	void *ret, **slot;
> +	void *entry, **slot;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
> -	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
> +	entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
> +	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
>  			 !slot_locked(mapping, slot))) {
>  		spin_unlock_irq(&mapping->tree_lock);
>  		return;
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range
  2016-09-29 22:49 ` [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range Ross Zwisler
  2016-09-30  9:44   ` Christoph Hellwig
@ 2016-10-03  9:55   ` Jan Kara
  2016-10-03 18:40     ` Ross Zwisler
  1 sibling, 1 reply; 46+ messages in thread
From: Jan Kara @ 2016-10-03  9:55 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu 29-09-16 16:49:25, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, for ranges covered by a PMD entry we will instead lock
> based on the page offset of the beginning of the PMD entry.  The 'mapping'
> pointer is still used in the same way.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c            | 37 ++++++++++++++++++++++++-------------
>  include/linux/dax.h |  2 +-
>  mm/filemap.c        |  2 +-
>  3 files changed, 26 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index baef586..406feea 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -64,10 +64,17 @@ static int __init init_dax_wait_table(void)
>  }
>  fs_initcall(init_dax_wait_table);
>  
> +static pgoff_t dax_entry_start(pgoff_t index, void *entry)
> +{
> +	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +		index &= (PMD_MASK >> PAGE_SHIFT);

Hum, but if we shift right, top bits of PMD_MASK will become zero - not
something we want I guess... You rather want to mask with something like:
	~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1)

> @@ -447,10 +457,11 @@ restart:
>  	return entry;
>  }
>  
> -void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> +void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
>  				   pgoff_t index, bool wake_all)

Nitpick: Ordering of arguments would look more logical to me like:

dax_wake_mapping_entry_waiter(mapping, index, entry, wake_all)

Other than that the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 08/12] dax: remove dax_pmd_fault()
  2016-09-29 22:49 ` [PATCH v4 08/12] dax: remove dax_pmd_fault() Ross Zwisler
  2016-09-30  8:50   ` Christoph Hellwig
@ 2016-10-03  9:56   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Jan Kara @ 2016-10-03  9:56 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu 29-09-16 16:49:26, Ross Zwisler wrote:
> dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
> fault handler.  This fault handler has been disabled for several kernel
> releases, and support for PMDs will be reintroduced using the struct iomap
> interface instead.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c            | 213 ----------------------------------------------------
>  include/linux/dax.h |   6 +-
>  2 files changed, 1 insertion(+), 218 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 406feea..b5e7b13 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -909,219 +909,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  }
>  EXPORT_SYMBOL_GPL(dax_fault);
>  
> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -/*
> - * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> - * more often than one might expect in the below function.
> - */
> -#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> -
> -static void __dax_dbg(struct buffer_head *bh, unsigned long address,
> -		const char *reason, const char *fn)
> -{
> -	if (bh) {
> -		char bname[BDEVNAME_SIZE];
> -		bdevname(bh->b_bdev, bname);
> -		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
> -			"length %zd fallback: %s\n", fn, current->comm,
> -			address, bname, bh->b_state, (u64)bh->b_blocknr,
> -			bh->b_size, reason);
> -	} else {
> -		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
> -			current->comm, address, reason);
> -	}
> -}
> -
> -#define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
> -
> -/**
> - * dax_pmd_fault - handle a PMD fault on a DAX file
> - * @vma: The virtual memory area where the fault occurred
> - * @vmf: The description of the fault
> - * @get_block: The filesystem method used to translate file offsets to blocks
> - *
> - * When a page fault occurs, filesystems may call this helper in their
> - * pmd_fault handler for DAX files.
> - */
> -int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> -		pmd_t *pmd, unsigned int flags, get_block_t get_block)
> -{
> -	struct file *file = vma->vm_file;
> -	struct address_space *mapping = file->f_mapping;
> -	struct inode *inode = mapping->host;
> -	struct buffer_head bh;
> -	unsigned blkbits = inode->i_blkbits;
> -	unsigned long pmd_addr = address & PMD_MASK;
> -	bool write = flags & FAULT_FLAG_WRITE;
> -	struct block_device *bdev;
> -	pgoff_t size, pgoff;
> -	sector_t block;
> -	int result = 0;
> -	bool alloc = false;
> -
> -	/* dax pmd mappings require pfn_t_devmap() */
> -	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> -		return VM_FAULT_FALLBACK;
> -
> -	/* Fall back to PTEs if we're going to COW */
> -	if (write && !(vma->vm_flags & VM_SHARED)) {
> -		split_huge_pmd(vma, pmd, address);
> -		dax_pmd_dbg(NULL, address, "cow write");
> -		return VM_FAULT_FALLBACK;
> -	}
> -	/* If the PMD would extend outside the VMA */
> -	if (pmd_addr < vma->vm_start) {
> -		dax_pmd_dbg(NULL, address, "vma start unaligned");
> -		return VM_FAULT_FALLBACK;
> -	}
> -	if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
> -		dax_pmd_dbg(NULL, address, "vma end unaligned");
> -		return VM_FAULT_FALLBACK;
> -	}
> -
> -	pgoff = linear_page_index(vma, pmd_addr);
> -	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> -	if (pgoff >= size)
> -		return VM_FAULT_SIGBUS;
> -	/* If the PMD would cover blocks out of the file */
> -	if ((pgoff | PG_PMD_COLOUR) >= size) {
> -		dax_pmd_dbg(NULL, address,
> -				"offset + huge page size > file size");
> -		return VM_FAULT_FALLBACK;
> -	}
> -
> -	memset(&bh, 0, sizeof(bh));
> -	bh.b_bdev = inode->i_sb->s_bdev;
> -	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
> -
> -	bh.b_size = PMD_SIZE;
> -
> -	if (get_block(inode, block, &bh, 0) != 0)
> -		return VM_FAULT_SIGBUS;
> -
> -	if (!buffer_mapped(&bh) && write) {
> -		if (get_block(inode, block, &bh, 1) != 0)
> -			return VM_FAULT_SIGBUS;
> -		alloc = true;
> -		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
> -	}
> -
> -	bdev = bh.b_bdev;
> -
> -	if (bh.b_size < PMD_SIZE) {
> -		dax_pmd_dbg(&bh, address, "allocated block too small");
> -		return VM_FAULT_FALLBACK;
> -	}
> -
> -	/*
> -	 * If we allocated new storage, make sure no process has any
> -	 * zero pages covering this hole
> -	 */
> -	if (alloc) {
> -		loff_t lstart = pgoff << PAGE_SHIFT;
> -		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
> -
> -		truncate_pagecache_range(inode, lstart, lend);
> -	}
> -
> -	if (!write && !buffer_mapped(&bh)) {
> -		spinlock_t *ptl;
> -		pmd_t entry;
> -		struct page *zero_page = get_huge_zero_page();
> -
> -		if (unlikely(!zero_page)) {
> -			dax_pmd_dbg(&bh, address, "no zero page");
> -			goto fallback;
> -		}
> -
> -		ptl = pmd_lock(vma->vm_mm, pmd);
> -		if (!pmd_none(*pmd)) {
> -			spin_unlock(ptl);
> -			dax_pmd_dbg(&bh, address, "pmd already present");
> -			goto fallback;
> -		}
> -
> -		dev_dbg(part_to_dev(bdev->bd_part),
> -				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
> -				__func__, current->comm, address,
> -				(unsigned long long) to_sector(&bh, inode));
> -
> -		entry = mk_pmd(zero_page, vma->vm_page_prot);
> -		entry = pmd_mkhuge(entry);
> -		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
> -		result = VM_FAULT_NOPAGE;
> -		spin_unlock(ptl);
> -	} else {
> -		struct blk_dax_ctl dax = {
> -			.sector = to_sector(&bh, inode),
> -			.size = PMD_SIZE,
> -		};
> -		long length = dax_map_atomic(bdev, &dax);
> -
> -		if (length < 0) {
> -			dax_pmd_dbg(&bh, address, "dax-error fallback");
> -			goto fallback;
> -		}
> -		if (length < PMD_SIZE) {
> -			dax_pmd_dbg(&bh, address, "dax-length too small");
> -			dax_unmap_atomic(bdev, &dax);
> -			goto fallback;
> -		}
> -		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
> -			dax_pmd_dbg(&bh, address, "pfn unaligned");
> -			dax_unmap_atomic(bdev, &dax);
> -			goto fallback;
> -		}
> -
> -		if (!pfn_t_devmap(dax.pfn)) {
> -			dax_unmap_atomic(bdev, &dax);
> -			dax_pmd_dbg(&bh, address, "pfn not in memmap");
> -			goto fallback;
> -		}
> -		dax_unmap_atomic(bdev, &dax);
> -
> -		/*
> -		 * For PTE faults we insert a radix tree entry for reads, and
> -		 * leave it clean.  Then on the first write we dirty the radix
> -		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
> -		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
> -		 * call into get_block() to translate the pgoff to a sector in
> -		 * order to be able to create a new radix tree entry.
> -		 *
> -		 * The PMD path doesn't have an equivalent to
> -		 * dax_pfn_mkwrite(), though, so for a read followed by a
> -		 * write we traverse all the way through dax_pmd_fault()
> -		 * twice.  This means we can just skip inserting a radix tree
> -		 * entry completely on the initial read and just wait until
> -		 * the write to insert a dirty entry.
> -		 */
> -		if (write) {
> -			/*
> -			 * We should insert radix-tree entry and dirty it here.
> -			 * For now this is broken...
> -			 */
> -		}
> -
> -		dev_dbg(part_to_dev(bdev->bd_part),
> -				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
> -				__func__, current->comm, address,
> -				pfn_t_to_pfn(dax.pfn),
> -				(unsigned long long) dax.sector);
> -		result |= vmf_insert_pfn_pmd(vma, address, pmd,
> -				dax.pfn, write);
> -	}
> -
> - out:
> -	return result;
> -
> - fallback:
> -	count_vm_event(THP_FAULT_FALLBACK);
> -	result = VM_FAULT_FALLBACK;
> -	goto out;
> -}
> -EXPORT_SYMBOL_GPL(dax_pmd_fault);
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -
>  /**
>   * dax_pfn_mkwrite - handle first write to DAX page
>   * @vma: The virtual memory area where the fault occurred
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 4065601..d9a8350 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -48,16 +48,12 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
>  }
>  #endif
>  
> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> -				unsigned int flags, get_block_t);
> -#else
>  static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  				pmd_t *pmd, unsigned int flags, get_block_t gb)
>  {
>  	return VM_FAULT_FALLBACK;
>  }
> -#endif
> +
>  int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
>  #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
>  
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 09/12] dax: correct dax iomap code namespace
  2016-09-29 22:49 ` [PATCH v4 09/12] dax: correct dax iomap code namespace Ross Zwisler
  2016-09-30  8:51   ` Christoph Hellwig
@ 2016-10-03  9:57   ` Jan Kara
  1 sibling, 0 replies; 46+ messages in thread
From: Jan Kara @ 2016-10-03  9:57 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu 29-09-16 16:49:27, Ross Zwisler wrote:
> The recently added DAX functions that use the new struct iomap data
> structure were named iomap_dax_rw(), iomap_dax_fault() and
> iomap_dax_actor().  These are actually defined in fs/dax.c, though, so
> should be part of the "dax" namespace and not the "iomap" namespace.
> Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
> respectively.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/dax.c            | 16 ++++++++--------
>  fs/ext2/file.c      |  6 +++---
>  fs/xfs/xfs_file.c   |  8 ++++----
>  include/linux/dax.h |  4 ++--
>  4 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index b5e7b13..6977e5e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1024,7 +1024,7 @@ EXPORT_SYMBOL_GPL(dax_truncate_page);
>  
>  #ifdef CONFIG_FS_IOMAP
>  static loff_t
> -iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> +dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct iomap *iomap)
>  {
>  	struct iov_iter *iter = data;
> @@ -1081,7 +1081,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  }
>  
>  /**
> - * iomap_dax_rw - Perform I/O to a DAX file
> + * dax_iomap_rw - Perform I/O to a DAX file
>   * @iocb:	The control block for this I/O
>   * @iter:	The addresses to do I/O from or to
>   * @ops:	iomap ops passed from the file system
> @@ -1091,7 +1091,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>   * and evicting any page cache pages in the region under I/O.
>   */
>  ssize_t
> -iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
> +dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		struct iomap_ops *ops)
>  {
>  	struct address_space *mapping = iocb->ki_filp->f_mapping;
> @@ -1121,7 +1121,7 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	while (iov_iter_count(iter)) {
>  		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
> -				iter, iomap_dax_actor);
> +				iter, dax_iomap_actor);
>  		if (ret <= 0)
>  			break;
>  		pos += ret;
> @@ -1131,10 +1131,10 @@ iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	iocb->ki_pos += done;
>  	return done ? done : ret;
>  }
> -EXPORT_SYMBOL_GPL(iomap_dax_rw);
> +EXPORT_SYMBOL_GPL(dax_iomap_rw);
>  
>  /**
> - * iomap_dax_fault - handle a page fault on a DAX file
> + * dax_iomap_fault - handle a page fault on a DAX file
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
>   * @ops: iomap ops passed from the file system
> @@ -1143,7 +1143,7 @@ EXPORT_SYMBOL_GPL(iomap_dax_rw);
>   * or mkwrite handler for DAX files. Assumes the caller has done all the
>   * necessary locking for the page fault to proceed successfully.
>   */
> -int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			struct iomap_ops *ops)
>  {
>  	struct address_space *mapping = vma->vm_file->f_mapping;
> @@ -1245,5 +1245,5 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		return VM_FAULT_SIGBUS | major;
>  	return VM_FAULT_NOPAGE | major;
>  }
> -EXPORT_SYMBOL_GPL(iomap_dax_fault);
> +EXPORT_SYMBOL_GPL(dax_iomap_fault);
>  #endif /* CONFIG_FS_IOMAP */
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 0f257f8..32a4913 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -38,7 +38,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  		return 0; /* skip atime */
>  
>  	inode_lock_shared(inode);
> -	ret = iomap_dax_rw(iocb, to, &ext2_iomap_ops);
> +	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
>  	inode_unlock_shared(inode);
>  
>  	file_accessed(iocb->ki_filp);
> @@ -62,7 +62,7 @@ static ssize_t ext2_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (ret)
>  		goto out_unlock;
>  
> -	ret = iomap_dax_rw(iocb, from, &ext2_iomap_ops);
> +	ret = dax_iomap_rw(iocb, from, &ext2_iomap_ops);
>  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
>  		i_size_write(inode, iocb->ki_pos);
>  		mark_inode_dirty(inode);
> @@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	}
>  	down_read(&ei->dax_sem);
>  
> -	ret = iomap_dax_fault(vma, vmf, &ext2_iomap_ops);
> +	ret = dax_iomap_fault(vma, vmf, &ext2_iomap_ops);
>  
>  	up_read(&ei->dax_sem);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 882f264..00293d2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -342,7 +342,7 @@ xfs_file_dax_read(
>  		return 0; /* skip atime */
>  
>  	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> -	ret = iomap_dax_rw(iocb, to, &xfs_iomap_ops);
> +	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
>  	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
>  
>  	file_accessed(iocb->ki_filp);
> @@ -721,7 +721,7 @@ xfs_file_dax_write(
>  
>  	trace_xfs_file_dax_write(ip, count, pos);
>  
> -	ret = iomap_dax_rw(iocb, from, &xfs_iomap_ops);
> +	ret = dax_iomap_rw(iocb, from, &xfs_iomap_ops);
>  	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
>  		i_size_write(inode, iocb->ki_pos);
>  		error = xfs_setfilesize(ip, pos, ret);
> @@ -1468,7 +1468,7 @@ xfs_filemap_page_mkwrite(
>  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (IS_DAX(inode)) {
> -		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
> +		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
>  	} else {
>  		ret = iomap_page_mkwrite(vma, vmf, &xfs_iomap_ops);
>  		ret = block_page_mkwrite_return(ret);
> @@ -1502,7 +1502,7 @@ xfs_filemap_fault(
>  		 * changes to xfs_get_blocks_direct() to map unwritten extent
>  		 * ioend for conversion on read-only mappings.
>  		 */
> -		ret = iomap_dax_fault(vma, vmf, &xfs_iomap_ops);
> +		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
>  	} else
>  		ret = filemap_fault(vma, vmf);
>  	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index d9a8350..c4a51bb 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -11,13 +11,13 @@ struct iomap_ops;
>  /* We use lowest available exceptional entry bit for locking */
>  #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
>  
> -ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
> +ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		struct iomap_ops *ops);
>  ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
>  		  get_block_t, dio_iodone_t, int flags);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> -int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			struct iomap_ops *ops);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-09-29 22:49 ` [PATCH v4 10/12] dax: add struct iomap based DAX PMD support Ross Zwisler
  2016-09-30  9:56   ` Christoph Hellwig
@ 2016-10-03 10:59   ` Jan Kara
  2016-10-03 16:37     ` Christoph Hellwig
  2016-10-03 21:05     ` Ross Zwisler
  1 sibling, 2 replies; 46+ messages in thread
From: Jan Kara @ 2016-10-03 10:59 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Thu 29-09-16 16:49:28, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This patch allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled using the new struct
> iomap based fault handlers.
> 
> There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
> mappings that have an associated block allocation, and 4k DAX empty
> entries.  The empty entries exist to provide locking for the duration of a
> given page fault.
> 
> This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
> entries, PMD DAX entries that have associated block allocations, and 2 MiB
> DAX empty entries.
> 
> Unlike the 4k case where we insert a struct page* into the radix tree for
> 4k zero pages, for HZP we insert a DAX exceptional entry with the new
> RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
> every 2MiB hole mapping, and it doesn't make sense to have that same struct
> page* with multiple entries in multiple trees.  This would cause contention
> on the single page lock for the one Huge Zero Page, and it would break the
> page->index and page->mapping associations that are assumed to be valid in
> many other places in the kernel.
> 
> One difficult use case is when one thread is trying to use 4k entries in
> radix tree for a given offset, and another thread is using 2 MiB entries
> for that same offset.  The current code handles this by making the 2 MiB
> user fall back to 4k entries for most cases.  This was done because it is
> the simplest solution, and because the use of 2MiB pages is already
> opportunistic.
> 
> If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
> we run into the problem of how we lock out 4k page faults for the entire
> 2MiB range while we clean out the radix tree so we can insert the 2MiB
> entry.  We can solve this problem if we need to, but I think that the cases
> where both 2MiB entries and 4K entries are being used for the same range
> will be rare enough and the gain small enough that it probably won't be
> worth the complexity.
...
>  restart:
>  	spin_lock_irq(&mapping->tree_lock);
>  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +
> +	if (entry && new_type == RADIX_DAX_PMD) {
> +		if (!radix_tree_exceptional_entry(entry) ||
> +				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> +			spin_unlock_irq(&mapping->tree_lock);
> +			return ERR_PTR(-EEXIST);
> +		}
> +	} else if (entry && new_type == RADIX_DAX_PTE) {
> +		if (radix_tree_exceptional_entry(entry) &&
> +		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
> +		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> +			pmd_downgrade = true;
> +		}
> +	}
> +
>  	/* No entry for given index? Make sure radix tree is big enough. */
> -	if (!entry) {
> +	if (!entry || pmd_downgrade) {
>  		int err;
>  
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -420,15 +439,39 @@ restart:
>  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>  		if (err)
>  			return ERR_PTR(err);
> -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> -			       RADIX_DAX_ENTRY_LOCK);
> +
> +		/*
> +		 * Besides huge zero pages the only other thing that gets
> +		 * downgraded are empty entries which don't need to be
> +		 * unmapped.
> +		 */
> +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> +			unmap_mapping_range(mapping,
> +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> +
>  		spin_lock_irq(&mapping->tree_lock);
> -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> +
> +		if (pmd_downgrade) {
> +			radix_tree_delete(&mapping->page_tree, index);
> +			mapping->nrexceptional--;
> +			dax_wake_mapping_entry_waiter(entry, mapping, index,
> +					false);
> +		}

Hum, this looks really problematic. Once you have dropped tree_lock,
anything could have happened with the radix tree - in particular the entry
you've got from get_unlocked_mapping_entry() can be different by now. Also
there's no guarantee that someone does not map the huge entry again just
after your call to unmap_mapping_range() finished.

So it seems you need to lock the entry (if you have one) before releasing
tree_lock to stabilize it. That is enough also to block other mappings of
that entry. Then once you reacquire the tree_lock, you can safely delete it
and replace it with a different entry.

> +
> +		entry = RADIX_DAX_EMPTY_ENTRY(new_type);
> +
> +		err = __radix_tree_insert(&mapping->page_tree, index,
> +				RADIX_DAX_ORDER(new_type), entry);
>  		radix_tree_preload_end();
>  		if (err) {
>  			spin_unlock_irq(&mapping->tree_lock);
> -			/* Someone already created the entry? */
> -			if (err == -EEXIST)
> +			/*
> +			 * Someone already created the entry?  This is a
> +			 * normal failure when inserting PMDs in a range
> +			 * that already contains PTEs.  In that case we want
> +			 * to return -EEXIST immediately.
> +			 */
> +			if (err == -EEXIST && new_type == RADIX_DAX_PTE)
>  				goto restart;
>  			return ERR_PTR(err);
>  		}
> @@ -596,11 +639,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
>  	return 0;
>  }
>  
> -#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
> -
> +/*
> + * By this point grab_mapping_entry() has ensured that we have a locked entry
> + * of the appropriate size so we don't have to worry about downgrading PMDs to
> + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> + * already in the tree, we will skip the insertion and just dirty the PMD as
> + * appropriate.
> + */
>  static void *dax_insert_mapping_entry(struct address_space *mapping,
>  				      struct vm_fault *vmf,
> -				      void *entry, sector_t sector)
> +				      void *entry, sector_t sector,
> +				      unsigned long new_type, bool hzp)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  	int error = 0;
> @@ -623,22 +672,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
>  		if (error)
>  			return ERR_PTR(error);
> +	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
> +		/* replacing huge zero page with PMD block mapping */
> +		unmap_mapping_range(mapping,
> +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
>  	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> -		       RADIX_DAX_ENTRY_LOCK);
> +	if (hzp)
> +		new_entry = RADIX_DAX_HZP_ENTRY();
> +	else
> +		new_entry = RADIX_DAX_ENTRY(sector, new_type);
> +
>  	if (hole_fill) {
>  		__delete_from_page_cache(entry, NULL);
>  		/* Drop pagecache reference */
>  		put_page(entry);
> -		error = radix_tree_insert(page_tree, index, new_entry);
> +		error = __radix_tree_insert(page_tree, index,
> +				RADIX_DAX_ORDER(new_type), new_entry);
>  		if (error) {
>  			new_entry = ERR_PTR(error);
>  			goto unlock;
>  		}
>  		mapping->nrexceptional++;
> -	} else {
> +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
>  		void **slot;
>  		void *ret;

Hum, I somewhat dislike how PTE and PMD paths differ here. But it's OK for
now I guess. Long term we might be better off to do away with zero pages
for PTEs as well and use exceptional entry and a single zero page like you
do for PMD. Because the special cases these zero pages cause are a
headache.

>  
> @@ -694,6 +751,18 @@ static int dax_writeback_one(struct block_device *bdev,
>  		goto unlock;
>  	}
>  
> +	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
> +	 * in the middle of a PMD, the 'index' we are given will be aligned to
> +	 * the start index of the PMD, as will the sector we pull from
> +	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
> +	 * worry about partial PMD writebacks.
> +	 */
>  	dax.sector = RADIX_DAX_SECTOR(entry);
>  	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
>  	spin_unlock_irq(&mapping->tree_lock);

<snip>

> +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	bool write = flags & FAULT_FLAG_WRITE;
> +	struct inode *inode = mapping->host;
> +	struct iomap iomap = { 0 };
> +	int error, result = 0;
> +	pgoff_t size, pgoff;
> +	struct vm_fault vmf;
> +	void *entry;
> +	loff_t pos;
> +
> +	/* Fall back to PTEs if we're going to COW */
> +	if (write && !(vma->vm_flags & VM_SHARED)) {
> +		split_huge_pmd(vma, pmd, address);
> +		return VM_FAULT_FALLBACK;
> +	}
> +
> +	/* If the PMD would extend outside the VMA */
> +	if (pmd_addr < vma->vm_start)
> +		return VM_FAULT_FALLBACK;
> +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> +		return VM_FAULT_FALLBACK;
> +
> +	/*
> +	 * Check whether offset isn't beyond end of file now. Caller is
> +	 * supposed to hold locks serializing us with truncate / punch hole so
> +	 * this is a reliable test.
> +	 */
> +	pgoff = linear_page_index(vma, pmd_addr);
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +
> +	if (pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +
> +	/* If the PMD would extend beyond the file size */
> +	if ((pgoff | PG_PMD_COLOUR) >= size)
> +		return VM_FAULT_FALLBACK;
> +
> +	/*
> +	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> +	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> +	 * the tree, for instance), it will return -EEXIST and we just fall
> +	 * back to 4k entries.
> +	 */
> +	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> +	if (IS_ERR(entry))
> +		return VM_FAULT_FALLBACK;
> +
> +	/*
> +	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
> +	 * setting up a mapping, so really we're using iomap_begin() as a way
> +	 * to look up our filesystem block.
> +	 */
> +	pos = (loff_t)pgoff << PAGE_SHIFT;
> +	error = ops->iomap_begin(inode, pos, PMD_SIZE, write ? IOMAP_WRITE : 0,
> +			&iomap);

I'm not quite sure if it is OK to call ->iomap_begin() without ever calling
->iomap_end. Specifically the comment before iomap_apply() says:

"It is assumed that the filesystems will lock whatever resources they
require in the iomap_begin call, and release them in the iomap_end call."

so what you do could result in unbalanced allocations / locks / whatever.
Christoph?

> +	if (error)
> +		goto fallback;
> +	if (iomap.offset + iomap.length < pos + PMD_SIZE)
> +		goto fallback;
> +
> +	vmf.pgoff = pgoff;
> +	vmf.flags = flags;
> +	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;

I don't think you want __GFP_FS here - we have already gone through the
filesystem's pmd_fault() handler which called dax_iomap_pmd_fault() and
thus we hold various fs locks, freeze protection, ...

> +
> +	switch (iomap.type) {
> +	case IOMAP_MAPPED:
> +		result = dax_pmd_insert_mapping(vma, pmd, &vmf, address,
> +				&iomap, pos, write, &entry);
> +		break;
> +	case IOMAP_UNWRITTEN:
> +	case IOMAP_HOLE:
> +		if (WARN_ON_ONCE(write))
> +			goto fallback;
> +		result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
> +				&entry);
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		result = VM_FAULT_FALLBACK;
> +		break;
> +	}
> +
> +	if (result == VM_FAULT_FALLBACK)
> +		count_vm_event(THP_FAULT_FALLBACK);
> +
> + unlock_entry:
> +	put_locked_mapping_entry(mapping, pgoff, entry);
> +	return result;
> +
> + fallback:
> +	count_vm_event(THP_FAULT_FALLBACK);
> +	result = VM_FAULT_FALLBACK;
> +	goto unlock_entry;
> +}
> +EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
> +#endif /* CONFIG_FS_DAX_PMD */
>  #endif /* CONFIG_FS_IOMAP */
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index c4a51bb..cacff9e 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -8,8 +8,33 @@
>  
>  struct iomap_ops;
>  
> -/* We use lowest available exceptional entry bit for locking */
> +/*
> + * We use lowest available bit in exceptional entry for locking, two bits for
> + * the entry type (PMD & PTE), and two more for flags (HZP and empty).  In
> + * total five special bits.
> + */
> +#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
>  #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> +/* PTE and PMD types */
> +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> +/* huge zero page and empty entry flags */
> +#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
> +#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))

I think we can do with just 2 bits for type instead of 4 but for now this
is OK I guess.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-10-03 10:59   ` Jan Kara
@ 2016-10-03 16:37     ` Christoph Hellwig
  2016-10-03 21:05     ` Ross Zwisler
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2016-10-03 16:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Mon, Oct 03, 2016 at 12:59:49PM +0200, Jan Kara wrote:
> I'm not quite sure if it is OK to call ->iomap_begin() without ever calling
> ->iomap_end. Specifically the comment before iomap_apply() says:
> 
> "It is assumed that the filesystems will lock whatever resources they
> require in the iomap_begin call, and release them in the iomap_end call."
> 
> so what you do could result in unbalanced allocations / locks / whatever.
> Christoph?

Indeed.  For XFS we only rely on iomap_end for error handling at the
moment, but it is intended to be paired for locking, as cluster file
systems like gfs2 requested this.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range
  2016-10-03  9:55   ` Jan Kara
@ 2016-10-03 18:40     ` Ross Zwisler
  0 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-10-03 18:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Mon, Oct 03, 2016 at 11:55:18AM +0200, Jan Kara wrote:
> On Thu 29-09-16 16:49:25, Ross Zwisler wrote:
> > DAX radix tree locking currently locks entries based on the unique
> > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > This works for PTEs, but as we move to PMDs we will need to have all the
> > offsets within the range covered by the PMD to map to the same bit lock.
> > To accomplish this, for ranges covered by a PMD entry we will instead lock
> > based on the page offset of the beginning of the PMD entry.  The 'mapping'
> > pointer is still used in the same way.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/dax.c            | 37 ++++++++++++++++++++++++-------------
> >  include/linux/dax.h |  2 +-
> >  mm/filemap.c        |  2 +-
> >  3 files changed, 26 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index baef586..406feea 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -64,10 +64,17 @@ static int __init init_dax_wait_table(void)
> >  }
> >  fs_initcall(init_dax_wait_table);
> >  
> > +static pgoff_t dax_entry_start(pgoff_t index, void *entry)
> > +{
> > +	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +		index &= (PMD_MASK >> PAGE_SHIFT);
> 
> Hum, but if we shift right, top bits of PMD_MASK will become zero - not
> something we want I guess... You rather want to mask with something like:
> 	~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1)

Great catch, thanks.  Fixed for v5.

> > @@ -447,10 +457,11 @@ restart:
> >  	return entry;
> >  }
> >  
> > -void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> > +void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
> >  				   pgoff_t index, bool wake_all)
> 
> Nitpick: Ordering of arguments would look more logical to me like:
> 
> dax_wake_mapping_entry_waiter(mapping, index, entry, wake_all)

Cool, I'll reorder them for v5.

Thanks for the review!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 00/12] re-enable DAX PMD support
  2016-09-30  4:00     ` Darrick J. Wong
@ 2016-10-03 18:54       ` Ross Zwisler
  0 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-10-03 18:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ross Zwisler, Dave Chinner, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Christoph Hellwig,
	Dan Williams, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Thu, Sep 29, 2016 at 09:00:55PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 29, 2016 at 09:03:43PM -0600, Ross Zwisler wrote:
> > On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> > > Finally: none of the patches in your tree have reviewed-by tags.
> > > That says to me that none of this code has been reviewed yet.
> > > Reviewed-by tags are non-negotiable requirement for anything going
> > > through my trees. I don't have time right now to review this code,
> > > so you're going to need to chase up other reviewers before merging.
> > > 
> > > And, really, this is getting very late in the cycle to be merging
> > > new code - we're less than one working day away from the merge
> > > window opening and we've missed the last linux-next build. I'd
> > > suggest that we'd might be best served by slipping this to the PMD
> > > support code to the next cycle when there's no time pressure for
> > > review and we can get a decent linux-next soak on the code.
> > 
> > I absolutely support your policy of only sending code to Linux that has passed
> > peer review.
> > 
> > However, I do feel compelled to point out that this is not new code.  I didn't
> > just spring it on everyone in the hours before the v4.8 merge window.  I
> > posted the first version of this patch set on August 15th, *seven weeks ago*:
> > 
> > https://lkml.org/lkml/2016/8/15/613
> > 
> > This was the day after v4.7-rc2 was released.
> > 
> > Since then I have responded promptly to the little review feedback
> > that I've received.  I've also reviewed and tested other DAX changes,
> > like the struct iomap changes from Christoph.  Those changes were
> > first posted to the mailing list on September 9th, four weeks after
> > mine.  Nevertheless, I was happy to rebase my changes on top of his,
> > which meant a full rewrite of the DAX PMD fault handler so it would be
> > based on struct iomap.  His changes are going to be merged for v4.9,
> > and mine are not.
> 
> I'm not knocking the iomap migration, but it did cause a fair amount of
> churn in the XFS reflink patchset -- and that's for a filesystem that
> already /had/ iomap implemented.  It'd be neat to have all(?) the DAX
> filesystems (ext[24], XFS) move over to iomap so that you wouldn't have
> to support multiple ways of talking to FSes.  AFAICT ext4 hasn't gotten
> iomap, which complicates things.  But that's my opinion, maybe you're
> fine with supporting iomap and not-iomap.

I agree that we want to move everything over to be iomap based.   I think
Christoph is already working on moving ext4 over, but as of this set PMD
support explicitly depends on the iomap interface, and I'm itching to remove
the struct buffer_head + get_block_t PTE path and I/O path as well.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-10-03 10:59   ` Jan Kara
  2016-10-03 16:37     ` Christoph Hellwig
@ 2016-10-03 21:05     ` Ross Zwisler
  2016-10-04  5:55       ` Jan Kara
  2016-10-06 21:34       ` Ross Zwisler
  1 sibling, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-10-03 21:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Mon, Oct 03, 2016 at 12:59:49PM +0200, Jan Kara wrote:
> On Thu 29-09-16 16:49:28, Ross Zwisler wrote:
> > @@ -420,15 +439,39 @@ restart:
> >  				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> >  		if (err)
> >  			return ERR_PTR(err);
> > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > -			       RADIX_DAX_ENTRY_LOCK);
> > +
> > +		/*
> > +		 * Besides huge zero pages the only other thing that gets
> > +		 * downgraded are empty entries which don't need to be
> > +		 * unmapped.
> > +		 */
> > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > +			unmap_mapping_range(mapping,
> > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > +
> >  		spin_lock_irq(&mapping->tree_lock);
> > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > +
> > +		if (pmd_downgrade) {
> > +			radix_tree_delete(&mapping->page_tree, index);
> > +			mapping->nrexceptional--;
> > +			dax_wake_mapping_entry_waiter(entry, mapping, index,
> > +					false);
> > +		}
> 
> Hum, this looks really problematic. Once you have dropped tree_lock,
> anything could have happened with the radix tree - in particular the entry
> you've got from get_unlocked_mapping_entry() can be different by now. Also
> there's no guarantee that someone does not map the huge entry again just
> after your call to unmap_mapping_range() finished.
> 
> So it seems you need to lock the entry (if you have one) before releasing
> tree_lock to stabilize it. That is enough also to block other mappings of
> that entry. Then once you reacquire the tree_lock, you can safely delete it
> and replace it with a different entry.

Yep, great catch.  I'll lock the entry before I drop tree_lock.

> > @@ -623,22 +672,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> >  		if (error)
> >  			return ERR_PTR(error);
> > +	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
> > +		/* replacing huge zero page with PMD block mapping */
> > +		unmap_mapping_range(mapping,
> > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> >  	}
> >  
> >  	spin_lock_irq(&mapping->tree_lock);
> > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > -		       RADIX_DAX_ENTRY_LOCK);
> > +	if (hzp)
> > +		new_entry = RADIX_DAX_HZP_ENTRY();
> > +	else
> > +		new_entry = RADIX_DAX_ENTRY(sector, new_type);
> > +
> >  	if (hole_fill) {
> >  		__delete_from_page_cache(entry, NULL);
> >  		/* Drop pagecache reference */
> >  		put_page(entry);
> > -		error = radix_tree_insert(page_tree, index, new_entry);
> > +		error = __radix_tree_insert(page_tree, index,
> > +				RADIX_DAX_ORDER(new_type), new_entry);
> >  		if (error) {
> >  			new_entry = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> >  		mapping->nrexceptional++;
> > -	} else {
> > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> >  		void **slot;
> >  		void *ret;
> 
> Hum, I somewhat dislike how PTE and PMD paths differ here. But it's OK for
> now I guess. Long term we might be better off to do away with zero pages
> for PTEs as well and use exceptional entry and a single zero page like you
> do for PMD. Because the special cases these zero pages cause are a
> headache.

I've been thinking about this as well, and I do think we'd be better off with
a single zero page for PTEs, as we have with PMDs.  It'd reduce the special
casing in the DAX code, and it'd also ensure that we don't waste a bunch of
time and memory creating read-only zero pages to service reads from holes.

I'll look into adding this for v5.

> > +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	bool write = flags & FAULT_FLAG_WRITE;
> > +	struct inode *inode = mapping->host;
> > +	struct iomap iomap = { 0 };
> > +	int error, result = 0;
> > +	pgoff_t size, pgoff;
> > +	struct vm_fault vmf;
> > +	void *entry;
> > +	loff_t pos;
> > +
> > +	/* Fall back to PTEs if we're going to COW */
> > +	if (write && !(vma->vm_flags & VM_SHARED)) {
> > +		split_huge_pmd(vma, pmd, address);
> > +		return VM_FAULT_FALLBACK;
> > +	}
> > +
> > +	/* If the PMD would extend outside the VMA */
> > +	if (pmd_addr < vma->vm_start)
> > +		return VM_FAULT_FALLBACK;
> > +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> > +		return VM_FAULT_FALLBACK;
> > +
> > +	/*
> > +	 * Check whether offset isn't beyond end of file now. Caller is
> > +	 * supposed to hold locks serializing us with truncate / punch hole so
> > +	 * this is a reliable test.
> > +	 */
> > +	pgoff = linear_page_index(vma, pmd_addr);
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +
> > +	if (pgoff >= size)
> > +		return VM_FAULT_SIGBUS;
> > +
> > +	/* If the PMD would extend beyond the file size */
> > +	if ((pgoff | PG_PMD_COLOUR) >= size)
> > +		return VM_FAULT_FALLBACK;
> > +
> > +	/*
> > +	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> > +	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> > +	 * the tree, for instance), it will return -EEXIST and we just fall
> > +	 * back to 4k entries.
> > +	 */
> > +	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> > +	if (IS_ERR(entry))
> > +		return VM_FAULT_FALLBACK;
> > +
> > +	/*
> > +	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
> > +	 * setting up a mapping, so really we're using iomap_begin() as a way
> > +	 * to look up our filesystem block.
> > +	 */
> > +	pos = (loff_t)pgoff << PAGE_SHIFT;
> > +	error = ops->iomap_begin(inode, pos, PMD_SIZE, write ? IOMAP_WRITE : 0,
> > +			&iomap);
> 
> I'm not quite sure if it is OK to call ->iomap_begin() without ever calling
> ->iomap_end. Specifically the comment before iomap_apply() says:
> 
> "It is assumed that the filesystems will lock whatever resources they
> require in the iomap_begin call, and release them in the iomap_end call."
> 
> so what you do could result in unbalanced allocations / locks / whatever.
> Christoph?

I'll add the iomap_end() calls to both the PTE and PMD iomap fault handlers.

> > +	if (error)
> > +		goto fallback;
> > +	if (iomap.offset + iomap.length < pos + PMD_SIZE)
> > +		goto fallback;
> > +
> > +	vmf.pgoff = pgoff;
> > +	vmf.flags = flags;
> > +	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
> 
> I don't think you want __GFP_FS here - we have already gone through the
> filesystem's pmd_fault() handler which called dax_iomap_pmd_fault() and
> thus we hold various fs locks, freeze protection, ...

I copied this from __get_fault_gfp_mask() in mm/memory.c.  That function is
used by do_page_mkwrite() and __do_fault(), and we eventually get this
vmf->gfp_mask in the PTE fault code.  With the code as it is we get the same
vmf->gfp_mask in both dax_iomap_fault() and dax_iomap_pmd_fault().  It seems
like they should remain consistent - is it wrong to have __GFP_FS in
dax_iomap_fault()?

> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index c4a51bb..cacff9e 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -8,8 +8,33 @@
> >  
> >  struct iomap_ops;
> >  
> > -/* We use lowest available exceptional entry bit for locking */
> > +/*
> > + * We use lowest available bit in exceptional entry for locking, two bits for
> > + * the entry type (PMD & PTE), and two more for flags (HZP and empty).  In
> > + * total five special bits.
> > + */
> > +#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
> >  #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> > +/* PTE and PMD types */
> > +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> > +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> > +/* huge zero page and empty entry flags */
> > +#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
> > +#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
> 
> I think we can do with just 2 bits for type instead of 4 but for now this
> is OK I guess.

I guess we could combine the PMD/PTE choice into the same bit (0=PTE, 1=PMD),
but we have three cases for the other types (zero page, empty entry just for
locking, real DAX based entry with storage), so we need at least 2 bits for
those.

Christoph also suggested some reworks to the "type" logic - I'll look at
simplifying the way the flags are used for DAX entries.

Thank you for the review!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 00/12] re-enable DAX PMD support
  2016-09-30  6:48     ` Dave Chinner
@ 2016-10-03 21:11       ` Ross Zwisler
  0 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-10-03 21:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Sep 30, 2016 at 04:48:58PM +1000, Dave Chinner wrote:
> On Thu, Sep 29, 2016 at 09:03:43PM -0600, Ross Zwisler wrote:
> > On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> > > Finally: none of the patches in your tree have reviewed-by tags.
> > > That says to me that none of this code has been reviewed yet.
> > > Reviewed-by tags are non-negotiable requirement for anything going
> > > through my trees. I don't have time right now to review this code,
> > > so you're going to need to chase up other reviewers before merging.
> > > 
> > > And, really, this is getting very late in the cycle to be merging
> > > new code - we're less than one working day away from the merge
> > > window opening and we've missed the last linux-next build. I'd
> > > suggest that we'd might be best served by slipping this to the PMD
> > > support code to the next cycle when there's no time pressure for
> > > review and we can get a decent linux-next soak on the code.
> > 
> > I absolutely support your policy of only sending code to Linux that has passed
> > peer review.
> > 
> > However, I do feel compelled to point out that this is not new code.  I didn't
> > just spring it on everyone in the hours before the v4.8 merge window.  I
> > posted the first version of this patch set on August 15th, *seven weeks ago*:
> > 
> > https://lkml.org/lkml/2016/8/15/613
> >
> > This was the day after v4.7-rc2 was released.
> 
> I understand, Ross. We've all been there - welcome to the club. :/
> 
> The problem is there are lots of people writing code and very few
> people who spent time reviewing it. And for many of those reviewers,
> they also have to spend time on other code bases, whether it be the
> distro's they maintain, the userspace that the kernel code needs to
> function, or even the test code needed to make this all work
> properly.
> 
> Stuff that spans multiple subsystems such as DAX (i.e. vm,
> filesystems and block devices) are particularly troublesome in this
> respect because there are very few people with expertise in all
> three aspects and hence able to review a change that spans all three
> subsystems. Those people also tend to be the busiest and have to
> prioritise what they do.
> 
> Consider the XFS side of review in recent times: in the past 4
> months, there's been ~30,000 lines of code change between kernel and
> userspace.  And there's already another 15,000+ lines of code in the
> backlog for the next 2-3 months.  That review load is falling on 3-4
> people, who all have other work to do as well. This is for work that
> was started well over a year ago, and that picked up from work that
> was originally done 3 years ago. We're swamped on pure XFS review
> right now, let alone all the other infratructure stuff we need to
> get reviewed at the same time...
> 
> Having someone say "lets get sorted after the merge window" is far
> better than having your patches ignored - it tells you someone wants
> your code and is actually planning to review it in the near term!
> Have patience. Keep the patches up to date, keep building what you
> need to build on top of them. Missing a merge window is not the end
> of the world.
> 
> > Since then I have responded promptly to the little review feedback that I've
> > received.  I've also reviewed and tested other DAX changes, like the struct
> > iomap changes from Christoph.
> 
> And I'm grateful for your doing that - it sped the process up a lot
> because they then weren't blocked waiting for me to get to them. As
> a result, I owe you some review time in return but unfortunately I
> can't fully return the favour immediately. If more people treated
> review as a selfless act that should be returned in kind, then we
> wouldn't have a review bottleneck like we do...
> 
> > Those changes were first posted to the mailing
> > list on September 9th, four weeks after mine.  Nevertheless, I was happy to
> > rebase my changes on top of his, which meant a full rewrite of the DAX PMD
> > fault handler so it would be based on struct iomap.  His changes are going to
> > be merged for v4.9, and mine are not.
> 
> Yes, this can happen, too - core infrastructure changes can appear
> suddenly and be implemented very quickly, but that does not mean
> there hasn't been a lot of background work and effort put into the
> code. The iomap code goes way back. Back to early 2010, in fact:
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1005.1/02720.html
> 
> At that time I implemented a working multipage write IO path for
> XFS, and Christoph integrated that int various OEM products shortly
> afterwards. Yes, there have been iomap based XFS implementations out
> there in production for over 5 years now, but that code was not
> clean enough to even consider merging.
> 
> Another reference in 2013, when someone proposed a hack for embedded
> systems to optimise the write path:
> 
> https://lkml.org/lkml/2013/7/23/809
> 
> Then Christoph introduced the struct iomap for pNFS and the XFS
> block layout driver in late 2013, and when DAX first came along I
> really wanted iomaps to be used up front rather than buffer heads
> for block mapping.
> 
> Now we've finally got iomaps in the IO path and that's rapidly
> cascading through all the XFS IO interfaces and into other
> filesystems. This is exactly what we first talked about 6 years ago.
> 
> So while it might look like the iomap infrastructure has come out of
> nowhere, it's really been a long, long road to get to this point. We
> work to a different time scale over here - it's not uncommon to be
> planning 5 years ahead for new XFS features. We know how long it
> takes to develop, test, review and stabilise significant new
> features, so while it might look like something appears and is
> committed quickly, that's because you haven't seen the work that has
> been done leading up to patches being presented for review and
> merge.
> 
> Hopefully this will give you some more perspective on why I think
> slipping a single release isn't something to get worried about. :)
> 
> > Please, help me understand what I can do to get my code reviewed.  Do I need
> > to more aggressively ping my patch series, asking people by name for reviews?
> 
> On the XFS and fstests lists, if nobody has responded within a few
> days (usually a week) then it's OK to ping it and see if anyone has
> time to review it. In general, a single ping is enough if the
> patchset is in good shape. Fix it all, repost, ping in a week if
> there's no followup. Repeat until merge.
> 
> > Do we need to rework our code flow to Linus so that the DAX changes go through
> > a filesystem tree like XFS or ext4, and ask the developers of that filesystem
> > to help with reviews?  Something else?
> 
> The question we have to ask here is whether a lack of development
> support from a filesystem stop us from driving the DAX
> implementation forward?  I've said it before, and I'll say it again:
> I'm happy to drive DAX on XFS forwards at the rate at which we can
> sustain review via the XFS tree and I don't care if we break support
> on ext2 or ext4. If we keep having to wait for ext4 to fix stuff to
> catch up with what we want/need to do then progress will continue to
> be sporadic and frustrating.  Perhaps it's time to stop waiting for
> ext4 to play catchup every time we take a step forwards....
> 
> > I'm honestly very frustrated by this because I've done my best to be open to
> > constructive criticism and I've tried to respond promptly to the feedback that
> > I've received.  In the end, though, a system where it's a requirement that all
> > upstreamed code be peer reviewed but in which I can't get any feedback is
> > essentially a system where I'm not allowed to contribute.
> 
> There's always been a review bottleneck, and everyone ends up on the
> end of that frustration from time to time. Delays will happen - it's
> just part of the process we all need to deal with. I used to get
> frustrated, too, but now I just accept it, roll with it and we've
> made it an acceptible part of the process to ping patches when it
> looks like they have been forgotten...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Thank you for the advice and perspective, Dave.  :)

I've gotten a bunch of review feedback from Christoph and Jan (thank you!), so
I'm busily addressing that.  Since we're now targeting v4.10 I'll also
incorporate tracepoints so we can observe what's going on in our various fault
paths.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-09-30  9:56   ` Christoph Hellwig
@ 2016-10-03 21:16     ` Ross Zwisler
  0 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-10-03 21:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Fri, Sep 30, 2016 at 02:56:27AM -0700, Christoph Hellwig wrote:
> > -/*
> > - * We use lowest available bit in exceptional entry for locking, other two
> > - * bits to determine entry type. In total 3 special bits.
> > - */
> > -#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> > -#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> > -#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> > -#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> > -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> > -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> > -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> > -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> > -		RADIX_TREE_EXCEPTIONAL_ENTRY))
> > -
> 
> Please split the move of these constants into a separate patch.

Will do for v5.

> > -static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
> > +		unsigned long new_type)
> >  {
> > +	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
> >  	void *entry, **slot;
> >  
> >  restart:
> >  	spin_lock_irq(&mapping->tree_lock);
> >  	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +
> > +	if (entry && new_type == RADIX_DAX_PMD) {
> > +		if (!radix_tree_exceptional_entry(entry) ||
> > +				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> > +			spin_unlock_irq(&mapping->tree_lock);
> > +			return ERR_PTR(-EEXIST);
> > +		}
> > +	} else if (entry && new_type == RADIX_DAX_PTE) {
> > +		if (radix_tree_exceptional_entry(entry) &&
> > +		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
> > +		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> > +			pmd_downgrade = true;
> > +		}
> > +	}
> 
> 	Would be nice to use switch on the type here:
> 
> 	old_type = RADIX_DAX_TYPE(entry);
> 
> 	if (entry) {
> 		switch (new_type) {
> 		case RADIX_DAX_PMD:
> 			if (!radix_tree_exceptional_entry(entry) ||
> 			    oldentry == RADIX_DAX_PTE) {
> 			    	entry = ERR_PTR(-EEXIST);
> 				goto out_unlock;
> 			}
> 			break;
> 		case RADIX_DAX_PTE:
> 			if (radix_tree_exceptional_entry(entry) &&
> 			    old_entry = RADIX_DAX_PMD &&
> 			    (unsigned long)entry & 
> 			      (RADIX_DAX_HZP|RADIX_DAX_EMPTY))
> 			      	..

Will do.  This definitely makes things less dense and more readable.

> Btw, why are only RADIX_DAX_PTE and RADIX_DAX_PMD in the type mask,
> and not RADIX_DAX_HZP and RADIX_DAX_EMPTY?  With that we could use the
> above old_entry local variable over this function and make it a lot les
> of a mess.

So essentially in this revision of the series 'type' just means PMD or PTE
(perhaps this should have been 'size' or something), and the Zero Page / Empty
Entry / Regular DAX choice was something separate.  Jan also commented on
this, and I agree there's room for improvement.  Let me try and rework it a
bit for v5.

> >  static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  				      struct vm_fault *vmf,
> > -				      void *entry, sector_t sector)
> > +				      void *entry, sector_t sector,
> > +				      unsigned long new_type, bool hzp)
> 
> And then we could also drop the hzp argument here..

Yep, that would be good.

> >  #ifdef CONFIG_FS_IOMAP
> > +static inline sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> > +{
> > +	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> > +}
> 
> Please split adding this new helper into a separate patch.

Done for v5.

> > +#if defined(CONFIG_FS_DAX_PMD)
> 
> Please use #ifdef here.

Will do.

> > +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> > +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> > +
> > +/* entries begin locked */
> > +#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
> > +	type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
> > +#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
> > +	RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
> > +#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
> > +		type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
> > +
> > +#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)
> 
> All these macros don't properly brace their arguments.  I think
> you'd make your life a lot easier by making them inline functions.

Oh, great point.  Will do, although depending on how the 'type' thing works
out we may end up with just a dax_radix_entry(sector, type) inline function,
where 'type' can be (RADIX_DAX_PMD | RADIX_DAX_EMPTY), etc.  That would
prevent the need for a bunch of different ways of making a new entry.

> > +#if defined(CONFIG_FS_DAX_PMD)
> 
> #ifdef, please

Will do.

Thank you for the review, Christoph.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 00/12] re-enable DAX PMD support
  2016-09-29 23:43 ` [PATCH v4 00/12] re-enable DAX PMD support Dave Chinner
  2016-09-30  3:03   ` Ross Zwisler
@ 2016-10-03 23:05   ` Ross Zwisler
  1 sibling, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2016-10-03 23:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> On Thu, Sep 29, 2016 at 04:49:18PM -0600, Ross Zwisler wrote:
> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > based locking scheme so that they can be re-enabled.
> > 
> > Ted, can you please take the ext2 + ext4 patches through your tree?  Dave,
> > can you please take the rest through the XFS tree?
> 
> That will make merging this difficult, because later patches in the
> series are dependent on the ext2/ext4 patches early in the series.
> I'd much prefer they all go through the one tree to avoid issues
> like this.

I think that at least some of the ext2 patches at least will be merged in v4.9
through Ted's tree because they are just bugfixes.  For whatever is left I'm
happy to have it merged to v4.10 through the XFS tree, thank you.

> > 
> > Changes since v3:
> >  - Corrected dax iomap code namespace for functions defined in fs/dax.c.
> >    (Dave Chinner)
> >  - Added leading "dax" namespace to new static functions in fs/dax.c.
> >    (Dave Chinner)
> >  - Made all DAX PMD code in fs/dax.c conditionally compiled based on
> >    CONFIG_FS_DAX_PMD.  Otherwise a stub in include/linux/dax.h that just
> >    returns VM_FAULT_FALLBACK will be used.  (Dave Chinner)
> >  - Removed dynamic debugging messages from DAX PMD fault path.  Debugging
> >    tracepoints will be added later to both the PTE and PMD paths via a
> >    later patch set. (Dave Chinner)
> >  - Added a comment to ext2_dax_vm_ops explaining why we don't support DAX
> >    PMD faults in ext2. (Dave Chinner)
> > 
> > This was built upon xfs/for-next with PMD performance fixes from Toshi Kani
> > and Dan Williams.  Dan's patch has already been merged for v4.8, and
> > Toshi's patches are currently queued in Andrew Morton's mm tree for v4.9
> > inclusion.
> 
> the xfs/for-next branch is not a stable branch - it can rebase at
> any time just like linux-next can. The topic branches that I merge
> into the for-next branch, OTOH, are usually stable. i.e. The topic
> branch you should be basing this on is "iomap-4.9-dax".

Thanks, I didn't realize that.  I'll rebase onto iomap-4.9-dax, then on a
v4.9-rc when things become more stable.

> And then you'll also see that there are already ext2 patches in this
> topic branch to convert it to iomap for DAX. So I'm quite happy to
> take the ext2/4 patches in this series in the same way.
> 
> The patches from Dan and Toshi: is you patch series dependent on
> them? Do I need to take them as well?

Nope, Dan and Toshi's patches are just performance fixes, and should be merged
through other trees for v4.9.

> > Previously reported performance numbers:
> > 
> >   [global]
> >   bs=4k
> >   size=2G
> >   directory=/mnt/pmem0
> >   ioengine=mmap
> >   [randrw]
> >   rw=randrw
> > 
> > Here are the performance results with XFS using only pte faults:
> >    READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
> >   WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec
> > 
> > Here are performance numbers for that same test using PMD faults:
> >    READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
> >   WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec
> 
> The numbers look good - how much of that is from lower filesystem
> allocation overhead and how much of it is from fewer page faults?
> You can probably determine this by creating the test file with
> write() to ensure it is fully allocated and so all the filesystem
> is doing on both the read and write paths is mapping allocated
> regions....

Perhaps I'm doing something wrong, but I think that the first time I run the
test I get all allocating writes because that's when the file is first
created.  The file stays around between runs, though, so every run thereafter
should already have a fully allocated file, so the only overhead difference
should be page faults?  Does this sound correct?

If that's true, the performance gain seems to be almost entirely due to fewer
page faults.  For both the 4k case and the 2M case there isn't a noticeable
performance difference between the unallocated vs fully allocated cases.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-10-03 21:05     ` Ross Zwisler
@ 2016-10-04  5:55       ` Jan Kara
  2016-10-04 15:39         ` Ross Zwisler
  2016-10-06 21:34       ` Ross Zwisler
  1 sibling, 1 reply; 46+ messages in thread
From: Jan Kara @ 2016-10-04  5:55 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Mon 03-10-16 15:05:57, Ross Zwisler wrote:
> > > @@ -623,22 +672,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> > >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> > >  		if (error)
> > >  			return ERR_PTR(error);
> > > +	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
> > > +		/* replacing huge zero page with PMD block mapping */
> > > +		unmap_mapping_range(mapping,
> > > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > >  	}
> > >  
> > >  	spin_lock_irq(&mapping->tree_lock);
> > > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > > -		       RADIX_DAX_ENTRY_LOCK);
> > > +	if (hzp)
> > > +		new_entry = RADIX_DAX_HZP_ENTRY();
> > > +	else
> > > +		new_entry = RADIX_DAX_ENTRY(sector, new_type);
> > > +
> > >  	if (hole_fill) {
> > >  		__delete_from_page_cache(entry, NULL);
> > >  		/* Drop pagecache reference */
> > >  		put_page(entry);
> > > -		error = radix_tree_insert(page_tree, index, new_entry);
> > > +		error = __radix_tree_insert(page_tree, index,
> > > +				RADIX_DAX_ORDER(new_type), new_entry);
> > >  		if (error) {
> > >  			new_entry = ERR_PTR(error);
> > >  			goto unlock;
> > >  		}
> > >  		mapping->nrexceptional++;
> > > -	} else {
> > > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> > >  		void **slot;
> > >  		void *ret;
> > 
> > Hum, I somewhat dislike how PTE and PMD paths differ here. But it's OK for
> > now I guess. Long term we might be better off to do away with zero pages
> > for PTEs as well and use exceptional entry and a single zero page like you
> > do for PMD. Because the special cases these zero pages cause are a
> > headache.
> 
> I've been thinking about this as well, and I do think we'd be better off with
> a single zero page for PTEs, as we have with PMDs.  It'd reduce the special
> casing in the DAX code, and it'd also ensure that we don't waste a bunch of
> time and memory creating read-only zero pages to service reads from holes.
> 
> I'll look into adding this for v5.

Well, this would clash with the dirty bit cleaning series I have. So I'd
prefer to put this on a todo list and address it once existing series are
integrated...

> > > +	if (error)
> > > +		goto fallback;
> > > +	if (iomap.offset + iomap.length < pos + PMD_SIZE)
> > > +		goto fallback;
> > > +
> > > +	vmf.pgoff = pgoff;
> > > +	vmf.flags = flags;
> > > +	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
> > 
> > I don't think you want __GFP_FS here - we have already gone through the
> > filesystem's pmd_fault() handler which called dax_iomap_pmd_fault() and
> > thus we hold various fs locks, freeze protection, ...
> 
> I copied this from __get_fault_gfp_mask() in mm/memory.c.  That function is
> used by do_page_mkwrite() and __do_fault(), and we eventually get this
> vmf->gfp_mask in the PTE fault code.  With the code as it is we get the same
> vmf->gfp_mask in both dax_iomap_fault() and dax_iomap_pmd_fault().  It seems
> like they should remain consistent - is it wrong to have __GFP_FS in
> dax_iomap_fault()?

The gfp_mask that propagates from __do_fault() or do_page_mkwrite() is fine
because at that point it is correct. But once we grab filesystem locks
which are not reclaim safe, we should update vmf->gfp_mask we pass further
down into DAX code to not contain __GFP_FS (that's a bug we apparently have
there). And inside DAX code, we definitely are not generally safe to add
__GFP_FS to mapping_gfp_mask(). Maybe we'd be better off propagating struct
vm_fault into this function, using passed gfp_mask there and make sure
callers update gfp_mask as appropriate.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-10-04  5:55       ` Jan Kara
@ 2016-10-04 15:39         ` Ross Zwisler
  2016-10-05  5:50           ` Jan Kara
  0 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2016-10-04 15:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Oct 04, 2016 at 07:55:57AM +0200, Jan Kara wrote:
> On Mon 03-10-16 15:05:57, Ross Zwisler wrote:
> > > > @@ -623,22 +672,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> > > >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> > > >  		if (error)
> > > >  			return ERR_PTR(error);
> > > > +	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
> > > > +		/* replacing huge zero page with PMD block mapping */
> > > > +		unmap_mapping_range(mapping,
> > > > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > > >  	}
> > > >  
> > > >  	spin_lock_irq(&mapping->tree_lock);
> > > > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > > > -		       RADIX_DAX_ENTRY_LOCK);
> > > > +	if (hzp)
> > > > +		new_entry = RADIX_DAX_HZP_ENTRY();
> > > > +	else
> > > > +		new_entry = RADIX_DAX_ENTRY(sector, new_type);
> > > > +
> > > >  	if (hole_fill) {
> > > >  		__delete_from_page_cache(entry, NULL);
> > > >  		/* Drop pagecache reference */
> > > >  		put_page(entry);
> > > > -		error = radix_tree_insert(page_tree, index, new_entry);
> > > > +		error = __radix_tree_insert(page_tree, index,
> > > > +				RADIX_DAX_ORDER(new_type), new_entry);
> > > >  		if (error) {
> > > >  			new_entry = ERR_PTR(error);
> > > >  			goto unlock;
> > > >  		}
> > > >  		mapping->nrexceptional++;
> > > > -	} else {
> > > > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> > > >  		void **slot;
> > > >  		void *ret;
> > > 
> > > Hum, I somewhat dislike how PTE and PMD paths differ here. But it's OK for
> > > now I guess. Long term we might be better off to do away with zero pages
> > > for PTEs as well and use exceptional entry and a single zero page like you
> > > do for PMD. Because the special cases these zero pages cause are a
> > > headache.
> > 
> > I've been thinking about this as well, and I do think we'd be better off with
> > a single zero page for PTEs, as we have with PMDs.  It'd reduce the special
> > casing in the DAX code, and it'd also ensure that we don't waste a bunch of
> > time and memory creating read-only zero pages to service reads from holes.
> > 
> > I'll look into adding this for v5.
> 
> Well, this would clash with the dirty bit cleaning series I have. So I'd
> prefer to put this on a todo list and address it once existing series are
> integrated...

Sure, that works.

> > > > +	if (error)
> > > > +		goto fallback;
> > > > +	if (iomap.offset + iomap.length < pos + PMD_SIZE)
> > > > +		goto fallback;
> > > > +
> > > > +	vmf.pgoff = pgoff;
> > > > +	vmf.flags = flags;
> > > > +	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
> > > 
> > > I don't think you want __GFP_FS here - we have already gone through the
> > > filesystem's pmd_fault() handler which called dax_iomap_pmd_fault() and
> > > thus we hold various fs locks, freeze protection, ...
> > 
> > I copied this from __get_fault_gfp_mask() in mm/memory.c.  That function is
> > used by do_page_mkwrite() and __do_fault(), and we eventually get this
> > vmf->gfp_mask in the PTE fault code.  With the code as it is we get the same
> > vmf->gfp_mask in both dax_iomap_fault() and dax_iomap_pmd_fault().  It seems
> > like they should remain consistent - is it wrong to have __GFP_FS in
> > dax_iomap_fault()?
> 
> The gfp_mask that propagates from __do_fault() or do_page_mkwrite() is fine
> because at that point it is correct. But once we grab filesystem locks
> which are not reclaim safe, we should update vmf->gfp_mask we pass further
> down into DAX code to not contain __GFP_FS (that's a bug we apparently have
> there). And inside DAX code, we definitely are not generally safe to add
> __GFP_FS to mapping_gfp_mask(). Maybe we'd be better off propagating struct
> vm_fault into this function, using passed gfp_mask there and make sure
> callers update gfp_mask as appropriate.

Yep, that makes sense to me.  In reviewing your set it also occurred to me that
we might want to stick a struct vm_area_struct *vma pointer in the vmf, since
you always need a vma when you are using a vmf, but we pass them as a pair
everywhere.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-10-04 15:39         ` Ross Zwisler
@ 2016-10-05  5:50           ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2016-10-05  5:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue 04-10-16 09:39:48, Ross Zwisler wrote:
> > The gfp_mask that propagates from __do_fault() or do_page_mkwrite() is fine
> > because at that point it is correct. But once we grab filesystem locks
> > which are not reclaim safe, we should update vmf->gfp_mask we pass further
> > down into DAX code to not contain __GFP_FS (that's a bug we apparently have
> > there). And inside DAX code, we definitely are not generally safe to add
> > __GFP_FS to mapping_gfp_mask(). Maybe we'd be better off propagating struct
> > vm_fault into this function, using passed gfp_mask there and make sure
> > callers update gfp_mask as appropriate.
> 
> Yep, that makes sense to me.  In reviewing your set it also occurred to me that
> we might want to stick a struct vm_area_struct *vma pointer in the vmf, since
> you always need a vma when you are using a vmf, but we pass them as a pair
> everywhere.

Actually, vma pointer will be in struct vm_fault after my DAX
write-protection series. So once that lands, we can clean up whatever
duplicit function parameters...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-10-03 21:05     ` Ross Zwisler
  2016-10-04  5:55       ` Jan Kara
@ 2016-10-06 21:34       ` Ross Zwisler
  2016-10-07  2:58         ` Ross Zwisler
  1 sibling, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2016-10-06 21:34 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Mon, Oct 03, 2016 at 03:05:57PM -0600, Ross Zwisler wrote:
> On Mon, Oct 03, 2016 at 12:59:49PM +0200, Jan Kara wrote:
> > On Thu 29-09-16 16:49:28, Ross Zwisler wrote:
<>
> > > +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > > +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> > > +{
> > > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > > +	unsigned long pmd_addr = address & PMD_MASK;
> > > +	bool write = flags & FAULT_FLAG_WRITE;
> > > +	struct inode *inode = mapping->host;
> > > +	struct iomap iomap = { 0 };
> > > +	int error, result = 0;
> > > +	pgoff_t size, pgoff;
> > > +	struct vm_fault vmf;
> > > +	void *entry;
> > > +	loff_t pos;
> > > +
> > > +	/* Fall back to PTEs if we're going to COW */
> > > +	if (write && !(vma->vm_flags & VM_SHARED)) {
> > > +		split_huge_pmd(vma, pmd, address);
> > > +		return VM_FAULT_FALLBACK;
> > > +	}
> > > +
> > > +	/* If the PMD would extend outside the VMA */
> > > +	if (pmd_addr < vma->vm_start)
> > > +		return VM_FAULT_FALLBACK;
> > > +	if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> > > +		return VM_FAULT_FALLBACK;
> > > +
> > > +	/*
> > > +	 * Check whether offset isn't beyond end of file now. Caller is
> > > +	 * supposed to hold locks serializing us with truncate / punch hole so
> > > +	 * this is a reliable test.
> > > +	 */
> > > +	pgoff = linear_page_index(vma, pmd_addr);
> > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +
> > > +	if (pgoff >= size)
> > > +		return VM_FAULT_SIGBUS;
> > > +
> > > +	/* If the PMD would extend beyond the file size */
> > > +	if ((pgoff | PG_PMD_COLOUR) >= size)
> > > +		return VM_FAULT_FALLBACK;
> > > +
> > > +	/*
> > > +	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> > > +	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
> > > +	 * the tree, for instance), it will return -EEXIST and we just fall
> > > +	 * back to 4k entries.
> > > +	 */
> > > +	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> > > +	if (IS_ERR(entry))
> > > +		return VM_FAULT_FALLBACK;
> > > +
> > > +	/*
> > > +	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
> > > +	 * setting up a mapping, so really we're using iomap_begin() as a way
> > > +	 * to look up our filesystem block.
> > > +	 */
> > > +	pos = (loff_t)pgoff << PAGE_SHIFT;
> > > +	error = ops->iomap_begin(inode, pos, PMD_SIZE, write ? IOMAP_WRITE : 0,
> > > +			&iomap);
> > 
> > I'm not quite sure if it is OK to call ->iomap_begin() without ever calling
> > ->iomap_end. Specifically the comment before iomap_apply() says:
> > 
> > "It is assumed that the filesystems will lock whatever resources they
> > require in the iomap_begin call, and release them in the iomap_end call."
> > 
> > so what you do could result in unbalanced allocations / locks / whatever.
> > Christoph?
> 
> I'll add the iomap_end() calls to both the PTE and PMD iomap fault handlers.

Interesting - adding iomap_end() calls to the DAX PTE fault handler causes an
AA deadlock because we try and retake ei->dax_sem.  We take dax_sem in
ext2_dax_fault() before calling into the DAX code, then if we end up going
through the error path in ext2_iomap_end(), we call 
  ext2_write_failed()
    ext2_truncate_blocks()
      dax_sem_down_write()

Where we try and take dax_sem again.  This error path is really only valid for
I/O operations, but we happen to call it for page faults because 'written' in
ext2_iomap_end() is just 0.

So...how should we handle this?  A few ideas:

1) Just continue to omit the calls to iomap_end() in the DAX page fault
handlers for now, and add them when there is useful work to be done in one of
the filesystems.

2) Add an IOMAP_FAULT flag to the flags passed into iomap_begin() and
iomap_end() so make it explicit that we are calling as part of a fault handler
and not an I/O operation, and use this to adjust the error handling in
ext2_iomap_end().

3) Just work around the existing error handling in ext2_iomap_end() by either
unsetting IOMAP_WRITE or by setting 'written' to the size of the fault.

For #2 or #3, probably add a comment explaining the deadlock and why we need
to never call ext2_write_failed() while handling a page fault.

Thoughts?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-10-06 21:34       ` Ross Zwisler
@ 2016-10-07  2:58         ` Ross Zwisler
  2016-10-07  7:24           ` Jan Kara
  0 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2016-10-07  2:58 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Christoph Hellwig,
	Dan Williams, Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Thu, Oct 06, 2016 at 03:34:24PM -0600, Ross Zwisler wrote:
> Interesting - adding iomap_end() calls to the DAX PTE fault handler causes an
> AA deadlock because we try and retake ei->dax_sem.  We take dax_sem in
> ext2_dax_fault() before calling into the DAX code, then if we end up going
> through the error path in ext2_iomap_end(), we call 
>   ext2_write_failed()
>     ext2_truncate_blocks()
>       dax_sem_down_write()
> 
> Where we try and take dax_sem again.  This error path is really only valid for
> I/O operations, but we happen to call it for page faults because 'written' in
> ext2_iomap_end() is just 0.
> 
> So...how should we handle this?  A few ideas:
> 
> 1) Just continue to omit the calls to iomap_end() in the DAX page fault
> handlers for now, and add them when there is useful work to be done in one of
> the filesystems.
> 
> 2) Add an IOMAP_FAULT flag to the flags passed into iomap_begin() and
> iomap_end() so make it explicit that we are calling as part of a fault handler
> and not an I/O operation, and use this to adjust the error handling in
> ext2_iomap_end().
> 
> 3) Just work around the existing error handling in ext2_iomap_end() by either
> unsetting IOMAP_WRITE or by setting 'written' to the size of the fault.
> 
> For #2 or #3, probably add a comment explaining the deadlock and why we need
> to never call ext2_write_failed() while handling a page fault.
> 
> Thoughts?

Never mind, #3 it is, I think it was just a plain bug to call iomap_end() with
'length' != 'written'.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
  2016-10-07  2:58         ` Ross Zwisler
@ 2016-10-07  7:24           ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2016-10-07  7:24 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Dave Chinner, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Thu 06-10-16 20:58:33, Ross Zwisler wrote:
> On Thu, Oct 06, 2016 at 03:34:24PM -0600, Ross Zwisler wrote:
> > Interesting - adding iomap_end() calls to the DAX PTE fault handler causes an
> > AA deadlock because we try and retake ei->dax_sem.  We take dax_sem in
> > ext2_dax_fault() before calling into the DAX code, then if we end up going
> > through the error path in ext2_iomap_end(), we call 
> >   ext2_write_failed()
> >     ext2_truncate_blocks()
> >       dax_sem_down_write()
> > 
> > Where we try and take dax_sem again.  This error path is really only valid for
> > I/O operations, but we happen to call it for page faults because 'written' in
> > ext2_iomap_end() is just 0.
> > 
> > So...how should we handle this?  A few ideas:
> > 
> > 1) Just continue to omit the calls to iomap_end() in the DAX page fault
> > handlers for now, and add them when there is useful work to be done in one of
> > the filesystems.
> > 
> > 2) Add an IOMAP_FAULT flag to the flags passed into iomap_begin() and
> > iomap_end() so make it explicit that we are calling as part of a fault handler
> > and not an I/O operation, and use this to adjust the error handling in
> > ext2_iomap_end().
> > 
> > 3) Just work around the existing error handling in ext2_iomap_end() by either
> > unsetting IOMAP_WRITE or by setting 'written' to the size of the fault.
> > 
> > For #2 or #3, probably add a comment explaining the deadlock and why we need
> > to never call ext2_write_failed() while handling a page fault.
> > 
> > Thoughts?
> 
> Never mind, #3 it is, I think it was just a plain bug to call iomap_end() with
> 'length' != 'written'.

Yup, that's what I'd think as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-10-07  7:44 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 01/12] ext4: allow DAX writeback for hole punch Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 02/12] ext4: tell DAX the size of allocation holes Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 03/12] dax: remove buffer_size_valid() Ross Zwisler
2016-09-30  8:49   ` Christoph Hellwig
2016-09-29 22:49 ` [PATCH v4 04/12] ext2: remove support for DAX PMD faults Ross Zwisler
2016-09-30  8:49   ` Christoph Hellwig
2016-10-03  9:35   ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 05/12] dax: make 'wait_table' global variable static Ross Zwisler
2016-09-30  8:50   ` Christoph Hellwig
2016-10-03  9:36   ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 06/12] dax: consistent variable naming for DAX entries Ross Zwisler
2016-09-30  8:50   ` Christoph Hellwig
2016-10-03  9:37   ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range Ross Zwisler
2016-09-30  9:44   ` Christoph Hellwig
2016-10-03  9:55   ` Jan Kara
2016-10-03 18:40     ` Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 08/12] dax: remove dax_pmd_fault() Ross Zwisler
2016-09-30  8:50   ` Christoph Hellwig
2016-10-03  9:56   ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 09/12] dax: correct dax iomap code namespace Ross Zwisler
2016-09-30  8:51   ` Christoph Hellwig
2016-10-03  9:57   ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 10/12] dax: add struct iomap based DAX PMD support Ross Zwisler
2016-09-30  9:56   ` Christoph Hellwig
2016-10-03 21:16     ` Ross Zwisler
2016-10-03 10:59   ` Jan Kara
2016-10-03 16:37     ` Christoph Hellwig
2016-10-03 21:05     ` Ross Zwisler
2016-10-04  5:55       ` Jan Kara
2016-10-04 15:39         ` Ross Zwisler
2016-10-05  5:50           ` Jan Kara
2016-10-06 21:34       ` Ross Zwisler
2016-10-07  2:58         ` Ross Zwisler
2016-10-07  7:24           ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 11/12] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 12/12] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-09-29 23:43 ` [PATCH v4 00/12] re-enable DAX PMD support Dave Chinner
2016-09-30  3:03   ` Ross Zwisler
2016-09-30  4:00     ` Darrick J. Wong
2016-10-03 18:54       ` Ross Zwisler
2016-09-30  6:48     ` Dave Chinner
2016-10-03 21:11       ` Ross Zwisler
2016-10-03 23:05   ` Ross Zwisler
2016-09-30 11:46 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).