linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/11] re-enable DAX PMD support
@ 2016-09-27 20:47 Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 01/11] ext4: allow DAX writeback for hole punch Ross Zwisler
                   ` (11 more replies)
  0 siblings, 12 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Jan and Christoph, can you please help review these changes?

Andrew, when the time is right can you please push these changes to Linus via
the -mm tree?

In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults.  Here is the FIO script I
used for my testing:

  [global]
  bs=4k
  size=2G
  directory=/mnt/pmem0
  ioengine=mmap
  [randrw]
  rw=randrw

Here are the performance results with XFS using only pte faults:
   READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
  WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec

Here are performance numbers for that same test using PMD faults:
   READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
  WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

This was done on a random lab machine with a PMEM device made from memmap'd
RAM.  To get XFS to use PMD faults, I did the following:

  mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
  mount -o dax /dev/pmem0 /mnt/pmem0
  xfs_io -c "extsize 2m" /mnt/pmem0

Changes since v2:
- Removed the struct buffer_head + get_block_t based dax_pmd_fault() handler.
  All DAX PMD faults will now happen via the new struct iomap based
  iomap_dax_pmd_fault().
- Added a new struct iomap based PMD path which is now used by XFS.
- Now that it is using struct iomap, ext2 no longer needs to modified so that
  ext2_get_block() will give us the size of a hole.
- Remove support for DAX PMD faults for ext2.  I can't get them to reliably
  happen in my testing.
- Removed unused xfs_get_blocks_dax_fault() wrapper
- Added a bunch of comments around my changes in dax.c.

This was built upon xfs/for-next with PMD performance fixes from Toshi Kani and
Dan Williams.  Dan's patch has already been merged for v4.8, and Toshi's
patches are currently queued in Andrew Morton's mm tree for v4.9 inclusion.

Here is a tree containing my changes and all the fixes that I've been testing:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v3

Ross Zwisler (11):
  ext4: allow DAX writeback for hole punch
  ext4: tell DAX the size of allocation holes
  dax: remove buffer_size_valid()
  ext2: remove support for DAX PMD faults
  dax: make 'wait_table' global variable static
  dax: consistent variable naming for DAX entries
  dax: coordinate locking for offsets in PMD range
  dax: remove dax_pmd_fault()
  dax: add struct iomap based DAX PMD support
  xfs: use struct iomap based DAX PMD fault path
  dax: remove "depends on BROKEN" from FS_DAX_PMD

 fs/Kconfig          |   1 -
 fs/dax.c            | 696 +++++++++++++++++++++++++++++-----------------------
 fs/ext2/file.c      |  24 +-
 fs/ext4/inode.c     |   7 +-
 fs/xfs/xfs_aops.c   |  25 +-
 fs/xfs/xfs_aops.h   |   3 -
 fs/xfs/xfs_file.c   |   2 +-
 include/linux/dax.h |  37 ++-
 mm/filemap.c        |   6 +-
 9 files changed, 434 insertions(+), 367 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 01/11] ext4: allow DAX writeback for hole punch
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
@ 2016-09-27 20:47 ` Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 02/11] ext4: tell DAX the size of allocation holes Ross Zwisler
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs, stable

Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
---
 fs/ext4/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 }
 
 /*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
  * associated with the given offset and length
  *
  * @inode:  File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	 * Write out all dirty pages to avoid race conditions
 	 * Then release them.
 	 */
-	if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 		ret = filemap_write_and_wait_range(mapping, offset,
 						   offset + length - 1);
 		if (ret)
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 02/11] ext4: tell DAX the size of allocation holes
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 01/11] ext4: allow DAX writeback for hole punch Ross Zwisler
@ 2016-09-27 20:47 ` Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 03/11] dax: remove buffer_size_valid() Ross Zwisler
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size.  This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 		ext4_update_bh_state(bh, map.m_flags);
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
+	} else if (ret == 0) {
+		/* hole case, need to fill in bh->b_size */
+		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 	}
 	return ret;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 03/11] dax: remove buffer_size_valid()
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 01/11] ext4: allow DAX writeback for hole punch Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 02/11] ext4: tell DAX the size of allocation holes Ross Zwisler
@ 2016-09-27 20:47 ` Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 04/11] ext2: remove support for DAX PMD faults Ross Zwisler
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
 	return buffer_mapped(bh) && !buffer_unwritten(bh);
 }
 
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size.  To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
-	return bh->b_state != 0;
-}
-
-
 static sector_t to_sector(const struct buffer_head *bh,
 		const struct inode *inode)
 {
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				rc = get_block(inode, block, bh, rw == WRITE);
 				if (rc)
 					break;
-				if (!buffer_size_valid(bh))
-					bh->b_size = 1 << blkbits;
 				bh_max = pos - first + bh->b_size;
 				bdev = bh->b_bdev;
 				/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bdev = bh.b_bdev;
 
-	/*
-	 * If the filesystem isn't willing to tell us the length of a hole,
-	 * just fall back to PTEs.  Calling get_block 512 times in a loop
-	 * would be silly.
-	 */
-	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+	if (bh.b_size < PMD_SIZE) {
 		dax_pmd_dbg(&bh, address, "allocated block too small");
 		return VM_FAULT_FALLBACK;
 	}
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 04/11] ext2: remove support for DAX PMD faults
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (2 preceding siblings ...)
  2016-09-27 20:47 ` [PATCH v3 03/11] dax: remove buffer_size_valid() Ross Zwisler
@ 2016-09-27 20:47 ` Ross Zwisler
  2016-09-27 21:47   ` Dave Chinner
  2016-09-27 20:47 ` [PATCH v3 05/11] dax: make 'wait_table' global variable static Ross Zwisler
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB.  In my testing I've been unable to
get ext2 to actually fault in a PMD.  It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation.  This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/file.c | 24 +-----------------------
 1 file changed, 1 insertion(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca363d..d5af6d2 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-						pmd_t *pmd, unsigned int flags)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret;
-
-	if (flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(inode->i_sb);
-		file_update_time(vma->vm_file);
-	}
-	down_read(&ei->dax_sem);
-
-	ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
-	up_read(&ei->dax_sem);
-	if (flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
 static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 		struct vm_fault *vmf)
 {
@@ -154,7 +133,6 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
-	.pmd_fault	= ext2_dax_pmd_fault,
 	.page_mkwrite	= ext2_dax_fault,
 	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
@@ -166,7 +144,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 05/11] dax: make 'wait_table' global variable static
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (3 preceding siblings ...)
  2016-09-27 20:47 ` [PATCH v3 04/11] ext2: remove support for DAX PMD faults Ross Zwisler
@ 2016-09-27 20:47 ` Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 06/11] dax: consistent variable naming for DAX entries Ross Zwisler
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:

fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?

Make it static so it has scope local to fs/dax.c, and to make sparse happy.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9b9be8a..ac28cdf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -52,7 +52,7 @@
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
 
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
 {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 06/11] dax: consistent variable naming for DAX entries
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (4 preceding siblings ...)
  2016-09-27 20:47 ` [PATCH v3 05/11] dax: make 'wait_table' global variable static Ross Zwisler
@ 2016-09-27 20:47 ` Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 07/11] dax: coordinate locking for offsets in PMD range Ross Zwisler
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

No functional change.

Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries.  This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac28cdf..baef586 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -357,7 +357,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 					pgoff_t index, void ***slotp)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
 	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
 
@@ -367,13 +367,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 	ewait.key.index = index;
 
 	for (;;) {
-		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
 					  &slot);
-		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		if (!entry || !radix_tree_exceptional_entry(entry) ||
 		    !slot_locked(mapping, slot)) {
 			if (slotp)
 				*slotp = slot;
-			return ret;
+			return entry;
 		}
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
@@ -396,13 +396,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  */
 static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
-	ret = get_unlocked_mapping_entry(mapping, index, &slot);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!ret) {
+	if (!entry) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -410,10 +410,10 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 			       RADIX_DAX_ENTRY_LOCK);
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		err = radix_tree_insert(&mapping->page_tree, index, entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
@@ -425,11 +425,11 @@ restart:
 		/* Good, we have inserted empty locked entry into the tree. */
 		mapping->nrexceptional++;
 		spin_unlock_irq(&mapping->tree_lock);
-		return ret;
+		return entry;
 	}
 	/* Normal page in radix tree? */
-	if (!radix_tree_exceptional_entry(ret)) {
-		struct page *page = ret;
+	if (!radix_tree_exceptional_entry(entry)) {
+		struct page *page = entry;
 
 		get_page(page);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -442,9 +442,9 @@ restart:
 		}
 		return page;
 	}
-	ret = lock_slot(mapping, slot);
+	entry = lock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	return ret;
+	return entry;
 }
 
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -469,11 +469,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 
 void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 {
-	void *ret, **slot;
+	void *entry, **slot;
 
 	spin_lock_irq(&mapping->tree_lock);
-	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
-	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+	entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
 			 !slot_locked(mapping, slot))) {
 		spin_unlock_irq(&mapping->tree_lock);
 		return;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 07/11] dax: coordinate locking for offsets in PMD range
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (5 preceding siblings ...)
  2016-09-27 20:47 ` [PATCH v3 06/11] dax: consistent variable naming for DAX entries Ross Zwisler
@ 2016-09-27 20:47 ` Ross Zwisler
  2016-09-27 20:47 ` [PATCH v3 08/11] dax: remove dax_pmd_fault() Ross Zwisler
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry.  The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 37 ++++++++++++++++++++++++-------------
 include/linux/dax.h |  2 +-
 mm/filemap.c        |  2 +-
 3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index baef586..406feea 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,10 +64,17 @@ static int __init init_dax_wait_table(void)
 }
 fs_initcall(init_dax_wait_table);
 
+static pgoff_t dax_entry_start(pgoff_t index, void *entry)
+{
+	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		index &= (PMD_MASK >> PAGE_SHIFT);
+	return index;
+}
+
 static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
-					      pgoff_t index)
+					      pgoff_t entry_start)
 {
-	unsigned long hash = hash_long((unsigned long)mapping ^ index,
+	unsigned long hash = hash_long((unsigned long)mapping ^ entry_start,
 				       DAX_WAIT_TABLE_BITS);
 	return wait_table + hash;
 }
@@ -285,7 +292,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
  */
 struct exceptional_entry_key {
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t entry_start;
 };
 
 struct wait_exceptional_entry_queue {
@@ -301,7 +308,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
 		container_of(wait, struct wait_exceptional_entry_queue, wait);
 
 	if (key->mapping != ewait->key.mapping ||
-	    key->index != ewait->key.index)
+	    key->entry_start != ewait->key.entry_start)
 		return 0;
 	return autoremove_wake_function(wait, mode, sync, NULL);
 }
@@ -359,12 +366,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 {
 	void *entry, **slot;
 	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq;
 
 	init_wait(&ewait.wait);
 	ewait.wait.func = wake_exceptional_entry_func;
-	ewait.key.mapping = mapping;
-	ewait.key.index = index;
 
 	for (;;) {
 		entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -375,6 +380,11 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 				*slotp = slot;
 			return entry;
 		}
+
+		wq = dax_entry_waitqueue(mapping,
+				dax_entry_start(index, entry));
+		ewait.key.mapping = mapping;
+		ewait.key.entry_start = dax_entry_start(index, entry);
 		prepare_to_wait_exclusive(wq, &ewait.wait,
 					  TASK_UNINTERRUPTIBLE);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -447,10 +457,11 @@ restart:
 	return entry;
 }
 
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
+void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
 				   pgoff_t index, bool wake_all)
 {
-	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping,
+						dax_entry_start(index, entry));
 
 	/*
 	 * Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -462,7 +473,7 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
 		struct exceptional_entry_key key;
 
 		key.mapping = mapping;
-		key.index = index;
+		key.entry_start = dax_entry_start(index, entry);
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
 	}
 }
@@ -480,7 +491,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
 	}
 	unlock_slot(mapping, slot);
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(entry, mapping, index, false);
 }
 
 static void put_locked_mapping_entry(struct address_space *mapping,
@@ -505,7 +516,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 		return;
 
 	/* We have to wake up next waiter for the radix tree entry lock */
-	dax_wake_mapping_entry_waiter(mapping, index, false);
+	dax_wake_mapping_entry_waiter(entry, mapping, index, false);
 }
 
 /*
@@ -532,7 +543,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 	radix_tree_delete(&mapping->page_tree, index);
 	mapping->nrexceptional--;
 	spin_unlock_irq(&mapping->tree_lock);
-	dax_wake_mapping_entry_waiter(mapping, index, true);
+	dax_wake_mapping_entry_waiter(entry, mapping, index, true);
 
 	return 1;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index add6c4b..4065601 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -21,7 +21,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
+void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
 				   pgoff_t index, bool wake_all);
 
 #ifdef CONFIG_FS_DAX
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..35e880d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,7 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			if (node)
 				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
+			dax_wake_mapping_entry_waiter(p, mapping, page->index,
 						      false);
 		}
 	}
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 08/11] dax: remove dax_pmd_fault()
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (6 preceding siblings ...)
  2016-09-27 20:47 ` [PATCH v3 07/11] dax: coordinate locking for offsets in PMD range Ross Zwisler
@ 2016-09-27 20:47 ` Ross Zwisler
  2016-09-27 20:48 ` [PATCH v3 09/11] dax: add struct iomap based DAX PMD support Ross Zwisler
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler.  This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 213 ----------------------------------------------------
 include/linux/dax.h |   6 +-
 2 files changed, 1 insertion(+), 218 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 406feea..b5e7b13 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -909,219 +909,6 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL_GPL(dax_fault);
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
- * more often than one might expect in the below function.
- */
-#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
-
-static void __dax_dbg(struct buffer_head *bh, unsigned long address,
-		const char *reason, const char *fn)
-{
-	if (bh) {
-		char bname[BDEVNAME_SIZE];
-		bdevname(bh->b_bdev, bname);
-		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
-			"length %zd fallback: %s\n", fn, current->comm,
-			address, bname, bh->b_state, (u64)bh->b_blocknr,
-			bh->b_size, reason);
-	} else {
-		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
-			current->comm, address, reason);
-	}
-}
-
-#define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
-
-/**
- * dax_pmd_fault - handle a PMD fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
- */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block)
-{
-	struct file *file = vma->vm_file;
-	struct address_space *mapping = file->f_mapping;
-	struct inode *inode = mapping->host;
-	struct buffer_head bh;
-	unsigned blkbits = inode->i_blkbits;
-	unsigned long pmd_addr = address & PMD_MASK;
-	bool write = flags & FAULT_FLAG_WRITE;
-	struct block_device *bdev;
-	pgoff_t size, pgoff;
-	sector_t block;
-	int result = 0;
-	bool alloc = false;
-
-	/* dax pmd mappings require pfn_t_devmap() */
-	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
-		return VM_FAULT_FALLBACK;
-
-	/* Fall back to PTEs if we're going to COW */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
-		split_huge_pmd(vma, pmd, address);
-		dax_pmd_dbg(NULL, address, "cow write");
-		return VM_FAULT_FALLBACK;
-	}
-	/* If the PMD would extend outside the VMA */
-	if (pmd_addr < vma->vm_start) {
-		dax_pmd_dbg(NULL, address, "vma start unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-	if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
-		dax_pmd_dbg(NULL, address, "vma end unaligned");
-		return VM_FAULT_FALLBACK;
-	}
-
-	pgoff = linear_page_index(vma, pmd_addr);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (pgoff >= size)
-		return VM_FAULT_SIGBUS;
-	/* If the PMD would cover blocks out of the file */
-	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(NULL, address,
-				"offset + huge page size > file size");
-		return VM_FAULT_FALLBACK;
-	}
-
-	memset(&bh, 0, sizeof(bh));
-	bh.b_bdev = inode->i_sb->s_bdev;
-	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
-
-	bh.b_size = PMD_SIZE;
-
-	if (get_block(inode, block, &bh, 0) != 0)
-		return VM_FAULT_SIGBUS;
-
-	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-		WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
-	}
-
-	bdev = bh.b_bdev;
-
-	if (bh.b_size < PMD_SIZE) {
-		dax_pmd_dbg(&bh, address, "allocated block too small");
-		return VM_FAULT_FALLBACK;
-	}
-
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
-
-	if (!write && !buffer_mapped(&bh)) {
-		spinlock_t *ptl;
-		pmd_t entry;
-		struct page *zero_page = get_huge_zero_page();
-
-		if (unlikely(!zero_page)) {
-			dax_pmd_dbg(&bh, address, "no zero page");
-			goto fallback;
-		}
-
-		ptl = pmd_lock(vma->vm_mm, pmd);
-		if (!pmd_none(*pmd)) {
-			spin_unlock(ptl);
-			dax_pmd_dbg(&bh, address, "pmd already present");
-			goto fallback;
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
-				__func__, current->comm, address,
-				(unsigned long long) to_sector(&bh, inode));
-
-		entry = mk_pmd(zero_page, vma->vm_page_prot);
-		entry = pmd_mkhuge(entry);
-		set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
-		result = VM_FAULT_NOPAGE;
-		spin_unlock(ptl);
-	} else {
-		struct blk_dax_ctl dax = {
-			.sector = to_sector(&bh, inode),
-			.size = PMD_SIZE,
-		};
-		long length = dax_map_atomic(bdev, &dax);
-
-		if (length < 0) {
-			dax_pmd_dbg(&bh, address, "dax-error fallback");
-			goto fallback;
-		}
-		if (length < PMD_SIZE) {
-			dax_pmd_dbg(&bh, address, "dax-length too small");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
-			dax_pmd_dbg(&bh, address, "pfn unaligned");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-
-		if (!pfn_t_devmap(dax.pfn)) {
-			dax_unmap_atomic(bdev, &dax);
-			dax_pmd_dbg(&bh, address, "pfn not in memmap");
-			goto fallback;
-		}
-		dax_unmap_atomic(bdev, &dax);
-
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			/*
-			 * We should insert radix-tree entry and dirty it here.
-			 * For now this is broken...
-			 */
-		}
-
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
-				__func__, current->comm, address,
-				pfn_t_to_pfn(dax.pfn),
-				(unsigned long long) dax.sector);
-		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				dax.pfn, write);
-	}
-
- out:
-	return result;
-
- fallback:
-	count_vm_event(THP_FAULT_FALLBACK);
-	result = VM_FAULT_FALLBACK;
-	goto out;
-}
-EXPORT_SYMBOL_GPL(dax_pmd_fault);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 /**
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 4065601..d9a8350 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -48,16 +48,12 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 }
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
-int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t);
-#else
 static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t *pmd, unsigned int flags, get_block_t gb)
 {
 	return VM_FAULT_FALLBACK;
 }
-#endif
+
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 09/11] dax: add struct iomap based DAX PMD support
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (7 preceding siblings ...)
  2016-09-27 20:47 ` [PATCH v3 08/11] dax: remove dax_pmd_fault() Ross Zwisler
@ 2016-09-27 20:48 ` Ross Zwisler
  2016-09-27 22:14   ` Dave Chinner
  2016-09-27 20:48 ` [PATCH v3 10/11] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking.  This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.

There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries.  The empty entries exist to provide locking for the duration of a
given page fault.

This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.

Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees.  This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.

One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset.  The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases.  This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.

If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry.  We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 396 ++++++++++++++++++++++++++++++++++++++++++++++------
 include/linux/dax.h |  29 +++-
 mm/filemap.c        |   4 +-
 3 files changed, 380 insertions(+), 49 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b5e7b13..13934d7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
 #include <linux/iomap.h>
 #include "internal.h"
 
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
-		RADIX_TREE_EXCEPTIONAL_ENTRY))
-
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -400,19 +386,52 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
  * radix tree entry locked. If the radix tree doesn't contain given index,
  * create empty exceptional entry for the index and return with it locked.
  *
+ * When requesting an entry with type RADIX_DAX_PMD, grab_mapping_entry() will
+ * either return that locked entry or will return an error.  This error will
+ * happen if there are any 4k entries (either zero pages or DAX entries)
+ * within the 2MiB range that we are requesting.
+ *
+ * We always favor 4k entries over 2MiB entries. There isn't a flow where we
+ * evict 4k entries in order to 'upgrade' them to a 2MiB entry.  Also, a 2MiB
+ * insertion will fail if it finds any 4k entries already in the tree, and a
+ * 4k insertion will cause an existing 2MiB entry to be unmapped and
+ * downgraded to 4k entries.  This happens for both 2MiB huge zero pages as
+ * well as 2MiB empty entries.
+ *
+ * The exception to this downgrade path is for 2MiB DAX PMD entries that have
+ * real storage backing them.  We will leave these real 2MiB DAX entries in
+ * the tree, and PTE writes will simply dirty the entire 2MiB DAX entry.
+ *
  * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
  * persistent memory the benefit is doubtful. We can add that later if we can
  * show it helps.
  */
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+		unsigned long new_type)
 {
+	bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
 	void *entry, **slot;
 
 restart:
 	spin_lock_irq(&mapping->tree_lock);
 	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+	if (entry && new_type == RADIX_DAX_PMD) {
+		if (!radix_tree_exceptional_entry(entry) ||
+				RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			spin_unlock_irq(&mapping->tree_lock);
+			return ERR_PTR(-EEXIST);
+		}
+	} else if (entry && new_type == RADIX_DAX_PTE) {
+		if (radix_tree_exceptional_entry(entry) &&
+		    RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
+		    (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
+			pmd_downgrade = true;
+		}
+	}
+
 	/* No entry for given index? Make sure radix tree is big enough. */
-	if (!entry) {
+	if (!entry || pmd_downgrade) {
 		int err;
 
 		spin_unlock_irq(&mapping->tree_lock);
@@ -420,15 +439,39 @@ restart:
 				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
 		if (err)
 			return ERR_PTR(err);
-		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-			       RADIX_DAX_ENTRY_LOCK);
+
+		/*
+		 * Besides huge zero pages the only other thing that gets
+		 * downgraded are empty entries which don't need to be
+		 * unmapped.
+		 */
+		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+			unmap_mapping_range(mapping,
+				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
 		spin_lock_irq(&mapping->tree_lock);
-		err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+		if (pmd_downgrade) {
+			radix_tree_delete(&mapping->page_tree, index);
+			mapping->nrexceptional--;
+			dax_wake_mapping_entry_waiter(entry, mapping, index,
+					false);
+		}
+
+		entry = RADIX_DAX_EMPTY_ENTRY(new_type);
+
+		err = __radix_tree_insert(&mapping->page_tree, index,
+				RADIX_DAX_ORDER(new_type), entry);
 		radix_tree_preload_end();
 		if (err) {
 			spin_unlock_irq(&mapping->tree_lock);
-			/* Someone already created the entry? */
-			if (err == -EEXIST)
+			/*
+			 * Someone already created the entry?  This is a
+			 * normal failure when inserting PMDs in a range
+			 * that already contains PTEs.  In that case we want
+			 * to return -EEXIST immediately.
+			 */
+			if (err == -EEXIST && new_type == RADIX_DAX_PTE)
 				goto restart;
 			return ERR_PTR(err);
 		}
@@ -596,11 +639,17 @@ static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size
 	return 0;
 }
 
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
+/*
+ * By this point grab_mapping_entry() has ensured that we have a locked entry
+ * of the appropriate size so we don't have to worry about downgrading PMDs to
+ * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
+ * already in the tree, we will skip the insertion and just dirty the PMD as
+ * appropriate.
+ */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector)
+				      void *entry, sector_t sector,
+				      unsigned long new_type, bool hzp)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	int error = 0;
@@ -623,22 +672,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
 		if (error)
 			return ERR_PTR(error);
+	} else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
+		/* replacing huge zero page with PMD block mapping */
+		unmap_mapping_range(mapping,
+			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
-		       RADIX_DAX_ENTRY_LOCK);
+	if (hzp)
+		new_entry = RADIX_DAX_HZP_ENTRY();
+	else
+		new_entry = RADIX_DAX_ENTRY(sector, new_type);
+
 	if (hole_fill) {
 		__delete_from_page_cache(entry, NULL);
 		/* Drop pagecache reference */
 		put_page(entry);
-		error = radix_tree_insert(page_tree, index, new_entry);
+		error = __radix_tree_insert(page_tree, index,
+				RADIX_DAX_ORDER(new_type), new_entry);
 		if (error) {
 			new_entry = ERR_PTR(error);
 			goto unlock;
 		}
 		mapping->nrexceptional++;
-	} else {
+	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
 		void **slot;
 		void *ret;
 
@@ -694,6 +751,18 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto unlock;
 	}
 
+	if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	/*
+	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we are given will be aligned to
+	 * the start index of the PMD, as will the sector we pull from
+	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
+	 * worry about partial PMD writebacks.
+	 */
 	dax.sector = RADIX_DAX_SECTOR(entry);
 	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
 	spin_unlock_irq(&mapping->tree_lock);
@@ -734,12 +803,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	pgoff_t start_index, end_index, pmd_index;
+	pgoff_t start_index, end_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	bool done = false;
 	int i, ret = 0;
-	void *entry;
 
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
@@ -749,15 +817,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 	start_index = wbc->range_start >> PAGE_SHIFT;
 	end_index = wbc->range_end >> PAGE_SHIFT;
-	pmd_index = DAX_PMD_INDEX(start_index);
-
-	rcu_read_lock();
-	entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
-	rcu_read_unlock();
-
-	/* see if the start of our range is covered by a PMD entry */
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
-		start_index = pmd_index;
 
 	tag_pages_for_writeback(mapping, start_index, end_index);
 
@@ -802,7 +861,8 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return PTR_ERR(dax.addr);
 	dax_unmap_atomic(bdev, &dax);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector,
+			RADIX_DAX_PTE, false);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 	*entryp = ret;
@@ -849,7 +909,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1023,6 +1083,11 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 EXPORT_SYMBOL_GPL(dax_truncate_page);
 
 #ifdef CONFIG_FS_IOMAP
+static inline sector_t iomap_dax_sector(struct iomap *iomap, loff_t pos)
+{
+	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
 static loff_t
 iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1048,8 +1113,7 @@ iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
-		dax.sector = iomap->blkno +
-			(((pos & PAGE_MASK) - iomap->offset) >> 9);
+		dax.sector = iomap_dax_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
 		if (map_len < 0) {
@@ -1164,7 +1228,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (pos >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
 	if (IS_ERR(entry)) {
 		error = PTR_ERR(entry);
 		goto out;
@@ -1186,7 +1250,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		goto unlock_entry;
 	}
 
-	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);
+	sector = iomap_dax_sector(&iomap, pos);
 
 	if (vmf->cow_page) {
 		switch (iomap.type) {
@@ -1246,4 +1310,246 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	return VM_FAULT_NOPAGE | major;
 }
 EXPORT_SYMBOL_GPL(iomap_dax_fault);
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
+ * more often than one might expect in the below functions.
+ */
+#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+static void __dax_pmd_dbg(struct iomap *iomap, unsigned long address,
+		const char *reason, const char *fn)
+{
+	if (iomap) {
+		char bname[BDEVNAME_SIZE];
+
+		bdevname(iomap->bdev, bname);
+		pr_debug("%s: %s addr %lx dev %s type %#x blkno %ld "
+			"offset %lld length %lld fallback: %s\n", fn,
+			current->comm, address, bname, iomap->type,
+			iomap->blkno, iomap->offset, iomap->length, reason);
+	} else {
+		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
+			current->comm, address, reason);
+	}
+}
+
+#define dax_pmd_dbg(bh, address, reason) \
+	__dax_pmd_dbg(bh, address, reason, __func__)
+
+static int iomap_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, loff_t pos, bool write, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct block_device *bdev = iomap->bdev;
+	struct blk_dax_ctl dax = {
+		.sector = iomap_dax_sector(iomap, pos),
+		.size = PMD_SIZE,
+	};
+	long length = dax_map_atomic(bdev, &dax);
+	void *ret;
+
+	if (length < 0) {
+		dax_pmd_dbg(iomap, address, "dax-error fallback");
+		return VM_FAULT_FALLBACK;
+	}
+	if (length < PMD_SIZE) {
+		dax_pmd_dbg(iomap, address, "dax-length too small");
+		dax_unmap_atomic(bdev, &dax);
+		return VM_FAULT_FALLBACK;
+	}
+	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
+		dax_pmd_dbg(iomap, address, "pfn unaligned");
+		dax_unmap_atomic(bdev, &dax);
+		return VM_FAULT_FALLBACK;
+	}
+	if (!pfn_t_devmap(dax.pfn)) {
+		dax_pmd_dbg(iomap, address, "pfn not in memmap");
+		dax_unmap_atomic(bdev, &dax);
+		return VM_FAULT_FALLBACK;
+	}
+	dax_unmap_atomic(bdev, &dax);
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp,
+			dax.sector, RADIX_DAX_PMD, false);
+	if (IS_ERR(ret)) {
+		dax_pmd_dbg(iomap, address,
+				"PMD radix insertion failed");
+		return VM_FAULT_FALLBACK;
+	}
+	*entryp = ret;
+
+	dev_dbg(part_to_dev(bdev->bd_part),
+			"%s: %s addr: %lx pfn: %lx sect: %llx\n",
+			__func__, current->comm, address,
+			pfn_t_to_pfn(dax.pfn),
+			(unsigned long long) dax.sector);
+
+	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
+}
+
+static int iomap_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd,
+		struct vm_fault *vmf, unsigned long address,
+		struct iomap *iomap, void **entryp)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	struct page *zero_page;
+	spinlock_t *ptl;
+	pmd_t pmd_entry;
+	void *ret;
+
+	zero_page = get_huge_zero_page();
+
+	if (unlikely(!zero_page)) {
+		dax_pmd_dbg(iomap, address, "no zero page");
+		return VM_FAULT_FALLBACK;
+	}
+
+	ret = dax_insert_mapping_entry(mapping, vmf, *entryp,
+			0, RADIX_DAX_PMD, true);
+	if (IS_ERR(ret)) {
+		dax_pmd_dbg(iomap, address,
+				"PMD radix insertion failed");
+		return VM_FAULT_FALLBACK;
+	}
+	*entryp = ret;
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_none(*pmd)) {
+		spin_unlock(ptl);
+		dax_pmd_dbg(iomap, address, "pmd already present");
+		return VM_FAULT_FALLBACK;
+	}
+
+	dev_dbg(part_to_dev(iomap->bdev->bd_part),
+			"%s: %s addr: %lx pfn: <zero> sect: %lx\n",
+			__func__, current->comm, address, iomap->blkno);
+
+	pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+	pmd_entry = pmd_mkhuge(pmd_entry);
+	set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+int iomap_dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long pmd_addr = address & PMD_MASK;
+	bool write = flags & FAULT_FLAG_WRITE;
+	struct inode *inode = mapping->host;
+	struct iomap iomap = { 0 };
+	int error, result = 0;
+	pgoff_t size, pgoff;
+	struct vm_fault vmf;
+	void *entry;
+	loff_t pos;
+
+	/* dax pmd mappings require pfn_t_devmap() */
+	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
+		return VM_FAULT_FALLBACK;
+
+	/* Fall back to PTEs if we're going to COW */
+	if (write && !(vma->vm_flags & VM_SHARED)) {
+		split_huge_pmd(vma, pmd, address);
+		dax_pmd_dbg(NULL, address, "cow write");
+		return VM_FAULT_FALLBACK;
+	}
+
+	/* If the PMD would extend outside the VMA */
+	if (pmd_addr < vma->vm_start) {
+		dax_pmd_dbg(NULL, address, "vma start unaligned");
+		return VM_FAULT_FALLBACK;
+	} else if ((pmd_addr + PMD_SIZE) > vma->vm_end) {
+		dax_pmd_dbg(NULL, address, "vma end unaligned");
+		return VM_FAULT_FALLBACK;
+	}
+
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is
+	 * supposed to hold locks serializing us with truncate / punch hole so
+	 * this is a reliable test.
+	 */
+	pgoff = linear_page_index(vma, pmd_addr);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	/* If the PMD would extend beyond the file size */
+	if ((pgoff | PG_PMD_COLOUR) >= size) {
+		dax_pmd_dbg(NULL, address,
+				"offset + huge page size > file size");
+		return VM_FAULT_FALLBACK;
+	}
+
+	/*
+	 * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+	 * PMD or a HZP entry.  If it can't (because a 4k page is already in
+	 * the tree, for instance), it will return -EEXIST and we just fall
+	 * back to 4k entries.
+	 */
+	entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+	if (IS_ERR(entry)) {
+		dax_pmd_dbg(NULL, address, "failed to grab mapping entry");
+		return VM_FAULT_FALLBACK;
+	}
+
+	/*
+	 * Note that we don't use iomap_apply here.  We aren't doing I/O, only
+	 * setting up a mapping, so really we're using iomap_begin() as a way
+	 * to look up our filesystem block.
+	 */
+	pos = (loff_t)pgoff << PAGE_SHIFT;
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, write ? IOMAP_WRITE : 0,
+			&iomap);
+	if (error) {
+		dax_pmd_dbg(NULL, address, "iomap_begin() failure");
+		goto fallback;
+	}
+	if (iomap.offset + iomap.length < pos + PMD_SIZE) {
+		dax_pmd_dbg(&iomap, address, "allocated block too small");
+		goto fallback;
+	}
+
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
+
+	switch (iomap.type) {
+	case IOMAP_MAPPED:
+		result = iomap_pmd_insert_mapping(vma, pmd, &vmf, address,
+				&iomap, pos, write, &entry);
+		break;
+	case IOMAP_UNWRITTEN:
+	case IOMAP_HOLE:
+		if (WARN_ON_ONCE(write))
+			goto fallback;
+		result = iomap_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
+				&entry);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		result = VM_FAULT_FALLBACK;
+		break;
+	}
+
+	if (result == VM_FAULT_FALLBACK)
+		count_vm_event(THP_FAULT_FALLBACK);
+
+ unlock_entry:
+	put_locked_mapping_entry(mapping, pgoff, entry);
+	return result;
+
+ fallback:
+	count_vm_event(THP_FAULT_FALLBACK);
+	result = VM_FAULT_FALLBACK;
+	goto unlock_entry;
+}
+EXPORT_SYMBOL_GPL(iomap_dax_pmd_fault);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_FS_IOMAP */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d9a8350..4877937 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,33 @@
 
 struct iomap_ops;
 
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, two bits for
+ * the entry type (PMD & PTE), and two more for flags (HZP and empty).  In
+ * total five special bits.
+ */
+#define RADIX_DAX_SHIFT	(RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+/* PTE and PMD types */
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+/* huge zero page and empty entry flags */
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
+
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+
+/* entries begin locked */
+#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
+	type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+	RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+		type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+
+#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)
 
 ssize_t iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
 		struct iomap_ops *ops);
@@ -19,6 +44,8 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			struct iomap_ops *ops);
+int iomap_dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 35e880d..d9dd97e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
+			WARN_ON_ONCE(p != RADIX_DAX_EMPTY_ENTRY(RADIX_DAX_PTE));
 			/* DAX accounts exceptional entries as normal pages */
 			if (node)
 				workingset_node_pages_dec(node);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 10/11] xfs: use struct iomap based DAX PMD fault path
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (8 preceding siblings ...)
  2016-09-27 20:48 ` [PATCH v3 09/11] dax: add struct iomap based DAX PMD support Ross Zwisler
@ 2016-09-27 20:48 ` Ross Zwisler
  2016-09-27 20:48 ` [PATCH v3 11/11] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
  2016-09-28  2:08 ` [PATCH v3 00/11] re-enable DAX PMD support Christoph Hellwig
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved iomap_dax_pmd_fault().  Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_aops.c | 25 +++++--------------------
 fs/xfs/xfs_aops.h |  3 ---
 fs/xfs/xfs_file.c |  2 +-
 3 files changed, 6 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4a28fa9..39c754f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1170,8 +1170,7 @@ __xfs_get_blocks(
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	bool			direct,
-	bool			dax_fault)
+	bool			direct)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1265,12 +1264,8 @@ __xfs_get_blocks(
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
-		if (create) {
-			if (dax_fault)
-				ASSERT(!ISUNWRITTEN(&imap));
-			else
-				xfs_map_direct(inode, bh_result, &imap, offset);
-		}
+		if (create)
+			xfs_map_direct(inode, bh_result, &imap, offset);
 	}
 
 	/*
@@ -1310,7 +1305,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
 }
 
 int
@@ -1320,17 +1315,7 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
-	struct inode		*inode,
-	sector_t		iblock,
-	struct buffer_head	*bh_result,
-	int			create)
-{
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 1950e3b..6779e9d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -57,9 +57,6 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
 		       struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
-int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
-			         struct buffer_head *map_bh, int create);
-
 int	xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
 		ssize_t size, void *private);
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 882f264..e86b2be 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1539,7 +1539,7 @@ xfs_filemap_pmd_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+	ret = iomap_dax_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (flags & FAULT_FLAG_WRITE)
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 11/11] dax: remove "depends on BROKEN" from FS_DAX_PMD
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (9 preceding siblings ...)
  2016-09-27 20:48 ` [PATCH v3 10/11] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
@ 2016-09-27 20:48 ` Ross Zwisler
  2016-09-28  2:08 ` [PATCH v3 00/11] re-enable DAX PMD support Christoph Hellwig
  11 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-27 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
 	depends on FS_DAX
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
-	depends on BROKEN
 
 endif # BLOCK
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 04/11] ext2: remove support for DAX PMD faults
  2016-09-27 20:47 ` [PATCH v3 04/11] ext2: remove support for DAX PMD faults Ross Zwisler
@ 2016-09-27 21:47   ` Dave Chinner
  2016-09-28 18:46     ` Ross Zwisler
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2016-09-27 21:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Tue, Sep 27, 2016 at 02:47:55PM -0600, Ross Zwisler wrote:
> DAX PMD support was added via the following commit:
> 
> commit e7b1ea2ad658 ("ext2: huge page fault support")
> 
> I believe this path to be untested as ext2 doesn't reliably provide block
> allocations that are aligned to 2MiB.  In my testing I've been unable to
> get ext2 to actually fault in a PMD.  It always fails with a "pfn
> unaligned" message because the sector returned by ext2_get_block() isn't
> aligned.
> 
> I've tried various settings for the "stride" and "stripe_width" extended
> options to mkfs.ext2, without any luck.
> 
> Since we can't reliably get PMDs, remove support so that we don't have an
> untested code path that we may someday traverse when we happen to get an
> aligned block allocation.  This should also make 4k DAX faults in ext2 a
> bit faster since they will no longer have to call the PMD fault handler
> only to get a response of VM_FAULT_FALLBACK.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
....
> @@ -154,7 +133,6 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
>  	.fault		= ext2_dax_fault,
> -	.pmd_fault	= ext2_dax_pmd_fault,
>  	.page_mkwrite	= ext2_dax_fault,
>  	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
>  };

Would it be better to put a comment mentioning this here? So as the
years go by, this reminds people not to bother trying to implement
it?

/*
 * .pmd_fault is not supported for DAX because allocation in ext2
 * cannot be reliably aligned to huge page sizes and so pmd faults
 * will always fail and fail back to regular faults.
 */

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 09/11] dax: add struct iomap based DAX PMD support
  2016-09-27 20:48 ` [PATCH v3 09/11] dax: add struct iomap based DAX PMD support Ross Zwisler
@ 2016-09-27 22:14   ` Dave Chinner
  2016-09-29 18:20     ` Ross Zwisler
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2016-09-27 22:14 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Tue, Sep 27, 2016 at 02:48:00PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This patch allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled using the new struct
> iomap based fault handlers.
> 
> There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
> mappings that have an associated block allocation, and 4k DAX empty
> entries.  The empty entries exist to provide locking for the duration of a
> given page fault.
> 
> This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
> entries, PMD DAX entries that have associated block allocations, and 2 MiB
> DAX empty entries.
> 
> Unlike the 4k case where we insert a struct page* into the radix tree for
> 4k zero pages, for HZP we insert a DAX exceptional entry with the new
> RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
> every 2MiB hole mapping, and it doesn't make sense to have that same struct
> page* with multiple entries in multiple trees.  This would cause contention
> on the single page lock for the one Huge Zero Page, and it would break the
> page->index and page->mapping associations that are assumed to be valid in
> many other places in the kernel.
> 
> One difficult use case is when one thread is trying to use 4k entries in
> radix tree for a given offset, and another thread is using 2 MiB entries
> for that same offset.  The current code handles this by making the 2 MiB
> user fall back to 4k entries for most cases.  This was done because it is
> the simplest solution, and because the use of 2MiB pages is already
> opportunistic.
> 
> If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
> we run into the problem of how we lock out 4k page faults for the entire
> 2MiB range while we clean out the radix tree so we can insert the 2MiB
> entry.  We can solve this problem if we need to, but I think that the cases
> where both 2MiB entries and 4K entries are being used for the same range
> will be rare enough and the gain small enough that it probably won't be
> worth the complexity.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
....
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> + * more often than one might expect in the below functions.
> + */
> +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> +
> +static void __dax_pmd_dbg(struct iomap *iomap, unsigned long address,
> +		const char *reason, const char *fn)
> +{
> +	if (iomap) {
> +		char bname[BDEVNAME_SIZE];
> +
> +		bdevname(iomap->bdev, bname);
> +		pr_debug("%s: %s addr %lx dev %s type %#x blkno %ld "
> +			"offset %lld length %lld fallback: %s\n", fn,
> +			current->comm, address, bname, iomap->type,
> +			iomap->blkno, iomap->offset, iomap->length, reason);
> +	} else {
> +		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
> +			current->comm, address, reason);
> +	}
> +}

Yuck! Tracepoints for debugging information like this, please, not
printk awfulness.

> +
> +#define dax_pmd_dbg(bh, address, reason) \
> +	__dax_pmd_dbg(bh, address, reason, __func__)
> +
> +static int iomap_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> +		struct vm_fault *vmf, unsigned long address,
> +		struct iomap *iomap, loff_t pos, bool write, void **entryp)

Please put a "dax" in the function name. grepping, cscope, etc are
much easier when static function names are namespaced properly.

> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	struct block_device *bdev = iomap->bdev;
> +	struct blk_dax_ctl dax = {
> +		.sector = iomap_dax_sector(iomap, pos),
> +		.size = PMD_SIZE,
> +	};
> +	long length = dax_map_atomic(bdev, &dax);
> +	void *ret;
> +
> +	if (length < 0) {
> +		dax_pmd_dbg(iomap, address, "dax-error fallback");
> +		return VM_FAULT_FALLBACK;
> +	}

Fails to unmap. Please use an goto based error stack. And
tracepoints make this much neater:

	trace_dax_pmd_insert_mapping(iomap, address, &dax, length);
	if (length < 0)
		goto unmap_fallback;
	if (length < PMD_SIZE)
		goto unmap_fallback;
	.....

	trace_dax_pmd_insert_mapping_done(iomap, address, &dax, length);
	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);

unmap_fallback:
	dax_unmap_atomic(bdev, &dax);
fallback:
	trace_dax_pmd_insert_fallback(iomap, address, &dax, length);
	return VM_FAULT_FALLBACK;
}

i.e. we don't need need all those debug printks to tell us what
failed - the first tracepoint tells use everything about the context
we are about to check, and the last tracepoint tells us whether we
are falling back or about to try mapping a PMD.

If you really need custom printk output for debugging, then use
trace_printk() so that it shows up in the trace output along with
all the trace points....

Same goes for all the other pr_debug() cals in this code - they need
to go and be replaced with tracepoints.

> +int iomap_dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)

dax_iomap_pmd_fault() - dax_ is the namespace prefix for the code in
fs/dax.c, not iomap_...

> +{
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	bool write = flags & FAULT_FLAG_WRITE;
> +	struct inode *inode = mapping->host;
> +	struct iomap iomap = { 0 };
> +	int error, result = 0;
> +	pgoff_t size, pgoff;
> +	struct vm_fault vmf;
> +	void *entry;
> +	loff_t pos;
> +
> +	/* dax pmd mappings require pfn_t_devmap() */
> +	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> +		return VM_FAULT_FALLBACK;

So we build all this stuff in, even if CONFIG_FS_DAX_PMD=n?
Shouldn't we just have a simple function that returns
VM_FAULT_FALLBACK when CONFIG_FS_DAX_PMD=n?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/11] re-enable DAX PMD support
  2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
                   ` (10 preceding siblings ...)
  2016-09-27 20:48 ` [PATCH v3 11/11] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
@ 2016-09-28  2:08 ` Christoph Hellwig
  2016-09-28  4:55   ` Dave Chinner
  11 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2016-09-28  2:08 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Tue, Sep 27, 2016 at 02:47:51PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking.  This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.
> 
> Jan and Christoph, can you please help review these changes?

About to get on a plane, so it might take a bit to do a real review.
In general this looks fine, but I guess the first two ext4 patches
should just go straight to Ted independent of the rest?

Also Jan just posted a giant DAX patchbomb, we'll need to find a way
to integrate all that work, and maybe prioritize things if we want
to get bits into 4.9 still.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/11] re-enable DAX PMD support
  2016-09-28  2:08 ` [PATCH v3 00/11] re-enable DAX PMD support Christoph Hellwig
@ 2016-09-28  4:55   ` Dave Chinner
  2016-09-29 18:23     ` Ross Zwisler
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2016-09-28  4:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Tue, Sep 27, 2016 at 07:08:42PM -0700, Christoph Hellwig wrote:
> On Tue, Sep 27, 2016 at 02:47:51PM -0600, Ross Zwisler wrote:
> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > based locking scheme so that they can be re-enabled.
> > 
> > Jan and Christoph, can you please help review these changes?
> 
> About to get on a plane, so it might take a bit to do a real review.
> In general this looks fine, but I guess the first two ext4 patches
> should just go straight to Ted independent of the rest?
> 
> Also Jan just posted a giant DAX patchbomb, we'll need to find a way
> to integrate all that work, and maybe prioritize things if we want
> to get bits into 4.9 still.

I'm not going to have time to do much review or testing of the DAX
changes (apart from the cursor comments I've already made) because
of the huge pile of XFS reflink changes I've got ot get through
first. However, I've already got the iomap dax bits in the XFS tree
so I can pull everything through there if review and testing is
covered otherwise......

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 04/11] ext2: remove support for DAX PMD faults
  2016-09-27 21:47   ` Dave Chinner
@ 2016-09-28 18:46     ` Ross Zwisler
  0 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-28 18:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Wed, Sep 28, 2016 at 07:47:20AM +1000, Dave Chinner wrote:
> On Tue, Sep 27, 2016 at 02:47:55PM -0600, Ross Zwisler wrote:
> > DAX PMD support was added via the following commit:
> > 
> > commit e7b1ea2ad658 ("ext2: huge page fault support")
> > 
> > I believe this path to be untested as ext2 doesn't reliably provide block
> > allocations that are aligned to 2MiB.  In my testing I've been unable to
> > get ext2 to actually fault in a PMD.  It always fails with a "pfn
> > unaligned" message because the sector returned by ext2_get_block() isn't
> > aligned.
> > 
> > I've tried various settings for the "stride" and "stripe_width" extended
> > options to mkfs.ext2, without any luck.
> > 
> > Since we can't reliably get PMDs, remove support so that we don't have an
> > untested code path that we may someday traverse when we happen to get an
> > aligned block allocation.  This should also make 4k DAX faults in ext2 a
> > bit faster since they will no longer have to call the PMD fault handler
> > only to get a response of VM_FAULT_FALLBACK.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ....
> > @@ -154,7 +133,6 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
> >  
> >  static const struct vm_operations_struct ext2_dax_vm_ops = {
> >  	.fault		= ext2_dax_fault,
> > -	.pmd_fault	= ext2_dax_pmd_fault,
> >  	.page_mkwrite	= ext2_dax_fault,
> >  	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
> >  };
> 
> Would it be better to put a comment mentioning this here? So as the
> years go by, this reminds people not to bother trying to implement
> it?
> 
> /*
>  * .pmd_fault is not supported for DAX because allocation in ext2
>  * cannot be reliably aligned to huge page sizes and so pmd faults
>  * will always fail and fail back to regular faults.
>  */

Sure, this seems like a good idea.  I'll add it, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 09/11] dax: add struct iomap based DAX PMD support
  2016-09-27 22:14   ` Dave Chinner
@ 2016-09-29 18:20     ` Ross Zwisler
  0 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-29 18:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Christoph Hellwig, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, linux-xfs

On Wed, Sep 28, 2016 at 08:14:24AM +1000, Dave Chinner wrote:
> On Tue, Sep 27, 2016 at 02:48:00PM -0600, Ross Zwisler wrote:
> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > locking.  This patch allows DAX PMDs to participate in the DAX radix tree
> > based locking scheme so that they can be re-enabled using the new struct
> > iomap based fault handlers.
> > 
> > There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
> > mappings that have an associated block allocation, and 4k DAX empty
> > entries.  The empty entries exist to provide locking for the duration of a
> > given page fault.
> > 
> > This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
> > entries, PMD DAX entries that have associated block allocations, and 2 MiB
> > DAX empty entries.
> > 
> > Unlike the 4k case where we insert a struct page* into the radix tree for
> > 4k zero pages, for HZP we insert a DAX exceptional entry with the new
> > RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
> > every 2MiB hole mapping, and it doesn't make sense to have that same struct
> > page* with multiple entries in multiple trees.  This would cause contention
> > on the single page lock for the one Huge Zero Page, and it would break the
> > page->index and page->mapping associations that are assumed to be valid in
> > many other places in the kernel.
> > 
> > One difficult use case is when one thread is trying to use 4k entries in
> > radix tree for a given offset, and another thread is using 2 MiB entries
> > for that same offset.  The current code handles this by making the 2 MiB
> > user fall back to 4k entries for most cases.  This was done because it is
> > the simplest solution, and because the use of 2MiB pages is already
> > opportunistic.
> > 
> > If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
> > we run into the problem of how we lock out 4k page faults for the entire
> > 2MiB range while we clean out the radix tree so we can insert the 2MiB
> > entry.  We can solve this problem if we need to, but I think that the cases
> > where both 2MiB entries and 4K entries are being used for the same range
> > will be rare enough and the gain small enough that it probably won't be
> > worth the complexity.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ....
> > +#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +/*
> > + * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> > + * more often than one might expect in the below functions.
> > + */
> > +#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> > +
> > +static void __dax_pmd_dbg(struct iomap *iomap, unsigned long address,
> > +		const char *reason, const char *fn)
> > +{
> > +	if (iomap) {
> > +		char bname[BDEVNAME_SIZE];
> > +
> > +		bdevname(iomap->bdev, bname);
> > +		pr_debug("%s: %s addr %lx dev %s type %#x blkno %ld "
> > +			"offset %lld length %lld fallback: %s\n", fn,
> > +			current->comm, address, bname, iomap->type,
> > +			iomap->blkno, iomap->offset, iomap->length, reason);
> > +	} else {
> > +		pr_debug("%s: %s addr: %lx fallback: %s\n", fn,
> > +			current->comm, address, reason);
> > +	}
> > +}
> 
> Yuck! Tracepoints for debugging information like this, please, not
> printk awfulness.

I was just recreating the debugging scheme used in the old PMD code.
I'll check out tracepoints.

> > +
> > +#define dax_pmd_dbg(bh, address, reason) \
> > +	__dax_pmd_dbg(bh, address, reason, __func__)
> > +
> > +static int iomap_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd,
> > +		struct vm_fault *vmf, unsigned long address,
> > +		struct iomap *iomap, loff_t pos, bool write, void **entryp)
> 
> Please put a "dax" in the function name. grepping, cscope, etc are
> much easier when static function names are namespaced properly.

Yea, namespacing for static functions is a bit hit and miss, especially in the
dax code.  (see buffer_written(), to_sector(), slot_locked(), etc.)  Poking
around in the XFS code, though, it looks like everything starts with "xfs_".
I'll add the leading "dax_".

> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	struct block_device *bdev = iomap->bdev;
> > +	struct blk_dax_ctl dax = {
> > +		.sector = iomap_dax_sector(iomap, pos),
> > +		.size = PMD_SIZE,
> > +	};
> > +	long length = dax_map_atomic(bdev, &dax);
> > +	void *ret;
> > +
> > +	if (length < 0) {
> > +		dax_pmd_dbg(iomap, address, "dax-error fallback");
> > +		return VM_FAULT_FALLBACK;
> > +	}
> 
> Fails to unmap. 

This is the failure case for dax_map_atomic() failing, so we don't have a
mapping to unmap at this point.

> Please use an goto based error stack. And
> tracepoints make this much neater:
> 
> 	trace_dax_pmd_insert_mapping(iomap, address, &dax, length);
> 	if (length < 0)
> 		goto unmap_fallback;
> 	if (length < PMD_SIZE)
> 		goto unmap_fallback;
> 	.....
> 
> 	trace_dax_pmd_insert_mapping_done(iomap, address, &dax, length);
> 	return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
> 
> unmap_fallback:
> 	dax_unmap_atomic(bdev, &dax);
> fallback:
> 	trace_dax_pmd_insert_fallback(iomap, address, &dax, length);
> 	return VM_FAULT_FALLBACK;
> }
> 
> i.e. we don't need need all those debug printks to tell us what
> failed - the first tracepoint tells use everything about the context
> we are about to check, and the last tracepoint tells us whether we
> are falling back or about to try mapping a PMD.
> 
> If you really need custom printk output for debugging, then use
> trace_printk() so that it shows up in the trace output along with
> all the trace points....
> 
> Same goes for all the other pr_debug() cals in this code - they need
> to go and be replaced with tracepoints.

Cool, I'll look into making this simpler.

> > +int iomap_dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > +		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> 
> dax_iomap_pmd_fault() - dax_ is the namespace prefix for the code in
> fs/dax.c, not iomap_...

I was just trying to be consistent with Christoph's dax iomap code.  :)  I'll
change both his and my functions to be properly namespaced as 'dax_iomap_'

> > +{
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	unsigned long pmd_addr = address & PMD_MASK;
> > +	bool write = flags & FAULT_FLAG_WRITE;
> > +	struct inode *inode = mapping->host;
> > +	struct iomap iomap = { 0 };
> > +	int error, result = 0;
> > +	pgoff_t size, pgoff;
> > +	struct vm_fault vmf;
> > +	void *entry;
> > +	loff_t pos;
> > +
> > +	/* dax pmd mappings require pfn_t_devmap() */
> > +	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> > +		return VM_FAULT_FALLBACK;
> 
> So we build all this stuff in, even if CONFIG_FS_DAX_PMD=n?
> Shouldn't we just have a simple function that returns
> VM_FAULT_FALLBACK when CONFIG_FS_DAX_PMD=n?

Well, not really.  If CONFIG_FS_DAX_PMD isn't defined the compiler notices
that we have an unconditional return and optimizes out the rest of the
function.  It effectively becomes a sub that does an unconditional "return
VM_FAULT_FALLBACK;".

Here is the generated code for iomap_dax_pmd_fault() when CONFIG_FS_DAX_PMD
isn't defined:

0000000000000000 <iomap_dax_pmd_fault>:
       0:       e8 00 00 00 00          callq  5 <iomap_dax_pmd_fault+0x5>
       5:       55                      push   %rbp
       6:       b8 00 08 00 00          mov    $0x800,%eax
       b:       48 89 e5                mov    %rsp,%rbp
       e:       5d                      pop    %rbp
       f:       c3                      retq

Where the 0x800 in there is VM_FAULT_FALLBACK.

However, I already need to make a stub for the PMD fault handler in dax.h for
configs where CONFIG_TRANSPARENT_HUGEPAGE isn't defined.  This stub is just:

#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
int iomap_dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
		pmd_t *pmd, unsigned int flags, struct iomap_ops *ops);
#else
static inline int iomap_dax_pmd_fault(struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd, unsigned int flags,
		struct iomap_ops *ops)
{
	return VM_FAULT_FALLBACK;
}
#endif

It's probably more readable if we just use this stub if CONFIG_FS_DAX_PMD isn't
defined.  I'll fix this for v4.

Thank you for the review!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 00/11] re-enable DAX PMD support
  2016-09-28  4:55   ` Dave Chinner
@ 2016-09-29 18:23     ` Ross Zwisler
  0 siblings, 0 replies; 19+ messages in thread
From: Ross Zwisler @ 2016-09-29 18:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Ross Zwisler, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Christoph Hellwig,
	Dan Williams, Jan Kara, Matthew Wilcox, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Wed, Sep 28, 2016 at 02:55:50PM +1000, Dave Chinner wrote:
> On Tue, Sep 27, 2016 at 07:08:42PM -0700, Christoph Hellwig wrote:
> > On Tue, Sep 27, 2016 at 02:47:51PM -0600, Ross Zwisler wrote:
> > > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > > locking.  This series allows DAX PMDs to participate in the DAX radix tree
> > > based locking scheme so that they can be re-enabled.
> > > 
> > > Jan and Christoph, can you please help review these changes?
> > 
> > About to get on a plane, so it might take a bit to do a real review.
> > In general this looks fine, but I guess the first two ext4 patches
> > should just go straight to Ted independent of the rest?
> > 
> > Also Jan just posted a giant DAX patchbomb, we'll need to find a way
> > to integrate all that work, and maybe prioritize things if we want
> > to get bits into 4.9 still.
> 
> I'm not going to have time to do much review or testing of the DAX
> changes (apart from the cursor comments I've already made) because
> of the huge pile of XFS reflink changes I've got ot get through
> first. However, I've already got the iomap dax bits in the XFS tree
> so I can pull everything through there if review and testing is
> covered otherwise......

Frankly I'd love it if my changes could make it into v4.9 through the XFS
tree.  They've passed xfstests both with and without DAX, and they've passed
all the targeted testing I've been able to throw at them.  If that works, we
can integrate Jan's changes on top of them during the v4.9 cycle and merge for
v4.10.

I'll work on incorporating changes for your feedback, Dave, and hopefully have
a v4 out by the end of the day.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-09-29 18:23 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-27 20:47 [PATCH v3 00/11] re-enable DAX PMD support Ross Zwisler
2016-09-27 20:47 ` [PATCH v3 01/11] ext4: allow DAX writeback for hole punch Ross Zwisler
2016-09-27 20:47 ` [PATCH v3 02/11] ext4: tell DAX the size of allocation holes Ross Zwisler
2016-09-27 20:47 ` [PATCH v3 03/11] dax: remove buffer_size_valid() Ross Zwisler
2016-09-27 20:47 ` [PATCH v3 04/11] ext2: remove support for DAX PMD faults Ross Zwisler
2016-09-27 21:47   ` Dave Chinner
2016-09-28 18:46     ` Ross Zwisler
2016-09-27 20:47 ` [PATCH v3 05/11] dax: make 'wait_table' global variable static Ross Zwisler
2016-09-27 20:47 ` [PATCH v3 06/11] dax: consistent variable naming for DAX entries Ross Zwisler
2016-09-27 20:47 ` [PATCH v3 07/11] dax: coordinate locking for offsets in PMD range Ross Zwisler
2016-09-27 20:47 ` [PATCH v3 08/11] dax: remove dax_pmd_fault() Ross Zwisler
2016-09-27 20:48 ` [PATCH v3 09/11] dax: add struct iomap based DAX PMD support Ross Zwisler
2016-09-27 22:14   ` Dave Chinner
2016-09-29 18:20     ` Ross Zwisler
2016-09-27 20:48 ` [PATCH v3 10/11] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
2016-09-27 20:48 ` [PATCH v3 11/11] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-09-28  2:08 ` [PATCH v3 00/11] re-enable DAX PMD support Christoph Hellwig
2016-09-28  4:55   ` Dave Chinner
2016-09-29 18:23     ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).