All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix dax races between page faults RFC only
@ 2016-01-27  4:08 Matthew Wilcox
  2016-01-27  4:08 ` [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Matthew Wilcox @ 2016-01-27  4:08 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner, Ross Zwisler; +Cc: Matthew Wilcox, linux-fsdevel

From: Matthew Wilcox <willy@linux.intel.com>

I haven't had time to boot these patches yet, much less split them apart
into properly reviewable chunks.  I just wanted to get the current state
of my tree out before Dave & Jan wake up.

The first patch you've seen before; it was patch 2/8 in the PUD patch
series I posted on January 6th.  It seemed like a good place to start
since it unifies a lot of fault handling.

The second patch is everything else we've talked about rolled into one.
It looks pretty good to me; after I make sure that it boots and passes
some smoke tests, I'll add the optimisation Jan & I discussed today
about trying to reduce the times we have to take the allocation lock in
exclusive/write mode.

If I understand the current state of the code correctly, truncate can't
race with the fault handler, so the re-checks we do of i_size are now
dead code, which can be deleted. right?

Matthew Wilcox (2):
  mm,fs,dax: Change ->pmd_fault to ->huge_fault
  dax: Giant hack

 Documentation/filesystems/Locking |   8 -
 Documentation/filesystems/dax.txt |  11 +-
 fs/block_dev.c                    |  12 +-
 fs/dax.c                          | 516 +++++++++++++++++++++++---------------
 fs/ext2/file.c                    |  58 +----
 fs/ext4/file.c                    | 116 +++------
 fs/xfs/xfs_file.c                 | 102 ++------
 fs/xfs/xfs_trace.h                |   2 -
 include/linux/dax.h               |  19 +-
 include/linux/mm.h                |  22 +-
 mm/memory.c                       |  63 ++---
 mm/mmap.c                         |   2 +-
 12 files changed, 411 insertions(+), 520 deletions(-)

-- 
2.7.0.rc3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault
  2016-01-27  4:08 [PATCH 0/2] Fix dax races between page faults RFC only Matthew Wilcox
@ 2016-01-27  4:08 ` Matthew Wilcox
  2016-01-27  5:48   ` kbuild test robot
                     ` (2 more replies)
  2016-01-27  4:08 ` [PATCH 2/2] dax: Giant hack Matthew Wilcox
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 13+ messages in thread
From: Matthew Wilcox @ 2016-01-27  4:08 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner, Ross Zwisler; +Cc: Matthew Wilcox, linux-fsdevel

From: Matthew Wilcox <willy@linux.intel.com>

In preparation for adding the ability to handle PUD pages, convert
->pmd_fault to ->huge_fault.  huge_fault() takes a vm_fault structure
instead of separate (address, pmd, flags) parameters.  The vm_fault
structure is extended to include a union of the different page table
pointers that may be needed, and three flag bits are reserved to indicate
which type of pointer is in the union.

The DAX fault handlers are unified into one entry point, meaning that
the filesystems can be largely unconcerned with what size of fault they
are handling.  ext4 needs to know in order to reserve enough blocks in
the journal, but ext2 and xfs are oblivious.

The existing dax_fault and dax_mkwrite had no callers, so rename the
__dax_fault and __dax_mkwrite to lose the initial underscores.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 Documentation/filesystems/dax.txt | 12 +++--
 fs/block_dev.c                    | 10 +---
 fs/dax.c                          | 97 +++++++++++++--------------------------
 fs/ext2/file.c                    | 27 ++---------
 fs/ext4/file.c                    | 56 +++++++---------------
 fs/xfs/xfs_file.c                 | 25 +++++-----
 fs/xfs/xfs_trace.h                |  2 +-
 include/linux/dax.h               | 17 -------
 include/linux/mm.h                | 20 ++++++--
 mm/memory.c                       | 20 ++++++--
 10 files changed, 102 insertions(+), 184 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 7bde640..2fe9e74 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -49,6 +49,7 @@ These block devices may be used for inspiration:
 - axonram: Axon DDR2 device driver
 - brd: RAM backed block device driver
 - dcssblk: s390 dcss block device driver
+- pmem: NV-DIMM Persistent Memory driver
 
 
 Implementation Tips for Filesystem Writers
@@ -61,9 +62,9 @@ Filesystem support consists of
   dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
 - implementing an mmap file operation for DAX files which sets the
   VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
-  include handlers for fault, pmd_fault and page_mkwrite (which should
-  probably call dax_fault(), dax_pmd_fault() and dax_mkwrite(), passing the
-  appropriate get_block() callback)
+  include handlers for fault, huge_fault and page_mkwrite (which should
+  probably call dax_fault() and dax_mkwrite(), passing the appropriate
+  get_block() callback)
 - calling dax_truncate_page() instead of block_truncate_page() for DAX files
 - calling dax_zero_page_range() instead of zero_user() for DAX files
 - ensuring that there is sufficient locking between reads, writes,
@@ -75,8 +76,9 @@ calls to get_block() (for example by a page-fault racing with a read()
 or a write()) work correctly.
 
 These filesystems may be used for inspiration:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
+- ext2: see Documentation/filesystems/ext2.txt
+- ext4: see Documentation/filesystems/ext4.txt
+- xfs: see Documentation/filesystems/xfs.txt
 
 
 Shortcomings
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 60895e5..a9474ac 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1727,13 +1727,7 @@ static const struct address_space_operations def_blk_aops = {
  */
 static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return __dax_fault(vma, vmf, blkdev_get_block, NULL);
-}
-
-static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, unsigned int flags)
-{
-	return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+	return dax_fault(vma, vmf, blkdev_get_block, NULL);
 }
 
 static void blkdev_vm_open(struct vm_area_struct *vma)
@@ -1760,7 +1754,7 @@ static const struct vm_operations_struct blkdev_dax_vm_ops = {
 	.open		= blkdev_vm_open,
 	.close		= blkdev_vm_close,
 	.fault		= blkdev_dax_fault,
-	.pmd_fault	= blkdev_dax_pmd_fault,
+	.huge_fault	= blkdev_dax_fault,
 	.pfn_mkwrite	= blkdev_dax_fault,
 };
 
diff --git a/fs/dax.c b/fs/dax.c
index 206650f..dbaf62c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -571,23 +571,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	return error;
 }
 
-/**
- * __dax_fault - handle a page fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- * @complete_unwritten: The filesystem method used to convert unwritten blocks
- *	to written so the data written to them is exposed. This is required for
- *	required by write faults for filesystems that will return unwritten
- *	extent mappings from @get_block, but it is optional for reads as
- *	dax_insert_mapping() will always zero unwritten blocks. If the fs does
- *	not support unwritten extents, the it should pass NULL.
- *
- * When a page fault occurs, filesystems may call this helper in their
- * fault handler for DAX files. __dax_fault() assumes the caller has done all
- * the necessary locking for the page fault to proceed successfully.
- */
-int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	struct file *file = vma->vm_file;
@@ -724,34 +708,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	}
 	goto out;
 }
-EXPORT_SYMBOL(__dax_fault);
-
-/**
- * dax_fault - handle a page fault on a DAX file
- * @vma: The virtual memory area where the fault occurred
- * @vmf: The description of the fault
- * @get_block: The filesystem method used to translate file offsets to blocks
- *
- * When a page fault occurs, filesystems may call this helper in their
- * fault handler for DAX files.
- */
-int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-	      get_block_t get_block, dax_iodone_t complete_unwritten)
-{
-	int result;
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
-
-	if (vmf->flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(sb);
-		file_update_time(vma->vm_file);
-	}
-	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
-	if (vmf->flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(sb);
-
-	return result;
-}
-EXPORT_SYMBOL_GPL(dax_fault);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
@@ -778,7 +734,7 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
 
 #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
 
-int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		pmd_t *pmd, unsigned int flags, get_block_t get_block,
 		dax_iodone_t complete_unwritten)
 {
@@ -994,37 +950,46 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	result = VM_FAULT_FALLBACK;
 	goto out;
 }
-EXPORT_SYMBOL_GPL(__dax_pmd_fault);
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags, get_block_t get_block,
+		dax_iodone_t complete_unwritten)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif /* !CONFIG_TRANSPARENT_HUGEPAGE */
 
 /**
- * dax_pmd_fault - handle a PMD fault on a DAX file
+ * dax_fault - handle a page fault on a DAX file
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
  * @get_block: The filesystem method used to translate file offsets to blocks
+ * @iodone: The filesystem method used to convert unwritten blocks
+ *	to written so the data written to them is exposed.  This is required
+ *	by write faults for filesystems that will return unwritten extent
+ *	mappings from @get_block, but it is optional for reads as
+ *	dax_insert_mapping() will always zero unwritten blocks.  If the fs
+ *	does not support unwritten extents, then it should pass NULL.
  *
  * When a page fault occurs, filesystems may call this helper in their
- * pmd_fault handler for DAX files.
+ * fault handler for DAX files.  dax_fault() assumes the caller has done all
+ * the necessary locking for the page fault to proceed successfully.
  */
-int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-			pmd_t *pmd, unsigned int flags, get_block_t get_block,
-			dax_iodone_t complete_unwritten)
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+		get_block_t get_block, dax_iodone_t iodone)
 {
-	int result;
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
-
-	if (flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(sb);
-		file_update_time(vma->vm_file);
+	unsigned long address = (unsigned long)vmf->virtual_address;
+	switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
+	case FAULT_FLAG_SIZE_PTE:
+		return dax_pte_fault(vma, vmf, get_block, iodone);
+	case FAULT_FLAG_SIZE_PMD:
+		return dax_pmd_fault(vma, address, vmf->pmd, vmf->flags,
+						get_block, iodone);
+	default:
+		return VM_FAULT_FALLBACK;
 	}
-	result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
-				complete_unwritten);
-	if (flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(sb);
-
-	return result;
 }
-EXPORT_SYMBOL_GPL(dax_pmd_fault);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
  * dax_pfn_mkwrite - handle first write to DAX page
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2c88d68..cf6f78c 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -51,7 +51,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
+	ret = dax_fault(vma, vmf, ext2_get_block, NULL);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
@@ -59,27 +59,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-						pmd_t *pmd, unsigned int flags)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret;
-
-	if (flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(inode->i_sb);
-		file_update_time(vma->vm_file);
-	}
-	down_read(&ei->dax_sem);
-
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
-
-	up_read(&ei->dax_sem);
-	if (flags & FAULT_FLAG_WRITE)
-		sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vma->vm_file);
@@ -90,7 +69,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	file_update_time(vma->vm_file);
 	down_read(&ei->dax_sem);
 
-	ret = __dax_mkwrite(vma, vmf, ext2_get_block, NULL);
+	ret = dax_mkwrite(vma, vmf, ext2_get_block, NULL);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
@@ -123,7 +102,7 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
-	.pmd_fault	= ext2_dax_pmd_fault,
+	.huge_fault	= ext2_dax_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
 	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8c8965c..71859ed 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -202,54 +202,30 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 
 	if (write) {
-		sb_start_pagefault(sb);
-		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-						EXT4_DATA_TRANS_BLOCKS(sb));
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else
-		result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
-
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
-		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
-
-	return result;
-}
-
-static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-						pmd_t *pmd, unsigned int flags)
-{
-	int result;
-	handle_t *handle = NULL;
-	struct inode *inode = file_inode(vma->vm_file);
-	struct super_block *sb = inode->i_sb;
-	bool write = flags & FAULT_FLAG_WRITE;
+		unsigned nblocks;
+		switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
+		case FAULT_FLAG_SIZE_PTE:
+			nblocks = EXT4_DATA_TRANS_BLOCKS(sb);
+			break;
+		case FAULT_FLAG_SIZE_PMD:
+			nblocks = ext4_chunk_trans_blocks(inode,
+						PMD_SIZE / PAGE_SIZE);
+			break;
+		default:
+			return VM_FAULT_FALLBACK;
+		}
 
-	if (write) {
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
-				ext4_chunk_trans_blocks(inode,
-							PMD_SIZE / PAGE_SIZE));
+		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, nblocks);
 	} else
 		down_read(&EXT4_I(inode)->i_mmap_sem);
 
 	if (IS_ERR(handle))
 		result = VM_FAULT_SIGBUS;
 	else
-		result = __dax_pmd_fault(vma, addr, pmd, flags,
-				ext4_dax_mmap_get_block, NULL);
+		result = dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
 
 	if (write) {
 		if (!IS_ERR(handle))
@@ -270,7 +246,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
 	down_read(&EXT4_I(inode)->i_mmap_sem);
-	err = __dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
+	err = dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(inode->i_sb);
 
@@ -310,7 +286,7 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
-	.pmd_fault	= ext4_dax_pmd_fault,
+	.huge_fault	= ext4_dax_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
 	.pfn_mkwrite	= ext4_dax_pfn_mkwrite,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 55e16e2..6db703b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1526,7 +1526,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
 	} else {
 		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
 		ret = block_page_mkwrite_return(ret);
@@ -1560,7 +1560,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		ret = dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1571,16 +1571,14 @@ xfs_filemap_fault(
 /*
  * Similar to xfs_filemap_fault(), the DAX fault path can call into here on
  * both read and write faults. Hence we need to handle both cases. There is no
- * ->pmd_mkwrite callout for huge pages, so we have a single function here to
+ * ->huge_mkwrite callout for huge pages, so we have a single function here to
  * handle both cases here. @flags carries the information on the type of fault
  * occuring.
  */
 STATIC int
-xfs_filemap_pmd_fault(
+xfs_filemap_huge_fault(
 	struct vm_area_struct	*vma,
-	unsigned long		addr,
-	pmd_t			*pmd,
-	unsigned int		flags)
+	struct vm_fault		*vmf)
 {
 	struct inode		*inode = file_inode(vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
@@ -1589,26 +1587,25 @@ xfs_filemap_pmd_fault(
 	if (!IS_DAX(inode))
 		return VM_FAULT_FALLBACK;
 
-	trace_xfs_filemap_pmd_fault(ip);
+	trace_xfs_filemap_huge_fault(ip);
 
-	if (flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & FAULT_FLAG_WRITE) {
 		sb_start_pagefault(inode->i_sb);
 		file_update_time(vma->vm_file);
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
-			      NULL);
+	ret = dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
-	if (flags & FAULT_FLAG_WRITE)
+	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(inode->i_sb);
 
 	return ret;
 }
 
 /*
- * pfn_mkwrite was originally inteneded to ensure we capture time stamp
+ * pfn_mkwrite was originally intended to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
  * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
  * to ensure we serialise the fault barrier in place.
@@ -1644,7 +1641,7 @@ xfs_filemap_pfn_mkwrite(
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
-	.pmd_fault	= xfs_filemap_pmd_fault,
+	.huge_fault	= xfs_filemap_huge_fault,
 	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_filemap_page_mkwrite,
 	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 391d797..fb1f3e1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -687,7 +687,7 @@ DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 
 DEFINE_INODE_EVENT(xfs_filemap_fault);
-DEFINE_INODE_EVENT(xfs_filemap_pmd_fault);
+DEFINE_INODE_EVENT(xfs_filemap_huge_fault);
 DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
 DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite);
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8204c3d..8e58c36 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -12,25 +12,8 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
 		dax_iodone_t);
-int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-		dax_iodone_t);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
-int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
-				unsigned int flags, get_block_t, dax_iodone_t);
-#else
-static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-				pmd_t *pmd, unsigned int flags, get_block_t gb,
-				dax_iodone_t di)
-{
-	return VM_FAULT_FALLBACK;
-}
-#define __dax_pmd_fault dax_pmd_fault
-#endif
 int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
 #define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
-#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa6da9a..b9d0979 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -232,15 +232,21 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 
+#define FAULT_FLAG_SIZE_MASK	0x700	/* Support up to 8-level page tables */
+#define FAULT_FLAG_SIZE_PTE	0x000	/* First level (eg 4k) */
+#define FAULT_FLAG_SIZE_PMD	0x100	/* Second level (eg 2MB) */
+#define FAULT_FLAG_SIZE_PUD	0x200	/* Third level (eg 1GB) */
+#define FAULT_FLAG_SIZE_PGD	0x300	/* Fourth level (eg 512GB) */
+
 /*
- * vm_fault is filled by the the pagefault handler and passed to the vma's
+ * vm_fault is filled in by the pagefault handler and passed to the vma's
  * ->fault function. The vma's ->fault is responsible for returning a bitmask
  * of VM_FAULT_xxx flags that give details about how the fault was handled.
  *
  * MM layer fills up gfp_mask for page allocations but fault handler might
  * alter it if its implementation requires a different allocation context.
  *
- * pgoff should be used in favour of virtual_address, if possible.
+ * pgoff should be used instead of virtual_address, if possible.
  */
 struct vm_fault {
 	unsigned int flags;		/* FAULT_FLAG_xxx flags */
@@ -257,7 +263,12 @@ struct vm_fault {
 	/* for ->map_pages() only */
 	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
 					 * max_pgoff inclusive */
-	pte_t *pte;			/* pte entry associated with ->pgoff */
+	union {
+		pte_t *pte;		/* pte entry associated with ->pgoff */
+		pmd_t *pmd;
+		pud_t *pud;
+		pgd_t *pgd;
+	};
 };
 
 /*
@@ -270,8 +281,7 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	int (*mremap)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
-	int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
-						pmd_t *, unsigned int flags);
+	int (*huge_fault)(struct vm_area_struct *, struct vm_fault *vmf);
 	void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
diff --git a/mm/memory.c b/mm/memory.c
index a2eaeef..03d49eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3274,10 +3274,16 @@ out:
 static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pmd_t *pmd, unsigned int flags)
 {
+	struct vm_fault vmf = {
+		.virtual_address = (void __user *)address,
+		.flags = flags | FAULT_FLAG_SIZE_PMD,
+		.pmd = pmd,
+	};
+
 	if (vma_is_anonymous(vma))
 		return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
-	if (vma->vm_ops->pmd_fault)
-		return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+	if (vma->vm_ops->huge_fault)
+		return vma->vm_ops->huge_fault(vma, &vmf);
 	return VM_FAULT_FALLBACK;
 }
 
@@ -3285,10 +3291,16 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pmd_t *pmd, pmd_t orig_pmd,
 			unsigned int flags)
 {
+	struct vm_fault vmf = {
+		.virtual_address = (void __user *)address,
+		.flags = flags | FAULT_FLAG_SIZE_PMD,
+		.pmd = pmd,
+	};
+
 	if (vma_is_anonymous(vma))
 		return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
-	if (vma->vm_ops->pmd_fault)
-		return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+	if (vma->vm_ops->huge_fault)
+		return vma->vm_ops->huge_fault(vma, &vmf);
 	return VM_FAULT_FALLBACK;
 }
 
-- 
2.7.0.rc3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] dax: Giant hack
  2016-01-27  4:08 [PATCH 0/2] Fix dax races between page faults RFC only Matthew Wilcox
  2016-01-27  4:08 ` [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault Matthew Wilcox
@ 2016-01-27  4:08 ` Matthew Wilcox
  2016-01-28 13:10   ` Jan Kara
                     ` (2 more replies)
       [not found] ` <CAOxpaSU_JgkeS=u61zxWTdP5hXymBkUsvkjkwNzm6XVig9y8RQ@mail.gmail.com>
  2016-01-28 12:48 ` Jan Kara
  3 siblings, 3 replies; 13+ messages in thread
From: Matthew Wilcox @ 2016-01-27  4:08 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner, Ross Zwisler; +Cc: Matthew Wilcox, linux-fsdevel

From: Matthew Wilcox <willy@linux.intel.com>

This glop of impossible-to-review code implements a number of ideas that
need to be separated out.

 - Eliminate vm_ops->huge_fault.  The core calls ->fault instead and callers
   who set VM_HUGEPAGE should be prepared to deal with FAULT_FLAG_SIZE_PMD
   (and larger)
 - Switch back to calling ->page_mkwrite instead of ->pfn_mkwrite.  DAX now
   always has a page to lock, and no other imlementations of ->pfn_mkwrite
   exist.
 - dax_mkwrite splits out from dax_fault.  dax_fault will now never call
   get_block() to allocate a block; only to see if a block has been allocated.
   dax_mkwrite will always attempt to allocate a block.
 - Filesystems now take their DAX allocation mutex in exclusive/write mode
   when calling dax_mkwrite.
 - Split out dax_insert_pmd_mapping() from dax_pmd_fault and share it with
   the new dax_pmd_mkwrite
 - Change dax_pmd_write to take a vm_fault argument like the rest of the
   family of functions.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 Documentation/filesystems/Locking |   8 -
 Documentation/filesystems/dax.txt |   5 +-
 fs/block_dev.c                    |  10 +-
 fs/dax.c                          | 433 +++++++++++++++++++++++++-------------
 fs/ext2/file.c                    |  35 +--
 fs/ext4/file.c                    |  96 +++------
 fs/xfs/xfs_file.c                 |  95 ++-------
 fs/xfs/xfs_trace.h                |   2 -
 include/linux/dax.h               |   4 +-
 include/linux/mm.h                |   4 -
 mm/memory.c                       |  51 +----
 mm/mmap.c                         |   2 +-
 12 files changed, 359 insertions(+), 386 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 619af9b..1be09e7 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -522,7 +522,6 @@ prototypes:
 	void (*close)(struct vm_area_struct*);
 	int (*fault)(struct vm_area_struct*, struct vm_fault *);
 	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
-	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
 	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
 
 locking rules:
@@ -532,7 +531,6 @@ close:		yes
 fault:		yes		can return with page locked
 map_pages:	yes
 page_mkwrite:	yes		can return with page locked
-pfn_mkwrite:	yes
 access:		yes
 
 	->fault() is called when a previously not present pte is about
@@ -559,12 +557,6 @@ the page has been truncated, the filesystem should not look up a new page
 like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
 will cause the VM to retry the fault.
 
-	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
-VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
-VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
-after this call is to make the pte read-write, unless pfn_mkwrite returns
-an error.
-
 	->access() is called when get_user_pages() fails in
 access_process_vm(), typically used to debug a process through
 /proc/pid/mem or ptrace.  This function is needed only for
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 2fe9e74..ff62feb 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -62,9 +62,8 @@ Filesystem support consists of
   dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
 - implementing an mmap file operation for DAX files which sets the
   VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
-  include handlers for fault, huge_fault and page_mkwrite (which should
-  probably call dax_fault() and dax_mkwrite(), passing the appropriate
-  get_block() callback)
+  include handlers for fault and page_mkwrite (which should probably call
+  dax_fault() and dax_mkwrite(), passing the appropriate get_block() callback)
 - calling dax_truncate_page() instead of block_truncate_page() for DAX files
 - calling dax_zero_page_range() instead of zero_user() for DAX files
 - ensuring that there is sufficient locking between reads, writes,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index a9474ac..78697fe 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1722,7 +1722,7 @@ static const struct address_space_operations def_blk_aops = {
  *
  * Finally, unlike the filemap_page_mkwrite() case there is no
  * filesystem superblock to sync against freezing.  We still include a
- * pfn_mkwrite callback for dax drivers to receive write fault
+ * page_mkwrite callback for dax drivers to receive write fault
  * notifications.
  */
 static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
@@ -1730,6 +1730,11 @@ static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return dax_fault(vma, vmf, blkdev_get_block, NULL);
 }
 
+static int blkdev_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return dax_mkwrite(vma, vmf, blkdev_get_block, NULL);
+}
+
 static void blkdev_vm_open(struct vm_area_struct *vma)
 {
 	struct inode *bd_inode = bdev_file_inode(vma->vm_file);
@@ -1754,8 +1759,7 @@ static const struct vm_operations_struct blkdev_dax_vm_ops = {
 	.open		= blkdev_vm_open,
 	.close		= blkdev_vm_close,
 	.fault		= blkdev_dax_fault,
-	.huge_fault	= blkdev_dax_fault,
-	.pfn_mkwrite	= blkdev_dax_fault,
+	.page_mkwrite	= blkdev_dax_mkwrite,
 };
 
 static const struct vm_operations_struct blkdev_default_vm_ops = {
diff --git a/fs/dax.c b/fs/dax.c
index dbaf62c..952c2c2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -372,7 +372,7 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
 
 	if (sector == NO_SECTOR) {
 		/*
-		 * This can happen during correct operation if our pfn_mkwrite
+		 * This can happen during correct operation if our page_mkwrite
 		 * fault raced against a hole punch operation.  If this
 		 * happens the pte that was hole punched will have been
 		 * unmapped and the radix tree entry will have been removed by
@@ -584,7 +584,6 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	sector_t block;
 	pgoff_t size;
 	int error;
-	int major = 0;
 
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
@@ -624,20 +623,8 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (error)
 		goto unlock_page;
 
-	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
-		if (vmf->flags & FAULT_FLAG_WRITE) {
-			error = get_block(inode, block, &bh, 1);
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-			major = VM_FAULT_MAJOR;
-			if (!error && (bh.b_size < PAGE_SIZE))
-				error = -EIO;
-			if (error)
-				goto unlock_page;
-		} else {
-			return dax_load_hole(mapping, page, vmf);
-		}
-	}
+	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
+		return dax_load_hole(mapping, page, vmf);
 
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
@@ -655,16 +642,101 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 								PAGE_SHIFT;
 			if (vmf->pgoff >= size) {
 				i_mmap_unlock_read(mapping);
-				error = -EIO;
-				goto out;
+				return VM_FAULT_SIGBUS;
 			}
 		}
 		return VM_FAULT_LOCKED;
 	}
 
-	/* Check we didn't race with a read fault installing a new page */
-	if (!page && major)
-		page = find_lock_page(mapping, vmf->pgoff);
+	/*
+	 * If we successfully insert a mapping to an unwritten extent, we
+	 * need to convert the unwritten extent.  If there is an error
+	 * inserting the mapping, the filesystem needs to leave it as
+	 * unwritten to prevent exposure of the stale underlying data to
+	 * userspace, but we still need to call the completion function so
+	 * the private resources on the mapping buffer can be released. We
+	 * indicate what the callback should do via the uptodate variable,
+	 * same as for normal BH based IO completions.
+	 */
+	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	if (buffer_unwritten(&bh)) {
+		if (complete_unwritten)
+			complete_unwritten(&bh, !error);
+		else
+			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
+	}
+
+ out:
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	/* -EBUSY is fine, somebody else faulted on the same PTE */
+	if ((error < 0) && (error != -EBUSY))
+		return VM_FAULT_SIGBUS;
+	return VM_FAULT_NOPAGE;
+
+ unlock_page:
+	if (page) {
+		unlock_page(page);
+		page_cache_release(page);
+	}
+	goto out;
+}
+
+static int dax_pte_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block, dax_iodone_t complete_unwritten)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct page *page;
+	struct buffer_head bh;
+	unsigned blkbits = inode->i_blkbits;
+	sector_t block;
+	pgoff_t size;
+	int error;
+	int major = 0;
+
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	memset(&bh, 0, sizeof(bh));
+	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+	bh.b_bdev = inode->i_sb->s_bdev;
+	bh.b_size = PAGE_SIZE;
+
+ repeat:
+	page = find_get_page(mapping, vmf->pgoff);
+	if (page) {
+		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+			page_cache_release(page);
+			return VM_FAULT_RETRY;
+		}
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
+		}
+		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		if (unlikely(vmf->pgoff >= size)) {
+			/*
+			 * We have a struct page covering a hole in the file
+			 * from a read fault and we've raced with a truncate
+			 */
+			error = -EIO;
+			goto unlock_page;
+		}
+	}
+
+	error = get_block(inode, block, &bh, 1);
+	if (!error && (bh.b_size < PAGE_SIZE))
+		error = -EIO;		/* fs corruption? */
+	if (error)
+		goto unlock_page;
+
+	count_vm_event(PGMAJFAULT);
+	mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+	major = VM_FAULT_MAJOR;
 
 	if (page) {
 		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
@@ -675,16 +747,6 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page = NULL;
 	}
 
-	/*
-	 * If we successfully insert the new mapping over an unwritten extent,
-	 * we need to ensure we convert the unwritten extent. If there is an
-	 * error inserting the mapping, the filesystem needs to leave it as
-	 * unwritten to prevent exposure of the stale underlying data to
-	 * userspace, but we still need to call the completion function so
-	 * the private resources on the mapping buffer can be released. We
-	 * indicate what the callback should do via the uptodate variable, same
-	 * as for normal BH based IO completions.
-	 */
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
 	if (buffer_unwritten(&bh)) {
 		if (complete_unwritten)
@@ -734,22 +796,101 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
 
 #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
 
-static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block,
-		dax_iodone_t complete_unwritten)
+static int dax_insert_pmd_mapping(struct inode *inode, struct buffer_head *bh,
+			struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = PMD_SIZE,
+	};
+	struct block_device *bdev = bh->b_bdev;
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
+	unsigned long address = (unsigned long)vmf->virtual_address;
+	pgoff_t pgoff = linear_page_index(vma, address & PMD_MASK);
+	int major;
+	long length;
+
+	length = dax_map_atomic(bdev, &dax);
+	if (length < 0)
+		return VM_FAULT_SIGBUS;
+	if (length < PMD_SIZE) {
+		dax_pmd_dbg(bh, address, "dax-length too small");
+		goto unmap;
+	}
+
+	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
+		dax_pmd_dbg(bh, address, "pfn unaligned");
+		goto unmap;
+	}
+
+	if (!pfn_t_devmap(dax.pfn)) {
+		dax_pmd_dbg(bh, address, "pfn not in memmap");
+		goto unmap;
+	}
+
+	if (buffer_unwritten(bh) || buffer_new(bh)) {
+		clear_pmem(dax.addr, PMD_SIZE);
+		wmb_pmem();
+		count_vm_event(PGMAJFAULT);
+		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+		major = VM_FAULT_MAJOR;
+	} else
+		major = 0;
+
+	dax_unmap_atomic(bdev, &dax);
+
+	/*
+	 * For PTE faults we insert a radix tree entry for reads, and
+	 * leave it clean.  Then on the first write we dirty the radix
+	 * tree entry via the dax_mkwrite() path.  This sequence
+	 * allows the dax_mkwrite() call to be simpler and avoid a
+	 * call into get_block() to translate the pgoff to a sector in
+	 * order to be able to create a new radix tree entry.
+	 *
+	 * The PMD path doesn't have an equivalent to
+	 * dax_mkwrite(), though, so for a read followed by a
+	 * write we traverse all the way through __dax_pmd_fault()
+	 * twice.  This means we can just skip inserting a radix tree
+	 * entry completely on the initial read and just wait until
+	 * the write to insert a dirty entry.
+	 */
+	if (write) {
+		int error = dax_radix_entry(vma->vm_file->f_mapping, pgoff,
+						dax.sector, true, true);
+		if (error) {
+			dax_pmd_dbg(bh, address, "PMD radix insertion failed");
+			goto fallback;
+		}
+	}
+
+	dev_dbg(part_to_dev(bdev->bd_part),
+			"%s: %s addr: %lx pfn: %lx sect: %llx\n",
+			__func__, current->comm, address,
+			pfn_t_to_pfn(dax.pfn),
+			(unsigned long long) dax.sector);
+	return major | vmf_insert_pfn_pmd(vma, address, vmf->pmd, dax.pfn,
+						write);
+ 
+ unmap:
+	dax_unmap_atomic(bdev, &dax);
+ fallback:
+	return VM_FAULT_FALLBACK;
+}
+
+static int dax_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+		get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
 	struct buffer_head bh;
 	unsigned blkbits = inode->i_blkbits;
+	unsigned long address = (unsigned long)vmf->virtual_address;
 	unsigned long pmd_addr = address & PMD_MASK;
-	bool write = flags & FAULT_FLAG_WRITE;
-	struct block_device *bdev;
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	pgoff_t size, pgoff;
 	sector_t block;
-	int error, result = 0;
-	bool alloc = false;
+	int result = 0;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -757,7 +898,7 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
-		split_huge_pmd(vma, pmd, address);
+		split_huge_pmd(vma, vmf->pmd, address);
 		dax_pmd_dbg(NULL, address, "cow write");
 		return VM_FAULT_FALLBACK;
 	}
@@ -791,14 +932,6 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if (get_block(inode, block, &bh, 0) != 0)
 		return VM_FAULT_SIGBUS;
 
-	if (!buffer_mapped(&bh) && write) {
-		if (get_block(inode, block, &bh, 1) != 0)
-			return VM_FAULT_SIGBUS;
-		alloc = true;
-	}
-
-	bdev = bh.b_bdev;
-
 	/*
 	 * If the filesystem isn't willing to tell us the length of a hole,
 	 * just fall back to PTEs.  Calling get_block 512 times in a loop
@@ -809,17 +942,6 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		return VM_FAULT_FALLBACK;
 	}
 
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (alloc) {
-		loff_t lstart = pgoff << PAGE_SHIFT;
-		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
-		truncate_pagecache_range(inode, lstart, lend);
-	}
-
 	i_mmap_lock_read(mapping);
 
 	/*
@@ -839,9 +961,9 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 	}
 
-	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
+	if (!buffer_mapped(&bh) && buffer_uptodate(&bh)) {
 		spinlock_t *ptl;
-		pmd_t entry;
+		pmd_t entry, *pmd = vmf->pmd;
 		struct page *zero_page = get_huge_zero_page();
 
 		if (unlikely(!zero_page)) {
@@ -856,7 +978,7 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 		}
 
-		dev_dbg(part_to_dev(bdev->bd_part),
+		dev_dbg(part_to_dev(bh.b_bdev->bd_part),
 				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
 				__func__, current->comm, address,
 				(unsigned long long) to_sector(&bh, inode));
@@ -867,75 +989,90 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		struct blk_dax_ctl dax = {
-			.sector = to_sector(&bh, inode),
-			.size = PMD_SIZE,
-		};
-		long length = dax_map_atomic(bdev, &dax);
+		result |= dax_insert_pmd_mapping(inode, &bh, vma, vmf);
+	}
 
-		if (length < 0) {
-			result = VM_FAULT_SIGBUS;
-			goto out;
-		}
-		if (length < PMD_SIZE) {
-			dax_pmd_dbg(&bh, address, "dax-length too small");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
-		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
-			dax_pmd_dbg(&bh, address, "pfn unaligned");
-			dax_unmap_atomic(bdev, &dax);
-			goto fallback;
-		}
+ out:
+	i_mmap_unlock_read(mapping);
 
-		if (!pfn_t_devmap(dax.pfn)) {
-			dax_unmap_atomic(bdev, &dax);
-			dax_pmd_dbg(&bh, address, "pfn not in memmap");
-			goto fallback;
-		}
+	if (buffer_unwritten(&bh))
+		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
 
-		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(dax.addr, PMD_SIZE);
-			wmb_pmem();
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-			result |= VM_FAULT_MAJOR;
-		}
-		dax_unmap_atomic(bdev, &dax);
+	return result;
 
-		/*
-		 * For PTE faults we insert a radix tree entry for reads, and
-		 * leave it clean.  Then on the first write we dirty the radix
-		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
-		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
-		 * call into get_block() to translate the pgoff to a sector in
-		 * order to be able to create a new radix tree entry.
-		 *
-		 * The PMD path doesn't have an equivalent to
-		 * dax_pfn_mkwrite(), though, so for a read followed by a
-		 * write we traverse all the way through __dax_pmd_fault()
-		 * twice.  This means we can just skip inserting a radix tree
-		 * entry completely on the initial read and just wait until
-		 * the write to insert a dirty entry.
-		 */
-		if (write) {
-			error = dax_radix_entry(mapping, pgoff, dax.sector,
-					true, true);
-			if (error) {
-				dax_pmd_dbg(&bh, address,
-						"PMD radix insertion failed");
-				goto fallback;
-			}
-		}
+ fallback:
+	count_vm_event(THP_FAULT_FALLBACK);
+	result = VM_FAULT_FALLBACK;
+	goto out;
+}
 
-		dev_dbg(part_to_dev(bdev->bd_part),
-				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
-				__func__, current->comm, address,
-				pfn_t_to_pfn(dax.pfn),
-				(unsigned long long) dax.sector);
-		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				dax.pfn, write);
+static int dax_pmd_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		get_block_t get_block, dax_iodone_t complete_unwritten)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	unsigned long address = (unsigned long)vmf->virtual_address;
+	struct buffer_head bh;
+	unsigned blkbits = inode->i_blkbits;
+	unsigned long pmd_addr = address & PMD_MASK;
+	struct block_device *bdev;
+	pgoff_t size, pgoff;
+	loff_t lstart, lend;
+	sector_t block;
+	int result = 0;
+
+	pgoff = linear_page_index(vma, pmd_addr);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (pgoff >= size)
+		return VM_FAULT_SIGBUS;
+
+	memset(&bh, 0, sizeof(bh));
+	bh.b_bdev = inode->i_sb->s_bdev;
+	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
+
+	bh.b_size = PMD_SIZE;
+
+	if (get_block(inode, block, &bh, 1) != 0)
+		return VM_FAULT_SIGBUS;
+
+	bdev = bh.b_bdev;
+
+	/*
+	 * If the filesystem isn't willing to tell us the length of a hole,
+	 * just fall back to PTEs.  Calling get_block 512 times in a loop
+	 * would be silly.
+	 */
+	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+		dax_pmd_dbg(&bh, address, "allocated block too small");
+		return VM_FAULT_FALLBACK;
+	}
+
+	/* Make sure no process has any zero pages covering this hole */
+	lstart = pgoff << PAGE_SHIFT;
+	lend = lstart + PMD_SIZE - 1; /* inclusive */
+	truncate_pagecache_range(inode, lstart, lend);
+
+	i_mmap_lock_read(mapping);
+
+	/*
+	 * If a truncate happened while we were allocating blocks, we may
+	 * leave blocks allocated to the file that are beyond EOF.  We can't
+	 * take i_mutex here, so just leave them hanging; they'll be freed
+	 * when the file is deleted.
+	 */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (pgoff >= size) {
+		result = VM_FAULT_SIGBUS;
+		goto out;
 	}
+	if ((pgoff | PG_PMD_COLOUR) >= size) {
+		dax_pmd_dbg(&bh, address,
+				"offset + huge page size > file size");
+		goto fallback;
+	}
+
+	result |= dax_insert_pmd_mapping(inode, &bh, vma, vmf);
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -951,9 +1088,13 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	goto out;
 }
 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
-static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd, unsigned int flags, get_block_t get_block,
-		dax_iodone_t complete_unwritten)
+static int dax_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+		get_block_t get_block, dax_iodone_t complete_unwritten)
+{
+	return VM_FAULT_FALLBACK;
+}
+static int dax_pmd_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+			get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	return VM_FAULT_FALLBACK;
 }
@@ -978,13 +1119,11 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		get_block_t get_block, dax_iodone_t iodone)
 {
-	unsigned long address = (unsigned long)vmf->virtual_address;
 	switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
 	case FAULT_FLAG_SIZE_PTE:
 		return dax_pte_fault(vma, vmf, get_block, iodone);
 	case FAULT_FLAG_SIZE_PMD:
-		return dax_pmd_fault(vma, address, vmf->pmd, vmf->flags,
-						get_block, iodone);
+		return dax_pmd_fault(vma, vmf, get_block, iodone);
 	default:
 		return VM_FAULT_FALLBACK;
 	}
@@ -992,26 +1131,30 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
- * dax_pfn_mkwrite - handle first write to DAX page
+ * dax_mkwrite - handle first write to a DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @iodone: The filesystem method used to convert unwritten blocks
+ *	to written so the data written to them is exposed.  This is required
+ *	by write faults for filesystems that will return unwritten extent
+ *	mappings from @get_block, but it is optional for reads as
+ *	dax_insert_mapping() will always zero unwritten blocks.  If the fs
+ *	does not support unwritten extents, then it should pass NULL.
  */
-int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+		get_block_t get_block, dax_iodone_t iodone)
 {
-	struct file *file = vma->vm_file;
-
-	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
-	 * RADIX_DAX_PTE entry already exists in the radix tree from a
-	 * previous call to __dax_fault().  We just want to look up that PTE
-	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
-	 * saves us from having to make a call to get_block() here to look
-	 * up the sector.
-	 */
-	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
-	return VM_FAULT_NOPAGE;
+	switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
+	case FAULT_FLAG_SIZE_PTE:
+		return dax_pte_mkwrite(vma, vmf, get_block, iodone);
+	case FAULT_FLAG_SIZE_PMD:
+		return dax_pmd_mkwrite(vma, vmf, get_block, iodone);
+	default:
+		return VM_FAULT_FALLBACK;
+	}
 }
-EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
+EXPORT_SYMBOL_GPL(dax_mkwrite);
 
 /**
  * dax_zero_page_range - zero a range within a page of a DAX file
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index cf6f78c..6028c63 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -49,13 +49,14 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		sb_start_pagefault(inode->i_sb);
 		file_update_time(vma->vm_file);
 	}
-	down_read(&ei->dax_sem);
 
+	down_read(&ei->dax_sem);
 	ret = dax_fault(vma, vmf, ext2_get_block, NULL);
-
 	up_read(&ei->dax_sem);
+
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(inode->i_sb);
+
 	return ret;
 }
 
@@ -67,44 +68,18 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
-	down_read(&ei->dax_sem);
 
+	down_write(&ei->dax_sem);
 	ret = dax_mkwrite(vma, vmf, ext2_get_block, NULL);
+	up_write(&ei->dax_sem);
 
-	up_read(&ei->dax_sem);
-	sb_end_pagefault(inode->i_sb);
-	return ret;
-}
-
-static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
-		struct vm_fault *vmf)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct ext2_inode_info *ei = EXT2_I(inode);
-	loff_t size;
-	int ret;
-
-	sb_start_pagefault(inode->i_sb);
-	file_update_time(vma->vm_file);
-	down_read(&ei->dax_sem);
-
-	/* check that the faulting page hasn't raced with truncate */
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (vmf->pgoff >= size)
-		ret = VM_FAULT_SIGBUS;
-	else
-		ret = dax_pfn_mkwrite(vma, vmf);
-
-	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
-	.huge_fault	= ext2_dax_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
-	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
 
 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 71859ed..72dcece 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -196,99 +196,65 @@ out:
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	int result;
-	handle_t *handle = NULL;
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 
 	if (write) {
-		unsigned nblocks;
-		switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
-		case FAULT_FLAG_SIZE_PTE:
-			nblocks = EXT4_DATA_TRANS_BLOCKS(sb);
-			break;
-		case FAULT_FLAG_SIZE_PMD:
-			nblocks = ext4_chunk_trans_blocks(inode,
-						PMD_SIZE / PAGE_SIZE);
-			break;
-		default:
-			return VM_FAULT_FALLBACK;
-		}
-
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
-		down_read(&EXT4_I(inode)->i_mmap_sem);
-		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, nblocks);
-	} else
-		down_read(&EXT4_I(inode)->i_mmap_sem);
+	}
 
-	if (IS_ERR(handle))
-		result = VM_FAULT_SIGBUS;
-	else
-		result = dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+	down_read(&EXT4_I(inode)->i_mmap_sem);
+	result = dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+	up_read(&EXT4_I(inode)->i_mmap_sem);
 
-	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
-		up_read(&EXT4_I(inode)->i_mmap_sem);
+	if (write)
 		sb_end_pagefault(sb);
-	} else
-		up_read(&EXT4_I(inode)->i_mmap_sem);
 
 	return result;
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	int err;
+	int result;
 	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	handle_t *handle;
+	unsigned nblocks;
+
+	switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
+	case FAULT_FLAG_SIZE_PTE:
+		nblocks = EXT4_DATA_TRANS_BLOCKS(sb);
+		break;
+	case FAULT_FLAG_SIZE_PMD:
+		nblocks = ext4_chunk_trans_blocks(inode, PMD_SIZE / PAGE_SIZE);
+		break;
+	default:
+		return VM_FAULT_FALLBACK;
+	}
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
-	down_read(&EXT4_I(inode)->i_mmap_sem);
-	err = dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
-	up_read(&EXT4_I(inode)->i_mmap_sem);
-	sb_end_pagefault(inode->i_sb);
-
-	return err;
-}
 
-/*
- * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
- * handler we check for races agaist truncate. Note that since we cycle through
- * i_mmap_sem, we are sure that also any hole punching that began before we
- * were called is finished by now and so if it included part of the file we
- * are working on, our pte will get unmapped and the check for pte_same() in
- * wp_pfn_shared() fails. Thus fault gets retried and things work out as
- * desired.
- */
-static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
-				struct vm_fault *vmf)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct super_block *sb = inode->i_sb;
-	loff_t size;
-	int ret;
+	handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, nblocks);
+	if (IS_ERR(handle)) {
+		result = VM_FAULT_SIGBUS;
+	} else {
+		down_write(&EXT4_I(inode)->i_mmap_sem);
+		result = dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
+		up_write(&EXT4_I(inode)->i_mmap_sem);
+		ext4_journal_stop(handle);
+	}
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	down_read(&EXT4_I(inode)->i_mmap_sem);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (vmf->pgoff >= size)
-		ret = VM_FAULT_SIGBUS;
-	else
-		ret = dax_pfn_mkwrite(vma, vmf);
-	up_read(&EXT4_I(inode)->i_mmap_sem);
-	sb_end_pagefault(sb);
+	sb_end_pagefault(inode->i_sb);
 
-	return ret;
+	return result;
 }
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
-	.huge_fault	= ext4_dax_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
-	.pfn_mkwrite	= ext4_dax_pfn_mkwrite,
 };
 #else
 #define ext4_dax_vm_ops	ext4_file_vm_ops
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 6db703b..f51f09a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1517,22 +1517,25 @@ xfs_filemap_page_mkwrite(
 	struct vm_fault		*vmf)
 {
 	struct inode		*inode = file_inode(vma->vm_file);
+	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
 
-	trace_xfs_filemap_page_mkwrite(XFS_I(inode));
+	trace_xfs_filemap_page_mkwrite(ip);
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
-	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
+		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 		ret = dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 	} else {
+		xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
 		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
 		ret = block_page_mkwrite_return(ret);
+		xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	}
 
-	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 
 	return ret;
@@ -1544,15 +1547,17 @@ xfs_filemap_fault(
 	struct vm_fault		*vmf)
 {
 	struct inode		*inode = file_inode(vma->vm_file);
+	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
 
-	trace_xfs_filemap_fault(XFS_I(inode));
+	trace_xfs_filemap_fault(ip);
 
-	/* DAX can shortcut the normal fault path on write faults! */
-	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(inode))
-		return xfs_filemap_page_mkwrite(vma, vmf);
+	if (IS_DAX(inode) && vmf->flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vma->vm_file);
+	}
 
-	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
 	if (IS_DAX(inode)) {
 		/*
 		 * we do not want to trigger unwritten extent conversion on read
@@ -1563,88 +1568,18 @@ xfs_filemap_fault(
 		ret = dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
 	} else
 		ret = filemap_fault(vma, vmf);
-	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-
-	return ret;
-}
-
-/*
- * Similar to xfs_filemap_fault(), the DAX fault path can call into here on
- * both read and write faults. Hence we need to handle both cases. There is no
- * ->huge_mkwrite callout for huge pages, so we have a single function here to
- * handle both cases here. @flags carries the information on the type of fault
- * occuring.
- */
-STATIC int
-xfs_filemap_huge_fault(
-	struct vm_area_struct	*vma,
-	struct vm_fault		*vmf)
-{
-	struct inode		*inode = file_inode(vma->vm_file);
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			ret;
-
-	if (!IS_DAX(inode))
-		return VM_FAULT_FALLBACK;
-
-	trace_xfs_filemap_huge_fault(ip);
-
-	if (vmf->flags & FAULT_FLAG_WRITE) {
-		sb_start_pagefault(inode->i_sb);
-		file_update_time(vma->vm_file);
-	}
-
-	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
-	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 
-	if (vmf->flags & FAULT_FLAG_WRITE)
+	if (IS_DAX(inode) && vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(inode->i_sb);
 
 	return ret;
 }
 
-/*
- * pfn_mkwrite was originally intended to ensure we capture time stamp
- * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
- * to ensure we serialise the fault barrier in place.
- */
-static int
-xfs_filemap_pfn_mkwrite(
-	struct vm_area_struct	*vma,
-	struct vm_fault		*vmf)
-{
-
-	struct inode		*inode = file_inode(vma->vm_file);
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			ret = VM_FAULT_NOPAGE;
-	loff_t			size;
-
-	trace_xfs_filemap_pfn_mkwrite(ip);
-
-	sb_start_pagefault(inode->i_sb);
-	file_update_time(vma->vm_file);
-
-	/* check if the faulting page hasn't raced with truncate */
-	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
-	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	if (vmf->pgoff >= size)
-		ret = VM_FAULT_SIGBUS;
-	else if (IS_DAX(inode))
-		ret = dax_pfn_mkwrite(vma, vmf);
-	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
-	sb_end_pagefault(inode->i_sb);
-	return ret;
-
-}
-
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
-	.huge_fault	= xfs_filemap_huge_fault,
 	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_filemap_page_mkwrite,
-	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
 };
 
 STATIC int
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index fb1f3e1..3f1515f 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -687,9 +687,7 @@ DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 
 DEFINE_INODE_EVENT(xfs_filemap_fault);
-DEFINE_INODE_EVENT(xfs_filemap_huge_fault);
 DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
-DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite);
 
 DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 8e58c36..b9e745f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -12,8 +12,8 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
 		dax_iodone_t);
-int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
-#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
+int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t,
+		dax_iodone_t);
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b9d0979..eac1aeb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,16 +281,12 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	int (*mremap)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
-	int (*huge_fault)(struct vm_area_struct *, struct vm_fault *vmf);
 	void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
-	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
-	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
-
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/mm/memory.c b/mm/memory.c
index 03d49eb..0af34e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2210,42 +2210,6 @@ oom:
 	return VM_FAULT_OOM;
 }
 
-/*
- * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
- * mapping
- */
-static int wp_pfn_shared(struct mm_struct *mm,
-			struct vm_area_struct *vma, unsigned long address,
-			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
-			pmd_t *pmd)
-{
-	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
-		struct vm_fault vmf = {
-			.page = NULL,
-			.pgoff = linear_page_index(vma, address),
-			.virtual_address = (void __user *)(address & PAGE_MASK),
-			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
-		};
-		int ret;
-
-		pte_unmap_unlock(page_table, ptl);
-		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
-		if (ret & VM_FAULT_ERROR)
-			return ret;
-		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-		/*
-		 * We might have raced with another page fault while we
-		 * released the pte_offset_map_lock.
-		 */
-		if (!pte_same(*page_table, orig_pte)) {
-			pte_unmap_unlock(page_table, ptl);
-			return 0;
-		}
-	}
-	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
-			     NULL, 0, 0);
-}
-
 static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 			  unsigned long address, pte_t *page_table,
 			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
@@ -2324,12 +2288,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * VM_PFNMAP VMA.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
-		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
+		 * Just mark the pages writable as we can't do any dirty
+		 * accounting on raw pfn maps.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 				     (VM_WRITE|VM_SHARED))
-			return wp_pfn_shared(mm, vma, address, page_table, ptl,
-					     orig_pte, pmd);
+			return wp_page_reuse(mm, vma, address, page_table, ptl,
+					     orig_pte, old_page, 0, 0);
 
 		pte_unmap_unlock(page_table, ptl);
 		return wp_page_copy(mm, vma, address, page_table, pmd,
@@ -3282,8 +3247,8 @@ static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (vma_is_anonymous(vma))
 		return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
-	if (vma->vm_ops->huge_fault)
-		return vma->vm_ops->huge_fault(vma, &vmf);
+	if (vma->vm_ops->fault)
+		return vma->vm_ops->fault(vma, &vmf);
 	return VM_FAULT_FALLBACK;
 }
 
@@ -3299,8 +3264,8 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (vma_is_anonymous(vma))
 		return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
-	if (vma->vm_ops->huge_fault)
-		return vma->vm_ops->huge_fault(vma, &vmf);
+	if (vma->vm_ops->page_mkwrite)
+		return vma->vm_ops->page_mkwrite(vma, &vmf);
 	return VM_FAULT_FALLBACK;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 407ab43..0d851cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1490,7 +1490,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
 		return 0;
 
 	/* The backer wishes to know when pages are first written to? */
-	if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
+	if (vm_ops && vm_ops->page_mkwrite)
 		return 1;
 
 	/* The open routine did something to the protections that pgprot_modify
-- 
2.7.0.rc3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault
  2016-01-27  4:08 ` [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault Matthew Wilcox
@ 2016-01-27  5:48   ` kbuild test robot
  2016-01-27 17:47   ` Ross Zwisler
  2016-01-28 12:17   ` Jan Kara
  2 siblings, 0 replies; 13+ messages in thread
From: kbuild test robot @ 2016-01-27  5:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: kbuild-all, Jan Kara, Dave Chinner, Ross Zwisler, Matthew Wilcox,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1675 bytes --]

Hi Matthew,

[auto build test ERROR on v4.5-rc1]
[also build test ERROR on next-20160125]
[cannot apply to xfs/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Matthew-Wilcox/mm-fs-dax-Change-pmd_fault-to-huge_fault/20160127-121155
config: openrisc-or1ksim_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All error/warnings (new ones prefixed by >>):

   mm/memory.c: In function 'create_huge_pmd':
>> mm/memory.c:3254:3: error: unknown field 'pmd' specified in initializer
>> mm/memory.c:3254:3: warning: initialization makes integer from pointer without a cast
   mm/memory.c: In function 'wp_huge_pmd':
   mm/memory.c:3271:3: error: unknown field 'pmd' specified in initializer
   mm/memory.c:3271:3: warning: initialization makes integer from pointer without a cast

vim +/pmd +3254 mm/memory.c

  3248	static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
  3249				unsigned long address, pmd_t *pmd, unsigned int flags)
  3250	{
  3251		struct vm_fault vmf = {
  3252			.virtual_address = (void __user *)address,
  3253			.flags = flags | FAULT_FLAG_SIZE_PMD,
> 3254			.pmd = pmd,
  3255		};
  3256	
  3257		if (vma_is_anonymous(vma))

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 7068 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Fix dax races between page faults RFC only
       [not found] ` <CAOxpaSU_JgkeS=u61zxWTdP5hXymBkUsvkjkwNzm6XVig9y8RQ@mail.gmail.com>
@ 2016-01-27  6:18   ` Matthew Wilcox
  2016-02-01 21:25     ` Ross Zwisler
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2016-01-27  6:18 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Matthew Wilcox, Jan Kara, Dave Chinner, Ross Zwisler, linux-fsdevel

On Tue, Jan 26, 2016 at 10:50:59PM -0700, Ross Zwisler wrote:
> On Tuesday, January 26, 2016, Matthew Wilcox <matthew.r.wilcox@intel.com>
> wrote:
> > If I understand the current state of the code correctly, truncate can't
> > race with the fault handler, so the re-checks we do of i_size are now
> > dead code, which can be deleted. right?
> 
> Yep, I think so.  I think we might be able to delete all the i_mmap locking
> in dax.c as well, now that the isolation from truncate all happens at the
> filesystem level.

No; we need to preserve the lock against truncate back down into the
MM code.  Consider the cow_page case; if truncate comes in after we drop
the filesystem lock and before the COWed page is inserted into the page
table, truncate won't see the page in order to remove it.  And truncate
is supposed to remove COWs as well as the underlying file.

I'm not sure what purpose it serves in dax_insert_mapping though.  Nor
the PMD fault handler.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault
  2016-01-27  4:08 ` [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault Matthew Wilcox
  2016-01-27  5:48   ` kbuild test robot
@ 2016-01-27 17:47   ` Ross Zwisler
  2016-01-28 12:17   ` Jan Kara
  2 siblings, 0 replies; 13+ messages in thread
From: Ross Zwisler @ 2016-01-27 17:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dave Chinner, Ross Zwisler, Matthew Wilcox, linux-fsdevel

On Tue, Jan 26, 2016 at 11:08:27PM -0500, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> In preparation for adding the ability to handle PUD pages, convert
> ->pmd_fault to ->huge_fault.  huge_fault() takes a vm_fault structure
> instead of separate (address, pmd, flags) parameters.  The vm_fault
> structure is extended to include a union of the different page table
> pointers that may be needed, and three flag bits are reserved to indicate
> which type of pointer is in the union.
> 
> The DAX fault handlers are unified into one entry point, meaning that
> the filesystems can be largely unconcerned with what size of fault they
> are handling.  ext4 needs to know in order to reserve enough blocks in
> the journal, but ext2 and xfs are oblivious.
> 
> The existing dax_fault and dax_mkwrite had no callers, so rename the
> __dax_fault and __dax_mkwrite to lose the initial underscores.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  Documentation/filesystems/dax.txt | 12 +++--
>  fs/block_dev.c                    | 10 +---
>  fs/dax.c                          | 97 +++++++++++++--------------------------
>  fs/ext2/file.c                    | 27 ++---------
>  fs/ext4/file.c                    | 56 +++++++---------------
>  fs/xfs/xfs_file.c                 | 25 +++++-----
>  fs/xfs/xfs_trace.h                |  2 +-
>  include/linux/dax.h               | 17 -------
>  include/linux/mm.h                | 20 ++++++--
>  mm/memory.c                       | 20 ++++++--
>  10 files changed, 102 insertions(+), 184 deletions(-)
> 
> diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
> index 7bde640..2fe9e74 100644
> --- a/Documentation/filesystems/dax.txt
> +++ b/Documentation/filesystems/dax.txt
> @@ -49,6 +49,7 @@ These block devices may be used for inspiration:
>  - axonram: Axon DDR2 device driver
>  - brd: RAM backed block device driver
>  - dcssblk: s390 dcss block device driver
> +- pmem: NV-DIMM Persistent Memory driver
>  
>  
>  Implementation Tips for Filesystem Writers
> @@ -61,9 +62,9 @@ Filesystem support consists of
>    dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
>  - implementing an mmap file operation for DAX files which sets the
>    VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
> -  include handlers for fault, pmd_fault and page_mkwrite (which should
> -  probably call dax_fault(), dax_pmd_fault() and dax_mkwrite(), passing the
> -  appropriate get_block() callback)
> +  include handlers for fault, huge_fault and page_mkwrite (which should
> +  probably call dax_fault() and dax_mkwrite(), passing the appropriate
> +  get_block() callback)

You might also want to mention that filesystems need to add a handler for
dax_pfn_mkwrite(), which is now necessary so that we can properly track dirty
radix tree entries due to the DAX fsync/msync changes.

I also noticed that there is still a mention of __dax_fault() in the comment
block in dax_pfn_mkwrite(), which should probably just
s/__dax_fault()/dax_fault()/.

Otherwise this looks great - thanks for getting rid of the dead wrappers!

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault
  2016-01-27  4:08 ` [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault Matthew Wilcox
  2016-01-27  5:48   ` kbuild test robot
  2016-01-27 17:47   ` Ross Zwisler
@ 2016-01-28 12:17   ` Jan Kara
  2016-01-29 14:31     ` Matthew Wilcox
  2 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2016-01-28 12:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dave Chinner, Ross Zwisler, Matthew Wilcox, linux-fsdevel

On Tue 26-01-16 23:08:27, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> In preparation for adding the ability to handle PUD pages, convert
> ->pmd_fault to ->huge_fault.  huge_fault() takes a vm_fault structure
> instead of separate (address, pmd, flags) parameters.  The vm_fault
> structure is extended to include a union of the different page table
> pointers that may be needed, and three flag bits are reserved to indicate
> which type of pointer is in the union.
> 
> The DAX fault handlers are unified into one entry point, meaning that
> the filesystems can be largely unconcerned with what size of fault they
> are handling.  ext4 needs to know in order to reserve enough blocks in
> the journal, but ext2 and xfs are oblivious.
> 
> The existing dax_fault and dax_mkwrite had no callers, so rename the
> __dax_fault and __dax_mkwrite to lose the initial underscores.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>

Thanks for these simplifications. I like them! Just one nit below:

> diff --git a/mm/memory.c b/mm/memory.c
> index a2eaeef..03d49eb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3274,10 +3274,16 @@ out:
>  static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, pmd_t *pmd, unsigned int flags)
>  {
> +	struct vm_fault vmf = {
> +		.virtual_address = (void __user *)address,
> +		.flags = flags | FAULT_FLAG_SIZE_PMD,
> +		.pmd = pmd,
> +	};
> +

I think we should fill in also the vmf.gfp_mask and vmf.pgoff fields. I
know they are not currently used but it would be a nasty surprise if
someone tried to use them in the future...

Otherwise feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Fix dax races between page faults RFC only
  2016-01-27  4:08 [PATCH 0/2] Fix dax races between page faults RFC only Matthew Wilcox
                   ` (2 preceding siblings ...)
       [not found] ` <CAOxpaSU_JgkeS=u61zxWTdP5hXymBkUsvkjkwNzm6XVig9y8RQ@mail.gmail.com>
@ 2016-01-28 12:48 ` Jan Kara
  3 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2016-01-28 12:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dave Chinner, Ross Zwisler, Matthew Wilcox, linux-fsdevel

On Tue 26-01-16 23:08:26, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> I haven't had time to boot these patches yet, much less split them apart
> into properly reviewable chunks.  I just wanted to get the current state
> of my tree out before Dave & Jan wake up.
> 
> The first patch you've seen before; it was patch 2/8 in the PUD patch
> series I posted on January 6th.  It seemed like a good place to start
> since it unifies a lot of fault handling.
> 
> The second patch is everything else we've talked about rolled into one.
> It looks pretty good to me; after I make sure that it boots and passes
> some smoke tests, I'll add the optimisation Jan & I discussed today
> about trying to reduce the times we have to take the allocation lock in
> exclusive/write mode.
> 
> If I understand the current state of the code correctly, truncate can't
> race with the fault handler, so the re-checks we do of i_size are now
> dead code, which can be deleted. right?

Those i_size checks are still needed. You are right that the locking
protects DAX code to run in parallel to truncate but truncate may happen
after we look at VMA & PTE and before we grab appropriate fs lock in the
fault handler. So if we didn't check for i_size under the lock, we risk
that we would e.g. ask the filesystem to allocate blocks beyond EOF and
those blocks would never be truncated. The fault itself will be fine since
once we return from the fault handler, we'll notice the PTE changed and
retry the fault but those dangling FS blocks would we be bad.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dax: Giant hack
  2016-01-27  4:08 ` [PATCH 2/2] dax: Giant hack Matthew Wilcox
@ 2016-01-28 13:10   ` Jan Kara
  2016-01-28 21:23   ` Dave Chinner
  2016-01-29 22:29   ` Jared Hulbert
  2 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2016-01-28 13:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dave Chinner, Ross Zwisler, Matthew Wilcox, linux-fsdevel

On Tue 26-01-16 23:08:28, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> This glop of impossible-to-review code implements a number of ideas that
> need to be separated out.
> 
>  - Eliminate vm_ops->huge_fault.  The core calls ->fault instead and callers
>    who set VM_HUGEPAGE should be prepared to deal with FAULT_FLAG_SIZE_PMD
>    (and larger)
>  - Switch back to calling ->page_mkwrite instead of ->pfn_mkwrite.  DAX now
>    always has a page to lock, and no other imlementations of ->pfn_mkwrite
>    exist.
>  - dax_mkwrite splits out from dax_fault.  dax_fault will now never call
>    get_block() to allocate a block; only to see if a block has been allocated.
>    dax_mkwrite will always attempt to allocate a block.
>  - Filesystems now take their DAX allocation mutex in exclusive/write mode
>    when calling dax_mkwrite.
>  - Split out dax_insert_pmd_mapping() from dax_pmd_fault and share it with
>    the new dax_pmd_mkwrite
>  - Change dax_pmd_write to take a vm_fault argument like the rest of the
>    family of functions.

I didn't check this in big detail but it looks sound to me and I like how
DAX path is now very similar to non-DAX one and how PMD and PTE faults are
symmetric. Two things I've noticed:

1) There's no need for freeze protection & file_update_time() calls in the
fault handler since we don't to any write operations there anymore. It is
enough to do this in the page_mkwrite() handlers.

2) Nobody uses the complete_unwritten handlers anymore so you can remove
them as a cleanup.

								Honza


> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> ---
>  Documentation/filesystems/Locking |   8 -
>  Documentation/filesystems/dax.txt |   5 +-
>  fs/block_dev.c                    |  10 +-
>  fs/dax.c                          | 433 +++++++++++++++++++++++++-------------
>  fs/ext2/file.c                    |  35 +--
>  fs/ext4/file.c                    |  96 +++------
>  fs/xfs/xfs_file.c                 |  95 ++-------
>  fs/xfs/xfs_trace.h                |   2 -
>  include/linux/dax.h               |   4 +-
>  include/linux/mm.h                |   4 -
>  mm/memory.c                       |  51 +----
>  mm/mmap.c                         |   2 +-
>  12 files changed, 359 insertions(+), 386 deletions(-)
> 
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 619af9b..1be09e7 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -522,7 +522,6 @@ prototypes:
>  	void (*close)(struct vm_area_struct*);
>  	int (*fault)(struct vm_area_struct*, struct vm_fault *);
>  	int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
> -	int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
>  	int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
>  
>  locking rules:
> @@ -532,7 +531,6 @@ close:		yes
>  fault:		yes		can return with page locked
>  map_pages:	yes
>  page_mkwrite:	yes		can return with page locked
> -pfn_mkwrite:	yes
>  access:		yes
>  
>  	->fault() is called when a previously not present pte is about
> @@ -559,12 +557,6 @@ the page has been truncated, the filesystem should not look up a new page
>  like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
>  will cause the VM to retry the fault.
>  
> -	->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> -VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> -VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
> -after this call is to make the pte read-write, unless pfn_mkwrite returns
> -an error.
> -
>  	->access() is called when get_user_pages() fails in
>  access_process_vm(), typically used to debug a process through
>  /proc/pid/mem or ptrace.  This function is needed only for
> diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
> index 2fe9e74..ff62feb 100644
> --- a/Documentation/filesystems/dax.txt
> +++ b/Documentation/filesystems/dax.txt
> @@ -62,9 +62,8 @@ Filesystem support consists of
>    dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
>  - implementing an mmap file operation for DAX files which sets the
>    VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
> -  include handlers for fault, huge_fault and page_mkwrite (which should
> -  probably call dax_fault() and dax_mkwrite(), passing the appropriate
> -  get_block() callback)
> +  include handlers for fault and page_mkwrite (which should probably call
> +  dax_fault() and dax_mkwrite(), passing the appropriate get_block() callback)
>  - calling dax_truncate_page() instead of block_truncate_page() for DAX files
>  - calling dax_zero_page_range() instead of zero_user() for DAX files
>  - ensuring that there is sufficient locking between reads, writes,
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index a9474ac..78697fe 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1722,7 +1722,7 @@ static const struct address_space_operations def_blk_aops = {
>   *
>   * Finally, unlike the filemap_page_mkwrite() case there is no
>   * filesystem superblock to sync against freezing.  We still include a
> - * pfn_mkwrite callback for dax drivers to receive write fault
> + * page_mkwrite callback for dax drivers to receive write fault
>   * notifications.
>   */
>  static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> @@ -1730,6 +1730,11 @@ static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	return dax_fault(vma, vmf, blkdev_get_block, NULL);
>  }
>  
> +static int blkdev_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	return dax_mkwrite(vma, vmf, blkdev_get_block, NULL);
> +}
> +
>  static void blkdev_vm_open(struct vm_area_struct *vma)
>  {
>  	struct inode *bd_inode = bdev_file_inode(vma->vm_file);
> @@ -1754,8 +1759,7 @@ static const struct vm_operations_struct blkdev_dax_vm_ops = {
>  	.open		= blkdev_vm_open,
>  	.close		= blkdev_vm_close,
>  	.fault		= blkdev_dax_fault,
> -	.huge_fault	= blkdev_dax_fault,
> -	.pfn_mkwrite	= blkdev_dax_fault,
> +	.page_mkwrite	= blkdev_dax_mkwrite,
>  };
>  
>  static const struct vm_operations_struct blkdev_default_vm_ops = {
> diff --git a/fs/dax.c b/fs/dax.c
> index dbaf62c..952c2c2 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -372,7 +372,7 @@ static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
>  
>  	if (sector == NO_SECTOR) {
>  		/*
> -		 * This can happen during correct operation if our pfn_mkwrite
> +		 * This can happen during correct operation if our page_mkwrite
>  		 * fault raced against a hole punch operation.  If this
>  		 * happens the pte that was hole punched will have been
>  		 * unmapped and the radix tree entry will have been removed by
> @@ -584,7 +584,6 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	sector_t block;
>  	pgoff_t size;
>  	int error;
> -	int major = 0;
>  
>  	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  	if (vmf->pgoff >= size)
> @@ -624,20 +623,8 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	if (error)
>  		goto unlock_page;
>  
> -	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
> -		if (vmf->flags & FAULT_FLAG_WRITE) {
> -			error = get_block(inode, block, &bh, 1);
> -			count_vm_event(PGMAJFAULT);
> -			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> -			major = VM_FAULT_MAJOR;
> -			if (!error && (bh.b_size < PAGE_SIZE))
> -				error = -EIO;
> -			if (error)
> -				goto unlock_page;
> -		} else {
> -			return dax_load_hole(mapping, page, vmf);
> -		}
> -	}
> +	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> +		return dax_load_hole(mapping, page, vmf);
>  
>  	if (vmf->cow_page) {
>  		struct page *new_page = vmf->cow_page;
> @@ -655,16 +642,101 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  								PAGE_SHIFT;
>  			if (vmf->pgoff >= size) {
>  				i_mmap_unlock_read(mapping);
> -				error = -EIO;
> -				goto out;
> +				return VM_FAULT_SIGBUS;
>  			}
>  		}
>  		return VM_FAULT_LOCKED;
>  	}
>  
> -	/* Check we didn't race with a read fault installing a new page */
> -	if (!page && major)
> -		page = find_lock_page(mapping, vmf->pgoff);
> +	/*
> +	 * If we successfully insert a mapping to an unwritten extent, we
> +	 * need to convert the unwritten extent.  If there is an error
> +	 * inserting the mapping, the filesystem needs to leave it as
> +	 * unwritten to prevent exposure of the stale underlying data to
> +	 * userspace, but we still need to call the completion function so
> +	 * the private resources on the mapping buffer can be released. We
> +	 * indicate what the callback should do via the uptodate variable,
> +	 * same as for normal BH based IO completions.
> +	 */
> +	error = dax_insert_mapping(inode, &bh, vma, vmf);
> +	if (buffer_unwritten(&bh)) {
> +		if (complete_unwritten)
> +			complete_unwritten(&bh, !error);
> +		else
> +			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
> +	}
> +
> + out:
> +	if (error == -ENOMEM)
> +		return VM_FAULT_OOM;
> +	/* -EBUSY is fine, somebody else faulted on the same PTE */
> +	if ((error < 0) && (error != -EBUSY))
> +		return VM_FAULT_SIGBUS;
> +	return VM_FAULT_NOPAGE;
> +
> + unlock_page:
> +	if (page) {
> +		unlock_page(page);
> +		page_cache_release(page);
> +	}
> +	goto out;
> +}
> +
> +static int dax_pte_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> +			get_block_t get_block, dax_iodone_t complete_unwritten)
> +{
> +	struct file *file = vma->vm_file;
> +	struct address_space *mapping = file->f_mapping;
> +	struct inode *inode = mapping->host;
> +	struct page *page;
> +	struct buffer_head bh;
> +	unsigned blkbits = inode->i_blkbits;
> +	sector_t block;
> +	pgoff_t size;
> +	int error;
> +	int major = 0;
> +
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +
> +	memset(&bh, 0, sizeof(bh));
> +	block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> +	bh.b_bdev = inode->i_sb->s_bdev;
> +	bh.b_size = PAGE_SIZE;
> +
> + repeat:
> +	page = find_get_page(mapping, vmf->pgoff);
> +	if (page) {
> +		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> +			page_cache_release(page);
> +			return VM_FAULT_RETRY;
> +		}
> +		if (unlikely(page->mapping != mapping)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto repeat;
> +		}
> +		size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +		if (unlikely(vmf->pgoff >= size)) {
> +			/*
> +			 * We have a struct page covering a hole in the file
> +			 * from a read fault and we've raced with a truncate
> +			 */
> +			error = -EIO;
> +			goto unlock_page;
> +		}
> +	}
> +
> +	error = get_block(inode, block, &bh, 1);
> +	if (!error && (bh.b_size < PAGE_SIZE))
> +		error = -EIO;		/* fs corruption? */
> +	if (error)
> +		goto unlock_page;
> +
> +	count_vm_event(PGMAJFAULT);
> +	mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> +	major = VM_FAULT_MAJOR;
>  
>  	if (page) {
>  		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> @@ -675,16 +747,6 @@ static int dax_pte_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		page = NULL;
>  	}
>  
> -	/*
> -	 * If we successfully insert the new mapping over an unwritten extent,
> -	 * we need to ensure we convert the unwritten extent. If there is an
> -	 * error inserting the mapping, the filesystem needs to leave it as
> -	 * unwritten to prevent exposure of the stale underlying data to
> -	 * userspace, but we still need to call the completion function so
> -	 * the private resources on the mapping buffer can be released. We
> -	 * indicate what the callback should do via the uptodate variable, same
> -	 * as for normal BH based IO completions.
> -	 */
>  	error = dax_insert_mapping(inode, &bh, vma, vmf);
>  	if (buffer_unwritten(&bh)) {
>  		if (complete_unwritten)
> @@ -734,22 +796,101 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
>  
>  #define dax_pmd_dbg(bh, address, reason)	__dax_dbg(bh, address, reason, "dax_pmd")
>  
> -static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> -		pmd_t *pmd, unsigned int flags, get_block_t get_block,
> -		dax_iodone_t complete_unwritten)
> +static int dax_insert_pmd_mapping(struct inode *inode, struct buffer_head *bh,
> +			struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct blk_dax_ctl dax = {
> +		.sector = to_sector(bh, inode),
> +		.size = PMD_SIZE,
> +	};
> +	struct block_device *bdev = bh->b_bdev;
> +	bool write = vmf->flags & FAULT_FLAG_WRITE;
> +	unsigned long address = (unsigned long)vmf->virtual_address;
> +	pgoff_t pgoff = linear_page_index(vma, address & PMD_MASK);
> +	int major;
> +	long length;
> +
> +	length = dax_map_atomic(bdev, &dax);
> +	if (length < 0)
> +		return VM_FAULT_SIGBUS;
> +	if (length < PMD_SIZE) {
> +		dax_pmd_dbg(bh, address, "dax-length too small");
> +		goto unmap;
> +	}
> +
> +	if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
> +		dax_pmd_dbg(bh, address, "pfn unaligned");
> +		goto unmap;
> +	}
> +
> +	if (!pfn_t_devmap(dax.pfn)) {
> +		dax_pmd_dbg(bh, address, "pfn not in memmap");
> +		goto unmap;
> +	}
> +
> +	if (buffer_unwritten(bh) || buffer_new(bh)) {
> +		clear_pmem(dax.addr, PMD_SIZE);
> +		wmb_pmem();
> +		count_vm_event(PGMAJFAULT);
> +		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> +		major = VM_FAULT_MAJOR;
> +	} else
> +		major = 0;
> +
> +	dax_unmap_atomic(bdev, &dax);
> +
> +	/*
> +	 * For PTE faults we insert a radix tree entry for reads, and
> +	 * leave it clean.  Then on the first write we dirty the radix
> +	 * tree entry via the dax_mkwrite() path.  This sequence
> +	 * allows the dax_mkwrite() call to be simpler and avoid a
> +	 * call into get_block() to translate the pgoff to a sector in
> +	 * order to be able to create a new radix tree entry.
> +	 *
> +	 * The PMD path doesn't have an equivalent to
> +	 * dax_mkwrite(), though, so for a read followed by a
> +	 * write we traverse all the way through __dax_pmd_fault()
> +	 * twice.  This means we can just skip inserting a radix tree
> +	 * entry completely on the initial read and just wait until
> +	 * the write to insert a dirty entry.
> +	 */
> +	if (write) {
> +		int error = dax_radix_entry(vma->vm_file->f_mapping, pgoff,
> +						dax.sector, true, true);
> +		if (error) {
> +			dax_pmd_dbg(bh, address, "PMD radix insertion failed");
> +			goto fallback;
> +		}
> +	}
> +
> +	dev_dbg(part_to_dev(bdev->bd_part),
> +			"%s: %s addr: %lx pfn: %lx sect: %llx\n",
> +			__func__, current->comm, address,
> +			pfn_t_to_pfn(dax.pfn),
> +			(unsigned long long) dax.sector);
> +	return major | vmf_insert_pfn_pmd(vma, address, vmf->pmd, dax.pfn,
> +						write);
> + 
> + unmap:
> +	dax_unmap_atomic(bdev, &dax);
> + fallback:
> +	return VM_FAULT_FALLBACK;
> +}
> +
> +static int dax_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +		get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
>  	struct inode *inode = mapping->host;
>  	struct buffer_head bh;
>  	unsigned blkbits = inode->i_blkbits;
> +	unsigned long address = (unsigned long)vmf->virtual_address;
>  	unsigned long pmd_addr = address & PMD_MASK;
> -	bool write = flags & FAULT_FLAG_WRITE;
> -	struct block_device *bdev;
> +	bool write = vmf->flags & FAULT_FLAG_WRITE;
>  	pgoff_t size, pgoff;
>  	sector_t block;
> -	int error, result = 0;
> -	bool alloc = false;
> +	int result = 0;
>  
>  	/* dax pmd mappings require pfn_t_devmap() */
>  	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> @@ -757,7 +898,7 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  
>  	/* Fall back to PTEs if we're going to COW */
>  	if (write && !(vma->vm_flags & VM_SHARED)) {
> -		split_huge_pmd(vma, pmd, address);
> +		split_huge_pmd(vma, vmf->pmd, address);
>  		dax_pmd_dbg(NULL, address, "cow write");
>  		return VM_FAULT_FALLBACK;
>  	}
> @@ -791,14 +932,6 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	if (get_block(inode, block, &bh, 0) != 0)
>  		return VM_FAULT_SIGBUS;
>  
> -	if (!buffer_mapped(&bh) && write) {
> -		if (get_block(inode, block, &bh, 1) != 0)
> -			return VM_FAULT_SIGBUS;
> -		alloc = true;
> -	}
> -
> -	bdev = bh.b_bdev;
> -
>  	/*
>  	 * If the filesystem isn't willing to tell us the length of a hole,
>  	 * just fall back to PTEs.  Calling get_block 512 times in a loop
> @@ -809,17 +942,6 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		return VM_FAULT_FALLBACK;
>  	}
>  
> -	/*
> -	 * If we allocated new storage, make sure no process has any
> -	 * zero pages covering this hole
> -	 */
> -	if (alloc) {
> -		loff_t lstart = pgoff << PAGE_SHIFT;
> -		loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
> -
> -		truncate_pagecache_range(inode, lstart, lend);
> -	}
> -
>  	i_mmap_lock_read(mapping);
>  
>  	/*
> @@ -839,9 +961,9 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		goto fallback;
>  	}
>  
> -	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
> +	if (!buffer_mapped(&bh) && buffer_uptodate(&bh)) {
>  		spinlock_t *ptl;
> -		pmd_t entry;
> +		pmd_t entry, *pmd = vmf->pmd;
>  		struct page *zero_page = get_huge_zero_page();
>  
>  		if (unlikely(!zero_page)) {
> @@ -856,7 +978,7 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  			goto fallback;
>  		}
>  
> -		dev_dbg(part_to_dev(bdev->bd_part),
> +		dev_dbg(part_to_dev(bh.b_bdev->bd_part),
>  				"%s: %s addr: %lx pfn: <zero> sect: %llx\n",
>  				__func__, current->comm, address,
>  				(unsigned long long) to_sector(&bh, inode));
> @@ -867,75 +989,90 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		result = VM_FAULT_NOPAGE;
>  		spin_unlock(ptl);
>  	} else {
> -		struct blk_dax_ctl dax = {
> -			.sector = to_sector(&bh, inode),
> -			.size = PMD_SIZE,
> -		};
> -		long length = dax_map_atomic(bdev, &dax);
> +		result |= dax_insert_pmd_mapping(inode, &bh, vma, vmf);
> +	}
>  
> -		if (length < 0) {
> -			result = VM_FAULT_SIGBUS;
> -			goto out;
> -		}
> -		if (length < PMD_SIZE) {
> -			dax_pmd_dbg(&bh, address, "dax-length too small");
> -			dax_unmap_atomic(bdev, &dax);
> -			goto fallback;
> -		}
> -		if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) {
> -			dax_pmd_dbg(&bh, address, "pfn unaligned");
> -			dax_unmap_atomic(bdev, &dax);
> -			goto fallback;
> -		}
> + out:
> +	i_mmap_unlock_read(mapping);
>  
> -		if (!pfn_t_devmap(dax.pfn)) {
> -			dax_unmap_atomic(bdev, &dax);
> -			dax_pmd_dbg(&bh, address, "pfn not in memmap");
> -			goto fallback;
> -		}
> +	if (buffer_unwritten(&bh))
> +		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
>  
> -		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> -			clear_pmem(dax.addr, PMD_SIZE);
> -			wmb_pmem();
> -			count_vm_event(PGMAJFAULT);
> -			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> -			result |= VM_FAULT_MAJOR;
> -		}
> -		dax_unmap_atomic(bdev, &dax);
> +	return result;
>  
> -		/*
> -		 * For PTE faults we insert a radix tree entry for reads, and
> -		 * leave it clean.  Then on the first write we dirty the radix
> -		 * tree entry via the dax_pfn_mkwrite() path.  This sequence
> -		 * allows the dax_pfn_mkwrite() call to be simpler and avoid a
> -		 * call into get_block() to translate the pgoff to a sector in
> -		 * order to be able to create a new radix tree entry.
> -		 *
> -		 * The PMD path doesn't have an equivalent to
> -		 * dax_pfn_mkwrite(), though, so for a read followed by a
> -		 * write we traverse all the way through __dax_pmd_fault()
> -		 * twice.  This means we can just skip inserting a radix tree
> -		 * entry completely on the initial read and just wait until
> -		 * the write to insert a dirty entry.
> -		 */
> -		if (write) {
> -			error = dax_radix_entry(mapping, pgoff, dax.sector,
> -					true, true);
> -			if (error) {
> -				dax_pmd_dbg(&bh, address,
> -						"PMD radix insertion failed");
> -				goto fallback;
> -			}
> -		}
> + fallback:
> +	count_vm_event(THP_FAULT_FALLBACK);
> +	result = VM_FAULT_FALLBACK;
> +	goto out;
> +}
>  
> -		dev_dbg(part_to_dev(bdev->bd_part),
> -				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
> -				__func__, current->comm, address,
> -				pfn_t_to_pfn(dax.pfn),
> -				(unsigned long long) dax.sector);
> -		result |= vmf_insert_pfn_pmd(vma, address, pmd,
> -				dax.pfn, write);
> +static int dax_pmd_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> +		get_block_t get_block, dax_iodone_t complete_unwritten)
> +{
> +	struct file *file = vma->vm_file;
> +	struct address_space *mapping = file->f_mapping;
> +	struct inode *inode = mapping->host;
> +	unsigned long address = (unsigned long)vmf->virtual_address;
> +	struct buffer_head bh;
> +	unsigned blkbits = inode->i_blkbits;
> +	unsigned long pmd_addr = address & PMD_MASK;
> +	struct block_device *bdev;
> +	pgoff_t size, pgoff;
> +	loff_t lstart, lend;
> +	sector_t block;
> +	int result = 0;
> +
> +	pgoff = linear_page_index(vma, pmd_addr);
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +
> +	memset(&bh, 0, sizeof(bh));
> +	bh.b_bdev = inode->i_sb->s_bdev;
> +	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
> +
> +	bh.b_size = PMD_SIZE;
> +
> +	if (get_block(inode, block, &bh, 1) != 0)
> +		return VM_FAULT_SIGBUS;
> +
> +	bdev = bh.b_bdev;
> +
> +	/*
> +	 * If the filesystem isn't willing to tell us the length of a hole,
> +	 * just fall back to PTEs.  Calling get_block 512 times in a loop
> +	 * would be silly.
> +	 */
> +	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
> +		dax_pmd_dbg(&bh, address, "allocated block too small");
> +		return VM_FAULT_FALLBACK;
> +	}
> +
> +	/* Make sure no process has any zero pages covering this hole */
> +	lstart = pgoff << PAGE_SHIFT;
> +	lend = lstart + PMD_SIZE - 1; /* inclusive */
> +	truncate_pagecache_range(inode, lstart, lend);
> +
> +	i_mmap_lock_read(mapping);
> +
> +	/*
> +	 * If a truncate happened while we were allocating blocks, we may
> +	 * leave blocks allocated to the file that are beyond EOF.  We can't
> +	 * take i_mutex here, so just leave them hanging; they'll be freed
> +	 * when the file is deleted.
> +	 */
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (pgoff >= size) {
> +		result = VM_FAULT_SIGBUS;
> +		goto out;
>  	}
> +	if ((pgoff | PG_PMD_COLOUR) >= size) {
> +		dax_pmd_dbg(&bh, address,
> +				"offset + huge page size > file size");
> +		goto fallback;
> +	}
> +
> +	result |= dax_insert_pmd_mapping(inode, &bh, vma, vmf);
>  
>   out:
>  	i_mmap_unlock_read(mapping);
> @@ -951,9 +1088,13 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	goto out;
>  }
>  #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
> -static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> -		pmd_t *pmd, unsigned int flags, get_block_t get_block,
> -		dax_iodone_t complete_unwritten)
> +static int dax_pmd_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +		get_block_t get_block, dax_iodone_t complete_unwritten)
> +{
> +	return VM_FAULT_FALLBACK;
> +}
> +static int dax_pmd_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> +			get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	return VM_FAULT_FALLBACK;
>  }
> @@ -978,13 +1119,11 @@ static int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		get_block_t get_block, dax_iodone_t iodone)
>  {
> -	unsigned long address = (unsigned long)vmf->virtual_address;
>  	switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
>  	case FAULT_FLAG_SIZE_PTE:
>  		return dax_pte_fault(vma, vmf, get_block, iodone);
>  	case FAULT_FLAG_SIZE_PMD:
> -		return dax_pmd_fault(vma, address, vmf->pmd, vmf->flags,
> -						get_block, iodone);
> +		return dax_pmd_fault(vma, vmf, get_block, iodone);
>  	default:
>  		return VM_FAULT_FALLBACK;
>  	}
> @@ -992,26 +1131,30 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  EXPORT_SYMBOL_GPL(dax_fault);
>  
>  /**
> - * dax_pfn_mkwrite - handle first write to DAX page
> + * dax_mkwrite - handle first write to a DAX page
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + * @iodone: The filesystem method used to convert unwritten blocks
> + *	to written so the data written to them is exposed.  This is required
> + *	by write faults for filesystems that will return unwritten extent
> + *	mappings from @get_block, but it is optional for reads as
> + *	dax_insert_mapping() will always zero unwritten blocks.  If the fs
> + *	does not support unwritten extents, then it should pass NULL.
>   */
> -int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> +		get_block_t get_block, dax_iodone_t iodone)
>  {
> -	struct file *file = vma->vm_file;
> -
> -	/*
> -	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
> -	 * RADIX_DAX_PTE entry already exists in the radix tree from a
> -	 * previous call to __dax_fault().  We just want to look up that PTE
> -	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
> -	 * saves us from having to make a call to get_block() here to look
> -	 * up the sector.
> -	 */
> -	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> -	return VM_FAULT_NOPAGE;
> +	switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
> +	case FAULT_FLAG_SIZE_PTE:
> +		return dax_pte_mkwrite(vma, vmf, get_block, iodone);
> +	case FAULT_FLAG_SIZE_PMD:
> +		return dax_pmd_mkwrite(vma, vmf, get_block, iodone);
> +	default:
> +		return VM_FAULT_FALLBACK;
> +	}
>  }
> -EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> +EXPORT_SYMBOL_GPL(dax_mkwrite);
>  
>  /**
>   * dax_zero_page_range - zero a range within a page of a DAX file
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index cf6f78c..6028c63 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -49,13 +49,14 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  		sb_start_pagefault(inode->i_sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	down_read(&ei->dax_sem);
>  
> +	down_read(&ei->dax_sem);
>  	ret = dax_fault(vma, vmf, ext2_get_block, NULL);
> -
>  	up_read(&ei->dax_sem);
> +
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(inode->i_sb);
> +
>  	return ret;
>  }
>  
> @@ -67,44 +68,18 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  
>  	sb_start_pagefault(inode->i_sb);
>  	file_update_time(vma->vm_file);
> -	down_read(&ei->dax_sem);
>  
> +	down_write(&ei->dax_sem);
>  	ret = dax_mkwrite(vma, vmf, ext2_get_block, NULL);
> +	up_write(&ei->dax_sem);
>  
> -	up_read(&ei->dax_sem);
> -	sb_end_pagefault(inode->i_sb);
> -	return ret;
> -}
> -
> -static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
> -		struct vm_fault *vmf)
> -{
> -	struct inode *inode = file_inode(vma->vm_file);
> -	struct ext2_inode_info *ei = EXT2_I(inode);
> -	loff_t size;
> -	int ret;
> -
> -	sb_start_pagefault(inode->i_sb);
> -	file_update_time(vma->vm_file);
> -	down_read(&ei->dax_sem);
> -
> -	/* check that the faulting page hasn't raced with truncate */
> -	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> -	if (vmf->pgoff >= size)
> -		ret = VM_FAULT_SIGBUS;
> -	else
> -		ret = dax_pfn_mkwrite(vma, vmf);
> -
> -	up_read(&ei->dax_sem);
>  	sb_end_pagefault(inode->i_sb);
>  	return ret;
>  }
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
>  	.fault		= ext2_dax_fault,
> -	.huge_fault	= ext2_dax_fault,
>  	.page_mkwrite	= ext2_dax_mkwrite,
> -	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
>  };
>  
>  static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 71859ed..72dcece 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -196,99 +196,65 @@ out:
>  static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	int result;
> -	handle_t *handle = NULL;
>  	struct inode *inode = file_inode(vma->vm_file);
>  	struct super_block *sb = inode->i_sb;
>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
>  
>  	if (write) {
> -		unsigned nblocks;
> -		switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
> -		case FAULT_FLAG_SIZE_PTE:
> -			nblocks = EXT4_DATA_TRANS_BLOCKS(sb);
> -			break;
> -		case FAULT_FLAG_SIZE_PMD:
> -			nblocks = ext4_chunk_trans_blocks(inode,
> -						PMD_SIZE / PAGE_SIZE);
> -			break;
> -		default:
> -			return VM_FAULT_FALLBACK;
> -		}
> -
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
> -		down_read(&EXT4_I(inode)->i_mmap_sem);
> -		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, nblocks);
> -	} else
> -		down_read(&EXT4_I(inode)->i_mmap_sem);
> +	}
>  
> -	if (IS_ERR(handle))
> -		result = VM_FAULT_SIGBUS;
> -	else
> -		result = dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
> +	down_read(&EXT4_I(inode)->i_mmap_sem);
> +	result = dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
> +	up_read(&EXT4_I(inode)->i_mmap_sem);
>  
> -	if (write) {
> -		if (!IS_ERR(handle))
> -			ext4_journal_stop(handle);
> -		up_read(&EXT4_I(inode)->i_mmap_sem);
> +	if (write)
>  		sb_end_pagefault(sb);
> -	} else
> -		up_read(&EXT4_I(inode)->i_mmap_sem);
>  
>  	return result;
>  }
>  
>  static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	int err;
> +	int result;
>  	struct inode *inode = file_inode(vma->vm_file);
> +	struct super_block *sb = inode->i_sb;
> +	handle_t *handle;
> +	unsigned nblocks;
> +
> +	switch (vmf->flags & FAULT_FLAG_SIZE_MASK) {
> +	case FAULT_FLAG_SIZE_PTE:
> +		nblocks = EXT4_DATA_TRANS_BLOCKS(sb);
> +		break;
> +	case FAULT_FLAG_SIZE_PMD:
> +		nblocks = ext4_chunk_trans_blocks(inode, PMD_SIZE / PAGE_SIZE);
> +		break;
> +	default:
> +		return VM_FAULT_FALLBACK;
> +	}
>  
>  	sb_start_pagefault(inode->i_sb);
>  	file_update_time(vma->vm_file);
> -	down_read(&EXT4_I(inode)->i_mmap_sem);
> -	err = dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
> -	up_read(&EXT4_I(inode)->i_mmap_sem);
> -	sb_end_pagefault(inode->i_sb);
> -
> -	return err;
> -}
>  
> -/*
> - * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
> - * handler we check for races agaist truncate. Note that since we cycle through
> - * i_mmap_sem, we are sure that also any hole punching that began before we
> - * were called is finished by now and so if it included part of the file we
> - * are working on, our pte will get unmapped and the check for pte_same() in
> - * wp_pfn_shared() fails. Thus fault gets retried and things work out as
> - * desired.
> - */
> -static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
> -				struct vm_fault *vmf)
> -{
> -	struct inode *inode = file_inode(vma->vm_file);
> -	struct super_block *sb = inode->i_sb;
> -	loff_t size;
> -	int ret;
> +	handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, nblocks);
> +	if (IS_ERR(handle)) {
> +		result = VM_FAULT_SIGBUS;
> +	} else {
> +		down_write(&EXT4_I(inode)->i_mmap_sem);
> +		result = dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
> +		up_write(&EXT4_I(inode)->i_mmap_sem);
> +		ext4_journal_stop(handle);
> +	}
>  
> -	sb_start_pagefault(sb);
> -	file_update_time(vma->vm_file);
> -	down_read(&EXT4_I(inode)->i_mmap_sem);
> -	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> -	if (vmf->pgoff >= size)
> -		ret = VM_FAULT_SIGBUS;
> -	else
> -		ret = dax_pfn_mkwrite(vma, vmf);
> -	up_read(&EXT4_I(inode)->i_mmap_sem);
> -	sb_end_pagefault(sb);
> +	sb_end_pagefault(inode->i_sb);
>  
> -	return ret;
> +	return result;
>  }
>  
>  static const struct vm_operations_struct ext4_dax_vm_ops = {
>  	.fault		= ext4_dax_fault,
> -	.huge_fault	= ext4_dax_fault,
>  	.page_mkwrite	= ext4_dax_mkwrite,
> -	.pfn_mkwrite	= ext4_dax_pfn_mkwrite,
>  };
>  #else
>  #define ext4_dax_vm_ops	ext4_file_vm_ops
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 6db703b..f51f09a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1517,22 +1517,25 @@ xfs_filemap_page_mkwrite(
>  	struct vm_fault		*vmf)
>  {
>  	struct inode		*inode = file_inode(vma->vm_file);
> +	struct xfs_inode	*ip = XFS_I(inode);
>  	int			ret;
>  
> -	trace_xfs_filemap_page_mkwrite(XFS_I(inode));
> +	trace_xfs_filemap_page_mkwrite(ip);
>  
>  	sb_start_pagefault(inode->i_sb);
>  	file_update_time(vma->vm_file);
> -	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>  	if (IS_DAX(inode)) {
> +		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  		ret = dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> +		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  	} else {
> +		xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
>  		ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
>  		ret = block_page_mkwrite_return(ret);
> +		xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
>  	}
>  
> -	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  	sb_end_pagefault(inode->i_sb);
>  
>  	return ret;
> @@ -1544,15 +1547,17 @@ xfs_filemap_fault(
>  	struct vm_fault		*vmf)
>  {
>  	struct inode		*inode = file_inode(vma->vm_file);
> +	struct xfs_inode	*ip = XFS_I(inode);
>  	int			ret;
>  
> -	trace_xfs_filemap_fault(XFS_I(inode));
> +	trace_xfs_filemap_fault(ip);
>  
> -	/* DAX can shortcut the normal fault path on write faults! */
> -	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(inode))
> -		return xfs_filemap_page_mkwrite(vma, vmf);
> +	if (IS_DAX(inode) && vmf->flags & FAULT_FLAG_WRITE) {
> +		sb_start_pagefault(inode->i_sb);
> +		file_update_time(vma->vm_file);
> +	}
>  
> -	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
>  	if (IS_DAX(inode)) {
>  		/*
>  		 * we do not want to trigger unwritten extent conversion on read
> @@ -1563,88 +1568,18 @@ xfs_filemap_fault(
>  		ret = dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
>  	} else
>  		ret = filemap_fault(vma, vmf);
> -	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -
> -	return ret;
> -}
> -
> -/*
> - * Similar to xfs_filemap_fault(), the DAX fault path can call into here on
> - * both read and write faults. Hence we need to handle both cases. There is no
> - * ->huge_mkwrite callout for huge pages, so we have a single function here to
> - * handle both cases here. @flags carries the information on the type of fault
> - * occuring.
> - */
> -STATIC int
> -xfs_filemap_huge_fault(
> -	struct vm_area_struct	*vma,
> -	struct vm_fault		*vmf)
> -{
> -	struct inode		*inode = file_inode(vma->vm_file);
> -	struct xfs_inode	*ip = XFS_I(inode);
> -	int			ret;
> -
> -	if (!IS_DAX(inode))
> -		return VM_FAULT_FALLBACK;
> -
> -	trace_xfs_filemap_huge_fault(ip);
> -
> -	if (vmf->flags & FAULT_FLAG_WRITE) {
> -		sb_start_pagefault(inode->i_sb);
> -		file_update_time(vma->vm_file);
> -	}
> -
> -	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> -	ret = dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
> -	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
>  
> -	if (vmf->flags & FAULT_FLAG_WRITE)
> +	if (IS_DAX(inode) && vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(inode->i_sb);
>  
>  	return ret;
>  }
>  
> -/*
> - * pfn_mkwrite was originally intended to ensure we capture time stamp
> - * updates on write faults. In reality, it's need to serialise against
> - * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
> - * to ensure we serialise the fault barrier in place.
> - */
> -static int
> -xfs_filemap_pfn_mkwrite(
> -	struct vm_area_struct	*vma,
> -	struct vm_fault		*vmf)
> -{
> -
> -	struct inode		*inode = file_inode(vma->vm_file);
> -	struct xfs_inode	*ip = XFS_I(inode);
> -	int			ret = VM_FAULT_NOPAGE;
> -	loff_t			size;
> -
> -	trace_xfs_filemap_pfn_mkwrite(ip);
> -
> -	sb_start_pagefault(inode->i_sb);
> -	file_update_time(vma->vm_file);
> -
> -	/* check if the faulting page hasn't raced with truncate */
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> -	if (vmf->pgoff >= size)
> -		ret = VM_FAULT_SIGBUS;
> -	else if (IS_DAX(inode))
> -		ret = dax_pfn_mkwrite(vma, vmf);
> -	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -	sb_end_pagefault(inode->i_sb);
> -	return ret;
> -
> -}
> -
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>  	.fault		= xfs_filemap_fault,
> -	.huge_fault	= xfs_filemap_huge_fault,
>  	.map_pages	= filemap_map_pages,
>  	.page_mkwrite	= xfs_filemap_page_mkwrite,
> -	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
>  };
>  
>  STATIC int
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index fb1f3e1..3f1515f 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -687,9 +687,7 @@ DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
>  DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
>  
>  DEFINE_INODE_EVENT(xfs_filemap_fault);
> -DEFINE_INODE_EVENT(xfs_filemap_huge_fault);
>  DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
> -DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite);
>  
>  DECLARE_EVENT_CLASS(xfs_iref_class,
>  	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 8e58c36..b9e745f 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -12,8 +12,8 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
>  		dax_iodone_t);
> -int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
> -#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
> +int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t,
> +		dax_iodone_t);
>  
>  static inline bool vma_is_dax(struct vm_area_struct *vma)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b9d0979..eac1aeb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -281,16 +281,12 @@ struct vm_operations_struct {
>  	void (*close)(struct vm_area_struct * area);
>  	int (*mremap)(struct vm_area_struct * area);
>  	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
> -	int (*huge_fault)(struct vm_area_struct *, struct vm_fault *vmf);
>  	void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
>  	/* notification that a previously read-only page is about to become
>  	 * writable, if an error is returned it will cause a SIGBUS */
>  	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
> -	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
> -	int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
> -
>  	/* called by access_process_vm when get_user_pages() fails, typically
>  	 * for use by special VMAs that can switch between memory and hardware
>  	 */
> diff --git a/mm/memory.c b/mm/memory.c
> index 03d49eb..0af34e2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2210,42 +2210,6 @@ oom:
>  	return VM_FAULT_OOM;
>  }
>  
> -/*
> - * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
> - * mapping
> - */
> -static int wp_pfn_shared(struct mm_struct *mm,
> -			struct vm_area_struct *vma, unsigned long address,
> -			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
> -			pmd_t *pmd)
> -{
> -	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
> -		struct vm_fault vmf = {
> -			.page = NULL,
> -			.pgoff = linear_page_index(vma, address),
> -			.virtual_address = (void __user *)(address & PAGE_MASK),
> -			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
> -		};
> -		int ret;
> -
> -		pte_unmap_unlock(page_table, ptl);
> -		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
> -		if (ret & VM_FAULT_ERROR)
> -			return ret;
> -		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> -		/*
> -		 * We might have raced with another page fault while we
> -		 * released the pte_offset_map_lock.
> -		 */
> -		if (!pte_same(*page_table, orig_pte)) {
> -			pte_unmap_unlock(page_table, ptl);
> -			return 0;
> -		}
> -	}
> -	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
> -			     NULL, 0, 0);
> -}
> -
>  static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
>  			  unsigned long address, pte_t *page_table,
>  			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
> @@ -2324,12 +2288,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * VM_PFNMAP VMA.
>  		 *
>  		 * We should not cow pages in a shared writeable mapping.
> -		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
> +		 * Just mark the pages writable as we can't do any dirty
> +		 * accounting on raw pfn maps.
>  		 */
>  		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  				     (VM_WRITE|VM_SHARED))
> -			return wp_pfn_shared(mm, vma, address, page_table, ptl,
> -					     orig_pte, pmd);
> +			return wp_page_reuse(mm, vma, address, page_table, ptl,
> +					     orig_pte, old_page, 0, 0);
>  
>  		pte_unmap_unlock(page_table, ptl);
>  		return wp_page_copy(mm, vma, address, page_table, pmd,
> @@ -3282,8 +3247,8 @@ static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	if (vma_is_anonymous(vma))
>  		return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
> -	if (vma->vm_ops->huge_fault)
> -		return vma->vm_ops->huge_fault(vma, &vmf);
> +	if (vma->vm_ops->fault)
> +		return vma->vm_ops->fault(vma, &vmf);
>  	return VM_FAULT_FALLBACK;
>  }
>  
> @@ -3299,8 +3264,8 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	if (vma_is_anonymous(vma))
>  		return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
> -	if (vma->vm_ops->huge_fault)
> -		return vma->vm_ops->huge_fault(vma, &vmf);
> +	if (vma->vm_ops->page_mkwrite)
> +		return vma->vm_ops->page_mkwrite(vma, &vmf);
>  	return VM_FAULT_FALLBACK;
>  }
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 407ab43..0d851cb 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1490,7 +1490,7 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
>  		return 0;
>  
>  	/* The backer wishes to know when pages are first written to? */
> -	if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
> +	if (vm_ops && vm_ops->page_mkwrite)
>  		return 1;
>  
>  	/* The open routine did something to the protections that pgprot_modify
> -- 
> 2.7.0.rc3
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dax: Giant hack
  2016-01-27  4:08 ` [PATCH 2/2] dax: Giant hack Matthew Wilcox
  2016-01-28 13:10   ` Jan Kara
@ 2016-01-28 21:23   ` Dave Chinner
  2016-01-29 22:29   ` Jared Hulbert
  2 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-01-28 21:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, Ross Zwisler, Matthew Wilcox, linux-fsdevel

On Tue, Jan 26, 2016 at 11:08:28PM -0500, Matthew Wilcox wrote:
> From: Matthew Wilcox <willy@linux.intel.com>
> 
> This glop of impossible-to-review code implements a number of ideas that
> need to be separated out.
> 
>  - Eliminate vm_ops->huge_fault.  The core calls ->fault instead and callers
>    who set VM_HUGEPAGE should be prepared to deal with FAULT_FLAG_SIZE_PMD
>    (and larger)
>  - Switch back to calling ->page_mkwrite instead of ->pfn_mkwrite.  DAX now
>    always has a page to lock, and no other imlementations of ->pfn_mkwrite
>    exist.
>  - dax_mkwrite splits out from dax_fault.  dax_fault will now never call
>    get_block() to allocate a block; only to see if a block has been allocated.
>    dax_mkwrite will always attempt to allocate a block.
>  - Filesystems now take their DAX allocation mutex in exclusive/write mode
>    when calling dax_mkwrite.
>  - Split out dax_insert_pmd_mapping() from dax_pmd_fault and share it with
>    the new dax_pmd_mkwrite
>  - Change dax_pmd_write to take a vm_fault argument like the rest of the
>    family of functions.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>

I haven't looked really closely at the patch itself, but I like the
direction you're taking here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault
  2016-01-28 12:17   ` Jan Kara
@ 2016-01-29 14:31     ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2016-01-29 14:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Wilcox, Dave Chinner, Ross Zwisler, linux-fsdevel

On Thu, Jan 28, 2016 at 01:17:10PM +0100, Jan Kara wrote:
> >  {
> > +	struct vm_fault vmf = {
> > +		.virtual_address = (void __user *)address,
> > +		.flags = flags | FAULT_FLAG_SIZE_PMD,
> > +		.pmd = pmd,
> > +	};
> > +
> 
> I think we should fill in also the vmf.gfp_mask and vmf.pgoff fields. I
> know they are not currently used but it would be a nasty surprise if
> someone tried to use them in the future...

Hmm.  We probably should try to use gfp_mask and pgoff.  I'll note those for
a future patch series, and I'll initialise them now.  Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dax: Giant hack
  2016-01-27  4:08 ` [PATCH 2/2] dax: Giant hack Matthew Wilcox
  2016-01-28 13:10   ` Jan Kara
  2016-01-28 21:23   ` Dave Chinner
@ 2016-01-29 22:29   ` Jared Hulbert
  2 siblings, 0 replies; 13+ messages in thread
From: Jared Hulbert @ 2016-01-29 22:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dave Chinner, Ross Zwisler, Matthew Wilcox, Linux FS Devel

>  - Switch back to calling ->page_mkwrite instead of ->pfn_mkwrite.  DAX now
>    always has a page to lock,

Can you expound on this a little?

If I understood pfn_mkwrite() it was there to enable DAX to track
writes to mmap addresses that only had a pfn and therefore no struct
page.  So where does the page come from?  Are we just skipping the pfn
case now?  What are the implications?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Fix dax races between page faults RFC only
  2016-01-27  6:18   ` [PATCH 0/2] Fix dax races between page faults RFC only Matthew Wilcox
@ 2016-02-01 21:25     ` Ross Zwisler
  0 siblings, 0 replies; 13+ messages in thread
From: Ross Zwisler @ 2016-02-01 21:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, Matthew Wilcox, Jan Kara, Dave Chinner,
	Ross Zwisler, linux-fsdevel

On Wed, Jan 27, 2016 at 01:18:30AM -0500, Matthew Wilcox wrote:
> On Tue, Jan 26, 2016 at 10:50:59PM -0700, Ross Zwisler wrote:
> > On Tuesday, January 26, 2016, Matthew Wilcox <matthew.r.wilcox@intel.com>
> > wrote:
> > > If I understand the current state of the code correctly, truncate can't
> > > race with the fault handler, so the re-checks we do of i_size are now
> > > dead code, which can be deleted. right?
> > 
> > Yep, I think so.  I think we might be able to delete all the i_mmap locking
> > in dax.c as well, now that the isolation from truncate all happens at the
> > filesystem level.
> 
> No; we need to preserve the lock against truncate back down into the
> MM code.  Consider the cow_page case; if truncate comes in after we drop
> the filesystem lock and before the COWed page is inserted into the page
> table, truncate won't see the page in order to remove it.  And truncate
> is supposed to remove COWs as well as the underlying file.

Yep, agreed.

> I'm not sure what purpose it serves in dax_insert_mapping though.  Nor
> the PMD fault handler.

Yep, AFAIK these two can be removed since they are now synchronized with
truncate via the filesystem level locking.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-02-01 21:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-27  4:08 [PATCH 0/2] Fix dax races between page faults RFC only Matthew Wilcox
2016-01-27  4:08 ` [PATCH 1/2] mm,fs,dax: Change ->pmd_fault to ->huge_fault Matthew Wilcox
2016-01-27  5:48   ` kbuild test robot
2016-01-27 17:47   ` Ross Zwisler
2016-01-28 12:17   ` Jan Kara
2016-01-29 14:31     ` Matthew Wilcox
2016-01-27  4:08 ` [PATCH 2/2] dax: Giant hack Matthew Wilcox
2016-01-28 13:10   ` Jan Kara
2016-01-28 21:23   ` Dave Chinner
2016-01-29 22:29   ` Jared Hulbert
     [not found] ` <CAOxpaSU_JgkeS=u61zxWTdP5hXymBkUsvkjkwNzm6XVig9y8RQ@mail.gmail.com>
2016-01-27  6:18   ` [PATCH 0/2] Fix dax races between page faults RFC only Matthew Wilcox
2016-02-01 21:25     ` Ross Zwisler
2016-01-28 12:48 ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.