All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Add updated DAX locking to ext2
@ 2015-10-13 22:25 ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-13 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

The first patch in this series is a somewhat related bug fix.  The second patch
adds new locking to ext2 to isolate DAX faults (page faults, PMD faults, page
mkwrite and pfn mkwrite) from ext2 operations that modify a given inode's data
block allocations.

I've tested this using xfstests with DAX enabled and disabled and verified that
it compiles without errors with and without CONFIG_FS_DAX.

Changes from v1:
 - 'dax_sem' is now only present in struct ext2_inode_info if CONFIG_FS_DAX is
   defined.  This was done to help save space on low-memory machines that are a
   target audience of ext2. (Jan)

 - The locking order between ext2_inode_info->dax_sem and sb_start_pagefault
   was swapped to match the locking order in XFS.  This resulted in more
   open-coding of the various DAX fault routines, and once this is done in ext4
   we should be able to get rid of many (all?) the DAX fault handler wrappers
   that do locking for the filesystems. (Dave)

 - Warn if 'dax_sem' isn't properly held in __ext2_truncate_blocks. (Dan)

Ross Zwisler (2):
  dax: dax_pfn_mkwrite() truncate race check
  ext2: Add locking for DAX faults

 fs/dax.c        | 13 +++++++--
 fs/ext2/ext2.h  | 11 ++++++++
 fs/ext2/file.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext2/inode.c | 10 +++++++
 fs/ext2/super.c |  3 +++
 5 files changed, 115 insertions(+), 6 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 0/2] Add updated DAX locking to ext2
@ 2015-10-13 22:25 ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-13 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

The first patch in this series is a somewhat related bug fix.  The second patch
adds new locking to ext2 to isolate DAX faults (page faults, PMD faults, page
mkwrite and pfn mkwrite) from ext2 operations that modify a given inode's data
block allocations.

I've tested this using xfstests with DAX enabled and disabled and verified that
it compiles without errors with and without CONFIG_FS_DAX.

Changes from v1:
 - 'dax_sem' is now only present in struct ext2_inode_info if CONFIG_FS_DAX is
   defined.  This was done to help save space on low-memory machines that are a
   target audience of ext2. (Jan)

 - The locking order between ext2_inode_info->dax_sem and sb_start_pagefault
   was swapped to match the locking order in XFS.  This resulted in more
   open-coding of the various DAX fault routines, and once this is done in ext4
   we should be able to get rid of many (all?) the DAX fault handler wrappers
   that do locking for the filesystems. (Dave)

 - Warn if 'dax_sem' isn't properly held in __ext2_truncate_blocks. (Dan)

Ross Zwisler (2):
  dax: dax_pfn_mkwrite() truncate race check
  ext2: Add locking for DAX faults

 fs/dax.c        | 13 +++++++--
 fs/ext2/ext2.h  | 11 ++++++++
 fs/ext2/file.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext2/inode.c | 10 +++++++
 fs/ext2/super.c |  3 +++
 5 files changed, 115 insertions(+), 6 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
  2015-10-13 22:25 ` Ross Zwisler
@ 2015-10-13 22:25   ` Ross Zwisler
  -1 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-13 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

Update dax_pfn_mkwrite() so that it validates i_size before returning.
This is necessary to ensure that the page fault has not raced with truncate
and is now pointing to a region beyond the end of the current file.

This change is based on a similar outstanding patch for XFS from Dave
Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
---
 fs/dax.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 131fd35a..82be6e4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	int ret = VM_FAULT_NOPAGE;
+	loff_t size;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
+
+	/* check that the faulting page hasn't raced with truncate */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+
 	sb_end_pagefault(sb);
-	return VM_FAULT_NOPAGE;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
@ 2015-10-13 22:25   ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-13 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

Update dax_pfn_mkwrite() so that it validates i_size before returning.
This is necessary to ensure that the page fault has not raced with truncate
and is now pointing to a region beyond the end of the current file.

This change is based on a similar outstanding patch for XFS from Dave
Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
---
 fs/dax.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 131fd35a..82be6e4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	int ret = VM_FAULT_NOPAGE;
+	loff_t size;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
+
+	/* check that the faulting page hasn't raced with truncate */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+
 	sb_end_pagefault(sb);
-	return VM_FAULT_NOPAGE;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/2] ext2: Add locking for DAX faults
  2015-10-13 22:25 ` Ross Zwisler
@ 2015-10-13 22:25   ` Ross Zwisler
  -1 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-13 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

Add locking to ensure that DAX faults are isolated from ext2 operations
that modify the data blocks allocation for an inode.  This is intended to
be analogous to the work being done in XFS by Dave Chinner:

http://www.spinics.net/lists/linux-fsdevel/msg90260.html

Compared with XFS the ext2 case is greatly simplified by the fact that ext2
already allocates and zeros new blocks before they are returned as part of
ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
unwritten buffer heads.

This means that the only work we need to do in ext2 is to isolate the DAX
faults from inode block allocation changes.  I believe this just means that
we need to isolate the DAX faults from truncate operations.

The newly introduced dax_sem is intended to replicate the protection
offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
protects XFS operations like hole punching, fallocate down, extent
manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
the only one of these operations supported by ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/ext2.h  | 11 ++++++++
 fs/ext2/file.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext2/inode.c | 10 +++++++
 fs/ext2/super.c |  3 +++
 4 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 8d15feb..4c69c94 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -684,6 +684,9 @@ struct ext2_inode_info {
 	struct rw_semaphore xattr_sem;
 #endif
 	rwlock_t i_meta_lock;
+#ifdef CONFIG_FS_DAX
+	struct rw_semaphore dax_sem;
+#endif
 
 	/*
 	 * truncate_mutex is for serialising ext2_truncate() against
@@ -699,6 +702,14 @@ struct ext2_inode_info {
 #endif
 };
 
+#ifdef CONFIG_FS_DAX
+#define dax_sem_down_write(ext2_inode)	down_write(&(ext2_inode)->dax_sem)
+#define dax_sem_up_write(ext2_inode)	up_write(&(ext2_inode)->dax_sem)
+#else
+#define dax_sem_down_write(ext2_inode)
+#define dax_sem_up_write(ext2_inode)
+#endif
+
 /*
  * Inode dynamic state flags
  */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 1982c3f..11a42c5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -27,27 +27,103 @@
 #include "acl.h"
 
 #ifdef CONFIG_FS_DAX
+/*
+ * The lock ordering for ext2 DAX fault paths is:
+ *
+ * mmap_sem (MM)
+ *   sb_start_pagefault (vfs, freeze)
+ *     ext2_inode_info->dax_sem
+ *       address_space->i_mmap_rwsem or page_lock (mutually exclusive in DAX)
+ *         ext2_inode_info->truncate_mutex
+ *
+ * The default page_lock and i_size verification done by non-DAX fault paths
+ * is sufficient because ext2 doesn't support hole punching.
+ */
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext2_get_block, NULL);
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ext2_inode_info *ei = EXT2_I(inode);
+	int ret;
+
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vma->vm_file);
+	}
+	down_read(&ei->dax_sem);
+
+	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
+
+	up_read(&ei->dax_sem);
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		sb_end_pagefault(inode->i_sb);
+	return ret;
 }
 
 static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
-	return dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ext2_inode_info *ei = EXT2_I(inode);
+	int ret;
+
+	if (flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vma->vm_file);
+	}
+	down_read(&ei->dax_sem);
+
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+
+	up_read(&ei->dax_sem);
+	if (flags & FAULT_FLAG_WRITE)
+		sb_end_pagefault(inode->i_sb);
+	return ret;
 }
 
 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ext2_inode_info *ei = EXT2_I(inode);
+	int ret;
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vma->vm_file);
+	down_read(&ei->dax_sem);
+
+	ret = __dax_mkwrite(vma, vmf, ext2_get_block, NULL);
+
+	up_read(&ei->dax_sem);
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+}
+
+static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
+		struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ext2_inode_info *ei = EXT2_I(inode);
+	int ret = VM_FAULT_NOPAGE;
+	loff_t size;
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vma->vm_file);
+	down_read(&ei->dax_sem);
+
+	/* check that the faulting page hasn't raced with truncate */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+
+	up_read(&ei->dax_sem);
+	sb_end_pagefault(inode->i_sb);
+	return ret;
 }
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
 	.pmd_fault	= ext2_dax_pmd_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
-	.pfn_mkwrite	= dax_pfn_mkwrite,
+	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
 
 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c60a248..0aa9bf6 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1085,6 +1085,7 @@ static void ext2_free_branches(struct inode *inode, __le32 *p, __le32 *q, int de
 		ext2_free_data(inode, p, q);
 }
 
+/* dax_sem must be held when calling this function */
 static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
 {
 	__le32 *i_data = EXT2_I(inode)->i_data;
@@ -1100,6 +1101,10 @@ static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
 	blocksize = inode->i_sb->s_blocksize;
 	iblock = (offset + blocksize-1) >> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
 
+#ifdef CONFIG_FS_DAX
+	WARN_ON(!rwsem_is_locked(&ei->dax_sem));
+#endif
+
 	n = ext2_block_to_path(inode, iblock, offsets, NULL);
 	if (n == 0)
 		return;
@@ -1185,7 +1190,10 @@ static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
 		return;
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return;
+
+	dax_sem_down_write(EXT2_I(inode));
 	__ext2_truncate_blocks(inode, offset);
+	dax_sem_up_write(EXT2_I(inode));
 }
 
 static int ext2_setsize(struct inode *inode, loff_t newsize)
@@ -1213,8 +1221,10 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	if (error)
 		return error;
 
+	dax_sem_down_write(EXT2_I(inode));
 	truncate_setsize(inode, newsize);
 	__ext2_truncate_blocks(inode, newsize);
+	dax_sem_up_write(EXT2_I(inode));
 
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 900e19c..3a71cea 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -192,6 +192,9 @@ static void init_once(void *foo)
 	init_rwsem(&ei->xattr_sem);
 #endif
 	mutex_init(&ei->truncate_mutex);
+#ifdef CONFIG_FS_DAX
+	init_rwsem(&ei->dax_sem);
+#endif
 	inode_init_once(&ei->vfs_inode);
 }
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/2] ext2: Add locking for DAX faults
@ 2015-10-13 22:25   ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-13 22:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

Add locking to ensure that DAX faults are isolated from ext2 operations
that modify the data blocks allocation for an inode.  This is intended to
be analogous to the work being done in XFS by Dave Chinner:

http://www.spinics.net/lists/linux-fsdevel/msg90260.html

Compared with XFS the ext2 case is greatly simplified by the fact that ext2
already allocates and zeros new blocks before they are returned as part of
ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
unwritten buffer heads.

This means that the only work we need to do in ext2 is to isolate the DAX
faults from inode block allocation changes.  I believe this just means that
we need to isolate the DAX faults from truncate operations.

The newly introduced dax_sem is intended to replicate the protection
offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
protects XFS operations like hole punching, fallocate down, extent
manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
the only one of these operations supported by ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/ext2.h  | 11 ++++++++
 fs/ext2/file.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext2/inode.c | 10 +++++++
 fs/ext2/super.c |  3 +++
 4 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 8d15feb..4c69c94 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -684,6 +684,9 @@ struct ext2_inode_info {
 	struct rw_semaphore xattr_sem;
 #endif
 	rwlock_t i_meta_lock;
+#ifdef CONFIG_FS_DAX
+	struct rw_semaphore dax_sem;
+#endif
 
 	/*
 	 * truncate_mutex is for serialising ext2_truncate() against
@@ -699,6 +702,14 @@ struct ext2_inode_info {
 #endif
 };
 
+#ifdef CONFIG_FS_DAX
+#define dax_sem_down_write(ext2_inode)	down_write(&(ext2_inode)->dax_sem)
+#define dax_sem_up_write(ext2_inode)	up_write(&(ext2_inode)->dax_sem)
+#else
+#define dax_sem_down_write(ext2_inode)
+#define dax_sem_up_write(ext2_inode)
+#endif
+
 /*
  * Inode dynamic state flags
  */
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 1982c3f..11a42c5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -27,27 +27,103 @@
 #include "acl.h"
 
 #ifdef CONFIG_FS_DAX
+/*
+ * The lock ordering for ext2 DAX fault paths is:
+ *
+ * mmap_sem (MM)
+ *   sb_start_pagefault (vfs, freeze)
+ *     ext2_inode_info->dax_sem
+ *       address_space->i_mmap_rwsem or page_lock (mutually exclusive in DAX)
+ *         ext2_inode_info->truncate_mutex
+ *
+ * The default page_lock and i_size verification done by non-DAX fault paths
+ * is sufficient because ext2 doesn't support hole punching.
+ */
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext2_get_block, NULL);
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ext2_inode_info *ei = EXT2_I(inode);
+	int ret;
+
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vma->vm_file);
+	}
+	down_read(&ei->dax_sem);
+
+	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
+
+	up_read(&ei->dax_sem);
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		sb_end_pagefault(inode->i_sb);
+	return ret;
 }
 
 static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
 						pmd_t *pmd, unsigned int flags)
 {
-	return dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ext2_inode_info *ei = EXT2_I(inode);
+	int ret;
+
+	if (flags & FAULT_FLAG_WRITE) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vma->vm_file);
+	}
+	down_read(&ei->dax_sem);
+
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+
+	up_read(&ei->dax_sem);
+	if (flags & FAULT_FLAG_WRITE)
+		sb_end_pagefault(inode->i_sb);
+	return ret;
 }
 
 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ext2_inode_info *ei = EXT2_I(inode);
+	int ret;
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vma->vm_file);
+	down_read(&ei->dax_sem);
+
+	ret = __dax_mkwrite(vma, vmf, ext2_get_block, NULL);
+
+	up_read(&ei->dax_sem);
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+}
+
+static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
+		struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct ext2_inode_info *ei = EXT2_I(inode);
+	int ret = VM_FAULT_NOPAGE;
+	loff_t size;
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vma->vm_file);
+	down_read(&ei->dax_sem);
+
+	/* check that the faulting page hasn't raced with truncate */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+
+	up_read(&ei->dax_sem);
+	sb_end_pagefault(inode->i_sb);
+	return ret;
 }
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
 	.fault		= ext2_dax_fault,
 	.pmd_fault	= ext2_dax_pmd_fault,
 	.page_mkwrite	= ext2_dax_mkwrite,
-	.pfn_mkwrite	= dax_pfn_mkwrite,
+	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
 };
 
 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c60a248..0aa9bf6 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1085,6 +1085,7 @@ static void ext2_free_branches(struct inode *inode, __le32 *p, __le32 *q, int de
 		ext2_free_data(inode, p, q);
 }
 
+/* dax_sem must be held when calling this function */
 static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
 {
 	__le32 *i_data = EXT2_I(inode)->i_data;
@@ -1100,6 +1101,10 @@ static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
 	blocksize = inode->i_sb->s_blocksize;
 	iblock = (offset + blocksize-1) >> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
 
+#ifdef CONFIG_FS_DAX
+	WARN_ON(!rwsem_is_locked(&ei->dax_sem));
+#endif
+
 	n = ext2_block_to_path(inode, iblock, offsets, NULL);
 	if (n == 0)
 		return;
@@ -1185,7 +1190,10 @@ static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
 		return;
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return;
+
+	dax_sem_down_write(EXT2_I(inode));
 	__ext2_truncate_blocks(inode, offset);
+	dax_sem_up_write(EXT2_I(inode));
 }
 
 static int ext2_setsize(struct inode *inode, loff_t newsize)
@@ -1213,8 +1221,10 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	if (error)
 		return error;
 
+	dax_sem_down_write(EXT2_I(inode));
 	truncate_setsize(inode, newsize);
 	__ext2_truncate_blocks(inode, newsize);
+	dax_sem_up_write(EXT2_I(inode));
 
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
 	if (inode_needs_sync(inode)) {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 900e19c..3a71cea 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -192,6 +192,9 @@ static void init_once(void *foo)
 	init_rwsem(&ei->xattr_sem);
 #endif
 	mutex_init(&ei->truncate_mutex);
+#ifdef CONFIG_FS_DAX
+	init_rwsem(&ei->dax_sem);
+#endif
 	inode_init_once(&ei->vfs_inode);
 }
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
  2015-10-13 22:25   ` Ross Zwisler
@ 2015-10-14  5:25     ` Dave Chinner
  -1 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2015-10-14  5:25 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> Update dax_pfn_mkwrite() so that it validates i_size before returning.
> This is necessary to ensure that the page fault has not raced with truncate
> and is now pointing to a region beyond the end of the current file.
> 
> This change is based on a similar outstanding patch for XFS from Dave
> Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> ---
>  fs/dax.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 131fd35a..82be6e4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>   */
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct super_block *sb = inode->i_sb;
> +	int ret = VM_FAULT_NOPAGE;
> +	loff_t size;
>  
>  	sb_start_pagefault(sb);
>  	file_update_time(vma->vm_file);
> +
> +	/* check that the faulting page hasn't raced with truncate */
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		ret = VM_FAULT_SIGBUS;
> +
>  	sb_end_pagefault(sb);
> -	return VM_FAULT_NOPAGE;
> +	return ret;

This is still racy, as the read of the inode size is not serialised
against or ordered by any locks held by truncate. The check in XFS
is serialised against truncate by the XFS_MMAPLOCK and the generic
DAX code does not have such a mechanism to rely on. Hence I'd
suggest that the correct thing to do here is remove
dax_pfn_mkwrite() and force filesystems to implement their own
truncate-safe variants of ->pfn_mkwrite.

And on further thought, I'm not sure that what I did with XFS is
at all correct when considering hole punching. That is, to get a PFN
based fault, we've already had to guarantee the file offset has
blocks allocated and mapped them. Then:

process 1				process 2
pfn_mkwrite @ X				punch hole @ X
 xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
  XFS_MMAPLOCK_SHARED
    <blocks>
					invalidate mapping @ X
					remove blocks @ X
					....
					unlock XFS_MMAPLOCK_EXCL
   checks file size
  unlock XFS_MMAPLOCK_SHARED
  return VM_FAULT_NOPAGE

And so process 1 continues with an invalid page mapping over the
hole process 2 just punched in the file. That's a data
exposure/corruption problem the underlying blocks get reallocated
to some other file.

Unless I'm missing something here, this gets ugly real fast.

If we treat ->pfn_mkwrite like ->page_mkwrite() and allocate blocks
under the pfn that was invalidated and punched out by the hole punch
operation, then *the physical pfn that maps to the file offset that
the page fault occurred for changes*.

So, we can't allocate new blocks during ->pfn_mkwrite. All we can
do is check the pfn mapping is still valid when we have serialised
against hole punch/truncate, and if it isn't we need to return
VM_FAULT_RETRY so that the page fault is restarted to find the new
mapping which can then have ->page_mkwrite/->pfn_mkwrite called
on it.

The biggest problem here is that VM_FAULT_RETRY will only allow one
retry of the page fault - if the page fault races twice in a row
with a hole punch (need 3 processes, two doing write faults at the
same offset, and the other repeatedly hole punching the same offset)
then we'll SIGBUS the process on the second VM_FAULT_RETRY in a row
from ->pfn_mkwrite....

And because I don't have all day to work out all the twisty paths of
the page fault code, where is a pfn-based read fault validated
against racing truncate/holepunch operations freeing the
underlying storage?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
@ 2015-10-14  5:25     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2015-10-14  5:25 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> Update dax_pfn_mkwrite() so that it validates i_size before returning.
> This is necessary to ensure that the page fault has not raced with truncate
> and is now pointing to a region beyond the end of the current file.
> 
> This change is based on a similar outstanding patch for XFS from Dave
> Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> ---
>  fs/dax.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 131fd35a..82be6e4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>   */
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct super_block *sb = inode->i_sb;
> +	int ret = VM_FAULT_NOPAGE;
> +	loff_t size;
>  
>  	sb_start_pagefault(sb);
>  	file_update_time(vma->vm_file);
> +
> +	/* check that the faulting page hasn't raced with truncate */
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		ret = VM_FAULT_SIGBUS;
> +
>  	sb_end_pagefault(sb);
> -	return VM_FAULT_NOPAGE;
> +	return ret;

This is still racy, as the read of the inode size is not serialised
against or ordered by any locks held by truncate. The check in XFS
is serialised against truncate by the XFS_MMAPLOCK and the generic
DAX code does not have such a mechanism to rely on. Hence I'd
suggest that the correct thing to do here is remove
dax_pfn_mkwrite() and force filesystems to implement their own
truncate-safe variants of ->pfn_mkwrite.

And on further thought, I'm not sure that what I did with XFS is
at all correct when considering hole punching. That is, to get a PFN
based fault, we've already had to guarantee the file offset has
blocks allocated and mapped them. Then:

process 1				process 2
pfn_mkwrite @ X				punch hole @ X
 xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
  XFS_MMAPLOCK_SHARED
    <blocks>
					invalidate mapping @ X
					remove blocks @ X
					....
					unlock XFS_MMAPLOCK_EXCL
   checks file size
  unlock XFS_MMAPLOCK_SHARED
  return VM_FAULT_NOPAGE

And so process 1 continues with an invalid page mapping over the
hole process 2 just punched in the file. That's a data
exposure/corruption problem the underlying blocks get reallocated
to some other file.

Unless I'm missing something here, this gets ugly real fast.

If we treat ->pfn_mkwrite like ->page_mkwrite() and allocate blocks
under the pfn that was invalidated and punched out by the hole punch
operation, then *the physical pfn that maps to the file offset that
the page fault occurred for changes*.

So, we can't allocate new blocks during ->pfn_mkwrite. All we can
do is check the pfn mapping is still valid when we have serialised
against hole punch/truncate, and if it isn't we need to return
VM_FAULT_RETRY so that the page fault is restarted to find the new
mapping which can then have ->page_mkwrite/->pfn_mkwrite called
on it.

The biggest problem here is that VM_FAULT_RETRY will only allow one
retry of the page fault - if the page fault races twice in a row
with a hole punch (need 3 processes, two doing write faults at the
same offset, and the other repeatedly hole punching the same offset)
then we'll SIGBUS the process on the second VM_FAULT_RETRY in a row
from ->pfn_mkwrite....

And because I don't have all day to work out all the twisty paths of
the page fault code, where is a pfn-based read fault validated
against racing truncate/holepunch operations freeing the
underlying storage?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
  2015-10-14  5:25     ` Dave Chinner
@ 2015-10-14  8:40       ` Jan Kara
  -1 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2015-10-14  8:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Wed 14-10-15 16:25:50, Dave Chinner wrote:
> On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > This is necessary to ensure that the page fault has not raced with truncate
> > and is now pointing to a region beyond the end of the current file.
> > 
> > This change is based on a similar outstanding patch for XFS from Dave
> > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/dax.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 131fd35a..82be6e4 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct inode *inode = file_inode(vma->vm_file);
> > +	struct super_block *sb = inode->i_sb;
> > +	int ret = VM_FAULT_NOPAGE;
> > +	loff_t size;
> >  
> >  	sb_start_pagefault(sb);
> >  	file_update_time(vma->vm_file);
> > +
> > +	/* check that the faulting page hasn't raced with truncate */
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (vmf->pgoff >= size)
> > +		ret = VM_FAULT_SIGBUS;
> > +
> >  	sb_end_pagefault(sb);
> > -	return VM_FAULT_NOPAGE;
> > +	return ret;
> 
> This is still racy, as the read of the inode size is not serialised
> against or ordered by any locks held by truncate. The check in XFS
> is serialised against truncate by the XFS_MMAPLOCK and the generic
> DAX code does not have such a mechanism to rely on. Hence I'd
> suggest that the correct thing to do here is remove
> dax_pfn_mkwrite() and force filesystems to implement their own
> truncate-safe variants of ->pfn_mkwrite.

Well, the i_size check is just an optimization anyway. See below the
discussion regarding the hole punch.

> And on further thought, I'm not sure that what I did with XFS is
> at all correct when considering hole punching. That is, to get a PFN
> based fault, we've already had to guarantee the file offset has
> blocks allocated and mapped them. Then:
> 
> process 1				process 2
> pfn_mkwrite @ X				punch hole @ X
>  xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
>   XFS_MMAPLOCK_SHARED
>     <blocks>
> 					invalidate mapping @ X
> 					remove blocks @ X
> 					....
> 					unlock XFS_MMAPLOCK_EXCL
>    checks file size
>   unlock XFS_MMAPLOCK_SHARED
>   return VM_FAULT_NOPAGE
> 
> And so process 1 continues with an invalid page mapping over the
> hole process 2 just punched in the file. That's a data
> exposure/corruption problem the underlying blocks get reallocated
> to some other file.
> 
> Unless I'm missing something here, this gets ugly real fast.

So how I convinced myself that this is not a problem is:

When process 2 invalidates mapping, it will also modify page tables to
unmap freed pages. Then the pte_same() check in mm/memory.c:wp_pfn_shared()
will fail, we bail out, CPU will retry the memory access which faults again
and now we go the right path. So ->pfn_mkwrite() has to be prepared that
the pfn under it actually isn't there anymore and current implementations
don't really care so we are fine.

Trying to generalize when we are safe against such races: Unless we modify
page tables or inode allocation information, we don't care about races with
truncate() / punch_hole() so even the original dax_pfn_mkwrite() code was
actually correct. But it's really subtle...

I have this written down before ext4_dax_pfn_mkwrite() handler so that I
don't forget but it would be good to add the comment also to some generic
code.

> If we treat ->pfn_mkwrite like ->page_mkwrite() and allocate blocks
> under the pfn that was invalidated and punched out by the hole punch
> operation, then *the physical pfn that maps to the file offset that
> the page fault occurred for changes*.
> 
> So, we can't allocate new blocks during ->pfn_mkwrite. All we can
> do is check the pfn mapping is still valid when we have serialised
> against hole punch/truncate, and if it isn't we need to return
> VM_FAULT_RETRY so that the page fault is restarted to find the new
> mapping which can then have ->page_mkwrite/->pfn_mkwrite called
> on it.
> 
> The biggest problem here is that VM_FAULT_RETRY will only allow one
> retry of the page fault - if the page fault races twice in a row
> with a hole punch (need 3 processes, two doing write faults at the
> same offset, and the other repeatedly hole punching the same offset)
> then we'll SIGBUS the process on the second VM_FAULT_RETRY in a row
> from ->pfn_mkwrite....
> 
> And because I don't have all day to work out all the twisty paths of
> the page fault code, where is a pfn-based read fault validated
> against racing truncate/holepunch operations freeing the
> underlying storage?

Do you mean a fault which ends up in dax_fault()? There we use that:

a) we hold fs specific i_mmap_lock in exclusive mode in truncate / punch
   hole over complete sequence of "invalidate mappings, remove blocks from
   inode".

b) we hold fs specific i_mmap_lock in shared mode in fault over the
   sequence "lookup block, install into page tables".

So I don't see a way how we could end up with page tables referencing freed
blocks - either fault finds a hole which it instantiates in page tables or
it finds a block, instantiates it in page tables, and subsequent truncate
will remove the block from page tables again. But I guess you are well
aware of this so maybe I misunderstood your question.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
@ 2015-10-14  8:40       ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2015-10-14  8:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Wed 14-10-15 16:25:50, Dave Chinner wrote:
> On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > This is necessary to ensure that the page fault has not raced with truncate
> > and is now pointing to a region beyond the end of the current file.
> > 
> > This change is based on a similar outstanding patch for XFS from Dave
> > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/dax.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 131fd35a..82be6e4 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct inode *inode = file_inode(vma->vm_file);
> > +	struct super_block *sb = inode->i_sb;
> > +	int ret = VM_FAULT_NOPAGE;
> > +	loff_t size;
> >  
> >  	sb_start_pagefault(sb);
> >  	file_update_time(vma->vm_file);
> > +
> > +	/* check that the faulting page hasn't raced with truncate */
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (vmf->pgoff >= size)
> > +		ret = VM_FAULT_SIGBUS;
> > +
> >  	sb_end_pagefault(sb);
> > -	return VM_FAULT_NOPAGE;
> > +	return ret;
> 
> This is still racy, as the read of the inode size is not serialised
> against or ordered by any locks held by truncate. The check in XFS
> is serialised against truncate by the XFS_MMAPLOCK and the generic
> DAX code does not have such a mechanism to rely on. Hence I'd
> suggest that the correct thing to do here is remove
> dax_pfn_mkwrite() and force filesystems to implement their own
> truncate-safe variants of ->pfn_mkwrite.

Well, the i_size check is just an optimization anyway. See below the
discussion regarding the hole punch.

> And on further thought, I'm not sure that what I did with XFS is
> at all correct when considering hole punching. That is, to get a PFN
> based fault, we've already had to guarantee the file offset has
> blocks allocated and mapped them. Then:
> 
> process 1				process 2
> pfn_mkwrite @ X				punch hole @ X
>  xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
>   XFS_MMAPLOCK_SHARED
>     <blocks>
> 					invalidate mapping @ X
> 					remove blocks @ X
> 					....
> 					unlock XFS_MMAPLOCK_EXCL
>    checks file size
>   unlock XFS_MMAPLOCK_SHARED
>   return VM_FAULT_NOPAGE
> 
> And so process 1 continues with an invalid page mapping over the
> hole process 2 just punched in the file. That's a data
> exposure/corruption problem the underlying blocks get reallocated
> to some other file.
> 
> Unless I'm missing something here, this gets ugly real fast.

So how I convinced myself that this is not a problem is:

When process 2 invalidates mapping, it will also modify page tables to
unmap freed pages. Then the pte_same() check in mm/memory.c:wp_pfn_shared()
will fail, we bail out, CPU will retry the memory access which faults again
and now we go the right path. So ->pfn_mkwrite() has to be prepared that
the pfn under it actually isn't there anymore and current implementations
don't really care so we are fine.

Trying to generalize when we are safe against such races: Unless we modify
page tables or inode allocation information, we don't care about races with
truncate() / punch_hole() so even the original dax_pfn_mkwrite() code was
actually correct. But it's really subtle...

I have this written down before ext4_dax_pfn_mkwrite() handler so that I
don't forget but it would be good to add the comment also to some generic
code.

> If we treat ->pfn_mkwrite like ->page_mkwrite() and allocate blocks
> under the pfn that was invalidated and punched out by the hole punch
> operation, then *the physical pfn that maps to the file offset that
> the page fault occurred for changes*.
> 
> So, we can't allocate new blocks during ->pfn_mkwrite. All we can
> do is check the pfn mapping is still valid when we have serialised
> against hole punch/truncate, and if it isn't we need to return
> VM_FAULT_RETRY so that the page fault is restarted to find the new
> mapping which can then have ->page_mkwrite/->pfn_mkwrite called
> on it.
> 
> The biggest problem here is that VM_FAULT_RETRY will only allow one
> retry of the page fault - if the page fault races twice in a row
> with a hole punch (need 3 processes, two doing write faults at the
> same offset, and the other repeatedly hole punching the same offset)
> then we'll SIGBUS the process on the second VM_FAULT_RETRY in a row
> from ->pfn_mkwrite....
> 
> And because I don't have all day to work out all the twisty paths of
> the page fault code, where is a pfn-based read fault validated
> against racing truncate/holepunch operations freeing the
> underlying storage?

Do you mean a fault which ends up in dax_fault()? There we use that:

a) we hold fs specific i_mmap_lock in exclusive mode in truncate / punch
   hole over complete sequence of "invalidate mappings, remove blocks from
   inode".

b) we hold fs specific i_mmap_lock in shared mode in fault over the
   sequence "lookup block, install into page tables".

So I don't see a way how we could end up with page tables referencing freed
blocks - either fault finds a hole which it instantiates in page tables or
it finds a block, instantiates it in page tables, and subsequent truncate
will remove the block from page tables again. But I guess you are well
aware of this so maybe I misunderstood your question.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ext2: Add locking for DAX faults
  2015-10-13 22:25   ` Ross Zwisler
@ 2015-10-14  8:51     ` Jan Kara
  -1 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2015-10-14  8:51 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Tue 13-10-15 16:25:37, Ross Zwisler wrote:
> Add locking to ensure that DAX faults are isolated from ext2 operations
> that modify the data blocks allocation for an inode.  This is intended to
> be analogous to the work being done in XFS by Dave Chinner:
> 
> http://www.spinics.net/lists/linux-fsdevel/msg90260.html
> 
> Compared with XFS the ext2 case is greatly simplified by the fact that ext2
> already allocates and zeros new blocks before they are returned as part of
> ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
> unwritten buffer heads.
> 
> This means that the only work we need to do in ext2 is to isolate the DAX
> faults from inode block allocation changes.  I believe this just means that
> we need to isolate the DAX faults from truncate operations.
> 
> The newly introduced dax_sem is intended to replicate the protection
> offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
> protects XFS operations like hole punching, fallocate down, extent
> manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
> the only one of these operations supported by ext2.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.com>

Or I can push the patch through my tree as it seems to be independent of
any other changes, am I right?

								Honza
> ---
>  fs/ext2/ext2.h  | 11 ++++++++
>  fs/ext2/file.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  fs/ext2/inode.c | 10 +++++++
>  fs/ext2/super.c |  3 +++
>  4 files changed, 104 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 8d15feb..4c69c94 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -684,6 +684,9 @@ struct ext2_inode_info {
>  	struct rw_semaphore xattr_sem;
>  #endif
>  	rwlock_t i_meta_lock;
> +#ifdef CONFIG_FS_DAX
> +	struct rw_semaphore dax_sem;
> +#endif
>  
>  	/*
>  	 * truncate_mutex is for serialising ext2_truncate() against
> @@ -699,6 +702,14 @@ struct ext2_inode_info {
>  #endif
>  };
>  
> +#ifdef CONFIG_FS_DAX
> +#define dax_sem_down_write(ext2_inode)	down_write(&(ext2_inode)->dax_sem)
> +#define dax_sem_up_write(ext2_inode)	up_write(&(ext2_inode)->dax_sem)
> +#else
> +#define dax_sem_down_write(ext2_inode)
> +#define dax_sem_up_write(ext2_inode)
> +#endif
> +
>  /*
>   * Inode dynamic state flags
>   */
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 1982c3f..11a42c5 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -27,27 +27,103 @@
>  #include "acl.h"
>  
>  #ifdef CONFIG_FS_DAX
> +/*
> + * The lock ordering for ext2 DAX fault paths is:
> + *
> + * mmap_sem (MM)
> + *   sb_start_pagefault (vfs, freeze)
> + *     ext2_inode_info->dax_sem
> + *       address_space->i_mmap_rwsem or page_lock (mutually exclusive in DAX)
> + *         ext2_inode_info->truncate_mutex
> + *
> + * The default page_lock and i_size verification done by non-DAX fault paths
> + * is sufficient because ext2 doesn't support hole punching.
> + */
>  static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext2_get_block, NULL);
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ext2_inode_info *ei = EXT2_I(inode);
> +	int ret;
> +
> +	if (vmf->flags & FAULT_FLAG_WRITE) {
> +		sb_start_pagefault(inode->i_sb);
> +		file_update_time(vma->vm_file);
> +	}
> +	down_read(&ei->dax_sem);
> +
> +	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
> +
> +	up_read(&ei->dax_sem);
> +	if (vmf->flags & FAULT_FLAG_WRITE)
> +		sb_end_pagefault(inode->i_sb);
> +	return ret;
>  }
>  
>  static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  						pmd_t *pmd, unsigned int flags)
>  {
> -	return dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ext2_inode_info *ei = EXT2_I(inode);
> +	int ret;
> +
> +	if (flags & FAULT_FLAG_WRITE) {
> +		sb_start_pagefault(inode->i_sb);
> +		file_update_time(vma->vm_file);
> +	}
> +	down_read(&ei->dax_sem);
> +
> +	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
> +
> +	up_read(&ei->dax_sem);
> +	if (flags & FAULT_FLAG_WRITE)
> +		sb_end_pagefault(inode->i_sb);
> +	return ret;
>  }
>  
>  static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ext2_inode_info *ei = EXT2_I(inode);
> +	int ret;
> +
> +	sb_start_pagefault(inode->i_sb);
> +	file_update_time(vma->vm_file);
> +	down_read(&ei->dax_sem);
> +
> +	ret = __dax_mkwrite(vma, vmf, ext2_get_block, NULL);
> +
> +	up_read(&ei->dax_sem);
> +	sb_end_pagefault(inode->i_sb);
> +	return ret;
> +}
> +
> +static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
> +		struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ext2_inode_info *ei = EXT2_I(inode);
> +	int ret = VM_FAULT_NOPAGE;
> +	loff_t size;
> +
> +	sb_start_pagefault(inode->i_sb);
> +	file_update_time(vma->vm_file);
> +	down_read(&ei->dax_sem);
> +
> +	/* check that the faulting page hasn't raced with truncate */
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		ret = VM_FAULT_SIGBUS;
> +
> +	up_read(&ei->dax_sem);
> +	sb_end_pagefault(inode->i_sb);
> +	return ret;
>  }
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
>  	.fault		= ext2_dax_fault,
>  	.pmd_fault	= ext2_dax_pmd_fault,
>  	.page_mkwrite	= ext2_dax_mkwrite,
> -	.pfn_mkwrite	= dax_pfn_mkwrite,
> +	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
>  };
>  
>  static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c60a248..0aa9bf6 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1085,6 +1085,7 @@ static void ext2_free_branches(struct inode *inode, __le32 *p, __le32 *q, int de
>  		ext2_free_data(inode, p, q);
>  }
>  
> +/* dax_sem must be held when calling this function */
>  static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
>  {
>  	__le32 *i_data = EXT2_I(inode)->i_data;
> @@ -1100,6 +1101,10 @@ static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
>  	blocksize = inode->i_sb->s_blocksize;
>  	iblock = (offset + blocksize-1) >> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
>  
> +#ifdef CONFIG_FS_DAX
> +	WARN_ON(!rwsem_is_locked(&ei->dax_sem));
> +#endif
> +
>  	n = ext2_block_to_path(inode, iblock, offsets, NULL);
>  	if (n == 0)
>  		return;
> @@ -1185,7 +1190,10 @@ static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
>  		return;
>  	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
>  		return;
> +
> +	dax_sem_down_write(EXT2_I(inode));
>  	__ext2_truncate_blocks(inode, offset);
> +	dax_sem_up_write(EXT2_I(inode));
>  }
>  
>  static int ext2_setsize(struct inode *inode, loff_t newsize)
> @@ -1213,8 +1221,10 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
>  	if (error)
>  		return error;
>  
> +	dax_sem_down_write(EXT2_I(inode));
>  	truncate_setsize(inode, newsize);
>  	__ext2_truncate_blocks(inode, newsize);
> +	dax_sem_up_write(EXT2_I(inode));
>  
>  	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
>  	if (inode_needs_sync(inode)) {
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 900e19c..3a71cea 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -192,6 +192,9 @@ static void init_once(void *foo)
>  	init_rwsem(&ei->xattr_sem);
>  #endif
>  	mutex_init(&ei->truncate_mutex);
> +#ifdef CONFIG_FS_DAX
> +	init_rwsem(&ei->dax_sem);
> +#endif
>  	inode_init_once(&ei->vfs_inode);
>  }
>  
> -- 
> 2.1.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ext2: Add locking for DAX faults
@ 2015-10-14  8:51     ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2015-10-14  8:51 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Tue 13-10-15 16:25:37, Ross Zwisler wrote:
> Add locking to ensure that DAX faults are isolated from ext2 operations
> that modify the data blocks allocation for an inode.  This is intended to
> be analogous to the work being done in XFS by Dave Chinner:
> 
> http://www.spinics.net/lists/linux-fsdevel/msg90260.html
> 
> Compared with XFS the ext2 case is greatly simplified by the fact that ext2
> already allocates and zeros new blocks before they are returned as part of
> ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
> unwritten buffer heads.
> 
> This means that the only work we need to do in ext2 is to isolate the DAX
> faults from inode block allocation changes.  I believe this just means that
> we need to isolate the DAX faults from truncate operations.
> 
> The newly introduced dax_sem is intended to replicate the protection
> offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
> protects XFS operations like hole punching, fallocate down, extent
> manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
> the only one of these operations supported by ext2.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.com>

Or I can push the patch through my tree as it seems to be independent of
any other changes, am I right?

								Honza
> ---
>  fs/ext2/ext2.h  | 11 ++++++++
>  fs/ext2/file.c  | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  fs/ext2/inode.c | 10 +++++++
>  fs/ext2/super.c |  3 +++
>  4 files changed, 104 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index 8d15feb..4c69c94 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -684,6 +684,9 @@ struct ext2_inode_info {
>  	struct rw_semaphore xattr_sem;
>  #endif
>  	rwlock_t i_meta_lock;
> +#ifdef CONFIG_FS_DAX
> +	struct rw_semaphore dax_sem;
> +#endif
>  
>  	/*
>  	 * truncate_mutex is for serialising ext2_truncate() against
> @@ -699,6 +702,14 @@ struct ext2_inode_info {
>  #endif
>  };
>  
> +#ifdef CONFIG_FS_DAX
> +#define dax_sem_down_write(ext2_inode)	down_write(&(ext2_inode)->dax_sem)
> +#define dax_sem_up_write(ext2_inode)	up_write(&(ext2_inode)->dax_sem)
> +#else
> +#define dax_sem_down_write(ext2_inode)
> +#define dax_sem_up_write(ext2_inode)
> +#endif
> +
>  /*
>   * Inode dynamic state flags
>   */
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 1982c3f..11a42c5 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -27,27 +27,103 @@
>  #include "acl.h"
>  
>  #ifdef CONFIG_FS_DAX
> +/*
> + * The lock ordering for ext2 DAX fault paths is:
> + *
> + * mmap_sem (MM)
> + *   sb_start_pagefault (vfs, freeze)
> + *     ext2_inode_info->dax_sem
> + *       address_space->i_mmap_rwsem or page_lock (mutually exclusive in DAX)
> + *         ext2_inode_info->truncate_mutex
> + *
> + * The default page_lock and i_size verification done by non-DAX fault paths
> + * is sufficient because ext2 doesn't support hole punching.
> + */
>  static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext2_get_block, NULL);
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ext2_inode_info *ei = EXT2_I(inode);
> +	int ret;
> +
> +	if (vmf->flags & FAULT_FLAG_WRITE) {
> +		sb_start_pagefault(inode->i_sb);
> +		file_update_time(vma->vm_file);
> +	}
> +	down_read(&ei->dax_sem);
> +
> +	ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
> +
> +	up_read(&ei->dax_sem);
> +	if (vmf->flags & FAULT_FLAG_WRITE)
> +		sb_end_pagefault(inode->i_sb);
> +	return ret;
>  }
>  
>  static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
>  						pmd_t *pmd, unsigned int flags)
>  {
> -	return dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ext2_inode_info *ei = EXT2_I(inode);
> +	int ret;
> +
> +	if (flags & FAULT_FLAG_WRITE) {
> +		sb_start_pagefault(inode->i_sb);
> +		file_update_time(vma->vm_file);
> +	}
> +	down_read(&ei->dax_sem);
> +
> +	ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
> +
> +	up_read(&ei->dax_sem);
> +	if (flags & FAULT_FLAG_WRITE)
> +		sb_end_pagefault(inode->i_sb);
> +	return ret;
>  }
>  
>  static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ext2_inode_info *ei = EXT2_I(inode);
> +	int ret;
> +
> +	sb_start_pagefault(inode->i_sb);
> +	file_update_time(vma->vm_file);
> +	down_read(&ei->dax_sem);
> +
> +	ret = __dax_mkwrite(vma, vmf, ext2_get_block, NULL);
> +
> +	up_read(&ei->dax_sem);
> +	sb_end_pagefault(inode->i_sb);
> +	return ret;
> +}
> +
> +static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
> +		struct vm_fault *vmf)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct ext2_inode_info *ei = EXT2_I(inode);
> +	int ret = VM_FAULT_NOPAGE;
> +	loff_t size;
> +
> +	sb_start_pagefault(inode->i_sb);
> +	file_update_time(vma->vm_file);
> +	down_read(&ei->dax_sem);
> +
> +	/* check that the faulting page hasn't raced with truncate */
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (vmf->pgoff >= size)
> +		ret = VM_FAULT_SIGBUS;
> +
> +	up_read(&ei->dax_sem);
> +	sb_end_pagefault(inode->i_sb);
> +	return ret;
>  }
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
>  	.fault		= ext2_dax_fault,
>  	.pmd_fault	= ext2_dax_pmd_fault,
>  	.page_mkwrite	= ext2_dax_mkwrite,
> -	.pfn_mkwrite	= dax_pfn_mkwrite,
> +	.pfn_mkwrite	= ext2_dax_pfn_mkwrite,
>  };
>  
>  static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c60a248..0aa9bf6 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1085,6 +1085,7 @@ static void ext2_free_branches(struct inode *inode, __le32 *p, __le32 *q, int de
>  		ext2_free_data(inode, p, q);
>  }
>  
> +/* dax_sem must be held when calling this function */
>  static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
>  {
>  	__le32 *i_data = EXT2_I(inode)->i_data;
> @@ -1100,6 +1101,10 @@ static void __ext2_truncate_blocks(struct inode *inode, loff_t offset)
>  	blocksize = inode->i_sb->s_blocksize;
>  	iblock = (offset + blocksize-1) >> EXT2_BLOCK_SIZE_BITS(inode->i_sb);
>  
> +#ifdef CONFIG_FS_DAX
> +	WARN_ON(!rwsem_is_locked(&ei->dax_sem));
> +#endif
> +
>  	n = ext2_block_to_path(inode, iblock, offsets, NULL);
>  	if (n == 0)
>  		return;
> @@ -1185,7 +1190,10 @@ static void ext2_truncate_blocks(struct inode *inode, loff_t offset)
>  		return;
>  	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
>  		return;
> +
> +	dax_sem_down_write(EXT2_I(inode));
>  	__ext2_truncate_blocks(inode, offset);
> +	dax_sem_up_write(EXT2_I(inode));
>  }
>  
>  static int ext2_setsize(struct inode *inode, loff_t newsize)
> @@ -1213,8 +1221,10 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
>  	if (error)
>  		return error;
>  
> +	dax_sem_down_write(EXT2_I(inode));
>  	truncate_setsize(inode, newsize);
>  	__ext2_truncate_blocks(inode, newsize);
> +	dax_sem_up_write(EXT2_I(inode));
>  
>  	inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
>  	if (inode_needs_sync(inode)) {
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 900e19c..3a71cea 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -192,6 +192,9 @@ static void init_once(void *foo)
>  	init_rwsem(&ei->xattr_sem);
>  #endif
>  	mutex_init(&ei->truncate_mutex);
> +#ifdef CONFIG_FS_DAX
> +	init_rwsem(&ei->dax_sem);
> +#endif
>  	inode_init_once(&ei->vfs_inode);
>  }
>  
> -- 
> 2.1.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ext2: Add locking for DAX faults
  2015-10-14  8:51     ` Jan Kara
@ 2015-10-14 15:31       ` Ross Zwisler
  -1 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-14 15:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Dave Chinner, Kirill A. Shutemov, linux-nvdimm,
	Matthew Wilcox

On Wed, Oct 14, 2015 at 10:51:19AM +0200, Jan Kara wrote:
> On Tue 13-10-15 16:25:37, Ross Zwisler wrote:
> > Add locking to ensure that DAX faults are isolated from ext2 operations
> > that modify the data blocks allocation for an inode.  This is intended to
> > be analogous to the work being done in XFS by Dave Chinner:
> > 
> > http://www.spinics.net/lists/linux-fsdevel/msg90260.html
> > 
> > Compared with XFS the ext2 case is greatly simplified by the fact that ext2
> > already allocates and zeros new blocks before they are returned as part of
> > ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
> > unwritten buffer heads.
> > 
> > This means that the only work we need to do in ext2 is to isolate the DAX
> > faults from inode block allocation changes.  I believe this just means that
> > we need to isolate the DAX faults from truncate operations.
> > 
> > The newly introduced dax_sem is intended to replicate the protection
> > offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
> > protects XFS operations like hole punching, fallocate down, extent
> > manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
> > the only one of these operations supported by ext2.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> The patch looks good to me. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.com>
> 
> Or I can push the patch through my tree as it seems to be independent of
> any other changes, am I right?
> 
> 								Honza

Yep, it is independent of other patches.  It'd be great if you pushed it up
through your tree, thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ext2: Add locking for DAX faults
@ 2015-10-14 15:31       ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-14 15:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Dave Chinner, Kirill A. Shutemov, linux-nvdimm,
	Matthew Wilcox

On Wed, Oct 14, 2015 at 10:51:19AM +0200, Jan Kara wrote:
> On Tue 13-10-15 16:25:37, Ross Zwisler wrote:
> > Add locking to ensure that DAX faults are isolated from ext2 operations
> > that modify the data blocks allocation for an inode.  This is intended to
> > be analogous to the work being done in XFS by Dave Chinner:
> > 
> > http://www.spinics.net/lists/linux-fsdevel/msg90260.html
> > 
> > Compared with XFS the ext2 case is greatly simplified by the fact that ext2
> > already allocates and zeros new blocks before they are returned as part of
> > ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
> > unwritten buffer heads.
> > 
> > This means that the only work we need to do in ext2 is to isolate the DAX
> > faults from inode block allocation changes.  I believe this just means that
> > we need to isolate the DAX faults from truncate operations.
> > 
> > The newly introduced dax_sem is intended to replicate the protection
> > offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
> > protects XFS operations like hole punching, fallocate down, extent
> > manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
> > the only one of these operations supported by ext2.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> The patch looks good to me. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.com>
> 
> Or I can push the patch through my tree as it seems to be independent of
> any other changes, am I right?
> 
> 								Honza

Yep, it is independent of other patches.  It'd be great if you pushed it up
through your tree, thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
  2015-10-14  5:25     ` Dave Chinner
@ 2015-10-14 17:26       ` Ross Zwisler
  -1 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-14 17:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Wed, Oct 14, 2015 at 04:25:50PM +1100, Dave Chinner wrote:
> On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > This is necessary to ensure that the page fault has not raced with truncate
> > and is now pointing to a region beyond the end of the current file.
> > 
> > This change is based on a similar outstanding patch for XFS from Dave
> > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/dax.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 131fd35a..82be6e4 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct inode *inode = file_inode(vma->vm_file);
> > +	struct super_block *sb = inode->i_sb;
> > +	int ret = VM_FAULT_NOPAGE;
> > +	loff_t size;
> >  
> >  	sb_start_pagefault(sb);
> >  	file_update_time(vma->vm_file);
> > +
> > +	/* check that the faulting page hasn't raced with truncate */
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (vmf->pgoff >= size)
> > +		ret = VM_FAULT_SIGBUS;
> > +
> >  	sb_end_pagefault(sb);
> > -	return VM_FAULT_NOPAGE;
> > +	return ret;
> 
> This is still racy, as the read of the inode size is not serialised
> against or ordered by any locks held by truncate. The check in XFS
> is serialised against truncate by the XFS_MMAPLOCK and the generic
> DAX code does not have such a mechanism to rely on. Hence I'd
> suggest that the correct thing to do here is remove
> dax_pfn_mkwrite() and force filesystems to implement their own
> truncate-safe variants of ->pfn_mkwrite.

Agreed, let's just drop this patch and remove dax_pfn_mkwrite().  The last
user of this function was ext4, and that usage goes away with Jan's latest
patch set.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
@ 2015-10-14 17:26       ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2015-10-14 17:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Wed, Oct 14, 2015 at 04:25:50PM +1100, Dave Chinner wrote:
> On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > This is necessary to ensure that the page fault has not raced with truncate
> > and is now pointing to a region beyond the end of the current file.
> > 
> > This change is based on a similar outstanding patch for XFS from Dave
> > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/dax.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 131fd35a..82be6e4 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct inode *inode = file_inode(vma->vm_file);
> > +	struct super_block *sb = inode->i_sb;
> > +	int ret = VM_FAULT_NOPAGE;
> > +	loff_t size;
> >  
> >  	sb_start_pagefault(sb);
> >  	file_update_time(vma->vm_file);
> > +
> > +	/* check that the faulting page hasn't raced with truncate */
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (vmf->pgoff >= size)
> > +		ret = VM_FAULT_SIGBUS;
> > +
> >  	sb_end_pagefault(sb);
> > -	return VM_FAULT_NOPAGE;
> > +	return ret;
> 
> This is still racy, as the read of the inode size is not serialised
> against or ordered by any locks held by truncate. The check in XFS
> is serialised against truncate by the XFS_MMAPLOCK and the generic
> DAX code does not have such a mechanism to rely on. Hence I'd
> suggest that the correct thing to do here is remove
> dax_pfn_mkwrite() and force filesystems to implement their own
> truncate-safe variants of ->pfn_mkwrite.

Agreed, let's just drop this patch and remove dax_pfn_mkwrite().  The last
user of this function was ext4, and that usage goes away with Jan's latest
patch set.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
  2015-10-14  8:40       ` Jan Kara
@ 2015-10-14 22:53         ` Dave Chinner
  -1 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2015-10-14 22:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Wed, Oct 14, 2015 at 10:40:25AM +0200, Jan Kara wrote:
> On Wed 14-10-15 16:25:50, Dave Chinner wrote:
> > On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > > This is necessary to ensure that the page fault has not raced with truncate
> > > and is now pointing to a region beyond the end of the current file.
> > > 
> > > This change is based on a similar outstanding patch for XFS from Dave
> > > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > ---
> > >  fs/dax.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 131fd35a..82be6e4 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > >   */
> > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > >  {
> > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > +	struct inode *inode = file_inode(vma->vm_file);
> > > +	struct super_block *sb = inode->i_sb;
> > > +	int ret = VM_FAULT_NOPAGE;
> > > +	loff_t size;
> > >  
> > >  	sb_start_pagefault(sb);
> > >  	file_update_time(vma->vm_file);
> > > +
> > > +	/* check that the faulting page hasn't raced with truncate */
> > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +	if (vmf->pgoff >= size)
> > > +		ret = VM_FAULT_SIGBUS;
> > > +
> > >  	sb_end_pagefault(sb);
> > > -	return VM_FAULT_NOPAGE;
> > > +	return ret;
> > 
> > This is still racy, as the read of the inode size is not serialised
> > against or ordered by any locks held by truncate. The check in XFS
> > is serialised against truncate by the XFS_MMAPLOCK and the generic
> > DAX code does not have such a mechanism to rely on. Hence I'd
> > suggest that the correct thing to do here is remove
> > dax_pfn_mkwrite() and force filesystems to implement their own
> > truncate-safe variants of ->pfn_mkwrite.
> 
> Well, the i_size check is just an optimization anyway. See below the
> discussion regarding the hole punch.
> 
> > And on further thought, I'm not sure that what I did with XFS is
> > at all correct when considering hole punching. That is, to get a PFN
> > based fault, we've already had to guarantee the file offset has
> > blocks allocated and mapped them. Then:
> > 
> > process 1				process 2
> > pfn_mkwrite @ X				punch hole @ X
> >  xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
> >   XFS_MMAPLOCK_SHARED
> >     <blocks>
> > 					invalidate mapping @ X
> > 					remove blocks @ X
> > 					....
> > 					unlock XFS_MMAPLOCK_EXCL
> >    checks file size
> >   unlock XFS_MMAPLOCK_SHARED
> >   return VM_FAULT_NOPAGE
> > 
> > And so process 1 continues with an invalid page mapping over the
> > hole process 2 just punched in the file. That's a data
> > exposure/corruption problem the underlying blocks get reallocated
> > to some other file.
> > 
> > Unless I'm missing something here, this gets ugly real fast.
> 
> So how I convinced myself that this is not a problem is:
> 
> When process 2 invalidates mapping, it will also modify page tables to
> unmap freed pages. Then the pte_same() check in mm/memory.c:wp_pfn_shared()
> will fail, we bail out,

So this comes in through handle_pte_fault()? And if !pte_same(),
we return 0:

__handle_mm_fault
  handle_pte_fault
    do_wp_page
      wp_pfn_shared()
        ->pfn_mkwrite
          xfs_filemap_pfn_mkwrite
	    return VM_FAULT_NOPAGE
        !pte_same
        return 0;
      return 0;
    return 0;
  return 0;
return 0;

and so we return all the way back out to __do_page_fault() with a
return value of zero, which then thinks the page fault succeeded...

> CPU will retry the memory access which faults again
> and now we go the right path. So ->pfn_mkwrite() has to be prepared that
> the pfn under it actually isn't there anymore and current implementations
> don't really care so we are fine.

Ok, that's what I was missing - the CPU refaulting when retrying the
write and that provides the retry path.

IOWs, we're relying on pte_same() to catch the truncate/hole
punch invalidation of the pfn mapping, but that can only work if the
filesystem ->pfn_mkwrite implementation first serialises the page
fault against the pte invalidation that the hole punch/truncation
does.

Right?

My head hurts now.

> Trying to generalize when we are safe against such races: Unless we modify
> page tables or inode allocation information, we don't care about races with
> truncate() / punch_hole() so even the original dax_pfn_mkwrite() code was
> actually correct. But it's really subtle...

I'd call it "completely undocumented", not "subtle". :/

> I have this written down before ext4_dax_pfn_mkwrite() handler so that I
> don't forget but it would be good to add the comment also to some generic
> code.

It needs to go somewhere that people looking to implement
->pfn_mkwrite() will see.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
@ 2015-10-14 22:53         ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2015-10-14 22:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Wed, Oct 14, 2015 at 10:40:25AM +0200, Jan Kara wrote:
> On Wed 14-10-15 16:25:50, Dave Chinner wrote:
> > On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > > This is necessary to ensure that the page fault has not raced with truncate
> > > and is now pointing to a region beyond the end of the current file.
> > > 
> > > This change is based on a similar outstanding patch for XFS from Dave
> > > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > ---
> > >  fs/dax.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 131fd35a..82be6e4 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > >   */
> > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > >  {
> > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > +	struct inode *inode = file_inode(vma->vm_file);
> > > +	struct super_block *sb = inode->i_sb;
> > > +	int ret = VM_FAULT_NOPAGE;
> > > +	loff_t size;
> > >  
> > >  	sb_start_pagefault(sb);
> > >  	file_update_time(vma->vm_file);
> > > +
> > > +	/* check that the faulting page hasn't raced with truncate */
> > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +	if (vmf->pgoff >= size)
> > > +		ret = VM_FAULT_SIGBUS;
> > > +
> > >  	sb_end_pagefault(sb);
> > > -	return VM_FAULT_NOPAGE;
> > > +	return ret;
> > 
> > This is still racy, as the read of the inode size is not serialised
> > against or ordered by any locks held by truncate. The check in XFS
> > is serialised against truncate by the XFS_MMAPLOCK and the generic
> > DAX code does not have such a mechanism to rely on. Hence I'd
> > suggest that the correct thing to do here is remove
> > dax_pfn_mkwrite() and force filesystems to implement their own
> > truncate-safe variants of ->pfn_mkwrite.
> 
> Well, the i_size check is just an optimization anyway. See below the
> discussion regarding the hole punch.
> 
> > And on further thought, I'm not sure that what I did with XFS is
> > at all correct when considering hole punching. That is, to get a PFN
> > based fault, we've already had to guarantee the file offset has
> > blocks allocated and mapped them. Then:
> > 
> > process 1				process 2
> > pfn_mkwrite @ X				punch hole @ X
> >  xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
> >   XFS_MMAPLOCK_SHARED
> >     <blocks>
> > 					invalidate mapping @ X
> > 					remove blocks @ X
> > 					....
> > 					unlock XFS_MMAPLOCK_EXCL
> >    checks file size
> >   unlock XFS_MMAPLOCK_SHARED
> >   return VM_FAULT_NOPAGE
> > 
> > And so process 1 continues with an invalid page mapping over the
> > hole process 2 just punched in the file. That's a data
> > exposure/corruption problem the underlying blocks get reallocated
> > to some other file.
> > 
> > Unless I'm missing something here, this gets ugly real fast.
> 
> So how I convinced myself that this is not a problem is:
> 
> When process 2 invalidates mapping, it will also modify page tables to
> unmap freed pages. Then the pte_same() check in mm/memory.c:wp_pfn_shared()
> will fail, we bail out,

So this comes in through handle_pte_fault()? And if !pte_same(),
we return 0:

__handle_mm_fault
  handle_pte_fault
    do_wp_page
      wp_pfn_shared()
        ->pfn_mkwrite
          xfs_filemap_pfn_mkwrite
	    return VM_FAULT_NOPAGE
        !pte_same
        return 0;
      return 0;
    return 0;
  return 0;
return 0;

and so we return all the way back out to __do_page_fault() with a
return value of zero, which then thinks the page fault succeeded...

> CPU will retry the memory access which faults again
> and now we go the right path. So ->pfn_mkwrite() has to be prepared that
> the pfn under it actually isn't there anymore and current implementations
> don't really care so we are fine.

Ok, that's what I was missing - the CPU refaulting when retrying the
write and that provides the retry path.

IOWs, we're relying on pte_same() to catch the truncate/hole
punch invalidation of the pfn mapping, but that can only work if the
filesystem ->pfn_mkwrite implementation first serialises the page
fault against the pte invalidation that the hole punch/truncation
does.

Right?

My head hurts now.

> Trying to generalize when we are safe against such races: Unless we modify
> page tables or inode allocation information, we don't care about races with
> truncate() / punch_hole() so even the original dax_pfn_mkwrite() code was
> actually correct. But it's really subtle...

I'd call it "completely undocumented", not "subtle". :/

> I have this written down before ext4_dax_pfn_mkwrite() handler so that I
> don't forget but it would be good to add the comment also to some generic
> code.

It needs to go somewhere that people looking to implement
->pfn_mkwrite() will see.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
  2015-10-14 22:53         ` Dave Chinner
@ 2015-10-16  7:55           ` Jan Kara
  -1 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2015-10-16  7:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Thu 15-10-15 09:53:16, Dave Chinner wrote:
> On Wed, Oct 14, 2015 at 10:40:25AM +0200, Jan Kara wrote:
> > On Wed 14-10-15 16:25:50, Dave Chinner wrote:
> > > On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > > > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > > > This is necessary to ensure that the page fault has not raced with truncate
> > > > and is now pointing to a region beyond the end of the current file.
> > > > 
> > > > This change is based on a similar outstanding patch for XFS from Dave
> > > > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > ---
> > > >  fs/dax.c | 13 +++++++++++--
> > > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 131fd35a..82be6e4 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > >   */
> > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > >  {
> > > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > +	struct inode *inode = file_inode(vma->vm_file);
> > > > +	struct super_block *sb = inode->i_sb;
> > > > +	int ret = VM_FAULT_NOPAGE;
> > > > +	loff_t size;
> > > >  
> > > >  	sb_start_pagefault(sb);
> > > >  	file_update_time(vma->vm_file);
> > > > +
> > > > +	/* check that the faulting page hasn't raced with truncate */
> > > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > > +	if (vmf->pgoff >= size)
> > > > +		ret = VM_FAULT_SIGBUS;
> > > > +
> > > >  	sb_end_pagefault(sb);
> > > > -	return VM_FAULT_NOPAGE;
> > > > +	return ret;
> > > 
> > > This is still racy, as the read of the inode size is not serialised
> > > against or ordered by any locks held by truncate. The check in XFS
> > > is serialised against truncate by the XFS_MMAPLOCK and the generic
> > > DAX code does not have such a mechanism to rely on. Hence I'd
> > > suggest that the correct thing to do here is remove
> > > dax_pfn_mkwrite() and force filesystems to implement their own
> > > truncate-safe variants of ->pfn_mkwrite.
> > 
> > Well, the i_size check is just an optimization anyway. See below the
> > discussion regarding the hole punch.
> > 
> > > And on further thought, I'm not sure that what I did with XFS is
> > > at all correct when considering hole punching. That is, to get a PFN
> > > based fault, we've already had to guarantee the file offset has
> > > blocks allocated and mapped them. Then:
> > > 
> > > process 1				process 2
> > > pfn_mkwrite @ X				punch hole @ X
> > >  xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
> > >   XFS_MMAPLOCK_SHARED
> > >     <blocks>
> > > 					invalidate mapping @ X
> > > 					remove blocks @ X
> > > 					....
> > > 					unlock XFS_MMAPLOCK_EXCL
> > >    checks file size
> > >   unlock XFS_MMAPLOCK_SHARED
> > >   return VM_FAULT_NOPAGE
> > > 
> > > And so process 1 continues with an invalid page mapping over the
> > > hole process 2 just punched in the file. That's a data
> > > exposure/corruption problem the underlying blocks get reallocated
> > > to some other file.
> > > 
> > > Unless I'm missing something here, this gets ugly real fast.
> > 
> > So how I convinced myself that this is not a problem is:
> > 
> > When process 2 invalidates mapping, it will also modify page tables to
> > unmap freed pages. Then the pte_same() check in mm/memory.c:wp_pfn_shared()
> > will fail, we bail out,
> 
> So this comes in through handle_pte_fault()? And if !pte_same(),
> we return 0:
> 
> __handle_mm_fault
>   handle_pte_fault
>     do_wp_page
>       wp_pfn_shared()
>         ->pfn_mkwrite
>           xfs_filemap_pfn_mkwrite
> 	    return VM_FAULT_NOPAGE
>         !pte_same
>         return 0;
>       return 0;
>     return 0;
>   return 0;
> return 0;
> 
> and so we return all the way back out to __do_page_fault() with a
> return value of zero, which then thinks the page fault succeeded...

Yes. Except "succeeded" is a bit misleading term here. Let's say it was
handled without fatal error.

> > CPU will retry the memory access which faults again
> > and now we go the right path. So ->pfn_mkwrite() has to be prepared that
> > the pfn under it actually isn't there anymore and current implementations
> > don't really care so we are fine.
> 
> Ok, that's what I was missing - the CPU refaulting when retrying the
> write and that provides the retry path.
> 
> IOWs, we're relying on pte_same() to catch the truncate/hole
> punch invalidation of the pfn mapping, but that can only work if the
> filesystem ->pfn_mkwrite implementation first serialises the page
> fault against the pte invalidation that the hole punch/truncation
> does.
> 
> Right?

Well, faults and invalidation are already serialized by pte lock. The
sequence looks like this:

page fault
	...
	pte_lock
	check page tables, see valid pfn but we need to set write lock
	pte_unlock
	->pfn_mkwrite()
	pte_lock
	if (page table entry changed from when we last looked)
		bail out
	finish fault
	pte_unlock

truncate
	...
	for each page removed
		for each mapping of that page
			pte_lock
			clear all page table entry for this page
			pte_unlock
  
So the check that pte entry is still valid after we called ->pfn_mkwrite()
and reacquired pte_lock is all that is needed to make sure we don't map
invalid pfn (already truncated one) into the page tables.

For read faults we have to be more careful since there the filesystem looks
up pfn which is then inserted into page tables so *there* we need the
truncate vs fault serialization so that fs doesn't ask mm to insert pfn
that got truncated from the file in the mean time.

> My head hurts now.

Yup.

> > Trying to generalize when we are safe against such races: Unless we modify
> > page tables or inode allocation information, we don't care about races with
> > truncate() / punch_hole() so even the original dax_pfn_mkwrite() code was
> > actually correct. But it's really subtle...
> 
> I'd call it "completely undocumented", not "subtle". :/

Yeah, probably we need a writeup about the whole truncate/punch hole vs fault
vs munmap serialization. Probably somewhere in Documentation/... I'll try
to find time to write that down sometime next week.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check
@ 2015-10-16  7:55           ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2015-10-16  7:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ross Zwisler, linux-kernel, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Andrew Morton,
	Dan Williams, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Thu 15-10-15 09:53:16, Dave Chinner wrote:
> On Wed, Oct 14, 2015 at 10:40:25AM +0200, Jan Kara wrote:
> > On Wed 14-10-15 16:25:50, Dave Chinner wrote:
> > > On Tue, Oct 13, 2015 at 04:25:36PM -0600, Ross Zwisler wrote:
> > > > Update dax_pfn_mkwrite() so that it validates i_size before returning.
> > > > This is necessary to ensure that the page fault has not raced with truncate
> > > > and is now pointing to a region beyond the end of the current file.
> > > > 
> > > > This change is based on a similar outstanding patch for XFS from Dave
> > > > Chinner entitled "xfs: add ->pfn_mkwrite support for DAX".
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > ---
> > > >  fs/dax.c | 13 +++++++++++--
> > > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 131fd35a..82be6e4 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -693,12 +693,21 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > >   */
> > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > >  {
> > > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > +	struct inode *inode = file_inode(vma->vm_file);
> > > > +	struct super_block *sb = inode->i_sb;
> > > > +	int ret = VM_FAULT_NOPAGE;
> > > > +	loff_t size;
> > > >  
> > > >  	sb_start_pagefault(sb);
> > > >  	file_update_time(vma->vm_file);
> > > > +
> > > > +	/* check that the faulting page hasn't raced with truncate */
> > > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > > +	if (vmf->pgoff >= size)
> > > > +		ret = VM_FAULT_SIGBUS;
> > > > +
> > > >  	sb_end_pagefault(sb);
> > > > -	return VM_FAULT_NOPAGE;
> > > > +	return ret;
> > > 
> > > This is still racy, as the read of the inode size is not serialised
> > > against or ordered by any locks held by truncate. The check in XFS
> > > is serialised against truncate by the XFS_MMAPLOCK and the generic
> > > DAX code does not have such a mechanism to rely on. Hence I'd
> > > suggest that the correct thing to do here is remove
> > > dax_pfn_mkwrite() and force filesystems to implement their own
> > > truncate-safe variants of ->pfn_mkwrite.
> > 
> > Well, the i_size check is just an optimization anyway. See below the
> > discussion regarding the hole punch.
> > 
> > > And on further thought, I'm not sure that what I did with XFS is
> > > at all correct when considering hole punching. That is, to get a PFN
> > > based fault, we've already had to guarantee the file offset has
> > > blocks allocated and mapped them. Then:
> > > 
> > > process 1				process 2
> > > pfn_mkwrite @ X				punch hole @ X
> > >  xfs_filemap_pfn_mkwrite		XFS_MMAPLOCK_EXCL
> > >   XFS_MMAPLOCK_SHARED
> > >     <blocks>
> > > 					invalidate mapping @ X
> > > 					remove blocks @ X
> > > 					....
> > > 					unlock XFS_MMAPLOCK_EXCL
> > >    checks file size
> > >   unlock XFS_MMAPLOCK_SHARED
> > >   return VM_FAULT_NOPAGE
> > > 
> > > And so process 1 continues with an invalid page mapping over the
> > > hole process 2 just punched in the file. That's a data
> > > exposure/corruption problem the underlying blocks get reallocated
> > > to some other file.
> > > 
> > > Unless I'm missing something here, this gets ugly real fast.
> > 
> > So how I convinced myself that this is not a problem is:
> > 
> > When process 2 invalidates mapping, it will also modify page tables to
> > unmap freed pages. Then the pte_same() check in mm/memory.c:wp_pfn_shared()
> > will fail, we bail out,
> 
> So this comes in through handle_pte_fault()? And if !pte_same(),
> we return 0:
> 
> __handle_mm_fault
>   handle_pte_fault
>     do_wp_page
>       wp_pfn_shared()
>         ->pfn_mkwrite
>           xfs_filemap_pfn_mkwrite
> 	    return VM_FAULT_NOPAGE
>         !pte_same
>         return 0;
>       return 0;
>     return 0;
>   return 0;
> return 0;
> 
> and so we return all the way back out to __do_page_fault() with a
> return value of zero, which then thinks the page fault succeeded...

Yes. Except "succeeded" is a bit misleading term here. Let's say it was
handled without fatal error.

> > CPU will retry the memory access which faults again
> > and now we go the right path. So ->pfn_mkwrite() has to be prepared that
> > the pfn under it actually isn't there anymore and current implementations
> > don't really care so we are fine.
> 
> Ok, that's what I was missing - the CPU refaulting when retrying the
> write and that provides the retry path.
> 
> IOWs, we're relying on pte_same() to catch the truncate/hole
> punch invalidation of the pfn mapping, but that can only work if the
> filesystem ->pfn_mkwrite implementation first serialises the page
> fault against the pte invalidation that the hole punch/truncation
> does.
> 
> Right?

Well, faults and invalidation are already serialized by pte lock. The
sequence looks like this:

page fault
	...
	pte_lock
	check page tables, see valid pfn but we need to set write lock
	pte_unlock
	->pfn_mkwrite()
	pte_lock
	if (page table entry changed from when we last looked)
		bail out
	finish fault
	pte_unlock

truncate
	...
	for each page removed
		for each mapping of that page
			pte_lock
			clear all page table entry for this page
			pte_unlock
  
So the check that pte entry is still valid after we called ->pfn_mkwrite()
and reacquired pte_lock is all that is needed to make sure we don't map
invalid pfn (already truncated one) into the page tables.

For read faults we have to be more careful since there the filesystem looks
up pfn which is then inserted into page tables so *there* we need the
truncate vs fault serialization so that fs doesn't ask mm to insert pfn
that got truncated from the file in the mean time.

> My head hurts now.

Yup.

> > Trying to generalize when we are safe against such races: Unless we modify
> > page tables or inode allocation information, we don't care about races with
> > truncate() / punch_hole() so even the original dax_pfn_mkwrite() code was
> > actually correct. But it's really subtle...
> 
> I'd call it "completely undocumented", not "subtle". :/

Yeah, probably we need a writeup about the whole truncate/punch hole vs fault
vs munmap serialization. Probably somewhere in Documentation/... I'll try
to find time to write that down sometime next week.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ext2: Add locking for DAX faults
  2015-10-14 15:31       ` Ross Zwisler
@ 2015-10-19 12:47         ` Jan Kara
  -1 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2015-10-19 12:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Wed 14-10-15 09:31:38, Ross Zwisler wrote:
> On Wed, Oct 14, 2015 at 10:51:19AM +0200, Jan Kara wrote:
> > On Tue 13-10-15 16:25:37, Ross Zwisler wrote:
> > > Add locking to ensure that DAX faults are isolated from ext2 operations
> > > that modify the data blocks allocation for an inode.  This is intended to
> > > be analogous to the work being done in XFS by Dave Chinner:
> > > 
> > > http://www.spinics.net/lists/linux-fsdevel/msg90260.html
> > > 
> > > Compared with XFS the ext2 case is greatly simplified by the fact that ext2
> > > already allocates and zeros new blocks before they are returned as part of
> > > ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
> > > unwritten buffer heads.
> > > 
> > > This means that the only work we need to do in ext2 is to isolate the DAX
> > > faults from inode block allocation changes.  I believe this just means that
> > > we need to isolate the DAX faults from truncate operations.
> > > 
> > > The newly introduced dax_sem is intended to replicate the protection
> > > offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
> > > protects XFS operations like hole punching, fallocate down, extent
> > > manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
> > > the only one of these operations supported by ext2.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > The patch looks good to me. Feel free to add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.com>
> > 
> > Or I can push the patch through my tree as it seems to be independent of
> > any other changes, am I right?
> > 
> > 								Honza
> 
> Yep, it is independent of other patches.  It'd be great if you pushed it up
> through your tree, thanks!

I've pulled the patch into my tree and will push it to Linus in the next
merge window.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ext2: Add locking for DAX faults
@ 2015-10-19 12:47         ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2015-10-19 12:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, Alexander Viro, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, Andrew Morton, Dan Williams,
	Dave Chinner, Kirill A. Shutemov, linux-nvdimm, Matthew Wilcox

On Wed 14-10-15 09:31:38, Ross Zwisler wrote:
> On Wed, Oct 14, 2015 at 10:51:19AM +0200, Jan Kara wrote:
> > On Tue 13-10-15 16:25:37, Ross Zwisler wrote:
> > > Add locking to ensure that DAX faults are isolated from ext2 operations
> > > that modify the data blocks allocation for an inode.  This is intended to
> > > be analogous to the work being done in XFS by Dave Chinner:
> > > 
> > > http://www.spinics.net/lists/linux-fsdevel/msg90260.html
> > > 
> > > Compared with XFS the ext2 case is greatly simplified by the fact that ext2
> > > already allocates and zeros new blocks before they are returned as part of
> > > ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
> > > unwritten buffer heads.
> > > 
> > > This means that the only work we need to do in ext2 is to isolate the DAX
> > > faults from inode block allocation changes.  I believe this just means that
> > > we need to isolate the DAX faults from truncate operations.
> > > 
> > > The newly introduced dax_sem is intended to replicate the protection
> > > offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
> > > protects XFS operations like hole punching, fallocate down, extent
> > > manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
> > > the only one of these operations supported by ext2.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > The patch looks good to me. Feel free to add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.com>
> > 
> > Or I can push the patch through my tree as it seems to be independent of
> > any other changes, am I right?
> > 
> > 								Honza
> 
> Yep, it is independent of other patches.  It'd be great if you pushed it up
> through your tree, thanks!

I've pulled the patch into my tree and will push it to Linus in the next
merge window.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-10-19 12:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-13 22:25 [PATCH v2 0/2] Add updated DAX locking to ext2 Ross Zwisler
2015-10-13 22:25 ` Ross Zwisler
2015-10-13 22:25 ` [PATCH v2 1/2] dax: dax_pfn_mkwrite() truncate race check Ross Zwisler
2015-10-13 22:25   ` Ross Zwisler
2015-10-14  5:25   ` Dave Chinner
2015-10-14  5:25     ` Dave Chinner
2015-10-14  8:40     ` Jan Kara
2015-10-14  8:40       ` Jan Kara
2015-10-14 22:53       ` Dave Chinner
2015-10-14 22:53         ` Dave Chinner
2015-10-16  7:55         ` Jan Kara
2015-10-16  7:55           ` Jan Kara
2015-10-14 17:26     ` Ross Zwisler
2015-10-14 17:26       ` Ross Zwisler
2015-10-13 22:25 ` [PATCH v2 2/2] ext2: Add locking for DAX faults Ross Zwisler
2015-10-13 22:25   ` Ross Zwisler
2015-10-14  8:51   ` Jan Kara
2015-10-14  8:51     ` Jan Kara
2015-10-14 15:31     ` Ross Zwisler
2015-10-14 15:31       ` Ross Zwisler
2015-10-19 12:47       ` Jan Kara
2015-10-19 12:47         ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.