linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] xfs, dax: fix the page fault/allocation mess
@ 2015-10-01  7:46 Dave Chinner
  2015-10-01  7:46 ` [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01  7:46 UTC (permalink / raw)
  To: xfs
  Cc: linux-fsdevel, ross.zwisler, willy, dan.j.williams,
	kirill.shutemov, linux-nvdimm, jack, linux-kernel

Hi folks,

As discussed in the recent thread about problems with DAX locking:

http://www.gossamer-threads.com/lists/linux/kernel/2264090?do=post_view_threaded

I said that I'd post the patch set that fixed the problems for XFS
as soon as I had something sane and workable. That's what this
series is.

To start with, it passes xfstests "auto" group with only the only
failures being expected failures or failures due to unexpected
allocation patterns or trying to use unsupported block sizes. That
makes it better than any previous version of the XFS/DAX code.

The patchset starts by reverting the two patches that were
introduced in 4.3-rc1 to try to fix the fault vs fault and fault vs
truncate races that caused deadlocks. This fixes the hangs in
generic/075 that these patches introduced.

Patch 3 enables XFS to handle the behaviour of DAX and DIO when
asking to allocate the block at (2^63 - 1FSB), where the offset +
count s technically illegal (larger than sb->s_maxbytes) and
overflows a s64 variable. This is currently hidden by the fact that
all DAX and DIO allocation is currently unwritten, but patch 5
exposes it for DAX.

Patch 4 introduces the ability for XFS to allocate physically zeroed
data blocks. This is done for each physical extent that is
allocated, deep inside the allocator itself and guaranteed to be
atomic with the allocation transaction and hence has no
crash+recovery exposure issues.

This is necessary because the BMAPI layer merges allocated extents
in the BMBT before it returns the mapped extent back to the high
level get_blocks() code. Hence the high level code can have a single
extent presented that is made of merged new and existing extents,
and so zeroing can't be done at this layer.

The advantage of driving the zeroing deep into the allocator is the
functionality is now available to all XFS code. Hence we can
allocate pre-zeroed blocks on any type of storage, and we can
utilise storage-based hardware acceleration (e.g. discard to zero,
WRITE_SAME, etc) to do the zeroing. From this POV, DAX is just
another hardware accelerated physical zeroing mechanism for XFS. :)

[ This is an example of the mantra I repeat a lot: solve the problem
  properly the first time and it will make everything simpler! Sure,
  it took me three attempts to work out how to solve it in a sane
  manner, but that's pretty much par for the course with anything
  non-trivial. ]

Patch 5 makes __xfs_get_blocks() aware that it is being called from
the DAX fault path and makes sure it returns zeroed blocks rather
than unwritten extents via XFS_BMAPI_ZERO. It also now sets
XFS_BMAPI_CONVERT, which tells it to convert unwritten extents to
written, zeroed blocks. This is the major change of behaviour.

Patch 6 removes the IO completion callbacks from the XFS DAX code as
they are not longer necessary after patch 5.

Patch 7 adds pfn_mkwrite support to XFS. This is needed to fix
generic/080, which detects a failure to update the inode timestamp
on a pfn fault. It also adds the same locking as the XFS
implementation of ->fault and ->page_mkwrite and hence provide
correct serialisation against truncate, hole punching, etc that
doesn't currently exist.

The next steps that are needed are to do the same "block zeroing
during allocation" to ext4, and then the block zeroing and
complete_unwritten callbacks can be removed from the DAX API and
code. I've had a breif look at the ext4 code - the block zeroing
should be able to be done by overloading the existing zeroout code
that ext4 has in the unwritten extent allocation code. I'd much
prefer that an ext4 expert does this work, and then we can clean up
the DAX code...

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
  2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
@ 2015-10-01  7:46 ` Dave Chinner
  2015-10-01  8:35   ` kbuild test robot
  2015-10-01 20:27   ` Ross Zwisler
  2015-10-01  7:46 ` [PATCH 2/7] Revert "dax: fix race between simultaneous faults" Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01  7:46 UTC (permalink / raw)
  To: xfs
  Cc: jack, linux-nvdimm, linux-kernel, linux-fsdevel, willy,
	ross.zwisler, dan.j.williams, kirill.shutemov

This reverts commit 46c043ede4711e8d598b9d63c5616c1fedb0605e.
---
 fs/dax.c    | 36 ++++++++++++++++--------------------
 mm/memory.c | 11 +++++++++--
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7ae6df7..400fe95 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -569,26 +569,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
 		goto fallback;
 
-	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-		int i;
-		for (i = 0; i < PTRS_PER_PMD; i++)
-			clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
-		wmb_pmem();
-		count_vm_event(PGMAJFAULT);
-		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-		result |= VM_FAULT_MAJOR;
-	}
-
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (buffer_new(&bh)) {
-		i_mmap_unlock_write(mapping);
-		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
-		i_mmap_lock_write(mapping);
-	}
-
 	/*
 	 * If a truncate happened while we were allocating blocks, we may
 	 * leave blocks allocated to the file that are beyond EOF.  We can't
@@ -603,6 +583,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if ((pgoff | PG_PMD_COLOUR) >= size)
 		goto fallback;
 
+	/*
+	 * If we allocated new storage, make sure no process has any
+	 * zero pages covering this hole
+	 */
+	if (buffer_new(&bh))
+		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
+
 	if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
 		spinlock_t *ptl;
 		pmd_t entry;
@@ -633,6 +620,15 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
 			goto fallback;
 
+		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
+			int i;
+			for (i = 0; i < PTRS_PER_PMD; i++)
+				clear_page(kaddr + i * PAGE_SIZE);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			result |= VM_FAULT_MAJOR;
+		}
+
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
 	}
 
diff --git a/mm/memory.c b/mm/memory.c
index 9cb2747..5ec066f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2426,10 +2426,17 @@ void unmap_mapping_range(struct address_space *mapping,
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
 
-	i_mmap_lock_write(mapping);
+
+	/*
+	 * DAX already holds i_mmap_lock to serialise file truncate vs
+	 * page fault and page fault vs page fault.
+	 */
+	if (!IS_DAX(mapping->host))
+		i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
-	i_mmap_unlock_write(mapping);
+	if (!IS_DAX(mapping->host))
+		i_mmap_unlock_write(mapping);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/7] Revert "dax: fix race between simultaneous faults"
  2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
  2015-10-01  7:46 ` [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Dave Chinner
@ 2015-10-01  7:46 ` Dave Chinner
  2015-10-01  7:46 ` [PATCH 3/7] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01  7:46 UTC (permalink / raw)
  To: xfs
  Cc: linux-fsdevel, ross.zwisler, willy, dan.j.williams,
	kirill.shutemov, linux-nvdimm, jack, linux-kernel

This reverts commit 843172978bb92997310d2f7fbc172ece423cfc02.
---
 fs/dax.c    | 33 ++++++++++++++++-----------------
 mm/memory.c | 11 +++--------
 2 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 400fe95..3994a2b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -285,6 +285,7 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
+	struct address_space *mapping = inode->i_mapping;
 	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	void __pmem *addr;
@@ -292,6 +293,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	pgoff_t size;
 	int error;
 
+	i_mmap_lock_read(mapping);
+
 	/*
 	 * Check truncate didn't happen while we were allocating a block.
 	 * If it did, this block may or may not be still allocated to the
@@ -321,6 +324,8 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
  out:
+	i_mmap_unlock_read(mapping);
+
 	return error;
 }
 
@@ -382,17 +387,15 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			 * from a read fault and we've raced with a truncate
 			 */
 			error = -EIO;
-			goto unlock;
+			goto unlock_page;
 		}
-	} else {
-		i_mmap_lock_write(mapping);
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock;
+		goto unlock_page;
 
 	if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
@@ -403,9 +406,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			if (!error && (bh.b_size < PAGE_SIZE))
 				error = -EIO;
 			if (error)
-				goto unlock;
+				goto unlock_page;
 		} else {
-			i_mmap_unlock_write(mapping);
 			return dax_load_hole(mapping, page, vmf);
 		}
 	}
@@ -417,15 +419,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock;
+			goto unlock_page;
 		vmf->page = page;
 		if (!page) {
+			i_mmap_lock_read(mapping);
 			/* Check we didn't race with truncate */
 			size = (i_size_read(inode) + PAGE_SIZE - 1) >>
 								PAGE_SHIFT;
 			if (vmf->pgoff >= size) {
+				i_mmap_unlock_read(mapping);
 				error = -EIO;
-				goto unlock;
+				goto out;
 			}
 		}
 		return VM_FAULT_LOCKED;
@@ -461,8 +465,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
 	}
 
-	if (!page)
-		i_mmap_unlock_write(mapping);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -471,14 +473,11 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock:
+ unlock_page:
 	if (page) {
 		unlock_page(page);
 		page_cache_release(page);
-	} else {
-		i_mmap_unlock_write(mapping);
 	}
-
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -556,10 +555,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-	i_mmap_lock_write(mapping);
 	length = get_block(inode, block, &bh, write);
 	if (length)
 		return VM_FAULT_SIGBUS;
+	i_mmap_lock_read(mapping);
 
 	/*
 	 * If the filesystem isn't willing to tell us the length of a hole,
@@ -633,11 +632,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
  out:
+	i_mmap_unlock_read(mapping);
+
 	if (buffer_unwritten(&bh))
 		complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
 
-	i_mmap_unlock_write(mapping);
-
 	return result;
 
  fallback:
diff --git a/mm/memory.c b/mm/memory.c
index 5ec066f..deb679c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2427,16 +2427,11 @@ void unmap_mapping_range(struct address_space *mapping,
 		details.last_index = ULONG_MAX;
 
 
-	/*
-	 * DAX already holds i_mmap_lock to serialise file truncate vs
-	 * page fault and page fault vs page fault.
-	 */
-	if (!IS_DAX(mapping->host))
-		i_mmap_lock_write(mapping);
+	/* DAX uses i_mmap_lock to serialise file truncate vs page fault */
+	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
-	if (!IS_DAX(mapping->host))
-		i_mmap_unlock_write(mapping);
+	i_mmap_unlock_write(mapping);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/7] xfs: fix inode size update overflow in xfs_map_direct()
  2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
  2015-10-01  7:46 ` [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Dave Chinner
  2015-10-01  7:46 ` [PATCH 2/7] Revert "dax: fix race between simultaneous faults" Dave Chinner
@ 2015-10-01  7:46 ` Dave Chinner
  2015-10-01  7:46 ` [PATCH 4/7] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01  7:46 UTC (permalink / raw)
  To: xfs
  Cc: jack, linux-nvdimm, linux-kernel, linux-fsdevel, willy,
	ross.zwisler, dan.j.williams, kirill.shutemov

From: Dave Chinner <dchinner@redhat.com>

Both direct IO and DAX pass an offset and count into get_blocks that
will overflow a s64 variable when an IO goes into the last supported
block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
from the tracing:

xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct:     [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096

0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
overflows the s64 offset and we fail to detect the need for a
filesize update and an ioend is not allocated.

This is *mostly* avoided for direct IO because such extending IOs
occur with full block allocation, and so the "IS_UNWRITTEN()" check
still evaluates as true and we get an ioend that way. However, doing
single sector extending IOs to this last block will expose the fact
that file size updates will not occur after the first allocating
direct IO as the overflow will then be exposed.

There is one further complexity: the DAX page fault path also
exposes the same issue in block allocation. However, page faults
cannot extend the file size, so in this case we want to allocate the
block but do not want to allocate an ioend to enable file size
update at IO completion. Hence we now need to distinguish between
the direct IO patch allocation and dax fault path allocation to
avoid leaking ioend structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_aops.h |  2 ++
 fs/xfs/xfs_file.c |  6 +++---
 3 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 50ab287..e747d6a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1250,13 +1250,28 @@ xfs_vm_releasepage(
  * the DIO. There is only going to be one reference to the ioend and its life
  * cycle is constrained by the DIO completion code. hence we don't need
  * reference counting here.
+ *
+ * Note that for DIO, an IO to the highest supported file block offset (i.e.
+ * 2^63 - 1FSB bytes) will result in the offset + count overflowing a signed 64
+ * bit variable. Hence if we see this overflow, we have to assume that the IO is
+ * extending the file size. We won't know for sure until IO completion is run
+ * and the actual max write offset is communicated to the IO completion
+ * routine.
+ *
+ * For DAX page faults, we are preparing to never see unwritten extents here,
+ * nor should we ever extend the inode size. Hence we will soon have nothing to
+ * do here for this case, ensuring we don't have to provide an IO completion
+ * callback to free an ioend that we don't actually need for a fault into the
+ * page at offset (2^63 - 1FSB) bytes.
  */
+
 static void
 xfs_map_direct(
 	struct inode		*inode,
 	struct buffer_head	*bh_result,
 	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
+	xfs_off_t		offset,
+	bool			dax_fault)
 {
 	struct xfs_ioend	*ioend;
 	xfs_off_t		size = bh_result->b_size;
@@ -1269,6 +1284,16 @@ xfs_map_direct(
 
 	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
 
+	/* XXX: preparation for removing unwritten extents in DAX */
+#if 0
+	if (dax_fault) {
+		ASSERT(type == XFS_IO_OVERWRITE);
+		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
+					    imap);
+		return;
+	}
+#endif
+
 	if (bh_result->b_private) {
 		ioend = bh_result->b_private;
 		ASSERT(ioend->io_size > 0);
@@ -1283,7 +1308,8 @@ xfs_map_direct(
 					      ioend->io_size, ioend->io_type,
 					      imap);
 	} else if (type == XFS_IO_UNWRITTEN ||
-		   offset + size > i_size_read(inode)) {
+		   offset + size > i_size_read(inode) ||
+		   offset + size < 0) {
 		ioend = xfs_alloc_ioend(inode, type);
 		ioend->io_offset = offset;
 		ioend->io_size = size;
@@ -1345,7 +1371,8 @@ __xfs_get_blocks(
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	bool			direct)
+	bool			direct,
+	bool			dax_fault)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1458,7 +1485,8 @@ __xfs_get_blocks(
 			set_buffer_unwritten(bh_result);
 		/* direct IO needs special help */
 		if (create && direct)
-			xfs_map_direct(inode, bh_result, &imap, offset);
+			xfs_map_direct(inode, bh_result, &imap, offset,
+				       dax_fault);
 	}
 
 	/*
@@ -1505,7 +1533,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, false);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
 }
 
 int
@@ -1515,7 +1543,17 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, true);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
+}
+
+int
+xfs_get_blocks_dax_fault(
+	struct inode		*inode,
+	sector_t		iblock,
+	struct buffer_head	*bh_result,
+	int			create)
+{
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
 }
 
 static void
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 86afd1a..d39ba25 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -58,6 +58,8 @@ int	xfs_get_blocks(struct inode *inode, sector_t offset,
 		       struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
+int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
+			         struct buffer_head *map_bh, int create);
 void	xfs_end_io_dax_write(struct buffer_head *bh, int uptodate);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e78feb4..27abe1c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1503,7 +1503,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_direct,
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault,
 				    xfs_end_io_dax_write);
 	} else {
 		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
@@ -1538,7 +1538,7 @@ xfs_filemap_fault(
 		 * changes to xfs_get_blocks_direct() to map unwritten extent
 		 * ioend for conversion on read-only mappings.
 		 */
-		ret = __dax_fault(vma, vmf, xfs_get_blocks_direct, NULL);
+		ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
 	} else
 		ret = filemap_fault(vma, vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1565,7 +1565,7 @@ xfs_filemap_pmd_fault(
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_direct,
+	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
 				    xfs_end_io_dax_write);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/7] xfs: introduce BMAPI_ZERO for allocating zeroed extents
  2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
                   ` (2 preceding siblings ...)
  2015-10-01  7:46 ` [PATCH 3/7] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
@ 2015-10-01  7:46 ` Dave Chinner
  2015-10-01  7:46 ` [PATCH 5/7] xfs: Don't use unwritten extents for DAX Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01  7:46 UTC (permalink / raw)
  To: xfs
  Cc: linux-fsdevel, ross.zwisler, willy, dan.j.williams,
	kirill.shutemov, linux-nvdimm, jack, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 10 +++++++++-
 fs/xfs/libxfs/xfs_alloc.h |  8 +++++---
 fs/xfs/libxfs/xfs_bmap.c  | 25 +++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_bmap.h  | 13 +++++++++++--
 fs/xfs/xfs_bmap_util.c    | 36 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h        |  3 +++
 6 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index ffad7f2..4cffc17 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2503,7 +2503,7 @@ xfs_alloc_vextent(
 		 * Try near allocation first, then anywhere-in-ag after
 		 * the first a.g. fails.
 		 */
-		if ((args->userdata  == XFS_ALLOC_INITIAL_USER_DATA) &&
+		if ((args->userdata & XFS_ALLOC_INITIAL_USER_DATA) &&
 		    (mp->m_flags & XFS_MOUNT_32BITINODES)) {
 			args->fsbno = XFS_AGB_TO_FSB(mp,
 					((mp->m_agfrotor / rotorstep) %
@@ -2634,6 +2634,14 @@ xfs_alloc_vextent(
 		XFS_AG_CHECK_DADDR(mp, XFS_FSB_TO_DADDR(mp, args->fsbno),
 			args->len);
 #endif
+
+		/* Zero the extent if we were asked to do so */
+		if (args->userdata & XFS_ALLOC_USERDATA_ZERO) {
+			error = xfs_zero_extent(args->ip, args->fsbno, args->len);
+			if (error)
+				goto error0;
+		}
+
 	}
 	xfs_perag_put(args->pag);
 	return 0;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index ca1c816..0ecde4d 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -101,6 +101,7 @@ typedef struct xfs_alloc_arg {
 	struct xfs_mount *mp;		/* file system mount point */
 	struct xfs_buf	*agbp;		/* buffer for a.g. freelist header */
 	struct xfs_perag *pag;		/* per-ag struct for this agno */
+	struct xfs_inode *ip;		/* for userdata zeroing method */
 	xfs_fsblock_t	fsbno;		/* file system block number */
 	xfs_agnumber_t	agno;		/* allocation group number */
 	xfs_agblock_t	agbno;		/* allocation group-relative block # */
@@ -120,15 +121,16 @@ typedef struct xfs_alloc_arg {
 	char		wasdel;		/* set if allocation was prev delayed */
 	char		wasfromfl;	/* set if allocation is from freelist */
 	char		isfl;		/* set if is freelist blocks - !acctg */
-	char		userdata;	/* set if this is user data */
+	char		userdata;	/* mask defining userdata treatment */
 	xfs_fsblock_t	firstblock;	/* io first block allocated */
 } xfs_alloc_arg_t;
 
 /*
  * Defines for userdata
  */
-#define XFS_ALLOC_USERDATA		1	/* allocation is for user data*/
-#define XFS_ALLOC_INITIAL_USER_DATA	2	/* special case start of file */
+#define XFS_ALLOC_USERDATA		(1 << 0)/* allocation is for user data*/
+#define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
+#define XFS_ALLOC_USERDATA_ZERO		(1 << 2)/* zero extent on allocation */
 
 xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_mount *mp,
 		struct xfs_perag *pag, xfs_extlen_t need);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 8e2010d..8f607ed 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3800,8 +3800,13 @@ xfs_bmap_btalloc(
 	args.wasdel = ap->wasdel;
 	args.isfl = 0;
 	args.userdata = ap->userdata;
-	if ((error = xfs_alloc_vextent(&args)))
+	if (ap->userdata & XFS_ALLOC_USERDATA_ZERO)
+		args.ip = ap->ip;
+
+	error = xfs_alloc_vextent(&args);
+	if (error)
 		return error;
+
 	if (tryagain && args.fsbno == NULLFSBLOCK) {
 		/*
 		 * Exact allocation failed. Now try with alignment
@@ -4300,11 +4305,14 @@ xfs_bmapi_allocate(
 
 	/*
 	 * Indicate if this is the first user data in the file, or just any
-	 * user data.
+	 * user data. And if it is userdata, indicate whether it needs to
+	 * be initialised to zero during allocation.
 	 */
 	if (!(bma->flags & XFS_BMAPI_METADATA)) {
 		bma->userdata = (bma->offset == 0) ?
 			XFS_ALLOC_INITIAL_USER_DATA : XFS_ALLOC_USERDATA;
+		if (bma->flags & XFS_BMAPI_ZERO)
+			bma->userdata |= XFS_ALLOC_USERDATA_ZERO;
 	}
 
 	bma->minlen = (bma->flags & XFS_BMAPI_CONTIG) ? bma->length : 1;
@@ -4419,6 +4427,17 @@ xfs_bmapi_convert_unwritten(
 	mval->br_state = (mval->br_state == XFS_EXT_UNWRITTEN)
 				? XFS_EXT_NORM : XFS_EXT_UNWRITTEN;
 
+	/*
+	 * Before insertion into the bmbt, zero the range being converted
+	 * if required.
+	 */
+	if (flags & XFS_BMAPI_ZERO) {
+		error = xfs_zero_extent(bma->ip, mval->br_startblock,
+					mval->br_blockcount);
+		if (error)
+			return error;
+	}
+
 	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
 			&bma->cur, mval, bma->firstblock, bma->flist,
 			&tmp_logflags);
@@ -4511,6 +4530,8 @@ xfs_bmapi_write(
 	ASSERT(len > 0);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
+			(XFS_BMAPI_METADATA | XFS_BMAPI_ZERO));
 
 	if (unlikely(XFS_TEST_ERROR(
 	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 6aaa0c1..a160f8a 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -52,9 +52,9 @@ struct xfs_bmalloca {
 	xfs_extlen_t		minleft; /* amount must be left after alloc */
 	bool			eof;	/* set if allocating past last extent */
 	bool			wasdel;	/* replacing a delayed allocation */
-	bool			userdata;/* set if is user data */
 	bool			aeof;	/* allocated space at eof */
 	bool			conv;	/* overwriting unwritten extents */
+	char			userdata;/* userdata mask */
 	int			flags;
 };
 
@@ -109,6 +109,14 @@ typedef	struct xfs_bmap_free
  */
 #define XFS_BMAPI_CONVERT	0x040
 
+/*
+ * allocate zeroed extents - this requires all newly allocated user data extents
+ * to be initialised to zero. It will be ignored if XFS_BMAPI_METADATA is set.
+ * Use in conjunction with XFS_BMAPI_CONVERT to convert unwritten extents found
+ * during the allocation range to zeroed written extents.
+ */
+#define XFS_BMAPI_ZERO		0x080
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -116,7 +124,8 @@ typedef	struct xfs_bmap_free
 	{ XFS_BMAPI_PREALLOC,	"PREALLOC" }, \
 	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
 	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
-	{ XFS_BMAPI_CONVERT,	"CONVERT" }
+	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
+	{ XFS_BMAPI_ZERO,	"ZERO" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3bf4ad0..3f59698 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -57,6 +57,35 @@ xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb)
 }
 
 /*
+ * Routine to zero an extent on disk allocated to the specific inode.
+ *
+ * The VFS functions take a linearised filesystem block offset, so we have to
+ * convert the sparse xfs fsb to the right format first.
+ * VFS types are real funky, too.
+ */
+int
+xfs_zero_extent(
+	struct xfs_inode *ip,
+	xfs_fsblock_t	start_fsb,
+	xfs_off_t	count_fsb)
+{
+	struct xfs_mount *mp = ip->i_mount;
+	xfs_daddr_t	sector = xfs_fsb_to_db(ip, start_fsb);
+	sector_t	block = XFS_BB_TO_FSBT(mp, sector);
+	ssize_t		size = XFS_FSB_TO_B(mp, count_fsb);
+
+	if (IS_DAX(VFS_I(ip)))
+		return dax_clear_blocks(VFS_I(ip), block, size);
+
+	/*
+	 * let the block layer decide on the fastest method of
+	 * implementing the zeroing.
+	 */
+	return sb_issue_zeroout(mp->m_super, block, count_fsb, GFP_NOFS);
+
+}
+
+/*
  * Routine to be called at transaction's end by xfs_bmapi, xfs_bunmapi
  * caller.  Frees all the extents that need freeing, which must be done
  * last due to locking considerations.  We never free any extents in
@@ -229,6 +258,13 @@ xfs_bmap_rtalloc(
 		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
 			ap->wasdel ? XFS_TRANS_DQ_DELRTBCOUNT :
 					XFS_TRANS_DQ_RTBCOUNT, (long) ralen);
+
+		/* Zero the extent if we were asked to do so */
+		if (ap->userdata & XFS_ALLOC_USERDATA_ZERO) {
+			error = xfs_zero_extent(ap->ip, ap->blkno, ap->length);
+			if (error)
+				return error;
+		}
 	} else {
 		ap->length = 0;
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7999e91..404bfa5 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -336,4 +336,7 @@ extern int	xfs_dev_is_read_only(struct xfs_mount *, char *);
 
 extern void	xfs_set_low_space_thresholds(struct xfs_mount *);
 
+int	xfs_zero_extent(struct xfs_inode *ip, xfs_fsblock_t start_fsb,
+			xfs_off_t count_fsb);
+
 #endif	/* __XFS_MOUNT_H__ */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/7] xfs: Don't use unwritten extents for DAX
  2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
                   ` (3 preceding siblings ...)
  2015-10-01  7:46 ` [PATCH 4/7] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
@ 2015-10-01  7:46 ` Dave Chinner
  2015-10-01  7:46 ` [PATCH 6/7] xfs: DAX does not use IO completion callbacks Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01  7:46 UTC (permalink / raw)
  To: xfs
  Cc: jack, linux-nvdimm, linux-kernel, linux-fsdevel, willy,
	ross.zwisler, dan.j.williams, kirill.shutemov

From: Dave Chinner <dchinner@redhat.com>

DAX has a page fault serialisation problem with block allocation.
Because it allows concurrent page faults and does not have a page
lock to serialise faults to the same page, it can get two concurrent
faults to the page that race.

When two read faults race, this isn't a huge problem as the data
underlying the page is not changing and so "detect and drop" works
just fine. The issues are to do with write faults.

When two write faults occur, we serialise block allocation in
get_blocks() so only one faul will allocate the extent. It will,
however, be marked as an unwritten extent, and that is where the
problem lies - the DAX fault code cannot differentiate between a
block that was just allocated and a block that was preallocated and
needs zeroing. The result is that both write faults end up zeroing
the block and attempting to convert it back to written.

The problem is that the first fault can zero and convert before the
second fault starts zeroing, resulting in the zeroing for the second
fault overwriting the data that the first fault wrote with zeros.
The second fault then attempts to convert the unwritten extent,
which is then a no-op because it's already written. Data loss occurs
as a result of this race.

Because there is no sane locking construct in the page fault code
that we can use for serialisation across the page faults, we need to
ensure block allocation and zeroing occurs atomically in the
filesystem. This means we can still take concurrent page faults and
the only time they will serialise is in the filesystem
mapping/allocation callback. The page fault code will always see
written, initialised extents, so we will be able to remove the
unwritten extent handling from the DAX code when all filesystems are
converted.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c           |  5 +++++
 fs/xfs/xfs_aops.c  | 13 +++++++++----
 fs/xfs/xfs_iomap.c | 17 ++++++++++++++++-
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 3994a2b..b36d6d2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -29,6 +29,11 @@
 #include <linux/uio.h>
 #include <linux/vmstat.h>
 
+/*
+ * dax_clear_blocks() is called from within transaction context from XFS,
+ * and hence this means the stack from this point must follow GFP_NOFS
+ * semantics for all operations.
+ */
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e747d6a..df3dabd 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1284,15 +1284,12 @@ xfs_map_direct(
 
 	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
 
-	/* XXX: preparation for removing unwritten extents in DAX */
-#if 0
 	if (dax_fault) {
 		ASSERT(type == XFS_IO_OVERWRITE);
 		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
 					    imap);
 		return;
 	}
-#endif
 
 	if (bh_result->b_private) {
 		ioend = bh_result->b_private;
@@ -1420,10 +1417,12 @@ __xfs_get_blocks(
 	if (error)
 		goto out_unlock;
 
+	/* for DAX, we convert unwritten extents directly */
 	if (create &&
 	    (!nimaps ||
 	     (imap.br_startblock == HOLESTARTBLOCK ||
-	      imap.br_startblock == DELAYSTARTBLOCK))) {
+	      imap.br_startblock == DELAYSTARTBLOCK) ||
+	     (IS_DAX(inode) && ISUNWRITTEN(&imap)))) {
 		if (direct || xfs_get_extsz_hint(ip)) {
 			/*
 			 * Drop the ilock in preparation for starting the block
@@ -1468,6 +1467,12 @@ __xfs_get_blocks(
 		goto out_unlock;
 	}
 
+	if (IS_DAX(inode) && create) {
+		ASSERT(!ISUNWRITTEN(&imap));
+		/* zeroing is not needed at a higher layer */
+		new = 0;
+	}
+
 	/* trim mapping down to size requested */
 	if (direct || size > (1 << inode->i_blkbits))
 		xfs_map_trim_size(inode, iblock, bh_result,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1f86033..8b2b4a9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -131,6 +131,7 @@ xfs_iomap_write_direct(
 	uint		qblocks, resblks, resrtextents;
 	int		committed;
 	int		error;
+	int		bmapi_flags = XFS_BMAPI_PREALLOC;
 
 	error = xfs_qm_dqattach(ip, 0);
 	if (error)
@@ -196,13 +197,26 @@ xfs_iomap_write_direct(
 	xfs_trans_ijoin(tp, ip, 0);
 
 	/*
+	 * For DAX, we do not allocate unwritten extents, but instead we zero
+	 * the block before we commit the transaction.  Ideally we'd like to do
+	 * this outside the transaction context, but if we commit and then crash
+	 * we may not have zeroed the blocks and this will be exposed on
+	 * recovery of the allocation. Hence we must zero before commit.
+	 * Further, if we are mapping unwritten extents here, we need to zero
+	 * and convert them to written so that we don't need an unwritten extent
+	 * callback for DAX.
+	 */
+	if (IS_DAX(VFS_I(ip)))
+		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
+
+	/*
 	 * From this point onwards we overwrite the imap pointer that the
 	 * caller gave to us.
 	 */
 	xfs_bmap_init(&free_list, &firstfsb);
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
-				XFS_BMAPI_PREALLOC, &firstfsb, 0,
+				bmapi_flags, &firstfsb, 0,
 				imap, &nimaps, &free_list);
 	if (error)
 		goto out_bmap_cancel;
@@ -213,6 +227,7 @@ xfs_iomap_write_direct(
 	error = xfs_bmap_finish(&tp, &free_list, &committed);
 	if (error)
 		goto out_bmap_cancel;
+
 	error = xfs_trans_commit(tp);
 	if (error)
 		goto out_unlock;
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/7] xfs: DAX does not use IO completion callbacks
  2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
                   ` (4 preceding siblings ...)
  2015-10-01  7:46 ` [PATCH 5/7] xfs: Don't use unwritten extents for DAX Dave Chinner
@ 2015-10-01  7:46 ` Dave Chinner
  2015-10-01  7:46 ` [PATCH 7/7] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
  2015-10-01 20:31 ` [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Ross Zwisler
  7 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01  7:46 UTC (permalink / raw)
  To: xfs
  Cc: linux-fsdevel, ross.zwisler, willy, dan.j.williams,
	kirill.shutemov, linux-nvdimm, jack, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

For DAX, we are now doing block zeroing and
we are updating the file size during allocation. This means we no
longer need an IO completion callback to do these things, so remove
the completion callbacks from the __dax_fault and __dax_mkwrite
calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 39 ---------------------------------------
 fs/xfs/xfs_aops.h |  1 -
 fs/xfs/xfs_file.c |  5 ++---
 3 files changed, 2 insertions(+), 43 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index df3dabd..69c2dbc 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1657,45 +1657,6 @@ xfs_end_io_direct_write(
 	__xfs_end_io_direct_write(inode, ioend, offset, size);
 }
 
-/*
- * For DAX we need a mapping buffer callback for unwritten extent conversion
- * when page faults allocate blocks and then zero them. Note that in this
- * case the mapping indicated by the ioend may extend beyond EOF. We most
- * definitely do not want to extend EOF here, so we trim back the ioend size to
- * EOF.
- */
-#ifdef CONFIG_FS_DAX
-void
-xfs_end_io_dax_write(
-	struct buffer_head	*bh,
-	int			uptodate)
-{
-	struct xfs_ioend	*ioend = bh->b_private;
-	struct inode		*inode = ioend->io_inode;
-	ssize_t			size = ioend->io_size;
-
-	ASSERT(IS_DAX(ioend->io_inode));
-
-	/* if there was an error zeroing, then don't convert it */
-	if (!uptodate)
-		ioend->io_error = -EIO;
-
-	/*
-	 * Trim update to EOF, so we don't extend EOF during unwritten extent
-	 * conversion of partial EOF blocks.
-	 */
-	spin_lock(&XFS_I(inode)->i_flags_lock);
-	if (ioend->io_offset + size > i_size_read(inode))
-		size = i_size_read(inode) - ioend->io_offset;
-	spin_unlock(&XFS_I(inode)->i_flags_lock);
-
-	__xfs_end_io_direct_write(inode, ioend, ioend->io_offset, size);
-
-}
-#else
-void xfs_end_io_dax_write(struct buffer_head *bh, int uptodate) { }
-#endif
-
 static inline ssize_t
 xfs_vm_do_dio(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index d39ba25..f6ffc9a 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -60,7 +60,6 @@ int	xfs_get_blocks_direct(struct inode *inode, sector_t offset,
 			      struct buffer_head *map_bh, int create);
 int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
 			         struct buffer_head *map_bh, int create);
-void	xfs_end_io_dax_write(struct buffer_head *bh, int uptodate);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 27abe1c..9c8eef7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1503,8 +1503,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault,
-				    xfs_end_io_dax_write);
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
 	} else {
 		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
 		ret = block_page_mkwrite_return(ret);
@@ -1566,7 +1565,7 @@ xfs_filemap_pmd_fault(
 	file_update_time(vma->vm_file);
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
-				    xfs_end_io_dax_write);
+			      NULL);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/7] xfs: add ->pfn_mkwrite support for DAX
  2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
                   ` (5 preceding siblings ...)
  2015-10-01  7:46 ` [PATCH 6/7] xfs: DAX does not use IO completion callbacks Dave Chinner
@ 2015-10-01  7:46 ` Dave Chinner
  2015-10-01 20:31 ` [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Ross Zwisler
  7 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01  7:46 UTC (permalink / raw)
  To: xfs
  Cc: jack, linux-nvdimm, linux-kernel, linux-fsdevel, willy,
	ross.zwisler, dan.j.williams, kirill.shutemov

From: Dave Chinner <dchinner@redhat.com>

->pfn_mkwrite support is needed so that when a page with allocated
backing store takes a write fault we can check that the fault has
not raced with a truncate and is pointing to a region beyond the
current end of file.

This also allows us to update the timestamp on the inode, too, which
fixes a generic/080 failure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 35 +++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trace.h |  1 +
 2 files changed, 36 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9c8eef7..f429662 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1572,11 +1572,46 @@ xfs_filemap_pmd_fault(
 	return ret;
 }
 
+/*
+ * pfn_mkwrite was originally inteneded to ensure we capture time stamp
+ * updates on write faults. In reality, it's need to serialise against
+ * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
+ * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
+ * barrier in place.
+ */
+static int
+xfs_filemap_pfn_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+
+	struct inode		*inode = file_inode(vma->vm_file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			ret = VM_FAULT_NOPAGE;
+	loff_t			size;
+
+	trace_xfs_filemap_pfn_mkwrite(ip);
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vma->vm_file);
+
+	/* check if the faulting page hasn't raced with truncate */
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+
+}
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.pmd_fault	= xfs_filemap_pmd_fault,
 	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
 };
 
 STATIC int
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 5ed36b1..c53beda 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -689,6 +689,7 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 DEFINE_INODE_EVENT(xfs_filemap_fault);
 DEFINE_INODE_EVENT(xfs_filemap_pmd_fault);
 DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
+DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite);
 
 DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
  2015-10-01  7:46 ` [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Dave Chinner
@ 2015-10-01  8:35   ` kbuild test robot
  2015-10-01 20:27   ` Ross Zwisler
  1 sibling, 0 replies; 16+ messages in thread
From: kbuild test robot @ 2015-10-01  8:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kbuild-all, xfs, linux-fsdevel, ross.zwisler, willy,
	dan.j.williams, kirill.shutemov, linux-nvdimm, jack,
	linux-kernel

Hi Dave,

[auto build test results on v4.3-rc3 -- if it's inappropriate base, please ignore]

reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> fs/dax.c:626:50: sparse: incorrect type in argument 1 (different address spaces)
   fs/dax.c:626:50:    expected void *page
   fs/dax.c:626:50:    got void [noderef] <asn:5>*

vim +626 fs/dax.c

   610			result = VM_FAULT_NOPAGE;
   611			spin_unlock(ptl);
   612		} else {
   613			sector = bh.b_blocknr << (blkbits - 9);
   614			length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
   615							bh.b_size);
   616			if (length < 0) {
   617				result = VM_FAULT_SIGBUS;
   618				goto out;
   619			}
   620			if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
   621				goto fallback;
   622	
   623			if (buffer_unwritten(&bh) || buffer_new(&bh)) {
   624				int i;
   625				for (i = 0; i < PTRS_PER_PMD; i++)
 > 626					clear_page(kaddr + i * PAGE_SIZE);
   627				count_vm_event(PGMAJFAULT);
   628				mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
   629				result |= VM_FAULT_MAJOR;
   630			}
   631	
   632			result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
   633		}
   634	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
  2015-10-01  7:46 ` [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Dave Chinner
  2015-10-01  8:35   ` kbuild test robot
@ 2015-10-01 20:27   ` Ross Zwisler
  2015-10-01 22:14     ` Williams, Dan J
  2015-10-01 22:32     ` Dave Chinner
  1 sibling, 2 replies; 16+ messages in thread
From: Ross Zwisler @ 2015-10-01 20:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xfs, linux-fsdevel, ross.zwisler, willy, dan.j.williams,
	kirill.shutemov, linux-nvdimm, jack, linux-kernel

On Thu, Oct 01, 2015 at 05:46:33PM +1000, Dave Chinner wrote:
> This reverts commit 46c043ede4711e8d598b9d63c5616c1fedb0605e.
> ---
>  fs/dax.c    | 36 ++++++++++++++++--------------------
>  mm/memory.c | 11 +++++++++--
>  2 files changed, 25 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 7ae6df7..400fe95 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -569,26 +569,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
>  		goto fallback;
>  
> -	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> -		int i;
> -		for (i = 0; i < PTRS_PER_PMD; i++)
> -			clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
> -		wmb_pmem();

The above two lines were updated to use the PMEM API with this commit:

commit d77e92e270ed ("dax: update PMD fault handler with PMEM API")

but they aren't updated in the reverted version here: 

> @@ -633,6 +620,15 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
>  			goto fallback;
>  
> +		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> +			int i;
> +			for (i = 0; i < PTRS_PER_PMD; i++)
> +				clear_page(kaddr + i * PAGE_SIZE);
> +			count_vm_event(PGMAJFAULT);
> +			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> +			result |= VM_FAULT_MAJOR;
> +		}
> +
>  		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
>  	}

This is the source of the follow-up sparse warning from the kbuild robot.


Also, if I understood your previous mails correctly you were targeting the
first two revert patches for v4.3 so we get back to v4.2 level locking, and
the rest of the series will target v4.4, correct?  How does this work?  Do the
patches need to be split into two series and tested separately?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/7] xfs, dax: fix the page fault/allocation mess
  2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
                   ` (6 preceding siblings ...)
  2015-10-01  7:46 ` [PATCH 7/7] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
@ 2015-10-01 20:31 ` Ross Zwisler
  2015-10-01 22:54   ` Dave Chinner
  7 siblings, 1 reply; 16+ messages in thread
From: Ross Zwisler @ 2015-10-01 20:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: xfs, linux-fsdevel, ross.zwisler, willy, dan.j.williams,
	kirill.shutemov, linux-nvdimm, jack, linux-kernel

On Thu, Oct 01, 2015 at 05:46:32PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> As discussed in the recent thread about problems with DAX locking:
> 
> http://www.gossamer-threads.com/lists/linux/kernel/2264090?do=post_view_threaded
> 
> I said that I'd post the patch set that fixed the problems for XFS
> as soon as I had something sane and workable. That's what this
> series is.
> 
> To start with, it passes xfstests "auto" group with only the only
> failures being expected failures or failures due to unexpected
> allocation patterns or trying to use unsupported block sizes. That
> makes it better than any previous version of the XFS/DAX code.
> 
> The patchset starts by reverting the two patches that were
> introduced in 4.3-rc1 to try to fix the fault vs fault and fault vs
> truncate races that caused deadlocks. This fixes the hangs in
> generic/075 that these patches introduced.
> 
> Patch 3 enables XFS to handle the behaviour of DAX and DIO when
> asking to allocate the block at (2^63 - 1FSB), where the offset +
> count s technically illegal (larger than sb->s_maxbytes) and
> overflows a s64 variable. This is currently hidden by the fact that
> all DAX and DIO allocation is currently unwritten, but patch 5
> exposes it for DAX.
> 
> Patch 4 introduces the ability for XFS to allocate physically zeroed
> data blocks. This is done for each physical extent that is
> allocated, deep inside the allocator itself and guaranteed to be
> atomic with the allocation transaction and hence has no
> crash+recovery exposure issues.
> 
> This is necessary because the BMAPI layer merges allocated extents
> in the BMBT before it returns the mapped extent back to the high
> level get_blocks() code. Hence the high level code can have a single
> extent presented that is made of merged new and existing extents,
> and so zeroing can't be done at this layer.
> 
> The advantage of driving the zeroing deep into the allocator is the
> functionality is now available to all XFS code. Hence we can
> allocate pre-zeroed blocks on any type of storage, and we can
> utilise storage-based hardware acceleration (e.g. discard to zero,
> WRITE_SAME, etc) to do the zeroing. From this POV, DAX is just
> another hardware accelerated physical zeroing mechanism for XFS. :)
> 
> [ This is an example of the mantra I repeat a lot: solve the problem
>   properly the first time and it will make everything simpler! Sure,
>   it took me three attempts to work out how to solve it in a sane
>   manner, but that's pretty much par for the course with anything
>   non-trivial. ]
> 
> Patch 5 makes __xfs_get_blocks() aware that it is being called from
> the DAX fault path and makes sure it returns zeroed blocks rather
> than unwritten extents via XFS_BMAPI_ZERO. It also now sets
> XFS_BMAPI_CONVERT, which tells it to convert unwritten extents to
> written, zeroed blocks. This is the major change of behaviour.
> 
> Patch 6 removes the IO completion callbacks from the XFS DAX code as
> they are not longer necessary after patch 5.
> 
> Patch 7 adds pfn_mkwrite support to XFS. This is needed to fix
> generic/080, which detects a failure to update the inode timestamp
> on a pfn fault. It also adds the same locking as the XFS
> implementation of ->fault and ->page_mkwrite and hence provide
> correct serialisation against truncate, hole punching, etc that
> doesn't currently exist.
> 
> The next steps that are needed are to do the same "block zeroing
> during allocation" to ext4, and then the block zeroing and
> complete_unwritten callbacks can be removed from the DAX API and
> code. I've had a breif look at the ext4 code - the block zeroing
> should be able to be done by overloading the existing zeroout code
> that ext4 has in the unwritten extent allocation code. I'd much
> prefer that an ext4 expert does this work, and then we can clean up
> the DAX code...

Thank you for working on this, and for documenting your thinking so clearly.

One thing I noticed is that in my test setup XFS+DAX is now failing
generic/274:

	# diff -u tests/generic/274.out /root/xfstests/results//generic/274.out.bad
	--- tests/generic/274.out	2015-08-24 11:05:41.490926305 -0600
	+++ /root/xfstests/results//generic/274.out.bad	2015-10-01 13:53:50.498354091 -0600
	@@ -2,4 +2,5 @@
	 ------------------------------
	 preallocation test
	 ------------------------------
	-done
	+failed to write to test file
	+(see /root/xfstests/results//generic/274.full for details)

I've verified that the test passes 100% of the time with my baseline
(v4.3-rc3), and with the set applied but without the DAX mount option.  With
the series and with DAX it fails 100% of the time.  I haven't looked into the
details of the failure yet, I just wanted to let you know that it was
happening.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
  2015-10-01 20:27   ` Ross Zwisler
@ 2015-10-01 22:14     ` Williams, Dan J
  2015-10-01 22:45       ` Ross Zwisler
  2015-10-01 22:32     ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Williams, Dan J @ 2015-10-01 22:14 UTC (permalink / raw)
  To: ross.zwisler
  Cc: jack, linux-nvdimm, dave.hansen, linux-kernel, xfs,
	linux-fsdevel, willy, kirill.shutemov

On Thu, 2015-10-01 at 14:27 -0600, Ross Zwisler wrote:
> On Thu, Oct 01, 2015 at 05:46:33PM +1000, Dave Chinner wrote:
> > This reverts commit 46c043ede4711e8d598b9d63c5616c1fedb0605e.
> > ---
> >  fs/dax.c    | 36 ++++++++++++++++--------------------
> >  mm/memory.c | 11 +++++++++--
> >  2 files changed, 25 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7ae6df7..400fe95 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -569,26 +569,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
> >  		goto fallback;
> >  
> > -	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > -		int i;
> > -		for (i = 0; i < PTRS_PER_PMD; i++)
> > -			clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
> > -		wmb_pmem();
> 
> The above two lines were updated to use the PMEM API with this commit:
> 
> commit d77e92e270ed ("dax: update PMD fault handler with PMEM API")
> 
> but they aren't updated in the reverted version here: 
> 
> > @@ -633,6 +620,15 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> >  			goto fallback;
> >  
> > +		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > +			int i;
> > +			for (i = 0; i < PTRS_PER_PMD; i++)
> > +				clear_page(kaddr + i * PAGE_SIZE);
> > +			count_vm_event(PGMAJFAULT);
> > +			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> > +			result |= VM_FAULT_MAJOR;
> > +		}
> > +
> >  		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
> >  	}
> 
> This is the source of the follow-up sparse warning from the kbuild robot.
> 

To that end Dave Hansen had also noticed that PTRS_PER_PMD should not be
used in this context.  Here's an incremental cleanup:

8<---
Subject: pmem, dax: clean up clear_pmem()

From: Dan Williams <dan.j.williams@intel.com>

Both, __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use
non-temporal instructions for clear_page(), and arch_clear_pmem() was
always incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/include/asm/pmem.h |    7 +------
 fs/dax.c                    |    4 +---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec816ab..1544fabcd7f9 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -132,12 +132,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	void *vaddr = (void __force *)addr;
 
-	/* TODO: implement the zeroing via non-temporal writes */
-	if (size == PAGE_SIZE && ((unsigned long)vaddr & ~PAGE_MASK) == 0)
-		clear_page(vaddr);
-	else
-		memset(vaddr, 0, size);
-
+	memset(vaddr, 0, size);
 	__arch_wb_cache_pmem(vaddr, size);
 }
 
diff --git a/fs/dax.c b/fs/dax.c
index b36d6d2e7f87..3faff9227135 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -625,9 +625,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 			goto fallback;
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			int i;
-			for (i = 0; i < PTRS_PER_PMD; i++)
-				clear_page(kaddr + i * PAGE_SIZE);
+			clear_pmem(kaddr, HPAGE_SIZE);
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
  2015-10-01 20:27   ` Ross Zwisler
  2015-10-01 22:14     ` Williams, Dan J
@ 2015-10-01 22:32     ` Dave Chinner
  2015-10-01 22:47       ` Ross Zwisler
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2015-10-01 22:32 UTC (permalink / raw)
  To: Ross Zwisler, xfs, linux-fsdevel, willy, dan.j.williams,
	kirill.shutemov, linux-nvdimm, jack, linux-kernel

On Thu, Oct 01, 2015 at 02:27:29PM -0600, Ross Zwisler wrote:
> On Thu, Oct 01, 2015 at 05:46:33PM +1000, Dave Chinner wrote:
> > This reverts commit 46c043ede4711e8d598b9d63c5616c1fedb0605e.
> > ---
> >  fs/dax.c    | 36 ++++++++++++++++--------------------
> >  mm/memory.c | 11 +++++++++--
> >  2 files changed, 25 insertions(+), 22 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7ae6df7..400fe95 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -569,26 +569,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
> >  		goto fallback;
> >  
> > -	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > -		int i;
> > -		for (i = 0; i < PTRS_PER_PMD; i++)
> > -			clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE);
> > -		wmb_pmem();
> 
> The above two lines were updated to use the PMEM API with this commit:
> 
> commit d77e92e270ed ("dax: update PMD fault handler with PMEM API")
> 
> but they aren't updated in the reverted version here: 
> 
> > @@ -633,6 +620,15 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> >  			goto fallback;
> >  
> > +		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> > +			int i;
> > +			for (i = 0; i < PTRS_PER_PMD; i++)
> > +				clear_page(kaddr + i * PAGE_SIZE);
> > +			count_vm_event(PGMAJFAULT);
> > +			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> > +			result |= VM_FAULT_MAJOR;
> > +		}
> > +
> >  		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
> >  	}
> 
> This is the source of the follow-up sparse warning from the kbuild robot.

I couldn't work out what set of commits I needed to revert to get a
clean revert, so I just reverted the commits and hacked out the
revert failures to what looked ok. Feel free to send me a clean set
of reverts, and I'll replace these patches with them... :)

> Also, if I understood your previous mails correctly you were targeting the
> first two revert patches for v4.3 so we get back to v4.2 level locking, and
> the rest of the series will target v4.4, correct?  How does this work?  Do the
> patches need to be split into two series and tested separately?

Test it and push the reverts however you like. I don't care how the
reverts get to 4.3 - I'll be carrying them locally in my trees from
now and so my development and testing is now unaffected by the bugs
that are in the 4.3 code. If you aren't going to push them for 4.3
then I'd suggest that they go to linus along with the rest of the
XFS changes in this series.

FWIW, I'm quite happy to host all the pending DAX changes in a
public git tree and ask for it to be included in linux-next. It's
probably a good idea to do this because it makes it much easier to
co-ordinate merges when we are touching multiple subsystems (ext4,
xfs, dax, mm, etc). And it will help prevent the "patches molder on
the list until Andrew hoovers them up" problem and so prevent this
situation from happening in the future...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
  2015-10-01 22:14     ` Williams, Dan J
@ 2015-10-01 22:45       ` Ross Zwisler
  0 siblings, 0 replies; 16+ messages in thread
From: Ross Zwisler @ 2015-10-01 22:45 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: jack, linux-nvdimm, dave.hansen, linux-kernel, xfs,
	linux-fsdevel, willy, ross.zwisler, kirill.shutemov

On Thu, Oct 01, 2015 at 10:14:22PM +0000, Williams, Dan J wrote:
> Subject: pmem, dax: clean up clear_pmem()
> 
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Both, __dax_pmd_fault, and clear_pmem() were taking special steps to
> clear memory a page at a time to take advantage of non-temporal
> clear_page() implementations.  However, x86_64 does not use
> non-temporal instructions for clear_page(), and arch_clear_pmem() was
> always incurring the cost of __arch_wb_cache_pmem().
> 
> Clean up the assumption that doing clear_pmem() a page at a time is more
> performant.
> 
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/include/asm/pmem.h |    7 +------
>  fs/dax.c                    |    4 +---
>  2 files changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index d8ce3ec816ab..1544fabcd7f9 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -132,12 +132,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
>  {
>  	void *vaddr = (void __force *)addr;
>  
> -	/* TODO: implement the zeroing via non-temporal writes */
> -	if (size == PAGE_SIZE && ((unsigned long)vaddr & ~PAGE_MASK) == 0)
> -		clear_page(vaddr);
> -	else
> -		memset(vaddr, 0, size);
> -
> +	memset(vaddr, 0, size);
>  	__arch_wb_cache_pmem(vaddr, size);
>  }
>  
> diff --git a/fs/dax.c b/fs/dax.c
> index b36d6d2e7f87..3faff9227135 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -625,9 +625,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  			goto fallback;
>  
>  		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> -			int i;
> -			for (i = 0; i < PTRS_PER_PMD; i++)
> -				clear_page(kaddr + i * PAGE_SIZE);
> +			clear_pmem(kaddr, HPAGE_SIZE);
>  			count_vm_event(PGMAJFAULT);
>  			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
>  			result |= VM_FAULT_MAJOR;
> 

This clear_pmem() needs a wmb_pmem() after it.  I'll make a quick series with
the clean revert and this guy at the end and try and get them in v4.3 - sound
good?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
  2015-10-01 22:32     ` Dave Chinner
@ 2015-10-01 22:47       ` Ross Zwisler
  0 siblings, 0 replies; 16+ messages in thread
From: Ross Zwisler @ 2015-10-01 22:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: jack, linux-nvdimm, linux-kernel, xfs, linux-fsdevel, willy,
	Ross Zwisler, dan.j.williams, kirill.shutemov

On Fri, Oct 02, 2015 at 08:32:40AM +1000, Dave Chinner wrote:
> I couldn't work out what set of commits I needed to revert to get a
> clean revert, so I just reverted the commits and hacked out the
> revert failures to what looked ok. Feel free to send me a clean set
> of reverts, and I'll replace these patches with them... :)

Will do.  I will queue the reverts in my external tree & ask Linus to pull
them into v4.3 so we don't ship with deadlocks.

> > Also, if I understood your previous mails correctly you were targeting the
> > first two revert patches for v4.3 so we get back to v4.2 level locking, and
> > the rest of the series will target v4.4, correct?  How does this work?  Do the
> > patches need to be split into two series and tested separately?
> 
> Test it and push the reverts however you like. I don't care how the
> reverts get to 4.3 - I'll be carrying them locally in my trees from
> now and so my development and testing is now unaffected by the bugs
> that are in the 4.3 code. If you aren't going to push them for 4.3
> then I'd suggest that they go to linus along with the rest of the
> XFS changes in this series.
> 
> FWIW, I'm quite happy to host all the pending DAX changes in a
> public git tree and ask for it to be included in linux-next. It's
> probably a good idea to do this because it makes it much easier to
> co-ordinate merges when we are touching multiple subsystems (ext4,
> xfs, dax, mm, etc). And it will help prevent the "patches molder on
> the list until Andrew hoovers them up" problem and so prevent this
> situation from happening in the future...

No objections from me. :)  I agree that it would be nice to have a central
home for all the DAX patches.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/7] xfs, dax: fix the page fault/allocation mess
  2015-10-01 20:31 ` [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Ross Zwisler
@ 2015-10-01 22:54   ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2015-10-01 22:54 UTC (permalink / raw)
  To: Ross Zwisler, xfs, linux-fsdevel, willy, dan.j.williams,
	kirill.shutemov, linux-nvdimm, jack, linux-kernel

On Thu, Oct 01, 2015 at 02:31:21PM -0600, Ross Zwisler wrote:
> On Thu, Oct 01, 2015 at 05:46:32PM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > As discussed in the recent thread about problems with DAX locking:
> > 
> > http://www.gossamer-threads.com/lists/linux/kernel/2264090?do=post_view_threaded
> > 
> > I said that I'd post the patch set that fixed the problems for XFS
> > as soon as I had something sane and workable. That's what this
> > series is.
> > 
> > To start with, it passes xfstests "auto" group with only the only
> > failures being expected failures or failures due to unexpected
> > allocation patterns or trying to use unsupported block sizes. That
> > makes it better than any previous version of the XFS/DAX code.
.....

> Thank you for working on this, and for documenting your thinking so clearly.

To put this in perspective, "patch 0" descriptions like this is a
requirement for any non-trivial XFS modification. It saves reviewers
so much time and many round trips in email and IRC to understand the
changes being proposed that it's a no-brainer.

Lead by example, and all that...

> One thing I noticed is that in my test setup XFS+DAX is now failing
> generic/274:
> 
> 	# diff -u tests/generic/274.out /root/xfstests/results//generic/274.out.bad
> 	--- tests/generic/274.out	2015-08-24 11:05:41.490926305 -0600
> 	+++ /root/xfstests/results//generic/274.out.bad	2015-10-01 13:53:50.498354091 -0600
> 	@@ -2,4 +2,5 @@
> 	 ------------------------------
> 	 preallocation test
> 	 ------------------------------
> 	-done
> 	+failed to write to test file
> 	+(see /root/xfstests/results//generic/274.full for details)
> 
> I've verified that the test passes 100% of the time with my baseline
> (v4.3-rc3), and with the set applied but without the DAX mount option.  With
> the series and with DAX it fails 100% of the time.  I haven't looked into the
> details of the failure yet, I just wanted to let you know that it was
> happening.

See above - I classified this under the "failures due to unexpected
allocation patterns". This is a ENOSPC test, and we've change the
allocation pattern and the unwritten extent conversion algorithm and
so changed the metadata allocation demand of the test.

I haven't looked any further than this yet, but I suspect the issue
is that the up-front unwritten extent conversion is not being
allowed to dip into the reserve block pool for BMBT allocations when
the extent list grows past a single block. If that's the case, then
it's a couple of lines of code to conditionally at XFS_TRANS_RESERVE
to the transaction handle to allow it access to the reserve pool...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-10-01 22:54 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-01  7:46 [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Dave Chinner
2015-10-01  7:46 ` [PATCH 1/7] Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" Dave Chinner
2015-10-01  8:35   ` kbuild test robot
2015-10-01 20:27   ` Ross Zwisler
2015-10-01 22:14     ` Williams, Dan J
2015-10-01 22:45       ` Ross Zwisler
2015-10-01 22:32     ` Dave Chinner
2015-10-01 22:47       ` Ross Zwisler
2015-10-01  7:46 ` [PATCH 2/7] Revert "dax: fix race between simultaneous faults" Dave Chinner
2015-10-01  7:46 ` [PATCH 3/7] xfs: fix inode size update overflow in xfs_map_direct() Dave Chinner
2015-10-01  7:46 ` [PATCH 4/7] xfs: introduce BMAPI_ZERO for allocating zeroed extents Dave Chinner
2015-10-01  7:46 ` [PATCH 5/7] xfs: Don't use unwritten extents for DAX Dave Chinner
2015-10-01  7:46 ` [PATCH 6/7] xfs: DAX does not use IO completion callbacks Dave Chinner
2015-10-01  7:46 ` [PATCH 7/7] xfs: add ->pfn_mkwrite support for DAX Dave Chinner
2015-10-01 20:31 ` [PATCH 0/7] xfs, dax: fix the page fault/allocation mess Ross Zwisler
2015-10-01 22:54   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).