All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xfs: fix direct IO completion issues
@ 2015-04-10 13:37 Dave Chinner
  2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw)
  To: xfs

Hi folks,

This patchset addresses the deeper problems Brian outlined in the
description of this patch:

http://oss.sgi.com/archives/xfs/2015-04/msg00071.html

The basic issues is that DIO completion can run in interrupt context
and it does things it should not do in interrupt context because Bad
Things Will Happen.

Patches 1 and 2 convert the DIO write completion code to use an
ioend and to never run in interrupt context when a transaction or
EOF update may need to be run.

Patches 3 and 4 of this series are effectively the same as the patch
Brain sent, but by checking the ioend status in the completion
routine before taking the spinlock, we can guarantee we never take
the spinlock in interrupt context and hence don't need irq safe spin
locks and so can re-use an existing innermost spinlock for
serialisation here.

The final patch is a removal of redundant operations - most of
generic_file_direct_write is being done in the XFS code, so most of
the gneric function is redundant and unnecessary overhead. Hence it
moves the bits that we need into the XFS code path, and we stop
calling the generic code altogether.

This passes xfstests and everything I've thrown at it. There's a
couple of small cleanups that I think still need to be done, but
they are minor. e.g. xfs_end_io_dio_write() can now probably call
xfs_finish_endio_sync() directly now, rather than open coding the
calls it makes after updating the incore inode size....

Comments, thoughts?

-Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] xfs: DIO requires an ioend for writes
  2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner
@ 2015-04-10 13:37 ` Dave Chinner
  2015-04-10 20:21   ` Brian Foster
  2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Right now unwritten extent conversion information is passed by
making the end IO private data non-null, which does not enable us to
pass any information from submission context to completion context,
which we need to use the standard IO completion paths.

Allocate an ioend in block allocation for direct IO and attach it to
the mapping buffer used during direct IO block allocation. Factor
the mapping code to make it obvious that this is happening only for
direct IO writes, and and place the mapping info and IO type
directly into the ioend for use in completion context.

The completion is changed to check the ioend type to determine if
unwritten extent completion is necessary or not.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 61 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a9b7a1..d95a42b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1233,6 +1233,57 @@ xfs_vm_releasepage(
 	return try_to_free_buffers(page);
 }
 
+static void
+xfs_get_blocks_map_buffer(
+	struct inode		*inode,
+	struct buffer_head	*bh_result,
+	int			create,
+	int			direct,
+	struct xfs_bmbt_irec	*imap,
+	xfs_off_t		offset,
+	ssize_t			size)
+{
+	struct xfs_ioend	*ioend;
+	int			type;
+
+	if (!create) {
+		/*
+		 * Unwritten extents do not report a disk address for
+		 * the read case (treat as if we're reading into a hole).
+		 */
+		if (!ISUNWRITTEN(imap))
+			xfs_map_buffer(inode, bh_result, imap, offset);
+		return;
+	}
+
+	xfs_map_buffer(inode, bh_result, imap, offset);
+
+	if (ISUNWRITTEN(imap))
+		set_buffer_unwritten(bh_result);
+
+	if (!direct)
+		return;
+
+	/*
+	 * Direct IO writes require an ioend to be allocated and
+	 * passed via the returned mapping. This allows the end
+	 * io function to determine the correct course of
+	 * action.
+	 */
+
+	if (ISUNWRITTEN(imap)) {
+		type = XFS_IO_UNWRITTEN;
+		set_buffer_defer_completion(bh_result);
+	} else
+		type = XFS_IO_OVERWRITE;
+	ioend = xfs_alloc_ioend(inode, type);
+	ioend->io_offset = offset;
+	ioend->io_size = size;
+	bh_result->b_private = ioend;
+
+	return;
+}
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
@@ -1252,6 +1303,7 @@ __xfs_get_blocks(
 	ssize_t			size;
 	int			new = 0;
 
+
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
@@ -1332,21 +1384,9 @@ __xfs_get_blocks(
 	}
 
 	if (imap.br_startblock != HOLESTARTBLOCK &&
-	    imap.br_startblock != DELAYSTARTBLOCK) {
-		/*
-		 * For unwritten extents do not report a disk address on
-		 * the read case (treat as if we're reading into a hole).
-		 */
-		if (create || !ISUNWRITTEN(&imap))
-			xfs_map_buffer(inode, bh_result, &imap, offset);
-		if (create && ISUNWRITTEN(&imap)) {
-			if (direct) {
-				bh_result->b_private = inode;
-				set_buffer_defer_completion(bh_result);
-			}
-			set_buffer_unwritten(bh_result);
-		}
-	}
+	    imap.br_startblock != DELAYSTARTBLOCK)
+		xfs_get_blocks_map_buffer(inode, bh_result, create, direct,
+					  &imap, offset, size);
 
 	/*
 	 * If this is a realtime file, data may be on a different device.
@@ -1455,9 +1495,10 @@ xfs_end_io_direct_write(
 	struct inode		*inode = file_inode(iocb->ki_filp);
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ioend	*ioend = private;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
-		return;
+		goto out_destroy_ioend;
 
 	/*
 	 * While the generic direct I/O code updates the inode size, it does
@@ -1477,7 +1518,7 @@ xfs_end_io_direct_write(
 	 * we can pass the ioend to the direct IO allocation callbacks and
 	 * avoid nesting that way.
 	 */
-	if (private && size > 0) {
+	if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) {
 		xfs_iomap_write_unwritten(ip, offset, size);
 	} else if (offset + size > ip->i_d.di_size) {
 		struct xfs_trans	*tp;
@@ -1487,11 +1528,13 @@ xfs_end_io_direct_write(
 		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
 		if (error) {
 			xfs_trans_cancel(tp, 0);
-			return;
+			goto out_destroy_ioend;
 		}
 
 		xfs_setfilesize(ip, tp, offset, size);
 	}
+out_destroy_ioend:
+	xfs_destroy_ioend(ioend);
 }
 
 STATIC ssize_t
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] xfs: direct IO needs to use append ioends
  2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner
  2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner
@ 2015-04-10 13:37 ` Dave Chinner
  2015-04-10 20:21   ` Brian Foster
  2015-04-11 21:15   ` Brian Foster
  2015-04-10 13:37 ` [PATCH 3/5] xfs: DIO write completion size updates race Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now we have an ioend being passed unconditionally to the direct IO
write completion context, we can pass a preallocated transaction
handle for on-disk inode size updates that are run in completion.

At this point we really need to be passing the correct block range
that the IO spans through the ioend, so calculate the last block in
the mapping before we map the allocated range and use that instead
of the size desired by the direct IO.

This enables us to keep track of multiple get-blocks calls in the
same direct IO - the ioend will keep coming back to us, and we can
keep extending it's range as new allocations and mappings are done.

There are some new trace points added for debugging, and a small
hack to actually make the tracepoints work (enums in tracepoints
that use __print_symbolic don't work correctly) that should be fixed
in the 4.1 merge window. THis hack can be removed when the
tracepoint fix is upstream.

There are lots of comments explaining the intricacies of passing the
ioend and append transaction in the code; they are better placed in
the code because we're going to need them to understand why this
code does what it does in a few years time....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c  | 262 +++++++++++++++++++++++++++++++++++++----------------
 fs/xfs/xfs_trace.h |  10 +-
 2 files changed, 194 insertions(+), 78 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d95a42b..52c7e46 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -178,6 +178,25 @@ xfs_setfilesize_ioend(
 	return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
 }
 
+STATIC void
+xfs_setfilesize_ioend_cancel(
+	struct xfs_ioend	*ioend)
+{
+	struct xfs_trans	*tp = ioend->io_append_trans;
+
+	/*
+	 * The transaction may have been allocated in the I/O submission thread,
+	 * thus we need to mark ourselves as being in a transaction manually.
+	 * Similarly for freeze protection.
+	 */
+	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	rwsem_acquire_read(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
+			   0, 1, _THIS_IP_);
+
+	xfs_trans_cancel(tp, 0);
+	ioend->io_append_trans = NULL;
+}
+
 /*
  * Schedule IO completion handling on the final put of an ioend.
  *
@@ -1233,18 +1252,18 @@ xfs_vm_releasepage(
 	return try_to_free_buffers(page);
 }
 
-static void
+static int
 xfs_get_blocks_map_buffer(
 	struct inode		*inode,
 	struct buffer_head	*bh_result,
 	int			create,
 	int			direct,
 	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset,
-	ssize_t			size)
+	xfs_off_t		offset)
 {
 	struct xfs_ioend	*ioend;
 	int			type;
+	loff_t			size;
 
 	if (!create) {
 		/*
@@ -1253,7 +1272,7 @@ xfs_get_blocks_map_buffer(
 		 */
 		if (!ISUNWRITTEN(imap))
 			xfs_map_buffer(inode, bh_result, imap, offset);
-		return;
+		return 0;
 	}
 
 	xfs_map_buffer(inode, bh_result, imap, offset);
@@ -1262,26 +1281,93 @@ xfs_get_blocks_map_buffer(
 		set_buffer_unwritten(bh_result);
 
 	if (!direct)
-		return;
+		return 0;
 
 	/*
-	 * Direct IO writes require an ioend to be allocated and
-	 * passed via the returned mapping. This allows the end
-	 * io function to determine the correct course of
-	 * action.
+	 * Direct IO writes require an ioend to be allocated and passed via the
+	 * returned mapping. This allows the end io function to determine the
+	 * correct course of action.
+	 *
+	 * Unwritten extents will need transactions at completion, so is known
+	 * to need deferring to a workqueue. However, for writes into written
+	 * extents, we *may* need a transaction if this IO extends the on-disk
+	 * EOF. Because we can race with other IOs the file may already be
+	 * extended by the time we get to the transaction. IO completion already
+	 * handles that case so all we will have done is incurred the overhead
+	 * of workqueue deferral for completion. This is acceptable overhead for
+	 * the rare case that this occurs.
 	 */
-
 	if (ISUNWRITTEN(imap)) {
 		type = XFS_IO_UNWRITTEN;
 		set_buffer_defer_completion(bh_result);
 	} else
 		type = XFS_IO_OVERWRITE;
-	ioend = xfs_alloc_ioend(inode, type);
-	ioend->io_offset = offset;
-	ioend->io_size = size;
-	bh_result->b_private = ioend;
 
-	return;
+	/*
+	 * The offset that is passed in is the first block the DIO will fall
+	 * into. The size supplied by the DIO layer is what it thinks it needs
+	 * but the mapping may not span this entire range. Hence we use the
+	 * truncated mapping size that's already been stashed in the bh_result
+	 * to calculate the range covered by the ioend.
+	 */
+	size = bh_result->b_size;
+	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
+
+	/*
+	 * If we get multiple mappings to in a single IO, we might be mapping
+	 * dfferent types. But because the direct IO can only have a single
+	 * private pointer, we need to ensure that:
+	 *
+	 * a) the ioend spans the entire region of the IO; and
+	 * b) if it contains unwritten extents, it is *permanently* marked as
+	 *    such and we cancel any append transaction attached to the ioend.
+	 *
+	 * We could do this by chaining ioends like buffered IO does, but
+	 * we only actually get one IO completion callback from the direct IO,
+	 * and that spans the entire IO regardless of how many mappings and IOs
+	 * are needed to complete the DIO. There is only going to be one
+	 * reference to the ioend and it's life cycle is constrained by the
+	 * DIO completion code. hence we don't need reference counting here.
+	 */
+	if (bh_result->b_private) {
+		ioend = bh_result->b_private;
+		ASSERT(ioend->io_size > 0);
+		ASSERT(offset >= ioend->io_offset);
+
+		if (offset + size > ioend->io_offset + ioend->io_size)
+			ioend->io_size = offset - ioend->io_offset + size;
+
+		if (type == XFS_IO_UNWRITTEN) {
+			if (ioend->io_append_trans)
+				xfs_setfilesize_ioend_cancel(ioend);
+			ioend->io_type = XFS_IO_UNWRITTEN;
+		}
+		trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset,
+					      ioend->io_size, ioend->io_type,
+					      imap);
+	} else {
+		ioend = xfs_alloc_ioend(inode, type);
+		ioend->io_offset = offset;
+		ioend->io_size = size;
+		bh_result->b_private = ioend;
+		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
+					   imap);
+	}
+
+	/* check if we need an append transaction allocated. */
+	if (ioend->io_type == XFS_IO_OVERWRITE &&
+	    xfs_ioend_is_append(ioend) && !ioend->io_append_trans) {
+		int	error;
+
+		error = xfs_setfilesize_trans_alloc(ioend);
+		ASSERT(!error);
+		if (error) {
+			xfs_destroy_ioend(ioend);
+			return error;
+		}
+		set_buffer_defer_completion(bh_result);
+	}
+	return 0;
 }
 
 STATIC int
@@ -1374,50 +1460,19 @@ __xfs_get_blocks(
 			xfs_iunlock(ip, lockmode);
 		}
 
-		trace_xfs_get_blocks_alloc(ip, offset, size, 0, &imap);
+		trace_xfs_get_blocks_alloc(ip, offset, size,
+				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
+						   : XFS_IO_DELALLOC, &imap);
 	} else if (nimaps) {
-		trace_xfs_get_blocks_found(ip, offset, size, 0, &imap);
+		trace_xfs_get_blocks_found(ip, offset, size,
+				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
+						   : XFS_IO_OVERWRITE, &imap);
 		xfs_iunlock(ip, lockmode);
 	} else {
 		trace_xfs_get_blocks_notfound(ip, offset, size);
 		goto out_unlock;
 	}
 
-	if (imap.br_startblock != HOLESTARTBLOCK &&
-	    imap.br_startblock != DELAYSTARTBLOCK)
-		xfs_get_blocks_map_buffer(inode, bh_result, create, direct,
-					  &imap, offset, size);
-
-	/*
-	 * If this is a realtime file, data may be on a different device.
-	 * to that pointed to from the buffer_head b_bdev currently.
-	 */
-	bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
-
-	/*
-	 * If we previously allocated a block out beyond eof and we are now
-	 * coming back to use it then we will need to flag it as new even if it
-	 * has a disk address.
-	 *
-	 * With sub-block writes into unwritten extents we also need to mark
-	 * the buffer as new so that the unwritten parts of the buffer gets
-	 * correctly zeroed.
-	 */
-	if (create &&
-	    ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) ||
-	     (offset >= i_size_read(inode)) ||
-	     (new || ISUNWRITTEN(&imap))))
-		set_buffer_new(bh_result);
-
-	if (imap.br_startblock == DELAYSTARTBLOCK) {
-		BUG_ON(direct);
-		if (create) {
-			set_buffer_uptodate(bh_result);
-			set_buffer_mapped(bh_result);
-			set_buffer_delay(bh_result);
-		}
-	}
-
 	/*
 	 * If this is O_DIRECT or the mpage code calling tell them how large
 	 * the mapping is, so that we can avoid repeated get_blocks calls.
@@ -1451,6 +1506,46 @@ __xfs_get_blocks(
 		bh_result->b_size = mapping_size;
 	}
 
+	if (imap.br_startblock != HOLESTARTBLOCK &&
+	    imap.br_startblock != DELAYSTARTBLOCK) {
+		error = xfs_get_blocks_map_buffer(inode, bh_result, create,
+						  direct, &imap, offset);
+		if (error)
+			return error;
+	}
+	if (create && direct)
+		ASSERT(bh_result->b_private);
+
+	/*
+	 * If this is a realtime file, data may be on a different device.
+	 * to that pointed to from the buffer_head b_bdev currently.
+	 */
+	bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
+
+	/*
+	 * If we previously allocated a block out beyond eof and we are now
+	 * coming back to use it then we will need to flag it as new even if it
+	 * has a disk address.
+	 *
+	 * With sub-block writes into unwritten extents we also need to mark
+	 * the buffer as new so that the unwritten parts of the buffer gets
+	 * correctly zeroed.
+	 */
+	if (create &&
+	    ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) ||
+	     (offset >= i_size_read(inode)) ||
+	     (new || ISUNWRITTEN(&imap))))
+		set_buffer_new(bh_result);
+
+	if (imap.br_startblock == DELAYSTARTBLOCK) {
+		BUG_ON(direct);
+		if (create) {
+			set_buffer_uptodate(bh_result);
+			set_buffer_mapped(bh_result);
+			set_buffer_delay(bh_result);
+		}
+	}
+
 	return 0;
 
 out_unlock:
@@ -1501,38 +1596,51 @@ xfs_end_io_direct_write(
 		goto out_destroy_ioend;
 
 	/*
-	 * While the generic direct I/O code updates the inode size, it does
-	 * so only after the end_io handler is called, which means our
-	 * end_io handler thinks the on-disk size is outside the in-core
-	 * size.  To prevent this just update it a little bit earlier here.
+	 * dio completion end_io functions are only called on writes if more
+	 * than 0 bytes was written.
 	 */
-	if (offset + size > i_size_read(inode))
-		i_size_write(inode, offset + size);
+	ASSERT(size > 0);
 
 	/*
-	 * For direct I/O we do not know if we need to allocate blocks or not,
-	 * so we can't preallocate an append transaction, as that results in
-	 * nested reservations and log space deadlocks. Hence allocate the
-	 * transaction here. While this is sub-optimal and can block IO
-	 * completion for some time, we're stuck with doing it this way until
-	 * we can pass the ioend to the direct IO allocation callbacks and
-	 * avoid nesting that way.
+	 * The ioend only maps whole blocks, while the IO may be sector aligned.
+	 * Hence the ioend offset/size may not match the IO offset/size exactly,
+	 * but should span it completely. Write the IO sizes into the ioend so
+	 * that completion processing does the right thing.
 	 */
-	if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) {
-		xfs_iomap_write_unwritten(ip, offset, size);
-	} else if (offset + size > ip->i_d.di_size) {
-		struct xfs_trans	*tp;
-		int			error;
+	ASSERT(size <= ioend->io_size);
+	ASSERT(offset >= ioend->io_offset);
+	ASSERT(offset + size <= ioend->io_offset + ioend->io_size);
+	ioend->io_size = size;
+	ioend->io_offset = offset;
 
-		tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
-		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
-		if (error) {
-			xfs_trans_cancel(tp, 0);
-			goto out_destroy_ioend;
-		}
+	/*
+	 * The ioend tells us whether we are doing unwritten extent conversion
+	 * or an append transaction that updates the on-disk file size. These
+	 * cases are the only cases where we should *potentially* be needing
+	 * to update the VFS inode size. When the ioend indicates this, we
+	 * are *guaranteed* to be running in non-interrupt context.
+	 *
+	 * We need to update the in-core inode size here so that we don't end up
+	 * with the on-disk inode size being outside the in-core inode size.
+	 * While we can do this in the process context after the IO has
+	 * completed, this does not work for AIO and hence we always update
+	 * the in-core inode size here if necessary.
+	 */
+	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) {
+		if (offset + size > i_size_read(inode))
+			i_size_write(inode, offset + size);
+	} else
+		ASSERT(offset + size <= i_size_read(inode));
 
-		xfs_setfilesize(ip, tp, offset, size);
+	/* Ugh. No way to propagate errors, so ignore them. */
+	if (ioend->io_type == XFS_IO_UNWRITTEN) {
+		xfs_iomap_write_unwritten(ip, offset, size);
+	} else if (ioend->io_append_trans) {
+		xfs_setfilesize_ioend(ioend);
+	} else {
+		ASSERT(!xfs_ioend_is_append(ioend));
 	}
+
 out_destroy_ioend:
 	xfs_destroy_ioend(ioend);
 }
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b2a45cc..a584c27 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1200,13 +1200,18 @@ DECLARE_EVENT_CLASS(xfs_imap_class,
 		__entry->blockcount = irec ? irec->br_blockcount : 0;
 	),
 	TP_printk("dev %d:%d ino 0x%llx size 0x%llx offset 0x%llx count %zd "
-		  "type %s startoff 0x%llx startblock %lld blockcount 0x%llx",
+		  "type %d startoff 0x%llx startblock %lld blockcount 0x%llx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->size,
 		  __entry->offset,
 		  __entry->count,
+		  /*
+		   * XXX: __print_symbolic broken for enums, fix coming in 4.1
+		   * cycle from Mr Rostedt. Need to know type now, so...
 		  __print_symbolic(__entry->type, XFS_IO_TYPES),
+		   */
+		  __entry->type,
 		  __entry->startoff,
 		  (__int64_t)__entry->startblock,
 		  __entry->blockcount)
@@ -1221,6 +1226,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_found);
 DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
+DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
+DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
+DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
 
 DECLARE_EVENT_CLASS(xfs_simple_io_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] xfs: DIO write completion size updates race
  2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner
  2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner
  2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner
@ 2015-04-10 13:37 ` Dave Chinner
  2015-04-10 20:22   ` Brian Foster
  2015-04-10 13:37 ` [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_end_io_direct_write() can race with other IO completions when
updating the in-core inode size. The IO completion processing is not
serialised for direct IO - they are done either under the
IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
during AIO DIO completion. Hence the non-atomic test-and-set update
of the in-core inode size is racy and can result in the in-core
inode size going backwards if the race if hit just right.

If the inod size goes backwards, this can trigger the EOF zeroing
code to run incorrectly on the next IO, which then will zero data
that has successfully been written to disk by a previous DIO.

To fix this bug, we need to serialise the test/set updates of the
in-core inode size. This first patch introduces locking around the
relevant updates and checks in the DIO path. Because we now have an
ioend in xfs_end_io_direct_write(), we know exactly then we are
doing an IO that requires an in-core EOF update, and we know that
they are not running in interrupt context. As such, we do not need to
use irqsave() spinlock variants to protect against interrupts while
the lock is held.

Hence we can use an existing spinlock in the inode to do this
serialisation and so not need to grow the struct xfs_inode just to
work around this problem.

This patch does not address the test/set EOF update in
generic_file_write_direct() for various reasons - that will be done
as a followup with separate explanation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 17 ++++++++++++-----
 fs/xfs/xfs_file.c | 13 ++++++++++++-
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 52c7e46..aafd54c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1616,21 +1616,28 @@ xfs_end_io_direct_write(
 	/*
 	 * The ioend tells us whether we are doing unwritten extent conversion
 	 * or an append transaction that updates the on-disk file size. These
-	 * cases are the only cases where we should *potentially* be needing
-	 * to update the VFS inode size. When the ioend indicates this, we
-	 * are *guaranteed* to be running in non-interrupt context.
+	 * cases are the only cases where we should *potentially* be needing to
+	 * update the VFS inode size. When the ioend indicates this, we are
+	 * *guaranteed* to be running in non-interrupt context.
 	 *
 	 * We need to update the in-core inode size here so that we don't end up
 	 * with the on-disk inode size being outside the in-core inode size.
 	 * While we can do this in the process context after the IO has
-	 * completed, this does not work for AIO and hence we always update
-	 * the in-core inode size here if necessary.
+	 * completed, this does not work for AIO and hence we always update the
+	 * in-core inode size here if necessary.
+	 *
+	 * We need to lock the test/set EOF update as we can be racing with
+	 * other IO completions here to update the EOF. Failing to serialise
+	 * here can result in EOF moving backwards and Bad Things Happen when
+	 * that occurs.
 	 */
+	spin_lock(&ip->i_flags_lock);
 	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) {
 		if (offset + size > i_size_read(inode))
 			i_size_write(inode, offset + size);
 	} else
 		ASSERT(offset + size <= i_size_read(inode));
+	spin_unlock(&ip->i_flags_lock);
 
 	/* Ugh. No way to propagate errors, so ignore them. */
 	if (ioend->io_type == XFS_IO_UNWRITTEN) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dc5f609..38ff356 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -569,10 +569,20 @@ restart:
 	 * write.  If zeroing is needed and we are currently holding the
 	 * iolock shared, we need to update it to exclusive which implies
 	 * having to redo all checks before.
+	 *
+	 * We need to serialise against EOF updates that occur in IO
+	 * completions here. We want to make sure that nobody is changing the
+	 * size while we do this check until we have placed an IO barrier (i.e.
+	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
+	 * The spinlock effectively forms a memory barrier once we have the
+	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
+	 * and hence be able to correctly determine if we need to run zeroing.
 	 */
+	spin_lock(&ip->i_flags_lock);
 	if (*pos > i_size_read(inode)) {
 		bool	zero = false;
 
+		spin_unlock(&ip->i_flags_lock);
 		if (*iolock == XFS_IOLOCK_SHARED) {
 			xfs_rw_iunlock(ip, *iolock);
 			*iolock = XFS_IOLOCK_EXCL;
@@ -582,7 +592,8 @@ restart:
 		error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero);
 		if (error)
 			return error;
-	}
+	} else
+		spin_unlock(&ip->i_flags_lock);
 
 	/*
 	 * Updating the timestamps will grab the ilock again from
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO
  2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner
                   ` (2 preceding siblings ...)
  2015-04-10 13:37 ` [PATCH 3/5] xfs: DIO write completion size updates race Dave Chinner
@ 2015-04-10 13:37 ` Dave Chinner
  2015-04-10 20:22   ` Brian Foster
  2015-04-10 13:38 ` [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary Dave Chinner
  2015-04-12 15:09 ` [PATCH 0/5] xfs: fix direct IO completion issues Christoph Hellwig
  5 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we are doing AIO DIO writes, the IOLOCK only provides an IO
submission barrier. When we need to do EOF zeroing, we need to ensure
that no other IO is in progress and all pending in-core EOF updates
have been completed. This requires us to wait for all outstanding
AIO DIO writes to the inode to complete and, if necessary, run their
EOF updates.

Once all the EOF updates are complete, we can then restart
xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing
that EOF is up to date and we have exclusive IO access to the file
so we can run EOF block zeroing if we need to without interference.
This gives EOF zeroing the same exclusivity against other IO as we
provide truncate operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 38ff356..7b872f4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -587,6 +587,16 @@ restart:
 			xfs_rw_iunlock(ip, *iolock);
 			*iolock = XFS_IOLOCK_EXCL;
 			xfs_rw_ilock(ip, *iolock);
+
+			/*
+			 * We now have an IO submission barrier in place, but
+			 * AIO can do EOF updates during IO completion and hence
+			 * we now need to wait for all of them to drain. Non-AIO
+			 * DIO will have drained before we are given the
+			 * XFS_IOLOCK_EXCL, and so for most cases this wait is a
+			 * no-op.
+			 */
+			inode_dio_wait(inode);
 			goto restart;
 		}
 		error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero);
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary
  2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner
                   ` (3 preceding siblings ...)
  2015-04-10 13:37 ` [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner
@ 2015-04-10 13:38 ` Dave Chinner
  2015-04-10 20:22   ` Brian Foster
  2015-04-12 15:09 ` [PATCH 0/5] xfs: fix direct IO completion issues Christoph Hellwig
  5 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2015-04-10 13:38 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

generic_file_direct_write() does all sorts of things to make DIO
work "sorta ok" with mixed buffered IO workloads. We already do
most of this work in xfs_file_aio_dio_write() because of the locking
requirements, so there's only a couple of things it does for us.

The first thing is that it does a page cache invalidation after the
->direct_IO callout. This can easily be added to the XFS code.

The second thing it does is that if data was written, it updates the
iov_iter structure to reflect the data written, and then does EOF
size updates if necessary. For XFS, these EOF size updates are now
not necessary, as we do them safely and race-free in IO completion
context. That leaves just the iov_iter update, and that's also exily
moved to the XFS code.

The result is that we don't need to call
generic_file_direct_write(), and hence remove a redundant buffered
writeback call and a redundant page cache invalidation call from the
DIO submission path. We also remove a racy EOF size update, and make
the DIO submission code in XFS much easier to follow. Wins all
round, really.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7b872f4..7182cd2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -665,6 +665,8 @@ xfs_file_dio_aio_write(
 	int			iolock;
 	size_t			count = iov_iter_count(from);
 	loff_t			pos = iocb->ki_pos;
+	loff_t			end;
+	struct iov_iter		data;
 	struct xfs_buftarg	*target = XFS_IS_REALTIME_INODE(ip) ?
 					mp->m_rtdev_targp : mp->m_ddev_targp;
 
@@ -704,10 +706,11 @@ xfs_file_dio_aio_write(
 	if (ret)
 		goto out;
 	iov_iter_truncate(from, count);
+	end = pos + count - 1;
 
 	if (mapping->nrpages) {
 		ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-						    pos, pos + count - 1);
+						   pos, end);
 		if (ret)
 			goto out;
 		/*
@@ -717,7 +720,7 @@ xfs_file_dio_aio_write(
 		 */
 		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
 					pos >> PAGE_CACHE_SHIFT,
-					(pos + count - 1) >> PAGE_CACHE_SHIFT);
+					end >> PAGE_CACHE_SHIFT);
 		WARN_ON_ONCE(ret);
 		ret = 0;
 	}
@@ -734,8 +737,22 @@ xfs_file_dio_aio_write(
 	}
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
-	ret = generic_file_direct_write(iocb, from, pos);
 
+	data = *from;
+	ret = mapping->a_ops->direct_IO(WRITE, iocb, &data, pos);
+
+	/* see generic_file_direct_write() for why this is necessary */
+	if (mapping->nrpages) {
+		invalidate_inode_pages2_range(mapping,
+					      pos >> PAGE_CACHE_SHIFT,
+					      end >> PAGE_CACHE_SHIFT);
+	}
+
+	if (ret > 0) {
+		pos += ret;
+		iov_iter_advance(from, ret);
+		iocb->ki_pos = pos;
+	}
 out:
 	xfs_rw_iunlock(ip, iolock);
 
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] xfs: DIO requires an ioend for writes
  2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner
@ 2015-04-10 20:21   ` Brian Foster
  2015-04-10 22:24     ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2015-04-10 20:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 10, 2015 at 11:37:56PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Right now unwritten extent conversion information is passed by
> making the end IO private data non-null, which does not enable us to
> pass any information from submission context to completion context,
> which we need to use the standard IO completion paths.
> 
> Allocate an ioend in block allocation for direct IO and attach it to
> the mapping buffer used during direct IO block allocation. Factor
> the mapping code to make it obvious that this is happening only for
> direct IO writes, and and place the mapping info and IO type
> directly into the ioend for use in completion context.
> 
> The completion is changed to check the ioend type to determine if
> unwritten extent completion is necessary or not.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 61 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..d95a42b 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,6 +1233,57 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +static void
> +xfs_get_blocks_map_buffer(
> +	struct inode		*inode,
> +	struct buffer_head	*bh_result,
> +	int			create,
> +	int			direct,
> +	struct xfs_bmbt_irec	*imap,
> +	xfs_off_t		offset,
> +	ssize_t			size)
> +{
> +	struct xfs_ioend	*ioend;
> +	int			type;
> +
> +	if (!create) {
> +		/*
> +		 * Unwritten extents do not report a disk address for
> +		 * the read case (treat as if we're reading into a hole).
> +		 */
> +		if (!ISUNWRITTEN(imap))
> +			xfs_map_buffer(inode, bh_result, imap, offset);
> +		return;
> +	}

This logic was kind of ugly to begin with, but I think the refactoring
exposes it further. There's rather twisty logic here just for a case
where we don't do anything. In other words, the fact that we have early
return logic in this function that ultimately jumps us back up through
two levels of scope (e.g., this function, the if branch in the caller)
is good indication that the logic could be improved.

I think we could lift the (!create && ISUNWRITTEN()) case up into the
top most check, along with the comment as to why, and skip the entire
codepath in that case. That also kills the duplicate xfs_map_buffer()
call above so long as we do something like this:

> +
> +	xfs_map_buffer(inode, bh_result, imap, offset);
	if (!create)
		return;

... but I think we could clean it up even further than that. This
function gets rather busy in the subsequent patch so I think there's
value in isolating it to managing direct I/O and lifting the common
stuff back into xfs_get_blocks() (see below). This means we could also
kill both of the create and direct parameters because they aren't used
at all beyond the !direct check a few lines below.

> +
> +	if (ISUNWRITTEN(imap))
> +		set_buffer_unwritten(bh_result);
> +
> +	if (!direct)
> +		return;
> +
> +	/*
> +	 * Direct IO writes require an ioend to be allocated and
> +	 * passed via the returned mapping. This allows the end
> +	 * io function to determine the correct course of
> +	 * action.
> +	 */
> +
> +	if (ISUNWRITTEN(imap)) {
> +		type = XFS_IO_UNWRITTEN;
> +		set_buffer_defer_completion(bh_result);
> +	} else
> +		type = XFS_IO_OVERWRITE;
> +	ioend = xfs_alloc_ioend(inode, type);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	bh_result->b_private = ioend;
> +
> +	return;

Pointless return... not a big deal given this is converted from a void
function in the subsequent patch.

> +}
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
> @@ -1252,6 +1303,7 @@ __xfs_get_blocks(
>  	ssize_t			size;
>  	int			new = 0;
>  
> +
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> @@ -1332,21 +1384,9 @@ __xfs_get_blocks(
>  	}
>  
>  	if (imap.br_startblock != HOLESTARTBLOCK &&
> -	    imap.br_startblock != DELAYSTARTBLOCK) {
> -		/*
> -		 * For unwritten extents do not report a disk address on
> -		 * the read case (treat as if we're reading into a hole).
> -		 */
> -		if (create || !ISUNWRITTEN(&imap))
> -			xfs_map_buffer(inode, bh_result, &imap, offset);
> -		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> -				bh_result->b_private = inode;
> -				set_buffer_defer_completion(bh_result);
> -			}
> -			set_buffer_unwritten(bh_result);
> -		}
> -	}
> +	    imap.br_startblock != DELAYSTARTBLOCK)
> +		xfs_get_blocks_map_buffer(inode, bh_result, create, direct,
> +					  &imap, offset, size);

So if we pull some of the bits from xfs_get_blocks_map_buffer() back up,
I end up with something like the the following here. Compile tested
only, but illustrates the point:

        /*
         * Map the buffer as long as we have physical blocks and this isn't a
         * read of an unwritten extent. Treat reads into unwritten extents as
         * holes and thus do not return a mapping.
         */
        if (imap.br_startblock != HOLESTARTBLOCK &&
            imap.br_startblock != DELAYSTARTBLOCK &&
            (create || !ISUNWRITTEN(&imap))) {
                xfs_map_buffer(inode, bh_result, &imap, offset);
		/* unwritten implies create due to check above */
                if (ISUNWRITTEN(&imap))
                        set_buffer_unwritten(bh_result);
                /* direct writes have a special mapping */
                if (create && direct) {
                        error = xfs_map_direct(inode, bh_result, &imap, offset);
                        if (error)
                                return error;
                }
        }

I renamed the helper to xfs_map_direct(), killed everything therein up
through the !direct check and killed both the create and direct params.
Thoughts?

Brian

>  
>  	/*
>  	 * If this is a realtime file, data may be on a different device.
> @@ -1455,9 +1495,10 @@ xfs_end_io_direct_write(
>  	struct inode		*inode = file_inode(iocb->ki_filp);
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_ioend	*ioend = private;
>  
>  	if (XFS_FORCED_SHUTDOWN(mp))
> -		return;
> +		goto out_destroy_ioend;
>  
>  	/*
>  	 * While the generic direct I/O code updates the inode size, it does
> @@ -1477,7 +1518,7 @@ xfs_end_io_direct_write(
>  	 * we can pass the ioend to the direct IO allocation callbacks and
>  	 * avoid nesting that way.
>  	 */
> -	if (private && size > 0) {
> +	if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) {
>  		xfs_iomap_write_unwritten(ip, offset, size);
>  	} else if (offset + size > ip->i_d.di_size) {
>  		struct xfs_trans	*tp;
> @@ -1487,11 +1528,13 @@ xfs_end_io_direct_write(
>  		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
>  		if (error) {
>  			xfs_trans_cancel(tp, 0);
> -			return;
> +			goto out_destroy_ioend;
>  		}
>  
>  		xfs_setfilesize(ip, tp, offset, size);
>  	}
> +out_destroy_ioend:
> +	xfs_destroy_ioend(ioend);
>  }
>  
>  STATIC ssize_t
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends
  2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner
@ 2015-04-10 20:21   ` Brian Foster
  2015-04-10 22:30     ` Dave Chinner
  2015-04-11 21:15   ` Brian Foster
  1 sibling, 1 reply; 19+ messages in thread
From: Brian Foster @ 2015-04-10 20:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have an ioend being passed unconditionally to the direct IO
> write completion context, we can pass a preallocated transaction
> handle for on-disk inode size updates that are run in completion.
> 
> At this point we really need to be passing the correct block range
> that the IO spans through the ioend, so calculate the last block in
> the mapping before we map the allocated range and use that instead
> of the size desired by the direct IO.
> 
> This enables us to keep track of multiple get-blocks calls in the
> same direct IO - the ioend will keep coming back to us, and we can
> keep extending it's range as new allocations and mappings are done.
> 
> There are some new trace points added for debugging, and a small
> hack to actually make the tracepoints work (enums in tracepoints
> that use __print_symbolic don't work correctly) that should be fixed
> in the 4.1 merge window. THis hack can be removed when the
> tracepoint fix is upstream.
> 
> There are lots of comments explaining the intricacies of passing the
> ioend and append transaction in the code; they are better placed in
> the code because we're going to need them to understand why this
> code does what it does in a few years time....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

I still need to look at this one (and grok the dio code more)... but an
initial question: is this multiple get_blocks() call aggregation a
requirement for the append ioend mechanism or an optimization? If the
latter, I think a separate patch is more appropriate...

Brian

>  fs/xfs/xfs_aops.c  | 262 +++++++++++++++++++++++++++++++++++++----------------
>  fs/xfs/xfs_trace.h |  10 +-
>  2 files changed, 194 insertions(+), 78 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d95a42b..52c7e46 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -178,6 +178,25 @@ xfs_setfilesize_ioend(
>  	return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
>  }
>  
> +STATIC void
> +xfs_setfilesize_ioend_cancel(
> +	struct xfs_ioend	*ioend)
> +{
> +	struct xfs_trans	*tp = ioend->io_append_trans;
> +
> +	/*
> +	 * The transaction may have been allocated in the I/O submission thread,
> +	 * thus we need to mark ourselves as being in a transaction manually.
> +	 * Similarly for freeze protection.
> +	 */
> +	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	rwsem_acquire_read(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
> +			   0, 1, _THIS_IP_);
> +
> +	xfs_trans_cancel(tp, 0);
> +	ioend->io_append_trans = NULL;
> +}
> +
>  /*
>   * Schedule IO completion handling on the final put of an ioend.
>   *
> @@ -1233,18 +1252,18 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> -static void
> +static int
>  xfs_get_blocks_map_buffer(
>  	struct inode		*inode,
>  	struct buffer_head	*bh_result,
>  	int			create,
>  	int			direct,
>  	struct xfs_bmbt_irec	*imap,
> -	xfs_off_t		offset,
> -	ssize_t			size)
> +	xfs_off_t		offset)
>  {
>  	struct xfs_ioend	*ioend;
>  	int			type;
> +	loff_t			size;
>  
>  	if (!create) {
>  		/*
> @@ -1253,7 +1272,7 @@ xfs_get_blocks_map_buffer(
>  		 */
>  		if (!ISUNWRITTEN(imap))
>  			xfs_map_buffer(inode, bh_result, imap, offset);
> -		return;
> +		return 0;
>  	}
>  
>  	xfs_map_buffer(inode, bh_result, imap, offset);
> @@ -1262,26 +1281,93 @@ xfs_get_blocks_map_buffer(
>  		set_buffer_unwritten(bh_result);
>  
>  	if (!direct)
> -		return;
> +		return 0;
>  
>  	/*
> -	 * Direct IO writes require an ioend to be allocated and
> -	 * passed via the returned mapping. This allows the end
> -	 * io function to determine the correct course of
> -	 * action.
> +	 * Direct IO writes require an ioend to be allocated and passed via the
> +	 * returned mapping. This allows the end io function to determine the
> +	 * correct course of action.
> +	 *
> +	 * Unwritten extents will need transactions at completion, so is known
> +	 * to need deferring to a workqueue. However, for writes into written
> +	 * extents, we *may* need a transaction if this IO extends the on-disk
> +	 * EOF. Because we can race with other IOs the file may already be
> +	 * extended by the time we get to the transaction. IO completion already
> +	 * handles that case so all we will have done is incurred the overhead
> +	 * of workqueue deferral for completion. This is acceptable overhead for
> +	 * the rare case that this occurs.
>  	 */
> -
>  	if (ISUNWRITTEN(imap)) {
>  		type = XFS_IO_UNWRITTEN;
>  		set_buffer_defer_completion(bh_result);
>  	} else
>  		type = XFS_IO_OVERWRITE;
> -	ioend = xfs_alloc_ioend(inode, type);
> -	ioend->io_offset = offset;
> -	ioend->io_size = size;
> -	bh_result->b_private = ioend;
>  
> -	return;
> +	/*
> +	 * The offset that is passed in is the first block the DIO will fall
> +	 * into. The size supplied by the DIO layer is what it thinks it needs
> +	 * but the mapping may not span this entire range. Hence we use the
> +	 * truncated mapping size that's already been stashed in the bh_result
> +	 * to calculate the range covered by the ioend.
> +	 */
> +	size = bh_result->b_size;
> +	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
> +
> +	/*
> +	 * If we get multiple mappings to in a single IO, we might be mapping
> +	 * dfferent types. But because the direct IO can only have a single
> +	 * private pointer, we need to ensure that:
> +	 *
> +	 * a) the ioend spans the entire region of the IO; and
> +	 * b) if it contains unwritten extents, it is *permanently* marked as
> +	 *    such and we cancel any append transaction attached to the ioend.
> +	 *
> +	 * We could do this by chaining ioends like buffered IO does, but
> +	 * we only actually get one IO completion callback from the direct IO,
> +	 * and that spans the entire IO regardless of how many mappings and IOs
> +	 * are needed to complete the DIO. There is only going to be one
> +	 * reference to the ioend and it's life cycle is constrained by the
> +	 * DIO completion code. hence we don't need reference counting here.
> +	 */
> +	if (bh_result->b_private) {
> +		ioend = bh_result->b_private;
> +		ASSERT(ioend->io_size > 0);
> +		ASSERT(offset >= ioend->io_offset);
> +
> +		if (offset + size > ioend->io_offset + ioend->io_size)
> +			ioend->io_size = offset - ioend->io_offset + size;
> +
> +		if (type == XFS_IO_UNWRITTEN) {
> +			if (ioend->io_append_trans)
> +				xfs_setfilesize_ioend_cancel(ioend);
> +			ioend->io_type = XFS_IO_UNWRITTEN;
> +		}
> +		trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset,
> +					      ioend->io_size, ioend->io_type,
> +					      imap);
> +	} else {
> +		ioend = xfs_alloc_ioend(inode, type);
> +		ioend->io_offset = offset;
> +		ioend->io_size = size;
> +		bh_result->b_private = ioend;
> +		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
> +					   imap);
> +	}
> +
> +	/* check if we need an append transaction allocated. */
> +	if (ioend->io_type == XFS_IO_OVERWRITE &&
> +	    xfs_ioend_is_append(ioend) && !ioend->io_append_trans) {
> +		int	error;
> +
> +		error = xfs_setfilesize_trans_alloc(ioend);
> +		ASSERT(!error);
> +		if (error) {
> +			xfs_destroy_ioend(ioend);
> +			return error;
> +		}
> +		set_buffer_defer_completion(bh_result);
> +	}
> +	return 0;
>  }
>  
>  STATIC int
> @@ -1374,50 +1460,19 @@ __xfs_get_blocks(
>  			xfs_iunlock(ip, lockmode);
>  		}
>  
> -		trace_xfs_get_blocks_alloc(ip, offset, size, 0, &imap);
> +		trace_xfs_get_blocks_alloc(ip, offset, size,
> +				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
> +						   : XFS_IO_DELALLOC, &imap);
>  	} else if (nimaps) {
> -		trace_xfs_get_blocks_found(ip, offset, size, 0, &imap);
> +		trace_xfs_get_blocks_found(ip, offset, size,
> +				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
> +						   : XFS_IO_OVERWRITE, &imap);
>  		xfs_iunlock(ip, lockmode);
>  	} else {
>  		trace_xfs_get_blocks_notfound(ip, offset, size);
>  		goto out_unlock;
>  	}
>  
> -	if (imap.br_startblock != HOLESTARTBLOCK &&
> -	    imap.br_startblock != DELAYSTARTBLOCK)
> -		xfs_get_blocks_map_buffer(inode, bh_result, create, direct,
> -					  &imap, offset, size);
> -
> -	/*
> -	 * If this is a realtime file, data may be on a different device.
> -	 * to that pointed to from the buffer_head b_bdev currently.
> -	 */
> -	bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
> -
> -	/*
> -	 * If we previously allocated a block out beyond eof and we are now
> -	 * coming back to use it then we will need to flag it as new even if it
> -	 * has a disk address.
> -	 *
> -	 * With sub-block writes into unwritten extents we also need to mark
> -	 * the buffer as new so that the unwritten parts of the buffer gets
> -	 * correctly zeroed.
> -	 */
> -	if (create &&
> -	    ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) ||
> -	     (offset >= i_size_read(inode)) ||
> -	     (new || ISUNWRITTEN(&imap))))
> -		set_buffer_new(bh_result);
> -
> -	if (imap.br_startblock == DELAYSTARTBLOCK) {
> -		BUG_ON(direct);
> -		if (create) {
> -			set_buffer_uptodate(bh_result);
> -			set_buffer_mapped(bh_result);
> -			set_buffer_delay(bh_result);
> -		}
> -	}
> -
>  	/*
>  	 * If this is O_DIRECT or the mpage code calling tell them how large
>  	 * the mapping is, so that we can avoid repeated get_blocks calls.
> @@ -1451,6 +1506,46 @@ __xfs_get_blocks(
>  		bh_result->b_size = mapping_size;
>  	}
>  
> +	if (imap.br_startblock != HOLESTARTBLOCK &&
> +	    imap.br_startblock != DELAYSTARTBLOCK) {
> +		error = xfs_get_blocks_map_buffer(inode, bh_result, create,
> +						  direct, &imap, offset);
> +		if (error)
> +			return error;
> +	}
> +	if (create && direct)
> +		ASSERT(bh_result->b_private);
> +
> +	/*
> +	 * If this is a realtime file, data may be on a different device.
> +	 * to that pointed to from the buffer_head b_bdev currently.
> +	 */
> +	bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
> +
> +	/*
> +	 * If we previously allocated a block out beyond eof and we are now
> +	 * coming back to use it then we will need to flag it as new even if it
> +	 * has a disk address.
> +	 *
> +	 * With sub-block writes into unwritten extents we also need to mark
> +	 * the buffer as new so that the unwritten parts of the buffer gets
> +	 * correctly zeroed.
> +	 */
> +	if (create &&
> +	    ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) ||
> +	     (offset >= i_size_read(inode)) ||
> +	     (new || ISUNWRITTEN(&imap))))
> +		set_buffer_new(bh_result);
> +
> +	if (imap.br_startblock == DELAYSTARTBLOCK) {
> +		BUG_ON(direct);
> +		if (create) {
> +			set_buffer_uptodate(bh_result);
> +			set_buffer_mapped(bh_result);
> +			set_buffer_delay(bh_result);
> +		}
> +	}
> +
>  	return 0;
>  
>  out_unlock:
> @@ -1501,38 +1596,51 @@ xfs_end_io_direct_write(
>  		goto out_destroy_ioend;
>  
>  	/*
> -	 * While the generic direct I/O code updates the inode size, it does
> -	 * so only after the end_io handler is called, which means our
> -	 * end_io handler thinks the on-disk size is outside the in-core
> -	 * size.  To prevent this just update it a little bit earlier here.
> +	 * dio completion end_io functions are only called on writes if more
> +	 * than 0 bytes was written.
>  	 */
> -	if (offset + size > i_size_read(inode))
> -		i_size_write(inode, offset + size);
> +	ASSERT(size > 0);
>  
>  	/*
> -	 * For direct I/O we do not know if we need to allocate blocks or not,
> -	 * so we can't preallocate an append transaction, as that results in
> -	 * nested reservations and log space deadlocks. Hence allocate the
> -	 * transaction here. While this is sub-optimal and can block IO
> -	 * completion for some time, we're stuck with doing it this way until
> -	 * we can pass the ioend to the direct IO allocation callbacks and
> -	 * avoid nesting that way.
> +	 * The ioend only maps whole blocks, while the IO may be sector aligned.
> +	 * Hence the ioend offset/size may not match the IO offset/size exactly,
> +	 * but should span it completely. Write the IO sizes into the ioend so
> +	 * that completion processing does the right thing.
>  	 */
> -	if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) {
> -		xfs_iomap_write_unwritten(ip, offset, size);
> -	} else if (offset + size > ip->i_d.di_size) {
> -		struct xfs_trans	*tp;
> -		int			error;
> +	ASSERT(size <= ioend->io_size);
> +	ASSERT(offset >= ioend->io_offset);
> +	ASSERT(offset + size <= ioend->io_offset + ioend->io_size);
> +	ioend->io_size = size;
> +	ioend->io_offset = offset;
>  
> -		tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> -		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
> -		if (error) {
> -			xfs_trans_cancel(tp, 0);
> -			goto out_destroy_ioend;
> -		}
> +	/*
> +	 * The ioend tells us whether we are doing unwritten extent conversion
> +	 * or an append transaction that updates the on-disk file size. These
> +	 * cases are the only cases where we should *potentially* be needing
> +	 * to update the VFS inode size. When the ioend indicates this, we
> +	 * are *guaranteed* to be running in non-interrupt context.
> +	 *
> +	 * We need to update the in-core inode size here so that we don't end up
> +	 * with the on-disk inode size being outside the in-core inode size.
> +	 * While we can do this in the process context after the IO has
> +	 * completed, this does not work for AIO and hence we always update
> +	 * the in-core inode size here if necessary.
> +	 */
> +	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) {
> +		if (offset + size > i_size_read(inode))
> +			i_size_write(inode, offset + size);
> +	} else
> +		ASSERT(offset + size <= i_size_read(inode));
>  
> -		xfs_setfilesize(ip, tp, offset, size);
> +	/* Ugh. No way to propagate errors, so ignore them. */
> +	if (ioend->io_type == XFS_IO_UNWRITTEN) {
> +		xfs_iomap_write_unwritten(ip, offset, size);
> +	} else if (ioend->io_append_trans) {
> +		xfs_setfilesize_ioend(ioend);
> +	} else {
> +		ASSERT(!xfs_ioend_is_append(ioend));
>  	}
> +
>  out_destroy_ioend:
>  	xfs_destroy_ioend(ioend);
>  }
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index b2a45cc..a584c27 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1200,13 +1200,18 @@ DECLARE_EVENT_CLASS(xfs_imap_class,
>  		__entry->blockcount = irec ? irec->br_blockcount : 0;
>  	),
>  	TP_printk("dev %d:%d ino 0x%llx size 0x%llx offset 0x%llx count %zd "
> -		  "type %s startoff 0x%llx startblock %lld blockcount 0x%llx",
> +		  "type %d startoff 0x%llx startblock %lld blockcount 0x%llx",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->size,
>  		  __entry->offset,
>  		  __entry->count,
> +		  /*
> +		   * XXX: __print_symbolic broken for enums, fix coming in 4.1
> +		   * cycle from Mr Rostedt. Need to know type now, so...
>  		  __print_symbolic(__entry->type, XFS_IO_TYPES),
> +		   */
> +		  __entry->type,
>  		  __entry->startoff,
>  		  (__int64_t)__entry->startblock,
>  		  __entry->blockcount)
> @@ -1221,6 +1226,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_found);
>  DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
>  DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
>  DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
>  
>  DECLARE_EVENT_CLASS(xfs_simple_io_class,
>  	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] xfs: DIO write completion size updates race
  2015-04-10 13:37 ` [PATCH 3/5] xfs: DIO write completion size updates race Dave Chinner
@ 2015-04-10 20:22   ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2015-04-10 20:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 10, 2015 at 11:37:58PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_end_io_direct_write() can race with other IO completions when
> updating the in-core inode size. The IO completion processing is not
> serialised for direct IO - they are done either under the
> IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
> during AIO DIO completion. Hence the non-atomic test-and-set update
> of the in-core inode size is racy and can result in the in-core
> inode size going backwards if the race if hit just right.
> 
> If the inod size goes backwards, this can trigger the EOF zeroing
> code to run incorrectly on the next IO, which then will zero data
> that has successfully been written to disk by a previous DIO.
> 
> To fix this bug, we need to serialise the test/set updates of the
> in-core inode size. This first patch introduces locking around the
> relevant updates and checks in the DIO path. Because we now have an
> ioend in xfs_end_io_direct_write(), we know exactly then we are
> doing an IO that requires an in-core EOF update, and we know that
> they are not running in interrupt context. As such, we do not need to
> use irqsave() spinlock variants to protect against interrupts while
> the lock is held.
> 
> Hence we can use an existing spinlock in the inode to do this
> serialisation and so not need to grow the struct xfs_inode just to
> work around this problem.
> 
> This patch does not address the test/set EOF update in
> generic_file_write_direct() for various reasons - that will be done
> as a followup with separate explanation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 17 ++++++++++++-----
>  fs/xfs/xfs_file.c | 13 ++++++++++++-
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 52c7e46..aafd54c 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1616,21 +1616,28 @@ xfs_end_io_direct_write(
>  	/*
>  	 * The ioend tells us whether we are doing unwritten extent conversion
>  	 * or an append transaction that updates the on-disk file size. These
> -	 * cases are the only cases where we should *potentially* be needing
> -	 * to update the VFS inode size. When the ioend indicates this, we
> -	 * are *guaranteed* to be running in non-interrupt context.
> +	 * cases are the only cases where we should *potentially* be needing to
> +	 * update the VFS inode size. When the ioend indicates this, we are
> +	 * *guaranteed* to be running in non-interrupt context.
>  	 *
>  	 * We need to update the in-core inode size here so that we don't end up
>  	 * with the on-disk inode size being outside the in-core inode size.
>  	 * While we can do this in the process context after the IO has
> -	 * completed, this does not work for AIO and hence we always update
> -	 * the in-core inode size here if necessary.
> +	 * completed, this does not work for AIO and hence we always update the
> +	 * in-core inode size here if necessary.
> +	 *
> +	 * We need to lock the test/set EOF update as we can be racing with
> +	 * other IO completions here to update the EOF. Failing to serialise
> +	 * here can result in EOF moving backwards and Bad Things Happen when
> +	 * that occurs.
>  	 */
> +	spin_lock(&ip->i_flags_lock);
>  	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) {
>  		if (offset + size > i_size_read(inode))
>  			i_size_write(inode, offset + size);
>  	} else
>  		ASSERT(offset + size <= i_size_read(inode));
> +	spin_unlock(&ip->i_flags_lock);

Looks good to me once we fix the (known) locking problem above of taking
the spinlock before checking the ioend (e.g., having a lock cycle in irq
context):

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  
>  	/* Ugh. No way to propagate errors, so ignore them. */
>  	if (ioend->io_type == XFS_IO_UNWRITTEN) {
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dc5f609..38ff356 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -569,10 +569,20 @@ restart:
>  	 * write.  If zeroing is needed and we are currently holding the
>  	 * iolock shared, we need to update it to exclusive which implies
>  	 * having to redo all checks before.
> +	 *
> +	 * We need to serialise against EOF updates that occur in IO
> +	 * completions here. We want to make sure that nobody is changing the
> +	 * size while we do this check until we have placed an IO barrier (i.e.
> +	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> +	 * The spinlock effectively forms a memory barrier once we have the
> +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> +	 * and hence be able to correctly determine if we need to run zeroing.
>  	 */
> +	spin_lock(&ip->i_flags_lock);
>  	if (*pos > i_size_read(inode)) {
>  		bool	zero = false;
>  
> +		spin_unlock(&ip->i_flags_lock);
>  		if (*iolock == XFS_IOLOCK_SHARED) {
>  			xfs_rw_iunlock(ip, *iolock);
>  			*iolock = XFS_IOLOCK_EXCL;
> @@ -582,7 +592,8 @@ restart:
>  		error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero);
>  		if (error)
>  			return error;
> -	}
> +	} else
> +		spin_unlock(&ip->i_flags_lock);
>  
>  	/*
>  	 * Updating the timestamps will grab the ilock again from
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO
  2015-04-10 13:37 ` [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner
@ 2015-04-10 20:22   ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2015-04-10 20:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 10, 2015 at 11:37:59PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we are doing AIO DIO writes, the IOLOCK only provides an IO
> submission barrier. When we need to do EOF zeroing, we need to ensure
> that no other IO is in progress and all pending in-core EOF updates
> have been completed. This requires us to wait for all outstanding
> AIO DIO writes to the inode to complete and, if necessary, run their
> EOF updates.
> 
> Once all the EOF updates are complete, we can then restart
> xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing
> that EOF is up to date and we have exclusive IO access to the file
> so we can run EOF block zeroing if we need to without interference.
> This gives EOF zeroing the same exclusivity against other IO as we
> provide truncate operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good...

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 38ff356..7b872f4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -587,6 +587,16 @@ restart:
>  			xfs_rw_iunlock(ip, *iolock);
>  			*iolock = XFS_IOLOCK_EXCL;
>  			xfs_rw_ilock(ip, *iolock);
> +
> +			/*
> +			 * We now have an IO submission barrier in place, but
> +			 * AIO can do EOF updates during IO completion and hence
> +			 * we now need to wait for all of them to drain. Non-AIO
> +			 * DIO will have drained before we are given the
> +			 * XFS_IOLOCK_EXCL, and so for most cases this wait is a
> +			 * no-op.
> +			 */
> +			inode_dio_wait(inode);
>  			goto restart;
>  		}
>  		error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero);
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary
  2015-04-10 13:38 ` [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary Dave Chinner
@ 2015-04-10 20:22   ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2015-04-10 20:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 10, 2015 at 11:38:00PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> generic_file_direct_write() does all sorts of things to make DIO
> work "sorta ok" with mixed buffered IO workloads. We already do
> most of this work in xfs_file_aio_dio_write() because of the locking
> requirements, so there's only a couple of things it does for us.
> 
> The first thing is that it does a page cache invalidation after the
> ->direct_IO callout. This can easily be added to the XFS code.
> 
> The second thing it does is that if data was written, it updates the
> iov_iter structure to reflect the data written, and then does EOF
> size updates if necessary. For XFS, these EOF size updates are now
> not necessary, as we do them safely and race-free in IO completion
> context. That leaves just the iov_iter update, and that's also exily
> moved to the XFS code.
> 
> The result is that we don't need to call
> generic_file_direct_write(), and hence remove a redundant buffered
> writeback call and a redundant page cache invalidation call from the
> DIO submission path. We also remove a racy EOF size update, and make
> the DIO submission code in XFS much easier to follow. Wins all
> round, really.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Seems fine to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7b872f4..7182cd2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -665,6 +665,8 @@ xfs_file_dio_aio_write(
>  	int			iolock;
>  	size_t			count = iov_iter_count(from);
>  	loff_t			pos = iocb->ki_pos;
> +	loff_t			end;
> +	struct iov_iter		data;
>  	struct xfs_buftarg	*target = XFS_IS_REALTIME_INODE(ip) ?
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
> @@ -704,10 +706,11 @@ xfs_file_dio_aio_write(
>  	if (ret)
>  		goto out;
>  	iov_iter_truncate(from, count);
> +	end = pos + count - 1;
>  
>  	if (mapping->nrpages) {
>  		ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> -						    pos, pos + count - 1);
> +						   pos, end);
>  		if (ret)
>  			goto out;
>  		/*
> @@ -717,7 +720,7 @@ xfs_file_dio_aio_write(
>  		 */
>  		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
>  					pos >> PAGE_CACHE_SHIFT,
> -					(pos + count - 1) >> PAGE_CACHE_SHIFT);
> +					end >> PAGE_CACHE_SHIFT);
>  		WARN_ON_ONCE(ret);
>  		ret = 0;
>  	}
> @@ -734,8 +737,22 @@ xfs_file_dio_aio_write(
>  	}
>  
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
> -	ret = generic_file_direct_write(iocb, from, pos);
>  
> +	data = *from;
> +	ret = mapping->a_ops->direct_IO(WRITE, iocb, &data, pos);
> +
> +	/* see generic_file_direct_write() for why this is necessary */
> +	if (mapping->nrpages) {
> +		invalidate_inode_pages2_range(mapping,
> +					      pos >> PAGE_CACHE_SHIFT,
> +					      end >> PAGE_CACHE_SHIFT);
> +	}
> +
> +	if (ret > 0) {
> +		pos += ret;
> +		iov_iter_advance(from, ret);
> +		iocb->ki_pos = pos;
> +	}
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] xfs: DIO requires an ioend for writes
  2015-04-10 20:21   ` Brian Foster
@ 2015-04-10 22:24     ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2015-04-10 22:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Apr 10, 2015 at 04:21:37PM -0400, Brian Foster wrote:
> On Fri, Apr 10, 2015 at 11:37:56PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Right now unwritten extent conversion information is passed by
> > making the end IO private data non-null, which does not enable us to
> > pass any information from submission context to completion context,
> > which we need to use the standard IO completion paths.
> > 
> > Allocate an ioend in block allocation for direct IO and attach it to
> > the mapping buffer used during direct IO block allocation. Factor
> > the mapping code to make it obvious that this is happening only for
> > direct IO writes, and and place the mapping info and IO type
> > directly into the ioend for use in completion context.
> > 
> > The completion is changed to check the ioend type to determine if
> > unwritten extent completion is necessary or not.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 61 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3a9b7a1..d95a42b 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1233,6 +1233,57 @@ xfs_vm_releasepage(
> >  	return try_to_free_buffers(page);
> >  }
> >  
> > +static void
> > +xfs_get_blocks_map_buffer(
> > +	struct inode		*inode,
> > +	struct buffer_head	*bh_result,
> > +	int			create,
> > +	int			direct,
> > +	struct xfs_bmbt_irec	*imap,
> > +	xfs_off_t		offset,
> > +	ssize_t			size)
> > +{
> > +	struct xfs_ioend	*ioend;
> > +	int			type;
> > +
> > +	if (!create) {
> > +		/*
> > +		 * Unwritten extents do not report a disk address for
> > +		 * the read case (treat as if we're reading into a hole).
> > +		 */
> > +		if (!ISUNWRITTEN(imap))
> > +			xfs_map_buffer(inode, bh_result, imap, offset);
> > +		return;
> > +	}
> 
> This logic was kind of ugly to begin with, but I think the refactoring
> exposes it further. There's rather twisty logic here just for a case

Yup, I isolated it first to make it easy to change, not necessarily
easier to read ;)

....

> So if we pull some of the bits from xfs_get_blocks_map_buffer() back up,
> I end up with something like the the following here. Compile tested
> only, but illustrates the point:
> 
>         /*
>          * Map the buffer as long as we have physical blocks and this isn't a
>          * read of an unwritten extent. Treat reads into unwritten extents as
>          * holes and thus do not return a mapping.
>          */
>         if (imap.br_startblock != HOLESTARTBLOCK &&
>             imap.br_startblock != DELAYSTARTBLOCK &&
>             (create || !ISUNWRITTEN(&imap))) {
>                 xfs_map_buffer(inode, bh_result, &imap, offset);
> 		/* unwritten implies create due to check above */
>                 if (ISUNWRITTEN(&imap))
>                         set_buffer_unwritten(bh_result);
>                 /* direct writes have a special mapping */
>                 if (create && direct) {
>                         error = xfs_map_direct(inode, bh_result, &imap, offset);
>                         if (error)
>                                 return error;
>                 }
>         }
> 
> I renamed the helper to xfs_map_direct(), killed everything therein up
> through the !direct check and killed both the create and direct params.
> Thoughts?

Yeah, that looks neater; I'll split and rework it in a similar
manner to this. Thanks!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends
  2015-04-10 20:21   ` Brian Foster
@ 2015-04-10 22:30     ` Dave Chinner
  2015-04-11 21:12       ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2015-04-10 22:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Apr 10, 2015 at 04:21:47PM -0400, Brian Foster wrote:
> On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now we have an ioend being passed unconditionally to the direct IO
> > write completion context, we can pass a preallocated transaction
> > handle for on-disk inode size updates that are run in completion.
> > 
> > At this point we really need to be passing the correct block range
> > that the IO spans through the ioend, so calculate the last block in
> > the mapping before we map the allocated range and use that instead
> > of the size desired by the direct IO.
> > 
> > This enables us to keep track of multiple get-blocks calls in the
> > same direct IO - the ioend will keep coming back to us, and we can
> > keep extending it's range as new allocations and mappings are done.
> > 
> > There are some new trace points added for debugging, and a small
> > hack to actually make the tracepoints work (enums in tracepoints
> > that use __print_symbolic don't work correctly) that should be fixed
> > in the 4.1 merge window. THis hack can be removed when the
> > tracepoint fix is upstream.
> > 
> > There are lots of comments explaining the intricacies of passing the
> > ioend and append transaction in the code; they are better placed in
> > the code because we're going to need them to understand why this
> > code does what it does in a few years time....
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> I still need to look at this one (and grok the dio code more)... but an
> initial question: is this multiple get_blocks() call aggregation a
> requirement for the append ioend mechanism or an optimization? If the
> latter, I think a separate patch is more appropriate...

Requirement. Direct Io is a twisty maze of passages loaded with
deadly traps. e.g. non AIO path:

->direct_IO
  alloc dio(off, len)
    loop until all IO issued {
      get_blocks
      dio->private = bh_result->b_private
      build bio
      dio->ref++
      submit bio
    }

  dio_await_completion(dio)
  dio_complete(dio)
    dio->ref--  => goes to zero
     dio->end_io(filp, off, len, dio->private)
       xfs_end_io_direct_write(... off, len, ioend)


So, essentially, for as many bios that are mapped and submitted for
the direct IO, there is only one end IO completion call for the
entire IO. The multiple mappings we make need to aggregate the state
of the entire IO, not just the single instance....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends
  2015-04-10 22:30     ` Dave Chinner
@ 2015-04-11 21:12       ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2015-04-11 21:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Apr 11, 2015 at 08:30:40AM +1000, Dave Chinner wrote:
> On Fri, Apr 10, 2015 at 04:21:47PM -0400, Brian Foster wrote:
> > On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Now we have an ioend being passed unconditionally to the direct IO
> > > write completion context, we can pass a preallocated transaction
> > > handle for on-disk inode size updates that are run in completion.
> > > 
> > > At this point we really need to be passing the correct block range
> > > that the IO spans through the ioend, so calculate the last block in
> > > the mapping before we map the allocated range and use that instead
> > > of the size desired by the direct IO.
> > > 
> > > This enables us to keep track of multiple get-blocks calls in the
> > > same direct IO - the ioend will keep coming back to us, and we can
> > > keep extending it's range as new allocations and mappings are done.
> > > 
> > > There are some new trace points added for debugging, and a small
> > > hack to actually make the tracepoints work (enums in tracepoints
> > > that use __print_symbolic don't work correctly) that should be fixed
> > > in the 4.1 merge window. THis hack can be removed when the
> > > tracepoint fix is upstream.
> > > 
> > > There are lots of comments explaining the intricacies of passing the
> > > ioend and append transaction in the code; they are better placed in
> > > the code because we're going to need them to understand why this
> > > code does what it does in a few years time....
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > I still need to look at this one (and grok the dio code more)... but an
> > initial question: is this multiple get_blocks() call aggregation a
> > requirement for the append ioend mechanism or an optimization? If the
> > latter, I think a separate patch is more appropriate...
> 
> Requirement. Direct Io is a twisty maze of passages loaded with
> deadly traps. e.g. non AIO path:
> 
> ->direct_IO
>   alloc dio(off, len)
>     loop until all IO issued {
>       get_blocks
>       dio->private = bh_result->b_private
>       build bio
>       dio->ref++
>       submit bio
>     }
> 
>   dio_await_completion(dio)
>   dio_complete(dio)
>     dio->ref--  => goes to zero
>      dio->end_io(filp, off, len, dio->private)
>        xfs_end_io_direct_write(... off, len, ioend)
> 
> 
> So, essentially, for as many bios that are mapped and submitted for
> the direct IO, there is only one end IO completion call for the
> entire IO. The multiple mappings we make need to aggregate the state
> of the entire IO, not just the single instance....
> 

Ok, thanks for the breakdown. Essentially, we need to track the highest
precedent I/O type of the overall DIO with respect to the completion
handler. The patch itself is not hard to follow, but the dio path is a
different beast. What I didn't quite catch when first playing with this
is the mapping size optimization earlier in the get_blocks call that
effectively defeats some of this by reducing the need for multiple calls
in many cases. A single DIO write over a range of alternating map types
(e.g., alternating preallocated blocks and holes), for example, is a
better way to trigger the ioend aggregation.

Additional comments to follow...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends
  2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner
  2015-04-10 20:21   ` Brian Foster
@ 2015-04-11 21:15   ` Brian Foster
  2015-04-12 23:31     ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Brian Foster @ 2015-04-11 21:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have an ioend being passed unconditionally to the direct IO
> write completion context, we can pass a preallocated transaction
> handle for on-disk inode size updates that are run in completion.
> 
> At this point we really need to be passing the correct block range
> that the IO spans through the ioend, so calculate the last block in
> the mapping before we map the allocated range and use that instead
> of the size desired by the direct IO.
> 
> This enables us to keep track of multiple get-blocks calls in the
> same direct IO - the ioend will keep coming back to us, and we can
> keep extending it's range as new allocations and mappings are done.
> 
> There are some new trace points added for debugging, and a small
> hack to actually make the tracepoints work (enums in tracepoints
> that use __print_symbolic don't work correctly) that should be fixed
> in the 4.1 merge window. THis hack can be removed when the
> tracepoint fix is upstream.
> 
> There are lots of comments explaining the intricacies of passing the
> ioend and append transaction in the code; they are better placed in
> the code because we're going to need them to understand why this
> code does what it does in a few years time....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 262 +++++++++++++++++++++++++++++++++++++----------------
>  fs/xfs/xfs_trace.h |  10 +-
>  2 files changed, 194 insertions(+), 78 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d95a42b..52c7e46 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -178,6 +178,25 @@ xfs_setfilesize_ioend(
>  	return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
>  }
>  
> +STATIC void
> +xfs_setfilesize_ioend_cancel(
> +	struct xfs_ioend	*ioend)
> +{
> +	struct xfs_trans	*tp = ioend->io_append_trans;
> +
> +	/*
> +	 * The transaction may have been allocated in the I/O submission thread,
> +	 * thus we need to mark ourselves as being in a transaction manually.
> +	 * Similarly for freeze protection.
> +	 */

This threw me off at first because we can call this from either the
submission or the completion context, unlike the commit case where this
comment is copied from. Could we move the comment above the function and
clarify a bit? E.g., something like the following is a bit more clear to
me:

/*
 * The process transaction and freeze protection state is cleared immediately
 * after setfilesize transaction allocation to support transfer of the tp from
 * submission to completion context. Restore the context appropriately to cancel
 * the transaction.
 */

> +	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	rwsem_acquire_read(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
> +			   0, 1, _THIS_IP_);
> +
> +	xfs_trans_cancel(tp, 0);
> +	ioend->io_append_trans = NULL;
> +}
> +
>  /*
>   * Schedule IO completion handling on the final put of an ioend.
>   *
> @@ -1233,18 +1252,18 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> -static void
> +static int
>  xfs_get_blocks_map_buffer(
>  	struct inode		*inode,
>  	struct buffer_head	*bh_result,
>  	int			create,
>  	int			direct,
>  	struct xfs_bmbt_irec	*imap,
> -	xfs_off_t		offset,
> -	ssize_t			size)
> +	xfs_off_t		offset)
>  {
>  	struct xfs_ioend	*ioend;
>  	int			type;
> +	loff_t			size;
>  
>  	if (!create) {
>  		/*
> @@ -1253,7 +1272,7 @@ xfs_get_blocks_map_buffer(
>  		 */
>  		if (!ISUNWRITTEN(imap))
>  			xfs_map_buffer(inode, bh_result, imap, offset);
> -		return;
> +		return 0;
>  	}
>  
>  	xfs_map_buffer(inode, bh_result, imap, offset);
> @@ -1262,26 +1281,93 @@ xfs_get_blocks_map_buffer(
>  		set_buffer_unwritten(bh_result);
>  
>  	if (!direct)
> -		return;
> +		return 0;
>  
>  	/*
> -	 * Direct IO writes require an ioend to be allocated and
> -	 * passed via the returned mapping. This allows the end
> -	 * io function to determine the correct course of
> -	 * action.
> +	 * Direct IO writes require an ioend to be allocated and passed via the
> +	 * returned mapping. This allows the end io function to determine the
> +	 * correct course of action.
> +	 *
> +	 * Unwritten extents will need transactions at completion, so is known
> +	 * to need deferring to a workqueue. However, for writes into written
> +	 * extents, we *may* need a transaction if this IO extends the on-disk
> +	 * EOF. Because we can race with other IOs the file may already be
> +	 * extended by the time we get to the transaction. IO completion already
> +	 * handles that case so all we will have done is incurred the overhead
> +	 * of workqueue deferral for completion. This is acceptable overhead for
> +	 * the rare case that this occurs.
>  	 */
> -
>  	if (ISUNWRITTEN(imap)) {
>  		type = XFS_IO_UNWRITTEN;
>  		set_buffer_defer_completion(bh_result);
>  	} else
>  		type = XFS_IO_OVERWRITE;
> -	ioend = xfs_alloc_ioend(inode, type);
> -	ioend->io_offset = offset;
> -	ioend->io_size = size;
> -	bh_result->b_private = ioend;
>  
> -	return;
> +	/*
> +	 * The offset that is passed in is the first block the DIO will fall
> +	 * into. The size supplied by the DIO layer is what it thinks it needs
> +	 * but the mapping may not span this entire range. Hence we use the
> +	 * truncated mapping size that's already been stashed in the bh_result
> +	 * to calculate the range covered by the ioend.
> +	 */
> +	size = bh_result->b_size;
> +	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
> +
> +	/*
> +	 * If we get multiple mappings to in a single IO, we might be mapping
> +	 * dfferent types. But because the direct IO can only have a single

	   different

> +	 * private pointer, we need to ensure that:
> +	 *
> +	 * a) the ioend spans the entire region of the IO; and
> +	 * b) if it contains unwritten extents, it is *permanently* marked as
> +	 *    such and we cancel any append transaction attached to the ioend.
> +	 *
> +	 * We could do this by chaining ioends like buffered IO does, but
> +	 * we only actually get one IO completion callback from the direct IO,
> +	 * and that spans the entire IO regardless of how many mappings and IOs
> +	 * are needed to complete the DIO. There is only going to be one
> +	 * reference to the ioend and it's life cycle is constrained by the
> +	 * DIO completion code. hence we don't need reference counting here.
> +	 */
> +	if (bh_result->b_private) {
> +		ioend = bh_result->b_private;
> +		ASSERT(ioend->io_size > 0);
> +		ASSERT(offset >= ioend->io_offset);
> +
> +		if (offset + size > ioend->io_offset + ioend->io_size)
> +			ioend->io_size = offset - ioend->io_offset + size;
> +
> +		if (type == XFS_IO_UNWRITTEN) {

		if (type == XFS_IO_UNWRITTEN &&
		    ioend->io_type != XFS_IO_UNWRITTEN)

... makes that a bit easier to follow imo.

> +			if (ioend->io_append_trans)
> +				xfs_setfilesize_ioend_cancel(ioend);
> +			ioend->io_type = XFS_IO_UNWRITTEN;
> +		}
> +		trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset,
> +					      ioend->io_size, ioend->io_type,
> +					      imap);
> +	} else {
> +		ioend = xfs_alloc_ioend(inode, type);
> +		ioend->io_offset = offset;
> +		ioend->io_size = size;
> +		bh_result->b_private = ioend;
> +		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
> +					   imap);
> +	}
> +
> +	/* check if we need an append transaction allocated. */
> +	if (ioend->io_type == XFS_IO_OVERWRITE &&
> +	    xfs_ioend_is_append(ioend) && !ioend->io_append_trans) {
> +		int	error;
> +
> +		error = xfs_setfilesize_trans_alloc(ioend);

I'm not totally convinced this is safe. We previously moved this tp
allocation from before a potential xfs_iomap_direct_write() call to the
completion side to avoid nesting this allocation with unwritten extent
allocation transactions. See the following for reference:

	437a255a xfs: fix direct IO nested transaction deadlock

Now we move it after that point of the codepath, and even then we know
that this is an overwrite if we do the allocation here. If we continue
on and hit a hole, it looks like there's still a sequence to allocate
this transaction and call xfs_iomap_write_direct(), nesting the
associated transaction reservations. Am I missing something?

Brian

> +		ASSERT(!error);
> +		if (error) {
> +			xfs_destroy_ioend(ioend);
> +			return error;
> +		}
> +		set_buffer_defer_completion(bh_result);
> +	}
> +	return 0;
>  }
>  
>  STATIC int
> @@ -1374,50 +1460,19 @@ __xfs_get_blocks(
>  			xfs_iunlock(ip, lockmode);
>  		}
>  
> -		trace_xfs_get_blocks_alloc(ip, offset, size, 0, &imap);
> +		trace_xfs_get_blocks_alloc(ip, offset, size,
> +				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
> +						   : XFS_IO_DELALLOC, &imap);
>  	} else if (nimaps) {
> -		trace_xfs_get_blocks_found(ip, offset, size, 0, &imap);
> +		trace_xfs_get_blocks_found(ip, offset, size,
> +				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
> +						   : XFS_IO_OVERWRITE, &imap);
>  		xfs_iunlock(ip, lockmode);
>  	} else {
>  		trace_xfs_get_blocks_notfound(ip, offset, size);
>  		goto out_unlock;
>  	}
>  
> -	if (imap.br_startblock != HOLESTARTBLOCK &&
> -	    imap.br_startblock != DELAYSTARTBLOCK)
> -		xfs_get_blocks_map_buffer(inode, bh_result, create, direct,
> -					  &imap, offset, size);
> -
> -	/*
> -	 * If this is a realtime file, data may be on a different device.
> -	 * to that pointed to from the buffer_head b_bdev currently.
> -	 */
> -	bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
> -
> -	/*
> -	 * If we previously allocated a block out beyond eof and we are now
> -	 * coming back to use it then we will need to flag it as new even if it
> -	 * has a disk address.
> -	 *
> -	 * With sub-block writes into unwritten extents we also need to mark
> -	 * the buffer as new so that the unwritten parts of the buffer gets
> -	 * correctly zeroed.
> -	 */
> -	if (create &&
> -	    ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) ||
> -	     (offset >= i_size_read(inode)) ||
> -	     (new || ISUNWRITTEN(&imap))))
> -		set_buffer_new(bh_result);
> -
> -	if (imap.br_startblock == DELAYSTARTBLOCK) {
> -		BUG_ON(direct);
> -		if (create) {
> -			set_buffer_uptodate(bh_result);
> -			set_buffer_mapped(bh_result);
> -			set_buffer_delay(bh_result);
> -		}
> -	}
> -
>  	/*
>  	 * If this is O_DIRECT or the mpage code calling tell them how large
>  	 * the mapping is, so that we can avoid repeated get_blocks calls.
> @@ -1451,6 +1506,46 @@ __xfs_get_blocks(
>  		bh_result->b_size = mapping_size;
>  	}
>  
> +	if (imap.br_startblock != HOLESTARTBLOCK &&
> +	    imap.br_startblock != DELAYSTARTBLOCK) {
> +		error = xfs_get_blocks_map_buffer(inode, bh_result, create,
> +						  direct, &imap, offset);
> +		if (error)
> +			return error;
> +	}
> +	if (create && direct)
> +		ASSERT(bh_result->b_private);
> +
> +	/*
> +	 * If this is a realtime file, data may be on a different device.
> +	 * to that pointed to from the buffer_head b_bdev currently.
> +	 */
> +	bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
> +
> +	/*
> +	 * If we previously allocated a block out beyond eof and we are now
> +	 * coming back to use it then we will need to flag it as new even if it
> +	 * has a disk address.
> +	 *
> +	 * With sub-block writes into unwritten extents we also need to mark
> +	 * the buffer as new so that the unwritten parts of the buffer gets
> +	 * correctly zeroed.
> +	 */
> +	if (create &&
> +	    ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) ||
> +	     (offset >= i_size_read(inode)) ||
> +	     (new || ISUNWRITTEN(&imap))))
> +		set_buffer_new(bh_result);
> +
> +	if (imap.br_startblock == DELAYSTARTBLOCK) {
> +		BUG_ON(direct);
> +		if (create) {
> +			set_buffer_uptodate(bh_result);
> +			set_buffer_mapped(bh_result);
> +			set_buffer_delay(bh_result);
> +		}
> +	}
> +
>  	return 0;
>  
>  out_unlock:
> @@ -1501,38 +1596,51 @@ xfs_end_io_direct_write(
>  		goto out_destroy_ioend;
>  
>  	/*
> -	 * While the generic direct I/O code updates the inode size, it does
> -	 * so only after the end_io handler is called, which means our
> -	 * end_io handler thinks the on-disk size is outside the in-core
> -	 * size.  To prevent this just update it a little bit earlier here.
> +	 * dio completion end_io functions are only called on writes if more
> +	 * than 0 bytes was written.
>  	 */
> -	if (offset + size > i_size_read(inode))
> -		i_size_write(inode, offset + size);
> +	ASSERT(size > 0);
>  
>  	/*
> -	 * For direct I/O we do not know if we need to allocate blocks or not,
> -	 * so we can't preallocate an append transaction, as that results in
> -	 * nested reservations and log space deadlocks. Hence allocate the
> -	 * transaction here. While this is sub-optimal and can block IO
> -	 * completion for some time, we're stuck with doing it this way until
> -	 * we can pass the ioend to the direct IO allocation callbacks and
> -	 * avoid nesting that way.
> +	 * The ioend only maps whole blocks, while the IO may be sector aligned.
> +	 * Hence the ioend offset/size may not match the IO offset/size exactly,
> +	 * but should span it completely. Write the IO sizes into the ioend so
> +	 * that completion processing does the right thing.
>  	 */
> -	if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) {
> -		xfs_iomap_write_unwritten(ip, offset, size);
> -	} else if (offset + size > ip->i_d.di_size) {
> -		struct xfs_trans	*tp;
> -		int			error;
> +	ASSERT(size <= ioend->io_size);
> +	ASSERT(offset >= ioend->io_offset);
> +	ASSERT(offset + size <= ioend->io_offset + ioend->io_size);
> +	ioend->io_size = size;
> +	ioend->io_offset = offset;
>  
> -		tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> -		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
> -		if (error) {
> -			xfs_trans_cancel(tp, 0);
> -			goto out_destroy_ioend;
> -		}
> +	/*
> +	 * The ioend tells us whether we are doing unwritten extent conversion
> +	 * or an append transaction that updates the on-disk file size. These
> +	 * cases are the only cases where we should *potentially* be needing
> +	 * to update the VFS inode size. When the ioend indicates this, we
> +	 * are *guaranteed* to be running in non-interrupt context.
> +	 *
> +	 * We need to update the in-core inode size here so that we don't end up
> +	 * with the on-disk inode size being outside the in-core inode size.
> +	 * While we can do this in the process context after the IO has
> +	 * completed, this does not work for AIO and hence we always update
> +	 * the in-core inode size here if necessary.
> +	 */
> +	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) {
> +		if (offset + size > i_size_read(inode))
> +			i_size_write(inode, offset + size);
> +	} else
> +		ASSERT(offset + size <= i_size_read(inode));
>  
> -		xfs_setfilesize(ip, tp, offset, size);
> +	/* Ugh. No way to propagate errors, so ignore them. */
> +	if (ioend->io_type == XFS_IO_UNWRITTEN) {
> +		xfs_iomap_write_unwritten(ip, offset, size);
> +	} else if (ioend->io_append_trans) {
> +		xfs_setfilesize_ioend(ioend);
> +	} else {
> +		ASSERT(!xfs_ioend_is_append(ioend));
>  	}
> +
>  out_destroy_ioend:
>  	xfs_destroy_ioend(ioend);
>  }
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index b2a45cc..a584c27 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1200,13 +1200,18 @@ DECLARE_EVENT_CLASS(xfs_imap_class,
>  		__entry->blockcount = irec ? irec->br_blockcount : 0;
>  	),
>  	TP_printk("dev %d:%d ino 0x%llx size 0x%llx offset 0x%llx count %zd "
> -		  "type %s startoff 0x%llx startblock %lld blockcount 0x%llx",
> +		  "type %d startoff 0x%llx startblock %lld blockcount 0x%llx",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->size,
>  		  __entry->offset,
>  		  __entry->count,
> +		  /*
> +		   * XXX: __print_symbolic broken for enums, fix coming in 4.1
> +		   * cycle from Mr Rostedt. Need to know type now, so...
>  		  __print_symbolic(__entry->type, XFS_IO_TYPES),
> +		   */
> +		  __entry->type,
>  		  __entry->startoff,
>  		  (__int64_t)__entry->startblock,
>  		  __entry->blockcount)
> @@ -1221,6 +1226,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_found);
>  DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
>  DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
>  DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
>  
>  DECLARE_EVENT_CLASS(xfs_simple_io_class,
>  	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] xfs: fix direct IO completion issues
  2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner
                   ` (4 preceding siblings ...)
  2015-04-10 13:38 ` [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary Dave Chinner
@ 2015-04-12 15:09 ` Christoph Hellwig
  2015-04-12 23:22   ` Dave Chinner
  5 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2015-04-12 15:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 10, 2015 at 11:37:55PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> This patchset addresses the deeper problems Brian outlined in the
> description of this patch:
> 
> http://oss.sgi.com/archives/xfs/2015-04/msg00071.html
> 
> The basic issues is that DIO completion can run in interrupt context
> and it does things it should not do in interrupt context because Bad
> Things Will Happen.

Where do we complete DIO writes from irq context?  Since my direct-io.c
changes from a few years ago that should not be the case.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] xfs: fix direct IO completion issues
  2015-04-12 15:09 ` [PATCH 0/5] xfs: fix direct IO completion issues Christoph Hellwig
@ 2015-04-12 23:22   ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2015-04-12 23:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sun, Apr 12, 2015 at 08:09:32AM -0700, Christoph Hellwig wrote:
> On Fri, Apr 10, 2015 at 11:37:55PM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > This patchset addresses the deeper problems Brian outlined in the
> > description of this patch:
> > 
> > http://oss.sgi.com/archives/xfs/2015-04/msg00071.html
> > 
> > The basic issues is that DIO completion can run in interrupt context
> > and it does things it should not do in interrupt context because Bad
> > Things Will Happen.
> 
> Where do we complete DIO writes from irq context?  Since my direct-io.c
> changes from a few years ago that should not be the case.

Yes, that's what I thought, too. However any AIO direct IO write
that does not call set_buffer_defer_completion() will run completion
in interrupt context. The current code in __xfs_get_blocks() sets
that flag only when:

                if (create && ISUNWRITTEN(&imap)) {
                        if (direct) {
                                bh_result->b_private = inode;
                                set_buffer_defer_completion(bh_result);
                        }
                        set_buffer_unwritten(bh_result);
                }

And hence only writes into unwritten extents will be deferred to the
DIO completion workqueue.  Hence sub-block writes that extend EOF
(the trace below), or extending writes into blocks beyond EOF
allocated by delalloc speculative prealloc will run transactions in
irq context to update the on-disk EOF.

This is the stack trace from testing the simple "use a spinlock
around i_size_write()" patches that pointed out how wrong we'd been:

[  375.648323] run fstests generic/036 at 2015-04-08 08:58:45
[  380.661832] BUG: spinlock cpu recursion on CPU#3, aio-dio-fcntl-r/27068
[  380.662898]  lock: 0xffff8800afe88b70, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: 3
[  380.664232] CPU: 3 PID: 27068 Comm: aio-dio-fcntl-r Not tainted 4.0.0-rc4-dgc+ #870
[  380.665393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  380.665769]  ffff8800afe88b70 ffff88013fd83bf8 ffffffff81dccb34 0000000000000000
[  380.665769]  0000000000000000 ffff88013fd83c18 ffffffff81dc6754 ffff8800afe88b70
[  380.665769]  ffffffff821ed962 ffff88013fd83c38 ffffffff81dc677f ffff8800afe88b70
[  380.665769] Call Trace:
[  380.665769]  <IRQ>  [<ffffffff81dccb34>] dump_stack+0x4c/0x65
[  380.665769]  [<ffffffff81dc6754>] spin_dump+0x90/0x95
[  380.665769]  [<ffffffff81dc677f>] spin_bug+0x26/0x2b
[  380.665769]  [<ffffffff810e4058>] do_raw_spin_lock+0x128/0x1a0
[  380.665769]  [<ffffffff81dd7dd5>] _raw_spin_lock+0x15/0x20
[  380.665769]  [<ffffffff814f49cc>] xfs_end_io_direct_write+0x5c/0x110
[  380.665769]  [<ffffffff812085c3>] dio_complete+0xf3/0x160
[  380.665769]  [<ffffffff812086a3>] dio_bio_end_aio+0x73/0x100
[  380.665769]  [<ffffffff810c85e2>] ? default_wake_function+0x12/0x20
[  380.665769]  [<ffffffff817c461b>] bio_endio+0x5b/0xa0
[  380.665769]  [<ffffffff817cc3a0>] blk_update_request+0x90/0x370
[  380.665769]  [<ffffffff817d565a>] blk_mq_end_request+0x1a/0x70
[  380.665769]  [<ffffffff81ad393f>] virtblk_request_done+0x3f/0x70
[  380.665769]  [<ffffffff817d5fce>] __blk_mq_complete_request+0x8e/0x120
[  380.665769]  [<ffffffff817d6076>] blk_mq_complete_request+0x16/0x20
[  380.665769]  [<ffffffff81ad347e>] virtblk_done+0x6e/0xf0
[  380.665769]  [<ffffffff81893af5>] vring_interrupt+0x35/0x60
[  380.665769]  [<ffffffff810f27de>] handle_irq_event_percpu+0x3e/0x1c0
[  380.665769]  [<ffffffff810f29a1>] handle_irq_event+0x41/0x70
[  380.665769]  [<ffffffff810f573f>] handle_edge_irq+0x7f/0x120
[  380.665769]  [<ffffffff8104d2f2>] handle_irq+0x22/0x40
[  380.665769]  [<ffffffff81ddaf31>] do_IRQ+0x51/0xf0
[  380.665769]  [<ffffffff81dd8fad>] common_interrupt+0x6d/0x6d
[  380.665769]  <EOI>  [<ffffffff810e4033>] ? do_raw_spin_lock+0x103/0x1a0
[  380.665769]  [<ffffffff81dd7dd5>] _raw_spin_lock+0x15/0x20
[  380.665769]  [<ffffffff81502f28>] xfs_file_aio_write_checks+0x58/0x130
[  380.665769]  [<ffffffff815030ce>] xfs_file_dio_aio_write+0xce/0x410
[  380.665769]  [<ffffffff811d0f08>] ? __sb_start_write+0x58/0x120
[  380.665769]  [<ffffffff815036fe>] xfs_file_write_iter+0x7e/0x120
[  380.665769]  [<ffffffff81503680>] ? xfs_file_buffered_aio_write+0x270/0x270
[  380.665769]  [<ffffffff81218cd3>] aio_run_iocb+0x203/0x3c0
[  380.665769]  [<ffffffff810c107d>] ? __might_sleep+0x4d/0x90
[  380.665769]  [<ffffffff810c107d>] ? __might_sleep+0x4d/0x90
[  380.665769]  [<ffffffff81219b4f>] do_io_submit+0x19f/0x410
[  380.665769]  [<ffffffff81219dd0>] SyS_io_submit+0x10/0x20
[  380.665769]  [<ffffffff81dd84c9>] system_call_fastpath+0x12/0x17

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends
  2015-04-11 21:15   ` Brian Foster
@ 2015-04-12 23:31     ` Dave Chinner
  2015-04-13 11:20       ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2015-04-12 23:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Sat, Apr 11, 2015 at 05:15:18PM -0400, Brian Foster wrote:
> On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now we have an ioend being passed unconditionally to the direct IO
> > write completion context, we can pass a preallocated transaction
> > handle for on-disk inode size updates that are run in completion.
.....
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -178,6 +178,25 @@ xfs_setfilesize_ioend(
> >  	return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
> >  }
> >  
> > +STATIC void
> > +xfs_setfilesize_ioend_cancel(
> > +	struct xfs_ioend	*ioend)
> > +{
> > +	struct xfs_trans	*tp = ioend->io_append_trans;
> > +
> > +	/*
> > +	 * The transaction may have been allocated in the I/O submission thread,
> > +	 * thus we need to mark ourselves as being in a transaction manually.
> > +	 * Similarly for freeze protection.
> > +	 */
> 
> This threw me off at first because we can call this from either the
> submission or the completion context, unlike the commit case where this
> comment is copied from. Could we move the comment above the function and
> clarify a bit? E.g., something like the following is a bit more clear to
> me:
> 
> /*
>  * The process transaction and freeze protection state is cleared immediately
>  * after setfilesize transaction allocation to support transfer of the tp from
>  * submission to completion context. Restore the context appropriately to cancel
>  * the transaction.
>  */

OK, I can do that, but given your next comments, it might just go
away.

> > +	} else {
> > +		ioend = xfs_alloc_ioend(inode, type);
> > +		ioend->io_offset = offset;
> > +		ioend->io_size = size;
> > +		bh_result->b_private = ioend;
> > +		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
> > +					   imap);
> > +	}
> > +
> > +	/* check if we need an append transaction allocated. */
> > +	if (ioend->io_type == XFS_IO_OVERWRITE &&
> > +	    xfs_ioend_is_append(ioend) && !ioend->io_append_trans) {
> > +		int	error;
> > +
> > +		error = xfs_setfilesize_trans_alloc(ioend);
> 
> I'm not totally convinced this is safe. We previously moved this tp
> allocation from before a potential xfs_iomap_direct_write() call to the
> completion side to avoid nesting this allocation with unwritten extent
> allocation transactions. See the following for reference:
> 
> 	437a255a xfs: fix direct IO nested transaction deadlock
> 
> Now we move it after that point of the codepath, and even then we know
> that this is an overwrite if we do the allocation here. If we continue
> on and hit a hole, it looks like there's still a sequence to allocate
> this transaction and call xfs_iomap_write_direct(), nesting the
> associated transaction reservations. Am I missing something?

No, I didn't really think this part through fully. I knew that we'd
get multiple calls, and we'd get multiple allocations, but for some
reason the penny didn't drop.

What it comes down to is that either we jump through lots of hoops
in __xfs_get_blocks() to handle this case (i.e.
cancel/allocate/reserve on repeat calls) or we just allocate it in
IO completion context as we currently are doing.

I'll have a look at what cancel/allocate/reserve looks like - it
might actually simplify the logic - and go from there.

Thanks for catching my silly thinko, Brain!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends
  2015-04-12 23:31     ` Dave Chinner
@ 2015-04-13 11:20       ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2015-04-13 11:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Apr 13, 2015 at 09:31:02AM +1000, Dave Chinner wrote:
> On Sat, Apr 11, 2015 at 05:15:18PM -0400, Brian Foster wrote:
> > On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Now we have an ioend being passed unconditionally to the direct IO
> > > write completion context, we can pass a preallocated transaction
> > > handle for on-disk inode size updates that are run in completion.
> .....
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -178,6 +178,25 @@ xfs_setfilesize_ioend(
> > >  	return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
> > >  }
> > >  
> > > +STATIC void
> > > +xfs_setfilesize_ioend_cancel(
> > > +	struct xfs_ioend	*ioend)
> > > +{
> > > +	struct xfs_trans	*tp = ioend->io_append_trans;
> > > +
> > > +	/*
> > > +	 * The transaction may have been allocated in the I/O submission thread,
> > > +	 * thus we need to mark ourselves as being in a transaction manually.
> > > +	 * Similarly for freeze protection.
> > > +	 */
> > 
> > This threw me off at first because we can call this from either the
> > submission or the completion context, unlike the commit case where this
> > comment is copied from. Could we move the comment above the function and
> > clarify a bit? E.g., something like the following is a bit more clear to
> > me:
> > 
> > /*
> >  * The process transaction and freeze protection state is cleared immediately
> >  * after setfilesize transaction allocation to support transfer of the tp from
> >  * submission to completion context. Restore the context appropriately to cancel
> >  * the transaction.
> >  */
> 
> OK, I can do that, but given your next comments, it might just go
> away.
> 
> > > +	} else {
> > > +		ioend = xfs_alloc_ioend(inode, type);
> > > +		ioend->io_offset = offset;
> > > +		ioend->io_size = size;
> > > +		bh_result->b_private = ioend;
> > > +		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
> > > +					   imap);
> > > +	}
> > > +
> > > +	/* check if we need an append transaction allocated. */
> > > +	if (ioend->io_type == XFS_IO_OVERWRITE &&
> > > +	    xfs_ioend_is_append(ioend) && !ioend->io_append_trans) {
> > > +		int	error;
> > > +
> > > +		error = xfs_setfilesize_trans_alloc(ioend);
> > 
> > I'm not totally convinced this is safe. We previously moved this tp
> > allocation from before a potential xfs_iomap_direct_write() call to the
> > completion side to avoid nesting this allocation with unwritten extent
> > allocation transactions. See the following for reference:
> > 
> > 	437a255a xfs: fix direct IO nested transaction deadlock
> > 
> > Now we move it after that point of the codepath, and even then we know
> > that this is an overwrite if we do the allocation here. If we continue
> > on and hit a hole, it looks like there's still a sequence to allocate
> > this transaction and call xfs_iomap_write_direct(), nesting the
> > associated transaction reservations. Am I missing something?
> 
> No, I didn't really think this part through fully. I knew that we'd
> get multiple calls, and we'd get multiple allocations, but for some
> reason the penny didn't drop.
> 
> What it comes down to is that either we jump through lots of hoops
> in __xfs_get_blocks() to handle this case (i.e.
> cancel/allocate/reserve on repeat calls) or we just allocate it in
> IO completion context as we currently are doing.
> 

Ok, I figured it would be one of those two approaches. The latter seems
more clean to me in just thinking about it since it's another spot we
have to consider the direct write case (as opposed to having it factored
cleanly into the new helper). Maybe it can be better organized than
that, splitting up the helper perhaps, so I'll reserve judgement. :)

Could we get a v2 of the race fix posted with the proper locking and
reviewed-by tags and whatnot? A reply to the patch in this thread is
fine if the broader rework is still in flux. I'd just like to have an
upstream reference for a backport of that one...

Brian

> I'll have a look at what cancel/allocate/reserve looks like - it
> might actually simplify the logic - and go from there.
> 
> Thanks for catching my silly thinko, Brain!
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-04-13 11:21 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner
2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner
2015-04-10 20:21   ` Brian Foster
2015-04-10 22:24     ` Dave Chinner
2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner
2015-04-10 20:21   ` Brian Foster
2015-04-10 22:30     ` Dave Chinner
2015-04-11 21:12       ` Brian Foster
2015-04-11 21:15   ` Brian Foster
2015-04-12 23:31     ` Dave Chinner
2015-04-13 11:20       ` Brian Foster
2015-04-10 13:37 ` [PATCH 3/5] xfs: DIO write completion size updates race Dave Chinner
2015-04-10 20:22   ` Brian Foster
2015-04-10 13:37 ` [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner
2015-04-10 20:22   ` Brian Foster
2015-04-10 13:38 ` [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary Dave Chinner
2015-04-10 20:22   ` Brian Foster
2015-04-12 15:09 ` [PATCH 0/5] xfs: fix direct IO completion issues Christoph Hellwig
2015-04-12 23:22   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.