All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8 v2] xfs: fix direct IO completion issues
@ 2015-04-14  7:26 Dave Chinner
  2015-04-14  7:26 ` [PATCH 1/8] xfs: factor DIO write mapping from get_blocks Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

Hi folks,

This is the second version of the series I first posted here:

http://oss.sgi.com/archives/xfs/2015-04/msg00097.html

I've reworked and refactored the first two patches in the series,
breaking them out into simpler, more self contained patches.

The main change from the previous version is that I dropped the
pre-allocation of the append transaction to avoid the nesting
problems it could cause.

The only other major change is that the ioend is only allocated when
a mapping is either for an unwritten region or spans EOF. hence if the
mapping is an overwrite within EOF we do not allocate an ioend, we
do not defer completion to the DIO workqueue and we do nothing in
the completion function as there is nothing to do.

Comments, thoughts?

-Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/8] xfs: factor DIO write mapping from get_blocks
  2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
@ 2015-04-14  7:26 ` Dave Chinner
  2015-04-14 14:23   ` Brian Foster
  2015-04-14  7:26 ` [PATCH 2/8] xfs: move DIO mapping size calculation Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Clarify and separate the buffer mapping logic so that the direct IO mapping is
not tangled up in propagating the extent status to teh mapping buffer. This
makes it easier to extend the direct IO mapping to use an ioend in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a9b7a1..5f7ddd5 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1233,6 +1233,22 @@ xfs_vm_releasepage(
 	return try_to_free_buffers(page);
 }
 
+/*
+ * do all the direct IO specific mapping buffer manipulation here.
+ */
+static void
+xfs_map_direct(
+	struct inode		*inode,
+	struct buffer_head	*bh_result,
+	struct xfs_bmbt_irec	*imap,
+	xfs_off_t		offset)
+{
+	if (ISUNWRITTEN(imap)) {
+		bh_result->b_private = inode;
+		set_buffer_defer_completion(bh_result);
+	}
+}
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
@@ -1332,20 +1348,14 @@ __xfs_get_blocks(
 	}
 
 	if (imap.br_startblock != HOLESTARTBLOCK &&
-	    imap.br_startblock != DELAYSTARTBLOCK) {
-		/*
-		 * For unwritten extents do not report a disk address on
-		 * the read case (treat as if we're reading into a hole).
-		 */
-		if (create || !ISUNWRITTEN(&imap))
-			xfs_map_buffer(inode, bh_result, &imap, offset);
-		if (create && ISUNWRITTEN(&imap)) {
-			if (direct) {
-				bh_result->b_private = inode;
-				set_buffer_defer_completion(bh_result);
-			}
+	    imap.br_startblock != DELAYSTARTBLOCK &&
+	    (create || !ISUNWRITTEN(&imap))) {
+		xfs_map_buffer(inode, bh_result, &imap, offset);
+		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
-		}
+		/* direct IO needs special help */
+		if (create && direct)
+			xfs_map_direct(inode, bh_result, &imap, offset);
 	}
 
 	/*
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/8] xfs: move DIO mapping size calculation
  2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
  2015-04-14  7:26 ` [PATCH 1/8] xfs: factor DIO write mapping from get_blocks Dave Chinner
@ 2015-04-14  7:26 ` Dave Chinner
  2015-04-14 14:24   ` Brian Foster
  2015-04-14  7:26 ` [PATCH 3/8] xfs: DIO needs an ioend for writes Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The mapping size calculation is done last in __xfs_get_blocks(), but
we are going to need the actual mapping size we will use to map the
direct IO correctly in xfs_map_direct(). Factor out the calculation
for code clarity, and move the call to be the first operation in
mapping the extent to the returned buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5f7ddd5..8f63520 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1249,6 +1249,47 @@ xfs_map_direct(
 	}
 }
 
+
+/*
+ * If this is O_DIRECT or the mpage code calling tell them how large the mapping
+ * is, so that we can avoid repeated get_blocks calls.
+ *
+ * If the mapping spans EOF, then we have to break the mapping up as the mapping
+ * for blocks beyond EOF must be marked new so that sub block regions can be
+ * correctly zeroed. We can't do this for mappings within EOF unless the mapping
+ * was just allocated or is unwritten, otherwise the callers would overwrite
+ * existing data with zeros. Hence we have to split the mapping into a range up
+ * to and including EOF, and a second mapping for beyond EOF.
+ */
+static void
+xfs_map_trim_size(
+	struct inode		*inode,
+	sector_t		iblock,
+	struct buffer_head	*bh_result,
+	struct xfs_bmbt_irec	*imap,
+	xfs_off_t		offset,
+	ssize_t			size)
+{
+	xfs_off_t		mapping_size;
+
+	mapping_size = imap->br_startoff + imap->br_blockcount - iblock;
+	mapping_size <<= inode->i_blkbits;
+
+	ASSERT(mapping_size > 0);
+	if (mapping_size > size)
+		mapping_size = size;
+	if (offset < i_size_read(inode) &&
+	    offset + mapping_size >= i_size_read(inode)) {
+		/* limit mapping to block that spans EOF */
+		mapping_size = roundup_64(i_size_read(inode) - offset,
+					  1 << inode->i_blkbits);
+	}
+	if (mapping_size > LONG_MAX)
+		mapping_size = LONG_MAX;
+
+	bh_result->b_size = mapping_size;
+}
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
@@ -1347,6 +1388,11 @@ __xfs_get_blocks(
 		goto out_unlock;
 	}
 
+	/* trim mapping down to size requested */
+	if (direct || size > (1 << inode->i_blkbits))
+		xfs_map_trim_size(inode, iblock, bh_result,
+				  &imap, offset, size);
+
 	if (imap.br_startblock != HOLESTARTBLOCK &&
 	    imap.br_startblock != DELAYSTARTBLOCK &&
 	    (create || !ISUNWRITTEN(&imap))) {
@@ -1388,39 +1434,6 @@ __xfs_get_blocks(
 		}
 	}
 
-	/*
-	 * If this is O_DIRECT or the mpage code calling tell them how large
-	 * the mapping is, so that we can avoid repeated get_blocks calls.
-	 *
-	 * If the mapping spans EOF, then we have to break the mapping up as the
-	 * mapping for blocks beyond EOF must be marked new so that sub block
-	 * regions can be correctly zeroed. We can't do this for mappings within
-	 * EOF unless the mapping was just allocated or is unwritten, otherwise
-	 * the callers would overwrite existing data with zeros. Hence we have
-	 * to split the mapping into a range up to and including EOF, and a
-	 * second mapping for beyond EOF.
-	 */
-	if (direct || size > (1 << inode->i_blkbits)) {
-		xfs_off_t		mapping_size;
-
-		mapping_size = imap.br_startoff + imap.br_blockcount - iblock;
-		mapping_size <<= inode->i_blkbits;
-
-		ASSERT(mapping_size > 0);
-		if (mapping_size > size)
-			mapping_size = size;
-		if (offset < i_size_read(inode) &&
-		    offset + mapping_size >= i_size_read(inode)) {
-			/* limit mapping to block that spans EOF */
-			mapping_size = roundup_64(i_size_read(inode) - offset,
-						  1 << inode->i_blkbits);
-		}
-		if (mapping_size > LONG_MAX)
-			mapping_size = LONG_MAX;
-
-		bh_result->b_size = mapping_size;
-	}
-
 	return 0;
 
 out_unlock:
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/8] xfs: DIO needs an ioend for writes
  2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
  2015-04-14  7:26 ` [PATCH 1/8] xfs: factor DIO write mapping from get_blocks Dave Chinner
  2015-04-14  7:26 ` [PATCH 2/8] xfs: move DIO mapping size calculation Dave Chinner
@ 2015-04-14  7:26 ` Dave Chinner
  2015-04-14 14:24   ` Brian Foster
  2015-04-14  7:26 ` [PATCH 4/8] xfs: handle DIO overwrite EOF update completion correctly Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Currently we can only tell DIO completion that an IO requires
unwritten extent completion. This is done by a hacky non-null
private pointer passed to Io completion, but the private pointer
does not actually contain any information that is used.

We also need to pass to IO completion the fact that the IO may be
beyond EOF and so a size update transaction needs to be done. This
is currently determined by checks in the io completion, but we need
to determine if this is necessary at block mapping time as we need
to defer the size update transactions to a completion workqueue,
just like unwritten extent conversion.

To do this, first we need to allocate and pass an ioend to to IO
completion. Add this for unwritten extent conversion; we'll do the
EOF updates in the next commit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c  | 92 ++++++++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_trace.h |  3 ++
 2 files changed, 85 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 8f63520..e1fa926 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1234,7 +1234,23 @@ xfs_vm_releasepage(
 }
 
 /*
- * do all the direct IO specific mapping buffer manipulation here.
+ * When we map a DIO buffer, we need to attach an ioend that describes the type
+ * of write IO we are doing. This passes to the completion function the
+ * operations it needs to perform.
+ *
+ * If we get multiple mappings to in a single IO, we might be mapping dfferent
+ * types. But because the direct IO can only have a single private pointer, we
+ * need to ensure that:
+ *
+ * a) the ioend spans the entire region of the IO; and
+ * b) if it contains unwritten extents, it is *permanently* marked as such
+ *
+ * We could do this by chaining ioends like buffered IO does, but we only
+ * actually get one IO completion callback from the direct IO, and that spans
+ * the entire IO regardless of how many mappings and IOs are needed to complete
+ * the DIO. There is only going to be one reference to the ioend and it's life
+ * cycle is constrained by the DIO completion code. hence we don't need
+ * reference counting here.
  */
 static void
 xfs_map_direct(
@@ -1243,10 +1259,42 @@ xfs_map_direct(
 	struct xfs_bmbt_irec	*imap,
 	xfs_off_t		offset)
 {
-	if (ISUNWRITTEN(imap)) {
-		bh_result->b_private = inode;
-		set_buffer_defer_completion(bh_result);
+	struct xfs_ioend	*ioend;
+	xfs_off_t		size = bh_result->b_size;
+	int			type;
+
+	if (ISUNWRITTEN(imap))
+		type = XFS_IO_UNWRITTEN;
+	else
+		type = XFS_IO_OVERWRITE;
+
+	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
+
+	if (bh_result->b_private) {
+		ioend = bh_result->b_private;
+		ASSERT(ioend->io_size > 0);
+		ASSERT(offset >= ioend->io_offset);
+		if (offset + size > ioend->io_offset + ioend->io_size)
+			ioend->io_size = offset - ioend->io_offset + size;
+
+		if (type == XFS_IO_UNWRITTEN && type != ioend->io_type)
+			ioend->io_type = XFS_IO_UNWRITTEN;
+
+		trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset,
+					      ioend->io_size, ioend->io_type,
+					      imap);
+	} else {
+		ioend = xfs_alloc_ioend(inode, type);
+		ioend->io_offset = offset;
+		ioend->io_size = size;
+		bh_result->b_private = ioend;
+
+		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
+					   imap);
 	}
+
+	if (ioend->io_type == XFS_IO_UNWRITTEN)
+		set_buffer_defer_completion(bh_result);
 }
 
 
@@ -1378,10 +1426,13 @@ __xfs_get_blocks(
 
 			xfs_iunlock(ip, lockmode);
 		}
-
-		trace_xfs_get_blocks_alloc(ip, offset, size, 0, &imap);
+		trace_xfs_get_blocks_alloc(ip, offset, size,
+				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
+						   : XFS_IO_DELALLOC, &imap);
 	} else if (nimaps) {
-		trace_xfs_get_blocks_found(ip, offset, size, 0, &imap);
+		trace_xfs_get_blocks_found(ip, offset, size,
+				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
+						   : XFS_IO_OVERWRITE, &imap);
 		xfs_iunlock(ip, lockmode);
 	} else {
 		trace_xfs_get_blocks_notfound(ip, offset, size);
@@ -1478,9 +1529,28 @@ xfs_end_io_direct_write(
 	struct inode		*inode = file_inode(iocb->ki_filp);
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ioend	*ioend = private;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
-		return;
+		goto out_destroy_ioend;
+
+	/*
+	 * dio completion end_io functions are only called on writes if more
+	 * than 0 bytes was written.
+	 */
+	ASSERT(size > 0);
+
+	/*
+	 * The ioend only maps whole blocks, while the IO may be sector aligned.
+	 * Hence the ioend offset/size may not match the IO offset/size exactly,
+	 * but should span it completely. Write the IO sizes into the ioend so
+	 * that completion processing does the right thing.
+	 */
+	ASSERT(size <= ioend->io_size);
+	ASSERT(offset >= ioend->io_offset);
+	ASSERT(offset + size <= ioend->io_offset + ioend->io_size);
+	ioend->io_size = size;
+	ioend->io_offset = offset;
 
 	/*
 	 * While the generic direct I/O code updates the inode size, it does
@@ -1500,7 +1570,7 @@ xfs_end_io_direct_write(
 	 * we can pass the ioend to the direct IO allocation callbacks and
 	 * avoid nesting that way.
 	 */
-	if (private && size > 0) {
+	if (ioend->io_type == XFS_IO_UNWRITTEN) {
 		xfs_iomap_write_unwritten(ip, offset, size);
 	} else if (offset + size > ip->i_d.di_size) {
 		struct xfs_trans	*tp;
@@ -1510,11 +1580,13 @@ xfs_end_io_direct_write(
 		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
 		if (error) {
 			xfs_trans_cancel(tp, 0);
-			return;
+			goto out_destroy_ioend;
 		}
 
 		xfs_setfilesize(ip, tp, offset, size);
 	}
+out_destroy_ioend:
+	xfs_destroy_ioend(ioend);
 }
 
 STATIC ssize_t
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b2a45cc..e78b64e 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1221,6 +1221,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_found);
 DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
 DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
+DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
+DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
+DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
 
 DECLARE_EVENT_CLASS(xfs_simple_io_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/8] xfs: handle DIO overwrite EOF update completion correctly
  2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
                   ` (2 preceding siblings ...)
  2015-04-14  7:26 ` [PATCH 3/8] xfs: DIO needs an ioend for writes Dave Chinner
@ 2015-04-14  7:26 ` Dave Chinner
  2015-04-14 14:35   ` Brian Foster
  2015-04-14  7:26 ` [PATCH 5/8] xfs: DIO writes within EOF don't need an ioend Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
write into allocated blocks beyond EOF) requires a transaction for
the EOF update. Thi is done in IO completion context, but we aren't
explicitly handling this situation properly and so it can run in
interrupt context. Ensure that we defer IO that spans EOF correctly
to the DIO completion workqueue, and now that we have an ioend in IO
completion we can use the common ioend completion path to do all the
work.

Note: we do not preallocate the append transaction as we can have
multiple mapping and allocation calls per direct IO. hence
preallocating can still leave us with nested transactions by
attempting to map and allocate more blocks after we've preallocated
an append transaction.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c  | 61 +++++++++++++++++++++++++++---------------------------
 fs/xfs/xfs_trace.h |  1 +
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e1fa926..e3968a3 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1293,7 +1293,7 @@ xfs_map_direct(
 					   imap);
 	}
 
-	if (ioend->io_type == XFS_IO_UNWRITTEN)
+	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend))
 		set_buffer_defer_completion(bh_result);
 }
 
@@ -1531,8 +1531,10 @@ xfs_end_io_direct_write(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ioend	*ioend = private;
 
+	trace_xfs_gbmap_direct_endio(ip, offset, size, ioend->io_type, NULL);
+
 	if (XFS_FORCED_SHUTDOWN(mp))
-		goto out_destroy_ioend;
+		goto out_end_io;
 
 	/*
 	 * dio completion end_io functions are only called on writes if more
@@ -1553,40 +1555,37 @@ xfs_end_io_direct_write(
 	ioend->io_offset = offset;
 
 	/*
-	 * While the generic direct I/O code updates the inode size, it does
-	 * so only after the end_io handler is called, which means our
-	 * end_io handler thinks the on-disk size is outside the in-core
-	 * size.  To prevent this just update it a little bit earlier here.
+	 * The ioend tells us whether we are doing unwritten extent conversion
+	 * or an append transaction that updates the on-disk file size. These
+	 * cases are the only cases where we should *potentially* be needing
+	 * to update the VFS inode size. When the ioend indicates this, we
+	 * are *guaranteed* to be running in non-interrupt context.
+	 *
+	 * We need to update the in-core inode size here so that we don't end up
+	 * with the on-disk inode size being outside the in-core inode size.
+	 * While we can do this in the process context after the IO has
+	 * completed, this does not work for AIO and hence we always update
+	 * the in-core inode size here if necessary.
 	 */
-	if (offset + size > i_size_read(inode))
-		i_size_write(inode, offset + size);
+	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend)) {
+		if (offset + size > i_size_read(inode))
+			i_size_write(inode, offset + size);
+	} else
+		ASSERT(offset + size <= i_size_read(inode));
 
 	/*
-	 * For direct I/O we do not know if we need to allocate blocks or not,
-	 * so we can't preallocate an append transaction, as that results in
-	 * nested reservations and log space deadlocks. Hence allocate the
-	 * transaction here. While this is sub-optimal and can block IO
-	 * completion for some time, we're stuck with doing it this way until
-	 * we can pass the ioend to the direct IO allocation callbacks and
-	 * avoid nesting that way.
+	 * If we are doing an append IO that needs to update the EOF on disk,
+	 * do the transaction reserve now so we can use common end io
+	 * processing. Stashing the error (if there is one) in the ioend will
+	 * result in the ioend processing passing on the error if it is
+	 * possible as we can't return it from here.
 	 */
-	if (ioend->io_type == XFS_IO_UNWRITTEN) {
-		xfs_iomap_write_unwritten(ip, offset, size);
-	} else if (offset + size > ip->i_d.di_size) {
-		struct xfs_trans	*tp;
-		int			error;
-
-		tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
-		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
-		if (error) {
-			xfs_trans_cancel(tp, 0);
-			goto out_destroy_ioend;
-		}
+	if (ioend->io_type == XFS_IO_OVERWRITE && xfs_ioend_is_append(ioend))
+		ioend->io_error = xfs_setfilesize_trans_alloc(ioend);
 
-		xfs_setfilesize(ip, tp, offset, size);
-	}
-out_destroy_ioend:
-	xfs_destroy_ioend(ioend);
+out_end_io:
+	xfs_end_io(&ioend->io_work);
+	return;
 }
 
 STATIC ssize_t
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e78b64e..967993b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1224,6 +1224,7 @@ DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
 DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
 DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
 DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
+DEFINE_IOMAP_EVENT(xfs_gbmap_direct_endio);
 
 DECLARE_EVENT_CLASS(xfs_simple_io_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/8] xfs: DIO writes within EOF don't need an ioend
  2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
                   ` (3 preceding siblings ...)
  2015-04-14  7:26 ` [PATCH 4/8] xfs: handle DIO overwrite EOF update completion correctly Dave Chinner
@ 2015-04-14  7:26 ` Dave Chinner
  2015-04-14 14:35   ` Brian Foster
  2015-04-14  7:26 ` [PATCH 6/8] xfs: DIO write completion size updates race Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

DIO writes that lie entirely within EOF have nothing to do in IO
completion. In this case, we don't need no steekin' ioend, and so we
can avoid allocating an ioend until we have a mapping that spans
EOF.

This means that IO completion has two contexts - deferred completion
to the dio workqueue that uses an ioend, and interrupt completion
that does nothing because there is nothing that can be done in this
context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c  | 62 ++++++++++++++++++++++++++++++------------------------
 fs/xfs/xfs_trace.h |  1 +
 2 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e3968a3..55356f6 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1234,15 +1234,19 @@ xfs_vm_releasepage(
 }
 
 /*
- * When we map a DIO buffer, we need to attach an ioend that describes the type
+ * When we map a DIO buffer, we may need to attach an ioend that describes the type
  * of write IO we are doing. This passes to the completion function the
- * operations it needs to perform.
+ * operations it needs to perform. If the mapping is for an overwrite wholly
+ * within the EOF then we don't need an ioend and so we don't allocate one. This
+ * avoids the unnecessary overhead of allocating and freeing ioends for
+ * workloads that don't require transactions on IO completion.
  *
  * If we get multiple mappings to in a single IO, we might be mapping dfferent
  * types. But because the direct IO can only have a single private pointer, we
  * need to ensure that:
  *
- * a) the ioend spans the entire region of the IO; and
+ * a) i) the ioend spans the entire region of unwritten mappings; or
+ *    ii) the ioend spans all the mappings that cross or are beyond EOF; and
  * b) if it contains unwritten extents, it is *permanently* marked as such
  *
  * We could do this by chaining ioends like buffered IO does, but we only
@@ -1283,7 +1287,8 @@ xfs_map_direct(
 		trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset,
 					      ioend->io_size, ioend->io_type,
 					      imap);
-	} else {
+	} else if (type == XFS_IO_UNWRITTEN ||
+		   offset + size > i_size_read(inode)) {
 		ioend = xfs_alloc_ioend(inode, type);
 		ioend->io_offset = offset;
 		ioend->io_size = size;
@@ -1291,10 +1296,13 @@ xfs_map_direct(
 
 		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
 					   imap);
+	} else {
+		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
+					    imap);
+		return;
 	}
 
-	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend))
-		set_buffer_defer_completion(bh_result);
+	set_buffer_defer_completion(bh_result);
 }
 
 
@@ -1515,9 +1523,11 @@ xfs_get_blocks_direct(
 /*
  * Complete a direct I/O write request.
  *
- * If the private argument is non-NULL __xfs_get_blocks signals us that we
- * need to issue a transaction to convert the range from unwritten to written
- * extents.
+ * The ioend structure is passed from __xfs_get_blocks() to tell us what to do.
+ * If no ioend exists (i.e. @private == NULL) then the write IO is an overwrite
+ * wholly within the EOF and so there is nothing for us to do. Note that in this
+ * case the completion can be called in interrupt context, whereas if we have an
+ * ioend we will always be called in task context (i.e. from a workqueue).
  */
 STATIC void
 xfs_end_io_direct_write(
@@ -1531,7 +1541,10 @@ xfs_end_io_direct_write(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ioend	*ioend = private;
 
-	trace_xfs_gbmap_direct_endio(ip, offset, size, ioend->io_type, NULL);
+	trace_xfs_gbmap_direct_endio(ip, offset, size,
+				     ioend ? ioend->io_type : 0, NULL);
+	if (!ioend)
+		return;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
 		goto out_end_io;
@@ -1544,12 +1557,12 @@ xfs_end_io_direct_write(
 
 	/*
 	 * The ioend only maps whole blocks, while the IO may be sector aligned.
-	 * Hence the ioend offset/size may not match the IO offset/size exactly,
-	 * but should span it completely. Write the IO sizes into the ioend so
-	 * that completion processing does the right thing.
+	 * Hence the ioend offset/size may not match the IO offset/size exactly.
+	 * Because we don't map overwrites within EOF into the ioend, the offset
+	 * may not match, but only if the endio spans EOF.  Either way, write
+	 * the IO sizes into the ioend so that completion processing does the
+	 * right thing.
 	 */
-	ASSERT(size <= ioend->io_size);
-	ASSERT(offset >= ioend->io_offset);
 	ASSERT(offset + size <= ioend->io_offset + ioend->io_size);
 	ioend->io_size = size;
 	ioend->io_offset = offset;
@@ -1558,20 +1571,15 @@ xfs_end_io_direct_write(
 	 * The ioend tells us whether we are doing unwritten extent conversion
 	 * or an append transaction that updates the on-disk file size. These
 	 * cases are the only cases where we should *potentially* be needing
-	 * to update the VFS inode size. When the ioend indicates this, we
-	 * are *guaranteed* to be running in non-interrupt context.
+	 * to update the VFS inode size.
 	 *
 	 * We need to update the in-core inode size here so that we don't end up
-	 * with the on-disk inode size being outside the in-core inode size.
-	 * While we can do this in the process context after the IO has
-	 * completed, this does not work for AIO and hence we always update
-	 * the in-core inode size here if necessary.
+	 * with the on-disk inode size being outside the in-core inode size. We
+	 * have no other method of updating EOF for AIO, so always do it here
+	 * if necessary.
 	 */
-	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend)) {
-		if (offset + size > i_size_read(inode))
-			i_size_write(inode, offset + size);
-	} else
-		ASSERT(offset + size <= i_size_read(inode));
+	if (offset + size > i_size_read(inode))
+		i_size_write(inode, offset + size);
 
 	/*
 	 * If we are doing an append IO that needs to update the EOF on disk,
@@ -1580,7 +1588,7 @@ xfs_end_io_direct_write(
 	 * result in the ioend processing passing on the error if it is
 	 * possible as we can't return it from here.
 	 */
-	if (ioend->io_type == XFS_IO_OVERWRITE && xfs_ioend_is_append(ioend))
+	if (ioend->io_type == XFS_IO_OVERWRITE)
 		ioend->io_error = xfs_setfilesize_trans_alloc(ioend);
 
 out_end_io:
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 967993b..615781b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1224,6 +1224,7 @@ DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
 DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
 DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
 DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
+DEFINE_IOMAP_EVENT(xfs_gbmap_direct_none);
 DEFINE_IOMAP_EVENT(xfs_gbmap_direct_endio);
 
 DECLARE_EVENT_CLASS(xfs_simple_io_class,
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/8] xfs: DIO write completion size updates race
  2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
                   ` (4 preceding siblings ...)
  2015-04-14  7:26 ` [PATCH 5/8] xfs: DIO writes within EOF don't need an ioend Dave Chinner
@ 2015-04-14  7:26 ` Dave Chinner
  2015-04-14 14:35   ` Brian Foster
  2015-04-14  7:26 ` [PATCH 7/8] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner
  2015-04-14  7:26 ` [PATCH 8/8] xfs: using generic_file_direct_write() is unnecessary Dave Chinner
  7 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_end_io_direct_write() can race with other IO completions when
updating the in-core inode size. The IO completion processing is not
serialised for direct IO - they are done either under the
IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
during AIO DIO completion. Hence the non-atomic test-and-set update
of the in-core inode size is racy and can result in the in-core
inode size going backwards if the race if hit just right.

If the inode size goes backwards, this can trigger the EOF zeroing
code to run incorrectly on the next IO, which then will zero data
that has successfully been written to disk by a previous DIO.

To fix this bug, we need to serialise the test/set updates of the
in-core inode size. This first patch introduces locking around the
relevant updates and checks in the DIO path. Because we now have an
ioend in xfs_end_io_direct_write(), we know exactly then we are
doing an IO that requires an in-core EOF update, and we know that
they are not running in interrupt context. As such, we do not need to
use irqsave() spinlock variants to protect against interrupts while
the lock is held.

Hence we can use an existing spinlock in the inode to do this
serialisation and so not need to grow the struct xfs_inode just to
work around this problem.

This patch does not address the test/set EOF update in
generic_file_write_direct() for various reasons - that will be done
as a followup with separate explanation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c |  7 +++++++
 fs/xfs/xfs_file.c | 13 ++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 55356f6..cd6b2e0 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1577,9 +1577,16 @@ xfs_end_io_direct_write(
 	 * with the on-disk inode size being outside the in-core inode size. We
 	 * have no other method of updating EOF for AIO, so always do it here
 	 * if necessary.
+	 *
+	 * We need to lock the test/set EOF update as we can be racing with
+	 * other IO completions here to update the EOF. Failing to serialise
+	 * here can result in EOF moving backwards and Bad Things Happen when
+	 * that occurs.
 	 */
+	spin_lock(&ip->i_flags_lock);
 	if (offset + size > i_size_read(inode))
 		i_size_write(inode, offset + size);
+	spin_unlock(&ip->i_flags_lock);
 
 	/*
 	 * If we are doing an append IO that needs to update the EOF on disk,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c203839..5d5b4ba 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -569,10 +569,20 @@ restart:
 	 * write.  If zeroing is needed and we are currently holding the
 	 * iolock shared, we need to update it to exclusive which implies
 	 * having to redo all checks before.
+	 *
+	 * We need to serialise against EOF updates that occur in IO
+	 * completions here. We want to make sure that nobody is changing the
+	 * size while we do this check until we have placed an IO barrier (i.e.
+	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
+	 * The spinlock effectively forms a memory barrier once we have the
+	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
+	 * and hence be able to correctly determine if we need to run zeroing.
 	 */
+	spin_lock(&ip->i_flags_lock);
 	if (*pos > i_size_read(inode)) {
 		bool	zero = false;
 
+		spin_unlock(&ip->i_flags_lock);
 		if (*iolock == XFS_IOLOCK_SHARED) {
 			xfs_rw_iunlock(ip, *iolock);
 			*iolock = XFS_IOLOCK_EXCL;
@@ -582,7 +592,8 @@ restart:
 		error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero);
 		if (error)
 			return error;
-	}
+	} else
+		spin_unlock(&ip->i_flags_lock);
 
 	/*
 	 * Updating the timestamps will grab the ilock again from
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/8] xfs: direct IO EOF zeroing needs to drain AIO
  2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
                   ` (5 preceding siblings ...)
  2015-04-14  7:26 ` [PATCH 6/8] xfs: DIO write completion size updates race Dave Chinner
@ 2015-04-14  7:26 ` Dave Chinner
  2015-04-14 14:35   ` Brian Foster
  2015-04-14  7:26 ` [PATCH 8/8] xfs: using generic_file_direct_write() is unnecessary Dave Chinner
  7 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we are doing AIO DIO writes, the IOLOCK only provides an IO
submission barrier. When we need to do EOF zeroing, we need to ensure
that no other IO is in progress and all pending in-core EOF updates
have been completed. This requires us to wait for all outstanding
AIO DIO writes to the inode to complete and, if necessary, run their
EOF updates.

Once all the EOF updates are complete, we can then restart
xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing
that EOF is up to date and we have exclusive IO access to the file
so we can run EOF block zeroing if we need to without interference.
This gives EOF zeroing the same exclusivity against other IO as we
provide truncate operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5d5b4ba..c398ec7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -587,6 +587,16 @@ restart:
 			xfs_rw_iunlock(ip, *iolock);
 			*iolock = XFS_IOLOCK_EXCL;
 			xfs_rw_ilock(ip, *iolock);
+
+			/*
+			 * We now have an IO submission barrier in place, but
+			 * AIO can do EOF updates during IO completion and hence
+			 * we now need to wait for all of them to drain. Non-AIO
+			 * DIO will have drained before we are given the
+			 * XFS_IOLOCK_EXCL, and so for most cases this wait is a
+			 * no-op.
+			 */
+			inode_dio_wait(inode);
 			goto restart;
 		}
 		error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero);
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/8] xfs: using generic_file_direct_write() is unnecessary
  2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
                   ` (6 preceding siblings ...)
  2015-04-14  7:26 ` [PATCH 7/8] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner
@ 2015-04-14  7:26 ` Dave Chinner
  2015-04-14 14:35   ` Brian Foster
  7 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-04-14  7:26 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

generic_file_direct_write() does all sorts of things to make DIO
work "sorta ok" with mixed buffered IO workloads. We already do
most of this work in xfs_file_aio_dio_write() because of the locking
requirements, so there's only a couple of things it does for us.

The first thing is that it does a page cache invalidation after the
->direct_IO callout. This can easily be added to the XFS code.

The second thing it does is that if data was written, it updates the
iov_iter structure to reflect the data written, and then does EOF
size updates if necessary. For XFS, these EOF size updates are now
not necessary, as we do them safely and race-free in IO completion
context. That leaves just the iov_iter update, and that's also moved
to the XFS code.

Therefore we don't need to call generic_file_direct_write() and in
doing so remove redundant buffered writeback and page cache
invalidation calls from the DIO submission path. We also remove a
racy EOF size update, and make the DIO submission code in XFS much
easier to follow. Wins all round, really.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c398ec7..3a5d305 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -665,6 +665,8 @@ xfs_file_dio_aio_write(
 	int			iolock;
 	size_t			count = iov_iter_count(from);
 	loff_t			pos = iocb->ki_pos;
+	loff_t			end;
+	struct iov_iter		data;
 	struct xfs_buftarg	*target = XFS_IS_REALTIME_INODE(ip) ?
 					mp->m_rtdev_targp : mp->m_ddev_targp;
 
@@ -704,10 +706,11 @@ xfs_file_dio_aio_write(
 	if (ret)
 		goto out;
 	iov_iter_truncate(from, count);
+	end = pos + count - 1;
 
 	if (mapping->nrpages) {
 		ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-						    pos, pos + count - 1);
+						   pos, end);
 		if (ret)
 			goto out;
 		/*
@@ -717,7 +720,7 @@ xfs_file_dio_aio_write(
 		 */
 		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
 					pos >> PAGE_CACHE_SHIFT,
-					(pos + count - 1) >> PAGE_CACHE_SHIFT);
+					end >> PAGE_CACHE_SHIFT);
 		WARN_ON_ONCE(ret);
 		ret = 0;
 	}
@@ -734,8 +737,22 @@ xfs_file_dio_aio_write(
 	}
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
-	ret = generic_file_direct_write(iocb, from, pos);
 
+	data = *from;
+	ret = mapping->a_ops->direct_IO(WRITE, iocb, &data, pos);
+
+	/* see generic_file_direct_write() for why this is necessary */
+	if (mapping->nrpages) {
+		invalidate_inode_pages2_range(mapping,
+					      pos >> PAGE_CACHE_SHIFT,
+					      end >> PAGE_CACHE_SHIFT);
+	}
+
+	if (ret > 0) {
+		pos += ret;
+		iov_iter_advance(from, ret);
+		iocb->ki_pos = pos;
+	}
 out:
 	xfs_rw_iunlock(ip, iolock);
 
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/8] xfs: factor DIO write mapping from get_blocks
  2015-04-14  7:26 ` [PATCH 1/8] xfs: factor DIO write mapping from get_blocks Dave Chinner
@ 2015-04-14 14:23   ` Brian Foster
  2015-04-14 20:06     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2015-04-14 14:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 05:26:44PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Clarify and separate the buffer mapping logic so that the direct IO mapping is
> not tangled up in propagating the extent status to teh mapping buffer. This
> makes it easier to extend the direct IO mapping to use an ioend in future.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 36 +++++++++++++++++++++++-------------
>  1 file changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..5f7ddd5 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,6 +1233,22 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * do all the direct IO specific mapping buffer manipulation here.
> + */
> +static void
> +xfs_map_direct(
> +	struct inode		*inode,
> +	struct buffer_head	*bh_result,
> +	struct xfs_bmbt_irec	*imap,
> +	xfs_off_t		offset)
> +{
> +	if (ISUNWRITTEN(imap)) {
> +		bh_result->b_private = inode;
> +		set_buffer_defer_completion(bh_result);
> +	}
> +}
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
> @@ -1332,20 +1348,14 @@ __xfs_get_blocks(
>  	}
>  
>  	if (imap.br_startblock != HOLESTARTBLOCK &&
> -	    imap.br_startblock != DELAYSTARTBLOCK) {
> -		/*
> -		 * For unwritten extents do not report a disk address on
> -		 * the read case (treat as if we're reading into a hole).
> -		 */

Can we keep this comment and move it before this block?

Brian

> -		if (create || !ISUNWRITTEN(&imap))
> -			xfs_map_buffer(inode, bh_result, &imap, offset);
> -		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> -				bh_result->b_private = inode;
> -				set_buffer_defer_completion(bh_result);
> -			}
> +	    imap.br_startblock != DELAYSTARTBLOCK &&
> +	    (create || !ISUNWRITTEN(&imap))) {
> +		xfs_map_buffer(inode, bh_result, &imap, offset);
> +		if (ISUNWRITTEN(&imap))
>  			set_buffer_unwritten(bh_result);
> -		}
> +		/* direct IO needs special help */
> +		if (create && direct)
> +			xfs_map_direct(inode, bh_result, &imap, offset);
>  	}
>  
>  	/*
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/8] xfs: move DIO mapping size calculation
  2015-04-14  7:26 ` [PATCH 2/8] xfs: move DIO mapping size calculation Dave Chinner
@ 2015-04-14 14:24   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-04-14 14:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 05:26:45PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The mapping size calculation is done last in __xfs_get_blocks(), but
> we are going to need the actual mapping size we will use to map the
> direct IO correctly in xfs_map_direct(). Factor out the calculation
> for code clarity, and move the call to be the first operation in
> mapping the extent to the returned buffer.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 46 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 5f7ddd5..8f63520 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1249,6 +1249,47 @@ xfs_map_direct(
>  	}
>  }
>  
> +
> +/*
> + * If this is O_DIRECT or the mpage code calling tell them how large the mapping
> + * is, so that we can avoid repeated get_blocks calls.
> + *
> + * If the mapping spans EOF, then we have to break the mapping up as the mapping
> + * for blocks beyond EOF must be marked new so that sub block regions can be
> + * correctly zeroed. We can't do this for mappings within EOF unless the mapping
> + * was just allocated or is unwritten, otherwise the callers would overwrite
> + * existing data with zeros. Hence we have to split the mapping into a range up
> + * to and including EOF, and a second mapping for beyond EOF.
> + */
> +static void
> +xfs_map_trim_size(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	struct xfs_bmbt_irec	*imap,
> +	xfs_off_t		offset,
> +	ssize_t			size)
> +{
> +	xfs_off_t		mapping_size;
> +
> +	mapping_size = imap->br_startoff + imap->br_blockcount - iblock;
> +	mapping_size <<= inode->i_blkbits;
> +
> +	ASSERT(mapping_size > 0);
> +	if (mapping_size > size)
> +		mapping_size = size;
> +	if (offset < i_size_read(inode) &&
> +	    offset + mapping_size >= i_size_read(inode)) {
> +		/* limit mapping to block that spans EOF */
> +		mapping_size = roundup_64(i_size_read(inode) - offset,
> +					  1 << inode->i_blkbits);
> +	}
> +	if (mapping_size > LONG_MAX)
> +		mapping_size = LONG_MAX;
> +
> +	bh_result->b_size = mapping_size;
> +}
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
> @@ -1347,6 +1388,11 @@ __xfs_get_blocks(
>  		goto out_unlock;
>  	}
>  
> +	/* trim mapping down to size requested */
> +	if (direct || size > (1 << inode->i_blkbits))
> +		xfs_map_trim_size(inode, iblock, bh_result,
> +				  &imap, offset, size);
> +
>  	if (imap.br_startblock != HOLESTARTBLOCK &&
>  	    imap.br_startblock != DELAYSTARTBLOCK &&
>  	    (create || !ISUNWRITTEN(&imap))) {
> @@ -1388,39 +1434,6 @@ __xfs_get_blocks(
>  		}
>  	}
>  
> -	/*
> -	 * If this is O_DIRECT or the mpage code calling tell them how large
> -	 * the mapping is, so that we can avoid repeated get_blocks calls.
> -	 *
> -	 * If the mapping spans EOF, then we have to break the mapping up as the
> -	 * mapping for blocks beyond EOF must be marked new so that sub block
> -	 * regions can be correctly zeroed. We can't do this for mappings within
> -	 * EOF unless the mapping was just allocated or is unwritten, otherwise
> -	 * the callers would overwrite existing data with zeros. Hence we have
> -	 * to split the mapping into a range up to and including EOF, and a
> -	 * second mapping for beyond EOF.
> -	 */
> -	if (direct || size > (1 << inode->i_blkbits)) {
> -		xfs_off_t		mapping_size;
> -
> -		mapping_size = imap.br_startoff + imap.br_blockcount - iblock;
> -		mapping_size <<= inode->i_blkbits;
> -
> -		ASSERT(mapping_size > 0);
> -		if (mapping_size > size)
> -			mapping_size = size;
> -		if (offset < i_size_read(inode) &&
> -		    offset + mapping_size >= i_size_read(inode)) {
> -			/* limit mapping to block that spans EOF */
> -			mapping_size = roundup_64(i_size_read(inode) - offset,
> -						  1 << inode->i_blkbits);
> -		}
> -		if (mapping_size > LONG_MAX)
> -			mapping_size = LONG_MAX;
> -
> -		bh_result->b_size = mapping_size;
> -	}
> -
>  	return 0;
>  
>  out_unlock:
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/8] xfs: DIO needs an ioend for writes
  2015-04-14  7:26 ` [PATCH 3/8] xfs: DIO needs an ioend for writes Dave Chinner
@ 2015-04-14 14:24   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-04-14 14:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 05:26:46PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently we can only tell DIO completion that an IO requires
> unwritten extent completion. This is done by a hacky non-null
> private pointer passed to Io completion, but the private pointer
> does not actually contain any information that is used.
> 
> We also need to pass to IO completion the fact that the IO may be
> beyond EOF and so a size update transaction needs to be done. This
> is currently determined by checks in the io completion, but we need
> to determine if this is necessary at block mapping time as we need
> to defer the size update transactions to a completion workqueue,
> just like unwritten extent conversion.
> 
> To do this, first we need to allocate and pass an ioend to to IO

Extra 'to' in there...

> completion. Add this for unwritten extent conversion; we'll do the
> EOF updates in the next commit.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 92 ++++++++++++++++++++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_trace.h |  3 ++
>  2 files changed, 85 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 8f63520..e1fa926 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1234,7 +1234,23 @@ xfs_vm_releasepage(
>  }
>  
>  /*
> - * do all the direct IO specific mapping buffer manipulation here.
> + * When we map a DIO buffer, we need to attach an ioend that describes the type
> + * of write IO we are doing. This passes to the completion function the
> + * operations it needs to perform.
> + *
> + * If we get multiple mappings to in a single IO, we might be mapping dfferent

s/to//
s/dfferent/different/

> + * types. But because the direct IO can only have a single private pointer, we
> + * need to ensure that:
> + *
> + * a) the ioend spans the entire region of the IO; and
> + * b) if it contains unwritten extents, it is *permanently* marked as such
> + *
> + * We could do this by chaining ioends like buffered IO does, but we only
> + * actually get one IO completion callback from the direct IO, and that spans
> + * the entire IO regardless of how many mappings and IOs are needed to complete
> + * the DIO. There is only going to be one reference to the ioend and it's life

s/it's/its/

Reviewed-by: Brian Foster <bfoster@redhat.com>

> + * cycle is constrained by the DIO completion code. hence we don't need
> + * reference counting here.
>   */
>  static void
>  xfs_map_direct(
> @@ -1243,10 +1259,42 @@ xfs_map_direct(
>  	struct xfs_bmbt_irec	*imap,
>  	xfs_off_t		offset)
>  {
> -	if (ISUNWRITTEN(imap)) {
> -		bh_result->b_private = inode;
> -		set_buffer_defer_completion(bh_result);
> +	struct xfs_ioend	*ioend;
> +	xfs_off_t		size = bh_result->b_size;
> +	int			type;
> +
> +	if (ISUNWRITTEN(imap))
> +		type = XFS_IO_UNWRITTEN;
> +	else
> +		type = XFS_IO_OVERWRITE;
> +
> +	trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap);
> +
> +	if (bh_result->b_private) {
> +		ioend = bh_result->b_private;
> +		ASSERT(ioend->io_size > 0);
> +		ASSERT(offset >= ioend->io_offset);
> +		if (offset + size > ioend->io_offset + ioend->io_size)
> +			ioend->io_size = offset - ioend->io_offset + size;
> +
> +		if (type == XFS_IO_UNWRITTEN && type != ioend->io_type)
> +			ioend->io_type = XFS_IO_UNWRITTEN;
> +
> +		trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset,
> +					      ioend->io_size, ioend->io_type,
> +					      imap);
> +	} else {
> +		ioend = xfs_alloc_ioend(inode, type);
> +		ioend->io_offset = offset;
> +		ioend->io_size = size;
> +		bh_result->b_private = ioend;
> +
> +		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
> +					   imap);
>  	}
> +
> +	if (ioend->io_type == XFS_IO_UNWRITTEN)
> +		set_buffer_defer_completion(bh_result);
>  }
>  
>  
> @@ -1378,10 +1426,13 @@ __xfs_get_blocks(
>  
>  			xfs_iunlock(ip, lockmode);
>  		}
> -
> -		trace_xfs_get_blocks_alloc(ip, offset, size, 0, &imap);
> +		trace_xfs_get_blocks_alloc(ip, offset, size,
> +				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
> +						   : XFS_IO_DELALLOC, &imap);
>  	} else if (nimaps) {
> -		trace_xfs_get_blocks_found(ip, offset, size, 0, &imap);
> +		trace_xfs_get_blocks_found(ip, offset, size,
> +				ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN
> +						   : XFS_IO_OVERWRITE, &imap);
>  		xfs_iunlock(ip, lockmode);
>  	} else {
>  		trace_xfs_get_blocks_notfound(ip, offset, size);
> @@ -1478,9 +1529,28 @@ xfs_end_io_direct_write(
>  	struct inode		*inode = file_inode(iocb->ki_filp);
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_ioend	*ioend = private;
>  
>  	if (XFS_FORCED_SHUTDOWN(mp))
> -		return;
> +		goto out_destroy_ioend;
> +
> +	/*
> +	 * dio completion end_io functions are only called on writes if more
> +	 * than 0 bytes was written.
> +	 */
> +	ASSERT(size > 0);
> +
> +	/*
> +	 * The ioend only maps whole blocks, while the IO may be sector aligned.
> +	 * Hence the ioend offset/size may not match the IO offset/size exactly,
> +	 * but should span it completely. Write the IO sizes into the ioend so
> +	 * that completion processing does the right thing.
> +	 */
> +	ASSERT(size <= ioend->io_size);
> +	ASSERT(offset >= ioend->io_offset);
> +	ASSERT(offset + size <= ioend->io_offset + ioend->io_size);
> +	ioend->io_size = size;
> +	ioend->io_offset = offset;
>  
>  	/*
>  	 * While the generic direct I/O code updates the inode size, it does
> @@ -1500,7 +1570,7 @@ xfs_end_io_direct_write(
>  	 * we can pass the ioend to the direct IO allocation callbacks and
>  	 * avoid nesting that way.
>  	 */
> -	if (private && size > 0) {
> +	if (ioend->io_type == XFS_IO_UNWRITTEN) {
>  		xfs_iomap_write_unwritten(ip, offset, size);
>  	} else if (offset + size > ip->i_d.di_size) {
>  		struct xfs_trans	*tp;
> @@ -1510,11 +1580,13 @@ xfs_end_io_direct_write(
>  		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
>  		if (error) {
>  			xfs_trans_cancel(tp, 0);
> -			return;
> +			goto out_destroy_ioend;
>  		}
>  
>  		xfs_setfilesize(ip, tp, offset, size);
>  	}
> +out_destroy_ioend:
> +	xfs_destroy_ioend(ioend);
>  }
>  
>  STATIC ssize_t
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index b2a45cc..e78b64e 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1221,6 +1221,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_found);
>  DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc);
>  DEFINE_IOMAP_EVENT(xfs_get_blocks_found);
>  DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
>  
>  DECLARE_EVENT_CLASS(xfs_simple_io_class,
>  	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/8] xfs: handle DIO overwrite EOF update completion correctly
  2015-04-14  7:26 ` [PATCH 4/8] xfs: handle DIO overwrite EOF update completion correctly Dave Chinner
@ 2015-04-14 14:35   ` Brian Foster
  2015-04-14 15:35     ` Brian Foster
  2015-04-14 20:12     ` Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Brian Foster @ 2015-04-14 14:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 05:26:47PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
> write into allocated blocks beyond EOF) requires a transaction for
> the EOF update. Thi is done in IO completion context, but we aren't
> explicitly handling this situation properly and so it can run in
> interrupt context. Ensure that we defer IO that spans EOF correctly
> to the DIO completion workqueue, and now that we have an ioend in IO
> completion we can use the common ioend completion path to do all the
> work.
> 
> Note: we do not preallocate the append transaction as we can have
> multiple mapping and allocation calls per direct IO. hence
> preallocating can still leave us with nested transactions by
> attempting to map and allocate more blocks after we've preallocated
> an append transaction.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 61 +++++++++++++++++++++++++++---------------------------
>  fs/xfs/xfs_trace.h |  1 +
>  2 files changed, 31 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e1fa926..e3968a3 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1293,7 +1293,7 @@ xfs_map_direct(
>  					   imap);
>  	}
>  
> -	if (ioend->io_type == XFS_IO_UNWRITTEN)
> +	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend))
>  		set_buffer_defer_completion(bh_result);
>  }
>  
> @@ -1531,8 +1531,10 @@ xfs_end_io_direct_write(
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_ioend	*ioend = private;
>  
> +	trace_xfs_gbmap_direct_endio(ip, offset, size, ioend->io_type, NULL);
> +
>  	if (XFS_FORCED_SHUTDOWN(mp))
> -		goto out_destroy_ioend;
> +		goto out_end_io;
>  
>  	/*
>  	 * dio completion end_io functions are only called on writes if more
> @@ -1553,40 +1555,37 @@ xfs_end_io_direct_write(
>  	ioend->io_offset = offset;
>  
>  	/*
> -	 * While the generic direct I/O code updates the inode size, it does
> -	 * so only after the end_io handler is called, which means our
> -	 * end_io handler thinks the on-disk size is outside the in-core
> -	 * size.  To prevent this just update it a little bit earlier here.
> +	 * The ioend tells us whether we are doing unwritten extent conversion
> +	 * or an append transaction that updates the on-disk file size. These
> +	 * cases are the only cases where we should *potentially* be needing
> +	 * to update the VFS inode size. When the ioend indicates this, we
> +	 * are *guaranteed* to be running in non-interrupt context.
> +	 *
> +	 * We need to update the in-core inode size here so that we don't end up
> +	 * with the on-disk inode size being outside the in-core inode size.
> +	 * While we can do this in the process context after the IO has
> +	 * completed, this does not work for AIO and hence we always update
> +	 * the in-core inode size here if necessary.
>  	 */
> -	if (offset + size > i_size_read(inode))
> -		i_size_write(inode, offset + size);
> +	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend)) {
> +		if (offset + size > i_size_read(inode))
> +			i_size_write(inode, offset + size);
> +	} else
> +		ASSERT(offset + size <= i_size_read(inode));

The code was obviously incorrect prior to this change, potentially
running some of these transactions in irq context. That said, it occurs
to me that one thing that the previous implementation looked to handle
that this does not is racing of in-flight aio with other operations.
E.g., what happens now if a non-extending, overwrite aio is submitted
and races with a truncate that causes it to be extending by the time we
get here? It looks like it would have been racy regardless, so maybe
that's just a separate problem...

>  
>  	/*
> -	 * For direct I/O we do not know if we need to allocate blocks or not,
> -	 * so we can't preallocate an append transaction, as that results in
> -	 * nested reservations and log space deadlocks. Hence allocate the
> -	 * transaction here. While this is sub-optimal and can block IO
> -	 * completion for some time, we're stuck with doing it this way until
> -	 * we can pass the ioend to the direct IO allocation callbacks and
> -	 * avoid nesting that way.
> +	 * If we are doing an append IO that needs to update the EOF on disk,
> +	 * do the transaction reserve now so we can use common end io
> +	 * processing. Stashing the error (if there is one) in the ioend will
> +	 * result in the ioend processing passing on the error if it is
> +	 * possible as we can't return it from here.
>  	 */
> -	if (ioend->io_type == XFS_IO_UNWRITTEN) {
> -		xfs_iomap_write_unwritten(ip, offset, size);
> -	} else if (offset + size > ip->i_d.di_size) {
> -		struct xfs_trans	*tp;
> -		int			error;
> -
> -		tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> -		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
> -		if (error) {
> -			xfs_trans_cancel(tp, 0);
> -			goto out_destroy_ioend;
> -		}
> +	if (ioend->io_type == XFS_IO_OVERWRITE && xfs_ioend_is_append(ioend))
> +		ioend->io_error = xfs_setfilesize_trans_alloc(ioend);

As you mentioned previously, we no longer need the transaction context
manipulation stuff in xfs_setfilesize_trans_alloc() with this approach.
It's still called from the writepage path though, so I guess it needs to
stay.

Brian

>  
> -		xfs_setfilesize(ip, tp, offset, size);
> -	}
> -out_destroy_ioend:
> -	xfs_destroy_ioend(ioend);
> +out_end_io:
> +	xfs_end_io(&ioend->io_work);
> +	return;
>  }
>  
>  STATIC ssize_t
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index e78b64e..967993b 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1224,6 +1224,7 @@ DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
>  DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
>  DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
>  DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_endio);
>  
>  DECLARE_EVENT_CLASS(xfs_simple_io_class,
>  	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/8] xfs: DIO writes within EOF don't need an ioend
  2015-04-14  7:26 ` [PATCH 5/8] xfs: DIO writes within EOF don't need an ioend Dave Chinner
@ 2015-04-14 14:35   ` Brian Foster
  2015-04-14 20:18     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2015-04-14 14:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 05:26:48PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> DIO writes that lie entirely within EOF have nothing to do in IO
> completion. In this case, we don't need no steekin' ioend, and so we
> can avoid allocating an ioend until we have a mapping that spans
> EOF.
> 
> This means that IO completion has two contexts - deferred completion
> to the dio workqueue that uses an ioend, and interrupt completion
> that does nothing because there is nothing that can be done in this
> context.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  | 62 ++++++++++++++++++++++++++++++------------------------
>  fs/xfs/xfs_trace.h |  1 +
>  2 files changed, 36 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e3968a3..55356f6 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1234,15 +1234,19 @@ xfs_vm_releasepage(
>  }
>  
>  /*
> - * When we map a DIO buffer, we need to attach an ioend that describes the type
> + * When we map a DIO buffer, we may need to attach an ioend that describes the type
>   * of write IO we are doing. This passes to the completion function the
> - * operations it needs to perform.
> + * operations it needs to perform. If the mapping is for an overwrite wholly
> + * within the EOF then we don't need an ioend and so we don't allocate one. This
> + * avoids the unnecessary overhead of allocating and freeing ioends for
> + * workloads that don't require transactions on IO completion.
>   *
>   * If we get multiple mappings to in a single IO, we might be mapping dfferent
>   * types. But because the direct IO can only have a single private pointer, we
>   * need to ensure that:
>   *
> - * a) the ioend spans the entire region of the IO; and
> + * a) i) the ioend spans the entire region of unwritten mappings; or
> + *    ii) the ioend spans all the mappings that cross or are beyond EOF; and
>   * b) if it contains unwritten extents, it is *permanently* marked as such
>   *
>   * We could do this by chaining ioends like buffered IO does, but we only
> @@ -1283,7 +1287,8 @@ xfs_map_direct(
>  		trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset,
>  					      ioend->io_size, ioend->io_type,
>  					      imap);
> -	} else {
> +	} else if (type == XFS_IO_UNWRITTEN ||
> +		   offset + size > i_size_read(inode)) {
>  		ioend = xfs_alloc_ioend(inode, type);
>  		ioend->io_offset = offset;
>  		ioend->io_size = size;
> @@ -1291,10 +1296,13 @@ xfs_map_direct(
>  
>  		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
>  					   imap);
> +	} else {
> +		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
> +					    imap);

Do we really need a tracepoint to indicate none of the other tracepoints
were hit? It stands out to me only because we already have the
unconditional trace_xfs_gbmap_direct() above. I'd say kill one or the
other, but I think we really want the function entry one because it
disambiguates individual get_block instances from the aggregate mapping.

> +		return;
>  	}
>  
> -	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend))
> -		set_buffer_defer_completion(bh_result);
> +	set_buffer_defer_completion(bh_result);

I'd move this up into the block where we allocate an ioend. That's the
only place we need it and doing so eliminates the need for the 'else {
return; }' thing entirely.

>  }
>  
>  
> @@ -1515,9 +1523,11 @@ xfs_get_blocks_direct(
>  /*
>   * Complete a direct I/O write request.
>   *
> - * If the private argument is non-NULL __xfs_get_blocks signals us that we
> - * need to issue a transaction to convert the range from unwritten to written
> - * extents.
> + * The ioend structure is passed from __xfs_get_blocks() to tell us what to do.
> + * If no ioend exists (i.e. @private == NULL) then the write IO is an overwrite
> + * wholly within the EOF and so there is nothing for us to do. Note that in this
> + * case the completion can be called in interrupt context, whereas if we have an
> + * ioend we will always be called in task context (i.e. from a workqueue).
>   */
>  STATIC void
>  xfs_end_io_direct_write(
> @@ -1531,7 +1541,10 @@ xfs_end_io_direct_write(
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_ioend	*ioend = private;
>  
> -	trace_xfs_gbmap_direct_endio(ip, offset, size, ioend->io_type, NULL);
> +	trace_xfs_gbmap_direct_endio(ip, offset, size,
> +				     ioend ? ioend->io_type : 0, NULL);
> +	if (!ioend)
> +		return;

Can we keep the i_size assert we've lost below?

ASSERT(offset + size <= i_size_read(inode));

Brian

>  
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		goto out_end_io;
> @@ -1544,12 +1557,12 @@ xfs_end_io_direct_write(
>  
>  	/*
>  	 * The ioend only maps whole blocks, while the IO may be sector aligned.
> -	 * Hence the ioend offset/size may not match the IO offset/size exactly,
> -	 * but should span it completely. Write the IO sizes into the ioend so
> -	 * that completion processing does the right thing.
> +	 * Hence the ioend offset/size may not match the IO offset/size exactly.
> +	 * Because we don't map overwrites within EOF into the ioend, the offset
> +	 * may not match, but only if the endio spans EOF.  Either way, write
> +	 * the IO sizes into the ioend so that completion processing does the
> +	 * right thing.
>  	 */
> -	ASSERT(size <= ioend->io_size);
> -	ASSERT(offset >= ioend->io_offset);
>  	ASSERT(offset + size <= ioend->io_offset + ioend->io_size);
>  	ioend->io_size = size;
>  	ioend->io_offset = offset;
> @@ -1558,20 +1571,15 @@ xfs_end_io_direct_write(
>  	 * The ioend tells us whether we are doing unwritten extent conversion
>  	 * or an append transaction that updates the on-disk file size. These
>  	 * cases are the only cases where we should *potentially* be needing
> -	 * to update the VFS inode size. When the ioend indicates this, we
> -	 * are *guaranteed* to be running in non-interrupt context.
> +	 * to update the VFS inode size.
>  	 *
>  	 * We need to update the in-core inode size here so that we don't end up
> -	 * with the on-disk inode size being outside the in-core inode size.
> -	 * While we can do this in the process context after the IO has
> -	 * completed, this does not work for AIO and hence we always update
> -	 * the in-core inode size here if necessary.
> +	 * with the on-disk inode size being outside the in-core inode size. We
> +	 * have no other method of updating EOF for AIO, so always do it here
> +	 * if necessary.
>  	 */
> -	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend)) {
> -		if (offset + size > i_size_read(inode))
> -			i_size_write(inode, offset + size);
> -	} else
> -		ASSERT(offset + size <= i_size_read(inode));
> +	if (offset + size > i_size_read(inode))
> +		i_size_write(inode, offset + size);
>  
>  	/*
>  	 * If we are doing an append IO that needs to update the EOF on disk,
> @@ -1580,7 +1588,7 @@ xfs_end_io_direct_write(
>  	 * result in the ioend processing passing on the error if it is
>  	 * possible as we can't return it from here.
>  	 */
> -	if (ioend->io_type == XFS_IO_OVERWRITE && xfs_ioend_is_append(ioend))
> +	if (ioend->io_type == XFS_IO_OVERWRITE)
>  		ioend->io_error = xfs_setfilesize_trans_alloc(ioend);
>  
>  out_end_io:
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 967993b..615781b 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1224,6 +1224,7 @@ DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
>  DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
>  DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
>  DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
> +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_none);
>  DEFINE_IOMAP_EVENT(xfs_gbmap_direct_endio);
>  
>  DECLARE_EVENT_CLASS(xfs_simple_io_class,
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/8] xfs: DIO write completion size updates race
  2015-04-14  7:26 ` [PATCH 6/8] xfs: DIO write completion size updates race Dave Chinner
@ 2015-04-14 14:35   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-04-14 14:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 05:26:49PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_end_io_direct_write() can race with other IO completions when
> updating the in-core inode size. The IO completion processing is not
> serialised for direct IO - they are done either under the
> IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
> during AIO DIO completion. Hence the non-atomic test-and-set update
> of the in-core inode size is racy and can result in the in-core
> inode size going backwards if the race if hit just right.
> 
> If the inode size goes backwards, this can trigger the EOF zeroing
> code to run incorrectly on the next IO, which then will zero data
> that has successfully been written to disk by a previous DIO.
> 
> To fix this bug, we need to serialise the test/set updates of the
> in-core inode size. This first patch introduces locking around the
> relevant updates and checks in the DIO path. Because we now have an
> ioend in xfs_end_io_direct_write(), we know exactly then we are
> doing an IO that requires an in-core EOF update, and we know that
> they are not running in interrupt context. As such, we do not need to
> use irqsave() spinlock variants to protect against interrupts while
> the lock is held.
> 
> Hence we can use an existing spinlock in the inode to do this
> serialisation and so not need to grow the struct xfs_inode just to
> work around this problem.
> 
> This patch does not address the test/set EOF update in
> generic_file_write_direct() for various reasons - that will be done
> as a followup with separate explanation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_aops.c |  7 +++++++
>  fs/xfs/xfs_file.c | 13 ++++++++++++-
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 55356f6..cd6b2e0 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1577,9 +1577,16 @@ xfs_end_io_direct_write(
>  	 * with the on-disk inode size being outside the in-core inode size. We
>  	 * have no other method of updating EOF for AIO, so always do it here
>  	 * if necessary.
> +	 *
> +	 * We need to lock the test/set EOF update as we can be racing with
> +	 * other IO completions here to update the EOF. Failing to serialise
> +	 * here can result in EOF moving backwards and Bad Things Happen when
> +	 * that occurs.
>  	 */
> +	spin_lock(&ip->i_flags_lock);
>  	if (offset + size > i_size_read(inode))
>  		i_size_write(inode, offset + size);
> +	spin_unlock(&ip->i_flags_lock);
>  
>  	/*
>  	 * If we are doing an append IO that needs to update the EOF on disk,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c203839..5d5b4ba 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -569,10 +569,20 @@ restart:
>  	 * write.  If zeroing is needed and we are currently holding the
>  	 * iolock shared, we need to update it to exclusive which implies
>  	 * having to redo all checks before.
> +	 *
> +	 * We need to serialise against EOF updates that occur in IO
> +	 * completions here. We want to make sure that nobody is changing the
> +	 * size while we do this check until we have placed an IO barrier (i.e.
> +	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> +	 * The spinlock effectively forms a memory barrier once we have the
> +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> +	 * and hence be able to correctly determine if we need to run zeroing.
>  	 */
> +	spin_lock(&ip->i_flags_lock);
>  	if (*pos > i_size_read(inode)) {
>  		bool	zero = false;
>  
> +		spin_unlock(&ip->i_flags_lock);
>  		if (*iolock == XFS_IOLOCK_SHARED) {
>  			xfs_rw_iunlock(ip, *iolock);
>  			*iolock = XFS_IOLOCK_EXCL;
> @@ -582,7 +592,8 @@ restart:
>  		error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero);
>  		if (error)
>  			return error;
> -	}
> +	} else
> +		spin_unlock(&ip->i_flags_lock);
>  
>  	/*
>  	 * Updating the timestamps will grab the ilock again from
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/8] xfs: direct IO EOF zeroing needs to drain AIO
  2015-04-14  7:26 ` [PATCH 7/8] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner
@ 2015-04-14 14:35   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-04-14 14:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 05:26:50PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we are doing AIO DIO writes, the IOLOCK only provides an IO
> submission barrier. When we need to do EOF zeroing, we need to ensure
> that no other IO is in progress and all pending in-core EOF updates
> have been completed. This requires us to wait for all outstanding
> AIO DIO writes to the inode to complete and, if necessary, run their
> EOF updates.
> 
> Once all the EOF updates are complete, we can then restart
> xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing
> that EOF is up to date and we have exclusive IO access to the file
> so we can run EOF block zeroing if we need to without interference.
> This gives EOF zeroing the same exclusivity against other IO as we
> provide truncate operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5d5b4ba..c398ec7 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -587,6 +587,16 @@ restart:
>  			xfs_rw_iunlock(ip, *iolock);
>  			*iolock = XFS_IOLOCK_EXCL;
>  			xfs_rw_ilock(ip, *iolock);
> +
> +			/*
> +			 * We now have an IO submission barrier in place, but
> +			 * AIO can do EOF updates during IO completion and hence
> +			 * we now need to wait for all of them to drain. Non-AIO
> +			 * DIO will have drained before we are given the
> +			 * XFS_IOLOCK_EXCL, and so for most cases this wait is a
> +			 * no-op.
> +			 */
> +			inode_dio_wait(inode);
>  			goto restart;
>  		}
>  		error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero);
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/8] xfs: using generic_file_direct_write() is unnecessary
  2015-04-14  7:26 ` [PATCH 8/8] xfs: using generic_file_direct_write() is unnecessary Dave Chinner
@ 2015-04-14 14:35   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-04-14 14:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 05:26:51PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> generic_file_direct_write() does all sorts of things to make DIO
> work "sorta ok" with mixed buffered IO workloads. We already do
> most of this work in xfs_file_aio_dio_write() because of the locking
> requirements, so there's only a couple of things it does for us.
> 
> The first thing is that it does a page cache invalidation after the
> ->direct_IO callout. This can easily be added to the XFS code.
> 
> The second thing it does is that if data was written, it updates the
> iov_iter structure to reflect the data written, and then does EOF
> size updates if necessary. For XFS, these EOF size updates are now
> not necessary, as we do them safely and race-free in IO completion
> context. That leaves just the iov_iter update, and that's also moved
> to the XFS code.
> 
> Therefore we don't need to call generic_file_direct_write() and in
> doing so remove redundant buffered writeback and page cache
> invalidation calls from the DIO submission path. We also remove a
> racy EOF size update, and make the DIO submission code in XFS much
> easier to follow. Wins all round, really.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c398ec7..3a5d305 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -665,6 +665,8 @@ xfs_file_dio_aio_write(
>  	int			iolock;
>  	size_t			count = iov_iter_count(from);
>  	loff_t			pos = iocb->ki_pos;
> +	loff_t			end;
> +	struct iov_iter		data;
>  	struct xfs_buftarg	*target = XFS_IS_REALTIME_INODE(ip) ?
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
> @@ -704,10 +706,11 @@ xfs_file_dio_aio_write(
>  	if (ret)
>  		goto out;
>  	iov_iter_truncate(from, count);
> +	end = pos + count - 1;
>  
>  	if (mapping->nrpages) {
>  		ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> -						    pos, pos + count - 1);
> +						   pos, end);
>  		if (ret)
>  			goto out;
>  		/*
> @@ -717,7 +720,7 @@ xfs_file_dio_aio_write(
>  		 */
>  		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
>  					pos >> PAGE_CACHE_SHIFT,
> -					(pos + count - 1) >> PAGE_CACHE_SHIFT);
> +					end >> PAGE_CACHE_SHIFT);
>  		WARN_ON_ONCE(ret);
>  		ret = 0;
>  	}
> @@ -734,8 +737,22 @@ xfs_file_dio_aio_write(
>  	}
>  
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
> -	ret = generic_file_direct_write(iocb, from, pos);
>  
> +	data = *from;
> +	ret = mapping->a_ops->direct_IO(WRITE, iocb, &data, pos);
> +
> +	/* see generic_file_direct_write() for why this is necessary */
> +	if (mapping->nrpages) {
> +		invalidate_inode_pages2_range(mapping,
> +					      pos >> PAGE_CACHE_SHIFT,
> +					      end >> PAGE_CACHE_SHIFT);
> +	}
> +
> +	if (ret > 0) {
> +		pos += ret;
> +		iov_iter_advance(from, ret);
> +		iocb->ki_pos = pos;
> +	}
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/8] xfs: handle DIO overwrite EOF update completion correctly
  2015-04-14 14:35   ` Brian Foster
@ 2015-04-14 15:35     ` Brian Foster
  2015-04-14 20:12     ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-04-14 15:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Apr 14, 2015 at 10:35:02AM -0400, Brian Foster wrote:
> On Tue, Apr 14, 2015 at 05:26:47PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
> > write into allocated blocks beyond EOF) requires a transaction for
> > the EOF update. Thi is done in IO completion context, but we aren't
> > explicitly handling this situation properly and so it can run in
> > interrupt context. Ensure that we defer IO that spans EOF correctly
> > to the DIO completion workqueue, and now that we have an ioend in IO
> > completion we can use the common ioend completion path to do all the
> > work.
> > 
> > Note: we do not preallocate the append transaction as we can have
> > multiple mapping and allocation calls per direct IO. hence
> > preallocating can still leave us with nested transactions by
> > attempting to map and allocate more blocks after we've preallocated
> > an append transaction.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c  | 61 +++++++++++++++++++++++++++---------------------------
> >  fs/xfs/xfs_trace.h |  1 +
> >  2 files changed, 31 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index e1fa926..e3968a3 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1293,7 +1293,7 @@ xfs_map_direct(
> >  					   imap);
> >  	}
> >  
> > -	if (ioend->io_type == XFS_IO_UNWRITTEN)
> > +	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend))
> >  		set_buffer_defer_completion(bh_result);
> >  }
> >  
> > @@ -1531,8 +1531,10 @@ xfs_end_io_direct_write(
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	struct xfs_ioend	*ioend = private;
> >  
> > +	trace_xfs_gbmap_direct_endio(ip, offset, size, ioend->io_type, NULL);
> > +
> >  	if (XFS_FORCED_SHUTDOWN(mp))
> > -		goto out_destroy_ioend;
> > +		goto out_end_io;
> >  
> >  	/*
> >  	 * dio completion end_io functions are only called on writes if more
> > @@ -1553,40 +1555,37 @@ xfs_end_io_direct_write(
> >  	ioend->io_offset = offset;
> >  
> >  	/*
> > -	 * While the generic direct I/O code updates the inode size, it does
> > -	 * so only after the end_io handler is called, which means our
> > -	 * end_io handler thinks the on-disk size is outside the in-core
> > -	 * size.  To prevent this just update it a little bit earlier here.
> > +	 * The ioend tells us whether we are doing unwritten extent conversion
> > +	 * or an append transaction that updates the on-disk file size. These
> > +	 * cases are the only cases where we should *potentially* be needing
> > +	 * to update the VFS inode size. When the ioend indicates this, we
> > +	 * are *guaranteed* to be running in non-interrupt context.
> > +	 *
> > +	 * We need to update the in-core inode size here so that we don't end up
> > +	 * with the on-disk inode size being outside the in-core inode size.
> > +	 * While we can do this in the process context after the IO has
> > +	 * completed, this does not work for AIO and hence we always update
> > +	 * the in-core inode size here if necessary.
> >  	 */
> > -	if (offset + size > i_size_read(inode))
> > -		i_size_write(inode, offset + size);
> > +	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend)) {
> > +		if (offset + size > i_size_read(inode))
> > +			i_size_write(inode, offset + size);
> > +	} else
> > +		ASSERT(offset + size <= i_size_read(inode));
> 
> The code was obviously incorrect prior to this change, potentially
> running some of these transactions in irq context. That said, it occurs
> to me that one thing that the previous implementation looked to handle
> that this does not is racing of in-flight aio with other operations.
> E.g., what happens now if a non-extending, overwrite aio is submitted
> and races with a truncate that causes it to be extending by the time we
> get here? It looks like it would have been racy regardless, so maybe
> that's just a separate problem...
> 

Looking further, we actually wait on dio in the truncate path with
IOLOCK_EXCL (e.g., similar to what we now do for extending aio itself),
so this is probably irrelevant...

Brian

> >  
> >  	/*
> > -	 * For direct I/O we do not know if we need to allocate blocks or not,
> > -	 * so we can't preallocate an append transaction, as that results in
> > -	 * nested reservations and log space deadlocks. Hence allocate the
> > -	 * transaction here. While this is sub-optimal and can block IO
> > -	 * completion for some time, we're stuck with doing it this way until
> > -	 * we can pass the ioend to the direct IO allocation callbacks and
> > -	 * avoid nesting that way.
> > +	 * If we are doing an append IO that needs to update the EOF on disk,
> > +	 * do the transaction reserve now so we can use common end io
> > +	 * processing. Stashing the error (if there is one) in the ioend will
> > +	 * result in the ioend processing passing on the error if it is
> > +	 * possible as we can't return it from here.
> >  	 */
> > -	if (ioend->io_type == XFS_IO_UNWRITTEN) {
> > -		xfs_iomap_write_unwritten(ip, offset, size);
> > -	} else if (offset + size > ip->i_d.di_size) {
> > -		struct xfs_trans	*tp;
> > -		int			error;
> > -
> > -		tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> > -		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
> > -		if (error) {
> > -			xfs_trans_cancel(tp, 0);
> > -			goto out_destroy_ioend;
> > -		}
> > +	if (ioend->io_type == XFS_IO_OVERWRITE && xfs_ioend_is_append(ioend))
> > +		ioend->io_error = xfs_setfilesize_trans_alloc(ioend);
> 
> As you mentioned previously, we no longer need the transaction context
> manipulation stuff in xfs_setfilesize_trans_alloc() with this approach.
> It's still called from the writepage path though, so I guess it needs to
> stay.
> 
> Brian
> 
> >  
> > -		xfs_setfilesize(ip, tp, offset, size);
> > -	}
> > -out_destroy_ioend:
> > -	xfs_destroy_ioend(ioend);
> > +out_end_io:
> > +	xfs_end_io(&ioend->io_work);
> > +	return;
> >  }
> >  
> >  STATIC ssize_t
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index e78b64e..967993b 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1224,6 +1224,7 @@ DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc);
> >  DEFINE_IOMAP_EVENT(xfs_gbmap_direct);
> >  DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new);
> >  DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update);
> > +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_endio);
> >  
> >  DECLARE_EVENT_CLASS(xfs_simple_io_class,
> >  	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
> > -- 
> > 2.0.0
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/8] xfs: factor DIO write mapping from get_blocks
  2015-04-14 14:23   ` Brian Foster
@ 2015-04-14 20:06     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-04-14 20:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Apr 14, 2015 at 10:23:46AM -0400, Brian Foster wrote:
> On Tue, Apr 14, 2015 at 05:26:44PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Clarify and separate the buffer mapping logic so that the direct IO mapping is
> > not tangled up in propagating the extent status to teh mapping buffer. This
> > makes it easier to extend the direct IO mapping to use an ioend in future.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 36 +++++++++++++++++++++++-------------
> >  1 file changed, 23 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3a9b7a1..5f7ddd5 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1233,6 +1233,22 @@ xfs_vm_releasepage(
> >  	return try_to_free_buffers(page);
> >  }
> >  
> > +/*
> > + * do all the direct IO specific mapping buffer manipulation here.
> > + */
> > +static void
> > +xfs_map_direct(
> > +	struct inode		*inode,
> > +	struct buffer_head	*bh_result,
> > +	struct xfs_bmbt_irec	*imap,
> > +	xfs_off_t		offset)
> > +{
> > +	if (ISUNWRITTEN(imap)) {
> > +		bh_result->b_private = inode;
> > +		set_buffer_defer_completion(bh_result);
> > +	}
> > +}
> > +
> >  STATIC int
> >  __xfs_get_blocks(
> >  	struct inode		*inode,
> > @@ -1332,20 +1348,14 @@ __xfs_get_blocks(
> >  	}
> >  
> >  	if (imap.br_startblock != HOLESTARTBLOCK &&
> > -	    imap.br_startblock != DELAYSTARTBLOCK) {
> > -		/*
> > -		 * For unwritten extents do not report a disk address on
> > -		 * the read case (treat as if we're reading into a hole).
> > -		 */
> 
> Can we keep this comment and move it before this block?

*nod*.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/8] xfs: handle DIO overwrite EOF update completion correctly
  2015-04-14 14:35   ` Brian Foster
  2015-04-14 15:35     ` Brian Foster
@ 2015-04-14 20:12     ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-04-14 20:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Apr 14, 2015 at 10:35:02AM -0400, Brian Foster wrote:
> On Tue, Apr 14, 2015 at 05:26:47PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
> > write into allocated blocks beyond EOF) requires a transaction for
> > the EOF update. Thi is done in IO completion context, but we aren't
> > explicitly handling this situation properly and so it can run in
> > interrupt context. Ensure that we defer IO that spans EOF correctly
> > to the DIO completion workqueue, and now that we have an ioend in IO
> > completion we can use the common ioend completion path to do all the
> > work.
> > 
> > Note: we do not preallocate the append transaction as we can have
> > multiple mapping and allocation calls per direct IO. hence
> > preallocating can still leave us with nested transactions by
> > attempting to map and allocate more blocks after we've preallocated
> > an append transaction.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
....
> > @@ -1553,40 +1555,37 @@ xfs_end_io_direct_write(
> >  	ioend->io_offset = offset;
> >  
> >  	/*
> > -	 * While the generic direct I/O code updates the inode size, it does
> > -	 * so only after the end_io handler is called, which means our
> > -	 * end_io handler thinks the on-disk size is outside the in-core
> > -	 * size.  To prevent this just update it a little bit earlier here.
> > +	 * The ioend tells us whether we are doing unwritten extent conversion
> > +	 * or an append transaction that updates the on-disk file size. These
> > +	 * cases are the only cases where we should *potentially* be needing
> > +	 * to update the VFS inode size. When the ioend indicates this, we
> > +	 * are *guaranteed* to be running in non-interrupt context.
> > +	 *
> > +	 * We need to update the in-core inode size here so that we don't end up
> > +	 * with the on-disk inode size being outside the in-core inode size.
> > +	 * While we can do this in the process context after the IO has
> > +	 * completed, this does not work for AIO and hence we always update
> > +	 * the in-core inode size here if necessary.
> >  	 */
> > -	if (offset + size > i_size_read(inode))
> > -		i_size_write(inode, offset + size);
> > +	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend)) {
> > +		if (offset + size > i_size_read(inode))
> > +			i_size_write(inode, offset + size);
> > +	} else
> > +		ASSERT(offset + size <= i_size_read(inode));
> 
> The code was obviously incorrect prior to this change, potentially
> running some of these transactions in irq context. That said, it occurs
> to me that one thing that the previous implementation looked to handle
> that this does not is racing of in-flight aio with other operations.
> E.g., what happens now if a non-extending, overwrite aio is submitted
> and races with a truncate that causes it to be extending by the time we
> get here? It looks like it would have been racy regardless, so maybe
> that's just a separate problem...

AIO can't race with truncate, because truncate does inode_dio_wait()
after taking the IOLOCK_EXCL.

As for failing to update if the inode size is extended, we only care
about the in-memory size update if the on-disk inode size is being
extended. If the on-disk size has been extended, then we don't need
to update the in-memory size because it's already been extended by
this code.

> >  	/*
> > -	 * For direct I/O we do not know if we need to allocate blocks or not,
> > -	 * so we can't preallocate an append transaction, as that results in
> > -	 * nested reservations and log space deadlocks. Hence allocate the
> > -	 * transaction here. While this is sub-optimal and can block IO
> > -	 * completion for some time, we're stuck with doing it this way until
> > -	 * we can pass the ioend to the direct IO allocation callbacks and
> > -	 * avoid nesting that way.
> > +	 * If we are doing an append IO that needs to update the EOF on disk,
> > +	 * do the transaction reserve now so we can use common end io
> > +	 * processing. Stashing the error (if there is one) in the ioend will
> > +	 * result in the ioend processing passing on the error if it is
> > +	 * possible as we can't return it from here.
> >  	 */
> > -	if (ioend->io_type == XFS_IO_UNWRITTEN) {
> > -		xfs_iomap_write_unwritten(ip, offset, size);
> > -	} else if (offset + size > ip->i_d.di_size) {
> > -		struct xfs_trans	*tp;
> > -		int			error;
> > -
> > -		tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> > -		error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
> > -		if (error) {
> > -			xfs_trans_cancel(tp, 0);
> > -			goto out_destroy_ioend;
> > -		}
> > +	if (ioend->io_type == XFS_IO_OVERWRITE && xfs_ioend_is_append(ioend))
> > +		ioend->io_error = xfs_setfilesize_trans_alloc(ioend);
> 
> As you mentioned previously, we no longer need the transaction context
> manipulation stuff in xfs_setfilesize_trans_alloc() with this approach.
> It's still called from the writepage path though, so I guess it needs to
> stay.

Yes, and if we ever get the eventual DIO rewrite that's been coming
for several years, we'll be able to untangle this code further and
use preallocation for DIO.  As it is, I have a few thoughts on how
to do preallocation regardless, just haven't had time to explore
them.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/8] xfs: DIO writes within EOF don't need an ioend
  2015-04-14 14:35   ` Brian Foster
@ 2015-04-14 20:18     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-04-14 20:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Apr 14, 2015 at 10:35:19AM -0400, Brian Foster wrote:
> On Tue, Apr 14, 2015 at 05:26:48PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > DIO writes that lie entirely within EOF have nothing to do in IO
> > completion. In this case, we don't need no steekin' ioend, and so we
> > can avoid allocating an ioend until we have a mapping that spans
> > EOF.
> > 
> > This means that IO completion has two contexts - deferred completion
> > to the dio workqueue that uses an ioend, and interrupt completion
> > that does nothing because there is nothing that can be done in this
> > context.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c  | 62 ++++++++++++++++++++++++++++++------------------------
> >  fs/xfs/xfs_trace.h |  1 +
> >  2 files changed, 36 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index e3968a3..55356f6 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1234,15 +1234,19 @@ xfs_vm_releasepage(
> >  }
> >  
> >  /*
> > - * When we map a DIO buffer, we need to attach an ioend that describes the type
> > + * When we map a DIO buffer, we may need to attach an ioend that describes the type
> >   * of write IO we are doing. This passes to the completion function the
> > - * operations it needs to perform.
> > + * operations it needs to perform. If the mapping is for an overwrite wholly
> > + * within the EOF then we don't need an ioend and so we don't allocate one. This
> > + * avoids the unnecessary overhead of allocating and freeing ioends for
> > + * workloads that don't require transactions on IO completion.
> >   *
> >   * If we get multiple mappings to in a single IO, we might be mapping dfferent
> >   * types. But because the direct IO can only have a single private pointer, we
> >   * need to ensure that:
> >   *
> > - * a) the ioend spans the entire region of the IO; and
> > + * a) i) the ioend spans the entire region of unwritten mappings; or
> > + *    ii) the ioend spans all the mappings that cross or are beyond EOF; and
> >   * b) if it contains unwritten extents, it is *permanently* marked as such
> >   *
> >   * We could do this by chaining ioends like buffered IO does, but we only
> > @@ -1283,7 +1287,8 @@ xfs_map_direct(
> >  		trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset,
> >  					      ioend->io_size, ioend->io_type,
> >  					      imap);
> > -	} else {
> > +	} else if (type == XFS_IO_UNWRITTEN ||
> > +		   offset + size > i_size_read(inode)) {
> >  		ioend = xfs_alloc_ioend(inode, type);
> >  		ioend->io_offset = offset;
> >  		ioend->io_size = size;
> > @@ -1291,10 +1296,13 @@ xfs_map_direct(
> >  
> >  		trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type,
> >  					   imap);
> > +	} else {
> > +		trace_xfs_gbmap_direct_none(XFS_I(inode), offset, size, type,
> > +					    imap);
> 
> Do we really need a tracepoint to indicate none of the other tracepoints
> were hit? It stands out to me only because we already have the
> unconditional trace_xfs_gbmap_direct() above. I'd say kill one or the
> other, but I think we really want the function entry one because it
> disambiguates individual get_block instances from the aggregate mapping.

I found this incredibly useful in debugging this code, because it
told me exactly what each mapping call was doing, and from that I
could see if it was doing the right thing. Yes, i could infer it
from the entry trace point, but grepping on the entry tracepoint
gets *all* the mapping calls, not just the overwrites wholly within
EOF...

> > +		return;
> >  	}
> >  
> > -	if (ioend->io_type == XFS_IO_UNWRITTEN || xfs_ioend_is_append(ioend))
> > -		set_buffer_defer_completion(bh_result);
> > +	set_buffer_defer_completion(bh_result);
> 
> I'd move this up into the block where we allocate an ioend. That's the
> only place we need it and doing so eliminates the need for the 'else {
> return; }' thing entirely.

Yeah, that would work, too.

> >  STATIC void
> >  xfs_end_io_direct_write(
> > @@ -1531,7 +1541,10 @@ xfs_end_io_direct_write(
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	struct xfs_ioend	*ioend = private;
> >  
> > -	trace_xfs_gbmap_direct_endio(ip, offset, size, ioend->io_type, NULL);
> > +	trace_xfs_gbmap_direct_endio(ip, offset, size,
> > +				     ioend ? ioend->io_type : 0, NULL);
> > +	if (!ioend)
> > +		return;
> 
> Can we keep the i_size assert we've lost below?
> 
> ASSERT(offset + size <= i_size_read(inode));

Sure, I can add it for that case.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-04-14 20:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-14  7:26 [PATCH 0/8 v2] xfs: fix direct IO completion issues Dave Chinner
2015-04-14  7:26 ` [PATCH 1/8] xfs: factor DIO write mapping from get_blocks Dave Chinner
2015-04-14 14:23   ` Brian Foster
2015-04-14 20:06     ` Dave Chinner
2015-04-14  7:26 ` [PATCH 2/8] xfs: move DIO mapping size calculation Dave Chinner
2015-04-14 14:24   ` Brian Foster
2015-04-14  7:26 ` [PATCH 3/8] xfs: DIO needs an ioend for writes Dave Chinner
2015-04-14 14:24   ` Brian Foster
2015-04-14  7:26 ` [PATCH 4/8] xfs: handle DIO overwrite EOF update completion correctly Dave Chinner
2015-04-14 14:35   ` Brian Foster
2015-04-14 15:35     ` Brian Foster
2015-04-14 20:12     ` Dave Chinner
2015-04-14  7:26 ` [PATCH 5/8] xfs: DIO writes within EOF don't need an ioend Dave Chinner
2015-04-14 14:35   ` Brian Foster
2015-04-14 20:18     ` Dave Chinner
2015-04-14  7:26 ` [PATCH 6/8] xfs: DIO write completion size updates race Dave Chinner
2015-04-14 14:35   ` Brian Foster
2015-04-14  7:26 ` [PATCH 7/8] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner
2015-04-14 14:35   ` Brian Foster
2015-04-14  7:26 ` [PATCH 8/8] xfs: using generic_file_direct_write() is unnecessary Dave Chinner
2015-04-14 14:35   ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.