* [PATCH 0/5] xfs: fix direct IO completion issues @ 2015-04-10 13:37 Dave Chinner 2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner ` (5 more replies) 0 siblings, 6 replies; 19+ messages in thread From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw) To: xfs Hi folks, This patchset addresses the deeper problems Brian outlined in the description of this patch: http://oss.sgi.com/archives/xfs/2015-04/msg00071.html The basic issues is that DIO completion can run in interrupt context and it does things it should not do in interrupt context because Bad Things Will Happen. Patches 1 and 2 convert the DIO write completion code to use an ioend and to never run in interrupt context when a transaction or EOF update may need to be run. Patches 3 and 4 of this series are effectively the same as the patch Brain sent, but by checking the ioend status in the completion routine before taking the spinlock, we can guarantee we never take the spinlock in interrupt context and hence don't need irq safe spin locks and so can re-use an existing innermost spinlock for serialisation here. The final patch is a removal of redundant operations - most of generic_file_direct_write is being done in the XFS code, so most of the gneric function is redundant and unnecessary overhead. Hence it moves the bits that we need into the XFS code path, and we stop calling the generic code altogether. This passes xfstests and everything I've thrown at it. There's a couple of small cleanups that I think still need to be done, but they are minor. e.g. xfs_end_io_dio_write() can now probably call xfs_finish_endio_sync() directly now, rather than open coding the calls it makes after updating the incore inode size.... Comments, thoughts? -Dave _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/5] xfs: DIO requires an ioend for writes 2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner @ 2015-04-10 13:37 ` Dave Chinner 2015-04-10 20:21 ` Brian Foster 2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner ` (4 subsequent siblings) 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> Right now unwritten extent conversion information is passed by making the end IO private data non-null, which does not enable us to pass any information from submission context to completion context, which we need to use the standard IO completion paths. Allocate an ioend in block allocation for direct IO and attach it to the mapping buffer used during direct IO block allocation. Factor the mapping code to make it obvious that this is happening only for direct IO writes, and and place the mapping info and IO type directly into the ioend for use in completion context. The completion is changed to check the ioend type to determine if unwritten extent completion is necessary or not. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 61 insertions(+), 18 deletions(-) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 3a9b7a1..d95a42b 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1233,6 +1233,57 @@ xfs_vm_releasepage( return try_to_free_buffers(page); } +static void +xfs_get_blocks_map_buffer( + struct inode *inode, + struct buffer_head *bh_result, + int create, + int direct, + struct xfs_bmbt_irec *imap, + xfs_off_t offset, + ssize_t size) +{ + struct xfs_ioend *ioend; + int type; + + if (!create) { + /* + * Unwritten extents do not report a disk address for + * the read case (treat as if we're reading into a hole). + */ + if (!ISUNWRITTEN(imap)) + xfs_map_buffer(inode, bh_result, imap, offset); + return; + } + + xfs_map_buffer(inode, bh_result, imap, offset); + + if (ISUNWRITTEN(imap)) + set_buffer_unwritten(bh_result); + + if (!direct) + return; + + /* + * Direct IO writes require an ioend to be allocated and + * passed via the returned mapping. This allows the end + * io function to determine the correct course of + * action. + */ + + if (ISUNWRITTEN(imap)) { + type = XFS_IO_UNWRITTEN; + set_buffer_defer_completion(bh_result); + } else + type = XFS_IO_OVERWRITE; + ioend = xfs_alloc_ioend(inode, type); + ioend->io_offset = offset; + ioend->io_size = size; + bh_result->b_private = ioend; + + return; +} + STATIC int __xfs_get_blocks( struct inode *inode, @@ -1252,6 +1303,7 @@ __xfs_get_blocks( ssize_t size; int new = 0; + if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; @@ -1332,21 +1384,9 @@ __xfs_get_blocks( } if (imap.br_startblock != HOLESTARTBLOCK && - imap.br_startblock != DELAYSTARTBLOCK) { - /* - * For unwritten extents do not report a disk address on - * the read case (treat as if we're reading into a hole). - */ - if (create || !ISUNWRITTEN(&imap)) - xfs_map_buffer(inode, bh_result, &imap, offset); - if (create && ISUNWRITTEN(&imap)) { - if (direct) { - bh_result->b_private = inode; - set_buffer_defer_completion(bh_result); - } - set_buffer_unwritten(bh_result); - } - } + imap.br_startblock != DELAYSTARTBLOCK) + xfs_get_blocks_map_buffer(inode, bh_result, create, direct, + &imap, offset, size); /* * If this is a realtime file, data may be on a different device. @@ -1455,9 +1495,10 @@ xfs_end_io_direct_write( struct inode *inode = file_inode(iocb->ki_filp); struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; + struct xfs_ioend *ioend = private; if (XFS_FORCED_SHUTDOWN(mp)) - return; + goto out_destroy_ioend; /* * While the generic direct I/O code updates the inode size, it does @@ -1477,7 +1518,7 @@ xfs_end_io_direct_write( * we can pass the ioend to the direct IO allocation callbacks and * avoid nesting that way. */ - if (private && size > 0) { + if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) { xfs_iomap_write_unwritten(ip, offset, size); } else if (offset + size > ip->i_d.di_size) { struct xfs_trans *tp; @@ -1487,11 +1528,13 @@ xfs_end_io_direct_write( error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0); if (error) { xfs_trans_cancel(tp, 0); - return; + goto out_destroy_ioend; } xfs_setfilesize(ip, tp, offset, size); } +out_destroy_ioend: + xfs_destroy_ioend(ioend); } STATIC ssize_t -- 2.0.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/5] xfs: DIO requires an ioend for writes 2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner @ 2015-04-10 20:21 ` Brian Foster 2015-04-10 22:24 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Brian Foster @ 2015-04-10 20:21 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 10, 2015 at 11:37:56PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Right now unwritten extent conversion information is passed by > making the end IO private data non-null, which does not enable us to > pass any information from submission context to completion context, > which we need to use the standard IO completion paths. > > Allocate an ioend in block allocation for direct IO and attach it to > the mapping buffer used during direct IO block allocation. Factor > the mapping code to make it obvious that this is happening only for > direct IO writes, and and place the mapping info and IO type > directly into the ioend for use in completion context. > > The completion is changed to check the ioend type to determine if > unwritten extent completion is necessary or not. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 61 insertions(+), 18 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 3a9b7a1..d95a42b 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -1233,6 +1233,57 @@ xfs_vm_releasepage( > return try_to_free_buffers(page); > } > > +static void > +xfs_get_blocks_map_buffer( > + struct inode *inode, > + struct buffer_head *bh_result, > + int create, > + int direct, > + struct xfs_bmbt_irec *imap, > + xfs_off_t offset, > + ssize_t size) > +{ > + struct xfs_ioend *ioend; > + int type; > + > + if (!create) { > + /* > + * Unwritten extents do not report a disk address for > + * the read case (treat as if we're reading into a hole). > + */ > + if (!ISUNWRITTEN(imap)) > + xfs_map_buffer(inode, bh_result, imap, offset); > + return; > + } This logic was kind of ugly to begin with, but I think the refactoring exposes it further. There's rather twisty logic here just for a case where we don't do anything. In other words, the fact that we have early return logic in this function that ultimately jumps us back up through two levels of scope (e.g., this function, the if branch in the caller) is good indication that the logic could be improved. I think we could lift the (!create && ISUNWRITTEN()) case up into the top most check, along with the comment as to why, and skip the entire codepath in that case. That also kills the duplicate xfs_map_buffer() call above so long as we do something like this: > + > + xfs_map_buffer(inode, bh_result, imap, offset); if (!create) return; ... but I think we could clean it up even further than that. This function gets rather busy in the subsequent patch so I think there's value in isolating it to managing direct I/O and lifting the common stuff back into xfs_get_blocks() (see below). This means we could also kill both of the create and direct parameters because they aren't used at all beyond the !direct check a few lines below. > + > + if (ISUNWRITTEN(imap)) > + set_buffer_unwritten(bh_result); > + > + if (!direct) > + return; > + > + /* > + * Direct IO writes require an ioend to be allocated and > + * passed via the returned mapping. This allows the end > + * io function to determine the correct course of > + * action. > + */ > + > + if (ISUNWRITTEN(imap)) { > + type = XFS_IO_UNWRITTEN; > + set_buffer_defer_completion(bh_result); > + } else > + type = XFS_IO_OVERWRITE; > + ioend = xfs_alloc_ioend(inode, type); > + ioend->io_offset = offset; > + ioend->io_size = size; > + bh_result->b_private = ioend; > + > + return; Pointless return... not a big deal given this is converted from a void function in the subsequent patch. > +} > + > STATIC int > __xfs_get_blocks( > struct inode *inode, > @@ -1252,6 +1303,7 @@ __xfs_get_blocks( > ssize_t size; > int new = 0; > > + > if (XFS_FORCED_SHUTDOWN(mp)) > return -EIO; > > @@ -1332,21 +1384,9 @@ __xfs_get_blocks( > } > > if (imap.br_startblock != HOLESTARTBLOCK && > - imap.br_startblock != DELAYSTARTBLOCK) { > - /* > - * For unwritten extents do not report a disk address on > - * the read case (treat as if we're reading into a hole). > - */ > - if (create || !ISUNWRITTEN(&imap)) > - xfs_map_buffer(inode, bh_result, &imap, offset); > - if (create && ISUNWRITTEN(&imap)) { > - if (direct) { > - bh_result->b_private = inode; > - set_buffer_defer_completion(bh_result); > - } > - set_buffer_unwritten(bh_result); > - } > - } > + imap.br_startblock != DELAYSTARTBLOCK) > + xfs_get_blocks_map_buffer(inode, bh_result, create, direct, > + &imap, offset, size); So if we pull some of the bits from xfs_get_blocks_map_buffer() back up, I end up with something like the the following here. Compile tested only, but illustrates the point: /* * Map the buffer as long as we have physical blocks and this isn't a * read of an unwritten extent. Treat reads into unwritten extents as * holes and thus do not return a mapping. */ if (imap.br_startblock != HOLESTARTBLOCK && imap.br_startblock != DELAYSTARTBLOCK && (create || !ISUNWRITTEN(&imap))) { xfs_map_buffer(inode, bh_result, &imap, offset); /* unwritten implies create due to check above */ if (ISUNWRITTEN(&imap)) set_buffer_unwritten(bh_result); /* direct writes have a special mapping */ if (create && direct) { error = xfs_map_direct(inode, bh_result, &imap, offset); if (error) return error; } } I renamed the helper to xfs_map_direct(), killed everything therein up through the !direct check and killed both the create and direct params. Thoughts? Brian > > /* > * If this is a realtime file, data may be on a different device. > @@ -1455,9 +1495,10 @@ xfs_end_io_direct_write( > struct inode *inode = file_inode(iocb->ki_filp); > struct xfs_inode *ip = XFS_I(inode); > struct xfs_mount *mp = ip->i_mount; > + struct xfs_ioend *ioend = private; > > if (XFS_FORCED_SHUTDOWN(mp)) > - return; > + goto out_destroy_ioend; > > /* > * While the generic direct I/O code updates the inode size, it does > @@ -1477,7 +1518,7 @@ xfs_end_io_direct_write( > * we can pass the ioend to the direct IO allocation callbacks and > * avoid nesting that way. > */ > - if (private && size > 0) { > + if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) { > xfs_iomap_write_unwritten(ip, offset, size); > } else if (offset + size > ip->i_d.di_size) { > struct xfs_trans *tp; > @@ -1487,11 +1528,13 @@ xfs_end_io_direct_write( > error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0); > if (error) { > xfs_trans_cancel(tp, 0); > - return; > + goto out_destroy_ioend; > } > > xfs_setfilesize(ip, tp, offset, size); > } > +out_destroy_ioend: > + xfs_destroy_ioend(ioend); > } > > STATIC ssize_t > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/5] xfs: DIO requires an ioend for writes 2015-04-10 20:21 ` Brian Foster @ 2015-04-10 22:24 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2015-04-10 22:24 UTC (permalink / raw) To: Brian Foster; +Cc: xfs On Fri, Apr 10, 2015 at 04:21:37PM -0400, Brian Foster wrote: > On Fri, Apr 10, 2015 at 11:37:56PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Right now unwritten extent conversion information is passed by > > making the end IO private data non-null, which does not enable us to > > pass any information from submission context to completion context, > > which we need to use the standard IO completion paths. > > > > Allocate an ioend in block allocation for direct IO and attach it to > > the mapping buffer used during direct IO block allocation. Factor > > the mapping code to make it obvious that this is happening only for > > direct IO writes, and and place the mapping info and IO type > > directly into the ioend for use in completion context. > > > > The completion is changed to check the ioend type to determine if > > unwritten extent completion is necessary or not. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++------------- > > 1 file changed, 61 insertions(+), 18 deletions(-) > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > index 3a9b7a1..d95a42b 100644 > > --- a/fs/xfs/xfs_aops.c > > +++ b/fs/xfs/xfs_aops.c > > @@ -1233,6 +1233,57 @@ xfs_vm_releasepage( > > return try_to_free_buffers(page); > > } > > > > +static void > > +xfs_get_blocks_map_buffer( > > + struct inode *inode, > > + struct buffer_head *bh_result, > > + int create, > > + int direct, > > + struct xfs_bmbt_irec *imap, > > + xfs_off_t offset, > > + ssize_t size) > > +{ > > + struct xfs_ioend *ioend; > > + int type; > > + > > + if (!create) { > > + /* > > + * Unwritten extents do not report a disk address for > > + * the read case (treat as if we're reading into a hole). > > + */ > > + if (!ISUNWRITTEN(imap)) > > + xfs_map_buffer(inode, bh_result, imap, offset); > > + return; > > + } > > This logic was kind of ugly to begin with, but I think the refactoring > exposes it further. There's rather twisty logic here just for a case Yup, I isolated it first to make it easy to change, not necessarily easier to read ;) .... > So if we pull some of the bits from xfs_get_blocks_map_buffer() back up, > I end up with something like the the following here. Compile tested > only, but illustrates the point: > > /* > * Map the buffer as long as we have physical blocks and this isn't a > * read of an unwritten extent. Treat reads into unwritten extents as > * holes and thus do not return a mapping. > */ > if (imap.br_startblock != HOLESTARTBLOCK && > imap.br_startblock != DELAYSTARTBLOCK && > (create || !ISUNWRITTEN(&imap))) { > xfs_map_buffer(inode, bh_result, &imap, offset); > /* unwritten implies create due to check above */ > if (ISUNWRITTEN(&imap)) > set_buffer_unwritten(bh_result); > /* direct writes have a special mapping */ > if (create && direct) { > error = xfs_map_direct(inode, bh_result, &imap, offset); > if (error) > return error; > } > } > > I renamed the helper to xfs_map_direct(), killed everything therein up > through the !direct check and killed both the create and direct params. > Thoughts? Yeah, that looks neater; I'll split and rework it in a similar manner to this. Thanks! Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/5] xfs: direct IO needs to use append ioends 2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner 2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner @ 2015-04-10 13:37 ` Dave Chinner 2015-04-10 20:21 ` Brian Foster 2015-04-11 21:15 ` Brian Foster 2015-04-10 13:37 ` [PATCH 3/5] xfs: DIO write completion size updates race Dave Chinner ` (3 subsequent siblings) 5 siblings, 2 replies; 19+ messages in thread From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> Now we have an ioend being passed unconditionally to the direct IO write completion context, we can pass a preallocated transaction handle for on-disk inode size updates that are run in completion. At this point we really need to be passing the correct block range that the IO spans through the ioend, so calculate the last block in the mapping before we map the allocated range and use that instead of the size desired by the direct IO. This enables us to keep track of multiple get-blocks calls in the same direct IO - the ioend will keep coming back to us, and we can keep extending it's range as new allocations and mappings are done. There are some new trace points added for debugging, and a small hack to actually make the tracepoints work (enums in tracepoints that use __print_symbolic don't work correctly) that should be fixed in the 4.1 merge window. THis hack can be removed when the tracepoint fix is upstream. There are lots of comments explaining the intricacies of passing the ioend and append transaction in the code; they are better placed in the code because we're going to need them to understand why this code does what it does in a few years time.... Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_aops.c | 262 +++++++++++++++++++++++++++++++++++++---------------- fs/xfs/xfs_trace.h | 10 +- 2 files changed, 194 insertions(+), 78 deletions(-) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index d95a42b..52c7e46 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -178,6 +178,25 @@ xfs_setfilesize_ioend( return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size); } +STATIC void +xfs_setfilesize_ioend_cancel( + struct xfs_ioend *ioend) +{ + struct xfs_trans *tp = ioend->io_append_trans; + + /* + * The transaction may have been allocated in the I/O submission thread, + * thus we need to mark ourselves as being in a transaction manually. + * Similarly for freeze protection. + */ + current_set_flags_nested(&tp->t_pflags, PF_FSTRANS); + rwsem_acquire_read(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1], + 0, 1, _THIS_IP_); + + xfs_trans_cancel(tp, 0); + ioend->io_append_trans = NULL; +} + /* * Schedule IO completion handling on the final put of an ioend. * @@ -1233,18 +1252,18 @@ xfs_vm_releasepage( return try_to_free_buffers(page); } -static void +static int xfs_get_blocks_map_buffer( struct inode *inode, struct buffer_head *bh_result, int create, int direct, struct xfs_bmbt_irec *imap, - xfs_off_t offset, - ssize_t size) + xfs_off_t offset) { struct xfs_ioend *ioend; int type; + loff_t size; if (!create) { /* @@ -1253,7 +1272,7 @@ xfs_get_blocks_map_buffer( */ if (!ISUNWRITTEN(imap)) xfs_map_buffer(inode, bh_result, imap, offset); - return; + return 0; } xfs_map_buffer(inode, bh_result, imap, offset); @@ -1262,26 +1281,93 @@ xfs_get_blocks_map_buffer( set_buffer_unwritten(bh_result); if (!direct) - return; + return 0; /* - * Direct IO writes require an ioend to be allocated and - * passed via the returned mapping. This allows the end - * io function to determine the correct course of - * action. + * Direct IO writes require an ioend to be allocated and passed via the + * returned mapping. This allows the end io function to determine the + * correct course of action. + * + * Unwritten extents will need transactions at completion, so is known + * to need deferring to a workqueue. However, for writes into written + * extents, we *may* need a transaction if this IO extends the on-disk + * EOF. Because we can race with other IOs the file may already be + * extended by the time we get to the transaction. IO completion already + * handles that case so all we will have done is incurred the overhead + * of workqueue deferral for completion. This is acceptable overhead for + * the rare case that this occurs. */ - if (ISUNWRITTEN(imap)) { type = XFS_IO_UNWRITTEN; set_buffer_defer_completion(bh_result); } else type = XFS_IO_OVERWRITE; - ioend = xfs_alloc_ioend(inode, type); - ioend->io_offset = offset; - ioend->io_size = size; - bh_result->b_private = ioend; - return; + /* + * The offset that is passed in is the first block the DIO will fall + * into. The size supplied by the DIO layer is what it thinks it needs + * but the mapping may not span this entire range. Hence we use the + * truncated mapping size that's already been stashed in the bh_result + * to calculate the range covered by the ioend. + */ + size = bh_result->b_size; + trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap); + + /* + * If we get multiple mappings to in a single IO, we might be mapping + * dfferent types. But because the direct IO can only have a single + * private pointer, we need to ensure that: + * + * a) the ioend spans the entire region of the IO; and + * b) if it contains unwritten extents, it is *permanently* marked as + * such and we cancel any append transaction attached to the ioend. + * + * We could do this by chaining ioends like buffered IO does, but + * we only actually get one IO completion callback from the direct IO, + * and that spans the entire IO regardless of how many mappings and IOs + * are needed to complete the DIO. There is only going to be one + * reference to the ioend and it's life cycle is constrained by the + * DIO completion code. hence we don't need reference counting here. + */ + if (bh_result->b_private) { + ioend = bh_result->b_private; + ASSERT(ioend->io_size > 0); + ASSERT(offset >= ioend->io_offset); + + if (offset + size > ioend->io_offset + ioend->io_size) + ioend->io_size = offset - ioend->io_offset + size; + + if (type == XFS_IO_UNWRITTEN) { + if (ioend->io_append_trans) + xfs_setfilesize_ioend_cancel(ioend); + ioend->io_type = XFS_IO_UNWRITTEN; + } + trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset, + ioend->io_size, ioend->io_type, + imap); + } else { + ioend = xfs_alloc_ioend(inode, type); + ioend->io_offset = offset; + ioend->io_size = size; + bh_result->b_private = ioend; + trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type, + imap); + } + + /* check if we need an append transaction allocated. */ + if (ioend->io_type == XFS_IO_OVERWRITE && + xfs_ioend_is_append(ioend) && !ioend->io_append_trans) { + int error; + + error = xfs_setfilesize_trans_alloc(ioend); + ASSERT(!error); + if (error) { + xfs_destroy_ioend(ioend); + return error; + } + set_buffer_defer_completion(bh_result); + } + return 0; } STATIC int @@ -1374,50 +1460,19 @@ __xfs_get_blocks( xfs_iunlock(ip, lockmode); } - trace_xfs_get_blocks_alloc(ip, offset, size, 0, &imap); + trace_xfs_get_blocks_alloc(ip, offset, size, + ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN + : XFS_IO_DELALLOC, &imap); } else if (nimaps) { - trace_xfs_get_blocks_found(ip, offset, size, 0, &imap); + trace_xfs_get_blocks_found(ip, offset, size, + ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN + : XFS_IO_OVERWRITE, &imap); xfs_iunlock(ip, lockmode); } else { trace_xfs_get_blocks_notfound(ip, offset, size); goto out_unlock; } - if (imap.br_startblock != HOLESTARTBLOCK && - imap.br_startblock != DELAYSTARTBLOCK) - xfs_get_blocks_map_buffer(inode, bh_result, create, direct, - &imap, offset, size); - - /* - * If this is a realtime file, data may be on a different device. - * to that pointed to from the buffer_head b_bdev currently. - */ - bh_result->b_bdev = xfs_find_bdev_for_inode(inode); - - /* - * If we previously allocated a block out beyond eof and we are now - * coming back to use it then we will need to flag it as new even if it - * has a disk address. - * - * With sub-block writes into unwritten extents we also need to mark - * the buffer as new so that the unwritten parts of the buffer gets - * correctly zeroed. - */ - if (create && - ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) || - (offset >= i_size_read(inode)) || - (new || ISUNWRITTEN(&imap)))) - set_buffer_new(bh_result); - - if (imap.br_startblock == DELAYSTARTBLOCK) { - BUG_ON(direct); - if (create) { - set_buffer_uptodate(bh_result); - set_buffer_mapped(bh_result); - set_buffer_delay(bh_result); - } - } - /* * If this is O_DIRECT or the mpage code calling tell them how large * the mapping is, so that we can avoid repeated get_blocks calls. @@ -1451,6 +1506,46 @@ __xfs_get_blocks( bh_result->b_size = mapping_size; } + if (imap.br_startblock != HOLESTARTBLOCK && + imap.br_startblock != DELAYSTARTBLOCK) { + error = xfs_get_blocks_map_buffer(inode, bh_result, create, + direct, &imap, offset); + if (error) + return error; + } + if (create && direct) + ASSERT(bh_result->b_private); + + /* + * If this is a realtime file, data may be on a different device. + * to that pointed to from the buffer_head b_bdev currently. + */ + bh_result->b_bdev = xfs_find_bdev_for_inode(inode); + + /* + * If we previously allocated a block out beyond eof and we are now + * coming back to use it then we will need to flag it as new even if it + * has a disk address. + * + * With sub-block writes into unwritten extents we also need to mark + * the buffer as new so that the unwritten parts of the buffer gets + * correctly zeroed. + */ + if (create && + ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) || + (offset >= i_size_read(inode)) || + (new || ISUNWRITTEN(&imap)))) + set_buffer_new(bh_result); + + if (imap.br_startblock == DELAYSTARTBLOCK) { + BUG_ON(direct); + if (create) { + set_buffer_uptodate(bh_result); + set_buffer_mapped(bh_result); + set_buffer_delay(bh_result); + } + } + return 0; out_unlock: @@ -1501,38 +1596,51 @@ xfs_end_io_direct_write( goto out_destroy_ioend; /* - * While the generic direct I/O code updates the inode size, it does - * so only after the end_io handler is called, which means our - * end_io handler thinks the on-disk size is outside the in-core - * size. To prevent this just update it a little bit earlier here. + * dio completion end_io functions are only called on writes if more + * than 0 bytes was written. */ - if (offset + size > i_size_read(inode)) - i_size_write(inode, offset + size); + ASSERT(size > 0); /* - * For direct I/O we do not know if we need to allocate blocks or not, - * so we can't preallocate an append transaction, as that results in - * nested reservations and log space deadlocks. Hence allocate the - * transaction here. While this is sub-optimal and can block IO - * completion for some time, we're stuck with doing it this way until - * we can pass the ioend to the direct IO allocation callbacks and - * avoid nesting that way. + * The ioend only maps whole blocks, while the IO may be sector aligned. + * Hence the ioend offset/size may not match the IO offset/size exactly, + * but should span it completely. Write the IO sizes into the ioend so + * that completion processing does the right thing. */ - if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) { - xfs_iomap_write_unwritten(ip, offset, size); - } else if (offset + size > ip->i_d.di_size) { - struct xfs_trans *tp; - int error; + ASSERT(size <= ioend->io_size); + ASSERT(offset >= ioend->io_offset); + ASSERT(offset + size <= ioend->io_offset + ioend->io_size); + ioend->io_size = size; + ioend->io_offset = offset; - tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS); - error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0); - if (error) { - xfs_trans_cancel(tp, 0); - goto out_destroy_ioend; - } + /* + * The ioend tells us whether we are doing unwritten extent conversion + * or an append transaction that updates the on-disk file size. These + * cases are the only cases where we should *potentially* be needing + * to update the VFS inode size. When the ioend indicates this, we + * are *guaranteed* to be running in non-interrupt context. + * + * We need to update the in-core inode size here so that we don't end up + * with the on-disk inode size being outside the in-core inode size. + * While we can do this in the process context after the IO has + * completed, this does not work for AIO and hence we always update + * the in-core inode size here if necessary. + */ + if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) { + if (offset + size > i_size_read(inode)) + i_size_write(inode, offset + size); + } else + ASSERT(offset + size <= i_size_read(inode)); - xfs_setfilesize(ip, tp, offset, size); + /* Ugh. No way to propagate errors, so ignore them. */ + if (ioend->io_type == XFS_IO_UNWRITTEN) { + xfs_iomap_write_unwritten(ip, offset, size); + } else if (ioend->io_append_trans) { + xfs_setfilesize_ioend(ioend); + } else { + ASSERT(!xfs_ioend_is_append(ioend)); } + out_destroy_ioend: xfs_destroy_ioend(ioend); } diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index b2a45cc..a584c27 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1200,13 +1200,18 @@ DECLARE_EVENT_CLASS(xfs_imap_class, __entry->blockcount = irec ? irec->br_blockcount : 0; ), TP_printk("dev %d:%d ino 0x%llx size 0x%llx offset 0x%llx count %zd " - "type %s startoff 0x%llx startblock %lld blockcount 0x%llx", + "type %d startoff 0x%llx startblock %lld blockcount 0x%llx", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, __entry->size, __entry->offset, __entry->count, + /* + * XXX: __print_symbolic broken for enums, fix coming in 4.1 + * cycle from Mr Rostedt. Need to know type now, so... __print_symbolic(__entry->type, XFS_IO_TYPES), + */ + __entry->type, __entry->startoff, (__int64_t)__entry->startblock, __entry->blockcount) @@ -1221,6 +1226,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_found); DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc); DEFINE_IOMAP_EVENT(xfs_get_blocks_found); DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc); +DEFINE_IOMAP_EVENT(xfs_gbmap_direct); +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new); +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update); DECLARE_EVENT_CLASS(xfs_simple_io_class, TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count), -- 2.0.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends 2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner @ 2015-04-10 20:21 ` Brian Foster 2015-04-10 22:30 ` Dave Chinner 2015-04-11 21:15 ` Brian Foster 1 sibling, 1 reply; 19+ messages in thread From: Brian Foster @ 2015-04-10 20:21 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Now we have an ioend being passed unconditionally to the direct IO > write completion context, we can pass a preallocated transaction > handle for on-disk inode size updates that are run in completion. > > At this point we really need to be passing the correct block range > that the IO spans through the ioend, so calculate the last block in > the mapping before we map the allocated range and use that instead > of the size desired by the direct IO. > > This enables us to keep track of multiple get-blocks calls in the > same direct IO - the ioend will keep coming back to us, and we can > keep extending it's range as new allocations and mappings are done. > > There are some new trace points added for debugging, and a small > hack to actually make the tracepoints work (enums in tracepoints > that use __print_symbolic don't work correctly) that should be fixed > in the 4.1 merge window. THis hack can be removed when the > tracepoint fix is upstream. > > There are lots of comments explaining the intricacies of passing the > ioend and append transaction in the code; they are better placed in > the code because we're going to need them to understand why this > code does what it does in a few years time.... > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- I still need to look at this one (and grok the dio code more)... but an initial question: is this multiple get_blocks() call aggregation a requirement for the append ioend mechanism or an optimization? If the latter, I think a separate patch is more appropriate... Brian > fs/xfs/xfs_aops.c | 262 +++++++++++++++++++++++++++++++++++++---------------- > fs/xfs/xfs_trace.h | 10 +- > 2 files changed, 194 insertions(+), 78 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index d95a42b..52c7e46 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -178,6 +178,25 @@ xfs_setfilesize_ioend( > return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size); > } > > +STATIC void > +xfs_setfilesize_ioend_cancel( > + struct xfs_ioend *ioend) > +{ > + struct xfs_trans *tp = ioend->io_append_trans; > + > + /* > + * The transaction may have been allocated in the I/O submission thread, > + * thus we need to mark ourselves as being in a transaction manually. > + * Similarly for freeze protection. > + */ > + current_set_flags_nested(&tp->t_pflags, PF_FSTRANS); > + rwsem_acquire_read(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1], > + 0, 1, _THIS_IP_); > + > + xfs_trans_cancel(tp, 0); > + ioend->io_append_trans = NULL; > +} > + > /* > * Schedule IO completion handling on the final put of an ioend. > * > @@ -1233,18 +1252,18 @@ xfs_vm_releasepage( > return try_to_free_buffers(page); > } > > -static void > +static int > xfs_get_blocks_map_buffer( > struct inode *inode, > struct buffer_head *bh_result, > int create, > int direct, > struct xfs_bmbt_irec *imap, > - xfs_off_t offset, > - ssize_t size) > + xfs_off_t offset) > { > struct xfs_ioend *ioend; > int type; > + loff_t size; > > if (!create) { > /* > @@ -1253,7 +1272,7 @@ xfs_get_blocks_map_buffer( > */ > if (!ISUNWRITTEN(imap)) > xfs_map_buffer(inode, bh_result, imap, offset); > - return; > + return 0; > } > > xfs_map_buffer(inode, bh_result, imap, offset); > @@ -1262,26 +1281,93 @@ xfs_get_blocks_map_buffer( > set_buffer_unwritten(bh_result); > > if (!direct) > - return; > + return 0; > > /* > - * Direct IO writes require an ioend to be allocated and > - * passed via the returned mapping. This allows the end > - * io function to determine the correct course of > - * action. > + * Direct IO writes require an ioend to be allocated and passed via the > + * returned mapping. This allows the end io function to determine the > + * correct course of action. > + * > + * Unwritten extents will need transactions at completion, so is known > + * to need deferring to a workqueue. However, for writes into written > + * extents, we *may* need a transaction if this IO extends the on-disk > + * EOF. Because we can race with other IOs the file may already be > + * extended by the time we get to the transaction. IO completion already > + * handles that case so all we will have done is incurred the overhead > + * of workqueue deferral for completion. This is acceptable overhead for > + * the rare case that this occurs. > */ > - > if (ISUNWRITTEN(imap)) { > type = XFS_IO_UNWRITTEN; > set_buffer_defer_completion(bh_result); > } else > type = XFS_IO_OVERWRITE; > - ioend = xfs_alloc_ioend(inode, type); > - ioend->io_offset = offset; > - ioend->io_size = size; > - bh_result->b_private = ioend; > > - return; > + /* > + * The offset that is passed in is the first block the DIO will fall > + * into. The size supplied by the DIO layer is what it thinks it needs > + * but the mapping may not span this entire range. Hence we use the > + * truncated mapping size that's already been stashed in the bh_result > + * to calculate the range covered by the ioend. > + */ > + size = bh_result->b_size; > + trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap); > + > + /* > + * If we get multiple mappings to in a single IO, we might be mapping > + * dfferent types. But because the direct IO can only have a single > + * private pointer, we need to ensure that: > + * > + * a) the ioend spans the entire region of the IO; and > + * b) if it contains unwritten extents, it is *permanently* marked as > + * such and we cancel any append transaction attached to the ioend. > + * > + * We could do this by chaining ioends like buffered IO does, but > + * we only actually get one IO completion callback from the direct IO, > + * and that spans the entire IO regardless of how many mappings and IOs > + * are needed to complete the DIO. There is only going to be one > + * reference to the ioend and it's life cycle is constrained by the > + * DIO completion code. hence we don't need reference counting here. > + */ > + if (bh_result->b_private) { > + ioend = bh_result->b_private; > + ASSERT(ioend->io_size > 0); > + ASSERT(offset >= ioend->io_offset); > + > + if (offset + size > ioend->io_offset + ioend->io_size) > + ioend->io_size = offset - ioend->io_offset + size; > + > + if (type == XFS_IO_UNWRITTEN) { > + if (ioend->io_append_trans) > + xfs_setfilesize_ioend_cancel(ioend); > + ioend->io_type = XFS_IO_UNWRITTEN; > + } > + trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset, > + ioend->io_size, ioend->io_type, > + imap); > + } else { > + ioend = xfs_alloc_ioend(inode, type); > + ioend->io_offset = offset; > + ioend->io_size = size; > + bh_result->b_private = ioend; > + trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type, > + imap); > + } > + > + /* check if we need an append transaction allocated. */ > + if (ioend->io_type == XFS_IO_OVERWRITE && > + xfs_ioend_is_append(ioend) && !ioend->io_append_trans) { > + int error; > + > + error = xfs_setfilesize_trans_alloc(ioend); > + ASSERT(!error); > + if (error) { > + xfs_destroy_ioend(ioend); > + return error; > + } > + set_buffer_defer_completion(bh_result); > + } > + return 0; > } > > STATIC int > @@ -1374,50 +1460,19 @@ __xfs_get_blocks( > xfs_iunlock(ip, lockmode); > } > > - trace_xfs_get_blocks_alloc(ip, offset, size, 0, &imap); > + trace_xfs_get_blocks_alloc(ip, offset, size, > + ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN > + : XFS_IO_DELALLOC, &imap); > } else if (nimaps) { > - trace_xfs_get_blocks_found(ip, offset, size, 0, &imap); > + trace_xfs_get_blocks_found(ip, offset, size, > + ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN > + : XFS_IO_OVERWRITE, &imap); > xfs_iunlock(ip, lockmode); > } else { > trace_xfs_get_blocks_notfound(ip, offset, size); > goto out_unlock; > } > > - if (imap.br_startblock != HOLESTARTBLOCK && > - imap.br_startblock != DELAYSTARTBLOCK) > - xfs_get_blocks_map_buffer(inode, bh_result, create, direct, > - &imap, offset, size); > - > - /* > - * If this is a realtime file, data may be on a different device. > - * to that pointed to from the buffer_head b_bdev currently. > - */ > - bh_result->b_bdev = xfs_find_bdev_for_inode(inode); > - > - /* > - * If we previously allocated a block out beyond eof and we are now > - * coming back to use it then we will need to flag it as new even if it > - * has a disk address. > - * > - * With sub-block writes into unwritten extents we also need to mark > - * the buffer as new so that the unwritten parts of the buffer gets > - * correctly zeroed. > - */ > - if (create && > - ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) || > - (offset >= i_size_read(inode)) || > - (new || ISUNWRITTEN(&imap)))) > - set_buffer_new(bh_result); > - > - if (imap.br_startblock == DELAYSTARTBLOCK) { > - BUG_ON(direct); > - if (create) { > - set_buffer_uptodate(bh_result); > - set_buffer_mapped(bh_result); > - set_buffer_delay(bh_result); > - } > - } > - > /* > * If this is O_DIRECT or the mpage code calling tell them how large > * the mapping is, so that we can avoid repeated get_blocks calls. > @@ -1451,6 +1506,46 @@ __xfs_get_blocks( > bh_result->b_size = mapping_size; > } > > + if (imap.br_startblock != HOLESTARTBLOCK && > + imap.br_startblock != DELAYSTARTBLOCK) { > + error = xfs_get_blocks_map_buffer(inode, bh_result, create, > + direct, &imap, offset); > + if (error) > + return error; > + } > + if (create && direct) > + ASSERT(bh_result->b_private); > + > + /* > + * If this is a realtime file, data may be on a different device. > + * to that pointed to from the buffer_head b_bdev currently. > + */ > + bh_result->b_bdev = xfs_find_bdev_for_inode(inode); > + > + /* > + * If we previously allocated a block out beyond eof and we are now > + * coming back to use it then we will need to flag it as new even if it > + * has a disk address. > + * > + * With sub-block writes into unwritten extents we also need to mark > + * the buffer as new so that the unwritten parts of the buffer gets > + * correctly zeroed. > + */ > + if (create && > + ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) || > + (offset >= i_size_read(inode)) || > + (new || ISUNWRITTEN(&imap)))) > + set_buffer_new(bh_result); > + > + if (imap.br_startblock == DELAYSTARTBLOCK) { > + BUG_ON(direct); > + if (create) { > + set_buffer_uptodate(bh_result); > + set_buffer_mapped(bh_result); > + set_buffer_delay(bh_result); > + } > + } > + > return 0; > > out_unlock: > @@ -1501,38 +1596,51 @@ xfs_end_io_direct_write( > goto out_destroy_ioend; > > /* > - * While the generic direct I/O code updates the inode size, it does > - * so only after the end_io handler is called, which means our > - * end_io handler thinks the on-disk size is outside the in-core > - * size. To prevent this just update it a little bit earlier here. > + * dio completion end_io functions are only called on writes if more > + * than 0 bytes was written. > */ > - if (offset + size > i_size_read(inode)) > - i_size_write(inode, offset + size); > + ASSERT(size > 0); > > /* > - * For direct I/O we do not know if we need to allocate blocks or not, > - * so we can't preallocate an append transaction, as that results in > - * nested reservations and log space deadlocks. Hence allocate the > - * transaction here. While this is sub-optimal and can block IO > - * completion for some time, we're stuck with doing it this way until > - * we can pass the ioend to the direct IO allocation callbacks and > - * avoid nesting that way. > + * The ioend only maps whole blocks, while the IO may be sector aligned. > + * Hence the ioend offset/size may not match the IO offset/size exactly, > + * but should span it completely. Write the IO sizes into the ioend so > + * that completion processing does the right thing. > */ > - if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) { > - xfs_iomap_write_unwritten(ip, offset, size); > - } else if (offset + size > ip->i_d.di_size) { > - struct xfs_trans *tp; > - int error; > + ASSERT(size <= ioend->io_size); > + ASSERT(offset >= ioend->io_offset); > + ASSERT(offset + size <= ioend->io_offset + ioend->io_size); > + ioend->io_size = size; > + ioend->io_offset = offset; > > - tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS); > - error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0); > - if (error) { > - xfs_trans_cancel(tp, 0); > - goto out_destroy_ioend; > - } > + /* > + * The ioend tells us whether we are doing unwritten extent conversion > + * or an append transaction that updates the on-disk file size. These > + * cases are the only cases where we should *potentially* be needing > + * to update the VFS inode size. When the ioend indicates this, we > + * are *guaranteed* to be running in non-interrupt context. > + * > + * We need to update the in-core inode size here so that we don't end up > + * with the on-disk inode size being outside the in-core inode size. > + * While we can do this in the process context after the IO has > + * completed, this does not work for AIO and hence we always update > + * the in-core inode size here if necessary. > + */ > + if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) { > + if (offset + size > i_size_read(inode)) > + i_size_write(inode, offset + size); > + } else > + ASSERT(offset + size <= i_size_read(inode)); > > - xfs_setfilesize(ip, tp, offset, size); > + /* Ugh. No way to propagate errors, so ignore them. */ > + if (ioend->io_type == XFS_IO_UNWRITTEN) { > + xfs_iomap_write_unwritten(ip, offset, size); > + } else if (ioend->io_append_trans) { > + xfs_setfilesize_ioend(ioend); > + } else { > + ASSERT(!xfs_ioend_is_append(ioend)); > } > + > out_destroy_ioend: > xfs_destroy_ioend(ioend); > } > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index b2a45cc..a584c27 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -1200,13 +1200,18 @@ DECLARE_EVENT_CLASS(xfs_imap_class, > __entry->blockcount = irec ? irec->br_blockcount : 0; > ), > TP_printk("dev %d:%d ino 0x%llx size 0x%llx offset 0x%llx count %zd " > - "type %s startoff 0x%llx startblock %lld blockcount 0x%llx", > + "type %d startoff 0x%llx startblock %lld blockcount 0x%llx", > MAJOR(__entry->dev), MINOR(__entry->dev), > __entry->ino, > __entry->size, > __entry->offset, > __entry->count, > + /* > + * XXX: __print_symbolic broken for enums, fix coming in 4.1 > + * cycle from Mr Rostedt. Need to know type now, so... > __print_symbolic(__entry->type, XFS_IO_TYPES), > + */ > + __entry->type, > __entry->startoff, > (__int64_t)__entry->startblock, > __entry->blockcount) > @@ -1221,6 +1226,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_found); > DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc); > DEFINE_IOMAP_EVENT(xfs_get_blocks_found); > DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc); > +DEFINE_IOMAP_EVENT(xfs_gbmap_direct); > +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new); > +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update); > > DECLARE_EVENT_CLASS(xfs_simple_io_class, > TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count), > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends 2015-04-10 20:21 ` Brian Foster @ 2015-04-10 22:30 ` Dave Chinner 2015-04-11 21:12 ` Brian Foster 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2015-04-10 22:30 UTC (permalink / raw) To: Brian Foster; +Cc: xfs On Fri, Apr 10, 2015 at 04:21:47PM -0400, Brian Foster wrote: > On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Now we have an ioend being passed unconditionally to the direct IO > > write completion context, we can pass a preallocated transaction > > handle for on-disk inode size updates that are run in completion. > > > > At this point we really need to be passing the correct block range > > that the IO spans through the ioend, so calculate the last block in > > the mapping before we map the allocated range and use that instead > > of the size desired by the direct IO. > > > > This enables us to keep track of multiple get-blocks calls in the > > same direct IO - the ioend will keep coming back to us, and we can > > keep extending it's range as new allocations and mappings are done. > > > > There are some new trace points added for debugging, and a small > > hack to actually make the tracepoints work (enums in tracepoints > > that use __print_symbolic don't work correctly) that should be fixed > > in the 4.1 merge window. THis hack can be removed when the > > tracepoint fix is upstream. > > > > There are lots of comments explaining the intricacies of passing the > > ioend and append transaction in the code; they are better placed in > > the code because we're going to need them to understand why this > > code does what it does in a few years time.... > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > I still need to look at this one (and grok the dio code more)... but an > initial question: is this multiple get_blocks() call aggregation a > requirement for the append ioend mechanism or an optimization? If the > latter, I think a separate patch is more appropriate... Requirement. Direct Io is a twisty maze of passages loaded with deadly traps. e.g. non AIO path: ->direct_IO alloc dio(off, len) loop until all IO issued { get_blocks dio->private = bh_result->b_private build bio dio->ref++ submit bio } dio_await_completion(dio) dio_complete(dio) dio->ref-- => goes to zero dio->end_io(filp, off, len, dio->private) xfs_end_io_direct_write(... off, len, ioend) So, essentially, for as many bios that are mapped and submitted for the direct IO, there is only one end IO completion call for the entire IO. The multiple mappings we make need to aggregate the state of the entire IO, not just the single instance.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends 2015-04-10 22:30 ` Dave Chinner @ 2015-04-11 21:12 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2015-04-11 21:12 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sat, Apr 11, 2015 at 08:30:40AM +1000, Dave Chinner wrote: > On Fri, Apr 10, 2015 at 04:21:47PM -0400, Brian Foster wrote: > > On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Now we have an ioend being passed unconditionally to the direct IO > > > write completion context, we can pass a preallocated transaction > > > handle for on-disk inode size updates that are run in completion. > > > > > > At this point we really need to be passing the correct block range > > > that the IO spans through the ioend, so calculate the last block in > > > the mapping before we map the allocated range and use that instead > > > of the size desired by the direct IO. > > > > > > This enables us to keep track of multiple get-blocks calls in the > > > same direct IO - the ioend will keep coming back to us, and we can > > > keep extending it's range as new allocations and mappings are done. > > > > > > There are some new trace points added for debugging, and a small > > > hack to actually make the tracepoints work (enums in tracepoints > > > that use __print_symbolic don't work correctly) that should be fixed > > > in the 4.1 merge window. THis hack can be removed when the > > > tracepoint fix is upstream. > > > > > > There are lots of comments explaining the intricacies of passing the > > > ioend and append transaction in the code; they are better placed in > > > the code because we're going to need them to understand why this > > > code does what it does in a few years time.... > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > > I still need to look at this one (and grok the dio code more)... but an > > initial question: is this multiple get_blocks() call aggregation a > > requirement for the append ioend mechanism or an optimization? If the > > latter, I think a separate patch is more appropriate... > > Requirement. Direct Io is a twisty maze of passages loaded with > deadly traps. e.g. non AIO path: > > ->direct_IO > alloc dio(off, len) > loop until all IO issued { > get_blocks > dio->private = bh_result->b_private > build bio > dio->ref++ > submit bio > } > > dio_await_completion(dio) > dio_complete(dio) > dio->ref-- => goes to zero > dio->end_io(filp, off, len, dio->private) > xfs_end_io_direct_write(... off, len, ioend) > > > So, essentially, for as many bios that are mapped and submitted for > the direct IO, there is only one end IO completion call for the > entire IO. The multiple mappings we make need to aggregate the state > of the entire IO, not just the single instance.... > Ok, thanks for the breakdown. Essentially, we need to track the highest precedent I/O type of the overall DIO with respect to the completion handler. The patch itself is not hard to follow, but the dio path is a different beast. What I didn't quite catch when first playing with this is the mapping size optimization earlier in the get_blocks call that effectively defeats some of this by reducing the need for multiple calls in many cases. A single DIO write over a range of alternating map types (e.g., alternating preallocated blocks and holes), for example, is a better way to trigger the ioend aggregation. Additional comments to follow... Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends 2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner 2015-04-10 20:21 ` Brian Foster @ 2015-04-11 21:15 ` Brian Foster 2015-04-12 23:31 ` Dave Chinner 1 sibling, 1 reply; 19+ messages in thread From: Brian Foster @ 2015-04-11 21:15 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Now we have an ioend being passed unconditionally to the direct IO > write completion context, we can pass a preallocated transaction > handle for on-disk inode size updates that are run in completion. > > At this point we really need to be passing the correct block range > that the IO spans through the ioend, so calculate the last block in > the mapping before we map the allocated range and use that instead > of the size desired by the direct IO. > > This enables us to keep track of multiple get-blocks calls in the > same direct IO - the ioend will keep coming back to us, and we can > keep extending it's range as new allocations and mappings are done. > > There are some new trace points added for debugging, and a small > hack to actually make the tracepoints work (enums in tracepoints > that use __print_symbolic don't work correctly) that should be fixed > in the 4.1 merge window. THis hack can be removed when the > tracepoint fix is upstream. > > There are lots of comments explaining the intricacies of passing the > ioend and append transaction in the code; they are better placed in > the code because we're going to need them to understand why this > code does what it does in a few years time.... > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_aops.c | 262 +++++++++++++++++++++++++++++++++++++---------------- > fs/xfs/xfs_trace.h | 10 +- > 2 files changed, 194 insertions(+), 78 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index d95a42b..52c7e46 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -178,6 +178,25 @@ xfs_setfilesize_ioend( > return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size); > } > > +STATIC void > +xfs_setfilesize_ioend_cancel( > + struct xfs_ioend *ioend) > +{ > + struct xfs_trans *tp = ioend->io_append_trans; > + > + /* > + * The transaction may have been allocated in the I/O submission thread, > + * thus we need to mark ourselves as being in a transaction manually. > + * Similarly for freeze protection. > + */ This threw me off at first because we can call this from either the submission or the completion context, unlike the commit case where this comment is copied from. Could we move the comment above the function and clarify a bit? E.g., something like the following is a bit more clear to me: /* * The process transaction and freeze protection state is cleared immediately * after setfilesize transaction allocation to support transfer of the tp from * submission to completion context. Restore the context appropriately to cancel * the transaction. */ > + current_set_flags_nested(&tp->t_pflags, PF_FSTRANS); > + rwsem_acquire_read(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1], > + 0, 1, _THIS_IP_); > + > + xfs_trans_cancel(tp, 0); > + ioend->io_append_trans = NULL; > +} > + > /* > * Schedule IO completion handling on the final put of an ioend. > * > @@ -1233,18 +1252,18 @@ xfs_vm_releasepage( > return try_to_free_buffers(page); > } > > -static void > +static int > xfs_get_blocks_map_buffer( > struct inode *inode, > struct buffer_head *bh_result, > int create, > int direct, > struct xfs_bmbt_irec *imap, > - xfs_off_t offset, > - ssize_t size) > + xfs_off_t offset) > { > struct xfs_ioend *ioend; > int type; > + loff_t size; > > if (!create) { > /* > @@ -1253,7 +1272,7 @@ xfs_get_blocks_map_buffer( > */ > if (!ISUNWRITTEN(imap)) > xfs_map_buffer(inode, bh_result, imap, offset); > - return; > + return 0; > } > > xfs_map_buffer(inode, bh_result, imap, offset); > @@ -1262,26 +1281,93 @@ xfs_get_blocks_map_buffer( > set_buffer_unwritten(bh_result); > > if (!direct) > - return; > + return 0; > > /* > - * Direct IO writes require an ioend to be allocated and > - * passed via the returned mapping. This allows the end > - * io function to determine the correct course of > - * action. > + * Direct IO writes require an ioend to be allocated and passed via the > + * returned mapping. This allows the end io function to determine the > + * correct course of action. > + * > + * Unwritten extents will need transactions at completion, so is known > + * to need deferring to a workqueue. However, for writes into written > + * extents, we *may* need a transaction if this IO extends the on-disk > + * EOF. Because we can race with other IOs the file may already be > + * extended by the time we get to the transaction. IO completion already > + * handles that case so all we will have done is incurred the overhead > + * of workqueue deferral for completion. This is acceptable overhead for > + * the rare case that this occurs. > */ > - > if (ISUNWRITTEN(imap)) { > type = XFS_IO_UNWRITTEN; > set_buffer_defer_completion(bh_result); > } else > type = XFS_IO_OVERWRITE; > - ioend = xfs_alloc_ioend(inode, type); > - ioend->io_offset = offset; > - ioend->io_size = size; > - bh_result->b_private = ioend; > > - return; > + /* > + * The offset that is passed in is the first block the DIO will fall > + * into. The size supplied by the DIO layer is what it thinks it needs > + * but the mapping may not span this entire range. Hence we use the > + * truncated mapping size that's already been stashed in the bh_result > + * to calculate the range covered by the ioend. > + */ > + size = bh_result->b_size; > + trace_xfs_gbmap_direct(XFS_I(inode), offset, size, type, imap); > + > + /* > + * If we get multiple mappings to in a single IO, we might be mapping > + * dfferent types. But because the direct IO can only have a single different > + * private pointer, we need to ensure that: > + * > + * a) the ioend spans the entire region of the IO; and > + * b) if it contains unwritten extents, it is *permanently* marked as > + * such and we cancel any append transaction attached to the ioend. > + * > + * We could do this by chaining ioends like buffered IO does, but > + * we only actually get one IO completion callback from the direct IO, > + * and that spans the entire IO regardless of how many mappings and IOs > + * are needed to complete the DIO. There is only going to be one > + * reference to the ioend and it's life cycle is constrained by the > + * DIO completion code. hence we don't need reference counting here. > + */ > + if (bh_result->b_private) { > + ioend = bh_result->b_private; > + ASSERT(ioend->io_size > 0); > + ASSERT(offset >= ioend->io_offset); > + > + if (offset + size > ioend->io_offset + ioend->io_size) > + ioend->io_size = offset - ioend->io_offset + size; > + > + if (type == XFS_IO_UNWRITTEN) { if (type == XFS_IO_UNWRITTEN && ioend->io_type != XFS_IO_UNWRITTEN) ... makes that a bit easier to follow imo. > + if (ioend->io_append_trans) > + xfs_setfilesize_ioend_cancel(ioend); > + ioend->io_type = XFS_IO_UNWRITTEN; > + } > + trace_xfs_gbmap_direct_update(XFS_I(inode), ioend->io_offset, > + ioend->io_size, ioend->io_type, > + imap); > + } else { > + ioend = xfs_alloc_ioend(inode, type); > + ioend->io_offset = offset; > + ioend->io_size = size; > + bh_result->b_private = ioend; > + trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type, > + imap); > + } > + > + /* check if we need an append transaction allocated. */ > + if (ioend->io_type == XFS_IO_OVERWRITE && > + xfs_ioend_is_append(ioend) && !ioend->io_append_trans) { > + int error; > + > + error = xfs_setfilesize_trans_alloc(ioend); I'm not totally convinced this is safe. We previously moved this tp allocation from before a potential xfs_iomap_direct_write() call to the completion side to avoid nesting this allocation with unwritten extent allocation transactions. See the following for reference: 437a255a xfs: fix direct IO nested transaction deadlock Now we move it after that point of the codepath, and even then we know that this is an overwrite if we do the allocation here. If we continue on and hit a hole, it looks like there's still a sequence to allocate this transaction and call xfs_iomap_write_direct(), nesting the associated transaction reservations. Am I missing something? Brian > + ASSERT(!error); > + if (error) { > + xfs_destroy_ioend(ioend); > + return error; > + } > + set_buffer_defer_completion(bh_result); > + } > + return 0; > } > > STATIC int > @@ -1374,50 +1460,19 @@ __xfs_get_blocks( > xfs_iunlock(ip, lockmode); > } > > - trace_xfs_get_blocks_alloc(ip, offset, size, 0, &imap); > + trace_xfs_get_blocks_alloc(ip, offset, size, > + ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN > + : XFS_IO_DELALLOC, &imap); > } else if (nimaps) { > - trace_xfs_get_blocks_found(ip, offset, size, 0, &imap); > + trace_xfs_get_blocks_found(ip, offset, size, > + ISUNWRITTEN(&imap) ? XFS_IO_UNWRITTEN > + : XFS_IO_OVERWRITE, &imap); > xfs_iunlock(ip, lockmode); > } else { > trace_xfs_get_blocks_notfound(ip, offset, size); > goto out_unlock; > } > > - if (imap.br_startblock != HOLESTARTBLOCK && > - imap.br_startblock != DELAYSTARTBLOCK) > - xfs_get_blocks_map_buffer(inode, bh_result, create, direct, > - &imap, offset, size); > - > - /* > - * If this is a realtime file, data may be on a different device. > - * to that pointed to from the buffer_head b_bdev currently. > - */ > - bh_result->b_bdev = xfs_find_bdev_for_inode(inode); > - > - /* > - * If we previously allocated a block out beyond eof and we are now > - * coming back to use it then we will need to flag it as new even if it > - * has a disk address. > - * > - * With sub-block writes into unwritten extents we also need to mark > - * the buffer as new so that the unwritten parts of the buffer gets > - * correctly zeroed. > - */ > - if (create && > - ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) || > - (offset >= i_size_read(inode)) || > - (new || ISUNWRITTEN(&imap)))) > - set_buffer_new(bh_result); > - > - if (imap.br_startblock == DELAYSTARTBLOCK) { > - BUG_ON(direct); > - if (create) { > - set_buffer_uptodate(bh_result); > - set_buffer_mapped(bh_result); > - set_buffer_delay(bh_result); > - } > - } > - > /* > * If this is O_DIRECT or the mpage code calling tell them how large > * the mapping is, so that we can avoid repeated get_blocks calls. > @@ -1451,6 +1506,46 @@ __xfs_get_blocks( > bh_result->b_size = mapping_size; > } > > + if (imap.br_startblock != HOLESTARTBLOCK && > + imap.br_startblock != DELAYSTARTBLOCK) { > + error = xfs_get_blocks_map_buffer(inode, bh_result, create, > + direct, &imap, offset); > + if (error) > + return error; > + } > + if (create && direct) > + ASSERT(bh_result->b_private); > + > + /* > + * If this is a realtime file, data may be on a different device. > + * to that pointed to from the buffer_head b_bdev currently. > + */ > + bh_result->b_bdev = xfs_find_bdev_for_inode(inode); > + > + /* > + * If we previously allocated a block out beyond eof and we are now > + * coming back to use it then we will need to flag it as new even if it > + * has a disk address. > + * > + * With sub-block writes into unwritten extents we also need to mark > + * the buffer as new so that the unwritten parts of the buffer gets > + * correctly zeroed. > + */ > + if (create && > + ((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) || > + (offset >= i_size_read(inode)) || > + (new || ISUNWRITTEN(&imap)))) > + set_buffer_new(bh_result); > + > + if (imap.br_startblock == DELAYSTARTBLOCK) { > + BUG_ON(direct); > + if (create) { > + set_buffer_uptodate(bh_result); > + set_buffer_mapped(bh_result); > + set_buffer_delay(bh_result); > + } > + } > + > return 0; > > out_unlock: > @@ -1501,38 +1596,51 @@ xfs_end_io_direct_write( > goto out_destroy_ioend; > > /* > - * While the generic direct I/O code updates the inode size, it does > - * so only after the end_io handler is called, which means our > - * end_io handler thinks the on-disk size is outside the in-core > - * size. To prevent this just update it a little bit earlier here. > + * dio completion end_io functions are only called on writes if more > + * than 0 bytes was written. > */ > - if (offset + size > i_size_read(inode)) > - i_size_write(inode, offset + size); > + ASSERT(size > 0); > > /* > - * For direct I/O we do not know if we need to allocate blocks or not, > - * so we can't preallocate an append transaction, as that results in > - * nested reservations and log space deadlocks. Hence allocate the > - * transaction here. While this is sub-optimal and can block IO > - * completion for some time, we're stuck with doing it this way until > - * we can pass the ioend to the direct IO allocation callbacks and > - * avoid nesting that way. > + * The ioend only maps whole blocks, while the IO may be sector aligned. > + * Hence the ioend offset/size may not match the IO offset/size exactly, > + * but should span it completely. Write the IO sizes into the ioend so > + * that completion processing does the right thing. > */ > - if (ioend->io_type == XFS_IO_UNWRITTEN && size > 0) { > - xfs_iomap_write_unwritten(ip, offset, size); > - } else if (offset + size > ip->i_d.di_size) { > - struct xfs_trans *tp; > - int error; > + ASSERT(size <= ioend->io_size); > + ASSERT(offset >= ioend->io_offset); > + ASSERT(offset + size <= ioend->io_offset + ioend->io_size); > + ioend->io_size = size; > + ioend->io_offset = offset; > > - tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS); > - error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0); > - if (error) { > - xfs_trans_cancel(tp, 0); > - goto out_destroy_ioend; > - } > + /* > + * The ioend tells us whether we are doing unwritten extent conversion > + * or an append transaction that updates the on-disk file size. These > + * cases are the only cases where we should *potentially* be needing > + * to update the VFS inode size. When the ioend indicates this, we > + * are *guaranteed* to be running in non-interrupt context. > + * > + * We need to update the in-core inode size here so that we don't end up > + * with the on-disk inode size being outside the in-core inode size. > + * While we can do this in the process context after the IO has > + * completed, this does not work for AIO and hence we always update > + * the in-core inode size here if necessary. > + */ > + if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) { > + if (offset + size > i_size_read(inode)) > + i_size_write(inode, offset + size); > + } else > + ASSERT(offset + size <= i_size_read(inode)); > > - xfs_setfilesize(ip, tp, offset, size); > + /* Ugh. No way to propagate errors, so ignore them. */ > + if (ioend->io_type == XFS_IO_UNWRITTEN) { > + xfs_iomap_write_unwritten(ip, offset, size); > + } else if (ioend->io_append_trans) { > + xfs_setfilesize_ioend(ioend); > + } else { > + ASSERT(!xfs_ioend_is_append(ioend)); > } > + > out_destroy_ioend: > xfs_destroy_ioend(ioend); > } > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index b2a45cc..a584c27 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -1200,13 +1200,18 @@ DECLARE_EVENT_CLASS(xfs_imap_class, > __entry->blockcount = irec ? irec->br_blockcount : 0; > ), > TP_printk("dev %d:%d ino 0x%llx size 0x%llx offset 0x%llx count %zd " > - "type %s startoff 0x%llx startblock %lld blockcount 0x%llx", > + "type %d startoff 0x%llx startblock %lld blockcount 0x%llx", > MAJOR(__entry->dev), MINOR(__entry->dev), > __entry->ino, > __entry->size, > __entry->offset, > __entry->count, > + /* > + * XXX: __print_symbolic broken for enums, fix coming in 4.1 > + * cycle from Mr Rostedt. Need to know type now, so... > __print_symbolic(__entry->type, XFS_IO_TYPES), > + */ > + __entry->type, > __entry->startoff, > (__int64_t)__entry->startblock, > __entry->blockcount) > @@ -1221,6 +1226,9 @@ DEFINE_IOMAP_EVENT(xfs_map_blocks_found); > DEFINE_IOMAP_EVENT(xfs_map_blocks_alloc); > DEFINE_IOMAP_EVENT(xfs_get_blocks_found); > DEFINE_IOMAP_EVENT(xfs_get_blocks_alloc); > +DEFINE_IOMAP_EVENT(xfs_gbmap_direct); > +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_new); > +DEFINE_IOMAP_EVENT(xfs_gbmap_direct_update); > > DECLARE_EVENT_CLASS(xfs_simple_io_class, > TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count), > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends 2015-04-11 21:15 ` Brian Foster @ 2015-04-12 23:31 ` Dave Chinner 2015-04-13 11:20 ` Brian Foster 0 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2015-04-12 23:31 UTC (permalink / raw) To: Brian Foster; +Cc: xfs On Sat, Apr 11, 2015 at 05:15:18PM -0400, Brian Foster wrote: > On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Now we have an ioend being passed unconditionally to the direct IO > > write completion context, we can pass a preallocated transaction > > handle for on-disk inode size updates that are run in completion. ..... > > --- a/fs/xfs/xfs_aops.c > > +++ b/fs/xfs/xfs_aops.c > > @@ -178,6 +178,25 @@ xfs_setfilesize_ioend( > > return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size); > > } > > > > +STATIC void > > +xfs_setfilesize_ioend_cancel( > > + struct xfs_ioend *ioend) > > +{ > > + struct xfs_trans *tp = ioend->io_append_trans; > > + > > + /* > > + * The transaction may have been allocated in the I/O submission thread, > > + * thus we need to mark ourselves as being in a transaction manually. > > + * Similarly for freeze protection. > > + */ > > This threw me off at first because we can call this from either the > submission or the completion context, unlike the commit case where this > comment is copied from. Could we move the comment above the function and > clarify a bit? E.g., something like the following is a bit more clear to > me: > > /* > * The process transaction and freeze protection state is cleared immediately > * after setfilesize transaction allocation to support transfer of the tp from > * submission to completion context. Restore the context appropriately to cancel > * the transaction. > */ OK, I can do that, but given your next comments, it might just go away. > > + } else { > > + ioend = xfs_alloc_ioend(inode, type); > > + ioend->io_offset = offset; > > + ioend->io_size = size; > > + bh_result->b_private = ioend; > > + trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type, > > + imap); > > + } > > + > > + /* check if we need an append transaction allocated. */ > > + if (ioend->io_type == XFS_IO_OVERWRITE && > > + xfs_ioend_is_append(ioend) && !ioend->io_append_trans) { > > + int error; > > + > > + error = xfs_setfilesize_trans_alloc(ioend); > > I'm not totally convinced this is safe. We previously moved this tp > allocation from before a potential xfs_iomap_direct_write() call to the > completion side to avoid nesting this allocation with unwritten extent > allocation transactions. See the following for reference: > > 437a255a xfs: fix direct IO nested transaction deadlock > > Now we move it after that point of the codepath, and even then we know > that this is an overwrite if we do the allocation here. If we continue > on and hit a hole, it looks like there's still a sequence to allocate > this transaction and call xfs_iomap_write_direct(), nesting the > associated transaction reservations. Am I missing something? No, I didn't really think this part through fully. I knew that we'd get multiple calls, and we'd get multiple allocations, but for some reason the penny didn't drop. What it comes down to is that either we jump through lots of hoops in __xfs_get_blocks() to handle this case (i.e. cancel/allocate/reserve on repeat calls) or we just allocate it in IO completion context as we currently are doing. I'll have a look at what cancel/allocate/reserve looks like - it might actually simplify the logic - and go from there. Thanks for catching my silly thinko, Brain! Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/5] xfs: direct IO needs to use append ioends 2015-04-12 23:31 ` Dave Chinner @ 2015-04-13 11:20 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2015-04-13 11:20 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, Apr 13, 2015 at 09:31:02AM +1000, Dave Chinner wrote: > On Sat, Apr 11, 2015 at 05:15:18PM -0400, Brian Foster wrote: > > On Fri, Apr 10, 2015 at 11:37:57PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Now we have an ioend being passed unconditionally to the direct IO > > > write completion context, we can pass a preallocated transaction > > > handle for on-disk inode size updates that are run in completion. > ..... > > > --- a/fs/xfs/xfs_aops.c > > > +++ b/fs/xfs/xfs_aops.c > > > @@ -178,6 +178,25 @@ xfs_setfilesize_ioend( > > > return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size); > > > } > > > > > > +STATIC void > > > +xfs_setfilesize_ioend_cancel( > > > + struct xfs_ioend *ioend) > > > +{ > > > + struct xfs_trans *tp = ioend->io_append_trans; > > > + > > > + /* > > > + * The transaction may have been allocated in the I/O submission thread, > > > + * thus we need to mark ourselves as being in a transaction manually. > > > + * Similarly for freeze protection. > > > + */ > > > > This threw me off at first because we can call this from either the > > submission or the completion context, unlike the commit case where this > > comment is copied from. Could we move the comment above the function and > > clarify a bit? E.g., something like the following is a bit more clear to > > me: > > > > /* > > * The process transaction and freeze protection state is cleared immediately > > * after setfilesize transaction allocation to support transfer of the tp from > > * submission to completion context. Restore the context appropriately to cancel > > * the transaction. > > */ > > OK, I can do that, but given your next comments, it might just go > away. > > > > + } else { > > > + ioend = xfs_alloc_ioend(inode, type); > > > + ioend->io_offset = offset; > > > + ioend->io_size = size; > > > + bh_result->b_private = ioend; > > > + trace_xfs_gbmap_direct_new(XFS_I(inode), offset, size, type, > > > + imap); > > > + } > > > + > > > + /* check if we need an append transaction allocated. */ > > > + if (ioend->io_type == XFS_IO_OVERWRITE && > > > + xfs_ioend_is_append(ioend) && !ioend->io_append_trans) { > > > + int error; > > > + > > > + error = xfs_setfilesize_trans_alloc(ioend); > > > > I'm not totally convinced this is safe. We previously moved this tp > > allocation from before a potential xfs_iomap_direct_write() call to the > > completion side to avoid nesting this allocation with unwritten extent > > allocation transactions. See the following for reference: > > > > 437a255a xfs: fix direct IO nested transaction deadlock > > > > Now we move it after that point of the codepath, and even then we know > > that this is an overwrite if we do the allocation here. If we continue > > on and hit a hole, it looks like there's still a sequence to allocate > > this transaction and call xfs_iomap_write_direct(), nesting the > > associated transaction reservations. Am I missing something? > > No, I didn't really think this part through fully. I knew that we'd > get multiple calls, and we'd get multiple allocations, but for some > reason the penny didn't drop. > > What it comes down to is that either we jump through lots of hoops > in __xfs_get_blocks() to handle this case (i.e. > cancel/allocate/reserve on repeat calls) or we just allocate it in > IO completion context as we currently are doing. > Ok, I figured it would be one of those two approaches. The latter seems more clean to me in just thinking about it since it's another spot we have to consider the direct write case (as opposed to having it factored cleanly into the new helper). Maybe it can be better organized than that, splitting up the helper perhaps, so I'll reserve judgement. :) Could we get a v2 of the race fix posted with the proper locking and reviewed-by tags and whatnot? A reply to the patch in this thread is fine if the broader rework is still in flux. I'd just like to have an upstream reference for a backport of that one... Brian > I'll have a look at what cancel/allocate/reserve looks like - it > might actually simplify the logic - and go from there. > > Thanks for catching my silly thinko, Brain! > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/5] xfs: DIO write completion size updates race 2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner 2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner 2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner @ 2015-04-10 13:37 ` Dave Chinner 2015-04-10 20:22 ` Brian Foster 2015-04-10 13:37 ` [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner ` (2 subsequent siblings) 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> xfs_end_io_direct_write() can race with other IO completions when updating the in-core inode size. The IO completion processing is not serialised for direct IO - they are done either under the IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all during AIO DIO completion. Hence the non-atomic test-and-set update of the in-core inode size is racy and can result in the in-core inode size going backwards if the race if hit just right. If the inod size goes backwards, this can trigger the EOF zeroing code to run incorrectly on the next IO, which then will zero data that has successfully been written to disk by a previous DIO. To fix this bug, we need to serialise the test/set updates of the in-core inode size. This first patch introduces locking around the relevant updates and checks in the DIO path. Because we now have an ioend in xfs_end_io_direct_write(), we know exactly then we are doing an IO that requires an in-core EOF update, and we know that they are not running in interrupt context. As such, we do not need to use irqsave() spinlock variants to protect against interrupts while the lock is held. Hence we can use an existing spinlock in the inode to do this serialisation and so not need to grow the struct xfs_inode just to work around this problem. This patch does not address the test/set EOF update in generic_file_write_direct() for various reasons - that will be done as a followup with separate explanation. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_aops.c | 17 ++++++++++++----- fs/xfs/xfs_file.c | 13 ++++++++++++- 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 52c7e46..aafd54c 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1616,21 +1616,28 @@ xfs_end_io_direct_write( /* * The ioend tells us whether we are doing unwritten extent conversion * or an append transaction that updates the on-disk file size. These - * cases are the only cases where we should *potentially* be needing - * to update the VFS inode size. When the ioend indicates this, we - * are *guaranteed* to be running in non-interrupt context. + * cases are the only cases where we should *potentially* be needing to + * update the VFS inode size. When the ioend indicates this, we are + * *guaranteed* to be running in non-interrupt context. * * We need to update the in-core inode size here so that we don't end up * with the on-disk inode size being outside the in-core inode size. * While we can do this in the process context after the IO has - * completed, this does not work for AIO and hence we always update - * the in-core inode size here if necessary. + * completed, this does not work for AIO and hence we always update the + * in-core inode size here if necessary. + * + * We need to lock the test/set EOF update as we can be racing with + * other IO completions here to update the EOF. Failing to serialise + * here can result in EOF moving backwards and Bad Things Happen when + * that occurs. */ + spin_lock(&ip->i_flags_lock); if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) { if (offset + size > i_size_read(inode)) i_size_write(inode, offset + size); } else ASSERT(offset + size <= i_size_read(inode)); + spin_unlock(&ip->i_flags_lock); /* Ugh. No way to propagate errors, so ignore them. */ if (ioend->io_type == XFS_IO_UNWRITTEN) { diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index dc5f609..38ff356 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -569,10 +569,20 @@ restart: * write. If zeroing is needed and we are currently holding the * iolock shared, we need to update it to exclusive which implies * having to redo all checks before. + * + * We need to serialise against EOF updates that occur in IO + * completions here. We want to make sure that nobody is changing the + * size while we do this check until we have placed an IO barrier (i.e. + * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. + * The spinlock effectively forms a memory barrier once we have the + * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value + * and hence be able to correctly determine if we need to run zeroing. */ + spin_lock(&ip->i_flags_lock); if (*pos > i_size_read(inode)) { bool zero = false; + spin_unlock(&ip->i_flags_lock); if (*iolock == XFS_IOLOCK_SHARED) { xfs_rw_iunlock(ip, *iolock); *iolock = XFS_IOLOCK_EXCL; @@ -582,7 +592,8 @@ restart: error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero); if (error) return error; - } + } else + spin_unlock(&ip->i_flags_lock); /* * Updating the timestamps will grab the ilock again from -- 2.0.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/5] xfs: DIO write completion size updates race 2015-04-10 13:37 ` [PATCH 3/5] xfs: DIO write completion size updates race Dave Chinner @ 2015-04-10 20:22 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2015-04-10 20:22 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 10, 2015 at 11:37:58PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > xfs_end_io_direct_write() can race with other IO completions when > updating the in-core inode size. The IO completion processing is not > serialised for direct IO - they are done either under the > IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all > during AIO DIO completion. Hence the non-atomic test-and-set update > of the in-core inode size is racy and can result in the in-core > inode size going backwards if the race if hit just right. > > If the inod size goes backwards, this can trigger the EOF zeroing > code to run incorrectly on the next IO, which then will zero data > that has successfully been written to disk by a previous DIO. > > To fix this bug, we need to serialise the test/set updates of the > in-core inode size. This first patch introduces locking around the > relevant updates and checks in the DIO path. Because we now have an > ioend in xfs_end_io_direct_write(), we know exactly then we are > doing an IO that requires an in-core EOF update, and we know that > they are not running in interrupt context. As such, we do not need to > use irqsave() spinlock variants to protect against interrupts while > the lock is held. > > Hence we can use an existing spinlock in the inode to do this > serialisation and so not need to grow the struct xfs_inode just to > work around this problem. > > This patch does not address the test/set EOF update in > generic_file_write_direct() for various reasons - that will be done > as a followup with separate explanation. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_aops.c | 17 ++++++++++++----- > fs/xfs/xfs_file.c | 13 ++++++++++++- > 2 files changed, 24 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 52c7e46..aafd54c 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -1616,21 +1616,28 @@ xfs_end_io_direct_write( > /* > * The ioend tells us whether we are doing unwritten extent conversion > * or an append transaction that updates the on-disk file size. These > - * cases are the only cases where we should *potentially* be needing > - * to update the VFS inode size. When the ioend indicates this, we > - * are *guaranteed* to be running in non-interrupt context. > + * cases are the only cases where we should *potentially* be needing to > + * update the VFS inode size. When the ioend indicates this, we are > + * *guaranteed* to be running in non-interrupt context. > * > * We need to update the in-core inode size here so that we don't end up > * with the on-disk inode size being outside the in-core inode size. > * While we can do this in the process context after the IO has > - * completed, this does not work for AIO and hence we always update > - * the in-core inode size here if necessary. > + * completed, this does not work for AIO and hence we always update the > + * in-core inode size here if necessary. > + * > + * We need to lock the test/set EOF update as we can be racing with > + * other IO completions here to update the EOF. Failing to serialise > + * here can result in EOF moving backwards and Bad Things Happen when > + * that occurs. > */ > + spin_lock(&ip->i_flags_lock); > if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_append_trans) { > if (offset + size > i_size_read(inode)) > i_size_write(inode, offset + size); > } else > ASSERT(offset + size <= i_size_read(inode)); > + spin_unlock(&ip->i_flags_lock); Looks good to me once we fix the (known) locking problem above of taking the spinlock before checking the ioend (e.g., having a lock cycle in irq context): Reviewed-by: Brian Foster <bfoster@redhat.com> > > /* Ugh. No way to propagate errors, so ignore them. */ > if (ioend->io_type == XFS_IO_UNWRITTEN) { > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index dc5f609..38ff356 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -569,10 +569,20 @@ restart: > * write. If zeroing is needed and we are currently holding the > * iolock shared, we need to update it to exclusive which implies > * having to redo all checks before. > + * > + * We need to serialise against EOF updates that occur in IO > + * completions here. We want to make sure that nobody is changing the > + * size while we do this check until we have placed an IO barrier (i.e. > + * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. > + * The spinlock effectively forms a memory barrier once we have the > + * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value > + * and hence be able to correctly determine if we need to run zeroing. > */ > + spin_lock(&ip->i_flags_lock); > if (*pos > i_size_read(inode)) { > bool zero = false; > > + spin_unlock(&ip->i_flags_lock); > if (*iolock == XFS_IOLOCK_SHARED) { > xfs_rw_iunlock(ip, *iolock); > *iolock = XFS_IOLOCK_EXCL; > @@ -582,7 +592,8 @@ restart: > error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero); > if (error) > return error; > - } > + } else > + spin_unlock(&ip->i_flags_lock); > > /* > * Updating the timestamps will grab the ilock again from > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO 2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner ` (2 preceding siblings ...) 2015-04-10 13:37 ` [PATCH 3/5] xfs: DIO write completion size updates race Dave Chinner @ 2015-04-10 13:37 ` Dave Chinner 2015-04-10 20:22 ` Brian Foster 2015-04-10 13:38 ` [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary Dave Chinner 2015-04-12 15:09 ` [PATCH 0/5] xfs: fix direct IO completion issues Christoph Hellwig 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2015-04-10 13:37 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> When we are doing AIO DIO writes, the IOLOCK only provides an IO submission barrier. When we need to do EOF zeroing, we need to ensure that no other IO is in progress and all pending in-core EOF updates have been completed. This requires us to wait for all outstanding AIO DIO writes to the inode to complete and, if necessary, run their EOF updates. Once all the EOF updates are complete, we can then restart xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing that EOF is up to date and we have exclusive IO access to the file so we can run EOF block zeroing if we need to without interference. This gives EOF zeroing the same exclusivity against other IO as we provide truncate operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_file.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 38ff356..7b872f4 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -587,6 +587,16 @@ restart: xfs_rw_iunlock(ip, *iolock); *iolock = XFS_IOLOCK_EXCL; xfs_rw_ilock(ip, *iolock); + + /* + * We now have an IO submission barrier in place, but + * AIO can do EOF updates during IO completion and hence + * we now need to wait for all of them to drain. Non-AIO + * DIO will have drained before we are given the + * XFS_IOLOCK_EXCL, and so for most cases this wait is a + * no-op. + */ + inode_dio_wait(inode); goto restart; } error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero); -- 2.0.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO 2015-04-10 13:37 ` [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner @ 2015-04-10 20:22 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2015-04-10 20:22 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 10, 2015 at 11:37:59PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When we are doing AIO DIO writes, the IOLOCK only provides an IO > submission barrier. When we need to do EOF zeroing, we need to ensure > that no other IO is in progress and all pending in-core EOF updates > have been completed. This requires us to wait for all outstanding > AIO DIO writes to the inode to complete and, if necessary, run their > EOF updates. > > Once all the EOF updates are complete, we can then restart > xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing > that EOF is up to date and we have exclusive IO access to the file > so we can run EOF block zeroing if we need to without interference. > This gives EOF zeroing the same exclusivity against other IO as we > provide truncate operations. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Looks good... Reviewed-by: Brian Foster <bfoster@redhat.com> > fs/xfs/xfs_file.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 38ff356..7b872f4 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -587,6 +587,16 @@ restart: > xfs_rw_iunlock(ip, *iolock); > *iolock = XFS_IOLOCK_EXCL; > xfs_rw_ilock(ip, *iolock); > + > + /* > + * We now have an IO submission barrier in place, but > + * AIO can do EOF updates during IO completion and hence > + * we now need to wait for all of them to drain. Non-AIO > + * DIO will have drained before we are given the > + * XFS_IOLOCK_EXCL, and so for most cases this wait is a > + * no-op. > + */ > + inode_dio_wait(inode); > goto restart; > } > error = xfs_zero_eof(ip, *pos, i_size_read(inode), &zero); > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary 2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner ` (3 preceding siblings ...) 2015-04-10 13:37 ` [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner @ 2015-04-10 13:38 ` Dave Chinner 2015-04-10 20:22 ` Brian Foster 2015-04-12 15:09 ` [PATCH 0/5] xfs: fix direct IO completion issues Christoph Hellwig 5 siblings, 1 reply; 19+ messages in thread From: Dave Chinner @ 2015-04-10 13:38 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> generic_file_direct_write() does all sorts of things to make DIO work "sorta ok" with mixed buffered IO workloads. We already do most of this work in xfs_file_aio_dio_write() because of the locking requirements, so there's only a couple of things it does for us. The first thing is that it does a page cache invalidation after the ->direct_IO callout. This can easily be added to the XFS code. The second thing it does is that if data was written, it updates the iov_iter structure to reflect the data written, and then does EOF size updates if necessary. For XFS, these EOF size updates are now not necessary, as we do them safely and race-free in IO completion context. That leaves just the iov_iter update, and that's also exily moved to the XFS code. The result is that we don't need to call generic_file_direct_write(), and hence remove a redundant buffered writeback call and a redundant page cache invalidation call from the DIO submission path. We also remove a racy EOF size update, and make the DIO submission code in XFS much easier to follow. Wins all round, really. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_file.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 7b872f4..7182cd2 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -665,6 +665,8 @@ xfs_file_dio_aio_write( int iolock; size_t count = iov_iter_count(from); loff_t pos = iocb->ki_pos; + loff_t end; + struct iov_iter data; struct xfs_buftarg *target = XFS_IS_REALTIME_INODE(ip) ? mp->m_rtdev_targp : mp->m_ddev_targp; @@ -704,10 +706,11 @@ xfs_file_dio_aio_write( if (ret) goto out; iov_iter_truncate(from, count); + end = pos + count - 1; if (mapping->nrpages) { ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, - pos, pos + count - 1); + pos, end); if (ret) goto out; /* @@ -717,7 +720,7 @@ xfs_file_dio_aio_write( */ ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping, pos >> PAGE_CACHE_SHIFT, - (pos + count - 1) >> PAGE_CACHE_SHIFT); + end >> PAGE_CACHE_SHIFT); WARN_ON_ONCE(ret); ret = 0; } @@ -734,8 +737,22 @@ xfs_file_dio_aio_write( } trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0); - ret = generic_file_direct_write(iocb, from, pos); + data = *from; + ret = mapping->a_ops->direct_IO(WRITE, iocb, &data, pos); + + /* see generic_file_direct_write() for why this is necessary */ + if (mapping->nrpages) { + invalidate_inode_pages2_range(mapping, + pos >> PAGE_CACHE_SHIFT, + end >> PAGE_CACHE_SHIFT); + } + + if (ret > 0) { + pos += ret; + iov_iter_advance(from, ret); + iocb->ki_pos = pos; + } out: xfs_rw_iunlock(ip, iolock); -- 2.0.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary 2015-04-10 13:38 ` [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary Dave Chinner @ 2015-04-10 20:22 ` Brian Foster 0 siblings, 0 replies; 19+ messages in thread From: Brian Foster @ 2015-04-10 20:22 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 10, 2015 at 11:38:00PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > generic_file_direct_write() does all sorts of things to make DIO > work "sorta ok" with mixed buffered IO workloads. We already do > most of this work in xfs_file_aio_dio_write() because of the locking > requirements, so there's only a couple of things it does for us. > > The first thing is that it does a page cache invalidation after the > ->direct_IO callout. This can easily be added to the XFS code. > > The second thing it does is that if data was written, it updates the > iov_iter structure to reflect the data written, and then does EOF > size updates if necessary. For XFS, these EOF size updates are now > not necessary, as we do them safely and race-free in IO completion > context. That leaves just the iov_iter update, and that's also exily > moved to the XFS code. > > The result is that we don't need to call > generic_file_direct_write(), and hence remove a redundant buffered > writeback call and a redundant page cache invalidation call from the > DIO submission path. We also remove a racy EOF size update, and make > the DIO submission code in XFS much easier to follow. Wins all > round, really. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Seems fine to me: Reviewed-by: Brian Foster <bfoster@redhat.com> > fs/xfs/xfs_file.c | 23 ++++++++++++++++++++--- > 1 file changed, 20 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 7b872f4..7182cd2 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -665,6 +665,8 @@ xfs_file_dio_aio_write( > int iolock; > size_t count = iov_iter_count(from); > loff_t pos = iocb->ki_pos; > + loff_t end; > + struct iov_iter data; > struct xfs_buftarg *target = XFS_IS_REALTIME_INODE(ip) ? > mp->m_rtdev_targp : mp->m_ddev_targp; > > @@ -704,10 +706,11 @@ xfs_file_dio_aio_write( > if (ret) > goto out; > iov_iter_truncate(from, count); > + end = pos + count - 1; > > if (mapping->nrpages) { > ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > - pos, pos + count - 1); > + pos, end); > if (ret) > goto out; > /* > @@ -717,7 +720,7 @@ xfs_file_dio_aio_write( > */ > ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping, > pos >> PAGE_CACHE_SHIFT, > - (pos + count - 1) >> PAGE_CACHE_SHIFT); > + end >> PAGE_CACHE_SHIFT); > WARN_ON_ONCE(ret); > ret = 0; > } > @@ -734,8 +737,22 @@ xfs_file_dio_aio_write( > } > > trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0); > - ret = generic_file_direct_write(iocb, from, pos); > > + data = *from; > + ret = mapping->a_ops->direct_IO(WRITE, iocb, &data, pos); > + > + /* see generic_file_direct_write() for why this is necessary */ > + if (mapping->nrpages) { > + invalidate_inode_pages2_range(mapping, > + pos >> PAGE_CACHE_SHIFT, > + end >> PAGE_CACHE_SHIFT); > + } > + > + if (ret > 0) { > + pos += ret; > + iov_iter_advance(from, ret); > + iocb->ki_pos = pos; > + } > out: > xfs_rw_iunlock(ip, iolock); > > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] xfs: fix direct IO completion issues 2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner ` (4 preceding siblings ...) 2015-04-10 13:38 ` [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary Dave Chinner @ 2015-04-12 15:09 ` Christoph Hellwig 2015-04-12 23:22 ` Dave Chinner 5 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2015-04-12 15:09 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Apr 10, 2015 at 11:37:55PM +1000, Dave Chinner wrote: > Hi folks, > > This patchset addresses the deeper problems Brian outlined in the > description of this patch: > > http://oss.sgi.com/archives/xfs/2015-04/msg00071.html > > The basic issues is that DIO completion can run in interrupt context > and it does things it should not do in interrupt context because Bad > Things Will Happen. Where do we complete DIO writes from irq context? Since my direct-io.c changes from a few years ago that should not be the case. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] xfs: fix direct IO completion issues 2015-04-12 15:09 ` [PATCH 0/5] xfs: fix direct IO completion issues Christoph Hellwig @ 2015-04-12 23:22 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2015-04-12 23:22 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Sun, Apr 12, 2015 at 08:09:32AM -0700, Christoph Hellwig wrote: > On Fri, Apr 10, 2015 at 11:37:55PM +1000, Dave Chinner wrote: > > Hi folks, > > > > This patchset addresses the deeper problems Brian outlined in the > > description of this patch: > > > > http://oss.sgi.com/archives/xfs/2015-04/msg00071.html > > > > The basic issues is that DIO completion can run in interrupt context > > and it does things it should not do in interrupt context because Bad > > Things Will Happen. > > Where do we complete DIO writes from irq context? Since my direct-io.c > changes from a few years ago that should not be the case. Yes, that's what I thought, too. However any AIO direct IO write that does not call set_buffer_defer_completion() will run completion in interrupt context. The current code in __xfs_get_blocks() sets that flag only when: if (create && ISUNWRITTEN(&imap)) { if (direct) { bh_result->b_private = inode; set_buffer_defer_completion(bh_result); } set_buffer_unwritten(bh_result); } And hence only writes into unwritten extents will be deferred to the DIO completion workqueue. Hence sub-block writes that extend EOF (the trace below), or extending writes into blocks beyond EOF allocated by delalloc speculative prealloc will run transactions in irq context to update the on-disk EOF. This is the stack trace from testing the simple "use a spinlock around i_size_write()" patches that pointed out how wrong we'd been: [ 375.648323] run fstests generic/036 at 2015-04-08 08:58:45 [ 380.661832] BUG: spinlock cpu recursion on CPU#3, aio-dio-fcntl-r/27068 [ 380.662898] lock: 0xffff8800afe88b70, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: 3 [ 380.664232] CPU: 3 PID: 27068 Comm: aio-dio-fcntl-r Not tainted 4.0.0-rc4-dgc+ #870 [ 380.665393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 380.665769] ffff8800afe88b70 ffff88013fd83bf8 ffffffff81dccb34 0000000000000000 [ 380.665769] 0000000000000000 ffff88013fd83c18 ffffffff81dc6754 ffff8800afe88b70 [ 380.665769] ffffffff821ed962 ffff88013fd83c38 ffffffff81dc677f ffff8800afe88b70 [ 380.665769] Call Trace: [ 380.665769] <IRQ> [<ffffffff81dccb34>] dump_stack+0x4c/0x65 [ 380.665769] [<ffffffff81dc6754>] spin_dump+0x90/0x95 [ 380.665769] [<ffffffff81dc677f>] spin_bug+0x26/0x2b [ 380.665769] [<ffffffff810e4058>] do_raw_spin_lock+0x128/0x1a0 [ 380.665769] [<ffffffff81dd7dd5>] _raw_spin_lock+0x15/0x20 [ 380.665769] [<ffffffff814f49cc>] xfs_end_io_direct_write+0x5c/0x110 [ 380.665769] [<ffffffff812085c3>] dio_complete+0xf3/0x160 [ 380.665769] [<ffffffff812086a3>] dio_bio_end_aio+0x73/0x100 [ 380.665769] [<ffffffff810c85e2>] ? default_wake_function+0x12/0x20 [ 380.665769] [<ffffffff817c461b>] bio_endio+0x5b/0xa0 [ 380.665769] [<ffffffff817cc3a0>] blk_update_request+0x90/0x370 [ 380.665769] [<ffffffff817d565a>] blk_mq_end_request+0x1a/0x70 [ 380.665769] [<ffffffff81ad393f>] virtblk_request_done+0x3f/0x70 [ 380.665769] [<ffffffff817d5fce>] __blk_mq_complete_request+0x8e/0x120 [ 380.665769] [<ffffffff817d6076>] blk_mq_complete_request+0x16/0x20 [ 380.665769] [<ffffffff81ad347e>] virtblk_done+0x6e/0xf0 [ 380.665769] [<ffffffff81893af5>] vring_interrupt+0x35/0x60 [ 380.665769] [<ffffffff810f27de>] handle_irq_event_percpu+0x3e/0x1c0 [ 380.665769] [<ffffffff810f29a1>] handle_irq_event+0x41/0x70 [ 380.665769] [<ffffffff810f573f>] handle_edge_irq+0x7f/0x120 [ 380.665769] [<ffffffff8104d2f2>] handle_irq+0x22/0x40 [ 380.665769] [<ffffffff81ddaf31>] do_IRQ+0x51/0xf0 [ 380.665769] [<ffffffff81dd8fad>] common_interrupt+0x6d/0x6d [ 380.665769] <EOI> [<ffffffff810e4033>] ? do_raw_spin_lock+0x103/0x1a0 [ 380.665769] [<ffffffff81dd7dd5>] _raw_spin_lock+0x15/0x20 [ 380.665769] [<ffffffff81502f28>] xfs_file_aio_write_checks+0x58/0x130 [ 380.665769] [<ffffffff815030ce>] xfs_file_dio_aio_write+0xce/0x410 [ 380.665769] [<ffffffff811d0f08>] ? __sb_start_write+0x58/0x120 [ 380.665769] [<ffffffff815036fe>] xfs_file_write_iter+0x7e/0x120 [ 380.665769] [<ffffffff81503680>] ? xfs_file_buffered_aio_write+0x270/0x270 [ 380.665769] [<ffffffff81218cd3>] aio_run_iocb+0x203/0x3c0 [ 380.665769] [<ffffffff810c107d>] ? __might_sleep+0x4d/0x90 [ 380.665769] [<ffffffff810c107d>] ? __might_sleep+0x4d/0x90 [ 380.665769] [<ffffffff81219b4f>] do_io_submit+0x19f/0x410 [ 380.665769] [<ffffffff81219dd0>] SyS_io_submit+0x10/0x20 [ 380.665769] [<ffffffff81dd84c9>] system_call_fastpath+0x12/0x17 Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2015-04-13 11:21 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-04-10 13:37 [PATCH 0/5] xfs: fix direct IO completion issues Dave Chinner 2015-04-10 13:37 ` [PATCH 1/5] xfs: DIO requires an ioend for writes Dave Chinner 2015-04-10 20:21 ` Brian Foster 2015-04-10 22:24 ` Dave Chinner 2015-04-10 13:37 ` [PATCH 2/5] xfs: direct IO needs to use append ioends Dave Chinner 2015-04-10 20:21 ` Brian Foster 2015-04-10 22:30 ` Dave Chinner 2015-04-11 21:12 ` Brian Foster 2015-04-11 21:15 ` Brian Foster 2015-04-12 23:31 ` Dave Chinner 2015-04-13 11:20 ` Brian Foster 2015-04-10 13:37 ` [PATCH 3/5] xfs: DIO write completion size updates race Dave Chinner 2015-04-10 20:22 ` Brian Foster 2015-04-10 13:37 ` [PATCH 4/5] xfs: direct IO EOF zeroing needs to drain AIO Dave Chinner 2015-04-10 20:22 ` Brian Foster 2015-04-10 13:38 ` [PATCH 5/5] xfs: using generic_file_direct_write() is unnecessary Dave Chinner 2015-04-10 20:22 ` Brian Foster 2015-04-12 15:09 ` [PATCH 0/5] xfs: fix direct IO completion issues Christoph Hellwig 2015-04-12 23:22 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.