All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com,
	hch@infradead.org, Jeff Moyer <jmoyer@redhat.com>,
	Al Viro <viro@ZenIV.linux.org.uk>,
	linux-ext4@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH 1/2] direct-io: Implement generic deferred AIO completions
Date: Tue, 16 Jul 2013 23:00:27 +0200	[thread overview]
Message-ID: <20130716210027.GA9595@quack.suse.cz> (raw)
In-Reply-To: <20130712004421.GE3438@dastard>

On Fri 12-07-13 10:44:21, Dave Chinner wrote:
> On Thu, Jul 11, 2013 at 12:02:18AM +0200, Jan Kara wrote:
> > From: Christoph Hellwig <hch@infradead.org>
> > 
> > Add support to the core direct-io code to defer AIO completions to user
> > context using a workqueue.  This replaces opencoded and less efficient
> > code in XFS and ext4 and will be needed to properly support O_(D)SYNC
> > for AIO.
> 
> I don't see how this is any more efficient than the deferral
> that XFS already does for direct IO completion - it just changes
> what queues IO for completions.
  I didn't change the changelog so it's upto Christoph to tell. But I can
remove the 'less efficient' part from the changelog if you want.

> And on that topic:
> 
> > Currently this creates a per-superblock unbound workqueue for these
> > completions, which is taken from an earlier patch by Jan Kara.  I'm
> > not really convinced about this use and would prefer a "normal" global
> > workqueue with a high concurrency limit, but this needs further discussion.
> 
> Unbound workqueues sacrifice locality for immediate dispatch.
> AIO-DIO performance at the high end is determined by IO submission
> and completion locality - we want submission and completion to occur
> on the same CPU as much as possible so cachelines are not bounced
> aroundi needlessly. An unbound workqueue would seem to be the wrong
> choice on this basis alone.
> 
> Further the use of @max_active = 1 means that it is a global, single
> threaded workqueue. While ext4 might require single threaded
> execution of unwritten extent completion, it would be introducing a
> massive bottleneck into XFS which currently uses the default
> concurrency depth of 256 worker threads per CPU per superblock just
> for unwritten extent conversion.
> 
> Hmmmm. I notice that the next patch then does generic_write_sync()
> is the IO completion handler, too. In XFS that causes log forces and
> will block, so that's yet more concurrency required that is required
> for this workqueue. Doing this in a single threaded workqueue and
> marshalling all sync AIO through it is highly unfriendly to
> concurrent IO completion.
> 
> FWIW, in XFS we queue unwritten extent conversion completions on a
> different workqueue to EOF size update completions because the
> latter are small, fast and rarely require IO or get blocked. The
> data IO completion workqueue for EOF updates has the same
> concurrency and depth as the unwritten extent work queue (i.e. 256
> workers per cpu per superblock). So pushing all of this DIO and EOF
> completion work into a single threaded global workqueue that can
> block in every IO completion doesn't seem like a very smart idea to
> me...
  OK, so you'd rather have the workqueue created like:

alloc_workqueue("dio-sync", WQ_MEM_RECLAIM, 0)

  I can certainly try that. ext4 should handle that fine. And I agree with
your arguments for high end devices. I'm just wondering whether it won't do
some harm to a simple SATA drive. Having more threads trying to do extent
conversion in parallel might cause the drive to seek more.

> > -static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async)
> > +static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
> > +		bool is_async)
> >  {
> >  	ssize_t transferred = 0;
> >  
> > @@ -258,19 +262,26 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is
> >  	if (ret == 0)
> >  		ret = transferred;
> >  
> > -	if (dio->end_io && dio->result) {
> > -		dio->end_io(dio->iocb, offset, transferred,
> > -			    dio->private, ret, is_async);
> > -	} else {
> > -		inode_dio_done(dio->inode);
> > -		if (is_async)
> > -			aio_complete(dio->iocb, ret, 0);
> > -	}
> > +	if (dio->end_io && dio->result)
> > +		dio->end_io(dio->iocb, offset, transferred, dio->private);
> 
> Ok, so by here we are assuming all filesystem IO completion
> processing is completed.
  Yes.

> > +
> > +	inode_dio_done(dio->inode);
> > +	if (is_async)
> > +		aio_complete(dio->iocb, ret, 0);
> >  
> > +	kmem_cache_free(dio_cache, dio);
> >  	return ret;
> 
> Hmmm. I started wonder if this is valid, because XFS is supposed to
> be doing workqueue based IO completion for appending writes and they
> are supposed to be run in a workqueue.
> 
> But, looking at the XFS code, there's actually a bug in the direct
> IO completion code and we are not defering completion to a workqueue
> like we should be for the append case.  This code in
> xfs_finish_ioend() demonstrates the intent:
> 
>                 if (ioend->io_type == XFS_IO_UNWRITTEN)
>                         queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
>                 else if (ioend->io_append_trans ||
> >>>>>>                   (ioend->io_isdirect && xfs_ioend_is_append(ioend)))
>                         queue_work(mp->m_data_workqueue, &ioend->io_work);
> 
> The problem is that we only ever call xfs_finish_ioend() if is_async
> is true, and that will never be true for direct IO beyond the current
> EOF. That's a bug, and is Bad(tm).
> 
> [ Interestingly, it turns out that dio->is_async is never set for
> writes beyond EOF because of filesystems that update file sizes
> before data IO completion occurs (stale data exposure issues). For
> XFS, that can't happen and so dio->is_async could always be set.... ]
> 
> What this means is that there's a condition for work queue deferral
> of DIO IO completion that this patch doesn't handle. Setting deferral
> only on unwritten extents like this:
> 
> > @@ -1292,8 +1281,10 @@ __xfs_get_blocks(
> >  		if (create || !ISUNWRITTEN(&imap))
> >  			xfs_map_buffer(inode, bh_result, &imap, offset);
> >  		if (create && ISUNWRITTEN(&imap)) {
> > -			if (direct)
> > +			if (direct) {
> >  				bh_result->b_private = inode;
> > +				set_buffer_defer_completion(bh_result);
> > +			}
> >  			set_buffer_unwritten(bh_result);
> >  		}
> >  	}
> 
> misses that case. I suspect Christoph's original patch predated the
> changes to XFS that added transactional file size updates to IO
> completion and so didn't take it into account.
  OK, thanks for catching this. Doing the i_size check in
_xfs_get_blocks() would be somewhat cumbersome I presume so I guess we can
handle that case by adding __blockdev_direct_IO() flag to defer dio
completion to a workqueue. XFS can then set the flag when it sees it needs
and i_size update. OK?

> > @@ -1390,9 +1381,7 @@ xfs_end_io_direct_write(
> >  	struct kiocb		*iocb,
> >  	loff_t			offset,
> >  	ssize_t			size,
> > -	void			*private,
> > -	int			ret,
> > -	bool			is_async)
> > +	void			*private)
> >  {
> >  	struct xfs_ioend	*ioend = iocb->private;
> >  
> > @@ -1414,17 +1403,10 @@ xfs_end_io_direct_write(
> >  
> >  	ioend->io_offset = offset;
> >  	ioend->io_size = size;
> > -	ioend->io_iocb = iocb;
> > -	ioend->io_result = ret;
> >  	if (private && size > 0)
> >  		ioend->io_type = XFS_IO_UNWRITTEN;
> >  
> > -	if (is_async) {
> > -		ioend->io_isasync = 1;
> > -		xfs_finish_ioend(ioend);
> > -	} else {
> > -		xfs_finish_ioend_sync(ioend);
> > -	}
> > +	xfs_finish_ioend_sync(ioend);
> >  }
> 
> As i mentioned, the existing code here is buggy. What we should be
> doing here is:
> 
> 	if (is_async) {
> 		ioend->io_isasync = 1;
> 		xfs_finish_ioend(ioend);
> 	} else if (xfs_ioend_is_append(ioend))) {
> 		xfs_finish_ioend(ioend);
> 	} else {
> 		xfs_finish_ioend_sync(ioend);
> 	}
  Umm, I started to wonder why do you actually need to defer the completion
for appending ioend. Because if DIO isn't async, dio_complete() won't be
called from interrupt context but from __blockdev_direct_IO(). So it seems
you can do everything you need directly there without deferring to a
workqueue. But maybe there's some locking subtlety I'm missing.

> Which means that the new code boils down to:
> 
> 	if (xfs_ioend_is_append(ioend)))
> 		xfs_finish_ioend(ioend);
> 	else
> 		xfs_finish_ioend_sync(ioend);
> 
> And now we see the problem with the only defering unwritten IO -
> the new direct IO code will think IO is completed when all we've
> done is queued it to another workqueue.
  I agree. New direct IO completion handlers have to really finish
everything without deferring to a separate workqueue. ext4 had a same bug
but there I caught it and fixed.

> I'm not sure we can handle this in get_blocks - we don't have the
> context to know how to treat allocation beyond the current EOF.
> Indeed, the current block being mapped might not be beyond EOF, but
> some of the IO might be (e.g. lies across an extent boundary),
> so setting the deferred completion in get_blocks doesn't allow the
> entire IO to be treated the same.
> 
> So I think there's a bit of re-thinking needed to aspects of this
> patch to be done.
  Additional flag to __blockdev_direct_IO() should solve this as I wrote
above (if you really need it).

> > +		 * avoids problems with pseudo filesystems which get initialized
> > +		 * before workqueues can be created
> > +		 */
> > +		if (type->fs_flags & FS_REQUIRES_DEV) {
> > +			s->s_dio_done_wq =
> > +				alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
> 
> The workqueue name needs to be combined with the fs_name so we know
> what filesystem is having problems in sysrq-w output.
  Yes, that's reasonable. However that means we'll have to initialize the
workqueue later. Hum.. I think I will return to the on-demand creation of
the workqueue as I originally had in my patch set.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	xfs@oss.sgi.com, hch@infradead.org,
	Jeff Moyer <jmoyer@redhat.com>, Al Viro <viro@ZenIV.linux.org.uk>,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH 1/2] direct-io: Implement generic deferred AIO completions
Date: Tue, 16 Jul 2013 23:00:27 +0200	[thread overview]
Message-ID: <20130716210027.GA9595@quack.suse.cz> (raw)
In-Reply-To: <20130712004421.GE3438@dastard>

On Fri 12-07-13 10:44:21, Dave Chinner wrote:
> On Thu, Jul 11, 2013 at 12:02:18AM +0200, Jan Kara wrote:
> > From: Christoph Hellwig <hch@infradead.org>
> > 
> > Add support to the core direct-io code to defer AIO completions to user
> > context using a workqueue.  This replaces opencoded and less efficient
> > code in XFS and ext4 and will be needed to properly support O_(D)SYNC
> > for AIO.
> 
> I don't see how this is any more efficient than the deferral
> that XFS already does for direct IO completion - it just changes
> what queues IO for completions.
  I didn't change the changelog so it's upto Christoph to tell. But I can
remove the 'less efficient' part from the changelog if you want.

> And on that topic:
> 
> > Currently this creates a per-superblock unbound workqueue for these
> > completions, which is taken from an earlier patch by Jan Kara.  I'm
> > not really convinced about this use and would prefer a "normal" global
> > workqueue with a high concurrency limit, but this needs further discussion.
> 
> Unbound workqueues sacrifice locality for immediate dispatch.
> AIO-DIO performance at the high end is determined by IO submission
> and completion locality - we want submission and completion to occur
> on the same CPU as much as possible so cachelines are not bounced
> aroundi needlessly. An unbound workqueue would seem to be the wrong
> choice on this basis alone.
> 
> Further the use of @max_active = 1 means that it is a global, single
> threaded workqueue. While ext4 might require single threaded
> execution of unwritten extent completion, it would be introducing a
> massive bottleneck into XFS which currently uses the default
> concurrency depth of 256 worker threads per CPU per superblock just
> for unwritten extent conversion.
> 
> Hmmmm. I notice that the next patch then does generic_write_sync()
> is the IO completion handler, too. In XFS that causes log forces and
> will block, so that's yet more concurrency required that is required
> for this workqueue. Doing this in a single threaded workqueue and
> marshalling all sync AIO through it is highly unfriendly to
> concurrent IO completion.
> 
> FWIW, in XFS we queue unwritten extent conversion completions on a
> different workqueue to EOF size update completions because the
> latter are small, fast and rarely require IO or get blocked. The
> data IO completion workqueue for EOF updates has the same
> concurrency and depth as the unwritten extent work queue (i.e. 256
> workers per cpu per superblock). So pushing all of this DIO and EOF
> completion work into a single threaded global workqueue that can
> block in every IO completion doesn't seem like a very smart idea to
> me...
  OK, so you'd rather have the workqueue created like:

alloc_workqueue("dio-sync", WQ_MEM_RECLAIM, 0)

  I can certainly try that. ext4 should handle that fine. And I agree with
your arguments for high end devices. I'm just wondering whether it won't do
some harm to a simple SATA drive. Having more threads trying to do extent
conversion in parallel might cause the drive to seek more.

> > -static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async)
> > +static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
> > +		bool is_async)
> >  {
> >  	ssize_t transferred = 0;
> >  
> > @@ -258,19 +262,26 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is
> >  	if (ret == 0)
> >  		ret = transferred;
> >  
> > -	if (dio->end_io && dio->result) {
> > -		dio->end_io(dio->iocb, offset, transferred,
> > -			    dio->private, ret, is_async);
> > -	} else {
> > -		inode_dio_done(dio->inode);
> > -		if (is_async)
> > -			aio_complete(dio->iocb, ret, 0);
> > -	}
> > +	if (dio->end_io && dio->result)
> > +		dio->end_io(dio->iocb, offset, transferred, dio->private);
> 
> Ok, so by here we are assuming all filesystem IO completion
> processing is completed.
  Yes.

> > +
> > +	inode_dio_done(dio->inode);
> > +	if (is_async)
> > +		aio_complete(dio->iocb, ret, 0);
> >  
> > +	kmem_cache_free(dio_cache, dio);
> >  	return ret;
> 
> Hmmm. I started wonder if this is valid, because XFS is supposed to
> be doing workqueue based IO completion for appending writes and they
> are supposed to be run in a workqueue.
> 
> But, looking at the XFS code, there's actually a bug in the direct
> IO completion code and we are not defering completion to a workqueue
> like we should be for the append case.  This code in
> xfs_finish_ioend() demonstrates the intent:
> 
>                 if (ioend->io_type == XFS_IO_UNWRITTEN)
>                         queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
>                 else if (ioend->io_append_trans ||
> >>>>>>                   (ioend->io_isdirect && xfs_ioend_is_append(ioend)))
>                         queue_work(mp->m_data_workqueue, &ioend->io_work);
> 
> The problem is that we only ever call xfs_finish_ioend() if is_async
> is true, and that will never be true for direct IO beyond the current
> EOF. That's a bug, and is Bad(tm).
> 
> [ Interestingly, it turns out that dio->is_async is never set for
> writes beyond EOF because of filesystems that update file sizes
> before data IO completion occurs (stale data exposure issues). For
> XFS, that can't happen and so dio->is_async could always be set.... ]
> 
> What this means is that there's a condition for work queue deferral
> of DIO IO completion that this patch doesn't handle. Setting deferral
> only on unwritten extents like this:
> 
> > @@ -1292,8 +1281,10 @@ __xfs_get_blocks(
> >  		if (create || !ISUNWRITTEN(&imap))
> >  			xfs_map_buffer(inode, bh_result, &imap, offset);
> >  		if (create && ISUNWRITTEN(&imap)) {
> > -			if (direct)
> > +			if (direct) {
> >  				bh_result->b_private = inode;
> > +				set_buffer_defer_completion(bh_result);
> > +			}
> >  			set_buffer_unwritten(bh_result);
> >  		}
> >  	}
> 
> misses that case. I suspect Christoph's original patch predated the
> changes to XFS that added transactional file size updates to IO
> completion and so didn't take it into account.
  OK, thanks for catching this. Doing the i_size check in
_xfs_get_blocks() would be somewhat cumbersome I presume so I guess we can
handle that case by adding __blockdev_direct_IO() flag to defer dio
completion to a workqueue. XFS can then set the flag when it sees it needs
and i_size update. OK?

> > @@ -1390,9 +1381,7 @@ xfs_end_io_direct_write(
> >  	struct kiocb		*iocb,
> >  	loff_t			offset,
> >  	ssize_t			size,
> > -	void			*private,
> > -	int			ret,
> > -	bool			is_async)
> > +	void			*private)
> >  {
> >  	struct xfs_ioend	*ioend = iocb->private;
> >  
> > @@ -1414,17 +1403,10 @@ xfs_end_io_direct_write(
> >  
> >  	ioend->io_offset = offset;
> >  	ioend->io_size = size;
> > -	ioend->io_iocb = iocb;
> > -	ioend->io_result = ret;
> >  	if (private && size > 0)
> >  		ioend->io_type = XFS_IO_UNWRITTEN;
> >  
> > -	if (is_async) {
> > -		ioend->io_isasync = 1;
> > -		xfs_finish_ioend(ioend);
> > -	} else {
> > -		xfs_finish_ioend_sync(ioend);
> > -	}
> > +	xfs_finish_ioend_sync(ioend);
> >  }
> 
> As i mentioned, the existing code here is buggy. What we should be
> doing here is:
> 
> 	if (is_async) {
> 		ioend->io_isasync = 1;
> 		xfs_finish_ioend(ioend);
> 	} else if (xfs_ioend_is_append(ioend))) {
> 		xfs_finish_ioend(ioend);
> 	} else {
> 		xfs_finish_ioend_sync(ioend);
> 	}
  Umm, I started to wonder why do you actually need to defer the completion
for appending ioend. Because if DIO isn't async, dio_complete() won't be
called from interrupt context but from __blockdev_direct_IO(). So it seems
you can do everything you need directly there without deferring to a
workqueue. But maybe there's some locking subtlety I'm missing.

> Which means that the new code boils down to:
> 
> 	if (xfs_ioend_is_append(ioend)))
> 		xfs_finish_ioend(ioend);
> 	else
> 		xfs_finish_ioend_sync(ioend);
> 
> And now we see the problem with the only defering unwritten IO -
> the new direct IO code will think IO is completed when all we've
> done is queued it to another workqueue.
  I agree. New direct IO completion handlers have to really finish
everything without deferring to a separate workqueue. ext4 had a same bug
but there I caught it and fixed.

> I'm not sure we can handle this in get_blocks - we don't have the
> context to know how to treat allocation beyond the current EOF.
> Indeed, the current block being mapped might not be beyond EOF, but
> some of the IO might be (e.g. lies across an extent boundary),
> so setting the deferred completion in get_blocks doesn't allow the
> entire IO to be treated the same.
> 
> So I think there's a bit of re-thinking needed to aspects of this
> patch to be done.
  Additional flag to __blockdev_direct_IO() should solve this as I wrote
above (if you really need it).

> > +		 * avoids problems with pseudo filesystems which get initialized
> > +		 * before workqueues can be created
> > +		 */
> > +		if (type->fs_flags & FS_REQUIRES_DEV) {
> > +			s->s_dio_done_wq =
> > +				alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
> 
> The workqueue name needs to be combined with the fs_name so we know
> what filesystem is having problems in sysrq-w output.
  Yes, that's reasonable. However that means we'll have to initialize the
workqueue later. Hum.. I think I will return to the on-demand creation of
the workqueue as I originally had in my patch set.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-07-16 21:00 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-10 22:02 [PATCH 1/2] direct-io: Implement generic deferred AIO completions Jan Kara
2013-07-10 22:02 ` Jan Kara
2013-07-10 22:02 ` [PATCH 2/2] direct-io: Handle O_(D)SYNC AIO Jan Kara
2013-07-10 22:02   ` Jan Kara
2013-07-12  0:44 ` [PATCH 1/2] direct-io: Implement generic deferred AIO completions Dave Chinner
2013-07-16 21:00   ` Jan Kara [this message]
2013-07-16 21:00     ` Jan Kara
2013-08-12 16:14     ` Jan Kara
2013-08-12 16:14       ` Jan Kara
2013-08-13  0:11       ` Dave Chinner
2013-08-13  0:11         ` Dave Chinner
2013-08-01  8:17   ` Christoph Hellwig
2013-08-14  9:10 [PATCH 0/2 v2] Fix O_SYNC AIO DIO Jan Kara
2013-08-14  9:10 ` [PATCH 1/2] direct-io: Implement generic deferred AIO completions Jan Kara
2013-09-04 13:04 [PATCH 0/2 v3] Fix O_SYNC AIO DIO Jan Kara
2013-09-04 13:04 ` [PATCH 1/2] direct-io: Implement generic deferred AIO completions Jan Kara
2013-09-04 13:04   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130716210027.GA9595@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=jmoyer@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@ZenIV.linux.org.uk \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.