All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS slow when project quota exceeded (again)
@ 2013-02-14 13:14 Jan Kara
  2013-02-15  7:16 ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2013-02-14 13:14 UTC (permalink / raw)
  To: xfs; +Cc: dchinner

  Hi,

  this is a follow up on a discussion started here:
http://www.spinics.net/lists/xfs/msg14999.html

To just quickly sum up the issue: 
When project quota gets exceeded XFS ends up flushing inodes using
sync_inodes_sb(). I've tested (in 3.8-rc4) that if one writes 200 MB to a
directory with 100 MB project quota like:
	fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
	for (i = 0; i < 50000; i++)
		pwrite(fd, buf, 4096, i*4096);
it takes about 3 s to finish, which is OK. But when there are lots of
inodes cached (I've tried with 10000 inodes cached on the fs), the same
test program runs ~140 s. This is because sync_inodes_sb() iterates over
all inodes in superblock and waits for IO and this iteration eats CPU
cycles.

One can argue that sync_inodes_sb() should be optimized but it isn't that
easy because it is a data integrity operation and we have to be sure that
even IO which has been submitted before sync_inodes_sb() is called is
finished by the time it returns.

So another POV is that it is an overkill to call data integrity operation
in a place which just needs to make sure IO is submitted so that delalloc
uncertainty is reduced. The comment before xfs_flush_inodes() says that
sync_inodes_sb() is used to throttle multiple callers to the rate at which
IO is completing. I wonder - why is that needed? Delalloc blocks are
allocated already at IO submission time so it should be enough to wait for
IO submission to happen? And writeback_inodes_sb() does that. Changelog of
the commit changing xfs_flush_inodes() to use sync_inodes_sb() claims that
if writeback_inodes_sb() is used, premature ENOSPC can happen. But I wonder
why is that the case when writeback_inodes_sb() waits for IO submission to
happen and thus for extent conversion? Isn't sync_inodes_sb() just papering
over some other problem?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XFS slow when project quota exceeded (again)
  2013-02-14 13:14 XFS slow when project quota exceeded (again) Jan Kara
@ 2013-02-15  7:16 ` Dave Chinner
  2013-02-15 10:38   ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2013-02-15  7:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: dchinner, xfs

On Thu, Feb 14, 2013 at 02:14:52PM +0100, Jan Kara wrote:
>   Hi,
> 
>   this is a follow up on a discussion started here:
> http://www.spinics.net/lists/xfs/msg14999.html
> 
> To just quickly sum up the issue: 
> When project quota gets exceeded XFS ends up flushing inodes using
> sync_inodes_sb(). I've tested (in 3.8-rc4) that if one writes 200 MB to a
> directory with 100 MB project quota like:
> 	fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
> 	for (i = 0; i < 50000; i++)
> 		pwrite(fd, buf, 4096, i*4096);
> it takes about 3 s to finish, which is OK. But when there are lots of
> inodes cached (I've tried with 10000 inodes cached on the fs), the same
> test program runs ~140 s.

So, you're testing the overhead of ~25,000 ENOSPC flushes. I could
brush this off and say "stupid application" but I won't....

> This is because sync_inodes_sb() iterates over
> all inodes in superblock and waits for IO and this iteration eats CPU
> cycles.

Yup, exactly what I said here:

http://www.spinics.net/lists/xfs/msg15198.html

Iterating inodes takes a lot of CPU.

I think the difference in the old method and the current one is that
we only do one inode cache iteration per write(), not one per
get_blocks() call. Hence we've removed the per-page overhead of
flushing, and now we just have the inode cache iteration overhead.

The fix to that problem is mentioned here:

http://www.spinics.net/lists/xfs/msg15186.html

Which is to:

	a) throttle speculative allocation as EDQUOT approaches; and
	b) efficiently track speculative preallocation
	   for all the inodes in the given project, and write back
	   and trim those inodes on ENOSPC.

both of those are still a work in progress. I was hoping that we'd
have a) in 3.9, but that doesn't seem likely now the merge window is
just about upon us....

Essnetially, the fix for your problem is b). it will turn a
whole superblock inode iteration into a "inodes with specualtive
preallocation iteration" that is project quota aware. it will have
lower per-flush overhead than a superblock scan, but it will still
have measurable overhead.

> One can argue that sync_inodes_sb() should be optimized but it isn't that
> easy because it is a data integrity operation and we have to be sure that
> even IO which has been submitted before sync_inodes_sb() is called is
> finished by the time it returns.

Right, but the IO wait is exactly why I chose it....

> So another POV is that it is an overkill to call data integrity operation
> in a place which just needs to make sure IO is submitted so that delalloc
> uncertainty is reduced.

... and not because of anything to do with how we handle delalloc
metadata reservations.

> The comment before xfs_flush_inodes() says that
> sync_inodes_sb() is used to throttle multiple callers to the rate at which
> IO is completing. I wonder - why is that needed?

It's trying to prevent the filesystem for falling into worst case IO
patterns at ENOSPC when there are hundreds of threads banging on the
filesystem and essentially locking up the system. i.e. we have to
throttle the IO submission rate from ENOSPC flushing artificially -
we really only need one thread doing the submission work, so we need
to throttle all concurrent callers while we are doing that work.
sync_inodes_sb() does that. And then we need to prevent individual
callers from trying to allocate space too frequently, which we do by
waiting for IO that was submitted in the flush. sync_inodes_sb()
does that, too.

> Delalloc blocks are
> allocated already at IO submission time so it should be enough to wait for
> IO submission to happen? And writeback_inodes_sb() does that. Changelog of
> the commit changing xfs_flush_inodes() to use sync_inodes_sb() claims that
> if writeback_inodes_sb() is used, premature ENOSPC can happen. But I wonder
> why is that the case when writeback_inodes_sb() waits for IO submission to
> happen and thus for extent conversion? Isn't sync_inodes_sb() just papering
> over some other problem?

It was never intended to create or solve EDQUOT issues related to
project quota.  What you are seeing is an unfortunate side effect of
fixing other, more serious architectural problems. We know it's a
problem and we have a fix in the works. Just be patient....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XFS slow when project quota exceeded (again)
  2013-02-15  7:16 ` Dave Chinner
@ 2013-02-15 10:38   ` Jan Kara
  2013-02-15 13:13     ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2013-02-15 10:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: dchinner, Jan Kara, xfs

On Fri 15-02-13 18:16:07, Dave Chinner wrote:
> On Thu, Feb 14, 2013 at 02:14:52PM +0100, Jan Kara wrote:
> >   Hi,
> > 
> >   this is a follow up on a discussion started here:
> > http://www.spinics.net/lists/xfs/msg14999.html
> > 
> > To just quickly sum up the issue: 
> > When project quota gets exceeded XFS ends up flushing inodes using
> > sync_inodes_sb(). I've tested (in 3.8-rc4) that if one writes 200 MB to a
> > directory with 100 MB project quota like:
> > 	fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);
> > 	for (i = 0; i < 50000; i++)
> > 		pwrite(fd, buf, 4096, i*4096);
> > it takes about 3 s to finish, which is OK. But when there are lots of
> > inodes cached (I've tried with 10000 inodes cached on the fs), the same
> > test program runs ~140 s.
> 
> So, you're testing the overhead of ~25,000 ENOSPC flushes. I could
> brush this off and say "stupid application" but I won't....
  Yes, stupid... NFS. This is what happens when NFS client writes to a
directory over project quota. So as much as I agree the workload is
braindamaged we don't have a choice to fix a client and we didn't find a
reasonable fix on NFS server side either. So that's why we ended up with
XFS changes.

> > This is because sync_inodes_sb() iterates over
> > all inodes in superblock and waits for IO and this iteration eats CPU
> > cycles.
> 
> Yup, exactly what I said here:
> 
> http://www.spinics.net/lists/xfs/msg15198.html
> 
> Iterating inodes takes a lot of CPU.
> 
> I think the difference in the old method and the current one is that
> we only do one inode cache iteration per write(), not one per
> get_blocks() call. Hence we've removed the per-page overhead of
> flushing, and now we just have the inode cache iteration overhead.
> 
> The fix to that problem is mentioned here:
> 
> http://www.spinics.net/lists/xfs/msg15186.html
> 
> Which is to:
> 
> 	a) throttle speculative allocation as EDQUOT approaches; and
> 	b) efficiently track speculative preallocation
> 	   for all the inodes in the given project, and write back
> 	   and trim those inodes on ENOSPC.
> 
> both of those are still a work in progress. I was hoping that we'd
> have a) in 3.9, but that doesn't seem likely now the merge window is
> just about upon us....
  Yeah, I know someone is working on a better solution. I was mostly
wondering why writeback_inodes_sb() isn't enough - i.e. why do you really
need to wait for IO completion. And you explained that below. Thanks!

> It's trying to prevent the filesystem for falling into worst case IO
> patterns at ENOSPC when there are hundreds of threads banging on the
> filesystem and essentially locking up the system. i.e. we have to
> throttle the IO submission rate from ENOSPC flushing artificially -
> we really only need one thread doing the submission work, so we need
> to throttle all concurrent callers while we are doing that work.
> sync_inodes_sb() does that. And then we need to prevent individual
> callers from trying to allocate space too frequently, which we do by
> waiting for IO that was submitted in the flush. sync_inodes_sb()
> does that, too.
  I see so the waiting for IO isn't really a correctness thing but just a
way to slow down processes bringing fs out of space so that they don't lock
up the system. Thanks for explanation!

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XFS slow when project quota exceeded (again)
  2013-02-15 10:38   ` Jan Kara
@ 2013-02-15 13:13     ` Brian Foster
  0 siblings, 0 replies; 4+ messages in thread
From: Brian Foster @ 2013-02-15 13:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs, dchinner

On 02/15/2013 05:38 AM, Jan Kara wrote:
> On Fri 15-02-13 18:16:07, Dave Chinner wrote:
>> On Thu, Feb 14, 2013 at 02:14:52PM +0100, Jan Kara wrote:
...
>> The fix to that problem is mentioned here:
>>
>> http://www.spinics.net/lists/xfs/msg15186.html
>>
>> Which is to:
>>
>> 	a) throttle speculative allocation as EDQUOT approaches; and
>> 	b) efficiently track speculative preallocation
>> 	   for all the inodes in the given project, and write back
>> 	   and trim those inodes on ENOSPC.
>>
>> both of those are still a work in progress. I was hoping that we'd
>> have a) in 3.9, but that doesn't seem likely now the merge window is
>> just about upon us....
>   Yeah, I know someone is working on a better solution. I was mostly
> wondering why writeback_inodes_sb() isn't enough - i.e. why do you really
> need to wait for IO completion. And you explained that below. Thanks!
> 

Sorry, I think the work for a) is pretty close [1]. We've had a bug fix
or two in this area and the speculative prealloc algorithm update to
handle sparse files since v2, so I have to rebase against those changes
and continue some sanity testing. I'll try to have something available
over the next week or so.

The last bits required for b) (i.e., the pquota ENOSPC trigger) is a bit
further out, but FWIW I have an RFC posted [2] to get some early design
feedback on a solution.

Brian

[1] - v2 - http://oss.sgi.com/archives/xfs/2013-01/msg00006.html
[2] - http://oss.sgi.com/archives/xfs/2012-12/msg00112.html

>> It's trying to prevent the filesystem for falling into worst case IO
>> patterns at ENOSPC when there are hundreds of threads banging on the
>> filesystem and essentially locking up the system. i.e. we have to
>> throttle the IO submission rate from ENOSPC flushing artificially -
>> we really only need one thread doing the submission work, so we need
>> to throttle all concurrent callers while we are doing that work.
>> sync_inodes_sb() does that. And then we need to prevent individual
>> callers from trying to allocate space too frequently, which we do by
>> waiting for IO that was submitted in the flush. sync_inodes_sb()
>> does that, too.
>   I see so the waiting for IO isn't really a correctness thing but just a
> way to slow down processes bringing fs out of space so that they don't lock
> up the system. Thanks for explanation!
> 
> 								Honza
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-02-15 13:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-14 13:14 XFS slow when project quota exceeded (again) Jan Kara
2013-02-15  7:16 ` Dave Chinner
2013-02-15 10:38   ` Jan Kara
2013-02-15 13:13     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.