All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@suse.de>
To: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <npiggin@kernel.dk>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Frank Mayhar <fmayhar@google.com>,
	John Stultz <johnstul@us.ibm.com>
Subject: Re: VFS scalability git tree
Date: Wed, 28 Jul 2010 01:09:08 +1000	[thread overview]
Message-ID: <20100727150908.GA3749@amd> (raw)
In-Reply-To: <20100727131810.GO7362@dastard>

On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> 
> A s a personal prefernce, I don't like testing filesystem performance
> on ramdisks because it hides problems caused by changes in IO
> latency. I'll come back to this later.

Very true, although it's good if you don't have some fast disks,
and it can be good to trigger different races than disks tend to.

So I still want to get to the bottom of the slowdown you saw on
vfs-scale.


> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run.
> 
> Quite possibly because of the smaller log - that will cause more
> frequent pushing on the log tail and hence I/O patterns will vary a
> bit...

Well... I think the test case (or how I'm running it) is simply a
bit unstable. I mean, there are subtle interactions all the way from
the CPU scheduler to the disk, so when I say unstable I'm not
particularly blaming XFS :)

 
> Also, keep in mind that delayed logging is shiny and new - it has
> increased XFS metadata performance and parallelism by an order of
> magnitude and so we're really seeing new a bunch of brand new issues
> that have never been seen before with this functionality.  As such,
> there's still some interactions I haven't got to the bottom of with
> delayed logging - it's stable enough to use and benchmark and won't
> corrupt anything but there are still has some warts we need to
> solve. The difficulty (as always) is in reliably reproducing the bad
> behaviour.

Sure, and I didn't see any corruptions, it seems pretty stable and
scalability is better than other filesystems. I'll see if I can
give a better recipe to reproduce the 'livelock'ish behaviour.

 
> > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
> > start to fill up memory and cause reclaim during the 2nd and subsequent
> > iterations.
> 
> I haven't used this mode, so I can't really comment on the results
> you are seeing.

It's a bit strange. Help says it should clear inodes between iterations
(without the -k flag), but it does not seem to.

 
> > > enabled. ext4 is using default mkfs and mount parameters except for
> > > barrier=0. All numbers are averages of three runs.
> > > 
> > > 	fs_mark rate (thousands of files/second)
> > >            2.6.35-rc5   2.6.35-rc5-scale
> > > threads    xfs   ext4     xfs    ext4
> > >   1         20    39       20     39
> > >   2         35    55       35     57
> > >   4         60    41       57     42
> > >   8         79     9       75      9
> > > 
> > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > going to ignore ext4 for the purposes of testing scalability here.
> > > 
> > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > slightly lower throughput.  So at this class of machine for this
> > > workload, the changes result in a slight reduction in scalability.
> > 
> > I wonder if these results are stable. It's possible that changes in
> > reclaim behaviour are causing my patches to require more IO for a
> > given unit of work?
> 
> More likely that's the result of using a smaller log size because it
> will require more frequent metadata pushes to make space for new
> transactions.

I was just checking whether your numbers are stable (where you
saw some slowdown with vfs-scale patches), and what could be the
cause. I agree that running real disks could make big changes in
behaviour.


> > I was seeing XFS 'livelock' in reclaim more with my patches, it
> > could be due to more parallelism now being allowed from the vfs and
> > reclaim.
> >
> > Based on my above numbers, I don't see that rcu-inodes is causing a
> > problem, and in terms of SMP scalability, there is really no way that
> > vanilla is more scalable, so I'm interested to see where this slowdown
> > is coming from.
> 
> As I said initially, ram disks hide IO latency changes resulting
> from increased numbers of IO or increases in seek distances.  My
> initial guess is the change in inode reclaim behaviour causing
> different IO patterns and more seeks under reclaim because the zone
> based reclaim is no longer reclaiming inodes in the order
> they are created (i.e. we are not doing sequential inode reclaim any
> more.

Sounds plausible. I'll do more investigations along those lines.

 
> FWIW, I use PCP monitoring graphs to correlate behavioural changes
> across different subsystems because it is far easier to relate
> information visually than it is by looking at raw numbers or traces.
> I think this graph shows the effect of relcaim on performance
> most clearly:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png

I haven't actually used that, it looks interesting.

 
> It's pretty clear that when the inode/dentry cache shrinkers are
> running, sustained create/unlink performance goes right down. From a
> different tab not in the screen shot (the other "test-4" tab), I
> could see CPU usage also goes down and the disk iops go way up
> whenever the create/unlink performance dropped. This same behaviour
> happens with the vfs-scale patchset, so it's not related to lock
> contention - just aggressive reclaim of still-dirty inodes.
> 
> FYI, The patch under test there was the XFS shrinker ignoring 7 out
> of 8 shrinker calls and then on the 8th call doing the work of all
> previous calls. i.e emulating  SHRINK_BATCH = 1024. Interestingly
> enough, that one change reduced the runtime of the 8m inode
> create/unlink load by ~25% (from ~24min to ~18min).

Hmm, interesting. Well that's naturally configurable with the
shrinker API changes I'm hoping to have merged. I'll plan to push
that ahead of the vfs-scale patches of course.


> That is by far the largest improvement I've been able to obtain from
> modifying the shrinker code, and it is from those sorts of
> observations that I think that IO being issued from reclaim is
> currently the most significant performance limiting factor for XFS
> in this sort of workload....

How is the xfs inode reclaim tied to linux inode reclaim? Does the
xfs inode not become reclaimable until some time after the linux inode
is reclaimed? Or what?

Do all or most of the xfs inodes require IO before being reclaimed
during this test? I wonder if you could throttle them a bit or sort
them somehow so that they tend to be cleaned by writeout and reclaim
just comes after and removes the clean ones, like pagecache reclaim
is (supposed) to work.?


WARNING: multiple messages have this Message-ID (diff)
From: Nick Piggin <npiggin@suse.de>
To: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <npiggin@kernel.dk>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Frank Mayhar <fmayhar@google.com>,
	John Stultz <johnstul@us.ibm.com>
Subject: Re: VFS scalability git tree
Date: Wed, 28 Jul 2010 01:09:08 +1000	[thread overview]
Message-ID: <20100727150908.GA3749@amd> (raw)
In-Reply-To: <20100727131810.GO7362@dastard>

On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> 
> A s a personal prefernce, I don't like testing filesystem performance
> on ramdisks because it hides problems caused by changes in IO
> latency. I'll come back to this later.

Very true, although it's good if you don't have some fast disks,
and it can be good to trigger different races than disks tend to.

So I still want to get to the bottom of the slowdown you saw on
vfs-scale.


> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run.
> 
> Quite possibly because of the smaller log - that will cause more
> frequent pushing on the log tail and hence I/O patterns will vary a
> bit...

Well... I think the test case (or how I'm running it) is simply a
bit unstable. I mean, there are subtle interactions all the way from
the CPU scheduler to the disk, so when I say unstable I'm not
particularly blaming XFS :)

 
> Also, keep in mind that delayed logging is shiny and new - it has
> increased XFS metadata performance and parallelism by an order of
> magnitude and so we're really seeing new a bunch of brand new issues
> that have never been seen before with this functionality.  As such,
> there's still some interactions I haven't got to the bottom of with
> delayed logging - it's stable enough to use and benchmark and won't
> corrupt anything but there are still has some warts we need to
> solve. The difficulty (as always) is in reliably reproducing the bad
> behaviour.

Sure, and I didn't see any corruptions, it seems pretty stable and
scalability is better than other filesystems. I'll see if I can
give a better recipe to reproduce the 'livelock'ish behaviour.

 
> > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
> > start to fill up memory and cause reclaim during the 2nd and subsequent
> > iterations.
> 
> I haven't used this mode, so I can't really comment on the results
> you are seeing.

It's a bit strange. Help says it should clear inodes between iterations
(without the -k flag), but it does not seem to.

 
> > > enabled. ext4 is using default mkfs and mount parameters except for
> > > barrier=0. All numbers are averages of three runs.
> > > 
> > > 	fs_mark rate (thousands of files/second)
> > >            2.6.35-rc5   2.6.35-rc5-scale
> > > threads    xfs   ext4     xfs    ext4
> > >   1         20    39       20     39
> > >   2         35    55       35     57
> > >   4         60    41       57     42
> > >   8         79     9       75      9
> > > 
> > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > going to ignore ext4 for the purposes of testing scalability here.
> > > 
> > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > slightly lower throughput.  So at this class of machine for this
> > > workload, the changes result in a slight reduction in scalability.
> > 
> > I wonder if these results are stable. It's possible that changes in
> > reclaim behaviour are causing my patches to require more IO for a
> > given unit of work?
> 
> More likely that's the result of using a smaller log size because it
> will require more frequent metadata pushes to make space for new
> transactions.

I was just checking whether your numbers are stable (where you
saw some slowdown with vfs-scale patches), and what could be the
cause. I agree that running real disks could make big changes in
behaviour.


> > I was seeing XFS 'livelock' in reclaim more with my patches, it
> > could be due to more parallelism now being allowed from the vfs and
> > reclaim.
> >
> > Based on my above numbers, I don't see that rcu-inodes is causing a
> > problem, and in terms of SMP scalability, there is really no way that
> > vanilla is more scalable, so I'm interested to see where this slowdown
> > is coming from.
> 
> As I said initially, ram disks hide IO latency changes resulting
> from increased numbers of IO or increases in seek distances.  My
> initial guess is the change in inode reclaim behaviour causing
> different IO patterns and more seeks under reclaim because the zone
> based reclaim is no longer reclaiming inodes in the order
> they are created (i.e. we are not doing sequential inode reclaim any
> more.

Sounds plausible. I'll do more investigations along those lines.

 
> FWIW, I use PCP monitoring graphs to correlate behavioural changes
> across different subsystems because it is far easier to relate
> information visually than it is by looking at raw numbers or traces.
> I think this graph shows the effect of relcaim on performance
> most clearly:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png

I haven't actually used that, it looks interesting.

 
> It's pretty clear that when the inode/dentry cache shrinkers are
> running, sustained create/unlink performance goes right down. From a
> different tab not in the screen shot (the other "test-4" tab), I
> could see CPU usage also goes down and the disk iops go way up
> whenever the create/unlink performance dropped. This same behaviour
> happens with the vfs-scale patchset, so it's not related to lock
> contention - just aggressive reclaim of still-dirty inodes.
> 
> FYI, The patch under test there was the XFS shrinker ignoring 7 out
> of 8 shrinker calls and then on the 8th call doing the work of all
> previous calls. i.e emulating  SHRINK_BATCH = 1024. Interestingly
> enough, that one change reduced the runtime of the 8m inode
> create/unlink load by ~25% (from ~24min to ~18min).

Hmm, interesting. Well that's naturally configurable with the
shrinker API changes I'm hoping to have merged. I'll plan to push
that ahead of the vfs-scale patches of course.


> That is by far the largest improvement I've been able to obtain from
> modifying the shrinker code, and it is from those sorts of
> observations that I think that IO being issued from reclaim is
> currently the most significant performance limiting factor for XFS
> in this sort of workload....

How is the xfs inode reclaim tied to linux inode reclaim? Does the
xfs inode not become reclaimable until some time after the linux inode
is reclaimed? Or what?

Do all or most of the xfs inodes require IO before being reclaimed
during this test? I wonder if you could throttle them a bit or sort
them somehow so that they tend to be cleaned by writeout and reclaim
just comes after and removes the clean ones, like pagecache reclaim
is (supposed) to work.?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-07-27 15:09 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-22 19:01 VFS scalability git tree Nick Piggin
2010-07-22 19:01 ` Nick Piggin
2010-07-23 11:13 ` Dave Chinner
2010-07-23 11:13   ` Dave Chinner
2010-07-23 14:04   ` [PATCH 0/2] vfs scalability tree fixes Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 16:09     ` Nick Piggin
2010-07-23 16:09       ` Nick Piggin
2010-07-23 14:04   ` [PATCH 1/2] xfs: fix shrinker build Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 14:04   ` [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 15:51   ` VFS scalability git tree Nick Piggin
2010-07-23 15:51     ` Nick Piggin
2010-07-24  0:21     ` Dave Chinner
2010-07-24  0:21       ` Dave Chinner
2010-07-23 11:17 ` Christoph Hellwig
2010-07-23 11:17   ` Christoph Hellwig
2010-07-23 15:42   ` Nick Piggin
2010-07-23 15:42     ` Nick Piggin
2010-07-23 13:55 ` Dave Chinner
2010-07-23 13:55   ` Dave Chinner
2010-07-23 16:16   ` Nick Piggin
2010-07-23 16:16     ` Nick Piggin
2010-07-27  7:05   ` Nick Piggin
2010-07-27  7:05     ` Nick Piggin
2010-07-27  8:06     ` Nick Piggin
2010-07-27  8:06       ` Nick Piggin
2010-07-27 11:36       ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
2010-07-27 13:30         ` Dave Chinner
2010-07-27 14:58           ` XFS hang in xlog_grant_log_space Dave Chinner
2010-07-28 13:17             ` Dave Chinner
2010-07-29 14:05               ` Nick Piggin
2010-07-29 22:56                 ` Dave Chinner
2010-07-30  3:59                   ` Nick Piggin
2010-07-28 12:57       ` VFS scalability git tree Dave Chinner
2010-07-28 12:57         ` Dave Chinner
2010-07-29 14:03         ` Nick Piggin
2010-07-29 14:03           ` Nick Piggin
2010-07-27 11:09     ` Nick Piggin
2010-07-27 11:09       ` Nick Piggin
2010-07-27 13:18     ` Dave Chinner
2010-07-27 13:18       ` Dave Chinner
2010-07-27 15:09       ` Nick Piggin [this message]
2010-07-27 15:09         ` Nick Piggin
2010-07-28  4:59         ` Dave Chinner
2010-07-28  4:59           ` Dave Chinner
2010-07-28  4:59           ` Dave Chinner
2010-07-23 15:35 ` Nick Piggin
2010-07-23 15:35   ` Nick Piggin
2010-07-24  8:43 ` KOSAKI Motohiro
2010-07-24  8:43   ` KOSAKI Motohiro
2010-07-24  8:44   ` [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() KOSAKI Motohiro
2010-07-24  8:44     ` KOSAKI Motohiro
2010-07-24  8:44     ` KOSAKI Motohiro
2010-07-24 12:05     ` KOSAKI Motohiro
2010-07-24 12:05       ` KOSAKI Motohiro
2010-07-24  8:46   ` [PATCH 2/2] vmscan: change shrink_slab() return tyep with void KOSAKI Motohiro
2010-07-24  8:46     ` KOSAKI Motohiro
2010-07-24  8:46     ` KOSAKI Motohiro
2010-07-24 10:54   ` VFS scalability git tree KOSAKI Motohiro
2010-07-24 10:54     ` KOSAKI Motohiro
2010-07-26  5:41 ` Nick Piggin
2010-07-26  5:41   ` Nick Piggin
2010-07-28 10:24   ` Nick Piggin
2010-07-28 10:24     ` Nick Piggin
2010-07-30  9:12 ` Nick Piggin
2010-07-30  9:12   ` Nick Piggin
2010-08-03  0:27   ` john stultz
2010-08-03  0:27     ` john stultz
2010-08-03  0:27     ` john stultz
2010-08-03  5:44     ` Nick Piggin
2010-08-03  5:44       ` Nick Piggin
2010-08-03  5:44       ` Nick Piggin
2010-09-14 22:26       ` Christoph Hellwig
2010-09-14 23:02         ` Frank Mayhar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100727150908.GA3749@amd \
    --to=npiggin@suse.de \
    --cc=david@fromorbit.com \
    --cc=fmayhar@google.com \
    --cc=johnstul@us.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@kernel.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.