All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Nick Piggin <npiggin@kernel.dk>
Cc: xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org
Subject: Re: VFS scalability git tree
Date: Wed, 28 Jul 2010 22:57:17 +1000	[thread overview]
Message-ID: <20100728125717.GI655@dastard> (raw)
In-Reply-To: <20100727080632.GA4958@amd>

On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> > 
> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run. When it starts reclaiming memory, things
> > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > of seconds in inode reclaim).
> 
> So about this XFS livelock type thingy. It looks like this, and happens
> periodically while running the above fs_mark benchmark requiring reclaim
> of inodes:
....

> Nothing much happening except 100% system time for seconds at a time
> (length of time varies). This is on a ramdisk, so it isn't waiting
> for IO.
> 
> During this time, lots of things are contending on the lock:
> 
>     60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
>      4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
>      3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
>      3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
>      3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
>                    |
>                    --- __write_lock_failed
>                       |
>                       |--99.92%-- xfs_inode_ag_walk
>                       |          xfs_inode_ag_iterator
>                       |          xfs_reclaim_inode_shrink
>                       |          shrink_slab
>                       |          shrink_zone
>                       |          balance_pgdat
>                       |          kswapd
>                       |          kthread
>                       |          kernel_thread_helper
>                        --0.08%-- [...]
> 
>      3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
>      1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
>      1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
>      0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>      0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
>                    |
>                    --- xfs_log_commit_cil
>                        _xfs_trans_commit
>                       |
>                       |--60.00%-- xfs_remove
>                       |          xfs_vn_unlink
>                       |          vfs_unlink
>                       |          do_unlinkat
>                       |          sys_unlink
> 
> I'm not sure if there was a long-running read locker in there causing
> all the write lockers to fail, or if they are just running into one
> another.

The longest hold is in the inode cluster writeback
(xfs_iflush_cluster), but if there is no IO then I don't see how
that would be a problem.

I suspect that it might be caused by having several CPUs
all trying to run the shrinker at the same time and them all
starting at the same AG and therefore lockstepping and getting
nothing done because they are all scanning the same inodes.

Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
avoid this lockstepping.  I've attached a patch below to do this
- can you give it a try?

> But anyway, I hacked the following patch which seemed to
> improve that behaviour. I haven't run any throughput numbers on it yet,
> but I could if you're interested (and it's not completely broken!)

Batching is certainly something that I have been considering, but
apart from the excessive scanning bug, the per-ag inode tree lookups
hve not featured prominently in any profiling I've done, so it
hasn't been a high priority.

You patch looks like it will work fine, but I think it can be made a
lot cleaner. I'll have a closer look at this once I get to the bottom of
the dbench hang you are seeing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: add an AG iterator start rotor

From: Dave Chinner <dchinner@redhat.com>

To avoid multiple CPUs from executing inode cache shrinkers on the same AG all
at the same time, make every shrinker call start on a different AG. This will
mostly prevent concurrent shrinker calls from competing for the serialising
pag_ici_lock and lock-stepping reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   11 ++++++++++-
 fs/xfs/xfs_mount.h          |    1 +
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index dfcbd98..5322105 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -181,11 +181,14 @@ xfs_inode_ag_iterator(
 	struct xfs_perag	*pag;
 	int			error = 0;
 	int			last_error = 0;
+	xfs_agnumber_t		start_ag;
 	xfs_agnumber_t		ag;
 	int			nr;
+	int			looped = 0;
 
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
-	ag = 0;
+	start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount;
+	ag = start_ag;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
 						exclusive, &nr);
@@ -197,6 +200,12 @@ xfs_inode_ag_iterator(
 		}
 		if (nr <= 0)
 			break;
+		if (ag >= mp->m_sb.sb_agcount) {
+			looped = 1;
+			ag = 0;
+		}
+		if (ag >= start_ag && looped)
+			break;
 	}
 	if (nr_to_scan)
 		*nr_to_scan = nr;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 622da21..fae61bb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -199,6 +199,7 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	atomic_t		m_agiter_rotor; /* ag iterator start rotor */
 } xfs_mount_t;
 
 /*

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Nick Piggin <npiggin@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: VFS scalability git tree
Date: Wed, 28 Jul 2010 22:57:17 +1000	[thread overview]
Message-ID: <20100728125717.GI655@dastard> (raw)
In-Reply-To: <20100727080632.GA4958@amd>

On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> > 
> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run. When it starts reclaiming memory, things
> > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > of seconds in inode reclaim).
> 
> So about this XFS livelock type thingy. It looks like this, and happens
> periodically while running the above fs_mark benchmark requiring reclaim
> of inodes:
....

> Nothing much happening except 100% system time for seconds at a time
> (length of time varies). This is on a ramdisk, so it isn't waiting
> for IO.
> 
> During this time, lots of things are contending on the lock:
> 
>     60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
>      4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
>      3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
>      3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
>      3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
>                    |
>                    --- __write_lock_failed
>                       |
>                       |--99.92%-- xfs_inode_ag_walk
>                       |          xfs_inode_ag_iterator
>                       |          xfs_reclaim_inode_shrink
>                       |          shrink_slab
>                       |          shrink_zone
>                       |          balance_pgdat
>                       |          kswapd
>                       |          kthread
>                       |          kernel_thread_helper
>                        --0.08%-- [...]
> 
>      3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
>      1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
>      1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
>      0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>      0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
>                    |
>                    --- xfs_log_commit_cil
>                        _xfs_trans_commit
>                       |
>                       |--60.00%-- xfs_remove
>                       |          xfs_vn_unlink
>                       |          vfs_unlink
>                       |          do_unlinkat
>                       |          sys_unlink
> 
> I'm not sure if there was a long-running read locker in there causing
> all the write lockers to fail, or if they are just running into one
> another.

The longest hold is in the inode cluster writeback
(xfs_iflush_cluster), but if there is no IO then I don't see how
that would be a problem.

I suspect that it might be caused by having several CPUs
all trying to run the shrinker at the same time and them all
starting at the same AG and therefore lockstepping and getting
nothing done because they are all scanning the same inodes.

Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
avoid this lockstepping.  I've attached a patch below to do this
- can you give it a try?

> But anyway, I hacked the following patch which seemed to
> improve that behaviour. I haven't run any throughput numbers on it yet,
> but I could if you're interested (and it's not completely broken!)

Batching is certainly something that I have been considering, but
apart from the excessive scanning bug, the per-ag inode tree lookups
hve not featured prominently in any profiling I've done, so it
hasn't been a high priority.

You patch looks like it will work fine, but I think it can be made a
lot cleaner. I'll have a closer look at this once I get to the bottom of
the dbench hang you are seeing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: add an AG iterator start rotor

From: Dave Chinner <dchinner@redhat.com>

To avoid multiple CPUs from executing inode cache shrinkers on the same AG all
at the same time, make every shrinker call start on a different AG. This will
mostly prevent concurrent shrinker calls from competing for the serialising
pag_ici_lock and lock-stepping reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   11 ++++++++++-
 fs/xfs/xfs_mount.h          |    1 +
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index dfcbd98..5322105 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -181,11 +181,14 @@ xfs_inode_ag_iterator(
 	struct xfs_perag	*pag;
 	int			error = 0;
 	int			last_error = 0;
+	xfs_agnumber_t		start_ag;
 	xfs_agnumber_t		ag;
 	int			nr;
+	int			looped = 0;
 
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
-	ag = 0;
+	start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount;
+	ag = start_ag;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
 						exclusive, &nr);
@@ -197,6 +200,12 @@ xfs_inode_ag_iterator(
 		}
 		if (nr <= 0)
 			break;
+		if (ag >= mp->m_sb.sb_agcount) {
+			looped = 1;
+			ag = 0;
+		}
+		if (ag >= start_ag && looped)
+			break;
 	}
 	if (nr_to_scan)
 		*nr_to_scan = nr;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 622da21..fae61bb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -199,6 +199,7 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	atomic_t		m_agiter_rotor; /* ag iterator start rotor */
 } xfs_mount_t;
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2010-07-28 12:57 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-22 19:01 VFS scalability git tree Nick Piggin
2010-07-22 19:01 ` Nick Piggin
2010-07-23 11:13 ` Dave Chinner
2010-07-23 11:13   ` Dave Chinner
2010-07-23 14:04   ` [PATCH 0/2] vfs scalability tree fixes Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 16:09     ` Nick Piggin
2010-07-23 16:09       ` Nick Piggin
2010-07-23 14:04   ` [PATCH 1/2] xfs: fix shrinker build Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 14:04   ` [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 15:51   ` VFS scalability git tree Nick Piggin
2010-07-23 15:51     ` Nick Piggin
2010-07-24  0:21     ` Dave Chinner
2010-07-24  0:21       ` Dave Chinner
2010-07-23 11:17 ` Christoph Hellwig
2010-07-23 11:17   ` Christoph Hellwig
2010-07-23 15:42   ` Nick Piggin
2010-07-23 15:42     ` Nick Piggin
2010-07-23 13:55 ` Dave Chinner
2010-07-23 13:55   ` Dave Chinner
2010-07-23 16:16   ` Nick Piggin
2010-07-23 16:16     ` Nick Piggin
2010-07-27  7:05   ` Nick Piggin
2010-07-27  7:05     ` Nick Piggin
2010-07-27  8:06     ` Nick Piggin
2010-07-27  8:06       ` Nick Piggin
2010-07-27 11:36       ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
2010-07-27 13:30         ` Dave Chinner
2010-07-27 14:58           ` XFS hang in xlog_grant_log_space Dave Chinner
2010-07-28 13:17             ` Dave Chinner
2010-07-29 14:05               ` Nick Piggin
2010-07-29 22:56                 ` Dave Chinner
2010-07-30  3:59                   ` Nick Piggin
2010-07-28 12:57       ` Dave Chinner [this message]
2010-07-28 12:57         ` VFS scalability git tree Dave Chinner
2010-07-29 14:03         ` Nick Piggin
2010-07-29 14:03           ` Nick Piggin
2010-07-27 11:09     ` Nick Piggin
2010-07-27 11:09       ` Nick Piggin
2010-07-27 13:18     ` Dave Chinner
2010-07-27 13:18       ` Dave Chinner
2010-07-27 15:09       ` Nick Piggin
2010-07-27 15:09         ` Nick Piggin
2010-07-28  4:59         ` Dave Chinner
2010-07-28  4:59           ` Dave Chinner
2010-07-28  4:59           ` Dave Chinner
2010-07-23 15:35 ` Nick Piggin
2010-07-23 15:35   ` Nick Piggin
2010-07-24  8:43 ` KOSAKI Motohiro
2010-07-24  8:43   ` KOSAKI Motohiro
2010-07-24  8:44   ` [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() KOSAKI Motohiro
2010-07-24  8:44     ` KOSAKI Motohiro
2010-07-24  8:44     ` KOSAKI Motohiro
2010-07-24 12:05     ` KOSAKI Motohiro
2010-07-24 12:05       ` KOSAKI Motohiro
2010-07-24  8:46   ` [PATCH 2/2] vmscan: change shrink_slab() return tyep with void KOSAKI Motohiro
2010-07-24  8:46     ` KOSAKI Motohiro
2010-07-24  8:46     ` KOSAKI Motohiro
2010-07-24 10:54   ` VFS scalability git tree KOSAKI Motohiro
2010-07-24 10:54     ` KOSAKI Motohiro
2010-07-26  5:41 ` Nick Piggin
2010-07-26  5:41   ` Nick Piggin
2010-07-28 10:24   ` Nick Piggin
2010-07-28 10:24     ` Nick Piggin
2010-07-30  9:12 ` Nick Piggin
2010-07-30  9:12   ` Nick Piggin
2010-08-03  0:27   ` john stultz
2010-08-03  0:27     ` john stultz
2010-08-03  0:27     ` john stultz
2010-08-03  5:44     ` Nick Piggin
2010-08-03  5:44       ` Nick Piggin
2010-08-03  5:44       ` Nick Piggin
2010-09-14 22:26       ` Christoph Hellwig
2010-09-14 23:02         ` Frank Mayhar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100728125717.GI655@dastard \
    --to=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=npiggin@kernel.dk \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.