All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Nick Piggin <npiggin@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>,
	Nick Piggin <npiggin@gmail.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 00/46] rcu-walk and dcache scaling
Date: Sun, 12 Dec 2010 06:54:07 -0800	[thread overview]
Message-ID: <20101212145407.GA1969@linux.vnet.ibm.com> (raw)
In-Reply-To: <20101210203257.GI2125@linux.vnet.ibm.com>

On Fri, Dec 10, 2010 at 12:32:57PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > > > >>
> > > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > > > >>
> > > > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > > > >> last time they were posted. It is structured like:
> > > > > >>
> > > > > >> * preparation patches
> > > > > >> * introduce new locks to take over dcache_lock, then remove it
> > > > > >> * cleaning up and reworking things for new locks
> > > > > >> * rcu-walk path walking
> > > > > >> * start on some fine grained locking steps
> > > > > >
> > > > > > Stress test doing:
> > > > > >
> > > > > >        single thread 50M inode create
> > > > > >        single thread rm -rf
> > > > > >        2-way 50M inode create
> > > > > >        2-way rm -rf
> > > > > >        4-way 50M inode create
> > > > > >        4-way rm -rf
> > > > > >        8-way 50M inode create
> > > > > >        8-way rm -rf
> > > > > >        8-way 250M inode create
> > > > > >        8-way rm -rf
> > > > > >
> > > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > > > with a CPU stuck spinning on here:
> > > > > >
> > > > > > [37372.084012] NMI backtrace for cpu 5
> > > > > > [37372.084012] CPU 5
> > > > > > [37372.084012] Modules linked in:
> > > > > > [37372.084012]
> > > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > > > [37372.084012] Stack:
> > > > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > > > [37372.084012] Call Trace:
> > > > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > OK good, with any luck, that's the same bug.
> > > > > 
> > > > > Is this XFS?
> > > > 
> > > > Yes.
> > > > 
> > > > > Is there any concurrent activity happening on the same dentries?
> > > > 
> > > > Not from an application perspective.
> > > > 
> > > > > Ie. are the rm -rf threads running on the same directories,
> > > > 
> > > > No, each thread operating on a different directory.
> > 
> > This is probably fixed by the same patch as the lockdep splat trace.
> > 
> > 
> > > > > or is there any reclaim happening in the background?
> > > > 
> > > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > > > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > > > be dentry cache reclaim going as well.
> > > 
> > > Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> > > few seconds, then idle for 10-15s. Typical perf top output of kswapd
> > > while it is active during unlinks is:
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
> > >             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
> > >             12917.00  7.6% down_write                  [kernel.kallsyms]
> > >             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
> > >             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
> > >              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
> > >              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
> > >              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
> > >              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >              5484.00  3.2% __d_drop                    [kernel.kallsyms]
> > >              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
> > >              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
> > >              3556.00  2.1% up_write                    [kernel.kallsyms]
> > >              3206.00  1.9% _cond_resched               [kernel.kallsyms]
> > >              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
> > >              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
> > >              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
> > >              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
> > >              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
> > >              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
> > >              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
> > >              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
> > >              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
> > >              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
> > >              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
> > >               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> > > 
> > > So there is some dentry cache reclaim going on. 
> > > 
> > > FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> > > more CPU time) in the work kswapd is doing during these unlinks, too.
> > > I just had a look at kswapd when a 8-way create is running - it's running at
> > > 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> > > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
> > >             21491.00  7.1% down_write                  [kernel.kallsyms]
> > >             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
> > >             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
> > >             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
> > >             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
> > >             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
> > >             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
> > >             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
> > >             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
> > >              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
> > >              6974.00  2.3% up_write                    [kernel.kallsyms]
> > >              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
> > >              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> > > 
> > > Which is showing a similar amount of RCU overhead as the unlink as above.
> > > And this while it was doing dentry cache reclaim (~10s sample):
> > > 
> > >             35921.00 15.7% __d_drop                      [kernel.kallsyms]
> > >             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
> > >             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
> > >             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
> > >             10098.00  4.4% iput                          [kernel.kallsyms]
> > >              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
> > >              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
> > >              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
> > >              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
> > >              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
> > >              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
> > >              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
> > >              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
> > >              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
> > >              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
> > >              3279.00  1.4% kfree                         [kernel.kallsyms]
> > >              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
> > >              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
> > >              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> > > 
> > > I've never really seen any signficant dentry cache reclaim overhead
> > > in profiles of these workloads before, so this was a bit of a
> > > surprise....
> > 
> > call_rcu shouldn't be doing much, except for disabling irqs and linking
> > the object into the list. I have a patch somewhere to reduce the irq
> > disable overhead a bit, but it really shouldn't be doing a lot of work.
> 
> Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and
> look at rcu/rcudata?  There will be a "ql=" number printed for each
> CPU, and if that number is too large, __call_rcu() does take what it
> considers to be corrective action, which can incur some overhead.
> 
> If this is the problem, then increasing the value of the qhimark module
> parameter might help.
> 
> If this is not the problem, I could make a patch that disables some of
> __call_rcu()'s grace-period acceleration code if you are willing to try
> it out.

Another thing that might help is to reduce the value of CONFIG_RCU_FANOUT
to something like 16.  If this does help, then there is a reasonably
straightforward change I can make to RCU.

							Thanx, Paul

> > Sometimes you find that touching the rcu head field needs to get a
> > cacheline exclusive, so a bit of work gets transferred there....
> > 
> > But it may also be something going a bit wrong in RCU. I blew it up
> > once already, after the files_lock splitup that enabled all CPUs to
> > create and destroy files :)
> 
> I would certainly like the opportunity to fix any bugs that might be
> in RCU...  ;-)
> 
> 							Thanx, Paul

WARNING: multiple messages have this Message-ID (diff)
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Nick Piggin <npiggin@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>,
	Nick Piggin <npiggin@gmail.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 00/46] rcu-walk and dcache scaling
Date: Sun, 12 Dec 2010 06:54:07 -0800	[thread overview]
Message-ID: <20101212145407.GA1969@linux.vnet.ibm.com> (raw)
In-Reply-To: <20101210203257.GI2125@linux.vnet.ibm.com>

On Fri, Dec 10, 2010 at 12:32:57PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > > > >>
> > > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > > > >>
> > > > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > > > >> last time they were posted. It is structured like:
> > > > > >>
> > > > > >> * preparation patches
> > > > > >> * introduce new locks to take over dcache_lock, then remove it
> > > > > >> * cleaning up and reworking things for new locks
> > > > > >> * rcu-walk path walking
> > > > > >> * start on some fine grained locking steps
> > > > > >
> > > > > > Stress test doing:
> > > > > >
> > > > > >        single thread 50M inode create
> > > > > >        single thread rm -rf
> > > > > >        2-way 50M inode create
> > > > > >        2-way rm -rf
> > > > > >        4-way 50M inode create
> > > > > >        4-way rm -rf
> > > > > >        8-way 50M inode create
> > > > > >        8-way rm -rf
> > > > > >        8-way 250M inode create
> > > > > >        8-way rm -rf
> > > > > >
> > > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > > > with a CPU stuck spinning on here:
> > > > > >
> > > > > > [37372.084012] NMI backtrace for cpu 5
> > > > > > [37372.084012] CPU 5
> > > > > > [37372.084012] Modules linked in:
> > > > > > [37372.084012]
> > > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > > > [37372.084012] Stack:
> > > > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > > > [37372.084012] Call Trace:
> > > > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > OK good, with any luck, that's the same bug.
> > > > > 
> > > > > Is this XFS?
> > > > 
> > > > Yes.
> > > > 
> > > > > Is there any concurrent activity happening on the same dentries?
> > > > 
> > > > Not from an application perspective.
> > > > 
> > > > > Ie. are the rm -rf threads running on the same directories,
> > > > 
> > > > No, each thread operating on a different directory.
> > 
> > This is probably fixed by the same patch as the lockdep splat trace.
> > 
> > 
> > > > > or is there any reclaim happening in the background?
> > > > 
> > > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > > > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > > > be dentry cache reclaim going as well.
> > > 
> > > Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> > > few seconds, then idle for 10-15s. Typical perf top output of kswapd
> > > while it is active during unlinks is:
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
> > >             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
> > >             12917.00  7.6% down_write                  [kernel.kallsyms]
> > >             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
> > >             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
> > >              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
> > >              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
> > >              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
> > >              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >              5484.00  3.2% __d_drop                    [kernel.kallsyms]
> > >              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
> > >              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
> > >              3556.00  2.1% up_write                    [kernel.kallsyms]
> > >              3206.00  1.9% _cond_resched               [kernel.kallsyms]
> > >              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
> > >              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
> > >              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
> > >              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
> > >              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
> > >              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
> > >              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
> > >              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
> > >              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
> > >              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
> > >              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
> > >               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> > > 
> > > So there is some dentry cache reclaim going on. 
> > > 
> > > FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> > > more CPU time) in the work kswapd is doing during these unlinks, too.
> > > I just had a look at kswapd when a 8-way create is running - it's running at
> > > 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> > > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
> > >             21491.00  7.1% down_write                  [kernel.kallsyms]
> > >             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
> > >             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
> > >             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
> > >             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
> > >             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
> > >             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
> > >             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
> > >             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
> > >              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
> > >              6974.00  2.3% up_write                    [kernel.kallsyms]
> > >              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
> > >              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> > > 
> > > Which is showing a similar amount of RCU overhead as the unlink as above.
> > > And this while it was doing dentry cache reclaim (~10s sample):
> > > 
> > >             35921.00 15.7% __d_drop                      [kernel.kallsyms]
> > >             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
> > >             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
> > >             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
> > >             10098.00  4.4% iput                          [kernel.kallsyms]
> > >              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
> > >              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
> > >              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
> > >              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
> > >              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
> > >              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
> > >              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
> > >              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
> > >              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
> > >              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
> > >              3279.00  1.4% kfree                         [kernel.kallsyms]
> > >              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
> > >              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
> > >              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> > > 
> > > I've never really seen any signficant dentry cache reclaim overhead
> > > in profiles of these workloads before, so this was a bit of a
> > > surprise....
> > 
> > call_rcu shouldn't be doing much, except for disabling irqs and linking
> > the object into the list. I have a patch somewhere to reduce the irq
> > disable overhead a bit, but it really shouldn't be doing a lot of work.
> 
> Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and
> look at rcu/rcudata?  There will be a "ql=" number printed for each
> CPU, and if that number is too large, __call_rcu() does take what it
> considers to be corrective action, which can incur some overhead.
> 
> If this is the problem, then increasing the value of the qhimark module
> parameter might help.
> 
> If this is not the problem, I could make a patch that disables some of
> __call_rcu()'s grace-period acceleration code if you are willing to try
> it out.

Another thing that might help is to reduce the value of CONFIG_RCU_FANOUT
to something like 16.  If this does help, then there is a reasonably
straightforward change I can make to RCU.

							Thanx, Paul

> > Sometimes you find that touching the rcu head field needs to get a
> > cacheline exclusive, so a bit of work gets transferred there....
> > 
> > But it may also be something going a bit wrong in RCU. I blew it up
> > once already, after the files_lock splitup that enabled all CPUs to
> > create and destroy files :)
> 
> I would certainly like the opportunity to fix any bugs that might be
> in RCU...  ;-)
> 
> 							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2010-12-12 14:54 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
2010-11-27  9:44 ` [PATCH 02/46] fs: d_validate fixes Nick Piggin
2010-12-08  1:53   ` Dave Chinner
2010-12-08  6:59     ` Nick Piggin
2010-12-09  0:50       ` Dave Chinner
2010-12-09  0:50         ` Dave Chinner
2010-12-09  4:50         ` Nick Piggin
2010-12-09  4:50           ` Nick Piggin
2010-11-27  9:44 ` [PATCH 03/46] kernel: kmem_ptr_validate considered harmful Nick Piggin
2010-11-27  9:44 ` [PATCH 04/46] fs: dcache documentation cleanup Nick Piggin
2010-11-27  9:44 ` [PATCH 05/46] fs: change d_delete semantics Nick Piggin
2010-11-27  9:44 ` [PATCH 06/46] cifs: dont overwrite dentry name in d_revalidate Nick Piggin
2010-11-27  9:44 ` [PATCH 07/46] jfs: " Nick Piggin
2010-11-27  9:44 ` [PATCH 08/46] fs: change d_compare for rcu-walk Nick Piggin
2010-11-27  9:44 ` [PATCH 09/46] fs: change d_hash " Nick Piggin
2010-11-27  9:44 ` [PATCH 10/46] hostfs: simplify locking Nick Piggin
2010-11-27  9:44 ` [PATCH 11/46] fs: dcache scale hash Nick Piggin
2010-12-09  6:09   ` Dave Chinner
2010-12-09  6:28     ` Nick Piggin
2010-12-09  8:17       ` Dave Chinner
2010-12-09 12:53         ` Nick Piggin
2010-12-09 23:42           ` Dave Chinner
2010-12-10  2:35             ` Nick Piggin
2010-12-10  9:01               ` Dave Chinner
2010-12-13  4:48                 ` Nick Piggin
2010-12-13  5:05                 ` Nick Piggin
2010-11-27  9:44 ` [PATCH 12/46] fs: dcache scale lru Nick Piggin
2010-12-09  7:22   ` Dave Chinner
2010-12-09 12:34     ` Nick Piggin
2010-11-27  9:44 ` [PATCH 13/46] fs: dcache scale dentry refcount Nick Piggin
2010-11-27  9:44 ` [PATCH 14/46] fs: dcache scale d_unhashed Nick Piggin
2010-11-27  9:44 ` [PATCH 15/46] fs: dcache scale subdirs Nick Piggin
2010-11-27  9:44 ` [PATCH 16/46] fs: scale inode alias list Nick Piggin
2010-11-27  9:44 ` [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations Nick Piggin
2011-01-18 22:32   ` Yehuda Sadeh Weinraub
2011-01-18 22:42     ` Nick Piggin
2011-01-19 22:27       ` Yehuda Sadeh Weinraub
2011-01-19 22:32         ` Nick Piggin
2011-01-25 22:10           ` Yehuda Sadeh Weinraub
2011-01-27  5:18             ` Nick Piggin
2011-02-07 18:52               ` Jim Schutt
2011-02-07 21:04                 ` Yehuda Sadeh Weinraub
2011-02-07 21:04                   ` Yehuda Sadeh Weinraub
2011-02-07 21:31                   ` Jim Schutt
2011-02-07 21:35                     ` Gregory Farnum
2011-02-07 22:25                   ` Jim Schutt
2011-02-14 17:57               ` Yehuda Sadeh Weinraub
2010-11-27  9:44 ` [PATCH 18/46] fs: increase d_name lock coverage Nick Piggin
2010-11-27  9:44 ` [PATCH 19/46] fs: dcache remove dcache_lock Nick Piggin
2010-11-27  9:44 ` [PATCH 20/46] fs: dcache avoid starvation in dcache multi-step operations Nick Piggin
2010-11-27  9:44 ` [PATCH 21/46] fs: dcache reduce dput locking Nick Piggin
2010-11-27  9:44 ` [PATCH 22/46] fs: dcache reduce locking in d_alloc Nick Piggin
2010-11-27  9:44 ` [PATCH 23/46] fs: dcache reduce dcache_inode_lock Nick Piggin
2010-11-27  9:44 ` [PATCH 24/46] fs: dcache rationalise dget variants Nick Piggin
2010-11-27  9:44 ` [PATCH 25/46] fs: dcache reduce d_parent locking Nick Piggin
2010-11-27  9:44 ` [PATCH 26/46] fs: dcache reduce prune_one_dentry locking Nick Piggin
2010-11-27  9:44 ` [PATCH 27/46] fs: reduce dcache_inode_lock width in lru scanning Nick Piggin
2010-11-27  9:44 ` [PATCH 28/46] fs: use RCU in shrink_dentry_list to reduce lock nesting Nick Piggin
2010-11-27  9:44 ` [PATCH 29/46] fs: consolidate dentry kill sequence Nick Piggin
2010-11-27  9:45 ` [PATCH 30/46] fs: icache RCU free inodes Nick Piggin
2010-11-27  9:45 ` [PATCH 31/46] fs: avoid inode RCU freeing for pseudo fs Nick Piggin
2010-11-27  9:45 ` [PATCH 32/46] kernel: optimise seqlock Nick Piggin
2010-11-27  9:45 ` [PATCH 33/46] fs: rcu-walk for path lookup Nick Piggin
2010-11-27  9:45 ` [PATCH 34/46] fs: fs_struct use seqlock Nick Piggin
2010-11-27  9:45 ` [PATCH 35/46] fs: dcache remove d_mounted Nick Piggin
2010-11-27  9:45 ` [PATCH 36/46] fs: dcache reduce branches in lookup path Nick Piggin
2010-11-27  9:45 ` [PATCH 37/46] fs: cache optimise dentry and inode for rcu-walk Nick Piggin
2010-11-27  9:45 ` [PATCH 38/46] fs: prefetch inode data in dcache lookup Nick Piggin
2010-11-27  9:45 ` [PATCH 39/46] fs: d_revalidate_rcu for rcu-walk Nick Piggin
2010-11-27  9:45 ` [PATCH 40/46] fs: provide rcu-walk aware permission i_ops Nick Piggin
2010-11-27  9:45 ` [PATCH 41/46] fs: provide simple rcu-walk ACL implementation Nick Piggin
2010-11-27  9:45 ` [PATCH 42/46] kernel: add bl_list Nick Piggin
2010-11-27  9:45 ` [PATCH 43/46] bit_spinlock: add required includes Nick Piggin
2010-11-27  9:45 ` [PATCH 44/46] fs: dcache per-bucket dcache hash locking Nick Piggin
2010-11-27  9:45 ` [PATCH 45/46] fs: dcache per-inode inode alias locking Nick Piggin
2010-11-27  9:45 ` [PATCH 46/46] fs: improve scalability of pseudo filesystems Nick Piggin
2010-11-27  9:56 ` [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate" Nick Piggin
2010-12-08  1:16   ` Dave Chinner
2010-12-08  9:38     ` Nick Piggin
2010-12-09  0:44       ` Dave Chinner
2010-12-09  4:38         ` Nick Piggin
2010-12-09  5:16           ` Nick Piggin
2010-11-27 15:04 ` [PATCH 00/46] rcu-walk and dcache scaling Anca Emanuel
2010-11-27 15:04   ` Anca Emanuel
2010-11-28  3:28   ` Nick Piggin
2010-11-28  3:28     ` Nick Piggin
2010-11-28  6:24     ` Sedat Dilek
2010-12-01 18:03 ` David Miller
2010-12-03 16:55   ` Nick Piggin
2010-12-07 11:25 ` Dave Chinner
2010-12-07 15:24   ` Nick Piggin
2010-12-07 15:24     ` Nick Piggin
2010-12-07 15:49     ` Peter Zijlstra
2010-12-07 15:59       ` Nick Piggin
2010-12-07 16:23         ` Peter Zijlstra
2010-12-08  3:28     ` Nick Piggin
2010-12-07 21:56 ` Dave Chinner
2010-12-08  1:47   ` Nick Piggin
2010-12-08  3:32     ` Dave Chinner
2010-12-08  4:28       ` Dave Chinner
2010-12-08  7:09         ` Nick Piggin
2010-12-08  7:09           ` Nick Piggin
2010-12-10 20:32           ` Paul E. McKenney
2010-12-12 14:54             ` Paul E. McKenney [this message]
2010-12-12 14:54               ` Paul E. McKenney
2010-11-27 19:20 Sedat Dilek
2010-11-27 20:53 ` Sedat Dilek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101212145407.GA1969@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@gmail.com \
    --cc=npiggin@kernel.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.