All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-11-27 19:20 Sedat Dilek
  2010-11-27 20:53 ` Sedat Dilek
  0 siblings, 1 reply; 26+ messages in thread
From: Sedat Dilek @ 2010-11-27 19:20 UTC (permalink / raw)
  To: Nick Piggin, LKML

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

Hi,

I wanted to give your patchset a try (on top of latest linux-next).

Unfortunately, the build breaks here:

/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
error: ‘dcache_inode_lock’ undeclared (first use in this function)
/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
error: (Each undeclared identifier is reported only once
/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
error: for each function it appears in.)

Attached patch "fs-cifs-inode.c-Fix-error-dcache_inode_lock-undeclared.patch"
should fix it.

Kind Regards,
- Sedat -

Compile-tested-by: me

$ grep cifs build_linux-next_next20101126.dileks.4.log
  LD      fs/cifs/built-in.o
  CC [M]  fs/cifs/cifsfs.o
  CC [M]  fs/cifs/cifssmb.o
  CC [M]  fs/cifs/cifs_debug.o
  CC [M]  fs/cifs/connect.o
  CC [M]  fs/cifs/dir.o
  CC [M]  fs/cifs/file.o
  CC [M]  fs/cifs/inode.o
  CC [M]  fs/cifs/link.o
  CC [M]  fs/cifs/misc.o
  CC [M]  fs/cifs/netmisc.o
  CC [M]  fs/cifs/smbdes.o
  CC [M]  fs/cifs/smbencrypt.o
  CC [M]  fs/cifs/transport.o
  CC [M]  fs/cifs/asn1.o
  CC [M]  fs/cifs/md4.o
  CC [M]  fs/cifs/md5.o
  CC [M]  fs/cifs/cifs_unicode.o
  CC [M]  fs/cifs/nterr.o

[-- Attachment #2: fs-cifs-inode.c-Fix-error-dcache_inode_lock-undeclared.patch --]
[-- Type: plain/text, Size: 612 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 19:20 [PATCH 00/46] rcu-walk and dcache scaling Sedat Dilek
@ 2010-11-27 20:53 ` Sedat Dilek
  0 siblings, 0 replies; 26+ messages in thread
From: Sedat Dilek @ 2010-11-27 20:53 UTC (permalink / raw)
  To: Nick Piggin, LKML

On Sat, Nov 27, 2010 at 8:20 PM, Sedat Dilek <sedat.dilek@googlemail.com> wrote:
> Hi,
>
> I wanted to give your patchset a try (on top of latest linux-next).
>
> Unfortunately, the build breaks here:
>
> /home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
> error: ‘dcache_inode_lock’ undeclared (first use in this function)
> /home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
> error: (Each undeclared identifier is reported only once
> /home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/fs/cifs/inode.c:807:
> error: for each function it appears in.)
>
> Attached patch "fs-cifs-inode.c-Fix-error-dcache_inode_lock-undeclared.patch"
> should fix it.
>
> Kind Regards,
> - Sedat -
>
> Compile-tested-by: me
>
[ ... ]

Next breakage:

/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/drivers/staging/pohmelfs/inode.c:834:
error: ‘psb’ undeclared (first use in this function)
/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/drivers/staging/pohmelfs/inode.c:834:
error: (Each undeclared identifier is reported only once
/home/sd/src/linux-2.6/linux-2.6.37-rc3/debian/build/source_i386_none/drivers/staging/pohmelfs/inode.c:834:
error: for each function it appears in.)

- Sedat -

P.S.: Note to myself: Unset CONFIG_POHMELFS for testing-purposes.

[ debian/config/i386/none/config.686 ]
...
##
## file: drivers/staging/pohmelfs/Kconfig
##
# CONFIG_POHMELFS is not set

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-10 20:32           ` Paul E. McKenney
@ 2010-12-12 14:54               ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2010-12-12 14:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Dec 10, 2010 at 12:32:57PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > > > >>
> > > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > > > >>
> > > > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > > > >> last time they were posted. It is structured like:
> > > > > >>
> > > > > >> * preparation patches
> > > > > >> * introduce new locks to take over dcache_lock, then remove it
> > > > > >> * cleaning up and reworking things for new locks
> > > > > >> * rcu-walk path walking
> > > > > >> * start on some fine grained locking steps
> > > > > >
> > > > > > Stress test doing:
> > > > > >
> > > > > >        single thread 50M inode create
> > > > > >        single thread rm -rf
> > > > > >        2-way 50M inode create
> > > > > >        2-way rm -rf
> > > > > >        4-way 50M inode create
> > > > > >        4-way rm -rf
> > > > > >        8-way 50M inode create
> > > > > >        8-way rm -rf
> > > > > >        8-way 250M inode create
> > > > > >        8-way rm -rf
> > > > > >
> > > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > > > with a CPU stuck spinning on here:
> > > > > >
> > > > > > [37372.084012] NMI backtrace for cpu 5
> > > > > > [37372.084012] CPU 5
> > > > > > [37372.084012] Modules linked in:
> > > > > > [37372.084012]
> > > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > > > [37372.084012] Stack:
> > > > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > > > [37372.084012] Call Trace:
> > > > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > OK good, with any luck, that's the same bug.
> > > > > 
> > > > > Is this XFS?
> > > > 
> > > > Yes.
> > > > 
> > > > > Is there any concurrent activity happening on the same dentries?
> > > > 
> > > > Not from an application perspective.
> > > > 
> > > > > Ie. are the rm -rf threads running on the same directories,
> > > > 
> > > > No, each thread operating on a different directory.
> > 
> > This is probably fixed by the same patch as the lockdep splat trace.
> > 
> > 
> > > > > or is there any reclaim happening in the background?
> > > > 
> > > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > > > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > > > be dentry cache reclaim going as well.
> > > 
> > > Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> > > few seconds, then idle for 10-15s. Typical perf top output of kswapd
> > > while it is active during unlinks is:
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
> > >             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
> > >             12917.00  7.6% down_write                  [kernel.kallsyms]
> > >             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
> > >             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
> > >              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
> > >              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
> > >              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
> > >              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >              5484.00  3.2% __d_drop                    [kernel.kallsyms]
> > >              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
> > >              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
> > >              3556.00  2.1% up_write                    [kernel.kallsyms]
> > >              3206.00  1.9% _cond_resched               [kernel.kallsyms]
> > >              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
> > >              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
> > >              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
> > >              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
> > >              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
> > >              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
> > >              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
> > >              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
> > >              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
> > >              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
> > >              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
> > >               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> > > 
> > > So there is some dentry cache reclaim going on. 
> > > 
> > > FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> > > more CPU time) in the work kswapd is doing during these unlinks, too.
> > > I just had a look at kswapd when a 8-way create is running - it's running at
> > > 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> > > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
> > >             21491.00  7.1% down_write                  [kernel.kallsyms]
> > >             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
> > >             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
> > >             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
> > >             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
> > >             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
> > >             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
> > >             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
> > >             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
> > >              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
> > >              6974.00  2.3% up_write                    [kernel.kallsyms]
> > >              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
> > >              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> > > 
> > > Which is showing a similar amount of RCU overhead as the unlink as above.
> > > And this while it was doing dentry cache reclaim (~10s sample):
> > > 
> > >             35921.00 15.7% __d_drop                      [kernel.kallsyms]
> > >             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
> > >             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
> > >             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
> > >             10098.00  4.4% iput                          [kernel.kallsyms]
> > >              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
> > >              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
> > >              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
> > >              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
> > >              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
> > >              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
> > >              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
> > >              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
> > >              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
> > >              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
> > >              3279.00  1.4% kfree                         [kernel.kallsyms]
> > >              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
> > >              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
> > >              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> > > 
> > > I've never really seen any signficant dentry cache reclaim overhead
> > > in profiles of these workloads before, so this was a bit of a
> > > surprise....
> > 
> > call_rcu shouldn't be doing much, except for disabling irqs and linking
> > the object into the list. I have a patch somewhere to reduce the irq
> > disable overhead a bit, but it really shouldn't be doing a lot of work.
> 
> Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and
> look at rcu/rcudata?  There will be a "ql=" number printed for each
> CPU, and if that number is too large, __call_rcu() does take what it
> considers to be corrective action, which can incur some overhead.
> 
> If this is the problem, then increasing the value of the qhimark module
> parameter might help.
> 
> If this is not the problem, I could make a patch that disables some of
> __call_rcu()'s grace-period acceleration code if you are willing to try
> it out.

Another thing that might help is to reduce the value of CONFIG_RCU_FANOUT
to something like 16.  If this does help, then there is a reasonably
straightforward change I can make to RCU.

							Thanx, Paul

> > Sometimes you find that touching the rcu head field needs to get a
> > cacheline exclusive, so a bit of work gets transferred there....
> > 
> > But it may also be something going a bit wrong in RCU. I blew it up
> > once already, after the files_lock splitup that enabled all CPUs to
> > create and destroy files :)
> 
> I would certainly like the opportunity to fix any bugs that might be
> in RCU...  ;-)
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-12-12 14:54               ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2010-12-12 14:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Fri, Dec 10, 2010 at 12:32:57PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote:
> > On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > > > >>
> > > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > > > >>
> > > > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > > > >> last time they were posted. It is structured like:
> > > > > >>
> > > > > >> * preparation patches
> > > > > >> * introduce new locks to take over dcache_lock, then remove it
> > > > > >> * cleaning up and reworking things for new locks
> > > > > >> * rcu-walk path walking
> > > > > >> * start on some fine grained locking steps
> > > > > >
> > > > > > Stress test doing:
> > > > > >
> > > > > >        single thread 50M inode create
> > > > > >        single thread rm -rf
> > > > > >        2-way 50M inode create
> > > > > >        2-way rm -rf
> > > > > >        4-way 50M inode create
> > > > > >        4-way rm -rf
> > > > > >        8-way 50M inode create
> > > > > >        8-way rm -rf
> > > > > >        8-way 250M inode create
> > > > > >        8-way rm -rf
> > > > > >
> > > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > > > with a CPU stuck spinning on here:
> > > > > >
> > > > > > [37372.084012] NMI backtrace for cpu 5
> > > > > > [37372.084012] CPU 5
> > > > > > [37372.084012] Modules linked in:
> > > > > > [37372.084012]
> > > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > > > [37372.084012] Stack:
> > > > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > > > [37372.084012] Call Trace:
> > > > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > > > 
> > > > > OK good, with any luck, that's the same bug.
> > > > > 
> > > > > Is this XFS?
> > > > 
> > > > Yes.
> > > > 
> > > > > Is there any concurrent activity happening on the same dentries?
> > > > 
> > > > Not from an application perspective.
> > > > 
> > > > > Ie. are the rm -rf threads running on the same directories,
> > > > 
> > > > No, each thread operating on a different directory.
> > 
> > This is probably fixed by the same patch as the lockdep splat trace.
> > 
> > 
> > > > > or is there any reclaim happening in the background?
> > > > 
> > > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > > > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > > > be dentry cache reclaim going as well.
> > > 
> > > Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> > > few seconds, then idle for 10-15s. Typical perf top output of kswapd
> > > while it is active during unlinks is:
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
> > >             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
> > >             12917.00  7.6% down_write                  [kernel.kallsyms]
> > >             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
> > >             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
> > >              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
> > >              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
> > >              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
> > >              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >              5484.00  3.2% __d_drop                    [kernel.kallsyms]
> > >              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
> > >              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
> > >              3556.00  2.1% up_write                    [kernel.kallsyms]
> > >              3206.00  1.9% _cond_resched               [kernel.kallsyms]
> > >              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
> > >              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
> > >              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
> > >              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
> > >              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
> > >              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
> > >              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
> > >              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
> > >              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
> > >              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
> > >              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
> > >               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> > > 
> > > So there is some dentry cache reclaim going on. 
> > > 
> > > FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> > > more CPU time) in the work kswapd is doing during these unlinks, too.
> > > I just had a look at kswapd when a 8-way create is running - it's running at
> > > 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> > > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> > > 
> > >              samples  pcnt function                    DSO
> > >              _______ _____ ___________________________ _________________
> > > 
> > >             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
> > >             21491.00  7.1% down_write                  [kernel.kallsyms]
> > >             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
> > >             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
> > >             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
> > >             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
> > >             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
> > >             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> > >             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
> > >             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
> > >             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
> > >             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
> > >              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
> > >              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
> > >              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
> > >              6974.00  2.3% up_write                    [kernel.kallsyms]
> > >              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
> > >              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> > > 
> > > Which is showing a similar amount of RCU overhead as the unlink as above.
> > > And this while it was doing dentry cache reclaim (~10s sample):
> > > 
> > >             35921.00 15.7% __d_drop                      [kernel.kallsyms]
> > >             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
> > >             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
> > >             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
> > >             10098.00  4.4% iput                          [kernel.kallsyms]
> > >              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
> > >              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
> > >              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
> > >              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
> > >              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
> > >              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
> > >              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
> > >              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
> > >              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
> > >              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
> > >              3279.00  1.4% kfree                         [kernel.kallsyms]
> > >              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
> > >              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
> > >              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> > > 
> > > I've never really seen any signficant dentry cache reclaim overhead
> > > in profiles of these workloads before, so this was a bit of a
> > > surprise....
> > 
> > call_rcu shouldn't be doing much, except for disabling irqs and linking
> > the object into the list. I have a patch somewhere to reduce the irq
> > disable overhead a bit, but it really shouldn't be doing a lot of work.
> 
> Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and
> look at rcu/rcudata?  There will be a "ql=" number printed for each
> CPU, and if that number is too large, __call_rcu() does take what it
> considers to be corrective action, which can incur some overhead.
> 
> If this is the problem, then increasing the value of the qhimark module
> parameter might help.
> 
> If this is not the problem, I could make a patch that disables some of
> __call_rcu()'s grace-period acceleration code if you are willing to try
> it out.

Another thing that might help is to reduce the value of CONFIG_RCU_FANOUT
to something like 16.  If this does help, then there is a reasonably
straightforward change I can make to RCU.

							Thanx, Paul

> > Sometimes you find that touching the rcu head field needs to get a
> > cacheline exclusive, so a bit of work gets transferred there....
> > 
> > But it may also be something going a bit wrong in RCU. I blew it up
> > once already, after the files_lock splitup that enabled all CPUs to
> > create and destroy files :)
> 
> I would certainly like the opportunity to fix any bugs that might be
> in RCU...  ;-)
> 
> 							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-08  7:09           ` Nick Piggin
  (?)
@ 2010-12-10 20:32           ` Paul E. McKenney
  2010-12-12 14:54               ` Paul E. McKenney
  -1 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2010-12-10 20:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote:
> On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > > >>
> > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > > >>
> > > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > > >> last time they were posted. It is structured like:
> > > > >>
> > > > >> * preparation patches
> > > > >> * introduce new locks to take over dcache_lock, then remove it
> > > > >> * cleaning up and reworking things for new locks
> > > > >> * rcu-walk path walking
> > > > >> * start on some fine grained locking steps
> > > > >
> > > > > Stress test doing:
> > > > >
> > > > >        single thread 50M inode create
> > > > >        single thread rm -rf
> > > > >        2-way 50M inode create
> > > > >        2-way rm -rf
> > > > >        4-way 50M inode create
> > > > >        4-way rm -rf
> > > > >        8-way 50M inode create
> > > > >        8-way rm -rf
> > > > >        8-way 250M inode create
> > > > >        8-way rm -rf
> > > > >
> > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > > with a CPU stuck spinning on here:
> > > > >
> > > > > [37372.084012] NMI backtrace for cpu 5
> > > > > [37372.084012] CPU 5
> > > > > [37372.084012] Modules linked in:
> > > > > [37372.084012]
> > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > > [37372.084012] Stack:
> > > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > > [37372.084012] Call Trace:
> > > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > > 
> > > > OK good, with any luck, that's the same bug.
> > > > 
> > > > Is this XFS?
> > > 
> > > Yes.
> > > 
> > > > Is there any concurrent activity happening on the same dentries?
> > > 
> > > Not from an application perspective.
> > > 
> > > > Ie. are the rm -rf threads running on the same directories,
> > > 
> > > No, each thread operating on a different directory.
> 
> This is probably fixed by the same patch as the lockdep splat trace.
> 
> 
> > > > or is there any reclaim happening in the background?
> > > 
> > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > > be dentry cache reclaim going as well.
> > 
> > Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> > few seconds, then idle for 10-15s. Typical perf top output of kswapd
> > while it is active during unlinks is:
> > 
> >              samples  pcnt function                    DSO
> >              _______ _____ ___________________________ _________________
> > 
> >             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
> >             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
> >             12917.00  7.6% down_write                  [kernel.kallsyms]
> >             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
> >             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
> >              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
> >              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
> >              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
> >              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
> >              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> >              5484.00  3.2% __d_drop                    [kernel.kallsyms]
> >              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
> >              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
> >              3556.00  2.1% up_write                    [kernel.kallsyms]
> >              3206.00  1.9% _cond_resched               [kernel.kallsyms]
> >              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
> >              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
> >              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
> >              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
> >              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
> >              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
> >              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
> >              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
> >              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
> >              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
> >              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
> >              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
> >              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
> >               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> > 
> > So there is some dentry cache reclaim going on. 
> > 
> > FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> > more CPU time) in the work kswapd is doing during these unlinks, too.
> > I just had a look at kswapd when a 8-way create is running - it's running at
> > 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> > 
> >              samples  pcnt function                    DSO
> >              _______ _____ ___________________________ _________________
> > 
> >             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
> >             21491.00  7.1% down_write                  [kernel.kallsyms]
> >             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
> >             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
> >             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
> >             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
> >             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
> >             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
> >             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
> >             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
> >             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
> >             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
> >              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
> >              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
> >              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
> >              6974.00  2.3% up_write                    [kernel.kallsyms]
> >              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
> >              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> > 
> > Which is showing a similar amount of RCU overhead as the unlink as above.
> > And this while it was doing dentry cache reclaim (~10s sample):
> > 
> >             35921.00 15.7% __d_drop                      [kernel.kallsyms]
> >             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
> >             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
> >             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
> >             10098.00  4.4% iput                          [kernel.kallsyms]
> >              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
> >              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
> >              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
> >              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
> >              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
> >              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
> >              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
> >              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
> >              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
> >              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
> >              3279.00  1.4% kfree                         [kernel.kallsyms]
> >              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
> >              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
> >              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> > 
> > I've never really seen any signficant dentry cache reclaim overhead
> > in profiles of these workloads before, so this was a bit of a
> > surprise....
> 
> call_rcu shouldn't be doing much, except for disabling irqs and linking
> the object into the list. I have a patch somewhere to reduce the irq
> disable overhead a bit, but it really shouldn't be doing a lot of work.

Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and
look at rcu/rcudata?  There will be a "ql=" number printed for each
CPU, and if that number is too large, __call_rcu() does take what it
considers to be corrective action, which can incur some overhead.

If this is the problem, then increasing the value of the qhimark module
parameter might help.

If this is not the problem, I could make a patch that disables some of
__call_rcu()'s grace-period acceleration code if you are willing to try
it out.

> Sometimes you find that touching the rcu head field needs to get a
> cacheline exclusive, so a bit of work gets transferred there....
> 
> But it may also be something going a bit wrong in RCU. I blew it up
> once already, after the files_lock splitup that enabled all CPUs to
> create and destroy files :)

I would certainly like the opportunity to fix any bugs that might be
in RCU...  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-08  4:28       ` Dave Chinner
@ 2010-12-08  7:09           ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-12-08  7:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > >>
> > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > >>
> > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > >> last time they were posted. It is structured like:
> > > >>
> > > >> * preparation patches
> > > >> * introduce new locks to take over dcache_lock, then remove it
> > > >> * cleaning up and reworking things for new locks
> > > >> * rcu-walk path walking
> > > >> * start on some fine grained locking steps
> > > >
> > > > Stress test doing:
> > > >
> > > >        single thread 50M inode create
> > > >        single thread rm -rf
> > > >        2-way 50M inode create
> > > >        2-way rm -rf
> > > >        4-way 50M inode create
> > > >        4-way rm -rf
> > > >        8-way 50M inode create
> > > >        8-way rm -rf
> > > >        8-way 250M inode create
> > > >        8-way rm -rf
> > > >
> > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > with a CPU stuck spinning on here:
> > > >
> > > > [37372.084012] NMI backtrace for cpu 5
> > > > [37372.084012] CPU 5
> > > > [37372.084012] Modules linked in:
> > > > [37372.084012]
> > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > [37372.084012] Stack:
> > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > [37372.084012] Call Trace:
> > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > 
> > > OK good, with any luck, that's the same bug.
> > > 
> > > Is this XFS?
> > 
> > Yes.
> > 
> > > Is there any concurrent activity happening on the same dentries?
> > 
> > Not from an application perspective.
> > 
> > > Ie. are the rm -rf threads running on the same directories,
> > 
> > No, each thread operating on a different directory.

This is probably fixed by the same patch as the lockdep splat trace.


> > > or is there any reclaim happening in the background?
> > 
> > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > be dentry cache reclaim going as well.
> 
> Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> few seconds, then idle for 10-15s. Typical perf top output of kswapd
> while it is active during unlinks is:
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ _________________
> 
>             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
>             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
>             12917.00  7.6% down_write                  [kernel.kallsyms]
>             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
>             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
>              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
>              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
>              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
>              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
>              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>              5484.00  3.2% __d_drop                    [kernel.kallsyms]
>              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
>              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
>              3556.00  2.1% up_write                    [kernel.kallsyms]
>              3206.00  1.9% _cond_resched               [kernel.kallsyms]
>              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
>              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
>              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
>              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
>              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
>              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
>              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
>              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
>              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
>              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
>              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
>              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
>              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
>               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> 
> So there is some dentry cache reclaim going on. 
> 
> FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> more CPU time) in the work kswapd is doing during these unlinks, too.
> I just had a look at kswapd when a 8-way create is running - it's running at
> 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ _________________
> 
>             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
>             21491.00  7.1% down_write                  [kernel.kallsyms]
>             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
>             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
>             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
>             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
>             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
>             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
>             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
>             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
>             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
>              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
>              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
>              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
>              6974.00  2.3% up_write                    [kernel.kallsyms]
>              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
>              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> 
> Which is showing a similar amount of RCU overhead as the unlink as above.
> And this while it was doing dentry cache reclaim (~10s sample):
> 
>             35921.00 15.7% __d_drop                      [kernel.kallsyms]
>             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
>             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
>             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
>             10098.00  4.4% iput                          [kernel.kallsyms]
>              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
>              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
>              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
>              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
>              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
>              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
>              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
>              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
>              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
>              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
>              3279.00  1.4% kfree                         [kernel.kallsyms]
>              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
>              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
>              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> 
> I've never really seen any signficant dentry cache reclaim overhead
> in profiles of these workloads before, so this was a bit of a
> surprise....

call_rcu shouldn't be doing much, except for disabling irqs and linking
the object into the list. I have a patch somewhere to reduce the irq
disable overhead a bit, but it really shouldn't be doing a lot of work.

Sometimes you find that touching the rcu head field needs to get a
cacheline exclusive, so a bit of work gets transferred there....

But it may also be something going a bit wrong in RCU. I blew it up
once already, after the files_lock splitup that enabled all CPUs to
create and destroy files :)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-12-08  7:09           ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-12-08  7:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > > >>
> > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > > >>
> > > >> Here is an new set of vfs patches for review, not that there was much interest
> > > >> last time they were posted. It is structured like:
> > > >>
> > > >> * preparation patches
> > > >> * introduce new locks to take over dcache_lock, then remove it
> > > >> * cleaning up and reworking things for new locks
> > > >> * rcu-walk path walking
> > > >> * start on some fine grained locking steps
> > > >
> > > > Stress test doing:
> > > >
> > > >        single thread 50M inode create
> > > >        single thread rm -rf
> > > >        2-way 50M inode create
> > > >        2-way rm -rf
> > > >        4-way 50M inode create
> > > >        4-way rm -rf
> > > >        8-way 50M inode create
> > > >        8-way rm -rf
> > > >        8-way 250M inode create
> > > >        8-way rm -rf
> > > >
> > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > > with a CPU stuck spinning on here:
> > > >
> > > > [37372.084012] NMI backtrace for cpu 5
> > > > [37372.084012] CPU 5
> > > > [37372.084012] Modules linked in:
> > > > [37372.084012]
> > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > > [37372.084012] Stack:
> > > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > > [37372.084012] Call Trace:
> > > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > > 
> > > OK good, with any luck, that's the same bug.
> > > 
> > > Is this XFS?
> > 
> > Yes.
> > 
> > > Is there any concurrent activity happening on the same dentries?
> > 
> > Not from an application perspective.
> > 
> > > Ie. are the rm -rf threads running on the same directories,
> > 
> > No, each thread operating on a different directory.

This is probably fixed by the same patch as the lockdep splat trace.


> > > or is there any reclaim happening in the background?
> > 
> > IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> > unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> > be dentry cache reclaim going as well.
> 
> Turns out that the kswapd peaks are upwards of 50% of a CPU for a
> few seconds, then idle for 10-15s. Typical perf top output of kswapd
> while it is active during unlinks is:
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ _________________
> 
>             17168.00 10.2% __call_rcu                  [kernel.kallsyms]
>             13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
>             12917.00  7.6% down_write                  [kernel.kallsyms]
>             12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
>             10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
>              9314.00  5.5% __lookup_tag                [kernel.kallsyms]
>              9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
>              8694.00  5.1% is_bad_inode                [kernel.kallsyms]
>              7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
>              6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>              5484.00  3.2% __d_drop                    [kernel.kallsyms]
>              5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
>              4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
>              3556.00  2.1% up_write                    [kernel.kallsyms]
>              3206.00  1.9% _cond_resched               [kernel.kallsyms]
>              3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
>              2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
>              2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
>              2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
>              2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
>              2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
>              1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
>              1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
>              1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
>              1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
>              1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
>              1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
>              1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
>               940.00  0.6% xfs_perag_put               [kernel.kallsyms]
> 
> So there is some dentry cache reclaim going on. 
> 
> FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
> more CPU time) in the work kswapd is doing during these unlinks, too.
> I just had a look at kswapd when a 8-way create is running - it's running at
> 50-60% of a cpu for seconds at a time. I caught this while it was doing pure
> XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ _________________
> 
>             27171.00  9.0% __call_rcu                  [kernel.kallsyms]
>             21491.00  7.1% down_write                  [kernel.kallsyms]
>             20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
>             20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
>             15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
>             15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
>             14893.00  4.9% is_bad_inode                [kernel.kallsyms]
>             14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>             14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
>             14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
>             10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
>             10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
>              8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
>              7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
>              7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
>              6974.00  2.3% up_write                    [kernel.kallsyms]
>              5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
>              5634.00  1.9% _cond_resched               [kernel.kallsyms]
> 
> Which is showing a similar amount of RCU overhead as the unlink as above.
> And this while it was doing dentry cache reclaim (~10s sample):
> 
>             35921.00 15.7% __d_drop                      [kernel.kallsyms]
>             30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
>             29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
>             19043.00  8.3% __call_rcu                    [kernel.kallsyms]
>             10098.00  4.4% iput                          [kernel.kallsyms]
>              7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
>              6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
>              6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
>              5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
>              5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
>              4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
>              4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
>              3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
>              3727.00  1.6% dentry_iput                   [kernel.kallsyms]
>              3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
>              3279.00  1.4% kfree                         [kernel.kallsyms]
>              3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
>              2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
>              2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]
> 
> I've never really seen any signficant dentry cache reclaim overhead
> in profiles of these workloads before, so this was a bit of a
> surprise....

call_rcu shouldn't be doing much, except for disabling irqs and linking
the object into the list. I have a patch somewhere to reduce the irq
disable overhead a bit, but it really shouldn't be doing a lot of work.

Sometimes you find that touching the rcu head field needs to get a
cacheline exclusive, so a bit of work gets transferred there....

But it may also be something going a bit wrong in RCU. I blew it up
once already, after the files_lock splitup that enabled all CPUs to
create and destroy files :)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-08  3:32     ` Dave Chinner
@ 2010-12-08  4:28       ` Dave Chinner
  2010-12-08  7:09           ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2010-12-08  4:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > >>
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > >>
> > >> Here is an new set of vfs patches for review, not that there was much interest
> > >> last time they were posted. It is structured like:
> > >>
> > >> * preparation patches
> > >> * introduce new locks to take over dcache_lock, then remove it
> > >> * cleaning up and reworking things for new locks
> > >> * rcu-walk path walking
> > >> * start on some fine grained locking steps
> > >
> > > Stress test doing:
> > >
> > >        single thread 50M inode create
> > >        single thread rm -rf
> > >        2-way 50M inode create
> > >        2-way rm -rf
> > >        4-way 50M inode create
> > >        4-way rm -rf
> > >        8-way 50M inode create
> > >        8-way rm -rf
> > >        8-way 250M inode create
> > >        8-way rm -rf
> > >
> > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > with a CPU stuck spinning on here:
> > >
> > > [37372.084012] NMI backtrace for cpu 5
> > > [37372.084012] CPU 5
> > > [37372.084012] Modules linked in:
> > > [37372.084012]
> > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > [37372.084012] Stack:
> > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > [37372.084012] Call Trace:
> > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > 
> > OK good, with any luck, that's the same bug.
> > 
> > Is this XFS?
> 
> Yes.
> 
> > Is there any concurrent activity happening on the same dentries?
> 
> Not from an application perspective.
> 
> > Ie. are the rm -rf threads running on the same directories,
> 
> No, each thread operating on a different directory.
> 
> > or is there any reclaim happening in the background?
> 
> IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> be dentry cache reclaim going as well.

Turns out that the kswapd peaks are upwards of 50% of a CPU for a
few seconds, then idle for 10-15s. Typical perf top output of kswapd
while it is active during unlinks is:

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

            17168.00 10.2% __call_rcu                  [kernel.kallsyms]
            13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
            12917.00  7.6% down_write                  [kernel.kallsyms]
            12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
            10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
             9314.00  5.5% __lookup_tag                [kernel.kallsyms]
             9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
             8694.00  5.1% is_bad_inode                [kernel.kallsyms]
             7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
             6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
             5484.00  3.2% __d_drop                    [kernel.kallsyms]
             5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
             4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
             3556.00  2.1% up_write                    [kernel.kallsyms]
             3206.00  1.9% _cond_resched               [kernel.kallsyms]
             3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
             2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
             2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
             2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
             2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
             2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
             1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
             1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
             1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
             1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
             1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
             1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
             1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
              940.00  0.6% xfs_perag_put               [kernel.kallsyms]

So there is some dentry cache reclaim going on. 

FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
more CPU time) in the work kswapd is doing during these unlinks, too.
I just had a look at kswapd when a 8-way create is running - it's running at
50-60% of a cpu for seconds at a time. I caught this while it was doing pure
XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

            27171.00  9.0% __call_rcu                  [kernel.kallsyms]
            21491.00  7.1% down_write                  [kernel.kallsyms]
            20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
            20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
            15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
            15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
            14893.00  4.9% is_bad_inode                [kernel.kallsyms]
            14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
            14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
            14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
            10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
            10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
             8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
             7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
             7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
             6974.00  2.3% up_write                    [kernel.kallsyms]
             5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
             5634.00  1.9% _cond_resched               [kernel.kallsyms]

Which is showing a similar amount of RCU overhead as the unlink as above.
And this while it was doing dentry cache reclaim (~10s sample):

            35921.00 15.7% __d_drop                      [kernel.kallsyms]
            30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
            29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
            19043.00  8.3% __call_rcu                    [kernel.kallsyms]
            10098.00  4.4% iput                          [kernel.kallsyms]
             7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
             6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
             6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
             5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
             5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
             4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
             4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
             3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
             3727.00  1.6% dentry_iput                   [kernel.kallsyms]
             3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
             3279.00  1.4% kfree                         [kernel.kallsyms]
             3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
             2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
             2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]

I've never really seen any signficant dentry cache reclaim overhead
in profiles of these workloads before, so this was a bit of a
surprise....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-08  1:47   ` Nick Piggin
@ 2010-12-08  3:32     ` Dave Chinner
  2010-12-08  4:28       ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2010-12-08  3:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> >>
> >> Here is an new set of vfs patches for review, not that there was much interest
> >> last time they were posted. It is structured like:
> >>
> >> * preparation patches
> >> * introduce new locks to take over dcache_lock, then remove it
> >> * cleaning up and reworking things for new locks
> >> * rcu-walk path walking
> >> * start on some fine grained locking steps
> >
> > Stress test doing:
> >
> >        single thread 50M inode create
> >        single thread rm -rf
> >        2-way 50M inode create
> >        2-way rm -rf
> >        4-way 50M inode create
> >        4-way rm -rf
> >        8-way 50M inode create
> >        8-way rm -rf
> >        8-way 250M inode create
> >        8-way rm -rf
> >
> > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > with a CPU stuck spinning on here:
> >
> > [37372.084012] NMI backtrace for cpu 5
> > [37372.084012] CPU 5
> > [37372.084012] Modules linked in:
> > [37372.084012]
> > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > [37372.084012] Stack:
> > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > [37372.084012] Call Trace:
> > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> 
> OK good, with any luck, that's the same bug.
> 
> Is this XFS?

Yes.

> Is there any concurrent activity happening on the same dentries?

Not from an application perspective.

> Ie. are the rm -rf threads running on the same directories,

No, each thread operating on a different directory.

> or is there any reclaim happening in the background?

IIRC, kswapd was consuming about 5-10% of a CPU during parallel
unlink tests. Mainly reclaiming XFS inodes, I think, but there may
be dentry cache reclaim going as well.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 15:24     ` Nick Piggin
  (?)
  (?)
@ 2010-12-08  3:28     ` Nick Piggin
  -1 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-12-08  3:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel, Peter Zijlstra

On Wed, Dec 08, 2010 at 02:24:23AM +1100, Nick Piggin wrote:
> On Tue, Dec 7, 2010 at 10:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> > [ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
> > [ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
> > [ 5954.063122] 3 locks held by rm/2927:
> > [ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
> > [ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
> > [ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
> > [ 5954.064014] irq event stamp: 1484376719
> > [ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
> > [ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
> > [ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
> > [ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
> > [ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
> > [ 5954.064014] Call Trace:
> > [ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
> > [ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
> > [ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
> > [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> > [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> > [ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
> > [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> > [ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
> > [ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
> > [ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> > [ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> > [ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> > [ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> > [ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> > [ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> > [ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> > [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> > [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> > [ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> > [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> > [ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> > [ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> > [ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> > [ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> > [ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> > [ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> > [ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> > [ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> > [ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> > [ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> > [ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> 
> Seems that lockdep exploded.
> 
> > [ 5954.092916] BUG: unable to handle kernel NULL pointer dereference at           (null)
> > [ 5954.093806] IP: [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> > [ 5954.094331] PGD 1084e5067 PUD 102368067 PMD 0
> > [ 5954.094830] Oops: 0000 [#1] SMP
> > [ 5954.095194] last sysfs file: /sys/devices/system/cpu/online
> > [ 5954.095760] CPU 6
> > [ 5954.095954] Modules linked in:
> > [ 5954.096319]
> > [ 5954.096483] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794 /Bochs
> > [ 5954.096665] RIP: 0010:[<ffffffff810ba1c8>]  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> > [ 5954.096665] RSP: 0018:ffff8801175539a8  EFLAGS: 00010046
> > [ 5954.096665] RAX: ffffffff8267d980 RBX: ffffffff8267d980 RCX: ffff880117553a48
> > [ 5954.096665] RDX: ffff8801175539d0 RSI: 0000000000000000 RDI: ffff880117553a48
> > [ 5954.096665] RBP: ffff880117553a08 R08: 0000000000000000 R09: 0000000000000000
> > [ 5954.096665] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> > [ 5954.096665] R13: ffffffff810b8e20 R14: ffff880117553a90 R15: 0000000000000000
> > [ 5954.096665] FS:  00007f4594cf3700(0000) GS:ffff8800dfa00000(0000) knlGS:0000000000000000
> > [ 5954.096665] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 5954.096665] CR2: 00007f2f21e89c60 CR3: 0000000110b0f000 CR4: 00000000000006e0
> > [ 5954.096665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 5954.096665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 5954.096665] Process rm (pid: 2927, threadinfo ffff880117552000, task ffff88010ff10b00)
> > [ 5954.096665] Stack:
> > [ 5954.096665]  ffffffff8267d868 00007fffa02d1428 ffff8800ffffffff ffff880100000000
> > [ 5954.096665]  000000000000b720 ffff880117553a48 ffffffff8267d868 ffff880117553a48
> > [ 5954.096665]  0000000000000000 ffff88010ff10b00 0000000000000000 ffffffff81dacba0
> > [ 5954.096665] Call Trace:
> > [ 5954.096665]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> > [ 5954.096665]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> > [ 5954.096665]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> > [ 5954.096665]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> > [ 5954.096665]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> > [ 5954.096665]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> > [ 5954.096665]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> > [ 5954.096665]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> > [ 5954.096665]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> > [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> > [ 5954.096665]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> > [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> > [ 5954.096665]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> > [ 5954.096665]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> > [ 5954.096665]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> > [ 5954.096665]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> > [ 5954.096665]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> > [ 5954.096665]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> > [ 5954.096665]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> > [ 5954.096665]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> > [ 5954.096665]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> > [ 5954.096665]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [ 5954.096665]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> > [ 5954.096665]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > [ 5954.096665] Code: 0a 89 05 dc 0f a8 01 48 8b 41 10 48 85 c0 0f 84 1f 01 00 00 48 8d 98 70 01 00 00 48 05 80 01 00 00 45 85 c0 48 0f 44 d8 4c 8b 3b <49> 8b 07 49 39
> > [ 5954.096665] RIP  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> > [ 5954.096665]  RSP <ffff8801175539a8>
> > [ 5954.096665] CR2: 0000000000000000
> > [ 5954.127991] ---[ end trace 85a6727c2d4e3d90 ]---
> 
> So vfs-scale-working branch may not be entirely in the clear, seeing

Ah, may have been a stupid little bug. The list entry check was being done
and then the pointer reloaded to be used.

What does the asm for shrink_dentry_list look like (before this patch)?

Thanks,
Nick

Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c	2010-12-08 12:41:35.000000000 +1100
+++ linux-2.6/fs/dcache.c	2010-12-08 14:23:46.000000000 +1100
@@ -657,10 +657,10 @@ static void shrink_dentry_list(struct li
 	struct dentry *dentry;
 
 	rcu_read_lock();
-	while (!list_empty(list)) {
-		dentry = list_entry(list->prev, struct dentry, d_lru);
-
-		/* Don't need RCU dereference because we recheck under lock */
+	for (;;) {
+		dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
+		if (&dentry->d_lru == list)
+			break;
 		spin_lock(&dentry->d_lock);
 		if (dentry != list_entry(list->prev, struct dentry, d_lru)) {
 			spin_unlock(&dentry->d_lock);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 21:56 ` Dave Chinner
@ 2010-12-08  1:47   ` Nick Piggin
  2010-12-08  3:32     ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2010-12-08  1:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
>>
>> Here is an new set of vfs patches for review, not that there was much interest
>> last time they were posted. It is structured like:
>>
>> * preparation patches
>> * introduce new locks to take over dcache_lock, then remove it
>> * cleaning up and reworking things for new locks
>> * rcu-walk path walking
>> * start on some fine grained locking steps
>
> Stress test doing:
>
>        single thread 50M inode create
>        single thread rm -rf
>        2-way 50M inode create
>        2-way rm -rf
>        4-way 50M inode create
>        4-way rm -rf
>        8-way 50M inode create
>        8-way rm -rf
>        8-way 250M inode create
>        8-way rm -rf
>
> Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> with a CPU stuck spinning on here:
>
> [37372.084012] NMI backtrace for cpu 5
> [37372.084012] CPU 5
> [37372.084012] Modules linked in:
> [37372.084012]
> [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> [37372.084012] Stack:
> [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> [37372.084012] Call Trace:
> [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

OK good, with any luck, that's the same bug.

Is this XFS? Is there any concurrent activity happening on the same dentries?
Ie. are the rm -rf threads running on the same directories, or is
there any reclaim
happening in the background?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 10:15 Nick Piggin
                   ` (2 preceding siblings ...)
  2010-12-07 11:25 ` Dave Chinner
@ 2010-12-07 21:56 ` Dave Chinner
  2010-12-08  1:47   ` Nick Piggin
  3 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2010-12-07 21:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> Here is an new set of vfs patches for review, not that there was much interest
> last time they were posted. It is structured like:
> 
> * preparation patches
> * introduce new locks to take over dcache_lock, then remove it
> * cleaning up and reworking things for new locks
> * rcu-walk path walking
> * start on some fine grained locking steps

Stress test doing:

	single thread 50M inode create
	single thread rm -rf
	2-way 50M inode create
	2-way rm -rf
	4-way 50M inode create
	4-way rm -rf
	8-way 50M inode create
	8-way rm -rf
	8-way 250M inode create
	8-way rm -rf

Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
with a CPU stuck spinning on here:

[37372.084012] NMI backtrace for cpu 5
[37372.084012] CPU 5 
[37372.084012] Modules linked in:
[37372.084012] 
[37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
[37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
[37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
[37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
[37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
[37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
[37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
[37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
[37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
[37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
[37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
[37372.084012] Stack:
[37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
[37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
[37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
[37372.084012] Call Trace:
[37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
[37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
[37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
[37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
[37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
[37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
[37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
[37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
[37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
[37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
[37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
[37372.084012] Code: c1 c1 41 06 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 c9 c3 66 0f 1f 44 00 00 55 48 89 e5 0f b7 07 38 e0 
[37372.084012] Call Trace:
[37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
[37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
[37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
[37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
[37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
[37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
[37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
[37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
[37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
[37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
[37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 15:59       ` Nick Piggin
@ 2010-12-07 16:23         ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2010-12-07 16:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel, MingLei

On Wed, 2010-12-08 at 02:59 +1100, Nick Piggin wrote:
> OK, thanks. My version should not have caused any problems though,
> right? 

I tihnk so, yes, altough looking at it again I wonder why you use
spin_aquire(.trylock=1) -- but that too shouldn't cause anything like
the explosion.

> Any idea what might have caused Dave's crash?

Not directly, no. Usually lockdep crashes indicate use after free like
things, where we try to lock a lock that's been scribbled on. But that
usually explodes a bit earlier.

You faulting in the middle of that breath-first-search does suggest some
data corruption, but I'm not quite sure what kind, I can't remember ever
having seem something like this before.

I've CC'ed Ming Lei who wrote the bfs search, maybe he's got an idea.


Copy of the splat:
---

> [ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
> [ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
> [ 5954.063122] 3 locks held by rm/2927:
> [ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
> [ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
> [ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
> [ 5954.064014] irq event stamp: 1484376719
> [ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
> [ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
> [ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
> [ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
> [ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
> [ 5954.064014] Call Trace:
> [ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
> [ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
> [ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
> [ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
> [ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 15:49     ` Peter Zijlstra
@ 2010-12-07 15:59       ` Nick Piggin
  2010-12-07 16:23         ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2010-12-07 15:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, Dec 8, 2010 at 2:49 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2010-12-08 at 02:24 +1100, Nick Piggin wrote:
>>
>>  repeat:
>>     spin_lock(&parent->d_lock);
>>     spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
>>     /* do stuff */
>>     spin_unlock(&parent->d_lock);
>>     spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
>>     parent = dentry;
>>     spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
>>     goto repeat;
>
> shouldn't that be s/this_parent/parent/ ?

Yes, typo in my pseudo code.


> So what you're trying to do is:
>
>  A -> B -> C -> ...
>
> lock A
> lock B, nested
> unlock A
> flip B from nested to top
> lock C, nested
> unlock B
> flip C from nested to top
> lock ...
>
> Anyway, the way to write that is something like:
>
>  lock_set_subclass(&detry->d_lock.dep_map, 0, _RET_IP_);
>
> Which will reset the subclass of the held lock from DENTRY_D_LOCK_NESTED
> to 0.

OK, thanks. My version should not have caused any problems though,
right? Any idea what might have caused Dave's crash?


> This is also used in double_unlock_balance(), we go into
> double_lock_balance() with this_rq locked and want to lock busiest,
> because of the lock ordering we might need to unlock this_rq and lock
> busiest first, at which point this_rq is nested.
>
> On unlock we thus need to map this_rq back to subclass 0 (which it had
> before double_lock_balance(), because otherwise subsequent lock
> operations will be done against the subclass and confuse things.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 15:24     ` Nick Piggin
  (?)
@ 2010-12-07 15:49     ` Peter Zijlstra
  2010-12-07 15:59       ` Nick Piggin
  -1 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2010-12-07 15:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Chinner, Nick Piggin, linux-fsdevel, linux-kernel

On Wed, 2010-12-08 at 02:24 +1100, Nick Piggin wrote:
> 
>  repeat:
>     spin_lock(&parent->d_lock);
>     spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
>     /* do stuff */
>     spin_unlock(&parent->d_lock);
>     spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
>     parent = dentry;
>     spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
>     goto repeat; 

shouldn't that be s/this_parent/parent/ ?

So what you're trying to do is:

  A -> B -> C -> ...

lock A
lock B, nested
unlock A
flip B from nested to top
lock C, nested
unlock B
flip C from nested to top
lock ...

Anyway, the way to write that is something like:

  lock_set_subclass(&detry->d_lock.dep_map, 0, _RET_IP_);

Which will reset the subclass of the held lock from DENTRY_D_LOCK_NESTED
to 0.

This is also used in double_unlock_balance(), we go into
double_lock_balance() with this_rq locked and want to lock busiest,
because of the lock ordering we might need to unlock this_rq and lock
busiest first, at which point this_rq is nested.

On unlock we thus need to map this_rq back to subclass 0 (which it had
before double_lock_balance(), because otherwise subsequent lock
operations will be done against the subclass and confuse things.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-07 11:25 ` Dave Chinner
@ 2010-12-07 15:24     ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-12-07 15:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Peter Zijlstra

On Tue, Dec 7, 2010 at 10:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
>>
>> Here is an new set of vfs patches for review, not that there was much interest
>> last time they were posted. It is structured like:
>>
>> * preparation patches
>> * introduce new locks to take over dcache_lock, then remove it
>> * cleaning up and reworking things for new locks
>> * rcu-walk path walking
>> * start on some fine grained locking steps
>
> Just got this set of traces doing an 8-way parallel remove of 50
> million inodes at about 40M inodes unlinked:

Thanks for testing...


> [ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
> [ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
> [ 5954.063122] 3 locks held by rm/2927:
> [ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
> [ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
> [ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
> [ 5954.064014] irq event stamp: 1484376719
> [ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
> [ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
> [ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
> [ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
> [ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
> [ 5954.064014] Call Trace:
> [ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
> [ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
> [ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
> [ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
> [ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

Seems that lockdep exploded.

> [ 5954.092916] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [ 5954.093806] IP: [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.094331] PGD 1084e5067 PUD 102368067 PMD 0
> [ 5954.094830] Oops: 0000 [#1] SMP
> [ 5954.095194] last sysfs file: /sys/devices/system/cpu/online
> [ 5954.095760] CPU 6
> [ 5954.095954] Modules linked in:
> [ 5954.096319]
> [ 5954.096483] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794 /Bochs
> [ 5954.096665] RIP: 0010:[<ffffffff810ba1c8>]  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.096665] RSP: 0018:ffff8801175539a8  EFLAGS: 00010046
> [ 5954.096665] RAX: ffffffff8267d980 RBX: ffffffff8267d980 RCX: ffff880117553a48
> [ 5954.096665] RDX: ffff8801175539d0 RSI: 0000000000000000 RDI: ffff880117553a48
> [ 5954.096665] RBP: ffff880117553a08 R08: 0000000000000000 R09: 0000000000000000
> [ 5954.096665] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> [ 5954.096665] R13: ffffffff810b8e20 R14: ffff880117553a90 R15: 0000000000000000
> [ 5954.096665] FS:  00007f4594cf3700(0000) GS:ffff8800dfa00000(0000) knlGS:0000000000000000
> [ 5954.096665] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 5954.096665] CR2: 00007f2f21e89c60 CR3: 0000000110b0f000 CR4: 00000000000006e0
> [ 5954.096665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5954.096665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 5954.096665] Process rm (pid: 2927, threadinfo ffff880117552000, task ffff88010ff10b00)
> [ 5954.096665] Stack:
> [ 5954.096665]  ffffffff8267d868 00007fffa02d1428 ffff8800ffffffff ffff880100000000
> [ 5954.096665]  000000000000b720 ffff880117553a48 ffffffff8267d868 ffff880117553a48
> [ 5954.096665]  0000000000000000 ffff88010ff10b00 0000000000000000 ffffffff81dacba0
> [ 5954.096665] Call Trace:
> [ 5954.096665]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.096665]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.096665]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.096665]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.096665]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.096665]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.096665]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.096665]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.096665]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.096665]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.096665]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.096665]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.096665]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.096665]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.096665]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.096665]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.096665]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.096665]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.096665]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> [ 5954.096665] Code: 0a 89 05 dc 0f a8 01 48 8b 41 10 48 85 c0 0f 84 1f 01 00 00 48 8d 98 70 01 00 00 48 05 80 01 00 00 45 85 c0 48 0f 44 d8 4c 8b 3b <49> 8b 07 49 39
> [ 5954.096665] RIP  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.096665]  RSP <ffff8801175539a8>
> [ 5954.096665] CR2: 0000000000000000
> [ 5954.127991] ---[ end trace 85a6727c2d4e3d90 ]---

So I vfs-scale-working branch may not be entirely in the clear, seeing
as it touches
the code lower in the call chain. However I don't know what can cause
lockdep to go off
the rails like this.

There is a sequence I used to hack around lockdep nesting
restrictions, following this
pattern:

 repeat:
    spin_lock(&parent->d_lock);
    spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
    /* do stuff */
    spin_unlock(&parent->d_lock);
    spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
    parent = dentry;
    spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
    goto repeat;

It's not directly in this call chain, but I wonder if it could have
caused any problem?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-12-07 15:24     ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-12-07 15:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-fsdevel, linux-kernel, Peter Zijlstra

On Tue, Dec 7, 2010 at 10:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
>>
>> Here is an new set of vfs patches for review, not that there was much interest
>> last time they were posted. It is structured like:
>>
>> * preparation patches
>> * introduce new locks to take over dcache_lock, then remove it
>> * cleaning up and reworking things for new locks
>> * rcu-walk path walking
>> * start on some fine grained locking steps
>
> Just got this set of traces doing an 8-way parallel remove of 50
> million inodes at about 40M inodes unlinked:

Thanks for testing...


> [ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
> [ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
> [ 5954.063122] 3 locks held by rm/2927:
> [ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
> [ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
> [ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
> [ 5954.064014] irq event stamp: 1484376719
> [ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
> [ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
> [ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
> [ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
> [ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
> [ 5954.064014] Call Trace:
> [ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
> [ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
> [ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
> [ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
> [ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
> [ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
> [ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

Seems that lockdep exploded.

> [ 5954.092916] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [ 5954.093806] IP: [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.094331] PGD 1084e5067 PUD 102368067 PMD 0
> [ 5954.094830] Oops: 0000 [#1] SMP
> [ 5954.095194] last sysfs file: /sys/devices/system/cpu/online
> [ 5954.095760] CPU 6
> [ 5954.095954] Modules linked in:
> [ 5954.096319]
> [ 5954.096483] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794 /Bochs
> [ 5954.096665] RIP: 0010:[<ffffffff810ba1c8>]  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.096665] RSP: 0018:ffff8801175539a8  EFLAGS: 00010046
> [ 5954.096665] RAX: ffffffff8267d980 RBX: ffffffff8267d980 RCX: ffff880117553a48
> [ 5954.096665] RDX: ffff8801175539d0 RSI: 0000000000000000 RDI: ffff880117553a48
> [ 5954.096665] RBP: ffff880117553a08 R08: 0000000000000000 R09: 0000000000000000
> [ 5954.096665] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
> [ 5954.096665] R13: ffffffff810b8e20 R14: ffff880117553a90 R15: 0000000000000000
> [ 5954.096665] FS:  00007f4594cf3700(0000) GS:ffff8800dfa00000(0000) knlGS:0000000000000000
> [ 5954.096665] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 5954.096665] CR2: 00007f2f21e89c60 CR3: 0000000110b0f000 CR4: 00000000000006e0
> [ 5954.096665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5954.096665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 5954.096665] Process rm (pid: 2927, threadinfo ffff880117552000, task ffff88010ff10b00)
> [ 5954.096665] Stack:
> [ 5954.096665]  ffffffff8267d868 00007fffa02d1428 ffff8800ffffffff ffff880100000000
> [ 5954.096665]  000000000000b720 ffff880117553a48 ffffffff8267d868 ffff880117553a48
> [ 5954.096665]  0000000000000000 ffff88010ff10b00 0000000000000000 ffffffff81dacba0
> [ 5954.096665] Call Trace:
> [ 5954.096665]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
> [ 5954.096665]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
> [ 5954.096665]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
> [ 5954.096665]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
> [ 5954.096665]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
> [ 5954.096665]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
> [ 5954.096665]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
> [ 5954.096665]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
> [ 5954.096665]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
> [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
> [ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
> [ 5954.096665]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
> [ 5954.096665]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
> [ 5954.096665]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
> [ 5954.096665]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
> [ 5954.096665]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
> [ 5954.096665]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
> [ 5954.096665]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
> [ 5954.096665]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
> [ 5954.096665]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 5954.096665]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
> [ 5954.096665]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> [ 5954.096665] Code: 0a 89 05 dc 0f a8 01 48 8b 41 10 48 85 c0 0f 84 1f 01 00 00 48 8d 98 70 01 00 00 48 05 80 01 00 00 45 85 c0 48 0f 44 d8 4c 8b 3b <49> 8b 07 49 39
> [ 5954.096665] RIP  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
> [ 5954.096665]  RSP <ffff8801175539a8>
> [ 5954.096665] CR2: 0000000000000000
> [ 5954.127991] ---[ end trace 85a6727c2d4e3d90 ]---

So I vfs-scale-working branch may not be entirely in the clear, seeing
as it touches
the code lower in the call chain. However I don't know what can cause
lockdep to go off
the rails like this.

There is a sequence I used to hack around lockdep nesting
restrictions, following this
pattern:

 repeat:
    spin_lock(&parent->d_lock);
    spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
    /* do stuff */
    spin_unlock(&parent->d_lock);
    spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
    parent = dentry;
    spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
    goto repeat;

It's not directly in this call chain, but I wonder if it could have
caused any problem?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 10:15 Nick Piggin
  2010-11-27 15:04   ` Anca Emanuel
  2010-12-01 18:03 ` David Miller
@ 2010-12-07 11:25 ` Dave Chinner
  2010-12-07 15:24     ` Nick Piggin
  2010-12-07 21:56 ` Dave Chinner
  3 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2010-12-07 11:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> Here is an new set of vfs patches for review, not that there was much interest
> last time they were posted. It is structured like:
> 
> * preparation patches
> * introduce new locks to take over dcache_lock, then remove it
> * cleaning up and reworking things for new locks
> * rcu-walk path walking
> * start on some fine grained locking steps

Just got this set of traces doing an 8-way parallel remove of 50
million inodes at about 40M inodes unlinked:

[ 5954.061633] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
[ 5954.062466] in_atomic(): 0, irqs_disabled(): 1, pid: 2927, name: rm
[ 5954.063122] 3 locks held by rm/2927:
[ 5954.063476]  #0:  (&sb->s_type->i_mutex_key#12/1){+.+.+.}, at: [<ffffffff8116f5e1>] do_rmdir+0x81/0x130
[ 5954.064014]  #1:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff8116d3a8>] vfs_rmdir+0x58/0xe0
[ 5954.064014]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811779c0>] shrink_dentry_list+0x0/0x430
[ 5954.064014] irq event stamp: 1484376719
[ 5954.064014] hardirqs last  enabled at (1484376719): [<ffffffff810ebf07>] __call_rcu+0xd7/0x1a0
[ 5954.064014] hardirqs last disabled at (1484376718): [<ffffffff810ebe7a>] __call_rcu+0x4a/0x1a0
[ 5954.064014] softirqs last  enabled at (1484376586): [<ffffffff8108b911>] __do_softirq+0x161/0x270
[ 5954.064014] softirqs last disabled at (1484376581): [<ffffffff8103af1c>] call_softirq+0x1c/0x50
[ 5954.064014] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794
[ 5954.064014] Call Trace:
[ 5954.064014]  [<ffffffff810b95b0>] ? print_irqtrace_events+0xd0/0xe0
[ 5954.064014]  [<ffffffff81076455>] __might_sleep+0xf5/0x130
[ 5954.064014]  [<ffffffff81b1e603>] do_page_fault+0x103/0x4f0
[ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.064014]  [<ffffffff81b19b28>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
[ 5954.064014]  [<ffffffff81b1af25>] page_fault+0x25/0x30
[ 5954.064014]  [<ffffffff810b8e20>] ? usage_match+0x0/0x20
[ 5954.064014]  [<ffffffff810ba1c8>] ? __bfs+0xc8/0x260
[ 5954.064014]  [<ffffffff810ba123>] ? __bfs+0x23/0x260
[ 5954.064014]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
[ 5954.064014]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
[ 5954.064014]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
[ 5954.064014]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
[ 5954.064014]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
[ 5954.064014]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
[ 5954.064014]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
[ 5954.064014]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.064014]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
[ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
[ 5954.064014]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
[ 5954.064014]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
[ 5954.064014]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
[ 5954.064014]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
[ 5954.064014]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
[ 5954.064014]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
[ 5954.064014]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
[ 5954.064014]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
[ 5954.064014]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
[ 5954.064014]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
[ 5954.064014]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
[ 5954.064014]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5954.064014]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
[ 5954.064014]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
[ 5954.092916] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 5954.093806] IP: [<ffffffff810ba1c8>] __bfs+0xc8/0x260
[ 5954.094331] PGD 1084e5067 PUD 102368067 PMD 0 
[ 5954.094830] Oops: 0000 [#1] SMP 
[ 5954.095194] last sysfs file: /sys/devices/system/cpu/online
[ 5954.095760] CPU 6 
[ 5954.095954] Modules linked in:
[ 5954.096319] 
[ 5954.096483] Pid: 2927, comm: rm Not tainted 2.6.37-rc4-dgc+ #794 /Bochs
[ 5954.096665] RIP: 0010:[<ffffffff810ba1c8>]  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
[ 5954.096665] RSP: 0018:ffff8801175539a8  EFLAGS: 00010046
[ 5954.096665] RAX: ffffffff8267d980 RBX: ffffffff8267d980 RCX: ffff880117553a48
[ 5954.096665] RDX: ffff8801175539d0 RSI: 0000000000000000 RDI: ffff880117553a48
[ 5954.096665] RBP: ffff880117553a08 R08: 0000000000000000 R09: 0000000000000000
[ 5954.096665] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[ 5954.096665] R13: ffffffff810b8e20 R14: ffff880117553a90 R15: 0000000000000000
[ 5954.096665] FS:  00007f4594cf3700(0000) GS:ffff8800dfa00000(0000) knlGS:0000000000000000
[ 5954.096665] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5954.096665] CR2: 00007f2f21e89c60 CR3: 0000000110b0f000 CR4: 00000000000006e0
[ 5954.096665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5954.096665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 5954.096665] Process rm (pid: 2927, threadinfo ffff880117552000, task ffff88010ff10b00)
[ 5954.096665] Stack:
[ 5954.096665]  ffffffff8267d868 00007fffa02d1428 ffff8800ffffffff ffff880100000000
[ 5954.096665]  000000000000b720 ffff880117553a48 ffffffff8267d868 ffff880117553a48
[ 5954.096665]  0000000000000000 ffff88010ff10b00 0000000000000000 ffffffff81dacba0
[ 5954.096665] Call Trace:
[ 5954.096665]  [<ffffffff810ba4d2>] find_usage_backwards+0x42/0x80
[ 5954.096665]  [<ffffffff810bcec4>] check_usage_backwards+0x64/0xf0
[ 5954.096665]  [<ffffffff8104796f>] ? save_stack_trace+0x2f/0x50
[ 5954.096665]  [<ffffffff810bce60>] ? check_usage_backwards+0x0/0xf0
[ 5954.096665]  [<ffffffff810bd9a9>] mark_lock+0x1a9/0x440
[ 5954.096665]  [<ffffffff810be989>] __lock_acquire+0x5a9/0x14b0
[ 5954.096665]  [<ffffffff810be716>] ? __lock_acquire+0x336/0x14b0
[ 5954.096665]  [<ffffffff810645b8>] ? pvclock_clocksource_read+0x58/0xd0
[ 5954.096665]  [<ffffffff810bf944>] lock_acquire+0xb4/0x140
[ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
[ 5954.096665]  [<ffffffff81b19d86>] _raw_spin_lock+0x36/0x70
[ 5954.096665]  [<ffffffff81177a1c>] ? shrink_dentry_list+0x5c/0x430
[ 5954.096665]  [<ffffffff81177a1c>] shrink_dentry_list+0x5c/0x430
[ 5954.096665]  [<ffffffff811779c0>] ? shrink_dentry_list+0x0/0x430
[ 5954.096665]  [<ffffffff816b9c7e>] ? do_raw_spin_unlock+0x5e/0xb0
[ 5954.096665]  [<ffffffff81177f2d>] __shrink_dcache_sb+0x13d/0x1c0
[ 5954.096665]  [<ffffffff811784bf>] shrink_dcache_parent+0x32f/0x390
[ 5954.096665]  [<ffffffff8116d31d>] dentry_unhash+0x3d/0x70
[ 5954.096665]  [<ffffffff8116d3b0>] vfs_rmdir+0x60/0xe0
[ 5954.096665]  [<ffffffff8116f673>] do_rmdir+0x113/0x130
[ 5954.096665]  [<ffffffff8103a03a>] ? sysret_check+0x2e/0x69
[ 5954.096665]  [<ffffffff81b19ae9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5954.096665]  [<ffffffff8116f6c5>] sys_unlinkat+0x35/0x40
[ 5954.096665]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
[ 5954.096665] Code: 0a 89 05 dc 0f a8 01 48 8b 41 10 48 85 c0 0f 84 1f 01 00 00 48 8d 98 70 01 00 00 48 05 80 01 00 00 45 85 c0 48 0f 44 d8 4c 8b 3b <49> 8b 07 49 39  
[ 5954.096665] RIP  [<ffffffff810ba1c8>] __bfs+0xc8/0x260
[ 5954.096665]  RSP <ffff8801175539a8>
[ 5954.096665] CR2: 0000000000000000
[ 5954.127991] ---[ end trace 85a6727c2d4e3d90 ]---

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-12-01 18:03 ` David Miller
@ 2010-12-03 16:55   ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-12-03 16:55 UTC (permalink / raw)
  To: David Miller; +Cc: npiggin, linux-fsdevel, linux-kernel

On Wed, Dec 01, 2010 at 10:03:30AM -0800, David Miller wrote:
> From: Nick Piggin <npiggin@kernel.dk>
> Date: Sat, 27 Nov 2010 21:15:58 +1100
> 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> Just want to say that I've been running this code for the past few days on
> my 128 cpu box and it seems quite sturdy.
> 
> If there are any kinds of vfs benchmarks you want me to run on this
> machine both with and without the scaling changes, just let me know.

Hi Dave, great, thanks!

I am mostly interested in single-thread performance on different ISAs
and different microarchitectures.

Microbenchmarks are easy, I attached a couple of quick ones I use in
an offline mail.

Single-threaded (preloadindex=false), cached and uncached, git diff on
an unmodified tree is slightly more real world.

And then anything else you can think of.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 10:15 Nick Piggin
  2010-11-27 15:04   ` Anca Emanuel
@ 2010-12-01 18:03 ` David Miller
  2010-12-03 16:55   ` Nick Piggin
  2010-12-07 11:25 ` Dave Chinner
  2010-12-07 21:56 ` Dave Chinner
  3 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2010-12-01 18:03 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel

From: Nick Piggin <npiggin@kernel.dk>
Date: Sat, 27 Nov 2010 21:15:58 +1100

> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working

Just want to say that I've been running this code for the past few days on
my 128 cpu box and it seems quite sturdy.

If there are any kinds of vfs benchmarks you want me to run on this
machine both with and without the scaling changes, just let me know.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-28  3:28     ` Nick Piggin
  (?)
@ 2010-11-28  6:24     ` Sedat Dilek
  -1 siblings, 0 replies; 26+ messages in thread
From: Sedat Dilek @ 2010-11-28  6:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Anca Emanuel, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1292 bytes --]

On Sun, Nov 28, 2010 at 4:28 AM, Nick Piggin <npiggin@kernel.dk> wrote:
> On Sat, Nov 27, 2010 at 05:04:08PM +0200, Anca Emanuel wrote:
>> On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
>>
>> I get:
>>   CC [M]  fs/cifs/inode.o
>> fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
>> fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
>> in this function)
>> fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
>> fs/cifs/inode.c:807: error: for each function it appears in.)
>> make[3]: *** [fs/cifs/inode.o] Error 1
>> make[2]: *** [fs/cifs] Error 2
>> make[1]: *** [fs] Error 2
>>
>> I used the latest mainline.
>
> Sorry, missed a conversion, it just needs to be changed to
> inode->i_lock. Pushed the fix to git.
>

I attached a patch to my posting in [1] but it was somehow "eaten"
(here in my mbox the patch is definitely attached).
Patchwork is also not listing my patch.
Anyway, it's fixed - that's good.

I have to check why I get a Call trace with systemd-v15 but not with
sysvinit package here on Debian.

- Sedat -

[1] http://lkml.org/lkml/2010/11/27/145

[-- Attachment #2: fs-cifs-inode.c-Fix-error-dcache_inode_lock-undeclared.patch --]
[-- Type: plain/text, Size: 612 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 15:04   ` Anca Emanuel
@ 2010-11-28  3:28     ` Nick Piggin
  -1 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-11-28  3:28 UTC (permalink / raw)
  To: Anca Emanuel, Sedat Dilek; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 05:04:08PM +0200, Anca Emanuel wrote:
> On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> I get:
>   CC [M]  fs/cifs/inode.o
> fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
> fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
> in this function)
> fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
> fs/cifs/inode.c:807: error: for each function it appears in.)
> make[3]: *** [fs/cifs/inode.o] Error 1
> make[2]: *** [fs/cifs] Error 2
> make[1]: *** [fs] Error 2
> 
> I used the latest mainline.

Sorry, missed a conversion, it just needs to be changed to
inode->i_lock. Pushed the fix to git.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-11-28  3:28     ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2010-11-28  3:28 UTC (permalink / raw)
  To: Anca Emanuel, Sedat Dilek; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 05:04:08PM +0200, Anca Emanuel wrote:
> On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> 
> I get:
>   CC [M]  fs/cifs/inode.o
> fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
> fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
> in this function)
> fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
> fs/cifs/inode.c:807: error: for each function it appears in.)
> make[3]: *** [fs/cifs/inode.o] Error 1
> make[2]: *** [fs/cifs] Error 2
> make[1]: *** [fs] Error 2
> 
> I used the latest mainline.

Sorry, missed a conversion, it just needs to be changed to
inode->i_lock. Pushed the fix to git.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
  2010-11-27 10:15 Nick Piggin
@ 2010-11-27 15:04   ` Anca Emanuel
  2010-12-01 18:03 ` David Miller
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Anca Emanuel @ 2010-11-27 15:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working

I get:
  CC [M]  fs/cifs/inode.o
fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
in this function)
fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
fs/cifs/inode.c:807: error: for each function it appears in.)
make[3]: *** [fs/cifs/inode.o] Error 1
make[2]: *** [fs/cifs] Error 2
make[1]: *** [fs] Error 2

I used the latest mainline.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-11-27 15:04   ` Anca Emanuel
  0 siblings, 0 replies; 26+ messages in thread
From: Anca Emanuel @ 2010-11-27 15:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

On Sat, Nov 27, 2010 at 12:15 PM, Nick Piggin <npiggin@kernel.dk> wrote:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working

I get:
  CC [M]  fs/cifs/inode.o
fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’:
fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use
in this function)
fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once
fs/cifs/inode.c:807: error: for each function it appears in.)
make[3]: *** [fs/cifs/inode.o] Error 1
make[2]: *** [fs/cifs] Error 2
make[1]: *** [fs] Error 2

I used the latest mainline.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 00/46] rcu-walk and dcache scaling
@ 2010-11-27 10:15 Nick Piggin
  2010-11-27 15:04   ` Anca Emanuel
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Nick Piggin @ 2010-11-27 10:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel


git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working

Here is an new set of vfs patches for review, not that there was much interest
last time they were posted. It is structured like:

* preparation patches
* introduce new locks to take over dcache_lock, then remove it
* cleaning up and reworking things for new locks
* rcu-walk path walking
* start on some fine grained locking steps

Thanks,
Nick


Nick Piggin (46):
  Revert "fs: use RCU read side protection in d_validate"
  fs: d_validate fixes
  kernel: kmem_ptr_validate considered harmful
  fs: dcache documentation cleanup
  fs: change d_delete semantics
  cifs: dont overwrite dentry name in d_revalidate
  jfs: dont overwrite dentry name in d_revalidate
  fs: change d_compare for rcu-walk
  fs: change d_hash for rcu-walk
  hostfs: simplify locking
  fs: dcache scale hash
  fs: dcache scale lru
  fs: dcache scale dentry refcount
  fs: dcache scale d_unhashed
  fs: dcache scale subdirs
  fs: scale inode alias list
  fs: Use rename lock and RCU for multi-step operations
  fs: increase d_name lock coverage
  fs: dcache remove dcache_lock
  fs: dcache avoid starvation in dcache multi-step operations
  fs: dcache reduce dput locking
  fs: dcache reduce locking in d_alloc
  fs: dcache reduce dcache_inode_lock
  fs: dcache rationalise dget variants
  fs: dcache reduce d_parent locking
  fs: dcache reduce prune_one_dentry locking
  fs: reduce dcache_inode_lock width in lru scanning
  fs: use RCU in shrink_dentry_list to reduce lock nesting
  fs: consolidate dentry kill sequence
  fs: icache RCU free inodes
  fs: avoid inode RCU freeing for pseudo fs
  kernel: optimise seqlock
  fs: rcu-walk for path lookup
  fs: fs_struct use seqlock
  fs: dcache remove d_mounted
  fs: dcache reduce branches in lookup path
  fs: cache optimise dentry and inode for rcu-walk
  fs: prefetch inode data in dcache lookup
  fs: d_revalidate_rcu for rcu-walk
  fs: provide rcu-walk aware permission i_ops
  fs: provide simple rcu-walk ACL implementation
  kernel: add bl_list
  bit_spinlock: add required includes
  fs: dcache per-bucket dcache hash locking
  fs: dcache per-inode inode alias locking
  fs: improve scalability of pseudo filesystems

 Documentation/filesystems/Locking            |   23 +-
 Documentation/filesystems/dentry-locking.txt |  174 ----
 Documentation/filesystems/path-lookup.txt    |  247 ++++++
 Documentation/filesystems/porting            |   45 +-
 Documentation/filesystems/vfs.txt            |   54 +-
 arch/ia64/kernel/perfmon.c                   |    4 +-
 arch/powerpc/platforms/cell/spufs/inode.c    |   18 +-
 drivers/infiniband/hw/ipath/ipath_fs.c       |    8 +-
 drivers/infiniband/hw/qib/qib_fs.c           |    5 +-
 drivers/staging/autofs/root.c                |    2 +-
 drivers/staging/pohmelfs/inode.c             |   11 +-
 drivers/staging/pohmelfs/path_entry.c        |   17 +-
 drivers/staging/smbfs/cache.c                |   10 +-
 drivers/usb/core/inode.c                     |   12 +-
 fs/9p/vfs_dentry.c                           |    4 +-
 fs/9p/vfs_inode.c                            |   39 +-
 fs/adfs/dir.c                                |   13 +-
 fs/adfs/super.c                              |   11 +-
 fs/affs/amigaffs.c                           |    4 +-
 fs/affs/namei.c                              |   66 +-
 fs/affs/super.c                              |   11 +-
 fs/afs/dir.c                                 |    6 +-
 fs/afs/super.c                               |   10 +-
 fs/anon_inodes.c                             |    4 +-
 fs/autofs4/autofs_i.h                        |   21 +-
 fs/autofs4/expire.c                          |  143 ++--
 fs/autofs4/inode.c                           |    2 +-
 fs/autofs4/root.c                            |   78 +-
 fs/autofs4/waitq.c                           |   23 +-
 fs/befs/linuxvfs.c                           |   10 +-
 fs/bfs/inode.c                               |    9 +-
 fs/block_dev.c                               |    9 +-
 fs/btrfs/acl.c                               |   19 +-
 fs/btrfs/ctree.h                             |    4 +-
 fs/btrfs/export.c                            |    4 +-
 fs/btrfs/inode.c                             |   27 +-
 fs/ceph/dir.c                                |   21 +-
 fs/ceph/inode.c                              |   27 +-
 fs/ceph/mds_client.c                         |    2 +-
 fs/cifs/cifsfs.c                             |    9 +-
 fs/cifs/dir.c                                |   75 +-
 fs/cifs/inode.c                              |   14 +-
 fs/cifs/link.c                               |    4 +-
 fs/cifs/readdir.c                            |    6 +-
 fs/coda/cache.c                              |    4 +-
 fs/coda/dir.c                                |    8 +-
 fs/coda/inode.c                              |    9 +-
 fs/configfs/configfs_internal.h              |    4 +-
 fs/configfs/dir.c                            |   13 +-
 fs/configfs/inode.c                          |    8 +-
 fs/dcache.c                                  | 1224 ++++++++++++++++++--------
 fs/ecryptfs/dentry.c                         |   15 +-
 fs/ecryptfs/inode.c                          |    8 +-
 fs/ecryptfs/main.c                           |    4 +-
 fs/ecryptfs/super.c                          |   12 +-
 fs/efs/super.c                               |    9 +-
 fs/exofs/super.c                             |    9 +-
 fs/exportfs/expfs.c                          |   14 +-
 fs/ext2/acl.c                                |   11 +-
 fs/ext2/acl.h                                |    8 +-
 fs/ext2/file.c                               |    2 +-
 fs/ext2/namei.c                              |    4 +-
 fs/ext2/super.c                              |    9 +-
 fs/ext3/acl.c                                |   11 +-
 fs/ext3/acl.h                                |    8 +-
 fs/ext3/file.c                               |    2 +-
 fs/ext3/namei.c                              |    4 +-
 fs/ext3/super.c                              |    9 +-
 fs/ext4/acl.c                                |   11 +-
 fs/ext4/acl.h                                |    4 +-
 fs/ext4/file.c                               |    2 +-
 fs/ext4/namei.c                              |    4 +-
 fs/ext4/super.c                              |    9 +-
 fs/fat/inode.c                               |   13 +-
 fs/fat/namei_msdos.c                         |   24 +-
 fs/fat/namei_vfat.c                          |   69 +-
 fs/filesystems.c                             |    3 +
 fs/freevxfs/vxfs_inode.c                     |    9 +-
 fs/fs_struct.c                               |   10 +
 fs/fuse/dir.c                                |    9 +-
 fs/fuse/inode.c                              |   13 +-
 fs/generic_acl.c                             |   20 +
 fs/gfs2/dentry.c                             |    5 +-
 fs/gfs2/export.c                             |    4 +-
 fs/gfs2/ops_fstype.c                         |    2 +-
 fs/gfs2/ops_inode.c                          |    2 +-
 fs/gfs2/super.c                              |    9 +-
 fs/hfs/dir.c                                 |    2 +-
 fs/hfs/hfs_fs.h                              |    7 +-
 fs/hfs/string.c                              |   17 +-
 fs/hfs/super.c                               |   11 +-
 fs/hfsplus/dir.c                             |    2 +-
 fs/hfsplus/hfsplus_fs.h                      |    7 +-
 fs/hfsplus/super.c                           |   12 +-
 fs/hfsplus/unicode.c                         |   17 +-
 fs/hostfs/hostfs_kern.c                      |   37 +-
 fs/hpfs/dentry.c                             |   26 +-
 fs/hpfs/super.c                              |    9 +-
 fs/hppfs/hppfs.c                             |    9 +-
 fs/hugetlbfs/inode.c                         |    9 +-
 fs/inode.c                                   |   16 +-
 fs/isofs/inode.c                             |  127 ++--
 fs/isofs/namei.c                             |    5 +-
 fs/jffs2/super.c                             |    9 +-
 fs/jfs/namei.c                               |   60 +-
 fs/jfs/super.c                               |   12 +-
 fs/libfs.c                                   |   63 +-
 fs/locks.c                                   |    2 +-
 fs/logfs/inode.c                             |    9 +-
 fs/minix/inode.c                             |    9 +-
 fs/minix/namei.c                             |    2 +-
 fs/namei.c                                   |  855 +++++++++++++++----
 fs/namespace.c                               |   29 +-
 fs/ncpfs/dir.c                               |   66 +-
 fs/ncpfs/inode.c                             |   11 +-
 fs/ncpfs/ncplib_kernel.h                     |   16 +-
 fs/nfs/dir.c                                 |   17 +-
 fs/nfs/getroot.c                             |   10 +-
 fs/nfs/inode.c                               |    9 +-
 fs/nfs/namespace.c                           |   17 +-
 fs/nfs/unlink.c                              |    2 +-
 fs/nfsd/vfs.c                                |    5 +-
 fs/nilfs2/super.c                            |   12 +-
 fs/notify/fsnotify.c                         |    8 +-
 fs/ntfs/inode.c                              |    9 +-
 fs/ocfs2/dcache.c                            |   10 +-
 fs/ocfs2/dlmfs/dlmfs.c                       |    9 +-
 fs/ocfs2/export.c                            |    4 +-
 fs/ocfs2/namei.c                             |   10 +-
 fs/ocfs2/super.c                             |    9 +-
 fs/openpromfs/inode.c                        |    9 +-
 fs/pipe.c                                    |   10 +-
 fs/proc/base.c                               |   14 +-
 fs/proc/generic.c                            |    4 +-
 fs/proc/inode.c                              |    9 +-
 fs/proc/proc_sysctl.c                        |   18 +-
 fs/qnx4/inode.c                              |    9 +-
 fs/reiserfs/super.c                          |    9 +-
 fs/reiserfs/xattr.c                          |    2 +-
 fs/romfs/super.c                             |    9 +-
 fs/squashfs/super.c                          |    9 +-
 fs/super.c                                   |    3 +-
 fs/sysfs/dir.c                               |   22 +-
 fs/sysv/inode.c                              |    9 +-
 fs/sysv/namei.c                              |    5 +-
 fs/sysv/super.c                              |    2 +-
 fs/ubifs/super.c                             |   10 +-
 fs/udf/super.c                               |    9 +-
 fs/ufs/super.c                               |    9 +-
 fs/xfs/linux-2.6/xfs_acl.c                   |    8 +-
 fs/xfs/linux-2.6/xfs_iops.c                  |    8 +-
 fs/xfs/xfs_acl.h                             |    4 +-
 fs/xfs/xfs_iget.c                            |   13 +-
 include/linux/bit_spinlock.h                 |    4 +
 include/linux/dcache.h                       |  183 ++--
 include/linux/fs.h                           |   61 +-
 include/linux/fs_struct.h                    |    3 +
 include/linux/fsnotify.h                     |    2 -
 include/linux/fsnotify_backend.h             |   11 +-
 include/linux/generic_acl.h                  |    1 +
 include/linux/list_bl.h                      |  141 +++
 include/linux/namei.h                        |   16 +-
 include/linux/ncp_fs.h                       |    4 +-
 include/linux/posix_acl.h                    |   19 +
 include/linux/rculist_bl.h                   |  128 +++
 include/linux/seqlock.h                      |   67 ++-
 include/linux/slab.h                         |    2 -
 ipc/mqueue.c                                 |    9 +-
 kernel/cgroup.c                              |   29 +-
 mm/filemap.c                                 |    3 -
 mm/shmem.c                                   |   15 +-
 mm/slab.c                                    |   32 +-
 mm/slob.c                                    |    5 -
 mm/slub.c                                    |   40 -
 mm/util.c                                    |   21 -
 net/socket.c                                 |    5 +-
 net/sunrpc/rpc_pipe.c                        |   14 +-
 security/selinux/selinuxfs.c                 |   16 +-
 security/tomoyo/realpath.c                   |    1 +
 179 files changed, 3848 insertions(+), 1711 deletions(-)
 delete mode 100644 Documentation/filesystems/dentry-locking.txt
 create mode 100644 Documentation/filesystems/path-lookup.txt
 create mode 100644 include/linux/list_bl.h
 create mode 100644 include/linux/rculist_bl.h


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-12-12 14:54 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-27 19:20 [PATCH 00/46] rcu-walk and dcache scaling Sedat Dilek
2010-11-27 20:53 ` Sedat Dilek
  -- strict thread matches above, loose matches on Subject: below --
2010-11-27 10:15 Nick Piggin
2010-11-27 15:04 ` Anca Emanuel
2010-11-27 15:04   ` Anca Emanuel
2010-11-28  3:28   ` Nick Piggin
2010-11-28  3:28     ` Nick Piggin
2010-11-28  6:24     ` Sedat Dilek
2010-12-01 18:03 ` David Miller
2010-12-03 16:55   ` Nick Piggin
2010-12-07 11:25 ` Dave Chinner
2010-12-07 15:24   ` Nick Piggin
2010-12-07 15:24     ` Nick Piggin
2010-12-07 15:49     ` Peter Zijlstra
2010-12-07 15:59       ` Nick Piggin
2010-12-07 16:23         ` Peter Zijlstra
2010-12-08  3:28     ` Nick Piggin
2010-12-07 21:56 ` Dave Chinner
2010-12-08  1:47   ` Nick Piggin
2010-12-08  3:32     ` Dave Chinner
2010-12-08  4:28       ` Dave Chinner
2010-12-08  7:09         ` Nick Piggin
2010-12-08  7:09           ` Nick Piggin
2010-12-10 20:32           ` Paul E. McKenney
2010-12-12 14:54             ` Paul E. McKenney
2010-12-12 14:54               ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.