All of lore.kernel.org
 help / color / mirror / Atom feed
* tree rcu: call_rcu scalability problem?
@ 2009-09-02  9:48 Nick Piggin
  2009-09-02 12:27 ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-09-02  9:48 UTC (permalink / raw)
  To: Paul McKenney, Linux Kernel Mailing List

Hi Paul,

I'm testing out scalability of some vfs code paths, and I'm seeing
a problem with call_rcu. This is a 2s8c opteron system, so nothing
crazy.

I'll show you the profile results for 1-8 threads:

1:
 29768 total                                      0.0076
 15550 default_idle                              48.5938
  1340 __d_lookup                                 3.6413
   954 __link_path_walk                           0.2559
   816 system_call_after_swapgs                   8.0792
   680 kmem_cache_alloc                           1.4167
   669 dput                                       1.1946
   591 __call_rcu                                 2.0521

2:
 56733 total                                      0.0145
 20074 default_idle                              62.7313
  3075 __call_rcu                                10.6771
  2650 __d_lookup                                 7.2011
  2019 dput                                       3.6054

4:
 98889 total                                      0.0253
 21759 default_idle                              67.9969
 10994 __call_rcu                                38.1736
  5185 __d_lookup                                14.0897
  4475 dput                                       7.9911

8:
170391 total                                      0.0437
 31815 __call_rcu                               110.4688
 12958 dput                                      23.1393
 10417 __d_lookup                                28.3071

Of course there are other scalability factors involved too, but
__call_rcu is taking 54 times more CPU to do 8 times the amount
of work from 1-8 threads, or a factor of 6.7 slowdown.

This is with tree RCU.

#
# RCU Subsystem
#
# CONFIG_CLASSIC_RCU is not set
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=64
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set

Testing classic RCU showed its call_rcu seemed to scale better, only
getting up to about 10 000 at 8 threads.

You'd need my vfs scalability patches to reproduce this exactly, but
the workload is just close(open(fd)), which rcu frees a lot of file
structs. I can certainly get more detailed profiles or test patches
for you though if you have any ideas.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02  9:48 tree rcu: call_rcu scalability problem? Nick Piggin
@ 2009-09-02 12:27 ` Nick Piggin
  2009-09-02 15:19   ` Paul E. McKenney
  2009-09-02 19:17   ` Peter Zijlstra
  0 siblings, 2 replies; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 12:27 UTC (permalink / raw)
  To: Paul McKenney, Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> Hi Paul,
> 
> I'm testing out scalability of some vfs code paths, and I'm seeing
> a problem with call_rcu. This is a 2s8c opteron system, so nothing
> crazy.
> 
> I'll show you the profile results for 1-8 threads:
> 
> 1:
>  29768 total                                      0.0076
>  15550 default_idle                              48.5938
>   1340 __d_lookup                                 3.6413
>    954 __link_path_walk                           0.2559
>    816 system_call_after_swapgs                   8.0792
>    680 kmem_cache_alloc                           1.4167
>    669 dput                                       1.1946
>    591 __call_rcu                                 2.0521
> 
> 2:
>  56733 total                                      0.0145
>  20074 default_idle                              62.7313
>   3075 __call_rcu                                10.6771
>   2650 __d_lookup                                 7.2011
>   2019 dput                                       3.6054
> 
> 4:
>  98889 total                                      0.0253
>  21759 default_idle                              67.9969
>  10994 __call_rcu                                38.1736
>   5185 __d_lookup                                14.0897
>   4475 dput                                       7.9911
> 
> 8:
> 170391 total                                      0.0437
>  31815 __call_rcu                               110.4688
>  12958 dput                                      23.1393
>  10417 __d_lookup                                28.3071
> 
> Of course there are other scalability factors involved too, but
> __call_rcu is taking 54 times more CPU to do 8 times the amount
> of work from 1-8 threads, or a factor of 6.7 slowdown.
> 
> This is with tree RCU.

It seems like nearly 2/3 of the cost is here:
        /* Add the callback to our list. */
        *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
        rdp->nxttail[RCU_NEXT_TAIL] = &head->next;

In loading the pointer to the next tail pointer. If I'm reading the profile
correctly. Can't see why that should be a probem though...

ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
   697  0.2172 :ffffffff8107dee0:       push   %r12
   228  0.0710 :ffffffff8107dee2:       push   %rbp
   133  0.0414 :ffffffff8107dee3:       mov    %rdx,%rbp
   918  0.2860 :ffffffff8107dee6:       push   %rbx
   316  0.0985 :ffffffff8107dee7:       mov    %rsi,0x8(%rdi)
   257  0.0801 :ffffffff8107deeb:       movq   $0x0,(%rdi)
  1660  0.5172 :ffffffff8107def2:       mfence
 27730  8.6394 :ffffffff8107def5:       pushfq
 13153  4.0979 :ffffffff8107def6:       pop    %r12
   903  0.2813 :ffffffff8107def8:       cli
  2562  0.7982 :ffffffff8107def9:       mov    %gs:0xde68,%eax
  1784  0.5558 :ffffffff8107df01:       cltq
               :ffffffff8107df03:       mov    0x60(%rdx,%rax,8),%rbx
               :ffffffff8107df08:       pushfq
  3494  1.0886 :ffffffff8107df09:       pop    %rdx
   896  0.2792 :ffffffff8107df0a:       cli
  2655  0.8272 :ffffffff8107df0b:       mov    0xd0(%rbp),%rcx
  1800  0.5608 :ffffffff8107df12:       cmp    (%rbx),%rcx
    21  0.0065 :ffffffff8107df15:       je     ffffffff8107df32 <__call_rcu+0x52
               :ffffffff8107df17:       mov    0x40(%rbx),%rax
    81  0.0252 :ffffffff8107df1b:       mov    %rcx,(%rbx)
     3 9.3e-04 :ffffffff8107df1e:       mov    %rax,0x38(%rbx)
               :ffffffff8107df22:       mov    0x48(%rbx),%rax
               :ffffffff8107df26:       mov    %rax,0x40(%rbx)
               :ffffffff8107df2a:       mov    0x50(%rbx),%rax
               :ffffffff8107df2e:       mov    %rax,0x48(%rbx)
               :ffffffff8107df32:       push   %rdx
  1194  0.3720 :ffffffff8107df33:       popfq
  9518  2.9654 :ffffffff8107df34:       pushfq
  4179  1.3020 :ffffffff8107df35:       pop    %rdx
  1277  0.3979 :ffffffff8107df36:       cli
  2546  0.7932 :ffffffff8107df37:       mov    0xc8(%rbp),%rax
  1748  0.5446 :ffffffff8107df3e:       cmp    %rax,0x8(%rbx)
     5  0.0016 :ffffffff8107df42:       je     ffffffff8107df57 <__call_rcu+0x77
               :ffffffff8107df44:       movb   $0x1,0x19(%rbx)
     2 6.2e-04 :ffffffff8107df48:       movb   $0x0,0x18(%rbx)
               :ffffffff8107df4c:       mov    0xc8(%rbp),%rax
               :ffffffff8107df53:       mov    %rax,0x8(%rbx)
   921  0.2869 :ffffffff8107df57:       push   %rdx
   151  0.0470 :ffffffff8107df58:       popfq
183507 57.1725 :ffffffff8107df59:       mov    0x50(%rbx),%rax
   995  0.3100 :ffffffff8107df5d:       mov    %rdi,(%rax)
     2 6.2e-04 :ffffffff8107df60:       mov    %rdi,0x50(%rbx)
    18  0.0056 :ffffffff8107df64:       mov    0xd0(%rbp),%rdx
   940  0.2929 :ffffffff8107df6b:       mov    0xc8(%rbp),%rax
    15  0.0047 :ffffffff8107df72:       cmp    %rax,%rdx
     1 3.1e-04 :ffffffff8107df75:       je     ffffffff8107dfb0 <__call_rcu+0xd0
   787  0.2452 :ffffffff8107df77:       mov    0x58(%rbx),%rax
    58  0.0181 :ffffffff8107df7b:       inc    %rax
     2 6.2e-04 :ffffffff8107df7e:       mov    %rax,0x58(%rbx)
  1679  0.5231 :ffffffff8107df82:       movslq 0x4988fb(%rip),%rdx        # ffff
    40  0.0125 :ffffffff8107df89:       cmp    %rdx,%rax
     5  0.0016 :ffffffff8107df8c:       jg     ffffffff8107dfd7 <__call_rcu+0xf7
   588  0.1832 :ffffffff8107df8e:       mov    0xe0(%rbp),%rdx
    84  0.0262 :ffffffff8107df95:       mov    0x51f924(%rip),%rax        # ffff
     5  0.0016 :ffffffff8107df9c:       cmp    %rax,%rdx
   505  0.1573 :ffffffff8107df9f:       js     ffffffff8107dfc8 <__call_rcu+0xe8
 17580  5.4771 :ffffffff8107dfa1:       push   %r12
  1671  0.5206 :ffffffff8107dfa3:       popfq
 24201  7.5399 :ffffffff8107dfa4:       pop    %rbx
  1367  0.4259 :ffffffff8107dfa5:       pop    %rbp
   377  0.1175 :ffffffff8107dfa6:       pop    %r12
               :ffffffff8107dfa8:       retq
               :ffffffff8107dfa9:       nopl   0x0(%rax)
               :ffffffff8107dfb0:       mov    %rbp,%rdi
               :ffffffff8107dfb3:       callq  ffffffff813be930 <_spin_lock_irqs
    12  0.0037 :ffffffff8107dfb8:       mov    %rbp,%rdi
               :ffffffff8107dfbb:       mov    %rax,%rsi
               :ffffffff8107dfbe:       callq  ffffffff8107d8e0 <rcu_start_gp>
               :ffffffff8107dfc3:       jmp    ffffffff8107df77 <__call_rcu+0x97
               :ffffffff8107dfc5:       nopl   (%rax)
               :ffffffff8107dfc8:       mov    $0x1,%esi
    10  0.0031 :ffffffff8107dfcd:       mov    %rbp,%rdi
               :ffffffff8107dfd0:       callq  ffffffff8107dd50 <force_quiescent
     1 3.1e-04 :ffffffff8107dfd5:       jmp    ffffffff8107dfa1 <__call_rcu+0xc1
   451  0.1405 :ffffffff8107dfd7:       mov    $0x7fffffffffffffff,%rdx
   411  0.1280 :ffffffff8107dfe1:       xor    %esi,%esi
               :ffffffff8107dfe3:       mov    %rbp,%rdi
               :ffffffff8107dfe6:       mov    %rdx,0x60(%rbx)
   317  0.0988 :ffffffff8107dfea:       callq  ffffffff8107dd50 <force_quiescent
  4510  1.4051 :ffffffff8107dfef:       jmp    ffffffff8107dfa1 <__call_rcu+0xc1
               :ffffffff8107dff1:       nopw   %cs:0x0(%rax,%rax,1)



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 12:27 ` Nick Piggin
@ 2009-09-02 15:19   ` Paul E. McKenney
  2009-09-02 16:24     ` Nick Piggin
  2009-09-02 19:17   ` Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-02 15:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > Hi Paul,
> > 
> > I'm testing out scalability of some vfs code paths, and I'm seeing
> > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > crazy.
> > 
> > I'll show you the profile results for 1-8 threads:
> > 
> > 1:
> >  29768 total                                      0.0076
> >  15550 default_idle                              48.5938
> >   1340 __d_lookup                                 3.6413
> >    954 __link_path_walk                           0.2559
> >    816 system_call_after_swapgs                   8.0792
> >    680 kmem_cache_alloc                           1.4167
> >    669 dput                                       1.1946
> >    591 __call_rcu                                 2.0521
> > 
> > 2:
> >  56733 total                                      0.0145
> >  20074 default_idle                              62.7313
> >   3075 __call_rcu                                10.6771
> >   2650 __d_lookup                                 7.2011
> >   2019 dput                                       3.6054
> > 
> > 4:
> >  98889 total                                      0.0253
> >  21759 default_idle                              67.9969
> >  10994 __call_rcu                                38.1736
> >   5185 __d_lookup                                14.0897
> >   4475 dput                                       7.9911

Four threads runs on one socket but 8 threads runs on two sockets,
I take it?

> > 8:
> > 170391 total                                      0.0437
> >  31815 __call_rcu                               110.4688
> >  12958 dput                                      23.1393
> >  10417 __d_lookup                                28.3071
> > 
> > Of course there are other scalability factors involved too, but
> > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > 
> > This is with tree RCU.
> 
> It seems like nearly 2/3 of the cost is here:
>         /* Add the callback to our list. */
>         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
>         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;

Hmmm...  That certainly is not the first list of code in call_rcu() that
would come to mind...

> In loading the pointer to the next tail pointer. If I'm reading the profile
> correctly. Can't see why that should be a probem though...

The usual diagnosis would be false sharing.

Hmmm...  What is the workload?  CPU-bound?  If CONFIG_PREEMPT=n, I might
expect interference from force_quiescent_state(), except that it should
run only every few clock ticks.  So this seems quite unlikely.

Could you please try padding the beginning and end of struct rcu_data
with a few hundred bytes and rerunning?  Just in case there is a shared
per-CPU variable either before or after rcu_data in your memory layout?

							Thanx, Paul

> ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
>    697  0.2172 :ffffffff8107dee0:       push   %r12
>    228  0.0710 :ffffffff8107dee2:       push   %rbp
>    133  0.0414 :ffffffff8107dee3:       mov    %rdx,%rbp
>    918  0.2860 :ffffffff8107dee6:       push   %rbx
>    316  0.0985 :ffffffff8107dee7:       mov    %rsi,0x8(%rdi)
>    257  0.0801 :ffffffff8107deeb:       movq   $0x0,(%rdi)
>   1660  0.5172 :ffffffff8107def2:       mfence
>  27730  8.6394 :ffffffff8107def5:       pushfq
>  13153  4.0979 :ffffffff8107def6:       pop    %r12
>    903  0.2813 :ffffffff8107def8:       cli
>   2562  0.7982 :ffffffff8107def9:       mov    %gs:0xde68,%eax
>   1784  0.5558 :ffffffff8107df01:       cltq
>                :ffffffff8107df03:       mov    0x60(%rdx,%rax,8),%rbx
>                :ffffffff8107df08:       pushfq
>   3494  1.0886 :ffffffff8107df09:       pop    %rdx
>    896  0.2792 :ffffffff8107df0a:       cli
>   2655  0.8272 :ffffffff8107df0b:       mov    0xd0(%rbp),%rcx
>   1800  0.5608 :ffffffff8107df12:       cmp    (%rbx),%rcx
>     21  0.0065 :ffffffff8107df15:       je     ffffffff8107df32 <__call_rcu+0x52
>                :ffffffff8107df17:       mov    0x40(%rbx),%rax
>     81  0.0252 :ffffffff8107df1b:       mov    %rcx,(%rbx)
>      3 9.3e-04 :ffffffff8107df1e:       mov    %rax,0x38(%rbx)
>                :ffffffff8107df22:       mov    0x48(%rbx),%rax
>                :ffffffff8107df26:       mov    %rax,0x40(%rbx)
>                :ffffffff8107df2a:       mov    0x50(%rbx),%rax
>                :ffffffff8107df2e:       mov    %rax,0x48(%rbx)
>                :ffffffff8107df32:       push   %rdx
>   1194  0.3720 :ffffffff8107df33:       popfq
>   9518  2.9654 :ffffffff8107df34:       pushfq
>   4179  1.3020 :ffffffff8107df35:       pop    %rdx
>   1277  0.3979 :ffffffff8107df36:       cli
>   2546  0.7932 :ffffffff8107df37:       mov    0xc8(%rbp),%rax
>   1748  0.5446 :ffffffff8107df3e:       cmp    %rax,0x8(%rbx)
>      5  0.0016 :ffffffff8107df42:       je     ffffffff8107df57 <__call_rcu+0x77
>                :ffffffff8107df44:       movb   $0x1,0x19(%rbx)
>      2 6.2e-04 :ffffffff8107df48:       movb   $0x0,0x18(%rbx)
>                :ffffffff8107df4c:       mov    0xc8(%rbp),%rax
>                :ffffffff8107df53:       mov    %rax,0x8(%rbx)
>    921  0.2869 :ffffffff8107df57:       push   %rdx
>    151  0.0470 :ffffffff8107df58:       popfq
> 183507 57.1725 :ffffffff8107df59:       mov    0x50(%rbx),%rax
>    995  0.3100 :ffffffff8107df5d:       mov    %rdi,(%rax)
>      2 6.2e-04 :ffffffff8107df60:       mov    %rdi,0x50(%rbx)
>     18  0.0056 :ffffffff8107df64:       mov    0xd0(%rbp),%rdx
>    940  0.2929 :ffffffff8107df6b:       mov    0xc8(%rbp),%rax
>     15  0.0047 :ffffffff8107df72:       cmp    %rax,%rdx
>      1 3.1e-04 :ffffffff8107df75:       je     ffffffff8107dfb0 <__call_rcu+0xd0
>    787  0.2452 :ffffffff8107df77:       mov    0x58(%rbx),%rax
>     58  0.0181 :ffffffff8107df7b:       inc    %rax
>      2 6.2e-04 :ffffffff8107df7e:       mov    %rax,0x58(%rbx)
>   1679  0.5231 :ffffffff8107df82:       movslq 0x4988fb(%rip),%rdx        # ffff
>     40  0.0125 :ffffffff8107df89:       cmp    %rdx,%rax
>      5  0.0016 :ffffffff8107df8c:       jg     ffffffff8107dfd7 <__call_rcu+0xf7
>    588  0.1832 :ffffffff8107df8e:       mov    0xe0(%rbp),%rdx
>     84  0.0262 :ffffffff8107df95:       mov    0x51f924(%rip),%rax        # ffff
>      5  0.0016 :ffffffff8107df9c:       cmp    %rax,%rdx
>    505  0.1573 :ffffffff8107df9f:       js     ffffffff8107dfc8 <__call_rcu+0xe8
>  17580  5.4771 :ffffffff8107dfa1:       push   %r12
>   1671  0.5206 :ffffffff8107dfa3:       popfq
>  24201  7.5399 :ffffffff8107dfa4:       pop    %rbx
>   1367  0.4259 :ffffffff8107dfa5:       pop    %rbp
>    377  0.1175 :ffffffff8107dfa6:       pop    %r12
>                :ffffffff8107dfa8:       retq
>                :ffffffff8107dfa9:       nopl   0x0(%rax)
>                :ffffffff8107dfb0:       mov    %rbp,%rdi
>                :ffffffff8107dfb3:       callq  ffffffff813be930 <_spin_lock_irqs
>     12  0.0037 :ffffffff8107dfb8:       mov    %rbp,%rdi
>                :ffffffff8107dfbb:       mov    %rax,%rsi
>                :ffffffff8107dfbe:       callq  ffffffff8107d8e0 <rcu_start_gp>
>                :ffffffff8107dfc3:       jmp    ffffffff8107df77 <__call_rcu+0x97
>                :ffffffff8107dfc5:       nopl   (%rax)
>                :ffffffff8107dfc8:       mov    $0x1,%esi
>     10  0.0031 :ffffffff8107dfcd:       mov    %rbp,%rdi
>                :ffffffff8107dfd0:       callq  ffffffff8107dd50 <force_quiescent
>      1 3.1e-04 :ffffffff8107dfd5:       jmp    ffffffff8107dfa1 <__call_rcu+0xc1
>    451  0.1405 :ffffffff8107dfd7:       mov    $0x7fffffffffffffff,%rdx
>    411  0.1280 :ffffffff8107dfe1:       xor    %esi,%esi
>                :ffffffff8107dfe3:       mov    %rbp,%rdi
>                :ffffffff8107dfe6:       mov    %rdx,0x60(%rbx)
>    317  0.0988 :ffffffff8107dfea:       callq  ffffffff8107dd50 <force_quiescent
>   4510  1.4051 :ffffffff8107dfef:       jmp    ffffffff8107dfa1 <__call_rcu+0xc1
>                :ffffffff8107dff1:       nopw   %cs:0x0(%rax,%rax,1)
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 15:19   ` Paul E. McKenney
@ 2009-09-02 16:24     ` Nick Piggin
  2009-09-02 16:37       ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 16:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > > Hi Paul,
> > > 
> > > I'm testing out scalability of some vfs code paths, and I'm seeing
> > > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > > crazy.
> > > 
> > > I'll show you the profile results for 1-8 threads:
> > > 
> > > 1:
> > >  29768 total                                      0.0076
> > >  15550 default_idle                              48.5938
> > >   1340 __d_lookup                                 3.6413
> > >    954 __link_path_walk                           0.2559
> > >    816 system_call_after_swapgs                   8.0792
> > >    680 kmem_cache_alloc                           1.4167
> > >    669 dput                                       1.1946
> > >    591 __call_rcu                                 2.0521
> > > 
> > > 2:
> > >  56733 total                                      0.0145
> > >  20074 default_idle                              62.7313
> > >   3075 __call_rcu                                10.6771
> > >   2650 __d_lookup                                 7.2011
> > >   2019 dput                                       3.6054
> > > 
> > > 4:
> > >  98889 total                                      0.0253
> > >  21759 default_idle                              67.9969
> > >  10994 __call_rcu                                38.1736
> > >   5185 __d_lookup                                14.0897
> > >   4475 dput                                       7.9911
> 
> Four threads runs on one socket but 8 threads runs on two sockets,
> I take it?

Yes.

 
> > > 8:
> > > 170391 total                                      0.0437
> > >  31815 __call_rcu                               110.4688
> > >  12958 dput                                      23.1393
> > >  10417 __d_lookup                                28.3071
> > > 
> > > Of course there are other scalability factors involved too, but
> > > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > > 
> > > This is with tree RCU.
> > 
> > It seems like nearly 2/3 of the cost is here:
> >         /* Add the callback to our list. */
> >         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> >         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> 
> Hmmm...  That certainly is not the first list of code in call_rcu() that
> would come to mind...

It's weird. I *think* I read the asm right, but oprofile maybe is
not attributing the cost to the right instruction.


> > In loading the pointer to the next tail pointer. If I'm reading the profile
> > correctly. Can't see why that should be a probem though...
> 
> The usual diagnosis would be false sharing.

Hmm that's possible yes.


> Hmmm...  What is the workload?  CPU-bound?  If CONFIG_PREEMPT=n, I might
> expect interference from force_quiescent_state(), except that it should
> run only every few clock ticks.  So this seems quite unlikely.

It's CPU bound and preempt=y.

Workload is just 8 processes running a loop of close(open("file$i")) as
I said though you probably won't be able to reproduce it on a vanilla
kernel.


> Could you please try padding the beginning and end of struct rcu_data
> with a few hundred bytes and rerunning?  Just in case there is a shared
> per-CPU variable either before or after rcu_data in your memory layout?

OK I'll try that.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 16:24     ` Nick Piggin
@ 2009-09-02 16:37       ` Paul E. McKenney
  2009-09-02 16:45         ` Nick Piggin
  2009-09-02 17:50         ` Nick Piggin
  0 siblings, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-02 16:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > > > Hi Paul,
> > > > 
> > > > I'm testing out scalability of some vfs code paths, and I'm seeing
> > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > > > crazy.
> > > > 
> > > > I'll show you the profile results for 1-8 threads:
> > > > 
> > > > 1:
> > > >  29768 total                                      0.0076
> > > >  15550 default_idle                              48.5938
> > > >   1340 __d_lookup                                 3.6413
> > > >    954 __link_path_walk                           0.2559
> > > >    816 system_call_after_swapgs                   8.0792
> > > >    680 kmem_cache_alloc                           1.4167
> > > >    669 dput                                       1.1946
> > > >    591 __call_rcu                                 2.0521
> > > > 
> > > > 2:
> > > >  56733 total                                      0.0145
> > > >  20074 default_idle                              62.7313
> > > >   3075 __call_rcu                                10.6771
> > > >   2650 __d_lookup                                 7.2011
> > > >   2019 dput                                       3.6054
> > > > 
> > > > 4:
> > > >  98889 total                                      0.0253
> > > >  21759 default_idle                              67.9969
> > > >  10994 __call_rcu                                38.1736
> > > >   5185 __d_lookup                                14.0897
> > > >   4475 dput                                       7.9911
> > 
> > Four threads runs on one socket but 8 threads runs on two sockets,
> > I take it?
> 
> Yes.
> 
> 
> > > > 8:
> > > > 170391 total                                      0.0437
> > > >  31815 __call_rcu                               110.4688
> > > >  12958 dput                                      23.1393
> > > >  10417 __d_lookup                                28.3071
> > > > 
> > > > Of course there are other scalability factors involved too, but
> > > > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > > > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > > > 
> > > > This is with tree RCU.
> > > 
> > > It seems like nearly 2/3 of the cost is here:
> > >         /* Add the callback to our list. */
> > >         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > >         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > 
> > Hmmm...  That certainly is not the first list of code in call_rcu() that
> > would come to mind...
> 
> It's weird. I *think* I read the asm right, but oprofile maybe is
> not attributing the cost to the right instruction.
> 
> 
> > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > correctly. Can't see why that should be a probem though...
> > 
> > The usual diagnosis would be false sharing.
> 
> Hmm that's possible yes.
> 
> 
> > Hmmm...  What is the workload?  CPU-bound?  If CONFIG_PREEMPT=n, I might
> > expect interference from force_quiescent_state(), except that it should
> > run only every few clock ticks.  So this seems quite unlikely.
> 
> It's CPU bound and preempt=y.
> 
> Workload is just 8 processes running a loop of close(open("file$i")) as
> I said though you probably won't be able to reproduce it on a vanilla
> kernel.

OK, so you are executing call_rcu() a -lot-!!!

Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of
the files in the "rcu" subdirectory in debugfs?  Please take a snapshot
of these files, run your test for a fixed time interval (perhaps ten
seconds, but please tell me how long), then take a second snapshot.

							Thanx, Paul

> > Could you please try padding the beginning and end of struct rcu_data
> > with a few hundred bytes and rerunning?  Just in case there is a shared
> > per-CPU variable either before or after rcu_data in your memory layout?
> 
> OK I'll try that.
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 16:37       ` Paul E. McKenney
@ 2009-09-02 16:45         ` Nick Piggin
  2009-09-02 16:48           ` Paul E. McKenney
  2009-09-02 17:50         ` Nick Piggin
  1 sibling, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 16:45 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote:
> > On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote:
> > > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> > > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > I'm testing out scalability of some vfs code paths, and I'm seeing
> > > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > > > > crazy.
> > > > > 
> > > > > I'll show you the profile results for 1-8 threads:
> > > > > 
> > > > > 1:
> > > > >  29768 total                                      0.0076
> > > > >  15550 default_idle                              48.5938
> > > > >   1340 __d_lookup                                 3.6413
> > > > >    954 __link_path_walk                           0.2559
> > > > >    816 system_call_after_swapgs                   8.0792
> > > > >    680 kmem_cache_alloc                           1.4167
> > > > >    669 dput                                       1.1946
> > > > >    591 __call_rcu                                 2.0521
> > > > > 
> > > > > 2:
> > > > >  56733 total                                      0.0145
> > > > >  20074 default_idle                              62.7313
> > > > >   3075 __call_rcu                                10.6771
> > > > >   2650 __d_lookup                                 7.2011
> > > > >   2019 dput                                       3.6054
> > > > > 
> > > > > 4:
> > > > >  98889 total                                      0.0253
> > > > >  21759 default_idle                              67.9969
> > > > >  10994 __call_rcu                                38.1736
> > > > >   5185 __d_lookup                                14.0897
> > > > >   4475 dput                                       7.9911
> > > 
> > > Four threads runs on one socket but 8 threads runs on two sockets,
> > > I take it?
> > 
> > Yes.
> > 
> > 
> > > > > 8:
> > > > > 170391 total                                      0.0437
> > > > >  31815 __call_rcu                               110.4688
> > > > >  12958 dput                                      23.1393
> > > > >  10417 __d_lookup                                28.3071
> > > > > 
> > > > > Of course there are other scalability factors involved too, but
> > > > > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > > > > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > > > > 
> > > > > This is with tree RCU.
> > > > 
> > > > It seems like nearly 2/3 of the cost is here:
> > > >         /* Add the callback to our list. */
> > > >         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > > >         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > > 
> > > Hmmm...  That certainly is not the first list of code in call_rcu() that
> > > would come to mind...
> > 
> > It's weird. I *think* I read the asm right, but oprofile maybe is
> > not attributing the cost to the right instruction.
> > 
> > 
> > > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > > correctly. Can't see why that should be a probem though...
> > > 
> > > The usual diagnosis would be false sharing.
> > 
> > Hmm that's possible yes.
> > 
> > 
> > > Hmmm...  What is the workload?  CPU-bound?  If CONFIG_PREEMPT=n, I might
> > > expect interference from force_quiescent_state(), except that it should
> > > run only every few clock ticks.  So this seems quite unlikely.
> > 
> > It's CPU bound and preempt=y.
> > 
> > Workload is just 8 processes running a loop of close(open("file$i")) as
> > I said though you probably won't be able to reproduce it on a vanilla
> > kernel.
> 
> OK, so you are executing call_rcu() a -lot-!!!

Oh yeah. The combined frequency in the 8 proc case is once per 270ns, so
nearly half a million times per second per core.

It's not *slow* by any means, but it is increasing much faster than other
functions on the profile so I just want to understand what is happening.
 

> Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of
> the files in the "rcu" subdirectory in debugfs?  Please take a snapshot
> of these files, run your test for a fixed time interval (perhaps ten
> seconds, but please tell me how long), then take a second snapshot.

Will do.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 16:45         ` Nick Piggin
@ 2009-09-02 16:48           ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-02 16:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 06:45:04PM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote:
> > > On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> > > > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > I'm testing out scalability of some vfs code paths, and I'm seeing
> > > > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > > > > > crazy.
> > > > > > 
> > > > > > I'll show you the profile results for 1-8 threads:
> > > > > > 
> > > > > > 1:
> > > > > >  29768 total                                      0.0076
> > > > > >  15550 default_idle                              48.5938
> > > > > >   1340 __d_lookup                                 3.6413
> > > > > >    954 __link_path_walk                           0.2559
> > > > > >    816 system_call_after_swapgs                   8.0792
> > > > > >    680 kmem_cache_alloc                           1.4167
> > > > > >    669 dput                                       1.1946
> > > > > >    591 __call_rcu                                 2.0521
> > > > > > 
> > > > > > 2:
> > > > > >  56733 total                                      0.0145
> > > > > >  20074 default_idle                              62.7313
> > > > > >   3075 __call_rcu                                10.6771
> > > > > >   2650 __d_lookup                                 7.2011
> > > > > >   2019 dput                                       3.6054
> > > > > > 
> > > > > > 4:
> > > > > >  98889 total                                      0.0253
> > > > > >  21759 default_idle                              67.9969
> > > > > >  10994 __call_rcu                                38.1736
> > > > > >   5185 __d_lookup                                14.0897
> > > > > >   4475 dput                                       7.9911
> > > > 
> > > > Four threads runs on one socket but 8 threads runs on two sockets,
> > > > I take it?
> > > 
> > > Yes.
> > > 
> > > 
> > > > > > 8:
> > > > > > 170391 total                                      0.0437
> > > > > >  31815 __call_rcu                               110.4688
> > > > > >  12958 dput                                      23.1393
> > > > > >  10417 __d_lookup                                28.3071
> > > > > > 
> > > > > > Of course there are other scalability factors involved too, but
> > > > > > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > > > > > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > > > > > 
> > > > > > This is with tree RCU.
> > > > > 
> > > > > It seems like nearly 2/3 of the cost is here:
> > > > >         /* Add the callback to our list. */
> > > > >         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > > > >         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > > > 
> > > > Hmmm...  That certainly is not the first list of code in call_rcu() that
> > > > would come to mind...
> > > 
> > > It's weird. I *think* I read the asm right, but oprofile maybe is
> > > not attributing the cost to the right instruction.
> > > 
> > > 
> > > > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > > > correctly. Can't see why that should be a probem though...
> > > > 
> > > > The usual diagnosis would be false sharing.
> > > 
> > > Hmm that's possible yes.
> > > 
> > > 
> > > > Hmmm...  What is the workload?  CPU-bound?  If CONFIG_PREEMPT=n, I might
> > > > expect interference from force_quiescent_state(), except that it should
> > > > run only every few clock ticks.  So this seems quite unlikely.
> > > 
> > > It's CPU bound and preempt=y.
> > > 
> > > Workload is just 8 processes running a loop of close(open("file$i")) as
> > > I said though you probably won't be able to reproduce it on a vanilla
> > > kernel.
> > 
> > OK, so you are executing call_rcu() a -lot-!!!
> 
> Oh yeah. The combined frequency in the 8 proc case is once per 270ns, so
> nearly half a million times per second per core.

Woo-hoo!!!  ;-)

> It's not *slow* by any means, but it is increasing much faster than other
> functions on the profile so I just want to understand what is happening.

Understood and agreed.

> > Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of
> > the files in the "rcu" subdirectory in debugfs?  Please take a snapshot
> > of these files, run your test for a fixed time interval (perhaps ten
> > seconds, but please tell me how long), then take a second snapshot.
> 
> Will do.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 16:37       ` Paul E. McKenney
  2009-09-02 16:45         ` Nick Piggin
@ 2009-09-02 17:50         ` Nick Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 17:50 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote:
> > > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > > correctly. Can't see why that should be a probem though...
> > > 
> > > The usual diagnosis would be false sharing.
> > 
> > Hmm that's possible yes.

OK, padding 64 bytes (cacheline size) at the start and end of struct rcu_data
does not help.

I wonder if the cycles aren't being attributed to the right instruction?
Interesting thing is this queueing part seems to be the same in rcuclassic
too, which seems to run faster.

I'll try to run it on a bigger machine and see if it becomes more
pronounced. But I might not get around to that tonight.


> > > Hmmm...  What is the workload?  CPU-bound?  If CONFIG_PREEMPT=n, I might
> > > expect interference from force_quiescent_state(), except that it should
> > > run only every few clock ticks.  So this seems quite unlikely.
> > 
> > It's CPU bound and preempt=y.
> > 
> > Workload is just 8 processes running a loop of close(open("file$i")) as
> > I said though you probably won't be able to reproduce it on a vanilla
> > kernel.
> 
> OK, so you are executing call_rcu() a -lot-!!!
> 
> Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of
> the files in the "rcu" subdirectory in debugfs?  Please take a snapshot
> of these files, run your test for a fixed time interval (perhaps ten
> seconds, but please tell me how long), then take a second snapshot.

Attached, old/* vs new/*. Interval was 22s.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 12:27 ` Nick Piggin
  2009-09-02 15:19   ` Paul E. McKenney
@ 2009-09-02 19:17   ` Peter Zijlstra
  2009-09-03  5:14     ` Paul E. McKenney
  2009-09-03  7:14     ` Nick Piggin
  1 sibling, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2009-09-02 19:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul McKenney, Linux Kernel Mailing List

On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:

> It seems like nearly 2/3 of the cost is here:
>         /* Add the callback to our list. */
>         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
>         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> 
> In loading the pointer to the next tail pointer. If I'm reading the profile
> correctly. Can't see why that should be a probem though...
> 
> ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
>    697  0.2172 :ffffffff8107dee0:       push   %r12

>    921  0.2869 :ffffffff8107df57:       push   %rdx
>    151  0.0470 :ffffffff8107df58:       popfq
> 183507 57.1725 :ffffffff8107df59:       mov    0x50(%rbx),%rax
>    995  0.3100 :ffffffff8107df5d:       mov    %rdi,(%rax)

I'd guess at popfq to be the expensive op here.. skid usually causes the
attribution to be a few ops down the line.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 19:17   ` Peter Zijlstra
@ 2009-09-03  5:14     ` Paul E. McKenney
  2009-09-03  7:45       ` Nick Piggin
  2009-09-03  9:01       ` Nick Piggin
  2009-09-03  7:14     ` Nick Piggin
  1 sibling, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-03  5:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:
> 
> > It seems like nearly 2/3 of the cost is here:
> >         /* Add the callback to our list. */
> >         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> >         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > 
> > In loading the pointer to the next tail pointer. If I'm reading the profile
> > correctly. Can't see why that should be a probem though...
> > 
> > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> >    697  0.2172 :ffffffff8107dee0:       push   %r12
> 
> >    921  0.2869 :ffffffff8107df57:       push   %rdx
> >    151  0.0470 :ffffffff8107df58:       popfq
> > 183507 57.1725 :ffffffff8107df59:       mov    0x50(%rbx),%rax
> >    995  0.3100 :ffffffff8107df5d:       mov    %rdi,(%rax)
> 
> I'd guess at popfq to be the expensive op here.. skid usually causes the
> attribution to be a few ops down the line.

I believe that Nick's workload is routinely driving the number of
callbacks queued on a given CPU above 10,000, which would provoke numerous
(and possibly inlined) calls to force_quiescent_state().  Like about
400,000 such calls per second.  Hey, I was naively assuming that no one
would see more than 10,000 callbacks queued on a single CPU unless there
was some sort of major emergency underway, and coded accordingly.  ;-)

I offer the attached experimental (untested, might not even compile) patch.

							Thanx, Paul

------------------------------------------------------------------------

>From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Wed, 2 Sep 2009 22:01:50 -0700
Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu()

The grace-period machinery in __call_rcu() was a failed attempt to avoid
implementing synchronize_rcu_expedited().  But now that this attempt has
failed, try removing the machinery.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   12 ------------
 1 files changed, 0 insertions(+), 12 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d2a372f..104de9e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1201,26 +1201,14 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 	 */
 	local_irq_save(flags);
 	rdp = rsp->rda[smp_processor_id()];
-	rcu_process_gp_end(rsp, rdp);
-	check_for_new_grace_period(rsp, rdp);
 
 	/* Add the callback to our list. */
 	*rdp->nxttail[RCU_NEXT_TAIL] = head;
 	rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
 
-	/* Start a new grace period if one not already started. */
-	if (ACCESS_ONCE(rsp->completed) == ACCESS_ONCE(rsp->gpnum)) {
-		unsigned long nestflag;
-		struct rcu_node *rnp_root = rcu_get_root(rsp);
-
-		spin_lock_irqsave(&rnp_root->lock, nestflag);
-		rcu_start_gp(rsp, nestflag);  /* releases rnp_root->lock. */
-	}
-
 	/* Force the grace period if too many callbacks or too long waiting. */
 	if (unlikely(++rdp->qlen > qhimark)) {
 		rdp->blimit = LONG_MAX;
-		force_quiescent_state(rsp, 0);
 	} else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0)
 		force_quiescent_state(rsp, 1);
 	local_irq_restore(flags);
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-02 19:17   ` Peter Zijlstra
  2009-09-03  5:14     ` Paul E. McKenney
@ 2009-09-03  7:14     ` Nick Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-09-03  7:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Paul McKenney, Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:
> 
> > It seems like nearly 2/3 of the cost is here:
> >         /* Add the callback to our list. */
> >         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> >         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > 
> > In loading the pointer to the next tail pointer. If I'm reading the profile
> > correctly. Can't see why that should be a probem though...
> > 
> > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> >    697  0.2172 :ffffffff8107dee0:       push   %r12
> 
> >    921  0.2869 :ffffffff8107df57:       push   %rdx
> >    151  0.0470 :ffffffff8107df58:       popfq
> > 183507 57.1725 :ffffffff8107df59:       mov    0x50(%rbx),%rax
> >    995  0.3100 :ffffffff8107df5d:       mov    %rdi,(%rax)
> 
> I'd guess at popfq to be the expensive op here.. skid usually causes the
> attribution to be a few ops down the line.

Well it doesn't really explain why it is getting more costly per
unit of work as core count increases. The popfq should run completely
out of L1 cache (or even from store forwarding) and it is expensive,
but those costs should remain the same regardless of how many cores
in the system.

But I am getting some strange looking profiles, so I'm going to get
a bigger system to run it on to see if we can get really pronounced
scalability problem. Maybe even make a microbenchmark to eliminate
other elements of the workload.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-03  5:14     ` Paul E. McKenney
@ 2009-09-03  7:45       ` Nick Piggin
  2009-09-03  9:01       ` Nick Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-09-03  7:45 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Peter Zijlstra, Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:
> > 
> > > It seems like nearly 2/3 of the cost is here:
> > >         /* Add the callback to our list. */
> > >         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > >         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > > 
> > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > correctly. Can't see why that should be a probem though...
> > > 
> > > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> > >    697  0.2172 :ffffffff8107dee0:       push   %r12
> > 
> > >    921  0.2869 :ffffffff8107df57:       push   %rdx
> > >    151  0.0470 :ffffffff8107df58:       popfq
> > > 183507 57.1725 :ffffffff8107df59:       mov    0x50(%rbx),%rax
> > >    995  0.3100 :ffffffff8107df5d:       mov    %rdi,(%rax)
> > 
> > I'd guess at popfq to be the expensive op here.. skid usually causes the
> > attribution to be a few ops down the line.
> 
> I believe that Nick's workload is routinely driving the number of
> callbacks queued on a given CPU above 10,000, which would provoke numerous
> (and possibly inlined) calls to force_quiescent_state().  Like about
> 400,000 such calls per second.  Hey, I was naively assuming that no one
> would see more than 10,000 callbacks queued on a single CPU unless there
> was some sort of major emergency underway, and coded accordingly.  ;-)
> 
> I offer the attached experimental (untested, might not even compile) patch.

Not only does it compile, but __call_rcu is now taking 1/10th the
cycles and absolute performance up nearly 20%. Looks like it is
now better than classic RCU.

I'll collect and post some more detailed numbers and profiles. Do
you want some new rcu trace results too?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-03  5:14     ` Paul E. McKenney
  2009-09-03  7:45       ` Nick Piggin
@ 2009-09-03  9:01       ` Nick Piggin
  2009-09-03 13:28         ` Paul E. McKenney
  1 sibling, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-09-03  9:01 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Peter Zijlstra, Linux Kernel Mailing List

On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote:
> >From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date: Wed, 2 Sep 2009 22:01:50 -0700
> Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu()
> 
> The grace-period machinery in __call_rcu() was a failed attempt to avoid
> implementing synchronize_rcu_expedited().  But now that this attempt has
> failed, try removing the machinery.

OK, the workload is parallel processes performing a close(open()) loop
in a tmpfs filesystem within different cwds (to avoid contention on the
cwd dentry). The kernel is first patched with my vfs scalability patches,
so the comparison is with/without Paul's rcu patch.

System is 2s8c opteron, with processes bound to CPUs (first within the
same socket, then over both sockets as count increases).

procs  tput-base          tput-rcu
1         595238 (x1.00)    645161 (x1.00)
2        1041666 (x1.75)   1136363 (x1.76)
4        1960784 (x3.29)   2298850 (x3.56)
8        3636363 (x6.11)   4545454 (x7.05)

Scalability is improved (from 2-8 way it is now actually linear), and
single thread performance is significantly improved too.

oprofile results collecting clk unhalted samples shows the following
results for __call_rcu symbol:

procs  samples  %        app name                 symbol name
tput-base
1      12153     3.8122  vmlinux                  __call_rcu
2      29253     3.9899  vmlinux                  __call_rcu
4      84503     5.4667  vmlinux                  __call_rcu
8      312816    9.5287  vmlinux                  __call_rcu

tput-rcu
1      8722      2.8770  vmlinux                  __call_rcu
2      17275     2.5804  vmlinux                  __call_rcu
4      33848     2.6015  vmlinux                  __call_rcu
8      67158     2.5561  vmlinux                  __call_rcu

Scaling is cearly much better (it is more important to look at absolute
samples because %age is dependent on other parts of the kernel too).

Feel free to add any of this to your changelog if you think it's important.

Thanks,
Nick

> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/rcutree.c |   12 ------------
>  1 files changed, 0 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index d2a372f..104de9e 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1201,26 +1201,14 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
>  	 */
>  	local_irq_save(flags);
>  	rdp = rsp->rda[smp_processor_id()];
> -	rcu_process_gp_end(rsp, rdp);
> -	check_for_new_grace_period(rsp, rdp);
>  
>  	/* Add the callback to our list. */
>  	*rdp->nxttail[RCU_NEXT_TAIL] = head;
>  	rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
>  
> -	/* Start a new grace period if one not already started. */
> -	if (ACCESS_ONCE(rsp->completed) == ACCESS_ONCE(rsp->gpnum)) {
> -		unsigned long nestflag;
> -		struct rcu_node *rnp_root = rcu_get_root(rsp);
> -
> -		spin_lock_irqsave(&rnp_root->lock, nestflag);
> -		rcu_start_gp(rsp, nestflag);  /* releases rnp_root->lock. */
> -	}
> -
>  	/* Force the grace period if too many callbacks or too long waiting. */
>  	if (unlikely(++rdp->qlen > qhimark)) {
>  		rdp->blimit = LONG_MAX;
> -		force_quiescent_state(rsp, 0);
>  	} else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0)
>  		force_quiescent_state(rsp, 1);
>  	local_irq_restore(flags);
> -- 
> 1.5.2.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: tree rcu: call_rcu scalability problem?
  2009-09-03  9:01       ` Nick Piggin
@ 2009-09-03 13:28         ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-03 13:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Peter Zijlstra, Linux Kernel Mailing List

On Thu, Sep 03, 2009 at 11:01:26AM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote:
> > >From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date: Wed, 2 Sep 2009 22:01:50 -0700
> > Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu()
> > 
> > The grace-period machinery in __call_rcu() was a failed attempt to avoid
> > implementing synchronize_rcu_expedited().  But now that this attempt has
> > failed, try removing the machinery.
> 
> OK, the workload is parallel processes performing a close(open()) loop
> in a tmpfs filesystem within different cwds (to avoid contention on the
> cwd dentry). The kernel is first patched with my vfs scalability patches,
> so the comparison is with/without Paul's rcu patch.
> 
> System is 2s8c opteron, with processes bound to CPUs (first within the
> same socket, then over both sockets as count increases).
> 
> procs  tput-base          tput-rcu
> 1         595238 (x1.00)    645161 (x1.00)
> 2        1041666 (x1.75)   1136363 (x1.76)
> 4        1960784 (x3.29)   2298850 (x3.56)
> 8        3636363 (x6.11)   4545454 (x7.05)
> 
> Scalability is improved (from 2-8 way it is now actually linear), and
> single thread performance is significantly improved too.
> 
> oprofile results collecting clk unhalted samples shows the following
> results for __call_rcu symbol:
> 
> procs  samples  %        app name                 symbol name
> tput-base
> 1      12153     3.8122  vmlinux                  __call_rcu
> 2      29253     3.9899  vmlinux                  __call_rcu
> 4      84503     5.4667  vmlinux                  __call_rcu
> 8      312816    9.5287  vmlinux                  __call_rcu
> 
> tput-rcu
> 1      8722      2.8770  vmlinux                  __call_rcu
> 2      17275     2.5804  vmlinux                  __call_rcu
> 4      33848     2.6015  vmlinux                  __call_rcu
> 8      67158     2.5561  vmlinux                  __call_rcu
> 
> Scaling is cearly much better (it is more important to look at absolute
> samples because %age is dependent on other parts of the kernel too).
> 
> Feel free to add any of this to your changelog if you think it's important.

Very cool!!!

I got a dissenting view from the people trying to get rid of interrupts
in computational workloads.  But I believe that it is possible to
split the difference, getting you almost all the performance benefits
while still permitting them to turn off the scheduling-clock interrupt.
The reason that I believe it should get you the performance benefits is
that deleting the rcu_process_gp_end() and check_for_new_grace_period()
didn't do much for you.  Their overhead is quite small compared to
hammering the system with a full set of IPIs every ten microseconds
or so.  ;-)

So could you please give the following experimental patch a go?
If it works for you, I will put together a production-ready patch
along these lines.

							Thanx, Paul

------------------------------------------------------------------------

>From 57b7f98303a5c5aa50648c71758760006af49bab Mon Sep 17 00:00:00 2001
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Thu, 3 Sep 2009 06:19:45 -0700
Subject: [PATCH] Reduce grace-period-encouragement impact on rcutree __call_rcu()

Remove only the emergency force_quiescent_state() from __call_rcu(),
which should get most of the reduction in overhead while still
allowing the tick to be turned off when non-idle, as proposed in
http://lkml.org/lkml/2009/9/1/229, and which reduced interrupts to
one per ten seconds in a CPU-bound computational workload according to
http://lkml.org/lkml/2009/9/3/7.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d2a372f..4c8e0d2 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1220,7 +1220,6 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 	/* Force the grace period if too many callbacks or too long waiting. */
 	if (unlikely(++rdp->qlen > qhimark)) {
 		rdp->blimit = LONG_MAX;
-		force_quiescent_state(rsp, 0);
 	} else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0)
 		force_quiescent_state(rsp, 1);
 	local_irq_restore(flags);
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-09-03 13:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-02  9:48 tree rcu: call_rcu scalability problem? Nick Piggin
2009-09-02 12:27 ` Nick Piggin
2009-09-02 15:19   ` Paul E. McKenney
2009-09-02 16:24     ` Nick Piggin
2009-09-02 16:37       ` Paul E. McKenney
2009-09-02 16:45         ` Nick Piggin
2009-09-02 16:48           ` Paul E. McKenney
2009-09-02 17:50         ` Nick Piggin
2009-09-02 19:17   ` Peter Zijlstra
2009-09-03  5:14     ` Paul E. McKenney
2009-09-03  7:45       ` Nick Piggin
2009-09-03  9:01       ` Nick Piggin
2009-09-03 13:28         ` Paul E. McKenney
2009-09-03  7:14     ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.