* tree rcu: call_rcu scalability problem?
@ 2009-09-02 9:48 Nick Piggin
2009-09-02 12:27 ` Nick Piggin
0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 9:48 UTC (permalink / raw)
To: Paul McKenney, Linux Kernel Mailing List
Hi Paul,
I'm testing out scalability of some vfs code paths, and I'm seeing
a problem with call_rcu. This is a 2s8c opteron system, so nothing
crazy.
I'll show you the profile results for 1-8 threads:
1:
29768 total 0.0076
15550 default_idle 48.5938
1340 __d_lookup 3.6413
954 __link_path_walk 0.2559
816 system_call_after_swapgs 8.0792
680 kmem_cache_alloc 1.4167
669 dput 1.1946
591 __call_rcu 2.0521
2:
56733 total 0.0145
20074 default_idle 62.7313
3075 __call_rcu 10.6771
2650 __d_lookup 7.2011
2019 dput 3.6054
4:
98889 total 0.0253
21759 default_idle 67.9969
10994 __call_rcu 38.1736
5185 __d_lookup 14.0897
4475 dput 7.9911
8:
170391 total 0.0437
31815 __call_rcu 110.4688
12958 dput 23.1393
10417 __d_lookup 28.3071
Of course there are other scalability factors involved too, but
__call_rcu is taking 54 times more CPU to do 8 times the amount
of work from 1-8 threads, or a factor of 6.7 slowdown.
This is with tree RCU.
#
# RCU Subsystem
#
# CONFIG_CLASSIC_RCU is not set
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=64
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
Testing classic RCU showed its call_rcu seemed to scale better, only
getting up to about 10 000 at 8 threads.
You'd need my vfs scalability patches to reproduce this exactly, but
the workload is just close(open(fd)), which rcu frees a lot of file
structs. I can certainly get more detailed profiles or test patches
for you though if you have any ideas.
Thanks,
Nick
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 9:48 tree rcu: call_rcu scalability problem? Nick Piggin
@ 2009-09-02 12:27 ` Nick Piggin
2009-09-02 15:19 ` Paul E. McKenney
2009-09-02 19:17 ` Peter Zijlstra
0 siblings, 2 replies; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 12:27 UTC (permalink / raw)
To: Paul McKenney, Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> Hi Paul,
>
> I'm testing out scalability of some vfs code paths, and I'm seeing
> a problem with call_rcu. This is a 2s8c opteron system, so nothing
> crazy.
>
> I'll show you the profile results for 1-8 threads:
>
> 1:
> 29768 total 0.0076
> 15550 default_idle 48.5938
> 1340 __d_lookup 3.6413
> 954 __link_path_walk 0.2559
> 816 system_call_after_swapgs 8.0792
> 680 kmem_cache_alloc 1.4167
> 669 dput 1.1946
> 591 __call_rcu 2.0521
>
> 2:
> 56733 total 0.0145
> 20074 default_idle 62.7313
> 3075 __call_rcu 10.6771
> 2650 __d_lookup 7.2011
> 2019 dput 3.6054
>
> 4:
> 98889 total 0.0253
> 21759 default_idle 67.9969
> 10994 __call_rcu 38.1736
> 5185 __d_lookup 14.0897
> 4475 dput 7.9911
>
> 8:
> 170391 total 0.0437
> 31815 __call_rcu 110.4688
> 12958 dput 23.1393
> 10417 __d_lookup 28.3071
>
> Of course there are other scalability factors involved too, but
> __call_rcu is taking 54 times more CPU to do 8 times the amount
> of work from 1-8 threads, or a factor of 6.7 slowdown.
>
> This is with tree RCU.
It seems like nearly 2/3 of the cost is here:
/* Add the callback to our list. */
*rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
In loading the pointer to the next tail pointer. If I'm reading the profile
correctly. Can't see why that should be a probem though...
ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
697 0.2172 :ffffffff8107dee0: push %r12
228 0.0710 :ffffffff8107dee2: push %rbp
133 0.0414 :ffffffff8107dee3: mov %rdx,%rbp
918 0.2860 :ffffffff8107dee6: push %rbx
316 0.0985 :ffffffff8107dee7: mov %rsi,0x8(%rdi)
257 0.0801 :ffffffff8107deeb: movq $0x0,(%rdi)
1660 0.5172 :ffffffff8107def2: mfence
27730 8.6394 :ffffffff8107def5: pushfq
13153 4.0979 :ffffffff8107def6: pop %r12
903 0.2813 :ffffffff8107def8: cli
2562 0.7982 :ffffffff8107def9: mov %gs:0xde68,%eax
1784 0.5558 :ffffffff8107df01: cltq
:ffffffff8107df03: mov 0x60(%rdx,%rax,8),%rbx
:ffffffff8107df08: pushfq
3494 1.0886 :ffffffff8107df09: pop %rdx
896 0.2792 :ffffffff8107df0a: cli
2655 0.8272 :ffffffff8107df0b: mov 0xd0(%rbp),%rcx
1800 0.5608 :ffffffff8107df12: cmp (%rbx),%rcx
21 0.0065 :ffffffff8107df15: je ffffffff8107df32 <__call_rcu+0x52
:ffffffff8107df17: mov 0x40(%rbx),%rax
81 0.0252 :ffffffff8107df1b: mov %rcx,(%rbx)
3 9.3e-04 :ffffffff8107df1e: mov %rax,0x38(%rbx)
:ffffffff8107df22: mov 0x48(%rbx),%rax
:ffffffff8107df26: mov %rax,0x40(%rbx)
:ffffffff8107df2a: mov 0x50(%rbx),%rax
:ffffffff8107df2e: mov %rax,0x48(%rbx)
:ffffffff8107df32: push %rdx
1194 0.3720 :ffffffff8107df33: popfq
9518 2.9654 :ffffffff8107df34: pushfq
4179 1.3020 :ffffffff8107df35: pop %rdx
1277 0.3979 :ffffffff8107df36: cli
2546 0.7932 :ffffffff8107df37: mov 0xc8(%rbp),%rax
1748 0.5446 :ffffffff8107df3e: cmp %rax,0x8(%rbx)
5 0.0016 :ffffffff8107df42: je ffffffff8107df57 <__call_rcu+0x77
:ffffffff8107df44: movb $0x1,0x19(%rbx)
2 6.2e-04 :ffffffff8107df48: movb $0x0,0x18(%rbx)
:ffffffff8107df4c: mov 0xc8(%rbp),%rax
:ffffffff8107df53: mov %rax,0x8(%rbx)
921 0.2869 :ffffffff8107df57: push %rdx
151 0.0470 :ffffffff8107df58: popfq
183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax
995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax)
2 6.2e-04 :ffffffff8107df60: mov %rdi,0x50(%rbx)
18 0.0056 :ffffffff8107df64: mov 0xd0(%rbp),%rdx
940 0.2929 :ffffffff8107df6b: mov 0xc8(%rbp),%rax
15 0.0047 :ffffffff8107df72: cmp %rax,%rdx
1 3.1e-04 :ffffffff8107df75: je ffffffff8107dfb0 <__call_rcu+0xd0
787 0.2452 :ffffffff8107df77: mov 0x58(%rbx),%rax
58 0.0181 :ffffffff8107df7b: inc %rax
2 6.2e-04 :ffffffff8107df7e: mov %rax,0x58(%rbx)
1679 0.5231 :ffffffff8107df82: movslq 0x4988fb(%rip),%rdx # ffff
40 0.0125 :ffffffff8107df89: cmp %rdx,%rax
5 0.0016 :ffffffff8107df8c: jg ffffffff8107dfd7 <__call_rcu+0xf7
588 0.1832 :ffffffff8107df8e: mov 0xe0(%rbp),%rdx
84 0.0262 :ffffffff8107df95: mov 0x51f924(%rip),%rax # ffff
5 0.0016 :ffffffff8107df9c: cmp %rax,%rdx
505 0.1573 :ffffffff8107df9f: js ffffffff8107dfc8 <__call_rcu+0xe8
17580 5.4771 :ffffffff8107dfa1: push %r12
1671 0.5206 :ffffffff8107dfa3: popfq
24201 7.5399 :ffffffff8107dfa4: pop %rbx
1367 0.4259 :ffffffff8107dfa5: pop %rbp
377 0.1175 :ffffffff8107dfa6: pop %r12
:ffffffff8107dfa8: retq
:ffffffff8107dfa9: nopl 0x0(%rax)
:ffffffff8107dfb0: mov %rbp,%rdi
:ffffffff8107dfb3: callq ffffffff813be930 <_spin_lock_irqs
12 0.0037 :ffffffff8107dfb8: mov %rbp,%rdi
:ffffffff8107dfbb: mov %rax,%rsi
:ffffffff8107dfbe: callq ffffffff8107d8e0 <rcu_start_gp>
:ffffffff8107dfc3: jmp ffffffff8107df77 <__call_rcu+0x97
:ffffffff8107dfc5: nopl (%rax)
:ffffffff8107dfc8: mov $0x1,%esi
10 0.0031 :ffffffff8107dfcd: mov %rbp,%rdi
:ffffffff8107dfd0: callq ffffffff8107dd50 <force_quiescent
1 3.1e-04 :ffffffff8107dfd5: jmp ffffffff8107dfa1 <__call_rcu+0xc1
451 0.1405 :ffffffff8107dfd7: mov $0x7fffffffffffffff,%rdx
411 0.1280 :ffffffff8107dfe1: xor %esi,%esi
:ffffffff8107dfe3: mov %rbp,%rdi
:ffffffff8107dfe6: mov %rdx,0x60(%rbx)
317 0.0988 :ffffffff8107dfea: callq ffffffff8107dd50 <force_quiescent
4510 1.4051 :ffffffff8107dfef: jmp ffffffff8107dfa1 <__call_rcu+0xc1
:ffffffff8107dff1: nopw %cs:0x0(%rax,%rax,1)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 12:27 ` Nick Piggin
@ 2009-09-02 15:19 ` Paul E. McKenney
2009-09-02 16:24 ` Nick Piggin
2009-09-02 19:17 ` Peter Zijlstra
1 sibling, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-02 15:19 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > Hi Paul,
> >
> > I'm testing out scalability of some vfs code paths, and I'm seeing
> > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > crazy.
> >
> > I'll show you the profile results for 1-8 threads:
> >
> > 1:
> > 29768 total 0.0076
> > 15550 default_idle 48.5938
> > 1340 __d_lookup 3.6413
> > 954 __link_path_walk 0.2559
> > 816 system_call_after_swapgs 8.0792
> > 680 kmem_cache_alloc 1.4167
> > 669 dput 1.1946
> > 591 __call_rcu 2.0521
> >
> > 2:
> > 56733 total 0.0145
> > 20074 default_idle 62.7313
> > 3075 __call_rcu 10.6771
> > 2650 __d_lookup 7.2011
> > 2019 dput 3.6054
> >
> > 4:
> > 98889 total 0.0253
> > 21759 default_idle 67.9969
> > 10994 __call_rcu 38.1736
> > 5185 __d_lookup 14.0897
> > 4475 dput 7.9911
Four threads runs on one socket but 8 threads runs on two sockets,
I take it?
> > 8:
> > 170391 total 0.0437
> > 31815 __call_rcu 110.4688
> > 12958 dput 23.1393
> > 10417 __d_lookup 28.3071
> >
> > Of course there are other scalability factors involved too, but
> > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > of work from 1-8 threads, or a factor of 6.7 slowdown.
> >
> > This is with tree RCU.
>
> It seems like nearly 2/3 of the cost is here:
> /* Add the callback to our list. */
> *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
Hmmm... That certainly is not the first list of code in call_rcu() that
would come to mind...
> In loading the pointer to the next tail pointer. If I'm reading the profile
> correctly. Can't see why that should be a probem though...
The usual diagnosis would be false sharing.
Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might
expect interference from force_quiescent_state(), except that it should
run only every few clock ticks. So this seems quite unlikely.
Could you please try padding the beginning and end of struct rcu_data
with a few hundred bytes and rerunning? Just in case there is a shared
per-CPU variable either before or after rcu_data in your memory layout?
Thanx, Paul
> ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> 697 0.2172 :ffffffff8107dee0: push %r12
> 228 0.0710 :ffffffff8107dee2: push %rbp
> 133 0.0414 :ffffffff8107dee3: mov %rdx,%rbp
> 918 0.2860 :ffffffff8107dee6: push %rbx
> 316 0.0985 :ffffffff8107dee7: mov %rsi,0x8(%rdi)
> 257 0.0801 :ffffffff8107deeb: movq $0x0,(%rdi)
> 1660 0.5172 :ffffffff8107def2: mfence
> 27730 8.6394 :ffffffff8107def5: pushfq
> 13153 4.0979 :ffffffff8107def6: pop %r12
> 903 0.2813 :ffffffff8107def8: cli
> 2562 0.7982 :ffffffff8107def9: mov %gs:0xde68,%eax
> 1784 0.5558 :ffffffff8107df01: cltq
> :ffffffff8107df03: mov 0x60(%rdx,%rax,8),%rbx
> :ffffffff8107df08: pushfq
> 3494 1.0886 :ffffffff8107df09: pop %rdx
> 896 0.2792 :ffffffff8107df0a: cli
> 2655 0.8272 :ffffffff8107df0b: mov 0xd0(%rbp),%rcx
> 1800 0.5608 :ffffffff8107df12: cmp (%rbx),%rcx
> 21 0.0065 :ffffffff8107df15: je ffffffff8107df32 <__call_rcu+0x52
> :ffffffff8107df17: mov 0x40(%rbx),%rax
> 81 0.0252 :ffffffff8107df1b: mov %rcx,(%rbx)
> 3 9.3e-04 :ffffffff8107df1e: mov %rax,0x38(%rbx)
> :ffffffff8107df22: mov 0x48(%rbx),%rax
> :ffffffff8107df26: mov %rax,0x40(%rbx)
> :ffffffff8107df2a: mov 0x50(%rbx),%rax
> :ffffffff8107df2e: mov %rax,0x48(%rbx)
> :ffffffff8107df32: push %rdx
> 1194 0.3720 :ffffffff8107df33: popfq
> 9518 2.9654 :ffffffff8107df34: pushfq
> 4179 1.3020 :ffffffff8107df35: pop %rdx
> 1277 0.3979 :ffffffff8107df36: cli
> 2546 0.7932 :ffffffff8107df37: mov 0xc8(%rbp),%rax
> 1748 0.5446 :ffffffff8107df3e: cmp %rax,0x8(%rbx)
> 5 0.0016 :ffffffff8107df42: je ffffffff8107df57 <__call_rcu+0x77
> :ffffffff8107df44: movb $0x1,0x19(%rbx)
> 2 6.2e-04 :ffffffff8107df48: movb $0x0,0x18(%rbx)
> :ffffffff8107df4c: mov 0xc8(%rbp),%rax
> :ffffffff8107df53: mov %rax,0x8(%rbx)
> 921 0.2869 :ffffffff8107df57: push %rdx
> 151 0.0470 :ffffffff8107df58: popfq
> 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax
> 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax)
> 2 6.2e-04 :ffffffff8107df60: mov %rdi,0x50(%rbx)
> 18 0.0056 :ffffffff8107df64: mov 0xd0(%rbp),%rdx
> 940 0.2929 :ffffffff8107df6b: mov 0xc8(%rbp),%rax
> 15 0.0047 :ffffffff8107df72: cmp %rax,%rdx
> 1 3.1e-04 :ffffffff8107df75: je ffffffff8107dfb0 <__call_rcu+0xd0
> 787 0.2452 :ffffffff8107df77: mov 0x58(%rbx),%rax
> 58 0.0181 :ffffffff8107df7b: inc %rax
> 2 6.2e-04 :ffffffff8107df7e: mov %rax,0x58(%rbx)
> 1679 0.5231 :ffffffff8107df82: movslq 0x4988fb(%rip),%rdx # ffff
> 40 0.0125 :ffffffff8107df89: cmp %rdx,%rax
> 5 0.0016 :ffffffff8107df8c: jg ffffffff8107dfd7 <__call_rcu+0xf7
> 588 0.1832 :ffffffff8107df8e: mov 0xe0(%rbp),%rdx
> 84 0.0262 :ffffffff8107df95: mov 0x51f924(%rip),%rax # ffff
> 5 0.0016 :ffffffff8107df9c: cmp %rax,%rdx
> 505 0.1573 :ffffffff8107df9f: js ffffffff8107dfc8 <__call_rcu+0xe8
> 17580 5.4771 :ffffffff8107dfa1: push %r12
> 1671 0.5206 :ffffffff8107dfa3: popfq
> 24201 7.5399 :ffffffff8107dfa4: pop %rbx
> 1367 0.4259 :ffffffff8107dfa5: pop %rbp
> 377 0.1175 :ffffffff8107dfa6: pop %r12
> :ffffffff8107dfa8: retq
> :ffffffff8107dfa9: nopl 0x0(%rax)
> :ffffffff8107dfb0: mov %rbp,%rdi
> :ffffffff8107dfb3: callq ffffffff813be930 <_spin_lock_irqs
> 12 0.0037 :ffffffff8107dfb8: mov %rbp,%rdi
> :ffffffff8107dfbb: mov %rax,%rsi
> :ffffffff8107dfbe: callq ffffffff8107d8e0 <rcu_start_gp>
> :ffffffff8107dfc3: jmp ffffffff8107df77 <__call_rcu+0x97
> :ffffffff8107dfc5: nopl (%rax)
> :ffffffff8107dfc8: mov $0x1,%esi
> 10 0.0031 :ffffffff8107dfcd: mov %rbp,%rdi
> :ffffffff8107dfd0: callq ffffffff8107dd50 <force_quiescent
> 1 3.1e-04 :ffffffff8107dfd5: jmp ffffffff8107dfa1 <__call_rcu+0xc1
> 451 0.1405 :ffffffff8107dfd7: mov $0x7fffffffffffffff,%rdx
> 411 0.1280 :ffffffff8107dfe1: xor %esi,%esi
> :ffffffff8107dfe3: mov %rbp,%rdi
> :ffffffff8107dfe6: mov %rdx,0x60(%rbx)
> 317 0.0988 :ffffffff8107dfea: callq ffffffff8107dd50 <force_quiescent
> 4510 1.4051 :ffffffff8107dfef: jmp ffffffff8107dfa1 <__call_rcu+0xc1
> :ffffffff8107dff1: nopw %cs:0x0(%rax,%rax,1)
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 15:19 ` Paul E. McKenney
@ 2009-09-02 16:24 ` Nick Piggin
2009-09-02 16:37 ` Paul E. McKenney
0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 16:24 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > > Hi Paul,
> > >
> > > I'm testing out scalability of some vfs code paths, and I'm seeing
> > > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > > crazy.
> > >
> > > I'll show you the profile results for 1-8 threads:
> > >
> > > 1:
> > > 29768 total 0.0076
> > > 15550 default_idle 48.5938
> > > 1340 __d_lookup 3.6413
> > > 954 __link_path_walk 0.2559
> > > 816 system_call_after_swapgs 8.0792
> > > 680 kmem_cache_alloc 1.4167
> > > 669 dput 1.1946
> > > 591 __call_rcu 2.0521
> > >
> > > 2:
> > > 56733 total 0.0145
> > > 20074 default_idle 62.7313
> > > 3075 __call_rcu 10.6771
> > > 2650 __d_lookup 7.2011
> > > 2019 dput 3.6054
> > >
> > > 4:
> > > 98889 total 0.0253
> > > 21759 default_idle 67.9969
> > > 10994 __call_rcu 38.1736
> > > 5185 __d_lookup 14.0897
> > > 4475 dput 7.9911
>
> Four threads runs on one socket but 8 threads runs on two sockets,
> I take it?
Yes.
> > > 8:
> > > 170391 total 0.0437
> > > 31815 __call_rcu 110.4688
> > > 12958 dput 23.1393
> > > 10417 __d_lookup 28.3071
> > >
> > > Of course there are other scalability factors involved too, but
> > > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > >
> > > This is with tree RCU.
> >
> > It seems like nearly 2/3 of the cost is here:
> > /* Add the callback to our list. */
> > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
>
> Hmmm... That certainly is not the first list of code in call_rcu() that
> would come to mind...
It's weird. I *think* I read the asm right, but oprofile maybe is
not attributing the cost to the right instruction.
> > In loading the pointer to the next tail pointer. If I'm reading the profile
> > correctly. Can't see why that should be a probem though...
>
> The usual diagnosis would be false sharing.
Hmm that's possible yes.
> Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might
> expect interference from force_quiescent_state(), except that it should
> run only every few clock ticks. So this seems quite unlikely.
It's CPU bound and preempt=y.
Workload is just 8 processes running a loop of close(open("file$i")) as
I said though you probably won't be able to reproduce it on a vanilla
kernel.
> Could you please try padding the beginning and end of struct rcu_data
> with a few hundred bytes and rerunning? Just in case there is a shared
> per-CPU variable either before or after rcu_data in your memory layout?
OK I'll try that.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 16:24 ` Nick Piggin
@ 2009-09-02 16:37 ` Paul E. McKenney
2009-09-02 16:45 ` Nick Piggin
2009-09-02 17:50 ` Nick Piggin
0 siblings, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-02 16:37 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > > > Hi Paul,
> > > >
> > > > I'm testing out scalability of some vfs code paths, and I'm seeing
> > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > > > crazy.
> > > >
> > > > I'll show you the profile results for 1-8 threads:
> > > >
> > > > 1:
> > > > 29768 total 0.0076
> > > > 15550 default_idle 48.5938
> > > > 1340 __d_lookup 3.6413
> > > > 954 __link_path_walk 0.2559
> > > > 816 system_call_after_swapgs 8.0792
> > > > 680 kmem_cache_alloc 1.4167
> > > > 669 dput 1.1946
> > > > 591 __call_rcu 2.0521
> > > >
> > > > 2:
> > > > 56733 total 0.0145
> > > > 20074 default_idle 62.7313
> > > > 3075 __call_rcu 10.6771
> > > > 2650 __d_lookup 7.2011
> > > > 2019 dput 3.6054
> > > >
> > > > 4:
> > > > 98889 total 0.0253
> > > > 21759 default_idle 67.9969
> > > > 10994 __call_rcu 38.1736
> > > > 5185 __d_lookup 14.0897
> > > > 4475 dput 7.9911
> >
> > Four threads runs on one socket but 8 threads runs on two sockets,
> > I take it?
>
> Yes.
>
>
> > > > 8:
> > > > 170391 total 0.0437
> > > > 31815 __call_rcu 110.4688
> > > > 12958 dput 23.1393
> > > > 10417 __d_lookup 28.3071
> > > >
> > > > Of course there are other scalability factors involved too, but
> > > > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > > > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > > >
> > > > This is with tree RCU.
> > >
> > > It seems like nearly 2/3 of the cost is here:
> > > /* Add the callback to our list. */
> > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> >
> > Hmmm... That certainly is not the first list of code in call_rcu() that
> > would come to mind...
>
> It's weird. I *think* I read the asm right, but oprofile maybe is
> not attributing the cost to the right instruction.
>
>
> > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > correctly. Can't see why that should be a probem though...
> >
> > The usual diagnosis would be false sharing.
>
> Hmm that's possible yes.
>
>
> > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might
> > expect interference from force_quiescent_state(), except that it should
> > run only every few clock ticks. So this seems quite unlikely.
>
> It's CPU bound and preempt=y.
>
> Workload is just 8 processes running a loop of close(open("file$i")) as
> I said though you probably won't be able to reproduce it on a vanilla
> kernel.
OK, so you are executing call_rcu() a -lot-!!!
Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of
the files in the "rcu" subdirectory in debugfs? Please take a snapshot
of these files, run your test for a fixed time interval (perhaps ten
seconds, but please tell me how long), then take a second snapshot.
Thanx, Paul
> > Could you please try padding the beginning and end of struct rcu_data
> > with a few hundred bytes and rerunning? Just in case there is a shared
> > per-CPU variable either before or after rcu_data in your memory layout?
>
> OK I'll try that.
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 16:37 ` Paul E. McKenney
@ 2009-09-02 16:45 ` Nick Piggin
2009-09-02 16:48 ` Paul E. McKenney
2009-09-02 17:50 ` Nick Piggin
1 sibling, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 16:45 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote:
> > On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote:
> > > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> > > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > > > > Hi Paul,
> > > > >
> > > > > I'm testing out scalability of some vfs code paths, and I'm seeing
> > > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > > > > crazy.
> > > > >
> > > > > I'll show you the profile results for 1-8 threads:
> > > > >
> > > > > 1:
> > > > > 29768 total 0.0076
> > > > > 15550 default_idle 48.5938
> > > > > 1340 __d_lookup 3.6413
> > > > > 954 __link_path_walk 0.2559
> > > > > 816 system_call_after_swapgs 8.0792
> > > > > 680 kmem_cache_alloc 1.4167
> > > > > 669 dput 1.1946
> > > > > 591 __call_rcu 2.0521
> > > > >
> > > > > 2:
> > > > > 56733 total 0.0145
> > > > > 20074 default_idle 62.7313
> > > > > 3075 __call_rcu 10.6771
> > > > > 2650 __d_lookup 7.2011
> > > > > 2019 dput 3.6054
> > > > >
> > > > > 4:
> > > > > 98889 total 0.0253
> > > > > 21759 default_idle 67.9969
> > > > > 10994 __call_rcu 38.1736
> > > > > 5185 __d_lookup 14.0897
> > > > > 4475 dput 7.9911
> > >
> > > Four threads runs on one socket but 8 threads runs on two sockets,
> > > I take it?
> >
> > Yes.
> >
> >
> > > > > 8:
> > > > > 170391 total 0.0437
> > > > > 31815 __call_rcu 110.4688
> > > > > 12958 dput 23.1393
> > > > > 10417 __d_lookup 28.3071
> > > > >
> > > > > Of course there are other scalability factors involved too, but
> > > > > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > > > > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > > > >
> > > > > This is with tree RCU.
> > > >
> > > > It seems like nearly 2/3 of the cost is here:
> > > > /* Add the callback to our list. */
> > > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > >
> > > Hmmm... That certainly is not the first list of code in call_rcu() that
> > > would come to mind...
> >
> > It's weird. I *think* I read the asm right, but oprofile maybe is
> > not attributing the cost to the right instruction.
> >
> >
> > > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > > correctly. Can't see why that should be a probem though...
> > >
> > > The usual diagnosis would be false sharing.
> >
> > Hmm that's possible yes.
> >
> >
> > > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might
> > > expect interference from force_quiescent_state(), except that it should
> > > run only every few clock ticks. So this seems quite unlikely.
> >
> > It's CPU bound and preempt=y.
> >
> > Workload is just 8 processes running a loop of close(open("file$i")) as
> > I said though you probably won't be able to reproduce it on a vanilla
> > kernel.
>
> OK, so you are executing call_rcu() a -lot-!!!
Oh yeah. The combined frequency in the 8 proc case is once per 270ns, so
nearly half a million times per second per core.
It's not *slow* by any means, but it is increasing much faster than other
functions on the profile so I just want to understand what is happening.
> Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of
> the files in the "rcu" subdirectory in debugfs? Please take a snapshot
> of these files, run your test for a fixed time interval (perhaps ten
> seconds, but please tell me how long), then take a second snapshot.
Will do.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 16:45 ` Nick Piggin
@ 2009-09-02 16:48 ` Paul E. McKenney
0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-02 16:48 UTC (permalink / raw)
To: Nick Piggin; +Cc: Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 06:45:04PM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote:
> > > On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> > > > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > > > > > Hi Paul,
> > > > > >
> > > > > > I'm testing out scalability of some vfs code paths, and I'm seeing
> > > > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > > > > > crazy.
> > > > > >
> > > > > > I'll show you the profile results for 1-8 threads:
> > > > > >
> > > > > > 1:
> > > > > > 29768 total 0.0076
> > > > > > 15550 default_idle 48.5938
> > > > > > 1340 __d_lookup 3.6413
> > > > > > 954 __link_path_walk 0.2559
> > > > > > 816 system_call_after_swapgs 8.0792
> > > > > > 680 kmem_cache_alloc 1.4167
> > > > > > 669 dput 1.1946
> > > > > > 591 __call_rcu 2.0521
> > > > > >
> > > > > > 2:
> > > > > > 56733 total 0.0145
> > > > > > 20074 default_idle 62.7313
> > > > > > 3075 __call_rcu 10.6771
> > > > > > 2650 __d_lookup 7.2011
> > > > > > 2019 dput 3.6054
> > > > > >
> > > > > > 4:
> > > > > > 98889 total 0.0253
> > > > > > 21759 default_idle 67.9969
> > > > > > 10994 __call_rcu 38.1736
> > > > > > 5185 __d_lookup 14.0897
> > > > > > 4475 dput 7.9911
> > > >
> > > > Four threads runs on one socket but 8 threads runs on two sockets,
> > > > I take it?
> > >
> > > Yes.
> > >
> > >
> > > > > > 8:
> > > > > > 170391 total 0.0437
> > > > > > 31815 __call_rcu 110.4688
> > > > > > 12958 dput 23.1393
> > > > > > 10417 __d_lookup 28.3071
> > > > > >
> > > > > > Of course there are other scalability factors involved too, but
> > > > > > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > > > > > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > > > > >
> > > > > > This is with tree RCU.
> > > > >
> > > > > It seems like nearly 2/3 of the cost is here:
> > > > > /* Add the callback to our list. */
> > > > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > > > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > > >
> > > > Hmmm... That certainly is not the first list of code in call_rcu() that
> > > > would come to mind...
> > >
> > > It's weird. I *think* I read the asm right, but oprofile maybe is
> > > not attributing the cost to the right instruction.
> > >
> > >
> > > > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > > > correctly. Can't see why that should be a probem though...
> > > >
> > > > The usual diagnosis would be false sharing.
> > >
> > > Hmm that's possible yes.
> > >
> > >
> > > > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might
> > > > expect interference from force_quiescent_state(), except that it should
> > > > run only every few clock ticks. So this seems quite unlikely.
> > >
> > > It's CPU bound and preempt=y.
> > >
> > > Workload is just 8 processes running a loop of close(open("file$i")) as
> > > I said though you probably won't be able to reproduce it on a vanilla
> > > kernel.
> >
> > OK, so you are executing call_rcu() a -lot-!!!
>
> Oh yeah. The combined frequency in the 8 proc case is once per 270ns, so
> nearly half a million times per second per core.
Woo-hoo!!! ;-)
> It's not *slow* by any means, but it is increasing much faster than other
> functions on the profile so I just want to understand what is happening.
Understood and agreed.
> > Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of
> > the files in the "rcu" subdirectory in debugfs? Please take a snapshot
> > of these files, run your test for a fixed time interval (perhaps ten
> > seconds, but please tell me how long), then take a second snapshot.
>
> Will do.
Thanx, Paul
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 16:37 ` Paul E. McKenney
2009-09-02 16:45 ` Nick Piggin
@ 2009-09-02 17:50 ` Nick Piggin
1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-09-02 17:50 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote:
> > > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > > correctly. Can't see why that should be a probem though...
> > >
> > > The usual diagnosis would be false sharing.
> >
> > Hmm that's possible yes.
OK, padding 64 bytes (cacheline size) at the start and end of struct rcu_data
does not help.
I wonder if the cycles aren't being attributed to the right instruction?
Interesting thing is this queueing part seems to be the same in rcuclassic
too, which seems to run faster.
I'll try to run it on a bigger machine and see if it becomes more
pronounced. But I might not get around to that tonight.
> > > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might
> > > expect interference from force_quiescent_state(), except that it should
> > > run only every few clock ticks. So this seems quite unlikely.
> >
> > It's CPU bound and preempt=y.
> >
> > Workload is just 8 processes running a loop of close(open("file$i")) as
> > I said though you probably won't be able to reproduce it on a vanilla
> > kernel.
>
> OK, so you are executing call_rcu() a -lot-!!!
>
> Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of
> the files in the "rcu" subdirectory in debugfs? Please take a snapshot
> of these files, run your test for a fixed time interval (perhaps ten
> seconds, but please tell me how long), then take a second snapshot.
Attached, old/* vs new/*. Interval was 22s.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 12:27 ` Nick Piggin
2009-09-02 15:19 ` Paul E. McKenney
@ 2009-09-02 19:17 ` Peter Zijlstra
2009-09-03 5:14 ` Paul E. McKenney
2009-09-03 7:14 ` Nick Piggin
1 sibling, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2009-09-02 19:17 UTC (permalink / raw)
To: Nick Piggin; +Cc: Paul McKenney, Linux Kernel Mailing List
On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:
> It seems like nearly 2/3 of the cost is here:
> /* Add the callback to our list. */
> *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
>
> In loading the pointer to the next tail pointer. If I'm reading the profile
> correctly. Can't see why that should be a probem though...
>
> ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> 697 0.2172 :ffffffff8107dee0: push %r12
> 921 0.2869 :ffffffff8107df57: push %rdx
> 151 0.0470 :ffffffff8107df58: popfq
> 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax
> 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax)
I'd guess at popfq to be the expensive op here.. skid usually causes the
attribution to be a few ops down the line.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 19:17 ` Peter Zijlstra
@ 2009-09-03 5:14 ` Paul E. McKenney
2009-09-03 7:45 ` Nick Piggin
2009-09-03 9:01 ` Nick Piggin
2009-09-03 7:14 ` Nick Piggin
1 sibling, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-03 5:14 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Nick Piggin, Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:
>
> > It seems like nearly 2/3 of the cost is here:
> > /* Add the callback to our list. */
> > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> >
> > In loading the pointer to the next tail pointer. If I'm reading the profile
> > correctly. Can't see why that should be a probem though...
> >
> > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> > 697 0.2172 :ffffffff8107dee0: push %r12
>
> > 921 0.2869 :ffffffff8107df57: push %rdx
> > 151 0.0470 :ffffffff8107df58: popfq
> > 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax
> > 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax)
>
> I'd guess at popfq to be the expensive op here.. skid usually causes the
> attribution to be a few ops down the line.
I believe that Nick's workload is routinely driving the number of
callbacks queued on a given CPU above 10,000, which would provoke numerous
(and possibly inlined) calls to force_quiescent_state(). Like about
400,000 such calls per second. Hey, I was naively assuming that no one
would see more than 10,000 callbacks queued on a single CPU unless there
was some sort of major emergency underway, and coded accordingly. ;-)
I offer the attached experimental (untested, might not even compile) patch.
Thanx, Paul
------------------------------------------------------------------------
>From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Wed, 2 Sep 2009 22:01:50 -0700
Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu()
The grace-period machinery in __call_rcu() was a failed attempt to avoid
implementing synchronize_rcu_expedited(). But now that this attempt has
failed, try removing the machinery.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
kernel/rcutree.c | 12 ------------
1 files changed, 0 insertions(+), 12 deletions(-)
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d2a372f..104de9e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1201,26 +1201,14 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
*/
local_irq_save(flags);
rdp = rsp->rda[smp_processor_id()];
- rcu_process_gp_end(rsp, rdp);
- check_for_new_grace_period(rsp, rdp);
/* Add the callback to our list. */
*rdp->nxttail[RCU_NEXT_TAIL] = head;
rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
- /* Start a new grace period if one not already started. */
- if (ACCESS_ONCE(rsp->completed) == ACCESS_ONCE(rsp->gpnum)) {
- unsigned long nestflag;
- struct rcu_node *rnp_root = rcu_get_root(rsp);
-
- spin_lock_irqsave(&rnp_root->lock, nestflag);
- rcu_start_gp(rsp, nestflag); /* releases rnp_root->lock. */
- }
-
/* Force the grace period if too many callbacks or too long waiting. */
if (unlikely(++rdp->qlen > qhimark)) {
rdp->blimit = LONG_MAX;
- force_quiescent_state(rsp, 0);
} else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0)
force_quiescent_state(rsp, 1);
local_irq_restore(flags);
--
1.5.2.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-02 19:17 ` Peter Zijlstra
2009-09-03 5:14 ` Paul E. McKenney
@ 2009-09-03 7:14 ` Nick Piggin
1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-09-03 7:14 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Paul McKenney, Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:
>
> > It seems like nearly 2/3 of the cost is here:
> > /* Add the callback to our list. */
> > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> >
> > In loading the pointer to the next tail pointer. If I'm reading the profile
> > correctly. Can't see why that should be a probem though...
> >
> > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> > 697 0.2172 :ffffffff8107dee0: push %r12
>
> > 921 0.2869 :ffffffff8107df57: push %rdx
> > 151 0.0470 :ffffffff8107df58: popfq
> > 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax
> > 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax)
>
> I'd guess at popfq to be the expensive op here.. skid usually causes the
> attribution to be a few ops down the line.
Well it doesn't really explain why it is getting more costly per
unit of work as core count increases. The popfq should run completely
out of L1 cache (or even from store forwarding) and it is expensive,
but those costs should remain the same regardless of how many cores
in the system.
But I am getting some strange looking profiles, so I'm going to get
a bigger system to run it on to see if we can get really pronounced
scalability problem. Maybe even make a microbenchmark to eliminate
other elements of the workload.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-03 5:14 ` Paul E. McKenney
@ 2009-09-03 7:45 ` Nick Piggin
2009-09-03 9:01 ` Nick Piggin
1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-09-03 7:45 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Peter Zijlstra, Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote:
> >
> > > It seems like nearly 2/3 of the cost is here:
> > > /* Add the callback to our list. */
> > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
> > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
> > >
> > > In loading the pointer to the next tail pointer. If I'm reading the profile
> > > correctly. Can't see why that should be a probem though...
> > >
> > > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
> > > 697 0.2172 :ffffffff8107dee0: push %r12
> >
> > > 921 0.2869 :ffffffff8107df57: push %rdx
> > > 151 0.0470 :ffffffff8107df58: popfq
> > > 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax
> > > 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax)
> >
> > I'd guess at popfq to be the expensive op here.. skid usually causes the
> > attribution to be a few ops down the line.
>
> I believe that Nick's workload is routinely driving the number of
> callbacks queued on a given CPU above 10,000, which would provoke numerous
> (and possibly inlined) calls to force_quiescent_state(). Like about
> 400,000 such calls per second. Hey, I was naively assuming that no one
> would see more than 10,000 callbacks queued on a single CPU unless there
> was some sort of major emergency underway, and coded accordingly. ;-)
>
> I offer the attached experimental (untested, might not even compile) patch.
Not only does it compile, but __call_rcu is now taking 1/10th the
cycles and absolute performance up nearly 20%. Looks like it is
now better than classic RCU.
I'll collect and post some more detailed numbers and profiles. Do
you want some new rcu trace results too?
Thanks,
Nick
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-03 5:14 ` Paul E. McKenney
2009-09-03 7:45 ` Nick Piggin
@ 2009-09-03 9:01 ` Nick Piggin
2009-09-03 13:28 ` Paul E. McKenney
1 sibling, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-09-03 9:01 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: Peter Zijlstra, Linux Kernel Mailing List
On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote:
> >From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date: Wed, 2 Sep 2009 22:01:50 -0700
> Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu()
>
> The grace-period machinery in __call_rcu() was a failed attempt to avoid
> implementing synchronize_rcu_expedited(). But now that this attempt has
> failed, try removing the machinery.
OK, the workload is parallel processes performing a close(open()) loop
in a tmpfs filesystem within different cwds (to avoid contention on the
cwd dentry). The kernel is first patched with my vfs scalability patches,
so the comparison is with/without Paul's rcu patch.
System is 2s8c opteron, with processes bound to CPUs (first within the
same socket, then over both sockets as count increases).
procs tput-base tput-rcu
1 595238 (x1.00) 645161 (x1.00)
2 1041666 (x1.75) 1136363 (x1.76)
4 1960784 (x3.29) 2298850 (x3.56)
8 3636363 (x6.11) 4545454 (x7.05)
Scalability is improved (from 2-8 way it is now actually linear), and
single thread performance is significantly improved too.
oprofile results collecting clk unhalted samples shows the following
results for __call_rcu symbol:
procs samples % app name symbol name
tput-base
1 12153 3.8122 vmlinux __call_rcu
2 29253 3.9899 vmlinux __call_rcu
4 84503 5.4667 vmlinux __call_rcu
8 312816 9.5287 vmlinux __call_rcu
tput-rcu
1 8722 2.8770 vmlinux __call_rcu
2 17275 2.5804 vmlinux __call_rcu
4 33848 2.6015 vmlinux __call_rcu
8 67158 2.5561 vmlinux __call_rcu
Scaling is cearly much better (it is more important to look at absolute
samples because %age is dependent on other parts of the kernel too).
Feel free to add any of this to your changelog if you think it's important.
Thanks,
Nick
>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
> kernel/rcutree.c | 12 ------------
> 1 files changed, 0 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index d2a372f..104de9e 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1201,26 +1201,14 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
> */
> local_irq_save(flags);
> rdp = rsp->rda[smp_processor_id()];
> - rcu_process_gp_end(rsp, rdp);
> - check_for_new_grace_period(rsp, rdp);
>
> /* Add the callback to our list. */
> *rdp->nxttail[RCU_NEXT_TAIL] = head;
> rdp->nxttail[RCU_NEXT_TAIL] = &head->next;
>
> - /* Start a new grace period if one not already started. */
> - if (ACCESS_ONCE(rsp->completed) == ACCESS_ONCE(rsp->gpnum)) {
> - unsigned long nestflag;
> - struct rcu_node *rnp_root = rcu_get_root(rsp);
> -
> - spin_lock_irqsave(&rnp_root->lock, nestflag);
> - rcu_start_gp(rsp, nestflag); /* releases rnp_root->lock. */
> - }
> -
> /* Force the grace period if too many callbacks or too long waiting. */
> if (unlikely(++rdp->qlen > qhimark)) {
> rdp->blimit = LONG_MAX;
> - force_quiescent_state(rsp, 0);
> } else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0)
> force_quiescent_state(rsp, 1);
> local_irq_restore(flags);
> --
> 1.5.2.5
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem?
2009-09-03 9:01 ` Nick Piggin
@ 2009-09-03 13:28 ` Paul E. McKenney
0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2009-09-03 13:28 UTC (permalink / raw)
To: Nick Piggin; +Cc: Peter Zijlstra, Linux Kernel Mailing List
On Thu, Sep 03, 2009 at 11:01:26AM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote:
> > >From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date: Wed, 2 Sep 2009 22:01:50 -0700
> > Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu()
> >
> > The grace-period machinery in __call_rcu() was a failed attempt to avoid
> > implementing synchronize_rcu_expedited(). But now that this attempt has
> > failed, try removing the machinery.
>
> OK, the workload is parallel processes performing a close(open()) loop
> in a tmpfs filesystem within different cwds (to avoid contention on the
> cwd dentry). The kernel is first patched with my vfs scalability patches,
> so the comparison is with/without Paul's rcu patch.
>
> System is 2s8c opteron, with processes bound to CPUs (first within the
> same socket, then over both sockets as count increases).
>
> procs tput-base tput-rcu
> 1 595238 (x1.00) 645161 (x1.00)
> 2 1041666 (x1.75) 1136363 (x1.76)
> 4 1960784 (x3.29) 2298850 (x3.56)
> 8 3636363 (x6.11) 4545454 (x7.05)
>
> Scalability is improved (from 2-8 way it is now actually linear), and
> single thread performance is significantly improved too.
>
> oprofile results collecting clk unhalted samples shows the following
> results for __call_rcu symbol:
>
> procs samples % app name symbol name
> tput-base
> 1 12153 3.8122 vmlinux __call_rcu
> 2 29253 3.9899 vmlinux __call_rcu
> 4 84503 5.4667 vmlinux __call_rcu
> 8 312816 9.5287 vmlinux __call_rcu
>
> tput-rcu
> 1 8722 2.8770 vmlinux __call_rcu
> 2 17275 2.5804 vmlinux __call_rcu
> 4 33848 2.6015 vmlinux __call_rcu
> 8 67158 2.5561 vmlinux __call_rcu
>
> Scaling is cearly much better (it is more important to look at absolute
> samples because %age is dependent on other parts of the kernel too).
>
> Feel free to add any of this to your changelog if you think it's important.
Very cool!!!
I got a dissenting view from the people trying to get rid of interrupts
in computational workloads. But I believe that it is possible to
split the difference, getting you almost all the performance benefits
while still permitting them to turn off the scheduling-clock interrupt.
The reason that I believe it should get you the performance benefits is
that deleting the rcu_process_gp_end() and check_for_new_grace_period()
didn't do much for you. Their overhead is quite small compared to
hammering the system with a full set of IPIs every ten microseconds
or so. ;-)
So could you please give the following experimental patch a go?
If it works for you, I will put together a production-ready patch
along these lines.
Thanx, Paul
------------------------------------------------------------------------
>From 57b7f98303a5c5aa50648c71758760006af49bab Mon Sep 17 00:00:00 2001
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Thu, 3 Sep 2009 06:19:45 -0700
Subject: [PATCH] Reduce grace-period-encouragement impact on rcutree __call_rcu()
Remove only the emergency force_quiescent_state() from __call_rcu(),
which should get most of the reduction in overhead while still
allowing the tick to be turned off when non-idle, as proposed in
http://lkml.org/lkml/2009/9/1/229, and which reduced interrupts to
one per ten seconds in a CPU-bound computational workload according to
http://lkml.org/lkml/2009/9/3/7.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
kernel/rcutree.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d2a372f..4c8e0d2 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1220,7 +1220,6 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
/* Force the grace period if too many callbacks or too long waiting. */
if (unlikely(++rdp->qlen > qhimark)) {
rdp->blimit = LONG_MAX;
- force_quiescent_state(rsp, 0);
} else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0)
force_quiescent_state(rsp, 1);
local_irq_restore(flags);
--
1.5.2.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-09-03 13:28 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-02 9:48 tree rcu: call_rcu scalability problem? Nick Piggin
2009-09-02 12:27 ` Nick Piggin
2009-09-02 15:19 ` Paul E. McKenney
2009-09-02 16:24 ` Nick Piggin
2009-09-02 16:37 ` Paul E. McKenney
2009-09-02 16:45 ` Nick Piggin
2009-09-02 16:48 ` Paul E. McKenney
2009-09-02 17:50 ` Nick Piggin
2009-09-02 19:17 ` Peter Zijlstra
2009-09-03 5:14 ` Paul E. McKenney
2009-09-03 7:45 ` Nick Piggin
2009-09-03 9:01 ` Nick Piggin
2009-09-03 13:28 ` Paul E. McKenney
2009-09-03 7:14 ` Nick Piggin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.