linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (no subject)
@ 2011-07-22  0:32 Jason Baron
  2011-07-22  0:57 ` Paul Turner
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Baron @ 2011-07-22  0:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

rth@redhat.com
Bcc: 
Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
 when bandwidth control is inactive
Reply-To: 
In-Reply-To: <20110721184758.403388616@google.com>

On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> So I'm seeing some strange costs associated with jump_labels; while on paper
> the branches and instructions retired improves (as expected) we're taking an
> unexpected hit in IPC.
> 
> [From the initial mail we have workloads:
>   mkdir -p /cgroup/cpu/test
>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> ]
> 
> To make some of the figures more clear:
> 
> Legend:
> !BWC = tip + bwc, BWC compiled out
> BWC = tip + bwc
> BWC_JL = tip + bwc + jump label (this patch)
> 
> 
> Now, comparing under W1 we see:
> W1: BWC vs BWC_JL
>                             instructions            cycles                  branches              elapsed                
> ---------------------------------------------------------------------------------------------------------------------
> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
> 
> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline] 
> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
> 
> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> the unconstrained case with BWC.
> 
> 
> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> measurements for BWC_JL, with (%d) being the relative difference to their
> BWC counterparts.
> 
> W1: BWC vs BWC_JL is very similar.
> 	BWC vs BWC_JL
> clovertown [BWC]            985732031              1283113452               175621212             1.375905653  
> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
> 
> barcelona [BWC]             982139920              1078757792               175417574             1.069537049  
> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
> 
> westmere [BWC]              918633403               896047900               166496917             0.754629182  
> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
> 
> Now this is rather odd, almost across the board we're seeing the expected
> drops in instructions and branches, yet we appear to be paying a heavy IPC
> price.  The fact that wall-time has scaled equivalently with cycles roughly
> rules out the cycles counter being off.
> 
> We are seeing the expected behavior in the bandwidth enabled case;
> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> and instruction which shows up on all the numbers above.
> 
> With respect to compiler mangling the text is essentially unchanged in size.
> One lurking suspicion is whether the inserted nops have perturbed some of the
> jmp/branch alignments?
> 
>     text    data     bss     dec     hex filename
>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
>  
>  I have checked to make sure that the right instructions are being patched in
>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
>  into a userspace test (and benchmarked it directly under perf).  The results
>  here are also exactly as expected.
> 
> e.g.
>  Performance counter stats for './jump_test':
>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> Performance counter stats for './jump_test 1':
>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
> 
> Overall if we can fix the IPC the benefit in the globally unconstrained case
> looks really good.
> 
> Any thoughts Jason?
> 

Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
more optimal.

thanks,

-Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re:
  2011-07-22  0:32 Jason Baron
@ 2011-07-22  0:57 ` Paul Turner
  2011-07-22  1:17   ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Turner @ 2011-07-22  0:57 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote:
> rth@redhat.com
> Bcc:
> Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
>  when bandwidth control is inactive
> Reply-To:
> In-Reply-To: <20110721184758.403388616@google.com>
>
> On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
>> So I'm seeing some strange costs associated with jump_labels; while on paper
>> the branches and instructions retired improves (as expected) we're taking an
>> unexpected hit in IPC.
>>
>> [From the initial mail we have workloads:
>>   mkdir -p /cgroup/cpu/test
>>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
>>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
>> ]
>>
>> To make some of the figures more clear:
>>
>> Legend:
>> !BWC = tip + bwc, BWC compiled out
>> BWC = tip + bwc
>> BWC_JL = tip + bwc + jump label (this patch)
>>
>>
>> Now, comparing under W1 we see:
>> W1: BWC vs BWC_JL
>>                             instructions            cycles                  branches              elapsed
>> ---------------------------------------------------------------------------------------------------------------------
>> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
>> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
>> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
>> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
>>
>> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline]
>> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
>> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
>> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
>>
>> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
>> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
>> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
>> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
>> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
>> the unconstrained case with BWC.
>>
>>
>> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
>> measurements for BWC_JL, with (%d) being the relative difference to their
>> BWC counterparts.
>>
>> W1: BWC vs BWC_JL is very similar.
>>       BWC vs BWC_JL
>> clovertown [BWC]            985732031              1283113452               175621212             1.375905653
>> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
>> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
>> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
>>
>> barcelona [BWC]             982139920              1078757792               175417574             1.069537049
>> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
>> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
>> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
>>
>> westmere [BWC]              918633403               896047900               166496917             0.754629182
>> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
>> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
>> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
>>
>> Now this is rather odd, almost across the board we're seeing the expected
>> drops in instructions and branches, yet we appear to be paying a heavy IPC
>> price.  The fact that wall-time has scaled equivalently with cycles roughly
>> rules out the cycles counter being off.
>>
>> We are seeing the expected behavior in the bandwidth enabled case;
>> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
>> and instruction which shows up on all the numbers above.
>>
>> With respect to compiler mangling the text is essentially unchanged in size.
>> One lurking suspicion is whether the inserted nops have perturbed some of the
>> jmp/branch alignments?
>>
>>     text    data     bss     dec     hex filename
>>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
>>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
>>
>>  I have checked to make sure that the right instructions are being patched in
>>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
>>  into a userspace test (and benchmarked it directly under perf).  The results
>>  here are also exactly as expected.
>>
>> e.g.
>>  Performance counter stats for './jump_test':
>>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
>> Performance counter stats for './jump_test 1':
>>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
>>
>> Overall if we can fix the IPC the benefit in the globally unconstrained case
>> looks really good.
>>
>> Any thoughts Jason?
>>
>
> Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
> CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
> more optimal.
>

Ah I should have mentioned that was one of the holes I stared down:

Builds were -O2 (gcc-4.6.1) and
$  zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set

Same kernel image across all platforms.






> thanks,
>
> -Jason
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-22  0:57 ` Paul Turner
@ 2011-07-22  1:17   ` Jason Baron
  2011-07-22  1:38     ` Paul Turner
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Baron @ 2011-07-22  1:17 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote:
> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote:
> > rth@redhat.com
> > Bcc:
> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
> >  when bandwidth control is inactive
> > Reply-To:
> > In-Reply-To: <20110721184758.403388616@google.com>
> >
> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> >> So I'm seeing some strange costs associated with jump_labels; while on paper
> >> the branches and instructions retired improves (as expected) we're taking an
> >> unexpected hit in IPC.
> >>
> >> [From the initial mail we have workloads:
> >>   mkdir -p /cgroup/cpu/test
> >>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
> >>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
> >>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
> >>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> >> ]
> >>
> >> To make some of the figures more clear:
> >>
> >> Legend:
> >> !BWC = tip + bwc, BWC compiled out
> >> BWC = tip + bwc
> >> BWC_JL = tip + bwc + jump label (this patch)
> >>
> >>
> >> Now, comparing under W1 we see:
> >> W1: BWC vs BWC_JL
> >>                             instructions            cycles                  branches              elapsed
> >> ---------------------------------------------------------------------------------------------------------------------
> >> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
> >> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
> >> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
> >> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
> >>
> >> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline]
> >> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
> >> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
> >> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
> >>
> >> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
> >> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
> >> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
> >> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> >> the unconstrained case with BWC.
> >>
> >>
> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> >> measurements for BWC_JL, with (%d) being the relative difference to their
> >> BWC counterparts.
> >>
> >> W1: BWC vs BWC_JL is very similar.
> >>       BWC vs BWC_JL
> >> clovertown [BWC]            985732031              1283113452               175621212             1.375905653
> >> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
> >> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
> >> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
> >>
> >> barcelona [BWC]             982139920              1078757792               175417574             1.069537049
> >> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
> >> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
> >> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
> >>
> >> westmere [BWC]              918633403               896047900               166496917             0.754629182
> >> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
> >> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
> >> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
> >>
> >> Now this is rather odd, almost across the board we're seeing the expected
> >> drops in instructions and branches, yet we appear to be paying a heavy IPC
> >> price.  The fact that wall-time has scaled equivalently with cycles roughly
> >> rules out the cycles counter being off.
> >>

if i understand your results, for barcelona you did see an improvement
in cycles and eslapsed time with jump labels for unconstrained?

> >> We are seeing the expected behavior in the bandwidth enabled case;
> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> >> and instruction which shows up on all the numbers above.
> >>
> >> With respect to compiler mangling the text is essentially unchanged in size.
> >> One lurking suspicion is whether the inserted nops have perturbed some of the
> >> jmp/branch alignments?

hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who
worked on the 'asm goto' in gcc.

> >>
> >>     text    data     bss     dec     hex filename
> >>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
> >>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
> >>

the other thing here is that vmlinux.jump_label includes the extra
kernel/jump_label.o file, so you can sort of subtract the text size of
that file to do a fair comparison.

Also, I would have expected the data section to have increased more with
jump labels enabled. Are tracepoints disabled (a current user of jump
labels).

> >>  I have checked to make sure that the right instructions are being patched in
> >>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
> >>  into a userspace test (and benchmarked it directly under perf).  The results
> >>  here are also exactly as expected.
> >>
> >> e.g.
> >>  Performance counter stats for './jump_test':
> >>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> >> Performance counter stats for './jump_test 1':
> >>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
> >>

what no-op did you use in userspace? I wouldn't think the no-op choice
would make any difference though...At compile time we use a 'jmp 0', and
then at boot we dynamically patch the 'jmp 0' with the no-op we think works
best...

thanks,

-Jason

> >> Overall if we can fix the IPC the benefit in the globally unconstrained case
> >> looks really good.
> >>
> >> Any thoughts Jason?
> >>
> >
> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
> > more optimal.
> >
> 
> Ah I should have mentioned that was one of the holes I stared down:
> 
> Builds were -O2 (gcc-4.6.1) and
> $  zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
> 
> Same kernel image across all platforms.
> 
> 
> 
> 
> 
> 
> > thanks,
> >
> > -Jason
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-22  1:17   ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron
@ 2011-07-22  1:38     ` Paul Turner
  2011-07-27 21:58       ` Jason Baron
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Turner @ 2011-07-22  1:38 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@redhat.com> wrote:
> On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote:
>> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote:
>> > rth@redhat.com
>> > Bcc:
>> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
>> >  when bandwidth control is inactive
>> > Reply-To:
>> > In-Reply-To: <20110721184758.403388616@google.com>
>> >
>> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
>> >> So I'm seeing some strange costs associated with jump_labels; while on paper
>> >> the branches and instructions retired improves (as expected) we're taking an
>> >> unexpected hit in IPC.
>> >>
>> >> [From the initial mail we have workloads:
>> >>   mkdir -p /cgroup/cpu/test
>> >>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>> >>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
>> >>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>> >>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
>> >> ]
>> >>
>> >> To make some of the figures more clear:
>> >>
>> >> Legend:
>> >> !BWC = tip + bwc, BWC compiled out
>> >> BWC = tip + bwc
>> >> BWC_JL = tip + bwc + jump label (this patch)
>> >>
>> >>
>> >> Now, comparing under W1 we see:
>> >> W1: BWC vs BWC_JL
>> >>                             instructions            cycles                  branches              elapsed
>> >> ---------------------------------------------------------------------------------------------------------------------
>> >> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
>> >> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
>> >> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
>> >> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
>> >>
>> >> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline]
>> >> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
>> >> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
>> >> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
>> >>
>> >> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
>> >> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
>> >> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
>> >> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
>> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
>> >> the unconstrained case with BWC.
>> >>
>> >>
>> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
>> >> measurements for BWC_JL, with (%d) being the relative difference to their
>> >> BWC counterparts.
>> >>
>> >> W1: BWC vs BWC_JL is very similar.
>> >>       BWC vs BWC_JL
>> >> clovertown [BWC]            985732031              1283113452               175621212             1.375905653
>> >> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
>> >> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
>> >> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
>> >>
>> >> barcelona [BWC]             982139920              1078757792               175417574             1.069537049
>> >> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
>> >> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
>> >> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
>> >>
>> >> westmere [BWC]              918633403               896047900               166496917             0.754629182
>> >> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
>> >> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
>> >> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
>> >>
>> >> Now this is rather odd, almost across the board we're seeing the expected
>> >> drops in instructions and branches, yet we appear to be paying a heavy IPC
>> >> price.  The fact that wall-time has scaled equivalently with cycles roughly
>> >> rules out the cycles counter being off.
>> >>
>
> if i understand your results, for barcelona you did see an improvement
> in cycles and eslapsed time with jump labels for unconstrained?
>

Under W2, yes.

>> >> We are seeing the expected behavior in the bandwidth enabled case;
>> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
>> >> and instruction which shows up on all the numbers above.
>> >>
>> >> With respect to compiler mangling the text is essentially unchanged in size.
>> >> One lurking suspicion is whether the inserted nops have perturbed some of the
>> >> jmp/branch alignments?
>
> hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who
> worked on the 'asm goto' in gcc.
>
>> >>
>> >>     text    data     bss     dec     hex filename
>> >>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
>> >>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
>> >>
>
> the other thing here is that vmlinux.jump_label includes the extra
> kernel/jump_label.o file, so you can sort of subtract the text size of
> that file to do a fair comparison.

Even without doing that it's only a 1.00004% change in text size.

I was just making the inference that if it's gcc mangling it's likely
in the layout/alignment.

>
> Also, I would have expected the data section to have increased more with
> jump labels enabled. Are tracepoints disabled (a current user of jump
> labels).

Yeah -- Tracing is enabled so the BWC build should have labels
already; this likely accounts for the small increase noted above.

>
>> >>  I have checked to make sure that the right instructions are being patched in
>> >>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
>> >>  into a userspace test (and benchmarked it directly under perf).  The results
>> >>  here are also exactly as expected.
>> >>
>> >> e.g.
>> >>  Performance counter stats for './jump_test':
>> >>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
>> >> Performance counter stats for './jump_test 1':
>> >>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
>> >>
>
> what no-op did you use in userspace? I wouldn't think the no-op choice
> would make any difference though...At compile time we use a 'jmp 0', and
> then at boot we dynamically patch the 'jmp 0' with the no-op we think works
> best...
>

Sorry -- what I meant here is I pulled the run-time chosen "best" nop
out of /proc/kcore and tested a
tight loop about a <JL><RET><COND><RET> sequence (e.g.
cfs_rq_throttled()) with JL being the nop and jmp respectively.

Specifically for Westmere this ends up being K8_NOP5  -- 0x666666D0

> thanks,
>
> -Jason
>
>> >> Overall if we can fix the IPC the benefit in the globally unconstrained case
>> >> looks really good.
>> >>
>> >> Any thoughts Jason?
>> >>
>> >
>> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
>> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
>> > more optimal.
>> >
>>
>> Ah I should have mentioned that was one of the holes I stared down:
>>
>> Builds were -O2 (gcc-4.6.1) and
>> $  zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
>> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
>>
>> Same kernel image across all platforms.
>>
>>
>>
>>
>>
>>
>> > thanks,
>> >
>> > -Jason
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-22  1:38     ` Paul Turner
@ 2011-07-27 21:58       ` Jason Baron
  2011-08-05  3:53         ` Paul Turner
                           ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jason Baron @ 2011-07-27 21:58 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Thu, Jul 21, 2011 at 06:38:01PM -0700, Paul Turner wrote:
> On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@redhat.com> wrote:
> > On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote:
> >> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote:
> >> > rth@redhat.com
> >> > Bcc:
> >> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
> >> >  when bandwidth control is inactive
> >> > Reply-To:
> >> > In-Reply-To: <20110721184758.403388616@google.com>
> >> >
> >> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> >> >> So I'm seeing some strange costs associated with jump_labels; while on paper
> >> >> the branches and instructions retired improves (as expected) we're taking an
> >> >> unexpected hit in IPC.
> >> >>
> >> >> [From the initial mail we have workloads:
> >> >>   mkdir -p /cgroup/cpu/test
> >> >>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
> >> >>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
> >> >>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
> >> >>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> >> >> ]
> >> >>
> >> >> To make some of the figures more clear:
> >> >>
> >> >> Legend:
> >> >> !BWC = tip + bwc, BWC compiled out
> >> >> BWC = tip + bwc
> >> >> BWC_JL = tip + bwc + jump label (this patch)
> >> >>
> >> >>
> >> >> Now, comparing under W1 we see:
> >> >> W1: BWC vs BWC_JL
> >> >>                             instructions            cycles                  branches              elapsed
> >> >> ---------------------------------------------------------------------------------------------------------------------
> >> >> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
> >> >> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
> >> >> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
> >> >> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
> >> >>
> >> >> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline]
> >> >> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
> >> >> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
> >> >> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
> >> >>
> >> >> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
> >> >> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
> >> >> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
> >> >> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
> >> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> >> >> the unconstrained case with BWC.
> >> >>
> >> >>
> >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> >> >> measurements for BWC_JL, with (%d) being the relative difference to their
> >> >> BWC counterparts.
> >> >>
> >> >> W1: BWC vs BWC_JL is very similar.
> >> >>       BWC vs BWC_JL
> >> >> clovertown [BWC]            985732031              1283113452               175621212             1.375905653
> >> >> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
> >> >> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
> >> >> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
> >> >>
> >> >> barcelona [BWC]             982139920              1078757792               175417574             1.069537049
> >> >> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
> >> >> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
> >> >> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
> >> >>
> >> >> westmere [BWC]              918633403               896047900               166496917             0.754629182
> >> >> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
> >> >> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
> >> >> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
> >> >>
> >> >> Now this is rather odd, almost across the board we're seeing the expected
> >> >> drops in instructions and branches, yet we appear to be paying a heavy IPC
> >> >> price.  The fact that wall-time has scaled equivalently with cycles roughly
> >> >> rules out the cycles counter being off.
> >> >>
> >
> > if i understand your results, for barcelona you did see an improvement
> > in cycles and eslapsed time with jump labels for unconstrained?
> >
> 
> Under W2, yes.
> 
> >> >> We are seeing the expected behavior in the bandwidth enabled case;
> >> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> >> >> and instruction which shows up on all the numbers above.
> >> >>
> >> >> With respect to compiler mangling the text is essentially unchanged in size.
> >> >> One lurking suspicion is whether the inserted nops have perturbed some of the
> >> >> jmp/branch alignments?
> >
> > hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who
> > worked on the 'asm goto' in gcc.
> >
> >> >>
> >> >>     text    data     bss     dec     hex filename
> >> >>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
> >> >>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
> >> >>
> >
> > the other thing here is that vmlinux.jump_label includes the extra
> > kernel/jump_label.o file, so you can sort of subtract the text size of
> > that file to do a fair comparison.
> 
> Even without doing that it's only a 1.00004% change in text size.
> 
> I was just making the inference that if it's gcc mangling it's likely
> in the layout/alignment.
> 
> >
> > Also, I would have expected the data section to have increased more with
> > jump labels enabled. Are tracepoints disabled (a current user of jump
> > labels).
> 
> Yeah -- Tracing is enabled so the BWC build should have labels
> already; this likely accounts for the small increase noted above.
> 
> >
> >> >>  I have checked to make sure that the right instructions are being patched in
> >> >>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
> >> >>  into a userspace test (and benchmarked it directly under perf).  The results
> >> >>  here are also exactly as expected.
> >> >>
> >> >> e.g.
> >> >>  Performance counter stats for './jump_test':
> >> >>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> >> >> Performance counter stats for './jump_test 1':
> >> >>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
> >> >>
> >
> > what no-op did you use in userspace? I wouldn't think the no-op choice
> > would make any difference though...At compile time we use a 'jmp 0', and
> > then at boot we dynamically patch the 'jmp 0' with the no-op we think works
> > best...
> >
> 
> Sorry -- what I meant here is I pulled the run-time chosen "best" nop
> out of /proc/kcore and tested a
> tight loop about a <JL><RET><COND><RET> sequence (e.g.
> cfs_rq_throttled()) with JL being the nop and jmp respectively.
> 
> Specifically for Westmere this ends up being K8_NOP5  -- 0x666666D0
> 
> > thanks,
> >
> > -Jason
> >
> >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case
> >> >> looks really good.
> >> >>
> >> >> Any thoughts Jason?
> >> >>
> >> >
> >> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
> >> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
> >> > more optimal.
> >> >
> >>
> >> Ah I should have mentioned that was one of the holes I stared down:
> >>
> >> Builds were -O2 (gcc-4.6.1) and
> >> $  zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
> >> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
> >>
> >> Same kernel image across all platforms.
> >>
> >>

Hi Paul,

Ok, I think I finally tracked this down. It may seem a bit crazy, but
when we are getting down to cycle counting like this, it seems that the
link order in the kernel/Makefile can make difference. I had the
jump_label.o listed after the core files, whereas all the code in
jump_label.o is really slow path code (used when toggling branch
values). As follows:


--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
 	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
-	    async.o range.o jump_label.o
+	    async.o range.o
 obj-y += groups.o
 
 ifdef CONFIG_FUNCTION_TRACER
@@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
 obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
+obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is


I've tested the patch using a single 'static_branch()' in the getppid() path,
and basically running tight loops of calls to getppid(). Before, the
patch, I was seeing results similar to what you reported, after the
patch, things improved for all metrics. Here are my results for the
branch disabled case:

With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:

 Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):

     3,969,510,217 instructions             #      0.864 IPC     ( +-0.000% )
     4,592,334,954 cycles                     ( +-   0.046% )
       751,634,470 branches                   ( +-   0.000% )

        1.722635797  seconds time elapsed   ( +-   0.046% )

Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:

 Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):

     4,009,611,846 instructions             #      0.867 IPC     ( +-0.000% )
     4,622,210,580 cycles                     ( +-   0.012% )
       771,662,904 branches                   ( +-   0.000% )

        1.734341454  seconds time elapsed   ( +-   0.022% )


So all of the measured metrics improved in the jump labels case b/w
0.5% - 2.5%. 

I'm curious to see what you find with this patch.

Thanks,

-Jason



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-27 21:58       ` Jason Baron
@ 2011-08-05  3:53         ` Paul Turner
  2011-08-05  7:21           ` Peter Zijlstra
  2011-08-05  3:55         ` Paul Turner
  2011-08-05  8:30         ` Peter Zijlstra
  2 siblings, 1 reply; 14+ messages in thread
From: Paul Turner @ 2011-08-05  3:53 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

< snip>

>
> Hi Paul,
>
> Ok, I think I finally tracked this down. It may seem a bit crazy, but
> when we are getting down to cycle counting like this, it seems that the
> link order in the kernel/Makefile can make difference. I had the
> jump_label.o listed after the core files, whereas all the code in
> jump_label.o is really slow path code (used when toggling branch
> values). As follows:
>
>
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
>            kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
>            hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>            notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
> -           async.o range.o jump_label.o
> +           async.o range.o
>  obj-y += groups.o
>
>  ifdef CONFIG_FUNCTION_TRACER
> @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
>  obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
>  obj-$(CONFIG_PADATA) += padata.o
>  obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> +obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>
>  ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
>  # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
>
>
> I've tested the patch using a single 'static_branch()' in the getppid() path,
> and basically running tight loops of calls to getppid(). Before, the
> patch, I was seeing results similar to what you reported, after the
> patch, things improved for all metrics. Here are my results for the
> branch disabled case:
>
> With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:
>
>  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
>
>     3,969,510,217 instructions             #      0.864 IPC     ( +-0.000% )
>     4,592,334,954 cycles                     ( +-   0.046% )
>       751,634,470 branches                   ( +-   0.000% )
>
>        1.722635797  seconds time elapsed   ( +-   0.046% )
>
> Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:
>
>  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
>
>     4,009,611,846 instructions             #      0.867 IPC     ( +-0.000% )
>     4,622,210,580 cycles                     ( +-   0.012% )
>       771,662,904 branches                   ( +-   0.000% )
>
>        1.734341454  seconds time elapsed   ( +-   0.022% )
>
>
> So all of the measured metrics improved in the jump labels case b/w
> 0.5% - 2.5%.
>
> I'm curious to see what you find with this patch.
>
> Thanks,
>
> -Jason
>

Hi Jason,

Thanks for taking a look at this.  Sorry, this took a few days to
benchmark all the permutations and we had some issues with internal
proxies which interrupted benchmarking runs.

Results and some analysis follow.

[
Key:

npo_XXX = with CONFIG_JUMP_LABEL, without link order patch (no patched order)
po_XXX = with CONFIG_JUMP_LABEL, with link order patch (patched order)
nojl_XXX = without CONFIG_JUMP_LABEL

Where "XXX" is
head: tip (c5bafb3) without patch series
cfs: tip + patch series - jump_label patch
cfs_jl: tip + patch series + jump_label for unconstrained

Test was repeated 3 times, each run was 50 repeats w/ typically ~<0.1
in-test variance on reported output
]

Considering just jump labels in tip, comparing against HEAD w/
!CONFIG_JUMP_LABEL

                           instructions            cycles
    branches              elapsed
---------------------------------------------------------------------------------------------------------------------
	Westmere:
njl_head.1                  798832892               722624737
     145375836             0.203218936   [baseline]
njl_head.2                  798888783 (+0.01)       746118188 (+3.25)
     145386807 (+0.01)     0.208573683 (-2.18)
njl_head.3                  798864253 (+0.00)       731537139 (+1.23)
     145382747 (+0.00)     0.204098175 (-4.28)
npo_head.1                  797033521 (-0.23)       731239359 (+1.19)
     144571358 (-0.55)     0.206910496 (-2.96)
npo_head.2                  797166434 (-0.21)       728926020 (+0.87)
     144603465 (-0.53)     0.202906392 (-4.84)
npo_head.3                  797165370 (-0.21)       725930458 (+0.46)
     144603438 (-0.53)     0.202118274 (-5.21)
po_head.1                   797019904 (-0.23)       699008145 (-3.27)
     144567652 (-0.56)     0.197272615 (-7.48)
po_head.2                   797037682 (-0.22)       705732419 (-2.34)
     144572115 (-0.55)     0.197101692 (-7.56)
po_head.3                   797079804 (-0.22)       698007668 (-3.41)
     144580964 (-0.55)     0.194871253 (-8.61)

	Barcelona:
njl_head.1                  816842028               748362637
     147462095             0.341654152
njl_head.2                  816849735 (+0.00)       748480742 (+0.02)
     147462652 (+0.00)     0.341450734 (-2.90)
njl_head.3                  816834963 (-0.00)       747083797 (-0.17)
     147460200 (-0.00)     0.340802353 (-3.09)
npo_head.1                  815068563 (-0.22)       775012690 (+3.56)
     146661357 (-0.54)     0.353797321 (+0.61)
npo_head.2                  815033261 (-0.22)       759613364 (+1.50)
     146654106 (-0.55)     0.346462671 (-1.48)
npo_head.3                  815029611 (-0.22)       762660196 (+1.91)
     146654169 (-0.55)     0.347565129 (-1.16)
po_head.1                   815026489 (-0.22)       767229109 (+2.52)
     146653376 (-0.55)     0.350241833 (-0.40)
po_head.2                   815035127 (-0.22)       770224495 (+2.92)
     146654019 (-0.55)     0.351352092 (-0.09)
po_head.3                   815109904 (-0.21)       774954096 (+3.55)
     146662020 (-0.54)     0.353505054 (+0.53)



With the patch to fix link-order we're typically faster and it's
probably time to modulate the configs so we get CONFIG_JUMP_LABEL by
default when CC_HAS_ASM_GOTO.

Considering Bandwidth control, comparing vs HEAD w/ CONFIG_JUMP_LABEL:

                            instructions            cycles
     branches              elapsed
---------------------------------------------------------------------------------------------------------------------
	Westmere:
po_head.1                   797019904               699008145
     144567652             0.197272615 [Baseline]
po_head.2                   797037682 (+0.00)       705732419 (+0.96)
     144572115 (+0.00)     0.197101692 (-4.91)
po_head.3                   797079804 (+0.01)       698007668 (-0.14)
     144580964 (+0.01)     0.194871253 (-5.98)
njl_cfs.1                   802649718 (+0.71)       708143552 (+1.31)
     146577437 (+1.39)     0.198770168 (-4.10)
njl_cfs.2                   802679078 (+0.71)       707486608 (+1.21)
     146582628 (+1.39)     0.197890812 (-4.53)
njl_cfs.3                   802647500 (+0.71)       704770712 (+0.82)
     146578141 (+1.39)     0.196742304 (-5.08)
npo_cfs.1                   800661523 (+0.46)       724068093 (+3.59)
     145774786 (+0.83)     0.204632700 (-1.27)
npo_cfs.2                   800646997 (+0.46)       718884486 (+2.84)
     145772293 (+0.83)     0.201248482 (-2.91)
npo_cfs.3                   800783171 (+0.47)       725140326 (+3.74)
     145804350 (+0.86)     0.203266025 (-1.93)
npo_cfs_jl.1                797304605 (+0.04)       687741762 (-1.61)
     143666256 (-0.62)     0.194302293 (-6.26)
npo_cfs_jl.2                797446281 (+0.05)       694066715 (-0.71)
     143700065 (-0.60)     0.194212118 (-6.30)
npo_cfs_jl.3                797374495 (+0.04)       697561774 (-0.21)
     143682692 (-0.61)     0.194935111 (-5.95)
po_cfs.1                    800631004 (+0.45)       715819643 (+2.41)
     145769677 (+0.83)     0.200007036 (-3.51)
po_cfs.2                    800642622 (+0.45)       698569729 (-0.06)
     145769973 (+0.83)     0.194625680 (-6.10)
po_cfs.3                    800752778 (+0.47)       707282749 (+1.18)
     145798992 (+0.85)     0.197047366 (-4.93)
po_cfs_jl.1                 797306617 (+0.04)       686329256 (-1.81)
     143666659 (-0.62)     0.193107369 (-6.83)
po_cfs_jl.2                 797434478 (+0.05)       677865445 (-3.02)
     143697712 (-0.60)     0.189314824 (-8.66)
po_cfs_jl.3                 797299055 (+0.04)       686371679 (-1.81)
     143665758 (-0.62)     0.191859014 (-7.44)

	Barcelona:
po_head.1                   815026489               767229109
     146653376             0.350241833 [Baseline]
po_head.2                   815035127 (+0.00)       770224495 (+0.39)
     146654019 (+0.00)     0.351352092 (-2.47)
po_head.3                   815109904 (+0.01)       774954096 (+1.01)
     146662020 (+0.01)     0.353505054 (-1.87)
njl_cfs.1                   820647075 (+0.69)       756895773 (-1.35)
     148663929 (+1.37)     0.345563962 (-4.07)
njl_cfs.2                   820672501 (+0.69)       761520373 (-0.74)
     148667815 (+1.37)     0.347529253 (-3.53)
njl_cfs.3                   820664350 (+0.69)       763400895 (-0.50)
     148666126 (+1.37)     0.348337223 (-3.30)
npo_cfs.1                   818629349 (+0.44)       758306455 (-1.16)
     147854452 (+0.82)     0.346678486 (-3.77)
npo_cfs.2                   818829256 (+0.47)       768393448 (+0.15)
     147891099 (+0.84)     0.350678075 (-2.65)
npo_cfs.3                   818697806 (+0.45)       772218715 (+0.65)
     147866720 (+0.83)     0.352333672 (-2.20)
npo_cfs_jl.1                815343935 (+0.04)       760127157 (-0.93)
     145753233 (-0.61)     0.347184970 (-3.62)
npo_cfs_jl.2                815415786 (+0.05)       775772068 (+1.11)
     145762961 (-0.61)     0.353965833 (-1.74)
npo_cfs_jl.3                815403187 (+0.05)       764048918 (-0.41)
     145761012 (-0.61)     0.348619922 (-3.23)
po_cfs.1                    819204964 (+0.51)       767156385 (-0.01)
     147959727 (+0.89)     0.350737982 (-2.64)
po_cfs.2                    818665676 (+0.45)       764324366 (-0.38)
     147860788 (+0.82)     0.348814489 (-3.17)
po_cfs.3                    818661849 (+0.45)       752288492 (-1.95)
     147859717 (+0.82)     0.343294319 (-4.70)
po_cfs_jl.1                 815336908 (+0.04)       765760248 (-0.19)
     145755155 (-0.61)     0.349608614 (-2.95)
po_cfs_jl.2                 815322295 (+0.04)       765613685 (-0.21)
     145751972 (-0.61)     0.349321663 (-3.03)
po_cfs_jl.3                 815310833 (+0.03)       759647967 (-0.99)
     145750118 (-0.62)     0.346607639 (-3.78)

Thanks to the magic of compiler re-organization we now report zero
overhead, in fact a speed-up is realized.

I will re-post v7.3 with:
- rebase to minor changes in tip
- removing RFT from adding jump_labels to CFS
- additional hierarchical period constraint

Thanks for looking into this Jason!

- Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-27 21:58       ` Jason Baron
  2011-08-05  3:53         ` Paul Turner
@ 2011-08-05  3:55         ` Paul Turner
  2011-08-05 18:28           ` Jason Baron
  2011-08-05  8:30         ` Peter Zijlstra
  2 siblings, 1 reply; 14+ messages in thread
From: Paul Turner @ 2011-08-05  3:55 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
>            kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
>            hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>            notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
> -           async.o range.o jump_label.o
> +           async.o range.o
>  obj-y += groups.o
>
>  ifdef CONFIG_FUNCTION_TRACER
> @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
>  obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
>  obj-$(CONFIG_PADATA) += padata.o
>  obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> +obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>
>  ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
>  # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
>

Tested-by: Paul Turner <pjt@google.com>

Let me know if you need any result tables for the actual commit msg.
Same goes for making CONFIG_JUMP_LABEL equivalent to default in
CC_HAS_ASM_GOTO case (at least on x86 anyway).


>
> I've tested the patch using a single 'static_branch()' in the getppid() path,
> and basically running tight loops of calls to getppid(). Before, the
> patch, I was seeing results similar to what you reported, after the
> patch, things improved for all metrics. Here are my results for the
> branch disabled case:
>
> With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:
>
>  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
>
>     3,969,510,217 instructions             #      0.864 IPC     ( +-0.000% )
>     4,592,334,954 cycles                     ( +-   0.046% )
>       751,634,470 branches                   ( +-   0.000% )
>
>        1.722635797  seconds time elapsed   ( +-   0.046% )
>
> Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:
>
>  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
>
>     4,009,611,846 instructions             #      0.867 IPC     ( +-0.000% )
>     4,622,210,580 cycles                     ( +-   0.012% )
>       771,662,904 branches                   ( +-   0.000% )
>
>        1.734341454  seconds time elapsed   ( +-   0.022% )
>
>
> So all of the measured metrics improved in the jump labels case b/w
> 0.5% - 2.5%.
>
> I'm curious to see what you find with this patch.
>
> Thanks,
>
> -Jason
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05  3:53         ` Paul Turner
@ 2011-08-05  7:21           ` Peter Zijlstra
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2011-08-05  7:21 UTC (permalink / raw)
  To: Paul Turner
  Cc: Jason Baron, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Thu, 2011-08-04 at 20:53 -0700, Paul Turner wrote:
> 
> I will re-post v7.3 with:
> - rebase to minor changes in tip
> - removing RFT from adding jump_labels to CFS
> - additional hierarchical period constraint 

Could you rebase to -tip + my patches, most of your previous set is
already queued there. The reason its not in -tip is is because the merge
window fallout still has -tip in a somewhat shaky state.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-27 21:58       ` Jason Baron
  2011-08-05  3:53         ` Paul Turner
  2011-08-05  3:55         ` Paul Turner
@ 2011-08-05  8:30         ` Peter Zijlstra
  2011-08-05 15:11           ` Richard Henderson
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2011-08-05  8:30 UTC (permalink / raw)
  To: Jason Baron
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Wed, 2011-07-27 at 17:58 -0400, Jason Baron wrote:
> Ok, I think I finally tracked this down. It may seem a bit crazy, but
> when we are getting down to cycle counting like this, it seems that the
> link order in the kernel/Makefile can make difference. I had the
> jump_label.o listed after the core files, whereas all the code in
> jump_label.o is really slow path code (used when toggling branch
> values). As follows:
> 
> 
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
>             kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
>             hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>             notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
> -           async.o range.o jump_label.o
> +           async.o range.o
>  obj-y += groups.o
>  
>  ifdef CONFIG_FUNCTION_TRACER
> @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
>  obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
>  obj-$(CONFIG_PADATA) += padata.o
>  obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> +obj-$(CONFIG_JUMP_LABEL) += jump_label.o 


OK, so _WHY_ does that make a difference and will a next version of
gnu-binutils not mess that up?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05  8:30         ` Peter Zijlstra
@ 2011-08-05 15:11           ` Richard Henderson
  2011-08-05 15:14             ` Peter Zijlstra
  2011-08-05 15:24             ` Jason Baron
  0 siblings, 2 replies; 14+ messages in thread
From: Richard Henderson @ 2011-08-05 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Baron, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Ingo Molnar, Pavel Emelyanov

On 08/05/2011 01:30 AM, Peter Zijlstra wrote:
> OK, so _WHY_ does that make a difference and will a next version of
> gnu-binutils not mess that up?

The Why is micro-architectual, and I can't answer that.

But ld will never re-order the files as given on the command-line.
There are too many functions and tables that are constructed 
piece-wise from input sections; re-ordering them would change
the semantics of the program.


r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05 15:11           ` Richard Henderson
@ 2011-08-05 15:14             ` Peter Zijlstra
  2011-08-05 15:24             ` Jason Baron
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2011-08-05 15:14 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Jason Baron, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Ingo Molnar, Pavel Emelyanov

On Fri, 2011-08-05 at 08:11 -0700, Richard Henderson wrote:
> On 08/05/2011 01:30 AM, Peter Zijlstra wrote:
> > OK, so _WHY_ does that make a difference and will a next version of
> > gnu-binutils not mess that up?
> 
> The Why is micro-architectual, and I can't answer that.
> 
> But ld will never re-order the files as given on the command-line.
> There are too many functions and tables that are constructed 
> piece-wise from input sections; re-ordering them would change
> the semantics of the program.

Right, so I was wondering about things like whole-program-optimization
passes at link time. Since I've no clue why the proposed patch does what
it does, its hard to say what invariant is needed to be kept.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05 15:11           ` Richard Henderson
  2011-08-05 15:14             ` Peter Zijlstra
@ 2011-08-05 15:24             ` Jason Baron
  1 sibling, 0 replies; 14+ messages in thread
From: Jason Baron @ 2011-08-05 15:24 UTC (permalink / raw)
  To: Richard Henderson, a.p.zijlstra
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Fri, Aug 05, 2011 at 08:11:15AM -0700, Richard Henderson wrote:
> On 08/05/2011 01:30 AM, Peter Zijlstra wrote:
> > OK, so _WHY_ does that make a difference and will a next version of
> > gnu-binutils not mess that up?
> 
> The Why is micro-architectual, and I can't answer that.

In tracking this down, I eventually found that just having the
jump_label.o file compiled into the kernel, but not actually using the
static_branch(), or 'asm goto' anywhere, led to a performance hit.
Thus, the compiler or the 'asm goto' itself wasn't actually causing any
degradation. 

Since the jump_label.o file is only slow-path code, it can be moved away
from core or heavily called kernel routines. I suspect this is probably
an icache issue, but I can't say for sure.

Thanks,

-Jason

> 
> But ld will never re-order the files as given on the command-line.
> There are too many functions and tables that are constructed 
> piece-wise from input sections; re-ordering them would change
> the semantics of the program.
> 
> 
> r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05  3:55         ` Paul Turner
@ 2011-08-05 18:28           ` Jason Baron
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Baron @ 2011-08-05 18:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth, rostedt

On Thu, Aug 04, 2011 at 08:55:08PM -0700, Paul Turner wrote:
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
> >            kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
> >            hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
> >            notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
> > -           async.o range.o jump_label.o
> > +           async.o range.o
> >  obj-y += groups.o
> >
> >  ifdef CONFIG_FUNCTION_TRACER
> > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
> >  obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
> >  obj-$(CONFIG_PADATA) += padata.o
> >  obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> >
> >  ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
> >  # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
> >
> 
> Tested-by: Paul Turner <pjt@google.com>
> 
> Let me know if you need any result tables for the actual commit msg.

Hi Paul,

Thanks for taking the time test this :) I'll post the patch shortly
with my own testing results. Hopefully, it can still be considered for
3.1 b/c of the non-invasive nature of the patch...

> Same goes for making CONFIG_JUMP_LABEL equivalent to default in
> CC_HAS_ASM_GOTO case (at least on x86 anyway).
> 

I originally had CONFIG_JUMP_LABEL implicitly turned on, but we ran into
a 32-bit compiler issue that was causing random, nasty crashes. That
issue has since been resolved in gcc, but we might need to update the
have CC_HAS_ASM_GOTO check to deal with that case better. Currently,
we're using the '-maccumulate-outgoing-args' gcc option to work around
the issue for 32 bit x86 (see: arch/x86/Makefile_32.cpu).

With the jump label interface somewhat stabilizing (I say somewhat, b/c Peter
brought up a good use case in the scheduler that it currently doesn't address,
but which we should be able to support without too much churn) and these testing
results, I think it might make sense to consider turning it on by default for
3.2. thoughts?

Thanks,

-Jason


> 
> >
> > I've tested the patch using a single 'static_branch()' in the getppid() path,
> > and basically running tight loops of calls to getppid(). Before, the
> > patch, I was seeing results similar to what you reported, after the
> > patch, things improved for all metrics. Here are my results for the
> > branch disabled case:
> >
> > With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:
> >
> >  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
> >
> >     3,969,510,217 instructions             #      0.864 IPC     ( +-0.000% )
> >     4,592,334,954 cycles                     ( +-   0.046% )
> >       751,634,470 branches                   ( +-   0.000% )
> >
> >        1.722635797  seconds time elapsed   ( +-   0.046% )
> >
> > Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:
> >
> >  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
> >
> >     4,009,611,846 instructions             #      0.867 IPC     ( +-0.000% )
> >     4,622,210,580 cycles                     ( +-   0.012% )
> >       771,662,904 branches                   ( +-   0.000% )
> >
> >        1.734341454  seconds time elapsed   ( +-   0.022% )
> >
> >
> > So all of the measured metrics improved in the jump labels case b/w
> > 0.5% - 2.5%.
> >
> > I'm curious to see what you find with this patch.
> >
> > Thanks,
> >
> > -Jason
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
@ 2011-07-21 16:43 ` Paul Turner
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

[-- Attachment #1: sched-bwc-add_jump_labels.patch --]
[-- Type: text/plain, Size: 16125 bytes --]

So I'm seeing some strange costs associated with jump_labels; while on paper
the branches and instructions retired improves (as expected) we're taking an
unexpected hit in IPC.

[From the initial mail we have workloads:
  mkdir -p /cgroup/cpu/test
  echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
  (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
  (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
  (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
]

To make some of the figures more clear:

Legend:
!BWC = tip + bwc, BWC compiled out
BWC = tip + bwc
BWC_JL = tip + bwc + jump label (this patch)


Now, comparing under W1 we see:
W1: BWC vs BWC_JL
                            instructions            cycles                  branches              elapsed                
---------------------------------------------------------------------------------------------------------------------
clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
+unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
+10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
+10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]

barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline] 
+unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
+10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
+10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]

westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
+unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
+10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
+10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
the unconstrained case with BWC.


Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
measurements for BWC_JL, with (%d) being the relative difference to their
BWC counterparts.

W1: BWC vs BWC_JL is very similar.
	BWC vs BWC_JL
clovertown [BWC]            985732031              1283113452               175621212             1.375905653  
+unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
+10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
+10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]

barcelona [BWC]             982139920              1078757792               175417574             1.069537049  
+unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
+10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
+10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]

westmere [BWC]              918633403               896047900               166496917             0.754629182  
+unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
+10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
+10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]

Now this is rather odd, almost across the board we're seeing the expected
drops in instructions and branches, yet we appear to be paying a heavy IPC
price.  The fact that wall-time has scaled equivalently with cycles roughly
rules out the cycles counter being off.

We are seeing the expected behavior in the bandwidth enabled case;
specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
and instruction which shows up on all the numbers above.

With respect to compiler mangling the text is essentially unchanged in size.
One lurking suspicion is whether the inserted nops have perturbed some of the
jmp/branch alignments?

    text    data     bss     dec     hex filename
 7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
 7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
 
 I have checked to make sure that the right instructions are being patched in
 at run-time.  I've also pulled a fully patched jump_label out of the kernel
 into a userspace test (and benchmarked it directly under perf).  The results
 here are also exactly as expected.

e.g.
 Performance counter stats for './jump_test':
     1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
Performance counter stats for './jump_test 1':
     2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles

Overall if we can fix the IPC the benefit in the globally unconstrained case
looks really good.

Any thoughts Jason?

-----
Some more raw data:

perf-stat_to_perf-stat variance in performance for W1:

	BWC_JL vs BWC_JL (sample run-to-run variance on JL measurements)
                            instructions            cycles                  branches              elapsed                
---------------------------------------------------------------------------------------------------------------------
clovertown [BWC_JL]         857963815              1007152750               153140328             0.433186926  
+unconstrained              856457537 (-0.18)       986820040 (-2.02)       152871983 (-0.18)     0.424187340 (-2.08)  [rel]
+10000000000/1000:          880281114 (+0.38)      1009349419 (-2.38)       160668480 (+0.39)     0.433031825 (-2.39)  [rel]
+10000000000/1000000:       881001883 (+0.08)      1008445782 (-2.68)       160811824 (+0.08)     0.432629132 (-2.69)  [rel]

barcelona [BWC_JL]          817011602               759838181               145951513             0.347462571  
+unconstrained              817076246 (+0.01)       758404044 (-0.19)       145958670 (+0.00)     0.346313238 (-0.33)  [rel]
+10000000000/1000:          830087089 (-0.00)       773100724 (+0.34)       151218674 (-0.01)     0.352047450 (+0.35)  [rel]
+10000000000/1000000:       830002149 (-0.02)       773209942 (+0.33)       151208657 (-0.03)     0.352090862 (+0.32)  [rel]

westmere [BWC_JL]           799057936               751384496               143875513             0.211182620  
+unconstrained              799067664 (+0.00)       751165910 (-0.03)       143877385 (+0.00)     0.210928554 (-0.12)  [rel]
+10000000000/1000:          812040497 (+0.00)       748711039 (-1.68)       149135568 (+0.00)     0.208868390 (-1.55)  [rel]
+10000000000/1000000:       811911208 (-0.00)       746860347 (-1.45)       149113194 (-0.00)     0.208663627 (-1.28)  [rel]

	BWC vs BWC (sample run-to-run variance on BWC measurements)

ilium [BWC]                845934117               974222228               152715407             0.419014188  
+unconstrained              849061624 (+0.37)       965568244 (-0.89)       153288606 (+0.38)     0.415287406 (-0.89)  [rel]
+10000000000/1000:          861138018 (+0.71)       975979688 (-0.28)       155594606 (+0.71)     0.418710227 (-0.28)  [rel]
+10000000000/1000000:       858768659 (+0.56)       972288157 (-0.42)       155163198 (+0.57)     0.417130144 (-0.42)  [rel]

barcelona [BWC]                820573353               748178486               148161233             0.342122850  
+unconstrained              820494225 (-0.01)       748302946 (+0.02)       148147559 (-0.01)     0.341349438 (-0.23)  [rel]
+10000000000/1000:          827929735 (-0.00)       756163375 (-0.22)       149609111 (-0.00)     0.344356113 (-0.22)  [rel]
+10000000000/1000000:       827682550 (-0.00)       759867539 (+0.84)       149565408 (-0.00)     0.346039855 (+0.84)  [rel]

westmere [BWC]                802533191               694415157               146071233             0.194428018  
+unconstrained              802648805 (+0.01)       698052899 (+0.52)       146099982 (+0.02)     0.195632318 (+0.62)  [rel]
+10000000000/1000:          809855427 (-0.00)       703633926 (+0.26)       147519800 (-0.00)     0.196545542 (+0.32)  [rel]
+10000000000/1000000:       809646717 (-0.01)       704895639 (-0.05)       147476169 (-0.02)     0.197022787 (+0.01)  [rel]

Raw Westmere measurements:

BWC:
Case: Unconstrained -1

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         802533191 instructions             #      1.156 IPC     ( +-   0.004% )
         694415157 cycles                     ( +-   0.165% )
         146071233 branches                   ( +-   0.003% )

        0.194428018  seconds time elapsed   ( +-   0.437% )

Case: 10000000000/1000:

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         809861594 instructions             #      1.154 IPC     ( +-   0.016% )
         701781996 cycles                     ( +-   0.184% )
         147520953 branches                   ( +-   0.022% )

        0.195928354  seconds time elapsed   ( +-   0.262% )


Case: 10000000000/1000000:

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         809752541 instructions             #      1.148 IPC     ( +-   0.016% )
         705278419 cycles                     ( +-   0.593% )
         147502154 branches                   ( +-   0.022% )

        0.196993502  seconds time elapsed   ( +-   0.698% )

BWC_JL
Case: Unconstrained -1

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         799057936 instructions             #      1.063 IPC     ( +-   0.001% )
         751384496 cycles                     ( +-   0.584% )
         143875513 branches                   ( +-   0.001% )

        0.211182620  seconds time elapsed   ( +-   0.771% )

Case: 10000000000/1000:

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         812033785 instructions             #      1.066 IPC     ( +-   0.017% )
         761469084 cycles                     ( +-   0.125% )
         149134146 branches                   ( +-   0.022% )

        0.212149229  seconds time elapsed   ( +-   0.171% )


Case: 10000000000/1000000:

 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs):

         811912834 instructions             #      1.071 IPC     ( +-   0.017% )
         757842988 cycles                     ( +-   0.158% )
         149113291 branches                   ( +-   0.022% )

        0.211364804  seconds time elapsed   ( +-   0.225% )


Let me know if there's any particular raw data you want, westmere seems the
most interesting because it's taking the biggest hit.

-------


From: Paul Turner <pjt@google.com>
When no groups within the system are constrained we can use jump labels to
reduce overheads -- skipping the per-cfs_rq runtime enabled checks.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c      |   33 +++++++++++++++++++++++++++++++--
 kernel/sched_fair.c |   15 ++++++++++++---
 2 files changed, 43 insertions(+), 5 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -71,6 +71,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/jump_label.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -499,7 +500,32 @@ static void destroy_cfs_bandwidth(struct
 	hrtimer_cancel(&cfs_b->period_timer);
 	hrtimer_cancel(&cfs_b->slack_timer);
 }
-#else
+
+#ifdef HAVE_JUMP_LABEL
+static struct jump_label_key __cfs_bandwidth_enabled;
+
+static inline bool cfs_bandwidth_enabled(void)
+{
+	return static_branch(&__cfs_bandwidth_enabled);
+}
+
+static void account_cfs_bandwidth_enabled(int enabled, int was_enabled)
+{
+	/* only need to count groups transitioning between enabled/!enabled */
+	if (enabled && !was_enabled)
+		jump_label_inc(&__cfs_bandwidth_enabled);
+	else if (!enabled && was_enabled)
+		jump_label_dec(&__cfs_bandwidth_enabled);
+}
+#else /* !HAVE_JUMP_LABEL */
+/* static_branch doesn't help unless supported */
+static int cfs_bandwidth_enabled(void)
+{
+	return 1;
+}
+static void account_cfs_bandwidth_enabled(int enabled, int was_enabled) {}
+#endif /* HAVE_JUMP_LABEL */
+#else /* !CONFIG_CFS_BANDWIDTH */
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
@@ -9025,7 +9051,7 @@ static int __cfs_schedulable(struct task
 
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
-	int i, ret = 0, runtime_enabled;
+	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 
 	if (tg == &root_task_group)
@@ -9053,6 +9079,9 @@ static int tg_set_cfs_bandwidth(struct t
 		goto out_unlock;
 
 	runtime_enabled = quota != RUNTIME_INF;
+	runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
+	account_cfs_bandwidth_enabled(runtime_enabled, runtime_was_enabled);
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1430,7 +1430,7 @@ static void __account_cfs_rq_runtime(str
 static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 						   unsigned long delta_exec)
 {
-	if (!cfs_rq->runtime_enabled)
+	if (!cfs_bandwidth_enabled() || !cfs_rq->runtime_enabled)
 		return;
 
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
@@ -1438,13 +1438,13 @@ static __always_inline void account_cfs_
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->throttled;
+	return cfs_bandwidth_enabled() && cfs_rq->throttled;
 }
 
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->throttle_count;
+	return cfs_bandwidth_enabled() && cfs_rq->throttle_count;
 }
 
 /*
@@ -1765,6 +1765,9 @@ static void __return_cfs_rq_runtime(stru
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
+	if (!cfs_bandwidth_enabled())
+		return;
+
 	if (!cfs_rq->runtime_enabled || !cfs_rq->nr_running)
 		return;
 
@@ -1810,6 +1813,9 @@ static void do_sched_cfs_slack_timer(str
  */
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
 {
+	if (!cfs_bandwidth_enabled())
+		return;
+
 	/* an active group must be handled by the update_curr()->put() path */
 	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
 		return;
@@ -1827,6 +1833,9 @@ static void check_enqueue_throttle(struc
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
+	if (!cfs_bandwidth_enabled())
+		return;
+
 	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
 		return;
 



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-08-05 18:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-22  0:32 Jason Baron
2011-07-22  0:57 ` Paul Turner
2011-07-22  1:17   ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron
2011-07-22  1:38     ` Paul Turner
2011-07-27 21:58       ` Jason Baron
2011-08-05  3:53         ` Paul Turner
2011-08-05  7:21           ` Peter Zijlstra
2011-08-05  3:55         ` Paul Turner
2011-08-05 18:28           ` Jason Baron
2011-08-05  8:30         ` Peter Zijlstra
2011-08-05 15:11           ` Richard Henderson
2011-08-05 15:14             ` Peter Zijlstra
2011-08-05 15:24             ` Jason Baron
  -- strict thread matches above, loose matches on Subject: below --
2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
2011-07-21 16:43 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).