linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (no subject)
@ 2011-07-22  0:32 Jason Baron
  2011-07-22  0:57 ` Paul Turner
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Baron @ 2011-07-22  0:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

rth@redhat.com
Bcc: 
Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
 when bandwidth control is inactive
Reply-To: 
In-Reply-To: <20110721184758.403388616@google.com>

On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> So I'm seeing some strange costs associated with jump_labels; while on paper
> the branches and instructions retired improves (as expected) we're taking an
> unexpected hit in IPC.
> 
> [From the initial mail we have workloads:
>   mkdir -p /cgroup/cpu/test
>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> ]
> 
> To make some of the figures more clear:
> 
> Legend:
> !BWC = tip + bwc, BWC compiled out
> BWC = tip + bwc
> BWC_JL = tip + bwc + jump label (this patch)
> 
> 
> Now, comparing under W1 we see:
> W1: BWC vs BWC_JL
>                             instructions            cycles                  branches              elapsed                
> ---------------------------------------------------------------------------------------------------------------------
> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
> 
> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline] 
> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
> 
> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> the unconstrained case with BWC.
> 
> 
> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> measurements for BWC_JL, with (%d) being the relative difference to their
> BWC counterparts.
> 
> W1: BWC vs BWC_JL is very similar.
> 	BWC vs BWC_JL
> clovertown [BWC]            985732031              1283113452               175621212             1.375905653  
> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
> 
> barcelona [BWC]             982139920              1078757792               175417574             1.069537049  
> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
> 
> westmere [BWC]              918633403               896047900               166496917             0.754629182  
> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
> 
> Now this is rather odd, almost across the board we're seeing the expected
> drops in instructions and branches, yet we appear to be paying a heavy IPC
> price.  The fact that wall-time has scaled equivalently with cycles roughly
> rules out the cycles counter being off.
> 
> We are seeing the expected behavior in the bandwidth enabled case;
> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> and instruction which shows up on all the numbers above.
> 
> With respect to compiler mangling the text is essentially unchanged in size.
> One lurking suspicion is whether the inserted nops have perturbed some of the
> jmp/branch alignments?
> 
>     text    data     bss     dec     hex filename
>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
>  
>  I have checked to make sure that the right instructions are being patched in
>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
>  into a userspace test (and benchmarked it directly under perf).  The results
>  here are also exactly as expected.
> 
> e.g.
>  Performance counter stats for './jump_test':
>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> Performance counter stats for './jump_test 1':
>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
> 
> Overall if we can fix the IPC the benefit in the globally unconstrained case
> looks really good.
> 
> Any thoughts Jason?
> 

Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
more optimal.

thanks,

-Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re:
  2011-07-22  0:32 Jason Baron
@ 2011-07-22  0:57 ` Paul Turner
  2011-07-22  1:17   ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Turner @ 2011-07-22  0:57 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote:
> rth@redhat.com
> Bcc:
> Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
>  when bandwidth control is inactive
> Reply-To:
> In-Reply-To: <20110721184758.403388616@google.com>
>
> On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
>> So I'm seeing some strange costs associated with jump_labels; while on paper
>> the branches and instructions retired improves (as expected) we're taking an
>> unexpected hit in IPC.
>>
>> [From the initial mail we have workloads:
>>   mkdir -p /cgroup/cpu/test
>>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
>>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
>> ]
>>
>> To make some of the figures more clear:
>>
>> Legend:
>> !BWC = tip + bwc, BWC compiled out
>> BWC = tip + bwc
>> BWC_JL = tip + bwc + jump label (this patch)
>>
>>
>> Now, comparing under W1 we see:
>> W1: BWC vs BWC_JL
>>                             instructions            cycles                  branches              elapsed
>> ---------------------------------------------------------------------------------------------------------------------
>> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
>> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
>> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
>> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
>>
>> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline]
>> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
>> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
>> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
>>
>> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
>> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
>> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
>> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
>> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
>> the unconstrained case with BWC.
>>
>>
>> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
>> measurements for BWC_JL, with (%d) being the relative difference to their
>> BWC counterparts.
>>
>> W1: BWC vs BWC_JL is very similar.
>>       BWC vs BWC_JL
>> clovertown [BWC]            985732031              1283113452               175621212             1.375905653
>> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
>> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
>> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
>>
>> barcelona [BWC]             982139920              1078757792               175417574             1.069537049
>> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
>> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
>> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
>>
>> westmere [BWC]              918633403               896047900               166496917             0.754629182
>> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
>> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
>> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
>>
>> Now this is rather odd, almost across the board we're seeing the expected
>> drops in instructions and branches, yet we appear to be paying a heavy IPC
>> price.  The fact that wall-time has scaled equivalently with cycles roughly
>> rules out the cycles counter being off.
>>
>> We are seeing the expected behavior in the bandwidth enabled case;
>> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
>> and instruction which shows up on all the numbers above.
>>
>> With respect to compiler mangling the text is essentially unchanged in size.
>> One lurking suspicion is whether the inserted nops have perturbed some of the
>> jmp/branch alignments?
>>
>>     text    data     bss     dec     hex filename
>>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
>>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
>>
>>  I have checked to make sure that the right instructions are being patched in
>>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
>>  into a userspace test (and benchmarked it directly under perf).  The results
>>  here are also exactly as expected.
>>
>> e.g.
>>  Performance counter stats for './jump_test':
>>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
>> Performance counter stats for './jump_test 1':
>>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
>>
>> Overall if we can fix the IPC the benefit in the globally unconstrained case
>> looks really good.
>>
>> Any thoughts Jason?
>>
>
> Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
> CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
> more optimal.
>

Ah I should have mentioned that was one of the holes I stared down:

Builds were -O2 (gcc-4.6.1) and
$  zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set

Same kernel image across all platforms.






> thanks,
>
> -Jason
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-22  0:57 ` Paul Turner
@ 2011-07-22  1:17   ` Jason Baron
  2011-07-22  1:38     ` Paul Turner
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Baron @ 2011-07-22  1:17 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote:
> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote:
> > rth@redhat.com
> > Bcc:
> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
> >  when bandwidth control is inactive
> > Reply-To:
> > In-Reply-To: <20110721184758.403388616@google.com>
> >
> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> >> So I'm seeing some strange costs associated with jump_labels; while on paper
> >> the branches and instructions retired improves (as expected) we're taking an
> >> unexpected hit in IPC.
> >>
> >> [From the initial mail we have workloads:
> >>   mkdir -p /cgroup/cpu/test
> >>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
> >>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
> >>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
> >>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> >> ]
> >>
> >> To make some of the figures more clear:
> >>
> >> Legend:
> >> !BWC = tip + bwc, BWC compiled out
> >> BWC = tip + bwc
> >> BWC_JL = tip + bwc + jump label (this patch)
> >>
> >>
> >> Now, comparing under W1 we see:
> >> W1: BWC vs BWC_JL
> >>                             instructions            cycles                  branches              elapsed
> >> ---------------------------------------------------------------------------------------------------------------------
> >> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
> >> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
> >> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
> >> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
> >>
> >> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline]
> >> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
> >> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
> >> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
> >>
> >> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
> >> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
> >> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
> >> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> >> the unconstrained case with BWC.
> >>
> >>
> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> >> measurements for BWC_JL, with (%d) being the relative difference to their
> >> BWC counterparts.
> >>
> >> W1: BWC vs BWC_JL is very similar.
> >>       BWC vs BWC_JL
> >> clovertown [BWC]            985732031              1283113452               175621212             1.375905653
> >> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
> >> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
> >> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
> >>
> >> barcelona [BWC]             982139920              1078757792               175417574             1.069537049
> >> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
> >> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
> >> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
> >>
> >> westmere [BWC]              918633403               896047900               166496917             0.754629182
> >> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
> >> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
> >> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
> >>
> >> Now this is rather odd, almost across the board we're seeing the expected
> >> drops in instructions and branches, yet we appear to be paying a heavy IPC
> >> price.  The fact that wall-time has scaled equivalently with cycles roughly
> >> rules out the cycles counter being off.
> >>

if i understand your results, for barcelona you did see an improvement
in cycles and eslapsed time with jump labels for unconstrained?

> >> We are seeing the expected behavior in the bandwidth enabled case;
> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> >> and instruction which shows up on all the numbers above.
> >>
> >> With respect to compiler mangling the text is essentially unchanged in size.
> >> One lurking suspicion is whether the inserted nops have perturbed some of the
> >> jmp/branch alignments?

hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who
worked on the 'asm goto' in gcc.

> >>
> >>     text    data     bss     dec     hex filename
> >>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
> >>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
> >>

the other thing here is that vmlinux.jump_label includes the extra
kernel/jump_label.o file, so you can sort of subtract the text size of
that file to do a fair comparison.

Also, I would have expected the data section to have increased more with
jump labels enabled. Are tracepoints disabled (a current user of jump
labels).

> >>  I have checked to make sure that the right instructions are being patched in
> >>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
> >>  into a userspace test (and benchmarked it directly under perf).  The results
> >>  here are also exactly as expected.
> >>
> >> e.g.
> >>  Performance counter stats for './jump_test':
> >>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> >> Performance counter stats for './jump_test 1':
> >>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
> >>

what no-op did you use in userspace? I wouldn't think the no-op choice
would make any difference though...At compile time we use a 'jmp 0', and
then at boot we dynamically patch the 'jmp 0' with the no-op we think works
best...

thanks,

-Jason

> >> Overall if we can fix the IPC the benefit in the globally unconstrained case
> >> looks really good.
> >>
> >> Any thoughts Jason?
> >>
> >
> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
> > more optimal.
> >
> 
> Ah I should have mentioned that was one of the holes I stared down:
> 
> Builds were -O2 (gcc-4.6.1) and
> $  zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
> 
> Same kernel image across all platforms.
> 
> 
> 
> 
> 
> 
> > thanks,
> >
> > -Jason
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-22  1:17   ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron
@ 2011-07-22  1:38     ` Paul Turner
  2011-07-27 21:58       ` Jason Baron
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Turner @ 2011-07-22  1:38 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@redhat.com> wrote:
> On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote:
>> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote:
>> > rth@redhat.com
>> > Bcc:
>> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
>> >  when bandwidth control is inactive
>> > Reply-To:
>> > In-Reply-To: <20110721184758.403388616@google.com>
>> >
>> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
>> >> So I'm seeing some strange costs associated with jump_labels; while on paper
>> >> the branches and instructions retired improves (as expected) we're taking an
>> >> unexpected hit in IPC.
>> >>
>> >> [From the initial mail we have workloads:
>> >>   mkdir -p /cgroup/cpu/test
>> >>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>> >>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
>> >>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>> >>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
>> >> ]
>> >>
>> >> To make some of the figures more clear:
>> >>
>> >> Legend:
>> >> !BWC = tip + bwc, BWC compiled out
>> >> BWC = tip + bwc
>> >> BWC_JL = tip + bwc + jump label (this patch)
>> >>
>> >>
>> >> Now, comparing under W1 we see:
>> >> W1: BWC vs BWC_JL
>> >>                             instructions            cycles                  branches              elapsed
>> >> ---------------------------------------------------------------------------------------------------------------------
>> >> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
>> >> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
>> >> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
>> >> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
>> >>
>> >> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline]
>> >> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
>> >> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
>> >> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
>> >>
>> >> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
>> >> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
>> >> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
>> >> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
>> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
>> >> the unconstrained case with BWC.
>> >>
>> >>
>> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
>> >> measurements for BWC_JL, with (%d) being the relative difference to their
>> >> BWC counterparts.
>> >>
>> >> W1: BWC vs BWC_JL is very similar.
>> >>       BWC vs BWC_JL
>> >> clovertown [BWC]            985732031              1283113452               175621212             1.375905653
>> >> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
>> >> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
>> >> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
>> >>
>> >> barcelona [BWC]             982139920              1078757792               175417574             1.069537049
>> >> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
>> >> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
>> >> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
>> >>
>> >> westmere [BWC]              918633403               896047900               166496917             0.754629182
>> >> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
>> >> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
>> >> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
>> >>
>> >> Now this is rather odd, almost across the board we're seeing the expected
>> >> drops in instructions and branches, yet we appear to be paying a heavy IPC
>> >> price.  The fact that wall-time has scaled equivalently with cycles roughly
>> >> rules out the cycles counter being off.
>> >>
>
> if i understand your results, for barcelona you did see an improvement
> in cycles and eslapsed time with jump labels for unconstrained?
>

Under W2, yes.

>> >> We are seeing the expected behavior in the bandwidth enabled case;
>> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
>> >> and instruction which shows up on all the numbers above.
>> >>
>> >> With respect to compiler mangling the text is essentially unchanged in size.
>> >> One lurking suspicion is whether the inserted nops have perturbed some of the
>> >> jmp/branch alignments?
>
> hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who
> worked on the 'asm goto' in gcc.
>
>> >>
>> >>     text    data     bss     dec     hex filename
>> >>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
>> >>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
>> >>
>
> the other thing here is that vmlinux.jump_label includes the extra
> kernel/jump_label.o file, so you can sort of subtract the text size of
> that file to do a fair comparison.

Even without doing that it's only a 1.00004% change in text size.

I was just making the inference that if it's gcc mangling it's likely
in the layout/alignment.

>
> Also, I would have expected the data section to have increased more with
> jump labels enabled. Are tracepoints disabled (a current user of jump
> labels).

Yeah -- Tracing is enabled so the BWC build should have labels
already; this likely accounts for the small increase noted above.

>
>> >>  I have checked to make sure that the right instructions are being patched in
>> >>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
>> >>  into a userspace test (and benchmarked it directly under perf).  The results
>> >>  here are also exactly as expected.
>> >>
>> >> e.g.
>> >>  Performance counter stats for './jump_test':
>> >>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
>> >> Performance counter stats for './jump_test 1':
>> >>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
>> >>
>
> what no-op did you use in userspace? I wouldn't think the no-op choice
> would make any difference though...At compile time we use a 'jmp 0', and
> then at boot we dynamically patch the 'jmp 0' with the no-op we think works
> best...
>

Sorry -- what I meant here is I pulled the run-time chosen "best" nop
out of /proc/kcore and tested a
tight loop about a <JL><RET><COND><RET> sequence (e.g.
cfs_rq_throttled()) with JL being the nop and jmp respectively.

Specifically for Westmere this ends up being K8_NOP5  -- 0x666666D0

> thanks,
>
> -Jason
>
>> >> Overall if we can fix the IPC the benefit in the globally unconstrained case
>> >> looks really good.
>> >>
>> >> Any thoughts Jason?
>> >>
>> >
>> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
>> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
>> > more optimal.
>> >
>>
>> Ah I should have mentioned that was one of the holes I stared down:
>>
>> Builds were -O2 (gcc-4.6.1) and
>> $  zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
>> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
>>
>> Same kernel image across all platforms.
>>
>>
>>
>>
>>
>>
>> > thanks,
>> >
>> > -Jason
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-22  1:38     ` Paul Turner
@ 2011-07-27 21:58       ` Jason Baron
  2011-08-05  3:53         ` Paul Turner
                           ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jason Baron @ 2011-07-27 21:58 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Thu, Jul 21, 2011 at 06:38:01PM -0700, Paul Turner wrote:
> On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@redhat.com> wrote:
> > On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote:
> >> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote:
> >> > rth@redhat.com
> >> > Bcc:
> >> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
> >> >  when bandwidth control is inactive
> >> > Reply-To:
> >> > In-Reply-To: <20110721184758.403388616@google.com>
> >> >
> >> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> >> >> So I'm seeing some strange costs associated with jump_labels; while on paper
> >> >> the branches and instructions retired improves (as expected) we're taking an
> >> >> unexpected hit in IPC.
> >> >>
> >> >> [From the initial mail we have workloads:
> >> >>   mkdir -p /cgroup/cpu/test
> >> >>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
> >> >>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
> >> >>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
> >> >>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> >> >> ]
> >> >>
> >> >> To make some of the figures more clear:
> >> >>
> >> >> Legend:
> >> >> !BWC = tip + bwc, BWC compiled out
> >> >> BWC = tip + bwc
> >> >> BWC_JL = tip + bwc + jump label (this patch)
> >> >>
> >> >>
> >> >> Now, comparing under W1 we see:
> >> >> W1: BWC vs BWC_JL
> >> >>                             instructions            cycles                  branches              elapsed
> >> >> ---------------------------------------------------------------------------------------------------------------------
> >> >> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
> >> >> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
> >> >> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
> >> >> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
> >> >>
> >> >> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline]
> >> >> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
> >> >> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
> >> >> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
> >> >>
> >> >> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
> >> >> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
> >> >> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
> >> >> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
> >> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> >> >> the unconstrained case with BWC.
> >> >>
> >> >>
> >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> >> >> measurements for BWC_JL, with (%d) being the relative difference to their
> >> >> BWC counterparts.
> >> >>
> >> >> W1: BWC vs BWC_JL is very similar.
> >> >>       BWC vs BWC_JL
> >> >> clovertown [BWC]            985732031              1283113452               175621212             1.375905653
> >> >> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
> >> >> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
> >> >> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
> >> >>
> >> >> barcelona [BWC]             982139920              1078757792               175417574             1.069537049
> >> >> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
> >> >> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
> >> >> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
> >> >>
> >> >> westmere [BWC]              918633403               896047900               166496917             0.754629182
> >> >> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
> >> >> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
> >> >> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
> >> >>
> >> >> Now this is rather odd, almost across the board we're seeing the expected
> >> >> drops in instructions and branches, yet we appear to be paying a heavy IPC
> >> >> price.  The fact that wall-time has scaled equivalently with cycles roughly
> >> >> rules out the cycles counter being off.
> >> >>
> >
> > if i understand your results, for barcelona you did see an improvement
> > in cycles and eslapsed time with jump labels for unconstrained?
> >
> 
> Under W2, yes.
> 
> >> >> We are seeing the expected behavior in the bandwidth enabled case;
> >> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> >> >> and instruction which shows up on all the numbers above.
> >> >>
> >> >> With respect to compiler mangling the text is essentially unchanged in size.
> >> >> One lurking suspicion is whether the inserted nops have perturbed some of the
> >> >> jmp/branch alignments?
> >
> > hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who
> > worked on the 'asm goto' in gcc.
> >
> >> >>
> >> >>     text    data     bss     dec     hex filename
> >> >>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
> >> >>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
> >> >>
> >
> > the other thing here is that vmlinux.jump_label includes the extra
> > kernel/jump_label.o file, so you can sort of subtract the text size of
> > that file to do a fair comparison.
> 
> Even without doing that it's only a 1.00004% change in text size.
> 
> I was just making the inference that if it's gcc mangling it's likely
> in the layout/alignment.
> 
> >
> > Also, I would have expected the data section to have increased more with
> > jump labels enabled. Are tracepoints disabled (a current user of jump
> > labels).
> 
> Yeah -- Tracing is enabled so the BWC build should have labels
> already; this likely accounts for the small increase noted above.
> 
> >
> >> >>  I have checked to make sure that the right instructions are being patched in
> >> >>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
> >> >>  into a userspace test (and benchmarked it directly under perf).  The results
> >> >>  here are also exactly as expected.
> >> >>
> >> >> e.g.
> >> >>  Performance counter stats for './jump_test':
> >> >>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> >> >> Performance counter stats for './jump_test 1':
> >> >>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
> >> >>
> >
> > what no-op did you use in userspace? I wouldn't think the no-op choice
> > would make any difference though...At compile time we use a 'jmp 0', and
> > then at boot we dynamically patch the 'jmp 0' with the no-op we think works
> > best...
> >
> 
> Sorry -- what I meant here is I pulled the run-time chosen "best" nop
> out of /proc/kcore and tested a
> tight loop about a <JL><RET><COND><RET> sequence (e.g.
> cfs_rq_throttled()) with JL being the nop and jmp respectively.
> 
> Specifically for Westmere this ends up being K8_NOP5  -- 0x666666D0
> 
> > thanks,
> >
> > -Jason
> >
> >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case
> >> >> looks really good.
> >> >>
> >> >> Any thoughts Jason?
> >> >>
> >> >
> >> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
> >> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
> >> > more optimal.
> >> >
> >>
> >> Ah I should have mentioned that was one of the holes I stared down:
> >>
> >> Builds were -O2 (gcc-4.6.1) and
> >> $  zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE
> >> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
> >>
> >> Same kernel image across all platforms.
> >>
> >>

Hi Paul,

Ok, I think I finally tracked this down. It may seem a bit crazy, but
when we are getting down to cycle counting like this, it seems that the
link order in the kernel/Makefile can make difference. I had the
jump_label.o listed after the core files, whereas all the code in
jump_label.o is really slow path code (used when toggling branch
values). As follows:


--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
 	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
-	    async.o range.o jump_label.o
+	    async.o range.o
 obj-y += groups.o
 
 ifdef CONFIG_FUNCTION_TRACER
@@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
 obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
+obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is


I've tested the patch using a single 'static_branch()' in the getppid() path,
and basically running tight loops of calls to getppid(). Before, the
patch, I was seeing results similar to what you reported, after the
patch, things improved for all metrics. Here are my results for the
branch disabled case:

With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:

 Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):

     3,969,510,217 instructions             #      0.864 IPC     ( +-0.000% )
     4,592,334,954 cycles                     ( +-   0.046% )
       751,634,470 branches                   ( +-   0.000% )

        1.722635797  seconds time elapsed   ( +-   0.046% )

Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:

 Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):

     4,009,611,846 instructions             #      0.867 IPC     ( +-0.000% )
     4,622,210,580 cycles                     ( +-   0.012% )
       771,662,904 branches                   ( +-   0.000% )

        1.734341454  seconds time elapsed   ( +-   0.022% )


So all of the measured metrics improved in the jump labels case b/w
0.5% - 2.5%. 

I'm curious to see what you find with this patch.

Thanks,

-Jason



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-27 21:58       ` Jason Baron
@ 2011-08-05  3:53         ` Paul Turner
  2011-08-05  7:21           ` Peter Zijlstra
  2011-08-05  3:55         ` Paul Turner
  2011-08-05  8:30         ` Peter Zijlstra
  2 siblings, 1 reply; 13+ messages in thread
From: Paul Turner @ 2011-08-05  3:53 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

< snip>

>
> Hi Paul,
>
> Ok, I think I finally tracked this down. It may seem a bit crazy, but
> when we are getting down to cycle counting like this, it seems that the
> link order in the kernel/Makefile can make difference. I had the
> jump_label.o listed after the core files, whereas all the code in
> jump_label.o is really slow path code (used when toggling branch
> values). As follows:
>
>
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
>            kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
>            hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>            notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
> -           async.o range.o jump_label.o
> +           async.o range.o
>  obj-y += groups.o
>
>  ifdef CONFIG_FUNCTION_TRACER
> @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
>  obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
>  obj-$(CONFIG_PADATA) += padata.o
>  obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> +obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>
>  ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
>  # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
>
>
> I've tested the patch using a single 'static_branch()' in the getppid() path,
> and basically running tight loops of calls to getppid(). Before, the
> patch, I was seeing results similar to what you reported, after the
> patch, things improved for all metrics. Here are my results for the
> branch disabled case:
>
> With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:
>
>  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
>
>     3,969,510,217 instructions             #      0.864 IPC     ( +-0.000% )
>     4,592,334,954 cycles                     ( +-   0.046% )
>       751,634,470 branches                   ( +-   0.000% )
>
>        1.722635797  seconds time elapsed   ( +-   0.046% )
>
> Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:
>
>  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
>
>     4,009,611,846 instructions             #      0.867 IPC     ( +-0.000% )
>     4,622,210,580 cycles                     ( +-   0.012% )
>       771,662,904 branches                   ( +-   0.000% )
>
>        1.734341454  seconds time elapsed   ( +-   0.022% )
>
>
> So all of the measured metrics improved in the jump labels case b/w
> 0.5% - 2.5%.
>
> I'm curious to see what you find with this patch.
>
> Thanks,
>
> -Jason
>

Hi Jason,

Thanks for taking a look at this.  Sorry, this took a few days to
benchmark all the permutations and we had some issues with internal
proxies which interrupted benchmarking runs.

Results and some analysis follow.

[
Key:

npo_XXX = with CONFIG_JUMP_LABEL, without link order patch (no patched order)
po_XXX = with CONFIG_JUMP_LABEL, with link order patch (patched order)
nojl_XXX = without CONFIG_JUMP_LABEL

Where "XXX" is
head: tip (c5bafb3) without patch series
cfs: tip + patch series - jump_label patch
cfs_jl: tip + patch series + jump_label for unconstrained

Test was repeated 3 times, each run was 50 repeats w/ typically ~<0.1
in-test variance on reported output
]

Considering just jump labels in tip, comparing against HEAD w/
!CONFIG_JUMP_LABEL

                           instructions            cycles
    branches              elapsed
---------------------------------------------------------------------------------------------------------------------
	Westmere:
njl_head.1                  798832892               722624737
     145375836             0.203218936   [baseline]
njl_head.2                  798888783 (+0.01)       746118188 (+3.25)
     145386807 (+0.01)     0.208573683 (-2.18)
njl_head.3                  798864253 (+0.00)       731537139 (+1.23)
     145382747 (+0.00)     0.204098175 (-4.28)
npo_head.1                  797033521 (-0.23)       731239359 (+1.19)
     144571358 (-0.55)     0.206910496 (-2.96)
npo_head.2                  797166434 (-0.21)       728926020 (+0.87)
     144603465 (-0.53)     0.202906392 (-4.84)
npo_head.3                  797165370 (-0.21)       725930458 (+0.46)
     144603438 (-0.53)     0.202118274 (-5.21)
po_head.1                   797019904 (-0.23)       699008145 (-3.27)
     144567652 (-0.56)     0.197272615 (-7.48)
po_head.2                   797037682 (-0.22)       705732419 (-2.34)
     144572115 (-0.55)     0.197101692 (-7.56)
po_head.3                   797079804 (-0.22)       698007668 (-3.41)
     144580964 (-0.55)     0.194871253 (-8.61)

	Barcelona:
njl_head.1                  816842028               748362637
     147462095             0.341654152
njl_head.2                  816849735 (+0.00)       748480742 (+0.02)
     147462652 (+0.00)     0.341450734 (-2.90)
njl_head.3                  816834963 (-0.00)       747083797 (-0.17)
     147460200 (-0.00)     0.340802353 (-3.09)
npo_head.1                  815068563 (-0.22)       775012690 (+3.56)
     146661357 (-0.54)     0.353797321 (+0.61)
npo_head.2                  815033261 (-0.22)       759613364 (+1.50)
     146654106 (-0.55)     0.346462671 (-1.48)
npo_head.3                  815029611 (-0.22)       762660196 (+1.91)
     146654169 (-0.55)     0.347565129 (-1.16)
po_head.1                   815026489 (-0.22)       767229109 (+2.52)
     146653376 (-0.55)     0.350241833 (-0.40)
po_head.2                   815035127 (-0.22)       770224495 (+2.92)
     146654019 (-0.55)     0.351352092 (-0.09)
po_head.3                   815109904 (-0.21)       774954096 (+3.55)
     146662020 (-0.54)     0.353505054 (+0.53)



With the patch to fix link-order we're typically faster and it's
probably time to modulate the configs so we get CONFIG_JUMP_LABEL by
default when CC_HAS_ASM_GOTO.

Considering Bandwidth control, comparing vs HEAD w/ CONFIG_JUMP_LABEL:

                            instructions            cycles
     branches              elapsed
---------------------------------------------------------------------------------------------------------------------
	Westmere:
po_head.1                   797019904               699008145
     144567652             0.197272615 [Baseline]
po_head.2                   797037682 (+0.00)       705732419 (+0.96)
     144572115 (+0.00)     0.197101692 (-4.91)
po_head.3                   797079804 (+0.01)       698007668 (-0.14)
     144580964 (+0.01)     0.194871253 (-5.98)
njl_cfs.1                   802649718 (+0.71)       708143552 (+1.31)
     146577437 (+1.39)     0.198770168 (-4.10)
njl_cfs.2                   802679078 (+0.71)       707486608 (+1.21)
     146582628 (+1.39)     0.197890812 (-4.53)
njl_cfs.3                   802647500 (+0.71)       704770712 (+0.82)
     146578141 (+1.39)     0.196742304 (-5.08)
npo_cfs.1                   800661523 (+0.46)       724068093 (+3.59)
     145774786 (+0.83)     0.204632700 (-1.27)
npo_cfs.2                   800646997 (+0.46)       718884486 (+2.84)
     145772293 (+0.83)     0.201248482 (-2.91)
npo_cfs.3                   800783171 (+0.47)       725140326 (+3.74)
     145804350 (+0.86)     0.203266025 (-1.93)
npo_cfs_jl.1                797304605 (+0.04)       687741762 (-1.61)
     143666256 (-0.62)     0.194302293 (-6.26)
npo_cfs_jl.2                797446281 (+0.05)       694066715 (-0.71)
     143700065 (-0.60)     0.194212118 (-6.30)
npo_cfs_jl.3                797374495 (+0.04)       697561774 (-0.21)
     143682692 (-0.61)     0.194935111 (-5.95)
po_cfs.1                    800631004 (+0.45)       715819643 (+2.41)
     145769677 (+0.83)     0.200007036 (-3.51)
po_cfs.2                    800642622 (+0.45)       698569729 (-0.06)
     145769973 (+0.83)     0.194625680 (-6.10)
po_cfs.3                    800752778 (+0.47)       707282749 (+1.18)
     145798992 (+0.85)     0.197047366 (-4.93)
po_cfs_jl.1                 797306617 (+0.04)       686329256 (-1.81)
     143666659 (-0.62)     0.193107369 (-6.83)
po_cfs_jl.2                 797434478 (+0.05)       677865445 (-3.02)
     143697712 (-0.60)     0.189314824 (-8.66)
po_cfs_jl.3                 797299055 (+0.04)       686371679 (-1.81)
     143665758 (-0.62)     0.191859014 (-7.44)

	Barcelona:
po_head.1                   815026489               767229109
     146653376             0.350241833 [Baseline]
po_head.2                   815035127 (+0.00)       770224495 (+0.39)
     146654019 (+0.00)     0.351352092 (-2.47)
po_head.3                   815109904 (+0.01)       774954096 (+1.01)
     146662020 (+0.01)     0.353505054 (-1.87)
njl_cfs.1                   820647075 (+0.69)       756895773 (-1.35)
     148663929 (+1.37)     0.345563962 (-4.07)
njl_cfs.2                   820672501 (+0.69)       761520373 (-0.74)
     148667815 (+1.37)     0.347529253 (-3.53)
njl_cfs.3                   820664350 (+0.69)       763400895 (-0.50)
     148666126 (+1.37)     0.348337223 (-3.30)
npo_cfs.1                   818629349 (+0.44)       758306455 (-1.16)
     147854452 (+0.82)     0.346678486 (-3.77)
npo_cfs.2                   818829256 (+0.47)       768393448 (+0.15)
     147891099 (+0.84)     0.350678075 (-2.65)
npo_cfs.3                   818697806 (+0.45)       772218715 (+0.65)
     147866720 (+0.83)     0.352333672 (-2.20)
npo_cfs_jl.1                815343935 (+0.04)       760127157 (-0.93)
     145753233 (-0.61)     0.347184970 (-3.62)
npo_cfs_jl.2                815415786 (+0.05)       775772068 (+1.11)
     145762961 (-0.61)     0.353965833 (-1.74)
npo_cfs_jl.3                815403187 (+0.05)       764048918 (-0.41)
     145761012 (-0.61)     0.348619922 (-3.23)
po_cfs.1                    819204964 (+0.51)       767156385 (-0.01)
     147959727 (+0.89)     0.350737982 (-2.64)
po_cfs.2                    818665676 (+0.45)       764324366 (-0.38)
     147860788 (+0.82)     0.348814489 (-3.17)
po_cfs.3                    818661849 (+0.45)       752288492 (-1.95)
     147859717 (+0.82)     0.343294319 (-4.70)
po_cfs_jl.1                 815336908 (+0.04)       765760248 (-0.19)
     145755155 (-0.61)     0.349608614 (-2.95)
po_cfs_jl.2                 815322295 (+0.04)       765613685 (-0.21)
     145751972 (-0.61)     0.349321663 (-3.03)
po_cfs_jl.3                 815310833 (+0.03)       759647967 (-0.99)
     145750118 (-0.62)     0.346607639 (-3.78)

Thanks to the magic of compiler re-organization we now report zero
overhead, in fact a speed-up is realized.

I will re-post v7.3 with:
- rebase to minor changes in tip
- removing RFT from adding jump_labels to CFS
- additional hierarchical period constraint

Thanks for looking into this Jason!

- Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-27 21:58       ` Jason Baron
  2011-08-05  3:53         ` Paul Turner
@ 2011-08-05  3:55         ` Paul Turner
  2011-08-05 18:28           ` Jason Baron
  2011-08-05  8:30         ` Peter Zijlstra
  2 siblings, 1 reply; 13+ messages in thread
From: Paul Turner @ 2011-08-05  3:55 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
>            kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
>            hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>            notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
> -           async.o range.o jump_label.o
> +           async.o range.o
>  obj-y += groups.o
>
>  ifdef CONFIG_FUNCTION_TRACER
> @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
>  obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
>  obj-$(CONFIG_PADATA) += padata.o
>  obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> +obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>
>  ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
>  # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
>

Tested-by: Paul Turner <pjt@google.com>

Let me know if you need any result tables for the actual commit msg.
Same goes for making CONFIG_JUMP_LABEL equivalent to default in
CC_HAS_ASM_GOTO case (at least on x86 anyway).


>
> I've tested the patch using a single 'static_branch()' in the getppid() path,
> and basically running tight loops of calls to getppid(). Before, the
> patch, I was seeing results similar to what you reported, after the
> patch, things improved for all metrics. Here are my results for the
> branch disabled case:
>
> With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:
>
>  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
>
>     3,969,510,217 instructions             #      0.864 IPC     ( +-0.000% )
>     4,592,334,954 cycles                     ( +-   0.046% )
>       751,634,470 branches                   ( +-   0.000% )
>
>        1.722635797  seconds time elapsed   ( +-   0.046% )
>
> Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:
>
>  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
>
>     4,009,611,846 instructions             #      0.867 IPC     ( +-0.000% )
>     4,622,210,580 cycles                     ( +-   0.012% )
>       771,662,904 branches                   ( +-   0.000% )
>
>        1.734341454  seconds time elapsed   ( +-   0.022% )
>
>
> So all of the measured metrics improved in the jump labels case b/w
> 0.5% - 2.5%.
>
> I'm curious to see what you find with this patch.
>
> Thanks,
>
> -Jason
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05  3:53         ` Paul Turner
@ 2011-08-05  7:21           ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2011-08-05  7:21 UTC (permalink / raw)
  To: Paul Turner
  Cc: Jason Baron, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Thu, 2011-08-04 at 20:53 -0700, Paul Turner wrote:
> 
> I will re-post v7.3 with:
> - rebase to minor changes in tip
> - removing RFT from adding jump_labels to CFS
> - additional hierarchical period constraint 

Could you rebase to -tip + my patches, most of your previous set is
already queued there. The reason its not in -tip is is because the merge
window fallout still has -tip in a somewhat shaky state.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-07-27 21:58       ` Jason Baron
  2011-08-05  3:53         ` Paul Turner
  2011-08-05  3:55         ` Paul Turner
@ 2011-08-05  8:30         ` Peter Zijlstra
  2011-08-05 15:11           ` Richard Henderson
  2 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2011-08-05  8:30 UTC (permalink / raw)
  To: Jason Baron
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth

On Wed, 2011-07-27 at 17:58 -0400, Jason Baron wrote:
> Ok, I think I finally tracked this down. It may seem a bit crazy, but
> when we are getting down to cycle counting like this, it seems that the
> link order in the kernel/Makefile can make difference. I had the
> jump_label.o listed after the core files, whereas all the code in
> jump_label.o is really slow path code (used when toggling branch
> values). As follows:
> 
> 
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
>             kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
>             hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>             notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
> -           async.o range.o jump_label.o
> +           async.o range.o
>  obj-y += groups.o
>  
>  ifdef CONFIG_FUNCTION_TRACER
> @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
>  obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
>  obj-$(CONFIG_PADATA) += padata.o
>  obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> +obj-$(CONFIG_JUMP_LABEL) += jump_label.o 


OK, so _WHY_ does that make a difference and will a next version of
gnu-binutils not mess that up?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05  8:30         ` Peter Zijlstra
@ 2011-08-05 15:11           ` Richard Henderson
  2011-08-05 15:14             ` Peter Zijlstra
  2011-08-05 15:24             ` Jason Baron
  0 siblings, 2 replies; 13+ messages in thread
From: Richard Henderson @ 2011-08-05 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Baron, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Ingo Molnar, Pavel Emelyanov

On 08/05/2011 01:30 AM, Peter Zijlstra wrote:
> OK, so _WHY_ does that make a difference and will a next version of
> gnu-binutils not mess that up?

The Why is micro-architectual, and I can't answer that.

But ld will never re-order the files as given on the command-line.
There are too many functions and tables that are constructed 
piece-wise from input sections; re-ordering them would change
the semantics of the program.


r~

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05 15:11           ` Richard Henderson
@ 2011-08-05 15:14             ` Peter Zijlstra
  2011-08-05 15:24             ` Jason Baron
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2011-08-05 15:14 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Jason Baron, Paul Turner, linux-kernel, Bharata B Rao,
	Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto,
	Ingo Molnar, Pavel Emelyanov

On Fri, 2011-08-05 at 08:11 -0700, Richard Henderson wrote:
> On 08/05/2011 01:30 AM, Peter Zijlstra wrote:
> > OK, so _WHY_ does that make a difference and will a next version of
> > gnu-binutils not mess that up?
> 
> The Why is micro-architectual, and I can't answer that.
> 
> But ld will never re-order the files as given on the command-line.
> There are too many functions and tables that are constructed 
> piece-wise from input sections; re-ordering them would change
> the semantics of the program.

Right, so I was wondering about things like whole-program-optimization
passes at link time. Since I've no clue why the proposed patch does what
it does, its hard to say what invariant is needed to be kept.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05 15:11           ` Richard Henderson
  2011-08-05 15:14             ` Peter Zijlstra
@ 2011-08-05 15:24             ` Jason Baron
  1 sibling, 0 replies; 13+ messages in thread
From: Jason Baron @ 2011-08-05 15:24 UTC (permalink / raw)
  To: Richard Henderson, a.p.zijlstra
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

On Fri, Aug 05, 2011 at 08:11:15AM -0700, Richard Henderson wrote:
> On 08/05/2011 01:30 AM, Peter Zijlstra wrote:
> > OK, so _WHY_ does that make a difference and will a next version of
> > gnu-binutils not mess that up?
> 
> The Why is micro-architectual, and I can't answer that.

In tracking this down, I eventually found that just having the
jump_label.o file compiled into the kernel, but not actually using the
static_branch(), or 'asm goto' anywhere, led to a performance hit.
Thus, the compiler or the 'asm goto' itself wasn't actually causing any
degradation. 

Since the jump_label.o file is only slow-path code, it can be moved away
from core or heavily called kernel routines. I suspect this is probably
an icache issue, but I can't say for sure.

Thanks,

-Jason

> 
> But ld will never re-order the files as given on the command-line.
> There are too many functions and tables that are constructed 
> piece-wise from input sections; re-ordering them would change
> the semantics of the program.
> 
> 
> r~

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive
  2011-08-05  3:55         ` Paul Turner
@ 2011-08-05 18:28           ` Jason Baron
  0 siblings, 0 replies; 13+ messages in thread
From: Jason Baron @ 2011-08-05 18:28 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov,
	rth, rostedt

On Thu, Aug 04, 2011 at 08:55:08PM -0700, Paul Turner wrote:
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
> >            kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
> >            hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
> >            notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
> > -           async.o range.o jump_label.o
> > +           async.o range.o
> >  obj-y += groups.o
> >
> >  ifdef CONFIG_FUNCTION_TRACER
> > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
> >  obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
> >  obj-$(CONFIG_PADATA) += padata.o
> >  obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> >
> >  ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
> >  # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
> >
> 
> Tested-by: Paul Turner <pjt@google.com>
> 
> Let me know if you need any result tables for the actual commit msg.

Hi Paul,

Thanks for taking the time test this :) I'll post the patch shortly
with my own testing results. Hopefully, it can still be considered for
3.1 b/c of the non-invasive nature of the patch...

> Same goes for making CONFIG_JUMP_LABEL equivalent to default in
> CC_HAS_ASM_GOTO case (at least on x86 anyway).
> 

I originally had CONFIG_JUMP_LABEL implicitly turned on, but we ran into
a 32-bit compiler issue that was causing random, nasty crashes. That
issue has since been resolved in gcc, but we might need to update the
have CC_HAS_ASM_GOTO check to deal with that case better. Currently,
we're using the '-maccumulate-outgoing-args' gcc option to work around
the issue for 32 bit x86 (see: arch/x86/Makefile_32.cpu).

With the jump label interface somewhat stabilizing (I say somewhat, b/c Peter
brought up a good use case in the scheduler that it currently doesn't address,
but which we should be able to support without too much churn) and these testing
results, I think it might make sense to consider turning it on by default for
3.2. thoughts?

Thanks,

-Jason


> 
> >
> > I've tested the patch using a single 'static_branch()' in the getppid() path,
> > and basically running tight loops of calls to getppid(). Before, the
> > patch, I was seeing results similar to what you reported, after the
> > patch, things improved for all metrics. Here are my results for the
> > branch disabled case:
> >
> > With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:
> >
> >  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
> >
> >     3,969,510,217 instructions             #      0.864 IPC     ( +-0.000% )
> >     4,592,334,954 cycles                     ( +-   0.046% )
> >       751,634,470 branches                   ( +-   0.000% )
> >
> >        1.722635797  seconds time elapsed   ( +-   0.046% )
> >
> > Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled:
> >
> >  Performance counter stats for 'bash -c /tmp/timing;true' (50 runs):
> >
> >     4,009,611,846 instructions             #      0.867 IPC     ( +-0.000% )
> >     4,622,210,580 cycles                     ( +-   0.012% )
> >       771,662,904 branches                   ( +-   0.000% )
> >
> >        1.734341454  seconds time elapsed   ( +-   0.022% )
> >
> >
> > So all of the measured metrics improved in the jump labels case b/w
> > 0.5% - 2.5%.
> >
> > I'm curious to see what you find with this patch.
> >
> > Thanks,
> >
> > -Jason
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-08-05 18:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-22  0:32 Jason Baron
2011-07-22  0:57 ` Paul Turner
2011-07-22  1:17   ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron
2011-07-22  1:38     ` Paul Turner
2011-07-27 21:58       ` Jason Baron
2011-08-05  3:53         ` Paul Turner
2011-08-05  7:21           ` Peter Zijlstra
2011-08-05  3:55         ` Paul Turner
2011-08-05 18:28           ` Jason Baron
2011-08-05  8:30         ` Peter Zijlstra
2011-08-05 15:11           ` Richard Henderson
2011-08-05 15:14             ` Peter Zijlstra
2011-08-05 15:24             ` Jason Baron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).