* (no subject)
@ 2011-07-22 0:32 Jason Baron
2011-07-22 0:57 ` Paul Turner
0 siblings, 1 reply; 14+ messages in thread
From: Jason Baron @ 2011-07-22 0:32 UTC (permalink / raw)
To: Paul Turner
Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov
rth@redhat.com
Bcc:
Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
when bandwidth control is inactive
Reply-To:
In-Reply-To: <20110721184758.403388616@google.com>
On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> So I'm seeing some strange costs associated with jump_labels; while on paper
> the branches and instructions retired improves (as expected) we're taking an
> unexpected hit in IPC.
>
> [From the initial mail we have workloads:
> mkdir -p /cgroup/cpu/test
> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> ]
>
> To make some of the figures more clear:
>
> Legend:
> !BWC = tip + bwc, BWC compiled out
> BWC = tip + bwc
> BWC_JL = tip + bwc + jump label (this patch)
>
>
> Now, comparing under W1 we see:
> W1: BWC vs BWC_JL
> instructions cycles branches elapsed
> ---------------------------------------------------------------------------------------------------------------------
> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline]
> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel]
> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel]
> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel]
>
> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline]
> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel]
> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel]
> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel]
>
> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline]
> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel]
> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel]
> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel]
> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> the unconstrained case with BWC.
>
>
> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> measurements for BWC_JL, with (%d) being the relative difference to their
> BWC counterparts.
>
> W1: BWC vs BWC_JL is very similar.
> BWC vs BWC_JL
> clovertown [BWC] 985732031 1283113452 175621212 1.375905653
> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel]
> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel]
> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel]
>
> barcelona [BWC] 982139920 1078757792 175417574 1.069537049
> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel]
> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel]
> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel]
>
> westmere [BWC] 918633403 896047900 166496917 0.754629182
> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel]
> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel]
> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel]
>
> Now this is rather odd, almost across the board we're seeing the expected
> drops in instructions and branches, yet we appear to be paying a heavy IPC
> price. The fact that wall-time has scaled equivalently with cycles roughly
> rules out the cycles counter being off.
>
> We are seeing the expected behavior in the bandwidth enabled case;
> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> and instruction which shows up on all the numbers above.
>
> With respect to compiler mangling the text is essentially unchanged in size.
> One lurking suspicion is whether the inserted nops have perturbed some of the
> jmp/branch alignments?
>
> text data bss dec hex filename
> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label
> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label
>
> I have checked to make sure that the right instructions are being patched in
> at run-time. I've also pulled a fully patched jump_label out of the kernel
> into a userspace test (and benchmarked it directly under perf). The results
> here are also exactly as expected.
>
> e.g.
> Performance counter stats for './jump_test':
> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> Performance counter stats for './jump_test 1':
> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
>
> Overall if we can fix the IPC the benefit in the globally unconstrained case
> looks really good.
>
> Any thoughts Jason?
>
Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
more optimal.
thanks,
-Jason
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2011-07-22 0:32 Jason Baron @ 2011-07-22 0:57 ` Paul Turner 2011-07-22 1:17 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron 0 siblings, 1 reply; 14+ messages in thread From: Paul Turner @ 2011-07-22 0:57 UTC (permalink / raw) To: Jason Baron Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote: > rth@redhat.com > Bcc: > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead > when bandwidth control is inactive > Reply-To: > In-Reply-To: <20110721184758.403388616@google.com> > > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote: >> So I'm seeing some strange costs associated with jump_labels; while on paper >> the branches and instructions retired improves (as expected) we're taking an >> unexpected hit in IPC. >> >> [From the initial mail we have workloads: >> mkdir -p /cgroup/cpu/test >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" >> ] >> >> To make some of the figures more clear: >> >> Legend: >> !BWC = tip + bwc, BWC compiled out >> BWC = tip + bwc >> BWC_JL = tip + bwc + jump label (this patch) >> >> >> Now, comparing under W1 we see: >> W1: BWC vs BWC_JL >> instructions cycles branches elapsed >> --------------------------------------------------------------------------------------------------------------------- >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] >> >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] >> >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in >> the unconstrained case with BWC. >> >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on >> measurements for BWC_JL, with (%d) being the relative difference to their >> BWC counterparts. >> >> W1: BWC vs BWC_JL is very similar. >> BWC vs BWC_JL >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653 >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] >> >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049 >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] >> >> westmere [BWC] 918633403 896047900 166496917 0.754629182 >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] >> >> Now this is rather odd, almost across the board we're seeing the expected >> drops in instructions and branches, yet we appear to be paying a heavy IPC >> price. The fact that wall-time has scaled equivalently with cycles roughly >> rules out the cycles counter being off. >> >> We are seeing the expected behavior in the bandwidth enabled case; >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch >> and instruction which shows up on all the numbers above. >> >> With respect to compiler mangling the text is essentially unchanged in size. >> One lurking suspicion is whether the inserted nops have perturbed some of the >> jmp/branch alignments? >> >> text data bss dec hex filename >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label >> >> I have checked to make sure that the right instructions are being patched in >> at run-time. I've also pulled a fully patched jump_label out of the kernel >> into a userspace test (and benchmarked it directly under perf). The results >> here are also exactly as expected. >> >> e.g. >> Performance counter stats for './jump_test': >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles >> Performance counter stats for './jump_test 1': >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case >> looks really good. >> >> Any thoughts Jason? >> > > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code > more optimal. > Ah I should have mentioned that was one of the holes I stared down: Builds were -O2 (gcc-4.6.1) and $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set Same kernel image across all platforms. > thanks, > > -Jason > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-22 0:57 ` Paul Turner @ 2011-07-22 1:17 ` Jason Baron 2011-07-22 1:38 ` Paul Turner 0 siblings, 1 reply; 14+ messages in thread From: Jason Baron @ 2011-07-22 1:17 UTC (permalink / raw) To: Paul Turner Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote: > On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote: > > rth@redhat.com > > Bcc: > > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead > > when bandwidth control is inactive > > Reply-To: > > In-Reply-To: <20110721184758.403388616@google.com> > > > > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote: > >> So I'm seeing some strange costs associated with jump_labels; while on paper > >> the branches and instructions retired improves (as expected) we're taking an > >> unexpected hit in IPC. > >> > >> [From the initial mail we have workloads: > >> mkdir -p /cgroup/cpu/test > >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) > >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" > >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" > >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" > >> ] > >> > >> To make some of the figures more clear: > >> > >> Legend: > >> !BWC = tip + bwc, BWC compiled out > >> BWC = tip + bwc > >> BWC_JL = tip + bwc + jump label (this patch) > >> > >> > >> Now, comparing under W1 we see: > >> W1: BWC vs BWC_JL > >> instructions cycles branches elapsed > >> --------------------------------------------------------------------------------------------------------------------- > >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] > >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] > >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] > >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] > >> > >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] > >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] > >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] > >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] > >> > >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] > >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] > >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] > >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] > >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in > >> the unconstrained case with BWC. > >> > >> > >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on > >> measurements for BWC_JL, with (%d) being the relative difference to their > >> BWC counterparts. > >> > >> W1: BWC vs BWC_JL is very similar. > >> BWC vs BWC_JL > >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653 > >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] > >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] > >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] > >> > >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049 > >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] > >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] > >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] > >> > >> westmere [BWC] 918633403 896047900 166496917 0.754629182 > >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] > >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] > >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] > >> > >> Now this is rather odd, almost across the board we're seeing the expected > >> drops in instructions and branches, yet we appear to be paying a heavy IPC > >> price. The fact that wall-time has scaled equivalently with cycles roughly > >> rules out the cycles counter being off. > >> if i understand your results, for barcelona you did see an improvement in cycles and eslapsed time with jump labels for unconstrained? > >> We are seeing the expected behavior in the bandwidth enabled case; > >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch > >> and instruction which shows up on all the numbers above. > >> > >> With respect to compiler mangling the text is essentially unchanged in size. > >> One lurking suspicion is whether the inserted nops have perturbed some of the > >> jmp/branch alignments? hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who worked on the 'asm goto' in gcc. > >> > >> text data bss dec hex filename > >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label > >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label > >> the other thing here is that vmlinux.jump_label includes the extra kernel/jump_label.o file, so you can sort of subtract the text size of that file to do a fair comparison. Also, I would have expected the data section to have increased more with jump labels enabled. Are tracepoints disabled (a current user of jump labels). > >> I have checked to make sure that the right instructions are being patched in > >> at run-time. I've also pulled a fully patched jump_label out of the kernel > >> into a userspace test (and benchmarked it directly under perf). The results > >> here are also exactly as expected. > >> > >> e.g. > >> Performance counter stats for './jump_test': > >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles > >> Performance counter stats for './jump_test 1': > >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles > >> what no-op did you use in userspace? I wouldn't think the no-op choice would make any difference though...At compile time we use a 'jmp 0', and then at boot we dynamically patch the 'jmp 0' with the no-op we think works best... thanks, -Jason > >> Overall if we can fix the IPC the benefit in the globally unconstrained case > >> looks really good. > >> > >> Any thoughts Jason? > >> > > > > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when > > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code > > more optimal. > > > > Ah I should have mentioned that was one of the holes I stared down: > > Builds were -O2 (gcc-4.6.1) and > $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE > # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set > > Same kernel image across all platforms. > > > > > > > > thanks, > > > > -Jason > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-22 1:17 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron @ 2011-07-22 1:38 ` Paul Turner 2011-07-27 21:58 ` Jason Baron 0 siblings, 1 reply; 14+ messages in thread From: Paul Turner @ 2011-07-22 1:38 UTC (permalink / raw) To: Jason Baron Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@redhat.com> wrote: > On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote: >> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote: >> > rth@redhat.com >> > Bcc: >> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead >> > when bandwidth control is inactive >> > Reply-To: >> > In-Reply-To: <20110721184758.403388616@google.com> >> > >> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote: >> >> So I'm seeing some strange costs associated with jump_labels; while on paper >> >> the branches and instructions retired improves (as expected) we're taking an >> >> unexpected hit in IPC. >> >> >> >> [From the initial mail we have workloads: >> >> mkdir -p /cgroup/cpu/test >> >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) >> >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" >> >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" >> >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" >> >> ] >> >> >> >> To make some of the figures more clear: >> >> >> >> Legend: >> >> !BWC = tip + bwc, BWC compiled out >> >> BWC = tip + bwc >> >> BWC_JL = tip + bwc + jump label (this patch) >> >> >> >> >> >> Now, comparing under W1 we see: >> >> W1: BWC vs BWC_JL >> >> instructions cycles branches elapsed >> >> --------------------------------------------------------------------------------------------------------------------- >> >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] >> >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] >> >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] >> >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] >> >> >> >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] >> >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] >> >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] >> >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] >> >> >> >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] >> >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] >> >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] >> >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] >> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in >> >> the unconstrained case with BWC. >> >> >> >> >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on >> >> measurements for BWC_JL, with (%d) being the relative difference to their >> >> BWC counterparts. >> >> >> >> W1: BWC vs BWC_JL is very similar. >> >> BWC vs BWC_JL >> >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653 >> >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] >> >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] >> >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] >> >> >> >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049 >> >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] >> >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] >> >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] >> >> >> >> westmere [BWC] 918633403 896047900 166496917 0.754629182 >> >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] >> >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] >> >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] >> >> >> >> Now this is rather odd, almost across the board we're seeing the expected >> >> drops in instructions and branches, yet we appear to be paying a heavy IPC >> >> price. The fact that wall-time has scaled equivalently with cycles roughly >> >> rules out the cycles counter being off. >> >> > > if i understand your results, for barcelona you did see an improvement > in cycles and eslapsed time with jump labels for unconstrained? > Under W2, yes. >> >> We are seeing the expected behavior in the bandwidth enabled case; >> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch >> >> and instruction which shows up on all the numbers above. >> >> >> >> With respect to compiler mangling the text is essentially unchanged in size. >> >> One lurking suspicion is whether the inserted nops have perturbed some of the >> >> jmp/branch alignments? > > hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who > worked on the 'asm goto' in gcc. > >> >> >> >> text data bss dec hex filename >> >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label >> >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label >> >> > > the other thing here is that vmlinux.jump_label includes the extra > kernel/jump_label.o file, so you can sort of subtract the text size of > that file to do a fair comparison. Even without doing that it's only a 1.00004% change in text size. I was just making the inference that if it's gcc mangling it's likely in the layout/alignment. > > Also, I would have expected the data section to have increased more with > jump labels enabled. Are tracepoints disabled (a current user of jump > labels). Yeah -- Tracing is enabled so the BWC build should have labels already; this likely accounts for the small increase noted above. > >> >> I have checked to make sure that the right instructions are being patched in >> >> at run-time. I've also pulled a fully patched jump_label out of the kernel >> >> into a userspace test (and benchmarked it directly under perf). The results >> >> here are also exactly as expected. >> >> >> >> e.g. >> >> Performance counter stats for './jump_test': >> >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles >> >> Performance counter stats for './jump_test 1': >> >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles >> >> > > what no-op did you use in userspace? I wouldn't think the no-op choice > would make any difference though...At compile time we use a 'jmp 0', and > then at boot we dynamically patch the 'jmp 0' with the no-op we think works > best... > Sorry -- what I meant here is I pulled the run-time chosen "best" nop out of /proc/kcore and tested a tight loop about a <JL><RET><COND><RET> sequence (e.g. cfs_rq_throttled()) with JL being the nop and jmp respectively. Specifically for Westmere this ends up being K8_NOP5 -- 0x666666D0 > thanks, > > -Jason > >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case >> >> looks really good. >> >> >> >> Any thoughts Jason? >> >> >> > >> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when >> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code >> > more optimal. >> > >> >> Ah I should have mentioned that was one of the holes I stared down: >> >> Builds were -O2 (gcc-4.6.1) and >> $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE >> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set >> >> Same kernel image across all platforms. >> >> >> >> >> >> >> > thanks, >> > >> > -Jason >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-22 1:38 ` Paul Turner @ 2011-07-27 21:58 ` Jason Baron 2011-08-05 3:53 ` Paul Turner ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Jason Baron @ 2011-07-27 21:58 UTC (permalink / raw) To: Paul Turner Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Thu, Jul 21, 2011 at 06:38:01PM -0700, Paul Turner wrote: > On Thu, Jul 21, 2011 at 6:17 PM, Jason Baron <jbaron@redhat.com> wrote: > > On Thu, Jul 21, 2011 at 05:57:31PM -0700, Paul Turner wrote: > >> On Thu, Jul 21, 2011 at 5:32 PM, Jason Baron <jbaron@redhat.com> wrote: > >> > rth@redhat.com > >> > Bcc: > >> > Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead > >> > when bandwidth control is inactive > >> > Reply-To: > >> > In-Reply-To: <20110721184758.403388616@google.com> > >> > > >> > On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote: > >> >> So I'm seeing some strange costs associated with jump_labels; while on paper > >> >> the branches and instructions retired improves (as expected) we're taking an > >> >> unexpected hit in IPC. > >> >> > >> >> [From the initial mail we have workloads: > >> >> mkdir -p /cgroup/cpu/test > >> >> echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) > >> >> (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" > >> >> (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" > >> >> (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" > >> >> ] > >> >> > >> >> To make some of the figures more clear: > >> >> > >> >> Legend: > >> >> !BWC = tip + bwc, BWC compiled out > >> >> BWC = tip + bwc > >> >> BWC_JL = tip + bwc + jump label (this patch) > >> >> > >> >> > >> >> Now, comparing under W1 we see: > >> >> W1: BWC vs BWC_JL > >> >> instructions cycles branches elapsed > >> >> --------------------------------------------------------------------------------------------------------------------- > >> >> clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] > >> >> +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] > >> >> +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] > >> >> +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] > >> >> > >> >> barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] > >> >> +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] > >> >> +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] > >> >> +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] > >> >> > >> >> westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] > >> >> +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] > >> >> +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] > >> >> +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] > >> >> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in > >> >> the unconstrained case with BWC. > >> >> > >> >> > >> >> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on > >> >> measurements for BWC_JL, with (%d) being the relative difference to their > >> >> BWC counterparts. > >> >> > >> >> W1: BWC vs BWC_JL is very similar. > >> >> BWC vs BWC_JL > >> >> clovertown [BWC] 985732031 1283113452 175621212 1.375905653 > >> >> +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] > >> >> +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] > >> >> +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] > >> >> > >> >> barcelona [BWC] 982139920 1078757792 175417574 1.069537049 > >> >> +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] > >> >> +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] > >> >> +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] > >> >> > >> >> westmere [BWC] 918633403 896047900 166496917 0.754629182 > >> >> +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] > >> >> +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] > >> >> +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] > >> >> > >> >> Now this is rather odd, almost across the board we're seeing the expected > >> >> drops in instructions and branches, yet we appear to be paying a heavy IPC > >> >> price. The fact that wall-time has scaled equivalently with cycles roughly > >> >> rules out the cycles counter being off. > >> >> > > > > if i understand your results, for barcelona you did see an improvement > > in cycles and eslapsed time with jump labels for unconstrained? > > > > Under W2, yes. > > >> >> We are seeing the expected behavior in the bandwidth enabled case; > >> >> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch > >> >> and instruction which shows up on all the numbers above. > >> >> > >> >> With respect to compiler mangling the text is essentially unchanged in size. > >> >> One lurking suspicion is whether the inserted nops have perturbed some of the > >> >> jmp/branch alignments? > > > > hmmmm....not sure, I'm adding Richard Henderson, to the 'cc list, who > > worked on the 'asm goto' in gcc. > > > >> >> > >> >> text data bss dec hex filename > >> >> 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label > >> >> 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label > >> >> > > > > the other thing here is that vmlinux.jump_label includes the extra > > kernel/jump_label.o file, so you can sort of subtract the text size of > > that file to do a fair comparison. > > Even without doing that it's only a 1.00004% change in text size. > > I was just making the inference that if it's gcc mangling it's likely > in the layout/alignment. > > > > > Also, I would have expected the data section to have increased more with > > jump labels enabled. Are tracepoints disabled (a current user of jump > > labels). > > Yeah -- Tracing is enabled so the BWC build should have labels > already; this likely accounts for the small increase noted above. > > > > >> >> I have checked to make sure that the right instructions are being patched in > >> >> at run-time. I've also pulled a fully patched jump_label out of the kernel > >> >> into a userspace test (and benchmarked it directly under perf). The results > >> >> here are also exactly as expected. > >> >> > >> >> e.g. > >> >> Performance counter stats for './jump_test': > >> >> 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles > >> >> Performance counter stats for './jump_test 1': > >> >> 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles > >> >> > > > > what no-op did you use in userspace? I wouldn't think the no-op choice > > would make any difference though...At compile time we use a 'jmp 0', and > > then at boot we dynamically patch the 'jmp 0' with the no-op we think works > > best... > > > > Sorry -- what I meant here is I pulled the run-time chosen "best" nop > out of /proc/kcore and tested a > tight loop about a <JL><RET><COND><RET> sequence (e.g. > cfs_rq_throttled()) with JL being the nop and jmp respectively. > > Specifically for Westmere this ends up being K8_NOP5 -- 0x666666D0 > > > thanks, > > > > -Jason > > > >> >> Overall if we can fix the IPC the benefit in the globally unconstrained case > >> >> looks really good. > >> >> > >> >> Any thoughts Jason? > >> >> > >> > > >> > Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when > >> > CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code > >> > more optimal. > >> > > >> > >> Ah I should have mentioned that was one of the holes I stared down: > >> > >> Builds were -O2 (gcc-4.6.1) and > >> $ zcat /proc/config.gz | grep CONFIG_CC_OPTIMIZE_FOR_SIZE > >> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set > >> > >> Same kernel image across all platforms. > >> > >> Hi Paul, Ok, I think I finally tracked this down. It may seem a bit crazy, but when we are getting down to cycle counting like this, it seems that the link order in the kernel/Makefile can make difference. I had the jump_label.o listed after the core files, whereas all the code in jump_label.o is really slow path code (used when toggling branch values). As follows: --- a/kernel/Makefile +++ b/kernel/Makefile @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ - async.o range.o jump_label.o + async.o range.o obj-y += groups.o ifdef CONFIG_FUNCTION_TRACER @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o obj-$(CONFIG_PADATA) += padata.o obj-$(CONFIG_CRASH_DUMP) += crash_dump.o +obj-$(CONFIG_JUMP_LABEL) += jump_label.o ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is I've tested the patch using a single 'static_branch()' in the getppid() path, and basically running tight loops of calls to getppid(). Before, the patch, I was seeing results similar to what you reported, after the patch, things improved for all metrics. Here are my results for the branch disabled case: With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) 4,592,334,954 cycles ( +- 0.046% ) 751,634,470 branches ( +- 0.000% ) 1.722635797 seconds time elapsed ( +- 0.046% ) Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) 4,622,210,580 cycles ( +- 0.012% ) 771,662,904 branches ( +- 0.000% ) 1.734341454 seconds time elapsed ( +- 0.022% ) So all of the measured metrics improved in the jump labels case b/w 0.5% - 2.5%. I'm curious to see what you find with this patch. Thanks, -Jason ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-27 21:58 ` Jason Baron @ 2011-08-05 3:53 ` Paul Turner 2011-08-05 7:21 ` Peter Zijlstra 2011-08-05 3:55 ` Paul Turner 2011-08-05 8:30 ` Peter Zijlstra 2 siblings, 1 reply; 14+ messages in thread From: Paul Turner @ 2011-08-05 3:53 UTC (permalink / raw) To: Jason Baron Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth < snip> > > Hi Paul, > > Ok, I think I finally tracked this down. It may seem a bit crazy, but > when we are getting down to cycle counting like this, it seems that the > link order in the kernel/Makefile can make difference. I had the > jump_label.o listed after the core files, whereas all the code in > jump_label.o is really slow path code (used when toggling branch > values). As follows: > > > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ > hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ > notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ > - async.o range.o jump_label.o > + async.o range.o > obj-y += groups.o > > ifdef CONFIG_FUNCTION_TRACER > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ > obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o > obj-$(CONFIG_PADATA) += padata.o > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o > > ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) > # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is > > > I've tested the patch using a single 'static_branch()' in the getppid() path, > and basically running tight loops of calls to getppid(). Before, the > patch, I was seeing results similar to what you reported, after the > patch, things improved for all metrics. Here are my results for the > branch disabled case: > > With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) > 4,592,334,954 cycles ( +- 0.046% ) > 751,634,470 branches ( +- 0.000% ) > > 1.722635797 seconds time elapsed ( +- 0.046% ) > > Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) > 4,622,210,580 cycles ( +- 0.012% ) > 771,662,904 branches ( +- 0.000% ) > > 1.734341454 seconds time elapsed ( +- 0.022% ) > > > So all of the measured metrics improved in the jump labels case b/w > 0.5% - 2.5%. > > I'm curious to see what you find with this patch. > > Thanks, > > -Jason > Hi Jason, Thanks for taking a look at this. Sorry, this took a few days to benchmark all the permutations and we had some issues with internal proxies which interrupted benchmarking runs. Results and some analysis follow. [ Key: npo_XXX = with CONFIG_JUMP_LABEL, without link order patch (no patched order) po_XXX = with CONFIG_JUMP_LABEL, with link order patch (patched order) nojl_XXX = without CONFIG_JUMP_LABEL Where "XXX" is head: tip (c5bafb3) without patch series cfs: tip + patch series - jump_label patch cfs_jl: tip + patch series + jump_label for unconstrained Test was repeated 3 times, each run was 50 repeats w/ typically ~<0.1 in-test variance on reported output ] Considering just jump labels in tip, comparing against HEAD w/ !CONFIG_JUMP_LABEL instructions cycles branches elapsed --------------------------------------------------------------------------------------------------------------------- Westmere: njl_head.1 798832892 722624737 145375836 0.203218936 [baseline] njl_head.2 798888783 (+0.01) 746118188 (+3.25) 145386807 (+0.01) 0.208573683 (-2.18) njl_head.3 798864253 (+0.00) 731537139 (+1.23) 145382747 (+0.00) 0.204098175 (-4.28) npo_head.1 797033521 (-0.23) 731239359 (+1.19) 144571358 (-0.55) 0.206910496 (-2.96) npo_head.2 797166434 (-0.21) 728926020 (+0.87) 144603465 (-0.53) 0.202906392 (-4.84) npo_head.3 797165370 (-0.21) 725930458 (+0.46) 144603438 (-0.53) 0.202118274 (-5.21) po_head.1 797019904 (-0.23) 699008145 (-3.27) 144567652 (-0.56) 0.197272615 (-7.48) po_head.2 797037682 (-0.22) 705732419 (-2.34) 144572115 (-0.55) 0.197101692 (-7.56) po_head.3 797079804 (-0.22) 698007668 (-3.41) 144580964 (-0.55) 0.194871253 (-8.61) Barcelona: njl_head.1 816842028 748362637 147462095 0.341654152 njl_head.2 816849735 (+0.00) 748480742 (+0.02) 147462652 (+0.00) 0.341450734 (-2.90) njl_head.3 816834963 (-0.00) 747083797 (-0.17) 147460200 (-0.00) 0.340802353 (-3.09) npo_head.1 815068563 (-0.22) 775012690 (+3.56) 146661357 (-0.54) 0.353797321 (+0.61) npo_head.2 815033261 (-0.22) 759613364 (+1.50) 146654106 (-0.55) 0.346462671 (-1.48) npo_head.3 815029611 (-0.22) 762660196 (+1.91) 146654169 (-0.55) 0.347565129 (-1.16) po_head.1 815026489 (-0.22) 767229109 (+2.52) 146653376 (-0.55) 0.350241833 (-0.40) po_head.2 815035127 (-0.22) 770224495 (+2.92) 146654019 (-0.55) 0.351352092 (-0.09) po_head.3 815109904 (-0.21) 774954096 (+3.55) 146662020 (-0.54) 0.353505054 (+0.53) With the patch to fix link-order we're typically faster and it's probably time to modulate the configs so we get CONFIG_JUMP_LABEL by default when CC_HAS_ASM_GOTO. Considering Bandwidth control, comparing vs HEAD w/ CONFIG_JUMP_LABEL: instructions cycles branches elapsed --------------------------------------------------------------------------------------------------------------------- Westmere: po_head.1 797019904 699008145 144567652 0.197272615 [Baseline] po_head.2 797037682 (+0.00) 705732419 (+0.96) 144572115 (+0.00) 0.197101692 (-4.91) po_head.3 797079804 (+0.01) 698007668 (-0.14) 144580964 (+0.01) 0.194871253 (-5.98) njl_cfs.1 802649718 (+0.71) 708143552 (+1.31) 146577437 (+1.39) 0.198770168 (-4.10) njl_cfs.2 802679078 (+0.71) 707486608 (+1.21) 146582628 (+1.39) 0.197890812 (-4.53) njl_cfs.3 802647500 (+0.71) 704770712 (+0.82) 146578141 (+1.39) 0.196742304 (-5.08) npo_cfs.1 800661523 (+0.46) 724068093 (+3.59) 145774786 (+0.83) 0.204632700 (-1.27) npo_cfs.2 800646997 (+0.46) 718884486 (+2.84) 145772293 (+0.83) 0.201248482 (-2.91) npo_cfs.3 800783171 (+0.47) 725140326 (+3.74) 145804350 (+0.86) 0.203266025 (-1.93) npo_cfs_jl.1 797304605 (+0.04) 687741762 (-1.61) 143666256 (-0.62) 0.194302293 (-6.26) npo_cfs_jl.2 797446281 (+0.05) 694066715 (-0.71) 143700065 (-0.60) 0.194212118 (-6.30) npo_cfs_jl.3 797374495 (+0.04) 697561774 (-0.21) 143682692 (-0.61) 0.194935111 (-5.95) po_cfs.1 800631004 (+0.45) 715819643 (+2.41) 145769677 (+0.83) 0.200007036 (-3.51) po_cfs.2 800642622 (+0.45) 698569729 (-0.06) 145769973 (+0.83) 0.194625680 (-6.10) po_cfs.3 800752778 (+0.47) 707282749 (+1.18) 145798992 (+0.85) 0.197047366 (-4.93) po_cfs_jl.1 797306617 (+0.04) 686329256 (-1.81) 143666659 (-0.62) 0.193107369 (-6.83) po_cfs_jl.2 797434478 (+0.05) 677865445 (-3.02) 143697712 (-0.60) 0.189314824 (-8.66) po_cfs_jl.3 797299055 (+0.04) 686371679 (-1.81) 143665758 (-0.62) 0.191859014 (-7.44) Barcelona: po_head.1 815026489 767229109 146653376 0.350241833 [Baseline] po_head.2 815035127 (+0.00) 770224495 (+0.39) 146654019 (+0.00) 0.351352092 (-2.47) po_head.3 815109904 (+0.01) 774954096 (+1.01) 146662020 (+0.01) 0.353505054 (-1.87) njl_cfs.1 820647075 (+0.69) 756895773 (-1.35) 148663929 (+1.37) 0.345563962 (-4.07) njl_cfs.2 820672501 (+0.69) 761520373 (-0.74) 148667815 (+1.37) 0.347529253 (-3.53) njl_cfs.3 820664350 (+0.69) 763400895 (-0.50) 148666126 (+1.37) 0.348337223 (-3.30) npo_cfs.1 818629349 (+0.44) 758306455 (-1.16) 147854452 (+0.82) 0.346678486 (-3.77) npo_cfs.2 818829256 (+0.47) 768393448 (+0.15) 147891099 (+0.84) 0.350678075 (-2.65) npo_cfs.3 818697806 (+0.45) 772218715 (+0.65) 147866720 (+0.83) 0.352333672 (-2.20) npo_cfs_jl.1 815343935 (+0.04) 760127157 (-0.93) 145753233 (-0.61) 0.347184970 (-3.62) npo_cfs_jl.2 815415786 (+0.05) 775772068 (+1.11) 145762961 (-0.61) 0.353965833 (-1.74) npo_cfs_jl.3 815403187 (+0.05) 764048918 (-0.41) 145761012 (-0.61) 0.348619922 (-3.23) po_cfs.1 819204964 (+0.51) 767156385 (-0.01) 147959727 (+0.89) 0.350737982 (-2.64) po_cfs.2 818665676 (+0.45) 764324366 (-0.38) 147860788 (+0.82) 0.348814489 (-3.17) po_cfs.3 818661849 (+0.45) 752288492 (-1.95) 147859717 (+0.82) 0.343294319 (-4.70) po_cfs_jl.1 815336908 (+0.04) 765760248 (-0.19) 145755155 (-0.61) 0.349608614 (-2.95) po_cfs_jl.2 815322295 (+0.04) 765613685 (-0.21) 145751972 (-0.61) 0.349321663 (-3.03) po_cfs_jl.3 815310833 (+0.03) 759647967 (-0.99) 145750118 (-0.62) 0.346607639 (-3.78) Thanks to the magic of compiler re-organization we now report zero overhead, in fact a speed-up is realized. I will re-post v7.3 with: - rebase to minor changes in tip - removing RFT from adding jump_labels to CFS - additional hierarchical period constraint Thanks for looking into this Jason! - Paul ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 3:53 ` Paul Turner @ 2011-08-05 7:21 ` Peter Zijlstra 0 siblings, 0 replies; 14+ messages in thread From: Peter Zijlstra @ 2011-08-05 7:21 UTC (permalink / raw) To: Paul Turner Cc: Jason Baron, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Thu, 2011-08-04 at 20:53 -0700, Paul Turner wrote: > > I will re-post v7.3 with: > - rebase to minor changes in tip > - removing RFT from adding jump_labels to CFS > - additional hierarchical period constraint Could you rebase to -tip + my patches, most of your previous set is already queued there. The reason its not in -tip is is because the merge window fallout still has -tip in a somewhat shaky state. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-27 21:58 ` Jason Baron 2011-08-05 3:53 ` Paul Turner @ 2011-08-05 3:55 ` Paul Turner 2011-08-05 18:28 ` Jason Baron 2011-08-05 8:30 ` Peter Zijlstra 2 siblings, 1 reply; 14+ messages in thread From: Paul Turner @ 2011-08-05 3:55 UTC (permalink / raw) To: Jason Baron Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ > hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ > notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ > - async.o range.o jump_label.o > + async.o range.o > obj-y += groups.o > > ifdef CONFIG_FUNCTION_TRACER > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ > obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o > obj-$(CONFIG_PADATA) += padata.o > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o > > ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) > # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is > Tested-by: Paul Turner <pjt@google.com> Let me know if you need any result tables for the actual commit msg. Same goes for making CONFIG_JUMP_LABEL equivalent to default in CC_HAS_ASM_GOTO case (at least on x86 anyway). > > I've tested the patch using a single 'static_branch()' in the getppid() path, > and basically running tight loops of calls to getppid(). Before, the > patch, I was seeing results similar to what you reported, after the > patch, things improved for all metrics. Here are my results for the > branch disabled case: > > With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) > 4,592,334,954 cycles ( +- 0.046% ) > 751,634,470 branches ( +- 0.000% ) > > 1.722635797 seconds time elapsed ( +- 0.046% ) > > Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) > 4,622,210,580 cycles ( +- 0.012% ) > 771,662,904 branches ( +- 0.000% ) > > 1.734341454 seconds time elapsed ( +- 0.022% ) > > > So all of the measured metrics improved in the jump labels case b/w > 0.5% - 2.5%. > > I'm curious to see what you find with this patch. > > Thanks, > > -Jason > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 3:55 ` Paul Turner @ 2011-08-05 18:28 ` Jason Baron 0 siblings, 0 replies; 14+ messages in thread From: Jason Baron @ 2011-08-05 18:28 UTC (permalink / raw) To: Paul Turner Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth, rostedt On Thu, Aug 04, 2011 at 08:55:08PM -0700, Paul Turner wrote: > > --- a/kernel/Makefile > > +++ b/kernel/Makefile > > @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ > > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ > > hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ > > notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ > > - async.o range.o jump_label.o > > + async.o range.o > > obj-y += groups.o > > > > ifdef CONFIG_FUNCTION_TRACER > > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ > > obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o > > obj-$(CONFIG_PADATA) += padata.o > > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o > > > > ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) > > # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is > > > > Tested-by: Paul Turner <pjt@google.com> > > Let me know if you need any result tables for the actual commit msg. Hi Paul, Thanks for taking the time test this :) I'll post the patch shortly with my own testing results. Hopefully, it can still be considered for 3.1 b/c of the non-invasive nature of the patch... > Same goes for making CONFIG_JUMP_LABEL equivalent to default in > CC_HAS_ASM_GOTO case (at least on x86 anyway). > I originally had CONFIG_JUMP_LABEL implicitly turned on, but we ran into a 32-bit compiler issue that was causing random, nasty crashes. That issue has since been resolved in gcc, but we might need to update the have CC_HAS_ASM_GOTO check to deal with that case better. Currently, we're using the '-maccumulate-outgoing-args' gcc option to work around the issue for 32 bit x86 (see: arch/x86/Makefile_32.cpu). With the jump label interface somewhat stabilizing (I say somewhat, b/c Peter brought up a good use case in the scheduler that it currently doesn't address, but which we should be able to support without too much churn) and these testing results, I think it might make sense to consider turning it on by default for 3.2. thoughts? Thanks, -Jason > > > > > I've tested the patch using a single 'static_branch()' in the getppid() path, > > and basically running tight loops of calls to getppid(). Before, the > > patch, I was seeing results similar to what you reported, after the > > patch, things improved for all metrics. Here are my results for the > > branch disabled case: > > > > With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled: > > > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > > > 3,969,510,217 instructions # 0.864 IPC ( +-0.000% ) > > 4,592,334,954 cycles ( +- 0.046% ) > > 751,634,470 branches ( +- 0.000% ) > > > > 1.722635797 seconds time elapsed ( +- 0.046% ) > > > > Jump labels turned off (CONFIG_JUMP_LABEL not set), branch disabled: > > > > Performance counter stats for 'bash -c /tmp/timing;true' (50 runs): > > > > 4,009,611,846 instructions # 0.867 IPC ( +-0.000% ) > > 4,622,210,580 cycles ( +- 0.012% ) > > 771,662,904 branches ( +- 0.000% ) > > > > 1.734341454 seconds time elapsed ( +- 0.022% ) > > > > > > So all of the measured metrics improved in the jump labels case b/w > > 0.5% - 2.5%. > > > > I'm curious to see what you find with this patch. > > > > Thanks, > > > > -Jason > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-27 21:58 ` Jason Baron 2011-08-05 3:53 ` Paul Turner 2011-08-05 3:55 ` Paul Turner @ 2011-08-05 8:30 ` Peter Zijlstra 2011-08-05 15:11 ` Richard Henderson 2 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2011-08-05 8:30 UTC (permalink / raw) To: Jason Baron Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, rth On Wed, 2011-07-27 at 17:58 -0400, Jason Baron wrote: > Ok, I think I finally tracked this down. It may seem a bit crazy, but > when we are getting down to cycle counting like this, it seems that the > link order in the kernel/Makefile can make difference. I had the > jump_label.o listed after the core files, whereas all the code in > jump_label.o is really slow path code (used when toggling branch > values). As follows: > > > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ > hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \ > notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \ > - async.o range.o jump_label.o > + async.o range.o > obj-y += groups.o > > ifdef CONFIG_FUNCTION_TRACER > @@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/ > obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o > obj-$(CONFIG_PADATA) += padata.o > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > +obj-$(CONFIG_JUMP_LABEL) += jump_label.o OK, so _WHY_ does that make a difference and will a next version of gnu-binutils not mess that up? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 8:30 ` Peter Zijlstra @ 2011-08-05 15:11 ` Richard Henderson 2011-08-05 15:14 ` Peter Zijlstra 2011-08-05 15:24 ` Jason Baron 0 siblings, 2 replies; 14+ messages in thread From: Richard Henderson @ 2011-08-05 15:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Jason Baron, Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov On 08/05/2011 01:30 AM, Peter Zijlstra wrote: > OK, so _WHY_ does that make a difference and will a next version of > gnu-binutils not mess that up? The Why is micro-architectual, and I can't answer that. But ld will never re-order the files as given on the command-line. There are too many functions and tables that are constructed piece-wise from input sections; re-ordering them would change the semantics of the program. r~ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 15:11 ` Richard Henderson @ 2011-08-05 15:14 ` Peter Zijlstra 2011-08-05 15:24 ` Jason Baron 1 sibling, 0 replies; 14+ messages in thread From: Peter Zijlstra @ 2011-08-05 15:14 UTC (permalink / raw) To: Richard Henderson Cc: Jason Baron, Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov On Fri, 2011-08-05 at 08:11 -0700, Richard Henderson wrote: > On 08/05/2011 01:30 AM, Peter Zijlstra wrote: > > OK, so _WHY_ does that make a difference and will a next version of > > gnu-binutils not mess that up? > > The Why is micro-architectual, and I can't answer that. > > But ld will never re-order the files as given on the command-line. > There are too many functions and tables that are constructed > piece-wise from input sections; re-ordering them would change > the semantics of the program. Right, so I was wondering about things like whole-program-optimization passes at link time. Since I've no clue why the proposed patch does what it does, its hard to say what invariant is needed to be kept. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-08-05 15:11 ` Richard Henderson 2011-08-05 15:14 ` Peter Zijlstra @ 2011-08-05 15:24 ` Jason Baron 1 sibling, 0 replies; 14+ messages in thread From: Jason Baron @ 2011-08-05 15:24 UTC (permalink / raw) To: Richard Henderson, a.p.zijlstra Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov On Fri, Aug 05, 2011 at 08:11:15AM -0700, Richard Henderson wrote: > On 08/05/2011 01:30 AM, Peter Zijlstra wrote: > > OK, so _WHY_ does that make a difference and will a next version of > > gnu-binutils not mess that up? > > The Why is micro-architectual, and I can't answer that. In tracking this down, I eventually found that just having the jump_label.o file compiled into the kernel, but not actually using the static_branch(), or 'asm goto' anywhere, led to a performance hit. Thus, the compiler or the 'asm goto' itself wasn't actually causing any degradation. Since the jump_label.o file is only slow-path code, it can be moved away from core or heavily called kernel routines. I suspect this is probably an icache issue, but I can't say for sure. Thanks, -Jason > > But ld will never re-order the files as given on the command-line. > There are too many functions and tables that are constructed > piece-wise from input sections; re-ordering them would change > the semantics of the program. > > > r~ ^ permalink raw reply [flat|nested] 14+ messages in thread
* [patch 00/18] CFS Bandwidth Control v7.2 @ 2011-07-21 16:43 Paul Turner 2011-07-21 16:43 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Paul Turner 0 siblings, 1 reply; 14+ messages in thread From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron Hi all, Please find attached the incremental v7.2 for bandwidth control. This release follows a fairly intensive period of scraping cycles across various configurations. Unfortunately we seem to be currently taking an IPC hit for jump_labels (despite a savings in branches/instr. ret) which despite fairly extensive digging I don't have a good explanation for. The emitted assembly /looks/ ok, but cycles/wall time is consistently higher across several platforms. As such I've demoted the jumppatch to [RFT] while these details are worked out. But there's no point in holding up the rest of the series any more. [ Please find the specific discussion related to the above attached to patch 17/18. ] So -- without jump labels -- the current performance looks like: instructions cycles branches --------------------------------------------------------------------------------------------- clovertown [!BWC] 843695716 965744453 151224759 +unconstrained 845934117 (+0.27) 974222228 (+0.88) 152715407 (+0.99) +10000000000/1000: 855102086 (+1.35) 978728348 (+1.34) 154495984 (+2.16) +10000000000/1000000: 853981660 (+1.22) 976344561 (+1.10) 154287243 (+2.03) barcelona [!BWC] 810514902 761071312 145351489 +unconstrained 820573353 (+1.24) 748178486 (-1.69) 148161233 (+1.93) +10000000000/1000: 827963132 (+2.15) 757829815 (-0.43) 149611950 (+2.93) +10000000000/1000000: 827701516 (+2.12) 753575001 (-0.98) 149568284 (+2.90) westmere [!BWC] 792513879 702882443 143267136 +unconstrained 802533191 (+1.26) 694415157 (-1.20) 146071233 (+1.96) +10000000000/1000: 809861594 (+2.19) 701781996 (-0.16) 147520953 (+2.97) +10000000000/1000000: 809752541 (+2.18) 705278419 (+0.34) 147502154 (+2.96) Under the workload: mkdir -p /cgroup/cpu/test echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" This may seem a strange work-load but it works around some bizarro overheads currently introduced by perf. Comparing for example with::w (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" We see: (W1) westmere [!BWC] 792513879 702882443 143267136 0.197246943 (W2) westmere [!BWC] 912241728 772576786 165734252 0.214923134 (W3) westmere [!BWC] 904349725 882084726 162577399 0.748506065 vs an 'ideal' total exec time of (approximately): $ time taskset -c 0 ./pipe-test 100000 real 0m0.198 user 0m0.007s ys 0m0.095s The overhead in W2 is explained by that invoking pipe-test directly, one of the siblings is becoming the perf_ctx parent, invoking lots of pain every time we switch. I do not have a reasonable explantion as to why (W1) is so much cheaper than (W2), I stumbled across it by accident when I was trying some combinations to reduce the <perf stat>-to-<perf stat> variance. v7.2 ----------- - Build errors in !CGROUP_SCHED case fixed - !CONFIG_SMP now 'supported' (#ifdef munging) - gcc was failing to inline account_cfs_rq_runtime, affecting performance - checks in expire_cfs_rq_runtime() and check_enqueue_throttle() re-organized to save branches. - jump labels introduced in the case BWC is not being used system-wide to reduce inert overhead. - branch saved in expiring runtime (reorganize conditonals) Hidetoshi, the following patchsets have changed enough to necessitate tweaking of your Reviewed-by: [patch 09/18] sched: add support for unthrottling group entities (extensive) [patch 11/18] sched: prevent interactions with throttled entities (update_cfs_shares) [patch 12/18] sched: prevent buddy interactions with throttled entities (new) Previous postings: ----------------- v7.1: https://lkml.org/lkml/2011/7/7/24 v7: http://lkml.org/lkml/2011/6/21/43 v6: http://lkml.org/lkml/2011/5/7/37 v5: http://lkml.org/lkml/2011/3 /22/477 v4: http://lkml.org/lkml/2011/2/23/44 v3: http://lkml.org/lkml/2010/10/12/44 v2: http://lkml.org/lkml/2010/4/28/88 Original posting: http://lkml.org/lkml/2010/2/12/393 Prior approaches: http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"] Thanks, - Paul ^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive 2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner @ 2011-07-21 16:43 ` Paul Turner 0 siblings, 0 replies; 14+ messages in thread From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron [-- Attachment #1: sched-bwc-add_jump_labels.patch --] [-- Type: text/plain, Size: 16125 bytes --] So I'm seeing some strange costs associated with jump_labels; while on paper the branches and instructions retired improves (as expected) we're taking an unexpected hit in IPC. [From the initial mail we have workloads: mkdir -p /cgroup/cpu/test echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted) (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done" (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true" (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;" ] To make some of the figures more clear: Legend: !BWC = tip + bwc, BWC compiled out BWC = tip + bwc BWC_JL = tip + bwc + jump label (this patch) Now, comparing under W1 we see: W1: BWC vs BWC_JL instructions cycles branches elapsed --------------------------------------------------------------------------------------------------------------------- clovertown [BWC] 845934117 974222228 152715407 0.419014188 [baseline] +unconstrained 857963815 (+1.42) 1007152750 (+3.38) 153140328 (+0.28) 0.433186926 (+3.38) [rel] +10000000000/1000: 876937753 (+2.55) 1033978705 (+5.65) 160038434 (+3.59) 0.443638365 (+5.66) [rel] +10000000000/1000000: 880276838 (+3.08) 1036176245 (+6.13) 160683878 (+4.15) 0.444577244 (+6.14) [rel] barcelona [BWC] 820573353 748178486 148161233 0.342122850 [baseline] +unconstrained 817011602 (-0.43) 759838181 (+1.56) 145951513 (-1.49) 0.347462571 (+1.56) [rel] +10000000000/1000: 830109086 (+0.26) 770451537 (+1.67) 151228902 (+1.08) 0.350824677 (+1.65) [rel] +10000000000/1000000: 830196206 (+0.30) 770704213 (+2.27) 151250413 (+1.12) 0.350962182 (+2.28) [rel] westmere [BWC] 802533191 694415157 146071233 0.194428018 [baseline] +unconstrained 799057936 (-0.43) 751384496 (+8.20) 143875513 (-1.50) 0.211182620 (+8.62) [rel] +10000000000/1000: 812033785 (+0.27) 761469084 (+8.51) 149134146 (+1.09) 0.212149229 (+8.28) [rel] +10000000000/1000000: 811912834 (+0.27) 757842988 (+7.45) 149113291 (+1.09) 0.211364804 (+7.30) [rel] e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in the unconstrained case with BWC. Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on measurements for BWC_JL, with (%d) being the relative difference to their BWC counterparts. W1: BWC vs BWC_JL is very similar. BWC vs BWC_JL clovertown [BWC] 985732031 1283113452 175621212 1.375905653 +unconstrained 979242938 (-0.66) 1288971141 (+0.46) 172122546 (-1.99) 1.389795165 (+1.01) [rel] +10000000000/1000: 999886468 (+0.33) 1296597143 (+1.13) 180554004 (+1.62) 1.392576770 (+1.18) [rel] +10000000000/1000000: 999034223 (+0.11) 1293925500 (+0.57) 180413829 (+1.39) 1.391041338 (+0.94) [rel] barcelona [BWC] 982139920 1078757792 175417574 1.069537049 +unconstrained 965443672 (-1.70) 1075377223 (-0.31) 170215844 (-2.97) 1.045595065 (-2.24) [rel] +10000000000/1000: 989104943 (+0.05) 1100836668 (+0.52) 178837754 (+1.22) 1.058730316 (-1.77) [rel] +10000000000/1000000: 987627489 (-0.32) 1095843758 (-0.17) 178567411 (+0.84) 1.056100899 (-2.28) [rel] westmere [BWC] 918633403 896047900 166496917 0.754629182 +unconstrained 914740541 (-0.42) 903906801 (+0.88) 163652848 (-1.71) 0.758050332 (+0.45) [rel] +10000000000/1000: 927517377 (-0.41) 952579771 (+5.67) 170173060 (+0.75) 0.771193786 (+2.43) [rel] +10000000000/1000000: 914676985 (-0.89) 936106277 (+3.81) 167683288 (+0.22) 0.764973632 (+1.38) [rel] Now this is rather odd, almost across the board we're seeing the expected drops in instructions and branches, yet we appear to be paying a heavy IPC price. The fact that wall-time has scaled equivalently with cycles roughly rules out the cycles counter being off. We are seeing the expected behavior in the bandwidth enabled case; specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch and instruction which shows up on all the numbers above. With respect to compiler mangling the text is essentially unchanged in size. One lurking suspicion is whether the inserted nops have perturbed some of the jmp/branch alignments? text data bss dec hex filename 7277206 2827256 2125824 12230286 ba9e8e vmlinux.jump_label 7276886 2826744 2125824 12229454 ba9b4e vmlinux.no_jump_label I have checked to make sure that the right instructions are being patched in at run-time. I've also pulled a fully patched jump_label out of the kernel into a userspace test (and benchmarked it directly under perf). The results here are also exactly as expected. e.g. Performance counter stats for './jump_test': 1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles Performance counter stats for './jump_test 1': 2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles Overall if we can fix the IPC the benefit in the globally unconstrained case looks really good. Any thoughts Jason? ----- Some more raw data: perf-stat_to_perf-stat variance in performance for W1: BWC_JL vs BWC_JL (sample run-to-run variance on JL measurements) instructions cycles branches elapsed --------------------------------------------------------------------------------------------------------------------- clovertown [BWC_JL] 857963815 1007152750 153140328 0.433186926 +unconstrained 856457537 (-0.18) 986820040 (-2.02) 152871983 (-0.18) 0.424187340 (-2.08) [rel] +10000000000/1000: 880281114 (+0.38) 1009349419 (-2.38) 160668480 (+0.39) 0.433031825 (-2.39) [rel] +10000000000/1000000: 881001883 (+0.08) 1008445782 (-2.68) 160811824 (+0.08) 0.432629132 (-2.69) [rel] barcelona [BWC_JL] 817011602 759838181 145951513 0.347462571 +unconstrained 817076246 (+0.01) 758404044 (-0.19) 145958670 (+0.00) 0.346313238 (-0.33) [rel] +10000000000/1000: 830087089 (-0.00) 773100724 (+0.34) 151218674 (-0.01) 0.352047450 (+0.35) [rel] +10000000000/1000000: 830002149 (-0.02) 773209942 (+0.33) 151208657 (-0.03) 0.352090862 (+0.32) [rel] westmere [BWC_JL] 799057936 751384496 143875513 0.211182620 +unconstrained 799067664 (+0.00) 751165910 (-0.03) 143877385 (+0.00) 0.210928554 (-0.12) [rel] +10000000000/1000: 812040497 (+0.00) 748711039 (-1.68) 149135568 (+0.00) 0.208868390 (-1.55) [rel] +10000000000/1000000: 811911208 (-0.00) 746860347 (-1.45) 149113194 (-0.00) 0.208663627 (-1.28) [rel] BWC vs BWC (sample run-to-run variance on BWC measurements) ilium [BWC] 845934117 974222228 152715407 0.419014188 +unconstrained 849061624 (+0.37) 965568244 (-0.89) 153288606 (+0.38) 0.415287406 (-0.89) [rel] +10000000000/1000: 861138018 (+0.71) 975979688 (-0.28) 155594606 (+0.71) 0.418710227 (-0.28) [rel] +10000000000/1000000: 858768659 (+0.56) 972288157 (-0.42) 155163198 (+0.57) 0.417130144 (-0.42) [rel] barcelona [BWC] 820573353 748178486 148161233 0.342122850 +unconstrained 820494225 (-0.01) 748302946 (+0.02) 148147559 (-0.01) 0.341349438 (-0.23) [rel] +10000000000/1000: 827929735 (-0.00) 756163375 (-0.22) 149609111 (-0.00) 0.344356113 (-0.22) [rel] +10000000000/1000000: 827682550 (-0.00) 759867539 (+0.84) 149565408 (-0.00) 0.346039855 (+0.84) [rel] westmere [BWC] 802533191 694415157 146071233 0.194428018 +unconstrained 802648805 (+0.01) 698052899 (+0.52) 146099982 (+0.02) 0.195632318 (+0.62) [rel] +10000000000/1000: 809855427 (-0.00) 703633926 (+0.26) 147519800 (-0.00) 0.196545542 (+0.32) [rel] +10000000000/1000000: 809646717 (-0.01) 704895639 (-0.05) 147476169 (-0.02) 0.197022787 (+0.01) [rel] Raw Westmere measurements: BWC: Case: Unconstrained -1 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs): 802533191 instructions # 1.156 IPC ( +- 0.004% ) 694415157 cycles ( +- 0.165% ) 146071233 branches ( +- 0.003% ) 0.194428018 seconds time elapsed ( +- 0.437% ) Case: 10000000000/1000: Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs): 809861594 instructions # 1.154 IPC ( +- 0.016% ) 701781996 cycles ( +- 0.184% ) 147520953 branches ( +- 0.022% ) 0.195928354 seconds time elapsed ( +- 0.262% ) Case: 10000000000/1000000: Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs): 809752541 instructions # 1.148 IPC ( +- 0.016% ) 705278419 cycles ( +- 0.593% ) 147502154 branches ( +- 0.022% ) 0.196993502 seconds time elapsed ( +- 0.698% ) BWC_JL Case: Unconstrained -1 Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs): 799057936 instructions # 1.063 IPC ( +- 0.001% ) 751384496 cycles ( +- 0.584% ) 143875513 branches ( +- 0.001% ) 0.211182620 seconds time elapsed ( +- 0.771% ) Case: 10000000000/1000: Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs): 812033785 instructions # 1.066 IPC ( +- 0.017% ) 761469084 cycles ( +- 0.125% ) 149134146 branches ( +- 0.022% ) 0.212149229 seconds time elapsed ( +- 0.171% ) Case: 10000000000/1000000: Performance counter stats for 'bash -c for ((i=0;i<5;i++)); do ./pipe-test 20000; done' (50 runs): 811912834 instructions # 1.071 IPC ( +- 0.017% ) 757842988 cycles ( +- 0.158% ) 149113291 branches ( +- 0.022% ) 0.211364804 seconds time elapsed ( +- 0.225% ) Let me know if there's any particular raw data you want, westmere seems the most interesting because it's taking the biggest hit. ------- From: Paul Turner <pjt@google.com> When no groups within the system are constrained we can use jump labels to reduce overheads -- skipping the per-cfs_rq runtime enabled checks. Signed-off-by: Paul Turner <pjt@google.com> --- kernel/sched.c | 33 +++++++++++++++++++++++++++++++-- kernel/sched_fair.c | 15 ++++++++++++--- 2 files changed, 43 insertions(+), 5 deletions(-) Index: tip/kernel/sched.c =================================================================== --- tip.orig/kernel/sched.c +++ tip/kernel/sched.c @@ -71,6 +71,7 @@ #include <linux/ctype.h> #include <linux/ftrace.h> #include <linux/slab.h> +#include <linux/jump_label.h> #include <asm/tlb.h> #include <asm/irq_regs.h> @@ -499,7 +500,32 @@ static void destroy_cfs_bandwidth(struct hrtimer_cancel(&cfs_b->period_timer); hrtimer_cancel(&cfs_b->slack_timer); } -#else + +#ifdef HAVE_JUMP_LABEL +static struct jump_label_key __cfs_bandwidth_enabled; + +static inline bool cfs_bandwidth_enabled(void) +{ + return static_branch(&__cfs_bandwidth_enabled); +} + +static void account_cfs_bandwidth_enabled(int enabled, int was_enabled) +{ + /* only need to count groups transitioning between enabled/!enabled */ + if (enabled && !was_enabled) + jump_label_inc(&__cfs_bandwidth_enabled); + else if (!enabled && was_enabled) + jump_label_dec(&__cfs_bandwidth_enabled); +} +#else /* !HAVE_JUMP_LABEL */ +/* static_branch doesn't help unless supported */ +static int cfs_bandwidth_enabled(void) +{ + return 1; +} +static void account_cfs_bandwidth_enabled(int enabled, int was_enabled) {} +#endif /* HAVE_JUMP_LABEL */ +#else /* !CONFIG_CFS_BANDWIDTH */ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} static void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {} static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {} @@ -9025,7 +9051,7 @@ static int __cfs_schedulable(struct task static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) { - int i, ret = 0, runtime_enabled; + int i, ret = 0, runtime_enabled, runtime_was_enabled; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); if (tg == &root_task_group) @@ -9053,6 +9079,9 @@ static int tg_set_cfs_bandwidth(struct t goto out_unlock; runtime_enabled = quota != RUNTIME_INF; + runtime_was_enabled = cfs_b->quota != RUNTIME_INF; + account_cfs_bandwidth_enabled(runtime_enabled, runtime_was_enabled); + raw_spin_lock_irq(&cfs_b->lock); cfs_b->period = ns_to_ktime(period); cfs_b->quota = quota; Index: tip/kernel/sched_fair.c =================================================================== --- tip.orig/kernel/sched_fair.c +++ tip/kernel/sched_fair.c @@ -1430,7 +1430,7 @@ static void __account_cfs_rq_runtime(str static __always_inline void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, unsigned long delta_exec) { - if (!cfs_rq->runtime_enabled) + if (!cfs_bandwidth_enabled() || !cfs_rq->runtime_enabled) return; __account_cfs_rq_runtime(cfs_rq, delta_exec); @@ -1438,13 +1438,13 @@ static __always_inline void account_cfs_ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) { - return cfs_rq->throttled; + return cfs_bandwidth_enabled() && cfs_rq->throttled; } /* check whether cfs_rq, or any parent, is throttled */ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) { - return cfs_rq->throttle_count; + return cfs_bandwidth_enabled() && cfs_rq->throttle_count; } /* @@ -1765,6 +1765,9 @@ static void __return_cfs_rq_runtime(stru static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) { + if (!cfs_bandwidth_enabled()) + return; + if (!cfs_rq->runtime_enabled || !cfs_rq->nr_running) return; @@ -1810,6 +1813,9 @@ static void do_sched_cfs_slack_timer(str */ static void check_enqueue_throttle(struct cfs_rq *cfs_rq) { + if (!cfs_bandwidth_enabled()) + return; + /* an active group must be handled by the update_curr()->put() path */ if (!cfs_rq->runtime_enabled || cfs_rq->curr) return; @@ -1827,6 +1833,9 @@ static void check_enqueue_throttle(struc /* conditionally throttle active cfs_rq's from put_prev_entity() */ static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { + if (!cfs_bandwidth_enabled()) + return; + if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0)) return; ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-08-05 18:29 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-07-22 0:32 Jason Baron 2011-07-22 0:57 ` Paul Turner 2011-07-22 1:17 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron 2011-07-22 1:38 ` Paul Turner 2011-07-27 21:58 ` Jason Baron 2011-08-05 3:53 ` Paul Turner 2011-08-05 7:21 ` Peter Zijlstra 2011-08-05 3:55 ` Paul Turner 2011-08-05 18:28 ` Jason Baron 2011-08-05 8:30 ` Peter Zijlstra 2011-08-05 15:11 ` Richard Henderson 2011-08-05 15:14 ` Peter Zijlstra 2011-08-05 15:24 ` Jason Baron -- strict thread matches above, loose matches on Subject: below -- 2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner 2011-07-21 16:43 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Paul Turner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.