linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (no subject)
@ 2011-07-22  0:32 Jason Baron
  2011-07-22  0:57 ` Paul Turner
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Baron @ 2011-07-22  0:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Bharata B Rao, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov

rth@redhat.com
Bcc: 
Subject: Re: [RFT][patch 17/18] sched: use jump labels to reduce overhead
 when bandwidth control is inactive
Reply-To: 
In-Reply-To: <20110721184758.403388616@google.com>

On Thu, Jul 21, 2011 at 09:43:42AM -0700, Paul Turner wrote:
> So I'm seeing some strange costs associated with jump_labels; while on paper
> the branches and instructions retired improves (as expected) we're taking an
> unexpected hit in IPC.
> 
> [From the initial mail we have workloads:
>   mkdir -p /cgroup/cpu/test
>   echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
>   (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"
>   (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
>   (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"
> ]
> 
> To make some of the figures more clear:
> 
> Legend:
> !BWC = tip + bwc, BWC compiled out
> BWC = tip + bwc
> BWC_JL = tip + bwc + jump label (this patch)
> 
> 
> Now, comparing under W1 we see:
> W1: BWC vs BWC_JL
>                             instructions            cycles                  branches              elapsed                
> ---------------------------------------------------------------------------------------------------------------------
> clovertown [BWC]            845934117               974222228               152715407             0.419014188 [baseline]
> +unconstrained              857963815 (+1.42)      1007152750 (+3.38)       153140328 (+0.28)     0.433186926 (+3.38)  [rel]
> +10000000000/1000:          876937753 (+2.55)      1033978705 (+5.65)       160038434 (+3.59)     0.443638365 (+5.66)  [rel]
> +10000000000/1000000:       880276838 (+3.08)      1036176245 (+6.13)       160683878 (+4.15)     0.444577244 (+6.14)  [rel]
> 
> barcelona [BWC]             820573353               748178486               148161233             0.342122850 [baseline] 
> +unconstrained              817011602 (-0.43)       759838181 (+1.56)       145951513 (-1.49)     0.347462571 (+1.56)  [rel]
> +10000000000/1000:          830109086 (+0.26)       770451537 (+1.67)       151228902 (+1.08)     0.350824677 (+1.65)  [rel]
> +10000000000/1000000:       830196206 (+0.30)       770704213 (+2.27)       151250413 (+1.12)     0.350962182 (+2.28)  [rel]
> 
> westmere [BWC]              802533191               694415157               146071233             0.194428018 [baseline]
> +unconstrained              799057936 (-0.43)       751384496 (+8.20)       143875513 (-1.50)     0.211182620 (+8.62)  [rel]
> +10000000000/1000:          812033785 (+0.27)       761469084 (+8.51)       149134146 (+1.09)     0.212149229 (+8.28)  [rel]
> +10000000000/1000000:       811912834 (+0.27)       757842988 (+7.45)       149113291 (+1.09)     0.211364804 (+7.30)  [rel]
> e.g. Barcelona issues ~0.43% less instructions, for a total of 817011602, in
> the unconstrained case with BWC.
> 
> 
> Where "unconstrained, 10000000000/1000, 10000000000/10000" are the on
> measurements for BWC_JL, with (%d) being the relative difference to their
> BWC counterparts.
> 
> W1: BWC vs BWC_JL is very similar.
> 	BWC vs BWC_JL
> clovertown [BWC]            985732031              1283113452               175621212             1.375905653  
> +unconstrained              979242938 (-0.66)      1288971141 (+0.46)       172122546 (-1.99)     1.389795165 (+1.01)  [rel]
> +10000000000/1000:          999886468 (+0.33)      1296597143 (+1.13)       180554004 (+1.62)     1.392576770 (+1.18)  [rel]
> +10000000000/1000000:       999034223 (+0.11)      1293925500 (+0.57)       180413829 (+1.39)     1.391041338 (+0.94)  [rel]
> 
> barcelona [BWC]             982139920              1078757792               175417574             1.069537049  
> +unconstrained              965443672 (-1.70)      1075377223 (-0.31)       170215844 (-2.97)     1.045595065 (-2.24)  [rel]
> +10000000000/1000:          989104943 (+0.05)      1100836668 (+0.52)       178837754 (+1.22)     1.058730316 (-1.77)  [rel]
> +10000000000/1000000:       987627489 (-0.32)      1095843758 (-0.17)       178567411 (+0.84)     1.056100899 (-2.28)  [rel]
> 
> westmere [BWC]              918633403               896047900               166496917             0.754629182  
> +unconstrained              914740541 (-0.42)       903906801 (+0.88)       163652848 (-1.71)     0.758050332 (+0.45)  [rel]
> +10000000000/1000:          927517377 (-0.41)       952579771 (+5.67)       170173060 (+0.75)     0.771193786 (+2.43)  [rel]
> +10000000000/1000000:       914676985 (-0.89)       936106277 (+3.81)       167683288 (+0.22)     0.764973632 (+1.38)  [rel]
> 
> Now this is rather odd, almost across the board we're seeing the expected
> drops in instructions and branches, yet we appear to be paying a heavy IPC
> price.  The fact that wall-time has scaled equivalently with cycles roughly
> rules out the cycles counter being off.
> 
> We are seeing the expected behavior in the bandwidth enabled case;
> specifically the <jl=jmp><ret><cond><ret> blocks are taking an extra branch
> and instruction which shows up on all the numbers above.
> 
> With respect to compiler mangling the text is essentially unchanged in size.
> One lurking suspicion is whether the inserted nops have perturbed some of the
> jmp/branch alignments?
> 
>     text    data     bss     dec     hex filename
>  7277206 2827256 2125824 12230286         ba9e8e vmlinux.jump_label
>  7276886 2826744 2125824 12229454         ba9b4e vmlinux.no_jump_label
>  
>  I have checked to make sure that the right instructions are being patched in
>  at run-time.  I've also pulled a fully patched jump_label out of the kernel
>  into a userspace test (and benchmarked it directly under perf).  The results
>  here are also exactly as expected.
> 
> e.g.
>  Performance counter stats for './jump_test':
>      1,500,839,002 instructions, 300,147,081 branches 702,468,404 cycles
> Performance counter stats for './jump_test 1':
>      2,001,014,609 instructions, 400,177,192 branches 901,758,219 cycles
> 
> Overall if we can fix the IPC the benefit in the globally unconstrained case
> looks really good.
> 
> Any thoughts Jason?
> 

Do you have CONFIG_CC_OPTIMIZE_FOR_SIZE set? I know that when
CONFIG_CC_OPTIMIZE_FOR_SIZE is not set, the compiler can make the code
more optimal.

thanks,

-Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread
* [patch 00/18] CFS Bandwidth Control v7.2
@ 2011-07-21 16:43 Paul Turner
  2011-07-21 16:43 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Paul Turner
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

Hi all,

Please find attached the incremental v7.2 for bandwidth control.

This release follows a fairly intensive period of scraping cycles across
various configurations.  Unfortunately we seem to be currently taking an IPC
hit for jump_labels (despite a savings in branches/instr. ret) which despite
fairly extensive digging I don't have a good explanation for.  The emitted
assembly /looks/ ok, but cycles/wall time is consistently higher across several
platforms.

As such I've demoted the jumppatch to [RFT] while these details are worked
out.  But there's no point in holding up the rest of the series any more.

[ Please find the specific discussion related to the above attached to patch 
17/18. ]

So -- without jump labels -- the current performance looks like:

                            instructions            cycles                  branches         
---------------------------------------------------------------------------------------------
clovertown [!BWC]           843695716               965744453               151224759        
+unconstrained              845934117 (+0.27)       974222228 (+0.88)       152715407 (+0.99)
+10000000000/1000:          855102086 (+1.35)       978728348 (+1.34)       154495984 (+2.16)
+10000000000/1000000:       853981660 (+1.22)       976344561 (+1.10)       154287243 (+2.03)

barcelona [!BWC]            810514902               761071312               145351489        
+unconstrained              820573353 (+1.24)       748178486 (-1.69)       148161233 (+1.93)
+10000000000/1000:          827963132 (+2.15)       757829815 (-0.43)       149611950 (+2.93)
+10000000000/1000000:       827701516 (+2.12)       753575001 (-0.98)       149568284 (+2.90)

westmere [!BWC]             792513879               702882443               143267136        
+unconstrained              802533191 (+1.26)       694415157 (-1.20)       146071233 (+1.96)
+10000000000/1000:          809861594 (+2.19)       701781996 (-0.16)       147520953 (+2.97)
+10000000000/1000000:       809752541 (+2.18)       705278419 (+0.34)       147502154 (+2.96)

Under the workload:
  mkdir -p /cgroup/cpu/test
  echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
  (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"

This may seem a strange work-load but it works around some bizarro overheads
currently introduced by perf.  Comparing for example with::w
  (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
  (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"


We see: 
 (W1)  westmere [!BWC]             792513879               702882443               143267136             0.197246943  
 (W2)  westmere [!BWC]             912241728               772576786               165734252             0.214923134  
 (W3)  westmere [!BWC]             904349725               882084726               162577399             0.748506065  

vs an 'ideal' total exec time of (approximately):
$ time taskset -c 0 ./pipe-test 100000
 real    0m0.198 user    0m0.007s ys     0m0.095s

The overhead in W2 is explained by that invoking pipe-test directly, one of
the siblings is becoming the perf_ctx parent, invoking lots of pain every time
we switch.  I do not have a reasonable explantion as to why (W1) is so much
cheaper than (W2), I stumbled across it by accident when I was trying some
combinations to reduce the <perf stat>-to-<perf stat> variance.

v7.2
-----------
- Build errors in !CGROUP_SCHED case fixed
- !CONFIG_SMP now 'supported' (#ifdef munging)
- gcc was failing to inline account_cfs_rq_runtime, affecting performance
- checks in expire_cfs_rq_runtime() and check_enqueue_throttle() re-organized
  to save branches.
- jump labels introduced in the case BWC is not being used system-wide to
  reduce inert overhead.
- branch saved in expiring runtime (reorganize conditonals)

Hidetoshi, the following patchsets have changed enough to necessitate tweaking
of your Reviewed-by:
[patch 09/18] sched: add support for unthrottling group entities (extensive)
[patch 11/18] sched: prevent interactions with throttled entities (update_cfs_shares)
[patch 12/18] sched: prevent buddy interactions with throttled entities (new)


Previous postings:
-----------------
v7.1: https://lkml.org/lkml/2011/7/7/24
v7: http://lkml.org/lkml/2011/6/21/43
v6: http://lkml.org/lkml/2011/5/7/37
v5: http://lkml.org/lkml/2011/3 /22/477
v4: http://lkml.org/lkml/2011/2/23/44
v3: http://lkml.org/lkml/2010/10/12/44
v2: http://lkml.org/lkml/2010/4/28/88
Original posting: http://lkml.org/lkml/2010/2/12/393

Prior approaches: http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]

Thanks,

- Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-08-05 18:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-22  0:32 Jason Baron
2011-07-22  0:57 ` Paul Turner
2011-07-22  1:17   ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Jason Baron
2011-07-22  1:38     ` Paul Turner
2011-07-27 21:58       ` Jason Baron
2011-08-05  3:53         ` Paul Turner
2011-08-05  7:21           ` Peter Zijlstra
2011-08-05  3:55         ` Paul Turner
2011-08-05 18:28           ` Jason Baron
2011-08-05  8:30         ` Peter Zijlstra
2011-08-05 15:11           ` Richard Henderson
2011-08-05 15:14             ` Peter Zijlstra
2011-08-05 15:24             ` Jason Baron
  -- strict thread matches above, loose matches on Subject: below --
2011-07-21 16:43 [patch 00/18] CFS Bandwidth Control v7.2 Paul Turner
2011-07-21 16:43 ` [RFT][patch 17/18] sched: use jump labels to reduce overhead when bandwidth control is inactive Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).