[patch 00/18] CFS Bandwidth Control v7.2

* [patch 00/18] CFS Bandwidth Control v7.2
@ 2011-07-21 16:43 Paul Turner
  2011-07-21 16:43 ` [patch 01/18] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
                   ` (20 more replies)
  0 siblings, 21 replies; 60+ messages in thread
From: Paul Turner @ 2011-07-21 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Hidetoshi Seto, Ingo Molnar, Pavel Emelyanov, Jason Baron

Hi all,

Please find attached the incremental v7.2 for bandwidth control.

This release follows a fairly intensive period of scraping cycles across
various configurations.  Unfortunately we seem to be currently taking an IPC
hit for jump_labels (despite a savings in branches/instr. ret) which despite
fairly extensive digging I don't have a good explanation for.  The emitted
assembly /looks/ ok, but cycles/wall time is consistently higher across several
platforms.

As such I've demoted the jumppatch to [RFT] while these details are worked
out.  But there's no point in holding up the rest of the series any more.

[ Please find the specific discussion related to the above attached to patch 
17/18. ]

So -- without jump labels -- the current performance looks like:

                            instructions            cycles                  branches         
---------------------------------------------------------------------------------------------
clovertown [!BWC]           843695716               965744453               151224759        
+unconstrained              845934117 (+0.27)       974222228 (+0.88)       152715407 (+0.99)
+10000000000/1000:          855102086 (+1.35)       978728348 (+1.34)       154495984 (+2.16)
+10000000000/1000000:       853981660 (+1.22)       976344561 (+1.10)       154287243 (+2.03)

barcelona [!BWC]            810514902               761071312               145351489        
+unconstrained              820573353 (+1.24)       748178486 (-1.69)       148161233 (+1.93)
+10000000000/1000:          827963132 (+2.15)       757829815 (-0.43)       149611950 (+2.93)
+10000000000/1000000:       827701516 (+2.12)       753575001 (-0.98)       149568284 (+2.90)

westmere [!BWC]             792513879               702882443               143267136        
+unconstrained              802533191 (+1.26)       694415157 (-1.20)       146071233 (+1.96)
+10000000000/1000:          809861594 (+2.19)       701781996 (-0.16)       147520953 (+2.97)
+10000000000/1000000:       809752541 (+2.18)       705278419 (+0.34)       147502154 (+2.96)

Under the workload:
  mkdir -p /cgroup/cpu/test
  echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
  (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"

This may seem a strange work-load but it works around some bizarro overheads
currently introduced by perf.  Comparing for example with::w
  (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
  (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"

We see: 
 (W1)  westmere [!BWC]             792513879               702882443               143267136             0.197246943  
 (W2)  westmere [!BWC]             912241728               772576786               165734252             0.214923134  
 (W3)  westmere [!BWC]             904349725               882084726               162577399             0.748506065  

vs an 'ideal' total exec time of (approximately):
$ time taskset -c 0 ./pipe-test 100000
 real    0m0.198 user    0m0.007s ys     0m0.095s

The overhead in W2 is explained by that invoking pipe-test directly, one of
the siblings is becoming the perf_ctx parent, invoking lots of pain every time
we switch.  I do not have a reasonable explantion as to why (W1) is so much
cheaper than (W2), I stumbled across it by accident when I was trying some
combinations to reduce the <perf stat>-to-<perf stat> variance.

v7.2
-----------
- Build errors in !CGROUP_SCHED case fixed
- !CONFIG_SMP now 'supported' (#ifdef munging)
- gcc was failing to inline account_cfs_rq_runtime, affecting performance
- checks in expire_cfs_rq_runtime() and check_enqueue_throttle() re-organized
  to save branches.
- jump labels introduced in the case BWC is not being used system-wide to
  reduce inert overhead.
- branch saved in expiring runtime (reorganize conditonals)

Hidetoshi, the following patchsets have changed enough to necessitate tweaking
of your Reviewed-by:
[patch 09/18] sched: add support for unthrottling group entities (extensive)
[patch 11/18] sched: prevent interactions with throttled entities (update_cfs_shares)
[patch 12/18] sched: prevent buddy interactions with throttled entities (new)

Previous postings:
-----------------
v7.1: https://lkml.org/lkml/2011/7/7/24
v7: http://lkml.org/lkml/2011/6/21/43
v6: http://lkml.org/lkml/2011/5/7/37
v5: http://lkml.org/lkml/2011/3 /22/477
v4: http://lkml.org/lkml/2011/2/23/44
v3: http://lkml.org/lkml/2010/10/12/44
v2: http://lkml.org/lkml/2010/4/28/88
Original posting: http://lkml.org/lkml/2010/2/12/393

Prior approaches: http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]

Thanks,

- Paul

^ permalink raw reply	[flat|nested] 60+ messages in thread