All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
@ 2014-07-17 23:26 Yuyang Du
  2014-07-17 23:26 ` [PATCH 1/2 v4] sched: Remove update_rq_runnable_avg Yuyang Du
                   ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-17 23:26 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel
  Cc: pjt, bsegall, arjan.van.de.ven, len.brown, rafael.j.wysocki,
	alan.cox, mark.gross, fengguang.wu, Yuyang Du

Thanks to Morten, Ben, and Fengguang.

v4 changes:

- Insert memory barrier before writing cfs_rq->load_last_update_copy.
- Fix typos.

We carried out some performance tests (thanks to Fengguang and his LKP). The results are shown
as follows. The patchset (including two patches) is on top of mainline v3.16-rc3. To make a fair
and clear comparison, we have two parts:

(1) v3.16-rc3 vs. PATCH 1/2 + 2/2
(2) PATCH 1/2 vs. PATCH 1/2 + 2/2

Overall, this rewrite has better performance, and reduced net overhead in load average tracking.

--------------------------------------------------------------------------------------

host: lkp-snb01
model: Sandy Bridge-EP
memory: 32G

host: lkp-sb03
model: Sandy Bridge-EP
memory: 64G

host: lkp-nex04
model: Nehalem-EX
memory: 256G

host: xps2
model: Nehalem
memory: 4G

host: lkp-a0x
model: Atom
memory: 8G

Legend:
	[+-]XX% - change percent
	~XX%    - stddev percent

(1) v3.16-rc3         PATCH 1/2 + 2/2
---------------  -------------------------  
      0.03 ~ 0%      +0.0%       0.03 ~ 0%  snb-drag/fileio/600s-100%-1HDD-ext4-64G-1024f-seqwr-sync
     51.72 ~ 1%      +0.5%      51.99 ~ 1%  snb-drag/fileio/600s-100%-1HDD-xfs-64G-1024f-rndrd-sync
     53.24 ~ 0%      +0.9%      53.72 ~ 0%  snb-drag/fileio/600s-100%-1HDD-xfs-64G-1024f-rndrw-sync
      0.01 ~ 0%      +0.0%       0.01 ~ 0%  snb-drag/fileio/600s-100%-1HDD-xfs-64G-1024f-rndwr-sync
      3.27 ~ 0%      -0.1%       3.27 ~ 0%  snb-drag/fileio/600s-100%-1HDD-xfs-64G-1024f-seqrd-sync
      0.02 ~ 0%      +0.0%       0.02 ~ 0%  snb-drag/fileio/600s-100%-1HDD-xfs-64G-1024f-seqrewr-sync
      0.02 ~ 0%      +0.0%       0.02 ~ 0%  snb-drag/fileio/600s-100%-1HDD-xfs-64G-1024f-seqwr-sync
    108.31 ~ 1%      +0.7%     109.06 ~ 0%  TOTAL fileio.request_latency_95%_ms

---------------  -------------------------  
    155810 ~ 3%     +62.6%     253355 ~ 0%  lkp-snb01/hackbench/1600%-process-pipe
    146931 ~ 1%      +5.5%     154948 ~ 0%  lkp-snb01/hackbench/1600%-process-socket
    172780 ~ 1%     +23.0%     212579 ~ 2%  lkp-snb01/hackbench/1600%-threads-pipe
    152966 ~ 0%      +3.6%     158433 ~ 0%  lkp-snb01/hackbench/1600%-threads-socket
     95943 ~ 0%      +2.7%      98501 ~ 0%  lkp-snb01/hackbench/50%-process-pipe
     86759 ~ 0%     +79.4%     155606 ~ 0%  lkp-snb01/hackbench/50%-process-socket
     90232 ~ 0%      +3.3%      93205 ~ 0%  lkp-snb01/hackbench/50%-threads-pipe
     79416 ~ 0%     +85.6%     147379 ~ 0%  lkp-snb01/hackbench/50%-threads-socket
    980841 ~ 1%     +29.9%    1274010 ~ 0%  TOTAL hackbench.throughput

---------------  -------------------------  
  3.02e+08 ~ 5%      -2.5%  2.944e+08 ~ 3%  lkp-a06/qperf/600s
  3.02e+08 ~ 5%      -2.5%  2.944e+08 ~ 3%  TOTAL qperf.sctp.bw

---------------  -------------------------  
 6.578e+08 ~ 1%      +1.1%  6.651e+08 ~ 1%  lkp-a06/qperf/600s
 6.578e+08 ~ 1%      +1.1%  6.651e+08 ~ 1%  TOTAL qperf.tcp.bw

---------------  -------------------------  
 6.678e+08 ~ 0%      +0.7%  6.728e+08 ~ 0%  lkp-a06/qperf/600s
 6.678e+08 ~ 0%      +0.7%  6.728e+08 ~ 0%  TOTAL qperf.udp.recv_bw

---------------  -------------------------  
 6.721e+08 ~ 0%      +1.1%  6.797e+08 ~ 0%  lkp-a06/qperf/600s
 6.721e+08 ~ 0%      +1.1%  6.797e+08 ~ 0%  TOTAL qperf.udp.send_bw

---------------  -------------------------  
     55388 ~ 2%      -1.9%      54324 ~ 0%  lkp-a06/qperf/600s
     55388 ~ 2%      -1.9%      54324 ~ 0%  TOTAL qperf.sctp.latency

---------------  -------------------------  
     39988 ~ 1%      -1.0%      39581 ~ 0%  lkp-a06/qperf/600s
     39988 ~ 1%      -1.0%      39581 ~ 0%  TOTAL qperf.tcp.latency

---------------  -------------------------  
     33022 ~ 2%      -1.6%      32484 ~ 0%  lkp-a06/qperf/600s
     33022 ~ 2%      -1.6%      32484 ~ 0%  TOTAL qperf.udp.latency

---------------  -------------------------  
   1048360 ~ 0%      +0.0%    1048360 ~ 0%  lkp-a05/iperf/300s-udp
   1048360 ~ 0%      +0.0%    1048360 ~ 0%  TOTAL iperf.udp.bps

---------------  -------------------------  
 4.801e+09 ~ 2%      -2.4%  4.688e+09 ~ 0%  lkp-a05/iperf/300s-tcp
 4.801e+09 ~ 2%      -2.4%  4.688e+09 ~ 0%  TOTAL iperf.tcp.receiver.bps

---------------  -------------------------  
 4.801e+09 ~ 2%      -2.4%  4.688e+09 ~ 0%  lkp-a05/iperf/300s-tcp
 4.801e+09 ~ 2%      -2.4%  4.688e+09 ~ 0%  TOTAL iperf.tcp.sender.bps

---------------  -------------------------  
    140261 ~ 1%      +2.6%     143971 ~ 0%  lkp-sb03/nepim/300s-100%-udp
    126862 ~ 1%      +4.4%     132471 ~ 4%  lkp-sb03/nepim/300s-100%-udp6
    577494 ~ 3%      -2.7%     561810 ~ 2%  lkp-sb03/nepim/300s-25%-udp
    515120 ~ 2%      +3.3%     532350 ~ 2%  lkp-sb03/nepim/300s-25%-udp6
   1359739 ~ 3%      +0.8%    1370604 ~ 2%  TOTAL nepim.udp.avg.kbps_in

---------------  -------------------------  
    160888 ~ 2%      +3.2%     165964 ~ 2%  lkp-sb03/nepim/300s-100%-udp
    127159 ~ 1%      +4.4%     132798 ~ 4%  lkp-sb03/nepim/300s-100%-udp6
    653177 ~ 3%      -1.0%     646770 ~ 3%  lkp-sb03/nepim/300s-25%-udp
    515540 ~ 2%      +4.1%     536440 ~ 2%  lkp-sb03/nepim/300s-25%-udp6
   1456766 ~ 3%      +1.7%    1481974 ~ 3%  TOTAL nepim.udp.avg.kbps_out

---------------  -------------------------  
    680285 ~ 1%      +1.7%     691663 ~ 1%  lkp-sb03/nepim/300s-100%-tcp
    645357 ~ 1%      +1.2%     653140 ~ 1%  lkp-sb03/nepim/300s-100%-tcp6
   2850752 ~ 1%      +0.0%    2851577 ~ 0%  lkp-sb03/nepim/300s-25%-tcp
   2588447 ~ 1%      +0.2%    2593352 ~ 0%  lkp-sb03/nepim/300s-25%-tcp6
   6764842 ~ 1%      +0.4%    6789733 ~ 0%  TOTAL nepim.tcp.avg.kbps_in

---------------  -------------------------  
    680449 ~ 1%      +1.7%     691824 ~ 1%  lkp-sb03/nepim/300s-100%-tcp
    645502 ~ 1%      +1.2%     653247 ~ 1%  lkp-sb03/nepim/300s-100%-tcp6
   2850934 ~ 1%      +0.0%    2851776 ~ 0%  lkp-sb03/nepim/300s-25%-tcp
   2588647 ~ 1%      +0.2%    2593553 ~ 0%  lkp-sb03/nepim/300s-25%-tcp6
   6765533 ~ 1%      +0.4%    6790402 ~ 0%  TOTAL nepim.tcp.avg.kbps_out

---------------  -------------------------  
     45789 ~ 1%      +1.9%      46658 ~ 0%  lkp-sb03/nuttcp/300s
     45789 ~ 1%      +1.9%      46658 ~ 0%  TOTAL nuttcp.throughput_Mbps

---------------  -------------------------  
     47139 ~ 4%      +3.6%      48854 ~ 3%  lkp-sb03/thrulay/300s
     47139 ~ 4%      +3.6%      48854 ~ 3%  TOTAL thrulay.throughput

---------------  -------------------------  
      0.02 ~11%     -10.1%       0.02 ~12%  lkp-sb03/thrulay/300s
      0.02 ~11%     -10.1%       0.02 ~12%  TOTAL thrulay.jitter

---------------  -------------------------  
      0.10 ~ 5%      -3.3%       0.10 ~ 4%  lkp-sb03/thrulay/300s
      0.10 ~ 5%      -3.3%       0.10 ~ 4%  TOTAL thrulay.RTT

---------------  -------------------------  
  75644346 ~ 0%      +0.5%   76029397 ~ 0%  xps2/pigz/100%-128K
  77167258 ~ 0%      +0.5%   77522343 ~ 0%  xps2/pigz/100%-512K
 152811604 ~ 0%      +0.5%  153551740 ~ 0%  TOTAL pigz.throughput

---------------  -------------------------  
     12773 ~ 0%      -1.2%      12615 ~ 0%  lkp-nex04/ebizzy/200%-100x-10s
     12773 ~ 0%      -1.2%      12615 ~ 0%  TOTAL ebizzy.throughput

---------------  -------------------------  
      6.87 ~ 2%     -83.6%       1.12 ~ 3%  lkp-snb01/hackbench/50%-process-socket
      6.43 ~ 2%     -79.8%       1.30 ~ 1%  lkp-snb01/hackbench/50%-threads-socket
     13.30 ~ 2%     -81.8%       2.42 ~ 2%  TOTAL perf-profile.cpu-cycles._raw_spin_lock_irqsave.__wake_up_sync_key.sock_def_readable.unix_stream_sendmsg.sock_aio_write

---------------  -------------------------  
      0.90 ~42%     -77.3%       0.20 ~16%  lkp-snb01/hackbench/1600%-process-pipe
      0.90 ~42%     -77.3%       0.20 ~16%  TOTAL perf-profile.cpu-cycles.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common.__wake_up_sync_key

---------------  -------------------------  
      1.76 ~ 2%     -83.7%       0.29 ~ 8%  lkp-snb01/hackbench/50%-process-socket
      1.08 ~ 1%     -71.8%       0.30 ~ 3%  lkp-snb01/hackbench/50%-threads-socket
      2.84 ~ 2%     -79.2%       0.59 ~ 5%  TOTAL perf-profile.cpu-cycles.__schedule.schedule.schedule_timeout.unix_stream_recvmsg.sock_aio_read

---------------  -------------------------  
      1.78 ~33%     -63.6%       0.65 ~28%  lkp-snb01/hackbench/1600%-process-pipe
      0.92 ~31%     -59.9%       0.37 ~30%  lkp-snb01/hackbench/1600%-threads-pipe
      1.55 ~10%    -100.0%       0.00 ~ 0%  lkp-snb01/hackbench/50%-process-socket
      1.84 ~ 5%     +14.9%       2.11 ~ 2%  lkp-snb01/hackbench/50%-threads-pipe
      1.43 ~ 9%     -79.7%       0.29 ~ 2%  lkp-snb01/hackbench/50%-threads-socket
      7.51 ~17%     -54.5%       3.42 ~10%  TOTAL perf-profile.cpu-cycles._raw_spin_lock.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common

---------------  -------------------------  
      0.89 ~20%     -88.0%       0.11 ~19%  lkp-snb01/hackbench/1600%-process-pipe
      0.47 ~ 5%    +110.0%       0.98 ~13%  lkp-snb01/hackbench/50%-process-pipe
      1.35 ~14%     -19.7%       1.09 ~13%  TOTAL perf-profile.cpu-cycles.__schedule.schedule_user.sysret_careful.__write_nocancel

---------------  -------------------------  
      2.81 ~ 2%     +40.3%       3.94 ~ 5%  lkp-snb01/hackbench/50%-process-pipe
      1.37 ~ 7%     -82.5%       0.24 ~ 5%  lkp-snb01/hackbench/50%-process-socket
      2.84 ~ 1%     +42.8%       4.06 ~ 1%  lkp-snb01/hackbench/50%-threads-pipe
      1.56 ~ 3%     -75.2%       0.39 ~ 4%  lkp-snb01/hackbench/50%-threads-socket
      8.58 ~ 3%      +0.5%       8.63 ~ 3%  TOTAL perf-profile.cpu-cycles.idle_cpu.select_task_rq_fair.try_to_wake_up.default_wake_function.autoremove_wake_function

---------------  -------------------------  
      2.60 ~33%     -72.5%       0.72 ~16%  lkp-snb01/hackbench/1600%-process-pipe
      0.97 ~15%     -52.8%       0.46 ~17%  lkp-snb01/hackbench/1600%-threads-pipe
      2.85 ~ 1%     +26.9%       3.62 ~ 3%  lkp-snb01/hackbench/50%-process-pipe
      6.42 ~16%     -25.3%       4.80 ~ 6%  TOTAL perf-profile.cpu-cycles.__schedule.schedule.pipe_wait.pipe_read.new_sync_read

---------------  -------------------------  
      1.14 ~22%     -75.2%       0.28 ~16%  lkp-snb01/hackbench/1600%-process-pipe
      0.91 ~14%     -56.9%       0.39 ~16%  lkp-snb01/hackbench/1600%-threads-pipe
      0.88 ~ 2%     +36.5%       1.20 ~ 6%  lkp-snb01/hackbench/50%-process-pipe
      0.88 ~ 2%     +41.6%       1.25 ~ 2%  lkp-snb01/hackbench/50%-threads-pipe
      3.82 ~11%     -18.0%       3.13 ~ 6%  TOTAL perf-profile.cpu-cycles.select_task_rq_fair.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common


(2) PATCH 1/2         PATCH 1/2 + 2/2 
---------------  -------------------------  
      6.73 ~ 2%     -83.3%       1.12 ~ 3%  lkp-snb01/hackbench/50%-process-socket
      6.63 ~ 0%     -80.4%       1.30 ~ 1%  lkp-snb01/hackbench/50%-threads-socket
     13.36 ~ 1%     -81.9%       2.42 ~ 2%  TOTAL perf-profile.cpu-cycles._raw_spin_lock_irqsave.__wake_up_sync_key.sock_def_readable.unix_stream_sendmsg.sock_aio_write

---------------  -------------------------  
      1.10 ~46%     -81.5%       0.20 ~16%  lkp-snb01/hackbench/1600%-process-pipe
      1.10 ~46%     -81.5%       0.20 ~16%  TOTAL perf-profile.cpu-cycles.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common.__wake_up_sync_key

---------------  -------------------------  
      1.80 ~ 1%     -84.0%       0.29 ~ 8%  lkp-snb01/hackbench/50%-process-socket
      1.09 ~ 1%     -72.2%       0.30 ~ 3%  lkp-snb01/hackbench/50%-threads-socket
      2.89 ~ 1%     -79.6%       0.59 ~ 5%  TOTAL perf-profile.cpu-cycles.__schedule.schedule.schedule_timeout.unix_stream_recvmsg.sock_aio_read

---------------  -------------------------  
      1.29 ~29%     -49.7%       0.65 ~28%  lkp-snb01/hackbench/1600%-process-pipe
      0.83 ~47%     -55.8%       0.37 ~30%  lkp-snb01/hackbench/1600%-threads-pipe
      1.38 ~ 7%    -100.0%       0.00 ~ 0%  lkp-snb01/hackbench/50%-process-socket
      1.61 ~ 4%     -82.0%       0.29 ~ 2%  lkp-snb01/hackbench/50%-threads-socket
      5.11 ~18%     -74.5%       1.30 ~23%  TOTAL perf-profile.cpu-cycles._raw_spin_lock.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common

---------------  -------------------------  
      0.83 ~14%     -87.1%       0.11 ~19%  lkp-snb01/hackbench/1600%-process-pipe
      0.50 ~ 3%     +97.3%       0.98 ~13%  lkp-snb01/hackbench/50%-process-pipe
      1.33 ~10%     -18.1%       1.09 ~13%  TOTAL perf-profile.cpu-cycles.__schedule.schedule_user.sysret_careful.__write_nocancel

---------------  -------------------------  
      1.19 ~21%     -52.1%       0.57 ~30%  lkp-snb01/hackbench/1600%-threads-pipe
      2.95 ~ 0%     +33.6%       3.94 ~ 5%  lkp-snb01/hackbench/50%-process-pipe
      1.52 ~ 6%     -84.2%       0.24 ~ 5%  lkp-snb01/hackbench/50%-process-socket
      2.98 ~ 1%     +36.4%       4.06 ~ 1%  lkp-snb01/hackbench/50%-threads-pipe
      1.50 ~ 3%     -74.2%       0.39 ~ 4%  lkp-snb01/hackbench/50%-threads-socket
     10.13 ~ 4%      -9.2%       9.20 ~ 5%  TOTAL perf-profile.cpu-cycles.idle_cpu.select_task_rq_fair.try_to_wake_up.default_wake_function.autoremove_wake_function

---------------  -------------------------  
      2.85 ~35%     -74.9%       0.72 ~16%  lkp-snb01/hackbench/1600%-process-pipe
      0.92 ~13%     -50.2%       0.46 ~17%  lkp-snb01/hackbench/1600%-threads-pipe
      2.92 ~ 1%     +23.9%       3.62 ~ 3%  lkp-snb01/hackbench/50%-process-pipe
      6.69 ~17%     -28.3%       4.80 ~ 6%  TOTAL perf-profile.cpu-cycles.__schedule.schedule.pipe_wait.pipe_read.new_sync_read

---------------  -------------------------  
    153533 ~ 2%     +65.0%     253355 ~ 0%  lkp-snb01/hackbench/1600%-process-pipe
    152059 ~ 0%      +1.9%     154948 ~ 0%  lkp-snb01/hackbench/1600%-process-socket
    174164 ~ 2%     +22.1%     212579 ~ 2%  lkp-snb01/hackbench/1600%-threads-pipe
    158193 ~ 0%      +0.2%     158433 ~ 0%  lkp-snb01/hackbench/1600%-threads-socket
     94656 ~ 0%      +4.1%      98501 ~ 0%  lkp-snb01/hackbench/50%-process-pipe
     87638 ~ 0%     +77.6%     155606 ~ 0%  lkp-snb01/hackbench/50%-process-socket
     89973 ~ 0%      +3.6%      93205 ~ 0%  lkp-snb01/hackbench/50%-threads-pipe
     80210 ~ 0%     +83.7%     147379 ~ 0%  lkp-snb01/hackbench/50%-threads-socket
    990430 ~ 1%     +28.6%    1274010 ~ 0%  TOTAL hackbench.throughput

---------------  -------------------------  
    702188 ~ 0%      -1.5%     691663 ~ 1%  lkp-sb03/nepim/300s-100%-tcp
    655502 ~ 0%      -0.4%     653140 ~ 1%  lkp-sb03/nepim/300s-100%-tcp6
   2860533 ~ 0%      -0.3%    2851577 ~ 0%  lkp-sb03/nepim/300s-25%-tcp
   2609335 ~ 0%      -0.6%    2593352 ~ 0%  lkp-sb03/nepim/300s-25%-tcp6
   6827559 ~ 0%      -0.6%    6789733 ~ 0%  TOTAL nepim.tcp.avg.kbps_in

---------------  -------------------------  
    702354 ~ 0%      -1.5%     691824 ~ 1%  lkp-sb03/nepim/300s-100%-tcp
    655502 ~ 0%      -0.3%     653247 ~ 1%  lkp-sb03/nepim/300s-100%-tcp6
   2860734 ~ 0%      -0.3%    2851776 ~ 0%  lkp-sb03/nepim/300s-25%-tcp
   2609536 ~ 0%      -0.6%    2593553 ~ 0%  lkp-sb03/nepim/300s-25%-tcp6
   6828128 ~ 0%      -0.6%    6790402 ~ 0%  TOTAL nepim.tcp.avg.kbps_out

---------------  -------------------------  
    140076 ~ 0%      +2.8%     143971 ~ 0%  lkp-sb03/nepim/300s-100%-udp
    126302 ~ 0%      +4.9%     132471 ~ 4%  lkp-sb03/nepim/300s-100%-udp6
    557984 ~ 0%      +0.7%     561810 ~ 2%  lkp-sb03/nepim/300s-25%-udp
    501648 ~ 1%      +6.1%     532350 ~ 2%  lkp-sb03/nepim/300s-25%-udp6
   1326011 ~ 0%      +3.4%    1370604 ~ 2%  TOTAL nepim.udp.avg.kbps_in

---------------  -------------------------  
    162279 ~ 1%      +2.3%     165964 ~ 2%  lkp-sb03/nepim/300s-100%-udp
    127240 ~ 1%      +4.4%     132798 ~ 4%  lkp-sb03/nepim/300s-100%-udp6
    649372 ~ 1%      -0.4%     646770 ~ 3%  lkp-sb03/nepim/300s-25%-udp
    502056 ~ 1%      +6.8%     536440 ~ 2%  lkp-sb03/nepim/300s-25%-udp6
   1440949 ~ 1%      +2.8%    1481974 ~ 3%  TOTAL nepim.udp.avg.kbps_out

---------------  -------------------------  
     49149 ~ 1%      -0.6%      48854 ~ 3%  lkp-sb03/thrulay/300s
     49149 ~ 1%      -0.6%      48854 ~ 3%  TOTAL thrulay.throughput

---------------  -------------------------  
      0.02 ~ 9%      +3.6%       0.02 ~12%  lkp-sb03/thrulay/300s
      0.02 ~ 9%      +3.6%       0.02 ~12%  TOTAL thrulay.jitter

---------------  -------------------------  
      0.10 ~ 1%      +2.1%       0.10 ~ 4%  lkp-sb03/thrulay/300s
      0.10 ~ 1%      +2.1%       0.10 ~ 4%  TOTAL thrulay.RTT

---------------  -------------------------  
 4.817e+09 ~ 1%      -2.7%  4.688e+09 ~ 0%  lkp-a05/iperf/300s-tcp
 4.817e+09 ~ 1%      -2.7%  4.688e+09 ~ 0%  TOTAL iperf.tcp.receiver.bps

---------------  -------------------------  
 4.817e+09 ~ 1%      -2.7%  4.688e+09 ~ 0%  lkp-a05/iperf/300s-tcp
 4.817e+09 ~ 1%      -2.7%  4.688e+09 ~ 0%  TOTAL iperf.tcp.sender.bps

---------------  -------------------------  
 3.036e+08 ~ 7%      -3.0%  2.944e+08 ~ 3%  lkp-a06/qperf/600s
 3.036e+08 ~ 7%      -3.0%  2.944e+08 ~ 3%  TOTAL qperf.sctp.bw

---------------  -------------------------  
 6.678e+08 ~ 0%      -0.4%  6.651e+08 ~ 1%  lkp-a06/qperf/600s
 6.678e+08 ~ 0%      -0.4%  6.651e+08 ~ 1%  TOTAL qperf.tcp.bw

---------------  -------------------------  
  6.73e+08 ~ 0%      -0.0%  6.728e+08 ~ 0%  lkp-a06/qperf/600s
  6.73e+08 ~ 0%      -0.0%  6.728e+08 ~ 0%  TOTAL qperf.udp.recv_bw

---------------  -------------------------  
 6.773e+08 ~ 0%      +0.4%  6.797e+08 ~ 0%  lkp-a06/qperf/600s
 6.773e+08 ~ 0%      +0.4%  6.797e+08 ~ 0%  TOTAL qperf.udp.send_bw

---------------  -------------------------  
     54508 ~ 2%      -0.3%      54324 ~ 0%  lkp-a06/qperf/600s
     54508 ~ 2%      -0.3%      54324 ~ 0%  TOTAL qperf.sctp.latency

---------------  -------------------------  
     39293 ~ 1%      +0.7%      39581 ~ 0%  lkp-a06/qperf/600s
     39293 ~ 1%      +0.7%      39581 ~ 0%  TOTAL qperf.tcp.latency

---------------  -------------------------  
     31924 ~ 0%      +1.8%      32484 ~ 0%  lkp-a06/qperf/600s
     31924 ~ 0%      +1.8%      32484 ~ 0%  TOTAL qperf.udp.latency

---------------  -------------------------  
   1048360 ~ 0%      +0.0%    1048360 ~ 0%  lkp-a05/iperf/300s-udp
   1048360 ~ 0%      +0.0%    1048360 ~ 0%  TOTAL iperf.udp.bps

---------------  -------------------------  
     45897 ~ 0%      +1.7%      46658 ~ 0%  lkp-sb03/nuttcp/300s
     45897 ~ 0%      +1.7%      46658 ~ 0%  TOTAL nuttcp.throughput_Mbps

---------------  -------------------------  
  75801537 ~ 0%      +0.3%   76029397 ~ 0%  xps2/pigz/100%-128K
  77314567 ~ 0%      +0.3%   77522343 ~ 0%  xps2/pigz/100%-512K
 153116104 ~ 0%      +0.3%  153551740 ~ 0%  TOTAL pigz.throughput

---------------  -------------------------  
     12763 ~ 0%      -1.2%      12615 ~ 0%  lkp-nex04/ebizzy/200%-100x-10s
     12763 ~ 0%      -1.2%      12615 ~ 0%  TOTAL ebizzy.throughput

--------------------------------------------------------------------------------------

Regarding the overflow issue, we now have for both entity and cfs_rq:

struct sched_avg {
    .....
    u64 load_sum;
    unsigned long load_avg;
    .....
};

Given the weight for both entity and cfs_rq is:

struct load_weight {
    unsigned long weight;
    .....
};

So, load_sum's max is 47742 * load.weight (which is unsigned long), then on 32bit,
it is absolutly safe. On 64bit, with unsigned long being 64bit, but we can afford
about 4353082796 (=2^64/47742/88761) entities with the highest weight (=88761)
always runnable, even considering we may multiply 1<<15 in decay_load64, we can
still support 132845 (=4353082796/2^15) always runnable, which should be acceptible.

load_avg = load_sum / 47742 = load.weight (which is unsigned long), so it should be
perfectly safe for both entity (even with arbitrary user group share) and cfs_rq on
both 32bit and 64bit. Originally, we saved this division, but have to get it back
because of the overflow issue on 32bit (actually load average itself is safe from
overflow, but the rest of the code referencing it always uses long, such as cpu_load,
etc., which prevents it from saving).

v3 changes:

Many thanks to Ben for v3 revision.

- Fix overflow issue both for entity and cfs_rq on both 32bit and 64bit.
- Track all entities (both task and group entity) due to group entity's clock issue.
  This actually improves code simplicity.
- Make a copy of cfs_rq sched_avg's last_update_time, to read an intact 64bit
  variable on 32bit machine when in data race (hope I did it right).
- Minor fixes and code improvement.

v2 changes:

Thanks to PeterZ and Ben for their help in fixing the issues and improving
the quality, and Fengguang and his 0Day in finding compile errors in different
configurations for version 2.

- Batch update the tg->load_avg, making sure it is up-to-date before update_cfs_shares
- Remove migrating task from the old CPU/cfs_rq, and do so with atomic operations


Yuyang Du (2):
  sched: Remove update_rq_runnable_avg
  sched: Rewrite per entity runnable load average tracking

 include/linux/sched.h |   21 +-
 kernel/sched/debug.c  |   30 +--
 kernel/sched/fair.c   |  566 ++++++++++++++++---------------------------------
 kernel/sched/proc.c   |    2 +-
 kernel/sched/sched.h  |   22 +-
 5 files changed, 207 insertions(+), 434 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/2 v4] sched: Remove update_rq_runnable_avg
  2014-07-17 23:26 [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
@ 2014-07-17 23:26 ` Yuyang Du
  2014-07-17 23:26 ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-17 23:26 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel
  Cc: pjt, bsegall, arjan.van.de.ven, len.brown, rafael.j.wysocki,
	alan.cox, mark.gross, fengguang.wu, Yuyang Du

The current rq->avg is not made use of anywhere, and the code is in fair
scheduler's critical path, so remove it.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/debug.c |    8 --------
 kernel/sched/fair.c  |   24 ++++--------------------
 kernel/sched/sched.h |    2 --
 3 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 695f977..4b864c7 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -68,14 +68,6 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #define PN(F) \
 	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
 
-	if (!se) {
-		struct sched_avg *avg = &cpu_rq(cpu)->avg;
-		P(avg->runnable_avg_sum);
-		P(avg->runnable_avg_period);
-		return;
-	}
-
-
 	PN(se->exec_start);
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..1a2d04f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2430,18 +2430,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 	}
 }
 
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
-	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
 static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2614,7 +2608,6 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
  */
 void idle_enter_fair(struct rq *this_rq)
 {
-	update_rq_runnable_avg(this_rq, 1);
 }
 
 /*
@@ -2624,7 +2617,6 @@ void idle_enter_fair(struct rq *this_rq)
  */
 void idle_exit_fair(struct rq *this_rq)
 {
-	update_rq_runnable_avg(this_rq, 0);
 }
 
 static int idle_balance(struct rq *this_rq);
@@ -2633,7 +2625,6 @@ static int idle_balance(struct rq *this_rq);
 
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 					   struct sched_entity *se,
 					   int wakeup) {}
@@ -3936,10 +3927,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se) {
-		update_rq_runnable_avg(rq, rq->nr_running);
+	if (!se)
 		add_nr_running(rq, 1);
-	}
+
 	hrtick_update(rq);
 }
 
@@ -3997,10 +3987,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_entity_load_avg(se, 1);
 	}
 
-	if (!se) {
+	if (!se)
 		sub_nr_running(rq, 1);
-		update_rq_runnable_avg(rq, 1);
-	}
+
 	hrtick_update(rq);
 }
 
@@ -5437,9 +5426,6 @@ static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
 		 */
 		if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
 			list_del_leaf_cfs_rq(cfs_rq);
-	} else {
-		struct rq *rq = rq_of(cfs_rq);
-		update_rq_runnable_avg(rq, rq->nr_running);
 	}
 }
 
@@ -7352,8 +7338,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (numabalancing_enabled)
 		task_tick_numa(rq, curr);
-
-	update_rq_runnable_avg(rq, 1);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 31cc02e..a147571 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -542,8 +542,6 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
-
-	struct sched_avg avg;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-17 23:26 [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
  2014-07-17 23:26 ` [PATCH 1/2 v4] sched: Remove update_rq_runnable_avg Yuyang Du
@ 2014-07-17 23:26 ` Yuyang Du
  2014-07-18  9:43   ` Vincent Guittot
                     ` (4 more replies)
  2014-07-18 15:39 ` [PATCH 0/2 " Morten Rasmussen
  2014-07-20  5:46 ` Mike Galbraith
  3 siblings, 5 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-17 23:26 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel
  Cc: pjt, bsegall, arjan.van.de.ven, len.brown, rafael.j.wysocki,
	alan.cox, mark.gross, fengguang.wu, Yuyang Du

The idea of per entity runnable load average (let runnable time contribute to load
weight) was proposed by Paul Turner, and it is still followed by this rewrite. This
rewrite is done due to the following ends:

1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated
   at the granularity of one entity at one time, which results in the cfs_rq load
   average is partially updated or asynchronous across its entities: at any time,
   only one entity is up to date and contributes to the cfs_rq, all other entities
   are effectively lagging behind.

2. cfs_rq load average is different between top rq->cfs_rq and other task_group's
   per CPU cfs_rqs in whether or not blocked_load_average contributes to the load.

3. How task_group's load is calculated is complex.

This rewrite tackles these by:

1. Combine runnable and blocked load averages for cfs_rq. And track cfs_rq's load
   average as a whole and is used as such.

2. Track task entity load average for carrying it between CPUs in migration, group
   cfs_rq and its own entity load averages are tracked for update_cfs_shares and
   task_h_load calc. task_group's load_avg is aggregated from its per CPU cfs_rq's
   load_avg, which is aggregated from its sched_entities (both task and group entity).
   Group entity's weight is proportional to its own cfs_rq's load_avg / task_group's
   load_avg.

3. All task, cfs_rq/group_entity, and task_group have simple, consistent, up-to-date,
   and synchronized load_avg.

This rewrite in principle is equivalent to the previous in functionality, but
significantly reduces code coplexity and hence increases efficiency and clarity.
In addition, the new load_avg is much more smooth/continuous (no abrupt jumping ups
and downs) and decayed/updated more quickly and synchronously to reflect the load
dynamic. As a result, we have less load tracking overhead and better performance.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |   21 +-
 kernel/sched/debug.c  |   22 +-
 kernel/sched/fair.c   |  542 ++++++++++++++++---------------------------------
 kernel/sched/proc.c   |    2 +-
 kernel/sched/sched.h  |   20 +-
 5 files changed, 203 insertions(+), 404 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0..c981f26 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1067,16 +1067,21 @@ struct load_weight {
 	u32 inv_weight;
 };
 
+/*
+ * The load_avg represents an infinite geometric series. The 64 bit
+ * load_sum can:
+ * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
+ * the highest weight (=88761) always runnable, we should not overflow
+ * 2) for entity, support any load.weight always runnable
+ */
 struct sched_avg {
 	/*
-	 * These sums represent an infinite geometric series and so are bound
-	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
-	 * choices of y < 1-2^(-32)*1024.
+	 * The load_avg represents an infinite geometric series.
 	 */
-	u32 runnable_avg_sum, runnable_avg_period;
-	u64 last_runnable_update;
-	s64 decay_count;
-	unsigned long load_avg_contrib;
+	u64 last_update_time;
+	u64 load_sum;
+	unsigned long load_avg;
+	u32 period_contrib;
 };
 
 #ifdef CONFIG_SCHEDSTATS
@@ -1142,7 +1147,7 @@ struct sched_entity {
 #endif
 
 #ifdef CONFIG_SMP
-	/* Per-entity load-tracking */
+	/* Per entity load average tracking */
 	struct sched_avg	avg;
 #endif
 };
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4b864c7..34a3f26 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -85,10 +85,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #endif
 	P(se->load.weight);
 #ifdef CONFIG_SMP
-	P(se->avg.runnable_avg_sum);
-	P(se->avg.runnable_avg_period);
-	P(se->avg.load_avg_contrib);
-	P(se->avg.decay_count);
+	P(se->my_q->avg.load_avg);
 #endif
 #undef PN
 #undef P
@@ -205,19 +202,11 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_load_avg",
-			cfs_rq->runnable_load_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "blocked_load_avg",
-			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
+			cfs_rq->avg.load_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_contrib",
-			cfs_rq->tg_load_contrib);
-	SEQ_printf(m, "  .%-30s: %d\n", "tg_runnable_contrib",
-			cfs_rq->tg_runnable_contrib);
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
 			atomic_long_read(&cfs_rq->tg->load_avg));
-	SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
-			atomic_read(&cfs_rq->tg->runnable_avg));
 #endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -624,10 +613,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 
 	P(se.load.weight);
 #ifdef CONFIG_SMP
-	P(se.avg.runnable_avg_sum);
-	P(se.avg.runnable_avg_period);
-	P(se.avg.load_avg_contrib);
-	P(se.avg.decay_count);
+	P(se.avg.load_avg);
 #endif
 	P(policy);
 	P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a2d04f..3055b9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -282,9 +282,6 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 	return grp->my_q;
 }
 
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
-				       int force_update);
-
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_rq->on_list) {
@@ -304,8 +301,6 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 		}
 
 		cfs_rq->on_list = 1;
-		/* We should have no load, but we need to update last_decay. */
-		update_cfs_rq_blocked_load(cfs_rq, 0);
 	}
 }
 
@@ -665,20 +660,27 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
-static unsigned long task_h_load(struct task_struct *p);
 
-static inline void __update_task_entity_contrib(struct sched_entity *se);
+/* dependent on LOAD_AVG_PERIOD, see below */
+#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
+
+static unsigned long task_h_load(struct task_struct *p);
 
 /* Give new task start runnable values to heavy its load in infant time */
 void init_task_runnable_average(struct task_struct *p)
 {
-	u32 slice;
+	struct sched_avg *sa = &p->se.avg;
 
-	p->se.avg.decay_count = 0;
-	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
-	p->se.avg.runnable_avg_sum = slice;
-	p->se.avg.runnable_avg_period = slice;
-	__update_task_entity_contrib(&p->se);
+	sa->last_update_time = 0;
+	/*
+	 * sched_avg's period_contrib should be strictly less then 1024, so
+	 * we give it 1023 to make sure it is almost a period (1024us), and
+	 * will definitely be update (after enqueue).
+	 */
+	sa->period_contrib = 1023;
+	sa->load_avg = p->se.load.weight;
+	sa->load_sum = p->se.load.weight * LOAD_AVG_MAX;
+	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 #else
 void init_task_runnable_average(struct task_struct *p)
@@ -1504,8 +1506,8 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		delta = runtime - p->last_sum_exec_runtime;
 		*period = now - p->last_task_numa_placement;
 	} else {
-		delta = p->se.avg.runnable_avg_sum;
-		*period = p->se.avg.runnable_avg_period;
+		delta = p->se.avg.load_avg / p->se.load.weight;
+		*period = LOAD_AVG_MAX;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2071,13 +2073,9 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 	long tg_weight;
 
 	/*
-	 * Use this CPU's actual weight instead of the last load_contribution
-	 * to gain a more accurate current total weight. See
-	 * update_cfs_rq_load_contribution().
+	 * Use this CPU's load average instead of actual weight
 	 */
 	tg_weight = atomic_long_read(&tg->load_avg);
-	tg_weight -= cfs_rq->tg_load_contrib;
-	tg_weight += cfs_rq->load.weight;
 
 	return tg_weight;
 }
@@ -2087,7 +2085,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 	long tg_weight, load, shares;
 
 	tg_weight = calc_tg_weight(tg, cfs_rq);
-	load = cfs_rq->load.weight;
+	load = cfs_rq->avg.load_avg;
 
 	shares = (tg->shares * load);
 	if (tg_weight)
@@ -2154,7 +2152,6 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
  * Note: The tables below are dependent on this value.
  */
 #define LOAD_AVG_PERIOD 32
-#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
 #define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
 
 /* Precomputed fixed inverse multiplies for multiplication by y^n */
@@ -2181,7 +2178,7 @@ static const u32 runnable_avg_yN_sum[] = {
  * Approximate:
  *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
  */
-static __always_inline u64 decay_load(u64 val, u64 n)
+static __always_inline u64 decay_load32(u64 val, u64 n)
 {
 	unsigned int local_n;
 
@@ -2210,6 +2207,18 @@ static __always_inline u64 decay_load(u64 val, u64 n)
 	return val >> 32;
 }
 
+static __always_inline u64 decay_load(u64 val, u64 n)
+{
+	if (likely(val <= UINT_MAX))
+		val = decay_load32(val, n);
+	else {
+		val *= (u32)decay_load32(1 << 15, n);
+		val >>= 15;
+	}
+
+	return val;
+}
+
 /*
  * For updates fully spanning n periods, the contribution to runnable
  * average will be: \Sum 1024*y^n
@@ -2234,7 +2243,7 @@ static u32 __compute_runnable_contrib(u64 n)
 		n -= LOAD_AVG_PERIOD;
 	} while (n > LOAD_AVG_PERIOD);
 
-	contrib = decay_load(contrib, n);
+	contrib = decay_load32(contrib, n);
 	return contrib + runnable_avg_yN_sum[n];
 }
 
@@ -2266,21 +2275,20 @@ static u32 __compute_runnable_contrib(u64 n)
  *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
-static __always_inline int __update_entity_runnable_avg(u64 now,
-							struct sched_avg *sa,
-							int runnable)
+static __always_inline int
+__update_load_avg(u64 now, struct sched_avg *sa, unsigned long w)
 {
 	u64 delta, periods;
-	u32 runnable_contrib;
+	u32 contrib;
 	int delta_w, decayed = 0;
 
-	delta = now - sa->last_runnable_update;
+	delta = now - sa->last_update_time;
 	/*
 	 * This should only happen when time goes backwards, which it
 	 * unfortunately does during sched clock init when we swap over to TSC.
 	 */
 	if ((s64)delta < 0) {
-		sa->last_runnable_update = now;
+		sa->last_update_time = now;
 		return 0;
 	}
 
@@ -2291,23 +2299,24 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	delta >>= 10;
 	if (!delta)
 		return 0;
-	sa->last_runnable_update = now;
+	sa->last_update_time = now;
 
 	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->runnable_avg_period % 1024;
+	delta_w = sa->period_contrib;
 	if (delta + delta_w >= 1024) {
-		/* period roll-over */
 		decayed = 1;
 
+		/* how much left for next period will start over, we don't know yet */
+		sa->period_contrib = 0;
+
 		/*
 		 * Now that we know we're crossing a period boundary, figure
 		 * out how much from delta we need to complete the current
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		if (runnable)
-			sa->runnable_avg_sum += delta_w;
-		sa->runnable_avg_period += delta_w;
+		if (w)
+			sa->load_sum += w * delta_w;
 
 		delta -= delta_w;
 
@@ -2315,290 +2324,120 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		periods = delta / 1024;
 		delta %= 1024;
 
-		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
-						  periods + 1);
-		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
-						     periods + 1);
+		sa->load_sum = decay_load(sa->load_sum, periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
-		runnable_contrib = __compute_runnable_contrib(periods);
-		if (runnable)
-			sa->runnable_avg_sum += runnable_contrib;
-		sa->runnable_avg_period += runnable_contrib;
+		contrib = __compute_runnable_contrib(periods);
+		if (w)
+			sa->load_sum += w * contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
-	if (runnable)
-		sa->runnable_avg_sum += delta;
-	sa->runnable_avg_period += delta;
+	if (w)
+		sa->load_sum += w * delta;
 
-	return decayed;
-}
+	sa->period_contrib += delta;
 
-/* Synchronize an entity's decay with its parenting cfs_rq.*/
-static inline u64 __synchronize_entity_decay(struct sched_entity *se)
-{
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 decays = atomic64_read(&cfs_rq->decay_counter);
+	if (decayed)
+		sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
 
-	decays -= se->avg.decay_count;
-	if (!decays)
-		return 0;
-
-	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
-	se->avg.decay_count = 0;
-
-	return decays;
+	return decayed;
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
-						 int force_update)
-{
-	struct task_group *tg = cfs_rq->tg;
-	long tg_contrib;
-
-	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
-	tg_contrib -= cfs_rq->tg_load_contrib;
-
-	if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
-		atomic_long_add(tg_contrib, &tg->load_avg);
-		cfs_rq->tg_load_contrib += tg_contrib;
-	}
-}
-
 /*
- * Aggregate cfs_rq runnable averages into an equivalent task_group
- * representation for computing load contributions.
+ * Updating tg's load_avg is only necessary before it is used in
+ * update_cfs_share (which is done) and effective_load (which is
+ * not done because it is too costly).
  */
-static inline void __update_tg_runnable_avg(struct sched_avg *sa,
-						  struct cfs_rq *cfs_rq)
-{
-	struct task_group *tg = cfs_rq->tg;
-	long contrib;
-
-	/* The fraction of a cpu used by this cfs_rq */
-	contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
-			  sa->runnable_avg_period + 1);
-	contrib -= cfs_rq->tg_runnable_contrib;
-
-	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
-		atomic_add(contrib, &tg->runnable_avg);
-		cfs_rq->tg_runnable_contrib += contrib;
-	}
-}
-
-static inline void __update_group_entity_contrib(struct sched_entity *se)
+static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-	struct task_group *tg = cfs_rq->tg;
-	int runnable_avg;
-
-	u64 contrib;
-
-	contrib = cfs_rq->tg_load_contrib * tg->shares;
-	se->avg.load_avg_contrib = div_u64(contrib,
-				     atomic_long_read(&tg->load_avg) + 1);
+	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 
-	/*
-	 * For group entities we need to compute a correction term in the case
-	 * that they are consuming <1 cpu so that we would contribute the same
-	 * load as a task of equal weight.
-	 *
-	 * Explicitly co-ordinating this measurement would be expensive, but
-	 * fortunately the sum of each cpus contribution forms a usable
-	 * lower-bound on the true value.
-	 *
-	 * Consider the aggregate of 2 contributions.  Either they are disjoint
-	 * (and the sum represents true value) or they are disjoint and we are
-	 * understating by the aggregate of their overlap.
-	 *
-	 * Extending this to N cpus, for a given overlap, the maximum amount we
-	 * understand is then n_i(n_i+1)/2 * w_i where n_i is the number of
-	 * cpus that overlap for this interval and w_i is the interval width.
-	 *
-	 * On a small machine; the first term is well-bounded which bounds the
-	 * total error since w_i is a subset of the period.  Whereas on a
-	 * larger machine, while this first term can be larger, if w_i is the
-	 * of consequential size guaranteed to see n_i*w_i quickly converge to
-	 * our upper bound of 1-cpu.
-	 */
-	runnable_avg = atomic_read(&tg->runnable_avg);
-	if (runnable_avg < NICE_0_LOAD) {
-		se->avg.load_avg_contrib *= runnable_avg;
-		se->avg.load_avg_contrib >>= NICE_0_SHIFT;
+	if (delta) {
+		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
 	}
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
-						 int force_update) {}
-static inline void __update_tg_runnable_avg(struct sched_avg *sa,
-						  struct cfs_rq *cfs_rq) {}
-static inline void __update_group_entity_contrib(struct sched_entity *se) {}
+static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-static inline void __update_task_entity_contrib(struct sched_entity *se)
-{
-	u32 contrib;
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 
-	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
-	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
-	contrib /= (se->avg.runnable_avg_period + 1);
-	se->avg.load_avg_contrib = scale_load(contrib);
-}
+#define subtract_until_zero(minuend, subtrahend)	\
+	(subtrahend < minuend ? minuend - subtrahend : 0)
 
-/* Compute the current contribution to load_avg by se, return any delta */
-static long __update_entity_load_avg_contrib(struct sched_entity *se)
+/*
+ * Group cfs_rq's load_avg is used for task_h_load and update_cfs_share
+ * calc.
+ */
+static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	long old_contrib = se->avg.load_avg_contrib;
+	int decayed;
 
-	if (entity_is_task(se)) {
-		__update_task_entity_contrib(se);
-	} else {
-		__update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
-		__update_group_entity_contrib(se);
+	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
+		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
+		cfs_rq->avg.load_avg = subtract_until_zero(cfs_rq->avg.load_avg, r);
+		r *= LOAD_AVG_MAX;
+		cfs_rq->avg.load_sum = subtract_until_zero(cfs_rq->avg.load_sum, r);
 	}
 
-	return se->avg.load_avg_contrib - old_contrib;
-}
+	decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
 
-static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
-						 long load_contrib)
-{
-	if (likely(load_contrib < cfs_rq->blocked_load_avg))
-		cfs_rq->blocked_load_avg -= load_contrib;
-	else
-		cfs_rq->blocked_load_avg = 0;
-}
+#ifndef CONFIG_64BIT
+	if (cfs_rq->avg.last_update_time != cfs_rq->load_last_update_time_copy) {
+		smp_wmb();
+		cfs_rq->load_last_update_time_copy = cfs_rq->avg.last_update_time;
+	}
+#endif
 
-static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
+	return decayed;
+}
 
-/* Update a sched_entity's runnable average */
-static inline void update_entity_load_avg(struct sched_entity *se,
-					  int update_cfs_rq)
+/* Update task and its cfs_rq load average */
+static inline void update_load_avg(struct sched_entity *se, int update_tg)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
-	u64 now;
+	u64 now = cfs_rq_clock_task(cfs_rq);
 
 	/*
-	 * For a group entity we need to use their owned cfs_rq_clock_task() in
-	 * case they are the parent of a throttled hierarchy.
+	 * Track task load average for carrying it to new CPU after migrated,
+	 * and group sched_entity for task_h_load calc in migration
 	 */
-	if (entity_is_task(se))
-		now = cfs_rq_clock_task(cfs_rq);
-	else
-		now = cfs_rq_clock_task(group_cfs_rq(se));
-
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
-		return;
-
-	contrib_delta = __update_entity_load_avg_contrib(se);
+	__update_load_avg(now, &se->avg, se->on_rq * se->load.weight);
 
-	if (!update_cfs_rq)
-		return;
-
-	if (se->on_rq)
-		cfs_rq->runnable_load_avg += contrib_delta;
-	else
-		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
+		update_tg_load_avg(cfs_rq);
 }
 
-/*
- * Decay the load contributed by all blocked children and account this so that
- * their contribution may appropriately discounted when they wake up.
- */
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
+/* Add the load generated by se into cfs_rq's load average */
+static inline void enqueue_entity_load_avg(struct sched_entity *se)
 {
-	u64 now = cfs_rq_clock_task(cfs_rq) >> 20;
-	u64 decays;
-
-	decays = now - cfs_rq->last_decay;
-	if (!decays && !force_update)
-		return;
+	struct sched_avg *sa = &se->avg;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 now = cfs_rq_clock_task(cfs_rq);
+	int migrated = 0, decayed;
 
-	if (atomic_long_read(&cfs_rq->removed_load)) {
-		unsigned long removed_load;
-		removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
-		subtract_blocked_load_contrib(cfs_rq, removed_load);
-	}
+	if (sa->last_update_time == 0) {
+		sa->last_update_time = now;
 
-	if (decays) {
-		cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
-						      decays);
-		atomic64_add(decays, &cfs_rq->decay_counter);
-		cfs_rq->last_decay = now;
+		if (entity_is_task(se))
+			migrated = 1;
 	}
+	else
+		__update_load_avg(now, sa, se->on_rq * se->load.weight);
 
-	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
-}
-
-/* Add the load generated by se into cfs_rq's child load-average */
-static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se,
-						  int wakeup)
-{
-	/*
-	 * We track migrations using entity decay_count <= 0, on a wake-up
-	 * migration we use a negative decay count to track the remote decays
-	 * accumulated while sleeping.
-	 *
-	 * Newly forked tasks are enqueued with se->avg.decay_count == 0, they
-	 * are seen by enqueue_entity_load_avg() as a migration with an already
-	 * constructed load_avg_contrib.
-	 */
-	if (unlikely(se->avg.decay_count <= 0)) {
-		se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
-		if (se->avg.decay_count) {
-			/*
-			 * In a wake-up migration we have to approximate the
-			 * time sleeping.  This is because we can't synchronize
-			 * clock_task between the two cpus, and it is not
-			 * guaranteed to be read-safe.  Instead, we can
-			 * approximate this using our carried decays, which are
-			 * explicitly atomically readable.
-			 */
-			se->avg.last_runnable_update -= (-se->avg.decay_count)
-							<< 20;
-			update_entity_load_avg(se, 0);
-			/* Indicate that we're now synchronized and on-rq */
-			se->avg.decay_count = 0;
-		}
-		wakeup = 0;
-	} else {
-		__synchronize_entity_decay(se);
-	}
+	decayed = update_cfs_rq_load_avg(now, cfs_rq);
 
-	/* migrated tasks did not contribute to our blocked load */
-	if (wakeup) {
-		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
-		update_entity_load_avg(se, 0);
+	if (migrated) {
+		cfs_rq->avg.load_avg += sa->load_avg;
+		cfs_rq->avg.load_sum += sa->load_sum;
 	}
 
-	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
-	/* we force update consideration on load-balancer moves */
-	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
-}
-
-/*
- * Remove se's load from this cfs_rq child load-average, if the entity is
- * transitioning to a blocked state we track its projected decay using
- * blocked_load_avg.
- */
-static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se,
-						  int sleep)
-{
-	update_entity_load_avg(se, 1);
-	/* we force update consideration on load-balancer moves */
-	update_cfs_rq_blocked_load(cfs_rq, !sleep);
-
-	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
-	if (sleep) {
-		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
-		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
-	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
+	if (decayed || migrated)
+		update_tg_load_avg(cfs_rq);
 }
 
 /*
@@ -2623,16 +2462,8 @@ static int idle_balance(struct rq *this_rq);
 
 #else /* CONFIG_SMP */
 
-static inline void update_entity_load_avg(struct sched_entity *se,
-					  int update_cfs_rq) {}
-static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
-					   struct sched_entity *se,
-					   int wakeup) {}
-static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
-					   struct sched_entity *se,
-					   int sleep) {}
-static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
-					      int force_update) {}
+static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
+static inline void enqueue_entity_load_avg(struct sched_entity *se) {}
 
 static inline int idle_balance(struct rq *rq)
 {
@@ -2764,7 +2595,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+	enqueue_entity_load_avg(se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -2839,7 +2670,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
+
+	update_load_avg(se, 1);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -3028,7 +2860,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_entity_load_avg(prev, 1);
+		update_load_avg(prev, 0);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -3044,8 +2876,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_entity_load_avg(curr, 1);
-	update_cfs_rq_blocked_load(cfs_rq, 1);
+	update_load_avg(curr, 1);
 	update_cfs_shares(cfs_rq);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -3923,8 +3754,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
+		update_load_avg(se, 1);
 		update_cfs_shares(cfs_rq);
-		update_entity_load_avg(se, 1);
 	}
 
 	if (!se)
@@ -3983,8 +3814,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
+		update_load_avg(se, 1);
 		update_cfs_shares(cfs_rq);
-		update_entity_load_avg(se, 1);
 	}
 
 	if (!se)
@@ -3997,7 +3828,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
-	return cpu_rq(cpu)->cfs.runnable_load_avg;
+	return cpu_rq(cpu)->cfs.avg.load_avg;
 }
 
 /*
@@ -4042,7 +3873,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
-	unsigned long load_avg = rq->cfs.runnable_load_avg;
+	unsigned long load_avg = rq->cfs.avg.load_avg;
 
 	if (nr_running)
 		return load_avg / nr_running;
@@ -4161,7 +3992,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 		/*
 		 * w = rw_i + @wl
 		 */
-		w = se->my_q->load.weight + wl;
+		w = se->my_q->avg.load_avg + wl;
 
 		/*
 		 * wl = S * s'_i; see (2)
@@ -4182,7 +4013,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 		/*
 		 * wl = dw_i = S * (s'_i - s_i); see (3)
 		 */
-		wl -= se->load.weight;
+		wl -= se->avg.load_avg;
 
 		/*
 		 * Recursively apply this logic to all parent groups to compute
@@ -4256,14 +4087,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	 */
 	if (sync) {
 		tg = task_group(current);
-		weight = current->se.load.weight;
+		weight = current->se.avg.load_avg;
 
 		this_load += effective_load(tg, this_cpu, -weight, -weight);
 		load += effective_load(tg, prev_cpu, 0, -weight);
 	}
 
 	tg = task_group(p);
-	weight = p->se.load.weight;
+	weight = p->se.avg.load_avg;
 
 	/*
 	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
@@ -4551,18 +4382,34 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 last_update_time;
 
 	/*
-	 * Load tracking: accumulate removed load so that it can be processed
-	 * when we next update owning cfs_rq under rq->lock.  Tasks contribute
-	 * to blocked load iff they have a positive decay-count.  It can never
-	 * be negative here since on-rq tasks have decay-count == 0.
+	 * Task on old CPU catches up with its old cfs_rq, and subtract itself from
+	 * the cfs_rq (task must be off the queue now).
 	 */
-	if (se->avg.decay_count) {
-		se->avg.decay_count = -__synchronize_entity_decay(se);
-		atomic_long_add(se->avg.load_avg_contrib,
-						&cfs_rq->removed_load);
-	}
+#ifndef CONFIG_64BIT
+	u64 last_update_time_copy;
+
+	do {
+		last_update_time_copy = cfs_rq->load_last_update_time_copy;
+		smp_rmb();
+		last_update_time = cfs_rq->avg.last_update_time;
+	} while (last_update_time != last_update_time_copy);
+#else
+	last_update_time = cfs_rq->avg.last_update_time;
+#endif
+	__update_load_avg(last_update_time, &se->avg, 0);
+	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
+
+	/*
+	 * We are supposed to update the task to "current" time, then its up to date
+	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
+	 * what current time is, so simply throw away the out-of-date time. This
+	 * will result in the wakee task is less decayed, but giving the wakee more
+	 * load sounds not bad.
+	 */
+	se->avg.last_update_time = 0;
 
 	/* We have migrated, no longer consider this task hot */
 	se->exec_start = 0;
@@ -5399,36 +5246,6 @@ next:
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-/*
- * update tg->load_weight by folding this cpu's load_avg
- */
-static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
-{
-	struct sched_entity *se = tg->se[cpu];
-	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
-
-	/* throttled entities do not contribute to load */
-	if (throttled_hierarchy(cfs_rq))
-		return;
-
-	update_cfs_rq_blocked_load(cfs_rq, 1);
-
-	if (se) {
-		update_entity_load_avg(se, 1);
-		/*
-		 * We pivot on our runnable average having decayed to zero for
-		 * list removal.  This generally implies that all our children
-		 * have also been removed (modulo rounding error or bandwidth
-		 * control); however, such cases are rare and we can fix these
-		 * at enqueue.
-		 *
-		 * TODO: fix up out-of-order children on enqueue.
-		 */
-		if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
-			list_del_leaf_cfs_rq(cfs_rq);
-	}
-}
-
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5437,17 +5254,17 @@ static void update_blocked_averages(int cpu)
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 	update_rq_clock(rq);
+
 	/*
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		/*
-		 * Note: We may want to consider periodically releasing
-		 * rq->lock about these updates so that creating many task
-		 * groups does not result in continually extending hold time.
-		 */
-		__update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
+		/* throttled entities do not contribute to load */
+		if (throttled_hierarchy(cfs_rq))
+			continue;
+
+		update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
 	}
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
@@ -5477,14 +5294,14 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	}
 
 	if (!se) {
-		cfs_rq->h_load = cfs_rq->runnable_load_avg;
+		cfs_rq->h_load = cfs_rq->avg.load_avg;
 		cfs_rq->last_h_load_update = now;
 	}
 
 	while ((se = cfs_rq->h_load_next) != NULL) {
 		load = cfs_rq->h_load;
-		load = div64_ul(load * se->avg.load_avg_contrib,
-				cfs_rq->runnable_load_avg + 1);
+		load = div64_ul(load * se->avg.load_avg,
+				cfs_rq->avg.load_avg + 1);
 		cfs_rq = group_cfs_rq(se);
 		cfs_rq->h_load = load;
 		cfs_rq->last_h_load_update = now;
@@ -5496,8 +5313,8 @@ static unsigned long task_h_load(struct task_struct *p)
 	struct cfs_rq *cfs_rq = task_cfs_rq(p);
 
 	update_cfs_rq_h_load(cfs_rq);
-	return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
-			cfs_rq->runnable_load_avg + 1);
+	return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
+			cfs_rq->avg.load_avg + 1);
 }
 #else
 static inline void update_blocked_averages(int cpu)
@@ -5506,7 +5323,7 @@ static inline void update_blocked_averages(int cpu)
 
 static unsigned long task_h_load(struct task_struct *p)
 {
-	return p->se.avg.load_avg_contrib;
+	return p->se.avg.load_avg;
 }
 #endif
 
@@ -7437,14 +7254,14 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 
 #ifdef CONFIG_SMP
 	/*
-	* Remove our load from contribution when we leave sched_fair
-	* and ensure we don't carry in an old decay_count if we
-	* switch back.
+	* Remove our load from contribution when we leave cfs_rq.
 	*/
-	if (se->avg.decay_count) {
-		__synchronize_entity_decay(se);
-		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
-	}
+	__update_load_avg(cfs_rq->avg.last_update_time, &se->avg,
+		se->on_rq * se->load.weight);
+	cfs_rq->avg.load_avg =
+		subtract_until_zero(cfs_rq->avg.load_avg, se->avg.load_avg);
+	cfs_rq->avg.load_sum =
+		subtract_until_zero(cfs_rq->avg.load_sum, se->avg.load_sum);
 #endif
 }
 
@@ -7501,8 +7318,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
 #ifdef CONFIG_SMP
-	atomic64_set(&cfs_rq->decay_counter, 1);
-	atomic_long_set(&cfs_rq->removed_load, 0);
+	atomic_long_set(&cfs_rq->removed_load_avg, 0);
 #endif
 }
 
@@ -7547,14 +7363,12 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	if (!on_rq) {
 		cfs_rq = cfs_rq_of(se);
 		se->vruntime += cfs_rq->min_vruntime;
+
 #ifdef CONFIG_SMP
-		/*
-		 * migrate_task_rq_fair() will have removed our previous
-		 * contribution, but we must synchronize for ongoing future
-		 * decay.
-		 */
-		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
-		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+		/* Virtually synchronize task with its new cfs_rq */
+		p->se.avg.last_update_time = cfs_rq->avg.last_update_time;
+		cfs_rq->avg.load_avg += p->se.avg.load_avg;
+		cfs_rq->avg.load_sum += p->se.avg.load_sum;
 #endif
 	}
 }
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index 16f5a30..8f547fe 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -504,7 +504,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 #ifdef CONFIG_SMP
 static inline unsigned long get_rq_runnable_load(struct rq *rq)
 {
-	return rq->cfs.runnable_load_avg;
+	return rq->cfs.avg.load_avg;
 }
 #else
 static inline unsigned long get_rq_runnable_load(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a147571..7c8c2a9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -210,7 +210,6 @@ struct task_group {
 
 #ifdef	CONFIG_SMP
 	atomic_long_t load_avg;
-	atomic_t runnable_avg;
 #endif
 #endif
 
@@ -331,21 +330,16 @@ struct cfs_rq {
 
 #ifdef CONFIG_SMP
 	/*
-	 * CFS Load tracking
-	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
-	 * This allows for the description of both thread and group usage (in
-	 * the FAIR_GROUP_SCHED case).
+	 * CFS load tracking
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
-	atomic64_t decay_counter;
-	u64 last_decay;
-	atomic_long_t removed_load;
+	struct sched_avg avg;
+	unsigned long tg_load_avg_contrib;
+	atomic_long_t removed_load_avg;
+#ifndef CONFIG_64BIT
+	u64 load_last_update_time_copy;
+#endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	/* Required to track per-cpu representation of a task_group */
-	u32 tg_runnable_contrib;
-	unsigned long tg_load_contrib;
-
 	/*
 	 *   h_load = weight * f(tg)
 	 *
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-17 23:26 ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
@ 2014-07-18  9:43   ` Vincent Guittot
  2014-07-27 17:36     ` Yuyang Du
  2014-07-28 10:48   ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2014-07-18  9:43 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, Peter Zijlstra, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On 18 July 2014 01:26, Yuyang Du <yuyang.du@intel.com> wrote:
> The idea of per entity runnable load average (let runnable time contribute to load
> weight) was proposed by Paul Turner, and it is still followed by this rewrite. This
> rewrite is done due to the following ends:
>
> 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated
>    at the granularity of one entity at one time, which results in the cfs_rq load
>    average is partially updated or asynchronous across its entities: at any time,
>    only one entity is up to date and contributes to the cfs_rq, all other entities
>    are effectively lagging behind.
>
> 2. cfs_rq load average is different between top rq->cfs_rq and other task_group's
>    per CPU cfs_rqs in whether or not blocked_load_average contributes to the load.
>
> 3. How task_group's load is calculated is complex.
>
> This rewrite tackles these by:
>
> 1. Combine runnable and blocked load averages for cfs_rq. And track cfs_rq's load
>    average as a whole and is used as such.
>
> 2. Track task entity load average for carrying it between CPUs in migration, group
>    cfs_rq and its own entity load averages are tracked for update_cfs_shares and
>    task_h_load calc. task_group's load_avg is aggregated from its per CPU cfs_rq's
>    load_avg, which is aggregated from its sched_entities (both task and group entity).
>    Group entity's weight is proportional to its own cfs_rq's load_avg / task_group's
>    load_avg.
>
> 3. All task, cfs_rq/group_entity, and task_group have simple, consistent, up-to-date,
>    and synchronized load_avg.
>
> This rewrite in principle is equivalent to the previous in functionality, but
> significantly reduces code coplexity and hence increases efficiency and clarity.
> In addition, the new load_avg is much more smooth/continuous (no abrupt jumping ups
> and downs) and decayed/updated more quickly and synchronously to reflect the load
> dynamic. As a result, we have less load tracking overhead and better performance.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  include/linux/sched.h |   21 +-
>  kernel/sched/debug.c  |   22 +-
>  kernel/sched/fair.c   |  542 ++++++++++++++++---------------------------------
>  kernel/sched/proc.c   |    2 +-
>  kernel/sched/sched.h  |   20 +-
>  5 files changed, 203 insertions(+), 404 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 306f4f0..c981f26 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1067,16 +1067,21 @@ struct load_weight {
>         u32 inv_weight;
>  };
>
> +/*
> + * The load_avg represents an infinite geometric series. The 64 bit
> + * load_sum can:
> + * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
> + * the highest weight (=88761) always runnable, we should not overflow
> + * 2) for entity, support any load.weight always runnable
> + */
>  struct sched_avg {
>         /*
> -        * These sums represent an infinite geometric series and so are bound
> -        * above by 1024/(1-y).  Thus we only need a u32 to store them for all
> -        * choices of y < 1-2^(-32)*1024.
> +        * The load_avg represents an infinite geometric series.
>          */
> -       u32 runnable_avg_sum, runnable_avg_period;
> -       u64 last_runnable_update;
> -       s64 decay_count;
> -       unsigned long load_avg_contrib;
> +       u64 last_update_time;
> +       u64 load_sum;
> +       unsigned long load_avg;
> +       u32 period_contrib;
>  };
>
>  #ifdef CONFIG_SCHEDSTATS
> @@ -1142,7 +1147,7 @@ struct sched_entity {
>  #endif
>
>  #ifdef CONFIG_SMP
> -       /* Per-entity load-tracking */
> +       /* Per entity load average tracking */
>         struct sched_avg        avg;
>  #endif
>  };
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 4b864c7..34a3f26 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -85,10 +85,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  #endif
>         P(se->load.weight);
>  #ifdef CONFIG_SMP
> -       P(se->avg.runnable_avg_sum);
> -       P(se->avg.runnable_avg_period);
> -       P(se->avg.load_avg_contrib);
> -       P(se->avg.decay_count);
> +       P(se->my_q->avg.load_avg);
>  #endif
>  #undef PN
>  #undef P
> @@ -205,19 +202,11 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>         SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
>         SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
>  #ifdef CONFIG_SMP
> -       SEQ_printf(m, "  .%-30s: %ld\n", "runnable_load_avg",
> -                       cfs_rq->runnable_load_avg);
> -       SEQ_printf(m, "  .%-30s: %ld\n", "blocked_load_avg",
> -                       cfs_rq->blocked_load_avg);
> +       SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
> +                       cfs_rq->avg.load_avg);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -       SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_contrib",
> -                       cfs_rq->tg_load_contrib);
> -       SEQ_printf(m, "  .%-30s: %d\n", "tg_runnable_contrib",
> -                       cfs_rq->tg_runnable_contrib);
>         SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
>                         atomic_long_read(&cfs_rq->tg->load_avg));
> -       SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
> -                       atomic_read(&cfs_rq->tg->runnable_avg));
>  #endif
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -624,10 +613,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
>
>         P(se.load.weight);
>  #ifdef CONFIG_SMP
> -       P(se.avg.runnable_avg_sum);
> -       P(se.avg.runnable_avg_period);
> -       P(se.avg.load_avg_contrib);
> -       P(se.avg.decay_count);
> +       P(se.avg.load_avg);
>  #endif
>         P(policy);
>         P(prio);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a2d04f..3055b9b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -282,9 +282,6 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
>         return grp->my_q;
>  }
>
> -static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
> -                                      int force_update);
> -
>  static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>         if (!cfs_rq->on_list) {
> @@ -304,8 +301,6 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>                 }
>
>                 cfs_rq->on_list = 1;
> -               /* We should have no load, but we need to update last_decay. */
> -               update_cfs_rq_blocked_load(cfs_rq, 0);
>         }
>  }
>
> @@ -665,20 +660,27 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  }
>
>  #ifdef CONFIG_SMP
> -static unsigned long task_h_load(struct task_struct *p);
>
> -static inline void __update_task_entity_contrib(struct sched_entity *se);
> +/* dependent on LOAD_AVG_PERIOD, see below */
> +#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
> +
> +static unsigned long task_h_load(struct task_struct *p);
>
>  /* Give new task start runnable values to heavy its load in infant time */
>  void init_task_runnable_average(struct task_struct *p)
>  {
> -       u32 slice;
> +       struct sched_avg *sa = &p->se.avg;
>
> -       p->se.avg.decay_count = 0;
> -       slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
> -       p->se.avg.runnable_avg_sum = slice;
> -       p->se.avg.runnable_avg_period = slice;
> -       __update_task_entity_contrib(&p->se);
> +       sa->last_update_time = 0;
> +       /*
> +        * sched_avg's period_contrib should be strictly less then 1024, so
> +        * we give it 1023 to make sure it is almost a period (1024us), and
> +        * will definitely be update (after enqueue).
> +        */
> +       sa->period_contrib = 1023;
> +       sa->load_avg = p->se.load.weight;
> +       sa->load_sum = p->se.load.weight * LOAD_AVG_MAX;
> +       /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>  }
>  #else
>  void init_task_runnable_average(struct task_struct *p)
> @@ -1504,8 +1506,8 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
>                 delta = runtime - p->last_sum_exec_runtime;
>                 *period = now - p->last_task_numa_placement;
>         } else {
> -               delta = p->se.avg.runnable_avg_sum;
> -               *period = p->se.avg.runnable_avg_period;
> +               delta = p->se.avg.load_avg / p->se.load.weight;
> +               *period = LOAD_AVG_MAX;
>         }
>
>         p->last_sum_exec_runtime = runtime;
> @@ -2071,13 +2073,9 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>         long tg_weight;
>
>         /*
> -        * Use this CPU's actual weight instead of the last load_contribution
> -        * to gain a more accurate current total weight. See
> -        * update_cfs_rq_load_contribution().
> +        * Use this CPU's load average instead of actual weight
>          */
>         tg_weight = atomic_long_read(&tg->load_avg);
> -       tg_weight -= cfs_rq->tg_load_contrib;
> -       tg_weight += cfs_rq->load.weight;
>
>         return tg_weight;
>  }
> @@ -2087,7 +2085,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
>         long tg_weight, load, shares;
>
>         tg_weight = calc_tg_weight(tg, cfs_rq);
> -       load = cfs_rq->load.weight;
> +       load = cfs_rq->avg.load_avg;
>
>         shares = (tg->shares * load);
>         if (tg_weight)
> @@ -2154,7 +2152,6 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
>   * Note: The tables below are dependent on this value.
>   */
>  #define LOAD_AVG_PERIOD 32
> -#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
>  #define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
>
>  /* Precomputed fixed inverse multiplies for multiplication by y^n */
> @@ -2181,7 +2178,7 @@ static const u32 runnable_avg_yN_sum[] = {
>   * Approximate:
>   *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
>   */
> -static __always_inline u64 decay_load(u64 val, u64 n)
> +static __always_inline u64 decay_load32(u64 val, u64 n)
>  {
>         unsigned int local_n;
>
> @@ -2210,6 +2207,18 @@ static __always_inline u64 decay_load(u64 val, u64 n)
>         return val >> 32;
>  }
>
> +static __always_inline u64 decay_load(u64 val, u64 n)
> +{
> +       if (likely(val <= UINT_MAX))
> +               val = decay_load32(val, n);
> +       else {
> +               val *= (u32)decay_load32(1 << 15, n);
> +               val >>= 15;
> +       }
> +
> +       return val;
> +}
> +
>  /*
>   * For updates fully spanning n periods, the contribution to runnable
>   * average will be: \Sum 1024*y^n
> @@ -2234,7 +2243,7 @@ static u32 __compute_runnable_contrib(u64 n)
>                 n -= LOAD_AVG_PERIOD;
>         } while (n > LOAD_AVG_PERIOD);
>
> -       contrib = decay_load(contrib, n);
> +       contrib = decay_load32(contrib, n);
>         return contrib + runnable_avg_yN_sum[n];
>  }
>
> @@ -2266,21 +2275,20 @@ static u32 __compute_runnable_contrib(u64 n)
>   *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
>   *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
>   */
> -static __always_inline int __update_entity_runnable_avg(u64 now,
> -                                                       struct sched_avg *sa,
> -                                                       int runnable)
> +static __always_inline int
> +__update_load_avg(u64 now, struct sched_avg *sa, unsigned long w)
>  {
>         u64 delta, periods;
> -       u32 runnable_contrib;
> +       u32 contrib;
>         int delta_w, decayed = 0;
>
> -       delta = now - sa->last_runnable_update;
> +       delta = now - sa->last_update_time;
>         /*
>          * This should only happen when time goes backwards, which it
>          * unfortunately does during sched clock init when we swap over to TSC.
>          */
>         if ((s64)delta < 0) {
> -               sa->last_runnable_update = now;
> +               sa->last_update_time = now;
>                 return 0;
>         }
>
> @@ -2291,23 +2299,24 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>         delta >>= 10;
>         if (!delta)
>                 return 0;
> -       sa->last_runnable_update = now;
> +       sa->last_update_time = now;
>
>         /* delta_w is the amount already accumulated against our next period */
> -       delta_w = sa->runnable_avg_period % 1024;
> +       delta_w = sa->period_contrib;
>         if (delta + delta_w >= 1024) {
> -               /* period roll-over */
>                 decayed = 1;
>
> +               /* how much left for next period will start over, we don't know yet */
> +               sa->period_contrib = 0;
> +
>                 /*
>                  * Now that we know we're crossing a period boundary, figure
>                  * out how much from delta we need to complete the current
>                  * period and accrue it.
>                  */
>                 delta_w = 1024 - delta_w;
> -               if (runnable)
> -                       sa->runnable_avg_sum += delta_w;
> -               sa->runnable_avg_period += delta_w;
> +               if (w)
> +                       sa->load_sum += w * delta_w;

Do you really need to have *w for computing the load_sum ? can't you
only use it when computing the load_avg ?

sa->load_avg = div_u64(sa->load_sum * w , LOAD_AVG_MAX)

>
>                 delta -= delta_w;
>
> @@ -2315,290 +2324,120 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>                 periods = delta / 1024;
>                 delta %= 1024;
>
> -               sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
> -                                                 periods + 1);
> -               sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
> -                                                    periods + 1);
> +               sa->load_sum = decay_load(sa->load_sum, periods + 1);
>
>                 /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> -               runnable_contrib = __compute_runnable_contrib(periods);
> -               if (runnable)
> -                       sa->runnable_avg_sum += runnable_contrib;
> -               sa->runnable_avg_period += runnable_contrib;
> +               contrib = __compute_runnable_contrib(periods);
> +               if (w)
> +                       sa->load_sum += w * contrib;
>         }
>
>         /* Remainder of delta accrued against u_0` */
> -       if (runnable)
> -               sa->runnable_avg_sum += delta;
> -       sa->runnable_avg_period += delta;
> +       if (w)
> +               sa->load_sum += w * delta;
>
> -       return decayed;
> -}
> +       sa->period_contrib += delta;
>
> -/* Synchronize an entity's decay with its parenting cfs_rq.*/
> -static inline u64 __synchronize_entity_decay(struct sched_entity *se)
> -{
> -       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -       u64 decays = atomic64_read(&cfs_rq->decay_counter);
> +       if (decayed)
> +               sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
>
> -       decays -= se->avg.decay_count;
> -       if (!decays)
> -               return 0;
> -
> -       se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
> -       se->avg.decay_count = 0;
> -
> -       return decays;
> +       return decayed;
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> -                                                int force_update)
> -{
> -       struct task_group *tg = cfs_rq->tg;
> -       long tg_contrib;
> -
> -       tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> -       tg_contrib -= cfs_rq->tg_load_contrib;
> -
> -       if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
> -               atomic_long_add(tg_contrib, &tg->load_avg);
> -               cfs_rq->tg_load_contrib += tg_contrib;
> -       }
> -}
> -
>  /*
> - * Aggregate cfs_rq runnable averages into an equivalent task_group
> - * representation for computing load contributions.
> + * Updating tg's load_avg is only necessary before it is used in
> + * update_cfs_share (which is done) and effective_load (which is
> + * not done because it is too costly).
>   */
> -static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> -                                                 struct cfs_rq *cfs_rq)
> -{
> -       struct task_group *tg = cfs_rq->tg;
> -       long contrib;
> -
> -       /* The fraction of a cpu used by this cfs_rq */
> -       contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
> -                         sa->runnable_avg_period + 1);
> -       contrib -= cfs_rq->tg_runnable_contrib;
> -
> -       if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> -               atomic_add(contrib, &tg->runnable_avg);
> -               cfs_rq->tg_runnable_contrib += contrib;
> -       }
> -}
> -
> -static inline void __update_group_entity_contrib(struct sched_entity *se)
> +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
> -       struct cfs_rq *cfs_rq = group_cfs_rq(se);
> -       struct task_group *tg = cfs_rq->tg;
> -       int runnable_avg;
> -
> -       u64 contrib;
> -
> -       contrib = cfs_rq->tg_load_contrib * tg->shares;
> -       se->avg.load_avg_contrib = div_u64(contrib,
> -                                    atomic_long_read(&tg->load_avg) + 1);
> +       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>
> -       /*
> -        * For group entities we need to compute a correction term in the case
> -        * that they are consuming <1 cpu so that we would contribute the same
> -        * load as a task of equal weight.
> -        *
> -        * Explicitly co-ordinating this measurement would be expensive, but
> -        * fortunately the sum of each cpus contribution forms a usable
> -        * lower-bound on the true value.
> -        *
> -        * Consider the aggregate of 2 contributions.  Either they are disjoint
> -        * (and the sum represents true value) or they are disjoint and we are
> -        * understating by the aggregate of their overlap.
> -        *
> -        * Extending this to N cpus, for a given overlap, the maximum amount we
> -        * understand is then n_i(n_i+1)/2 * w_i where n_i is the number of
> -        * cpus that overlap for this interval and w_i is the interval width.
> -        *
> -        * On a small machine; the first term is well-bounded which bounds the
> -        * total error since w_i is a subset of the period.  Whereas on a
> -        * larger machine, while this first term can be larger, if w_i is the
> -        * of consequential size guaranteed to see n_i*w_i quickly converge to
> -        * our upper bound of 1-cpu.
> -        */
> -       runnable_avg = atomic_read(&tg->runnable_avg);
> -       if (runnable_avg < NICE_0_LOAD) {
> -               se->avg.load_avg_contrib *= runnable_avg;
> -               se->avg.load_avg_contrib >>= NICE_0_SHIFT;
> +       if (delta) {
> +               atomic_long_add(delta, &cfs_rq->tg->load_avg);
> +               cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>         }
>  }
>
>  #else /* CONFIG_FAIR_GROUP_SCHED */
> -static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> -                                                int force_update) {}
> -static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> -                                                 struct cfs_rq *cfs_rq) {}
> -static inline void __update_group_entity_contrib(struct sched_entity *se) {}
> +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -static inline void __update_task_entity_contrib(struct sched_entity *se)
> -{
> -       u32 contrib;
> +static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
>
> -       /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> -       contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
> -       contrib /= (se->avg.runnable_avg_period + 1);
> -       se->avg.load_avg_contrib = scale_load(contrib);
> -}
> +#define subtract_until_zero(minuend, subtrahend)       \
> +       (subtrahend < minuend ? minuend - subtrahend : 0)
>
> -/* Compute the current contribution to load_avg by se, return any delta */
> -static long __update_entity_load_avg_contrib(struct sched_entity *se)
> +/*
> + * Group cfs_rq's load_avg is used for task_h_load and update_cfs_share
> + * calc.
> + */
> +static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> -       long old_contrib = se->avg.load_avg_contrib;
> +       int decayed;
>
> -       if (entity_is_task(se)) {
> -               __update_task_entity_contrib(se);
> -       } else {
> -               __update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
> -               __update_group_entity_contrib(se);
> +       if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> +               long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> +               cfs_rq->avg.load_avg = subtract_until_zero(cfs_rq->avg.load_avg, r);
> +               r *= LOAD_AVG_MAX;
> +               cfs_rq->avg.load_sum = subtract_until_zero(cfs_rq->avg.load_sum, r);
>         }
>
> -       return se->avg.load_avg_contrib - old_contrib;
> -}
> +       decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
>
> -static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
> -                                                long load_contrib)
> -{
> -       if (likely(load_contrib < cfs_rq->blocked_load_avg))
> -               cfs_rq->blocked_load_avg -= load_contrib;
> -       else
> -               cfs_rq->blocked_load_avg = 0;
> -}
> +#ifndef CONFIG_64BIT
> +       if (cfs_rq->avg.last_update_time != cfs_rq->load_last_update_time_copy) {
> +               smp_wmb();
> +               cfs_rq->load_last_update_time_copy = cfs_rq->avg.last_update_time;
> +       }
> +#endif
>
> -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> +       return decayed;
> +}
>
> -/* Update a sched_entity's runnable average */
> -static inline void update_entity_load_avg(struct sched_entity *se,
> -                                         int update_cfs_rq)
> +/* Update task and its cfs_rq load average */
> +static inline void update_load_avg(struct sched_entity *se, int update_tg)
>  {
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -       long contrib_delta;
> -       u64 now;
> +       u64 now = cfs_rq_clock_task(cfs_rq);
>
>         /*
> -        * For a group entity we need to use their owned cfs_rq_clock_task() in
> -        * case they are the parent of a throttled hierarchy.
> +        * Track task load average for carrying it to new CPU after migrated,
> +        * and group sched_entity for task_h_load calc in migration
>          */
> -       if (entity_is_task(se))
> -               now = cfs_rq_clock_task(cfs_rq);
> -       else
> -               now = cfs_rq_clock_task(group_cfs_rq(se));
> -
> -       if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
> -               return;
> -
> -       contrib_delta = __update_entity_load_avg_contrib(se);
> +       __update_load_avg(now, &se->avg, se->on_rq * se->load.weight);
>
> -       if (!update_cfs_rq)
> -               return;
> -
> -       if (se->on_rq)
> -               cfs_rq->runnable_load_avg += contrib_delta;
> -       else
> -               subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> +       if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
> +               update_tg_load_avg(cfs_rq);
>  }
>
> -/*
> - * Decay the load contributed by all blocked children and account this so that
> - * their contribution may appropriately discounted when they wake up.
> - */
> -static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
> +/* Add the load generated by se into cfs_rq's load average */
> +static inline void enqueue_entity_load_avg(struct sched_entity *se)
>  {
> -       u64 now = cfs_rq_clock_task(cfs_rq) >> 20;
> -       u64 decays;
> -
> -       decays = now - cfs_rq->last_decay;
> -       if (!decays && !force_update)
> -               return;
> +       struct sched_avg *sa = &se->avg;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       u64 now = cfs_rq_clock_task(cfs_rq);
> +       int migrated = 0, decayed;
>
> -       if (atomic_long_read(&cfs_rq->removed_load)) {
> -               unsigned long removed_load;
> -               removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
> -               subtract_blocked_load_contrib(cfs_rq, removed_load);
> -       }
> +       if (sa->last_update_time == 0) {
> +               sa->last_update_time = now;
>
> -       if (decays) {
> -               cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
> -                                                     decays);
> -               atomic64_add(decays, &cfs_rq->decay_counter);
> -               cfs_rq->last_decay = now;
> +               if (entity_is_task(se))
> +                       migrated = 1;
>         }
> +       else
> +               __update_load_avg(now, sa, se->on_rq * se->load.weight);
>
> -       __update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
> -}
> -
> -/* Add the load generated by se into cfs_rq's child load-average */
> -static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> -                                                 struct sched_entity *se,
> -                                                 int wakeup)
> -{
> -       /*
> -        * We track migrations using entity decay_count <= 0, on a wake-up
> -        * migration we use a negative decay count to track the remote decays
> -        * accumulated while sleeping.
> -        *
> -        * Newly forked tasks are enqueued with se->avg.decay_count == 0, they
> -        * are seen by enqueue_entity_load_avg() as a migration with an already
> -        * constructed load_avg_contrib.
> -        */
> -       if (unlikely(se->avg.decay_count <= 0)) {
> -               se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
> -               if (se->avg.decay_count) {
> -                       /*
> -                        * In a wake-up migration we have to approximate the
> -                        * time sleeping.  This is because we can't synchronize
> -                        * clock_task between the two cpus, and it is not
> -                        * guaranteed to be read-safe.  Instead, we can
> -                        * approximate this using our carried decays, which are
> -                        * explicitly atomically readable.
> -                        */
> -                       se->avg.last_runnable_update -= (-se->avg.decay_count)
> -                                                       << 20;
> -                       update_entity_load_avg(se, 0);
> -                       /* Indicate that we're now synchronized and on-rq */
> -                       se->avg.decay_count = 0;
> -               }
> -               wakeup = 0;
> -       } else {
> -               __synchronize_entity_decay(se);
> -       }
> +       decayed = update_cfs_rq_load_avg(now, cfs_rq);
>
> -       /* migrated tasks did not contribute to our blocked load */
> -       if (wakeup) {
> -               subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
> -               update_entity_load_avg(se, 0);
> +       if (migrated) {
> +               cfs_rq->avg.load_avg += sa->load_avg;
> +               cfs_rq->avg.load_sum += sa->load_sum;
>         }
>
> -       cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
> -       /* we force update consideration on load-balancer moves */
> -       update_cfs_rq_blocked_load(cfs_rq, !wakeup);
> -}
> -
> -/*
> - * Remove se's load from this cfs_rq child load-average, if the entity is
> - * transitioning to a blocked state we track its projected decay using
> - * blocked_load_avg.
> - */
> -static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
> -                                                 struct sched_entity *se,
> -                                                 int sleep)
> -{
> -       update_entity_load_avg(se, 1);
> -       /* we force update consideration on load-balancer moves */
> -       update_cfs_rq_blocked_load(cfs_rq, !sleep);
> -
> -       cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
> -       if (sleep) {
> -               cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> -               se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> -       } /* migrations, e.g. sleep=0 leave decay_count == 0 */
> +       if (decayed || migrated)
> +               update_tg_load_avg(cfs_rq);
>  }
>
>  /*
> @@ -2623,16 +2462,8 @@ static int idle_balance(struct rq *this_rq);
>
>  #else /* CONFIG_SMP */
>
> -static inline void update_entity_load_avg(struct sched_entity *se,
> -                                         int update_cfs_rq) {}
> -static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
> -                                          struct sched_entity *se,
> -                                          int wakeup) {}
> -static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
> -                                          struct sched_entity *se,
> -                                          int sleep) {}
> -static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
> -                                             int force_update) {}
> +static inline void update_load_avg(struct sched_entity *se, int update_tg) {}
> +static inline void enqueue_entity_load_avg(struct sched_entity *se) {}
>
>  static inline int idle_balance(struct rq *rq)
>  {
> @@ -2764,7 +2595,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>          * Update run-time statistics of the 'current'.
>          */
>         update_curr(cfs_rq);
> -       enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
> +       enqueue_entity_load_avg(se);
>         account_entity_enqueue(cfs_rq, se);
>         update_cfs_shares(cfs_rq);
>
> @@ -2839,7 +2670,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>          * Update run-time statistics of the 'current'.
>          */
>         update_curr(cfs_rq);
> -       dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
> +
> +       update_load_avg(se, 1);
>
>         update_stats_dequeue(cfs_rq, se);
>         if (flags & DEQUEUE_SLEEP) {
> @@ -3028,7 +2860,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>                 /* Put 'current' back into the tree. */
>                 __enqueue_entity(cfs_rq, prev);
>                 /* in !on_rq case, update occurred at dequeue */
> -               update_entity_load_avg(prev, 1);
> +               update_load_avg(prev, 0);
>         }
>         cfs_rq->curr = NULL;
>  }
> @@ -3044,8 +2876,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>         /*
>          * Ensure that runnable average is periodically updated.
>          */
> -       update_entity_load_avg(curr, 1);
> -       update_cfs_rq_blocked_load(cfs_rq, 1);
> +       update_load_avg(curr, 1);
>         update_cfs_shares(cfs_rq);
>
>  #ifdef CONFIG_SCHED_HRTICK
> @@ -3923,8 +3754,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 if (cfs_rq_throttled(cfs_rq))
>                         break;
>
> +               update_load_avg(se, 1);
>                 update_cfs_shares(cfs_rq);
> -               update_entity_load_avg(se, 1);
>         }
>
>         if (!se)
> @@ -3983,8 +3814,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 if (cfs_rq_throttled(cfs_rq))
>                         break;
>
> +               update_load_avg(se, 1);
>                 update_cfs_shares(cfs_rq);
> -               update_entity_load_avg(se, 1);
>         }
>
>         if (!se)
> @@ -3997,7 +3828,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  /* Used instead of source_load when we know the type == 0 */
>  static unsigned long weighted_cpuload(const int cpu)
>  {
> -       return cpu_rq(cpu)->cfs.runnable_load_avg;
> +       return cpu_rq(cpu)->cfs.avg.load_avg;
>  }
>
>  /*
> @@ -4042,7 +3873,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
>         unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> -       unsigned long load_avg = rq->cfs.runnable_load_avg;
> +       unsigned long load_avg = rq->cfs.avg.load_avg;
>
>         if (nr_running)
>                 return load_avg / nr_running;
> @@ -4161,7 +3992,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>                 /*
>                  * w = rw_i + @wl
>                  */
> -               w = se->my_q->load.weight + wl;
> +               w = se->my_q->avg.load_avg + wl;
>
>                 /*
>                  * wl = S * s'_i; see (2)
> @@ -4182,7 +4013,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>                 /*
>                  * wl = dw_i = S * (s'_i - s_i); see (3)
>                  */
> -               wl -= se->load.weight;
> +               wl -= se->avg.load_avg;
>
>                 /*
>                  * Recursively apply this logic to all parent groups to compute
> @@ -4256,14 +4087,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>          */
>         if (sync) {
>                 tg = task_group(current);
> -               weight = current->se.load.weight;
> +               weight = current->se.avg.load_avg;
>
>                 this_load += effective_load(tg, this_cpu, -weight, -weight);
>                 load += effective_load(tg, prev_cpu, 0, -weight);
>         }
>
>         tg = task_group(p);
> -       weight = p->se.load.weight;
> +       weight = p->se.avg.load_avg;
>
>         /*
>          * In low-load situations, where prev_cpu is idle and this_cpu is idle
> @@ -4551,18 +4382,34 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
>  {
>         struct sched_entity *se = &p->se;
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       u64 last_update_time;
>
>         /*
> -        * Load tracking: accumulate removed load so that it can be processed
> -        * when we next update owning cfs_rq under rq->lock.  Tasks contribute
> -        * to blocked load iff they have a positive decay-count.  It can never
> -        * be negative here since on-rq tasks have decay-count == 0.
> +        * Task on old CPU catches up with its old cfs_rq, and subtract itself from
> +        * the cfs_rq (task must be off the queue now).
>          */
> -       if (se->avg.decay_count) {
> -               se->avg.decay_count = -__synchronize_entity_decay(se);
> -               atomic_long_add(se->avg.load_avg_contrib,
> -                                               &cfs_rq->removed_load);
> -       }
> +#ifndef CONFIG_64BIT
> +       u64 last_update_time_copy;
> +
> +       do {
> +               last_update_time_copy = cfs_rq->load_last_update_time_copy;
> +               smp_rmb();
> +               last_update_time = cfs_rq->avg.last_update_time;
> +       } while (last_update_time != last_update_time_copy);
> +#else
> +       last_update_time = cfs_rq->avg.last_update_time;
> +#endif
> +       __update_load_avg(last_update_time, &se->avg, 0);
> +       atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
> +
> +       /*
> +        * We are supposed to update the task to "current" time, then its up to date
> +        * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
> +        * what current time is, so simply throw away the out-of-date time. This
> +        * will result in the wakee task is less decayed, but giving the wakee more
> +        * load sounds not bad.
> +        */
> +       se->avg.last_update_time = 0;
>
>         /* We have migrated, no longer consider this task hot */
>         se->exec_start = 0;
> @@ -5399,36 +5246,6 @@ next:
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -/*
> - * update tg->load_weight by folding this cpu's load_avg
> - */
> -static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
> -{
> -       struct sched_entity *se = tg->se[cpu];
> -       struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
> -
> -       /* throttled entities do not contribute to load */
> -       if (throttled_hierarchy(cfs_rq))
> -               return;
> -
> -       update_cfs_rq_blocked_load(cfs_rq, 1);
> -
> -       if (se) {
> -               update_entity_load_avg(se, 1);
> -               /*
> -                * We pivot on our runnable average having decayed to zero for
> -                * list removal.  This generally implies that all our children
> -                * have also been removed (modulo rounding error or bandwidth
> -                * control); however, such cases are rare and we can fix these
> -                * at enqueue.
> -                *
> -                * TODO: fix up out-of-order children on enqueue.
> -                */
> -               if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
> -                       list_del_leaf_cfs_rq(cfs_rq);
> -       }
> -}
> -
>  static void update_blocked_averages(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
> @@ -5437,17 +5254,17 @@ static void update_blocked_averages(int cpu)
>
>         raw_spin_lock_irqsave(&rq->lock, flags);
>         update_rq_clock(rq);
> +
>         /*
>          * Iterates the task_group tree in a bottom up fashion, see
>          * list_add_leaf_cfs_rq() for details.
>          */
>         for_each_leaf_cfs_rq(rq, cfs_rq) {
> -               /*
> -                * Note: We may want to consider periodically releasing
> -                * rq->lock about these updates so that creating many task
> -                * groups does not result in continually extending hold time.
> -                */
> -               __update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
> +               /* throttled entities do not contribute to load */
> +               if (throttled_hierarchy(cfs_rq))
> +                       continue;
> +
> +               update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>         }
>
>         raw_spin_unlock_irqrestore(&rq->lock, flags);
> @@ -5477,14 +5294,14 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
>         }
>
>         if (!se) {
> -               cfs_rq->h_load = cfs_rq->runnable_load_avg;
> +               cfs_rq->h_load = cfs_rq->avg.load_avg;
>                 cfs_rq->last_h_load_update = now;
>         }
>
>         while ((se = cfs_rq->h_load_next) != NULL) {
>                 load = cfs_rq->h_load;
> -               load = div64_ul(load * se->avg.load_avg_contrib,
> -                               cfs_rq->runnable_load_avg + 1);
> +               load = div64_ul(load * se->avg.load_avg,
> +                               cfs_rq->avg.load_avg + 1);
>                 cfs_rq = group_cfs_rq(se);
>                 cfs_rq->h_load = load;
>                 cfs_rq->last_h_load_update = now;
> @@ -5496,8 +5313,8 @@ static unsigned long task_h_load(struct task_struct *p)
>         struct cfs_rq *cfs_rq = task_cfs_rq(p);
>
>         update_cfs_rq_h_load(cfs_rq);
> -       return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
> -                       cfs_rq->runnable_load_avg + 1);
> +       return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> +                       cfs_rq->avg.load_avg + 1);
>  }
>  #else
>  static inline void update_blocked_averages(int cpu)
> @@ -5506,7 +5323,7 @@ static inline void update_blocked_averages(int cpu)
>
>  static unsigned long task_h_load(struct task_struct *p)
>  {
> -       return p->se.avg.load_avg_contrib;
> +       return p->se.avg.load_avg;
>  }
>  #endif
>
> @@ -7437,14 +7254,14 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
>
>  #ifdef CONFIG_SMP
>         /*
> -       * Remove our load from contribution when we leave sched_fair
> -       * and ensure we don't carry in an old decay_count if we
> -       * switch back.
> +       * Remove our load from contribution when we leave cfs_rq.
>         */
> -       if (se->avg.decay_count) {
> -               __synchronize_entity_decay(se);
> -               subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
> -       }
> +       __update_load_avg(cfs_rq->avg.last_update_time, &se->avg,
> +               se->on_rq * se->load.weight);
> +       cfs_rq->avg.load_avg =
> +               subtract_until_zero(cfs_rq->avg.load_avg, se->avg.load_avg);
> +       cfs_rq->avg.load_sum =
> +               subtract_until_zero(cfs_rq->avg.load_sum, se->avg.load_sum);
>  #endif
>  }
>
> @@ -7501,8 +7318,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>         cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
>  #endif
>  #ifdef CONFIG_SMP
> -       atomic64_set(&cfs_rq->decay_counter, 1);
> -       atomic_long_set(&cfs_rq->removed_load, 0);
> +       atomic_long_set(&cfs_rq->removed_load_avg, 0);
>  #endif
>  }
>
> @@ -7547,14 +7363,12 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
>         if (!on_rq) {
>                 cfs_rq = cfs_rq_of(se);
>                 se->vruntime += cfs_rq->min_vruntime;
> +
>  #ifdef CONFIG_SMP
> -               /*
> -                * migrate_task_rq_fair() will have removed our previous
> -                * contribution, but we must synchronize for ongoing future
> -                * decay.
> -                */
> -               se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
> -               cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
> +               /* Virtually synchronize task with its new cfs_rq */
> +               p->se.avg.last_update_time = cfs_rq->avg.last_update_time;
> +               cfs_rq->avg.load_avg += p->se.avg.load_avg;
> +               cfs_rq->avg.load_sum += p->se.avg.load_sum;
>  #endif
>         }
>  }
> diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
> index 16f5a30..8f547fe 100644
> --- a/kernel/sched/proc.c
> +++ b/kernel/sched/proc.c
> @@ -504,7 +504,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>  #ifdef CONFIG_SMP
>  static inline unsigned long get_rq_runnable_load(struct rq *rq)
>  {
> -       return rq->cfs.runnable_load_avg;
> +       return rq->cfs.avg.load_avg;
>  }
>  #else
>  static inline unsigned long get_rq_runnable_load(struct rq *rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a147571..7c8c2a9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -210,7 +210,6 @@ struct task_group {
>
>  #ifdef CONFIG_SMP
>         atomic_long_t load_avg;
> -       atomic_t runnable_avg;
>  #endif
>  #endif
>
> @@ -331,21 +330,16 @@ struct cfs_rq {
>
>  #ifdef CONFIG_SMP
>         /*
> -        * CFS Load tracking
> -        * Under CFS, load is tracked on a per-entity basis and aggregated up.
> -        * This allows for the description of both thread and group usage (in
> -        * the FAIR_GROUP_SCHED case).
> +        * CFS load tracking
>          */
> -       unsigned long runnable_load_avg, blocked_load_avg;
> -       atomic64_t decay_counter;
> -       u64 last_decay;
> -       atomic_long_t removed_load;
> +       struct sched_avg avg;
> +       unsigned long tg_load_avg_contrib;
> +       atomic_long_t removed_load_avg;
> +#ifndef CONFIG_64BIT
> +       u64 load_last_update_time_copy;
> +#endif
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -       /* Required to track per-cpu representation of a task_group */
> -       u32 tg_runnable_contrib;
> -       unsigned long tg_load_contrib;
> -
>         /*
>          *   h_load = weight * f(tg)
>          *
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-17 23:26 [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
  2014-07-17 23:26 ` [PATCH 1/2 v4] sched: Remove update_rq_runnable_avg Yuyang Du
  2014-07-17 23:26 ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
@ 2014-07-18 15:39 ` Morten Rasmussen
  2014-07-27 19:02   ` Yuyang Du
  2014-07-20  5:46 ` Mike Galbraith
  3 siblings, 1 reply; 47+ messages in thread
From: Morten Rasmussen @ 2014-07-18 15:39 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Fri, Jul 18, 2014 at 12:26:04AM +0100, Yuyang Du wrote:
> Thanks to Morten, Ben, and Fengguang.
> 
> v4 changes:
> 
> - Insert memory barrier before writing cfs_rq->load_last_update_copy.
> - Fix typos.

It is quite a challenge keeping up with your revisions :) Three
revisions in five days. It takes time to go through all the changes to
understand the implications of your proposed changes.

I still haven't gotten to the bottom of everything, but this is my view
so far.

1. runnable_avg_period is removed

load_avg_contrib used to be runnable_avg_sum/runnable_avg_period scaled
by the task load weight (priority). The runnable_avg_period is replaced
by a constant in this patch set. The effect of that change is that task
load tracking no longer is more sensitive early life of the task until
it has built up some history. Task are now initialized to start out as
if they have been runnable forever (>345ms). If this assumption about
the task behavior is wrong it will take longer to converge to the true
average than it did before. The upside is that is more stable.

2. runnable_load_avg and blocked_load_avg are combined

runnable_load_avg currently represents the sum of load_avg_contrib of
all tasks on the rq, while blocked_load_avg is the sum of those tasks
not on a runqueue. It makes perfect sense to consider the sum of both
when calculating the load of a cpu, but we currently don't include
blocked_load_avg. The reason for that is the priority scaling of the
task load_avg_contrib may lead to under-utilization of cpus that
occasionally have tiny high priority task running. You can easily have a
task that takes 5% of cpu time but has a load_avg_contrib several times
larger than a default priority task runnable 100% of the time.

Another thing that might be an issue is that the blocked of a terminated
task lives on for quite a while until has decayed away.

I'm all for taking the blocked load into consideration, but this issue
has to be resolved first. Which leads me on to the next thing.

Most of the work going on around energy awareness is based on the load
tracking to estimate task and cpu utilization. It seems that most of the
involved parties think that we need an unweighted variant of the tracked
load as well as tracking the running time of a task. The latter was part
of the original proposal by pjt and Ben, but wasn't used. It seems that
unweighted runnable tracking should be fairly easy to add to your
proposal, but I don't have an overview of whether it is possible to add
running tracking. Do you think that is possible?

Morten

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-17 23:26 [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
                   ` (2 preceding siblings ...)
  2014-07-18 15:39 ` [PATCH 0/2 " Morten Rasmussen
@ 2014-07-20  5:46 ` Mike Galbraith
  2014-07-27 19:34   ` Yuyang Du
  3 siblings, 1 reply; 47+ messages in thread
From: Mike Galbraith @ 2014-07-20  5:46 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Fri, 2014-07-18 at 07:26 +0800, Yuyang Du wrote: 
> Thanks to Morten, Ben, and Fengguang.
> 
> v4 changes:
> 
> - Insert memory barrier before writing cfs_rq->load_last_update_copy.
> - Fix typos.

My little desktop box says lovely minus signs have had their usual
effect on the general case (cgroups enabled but not in use). 

pipe-test scheduling cross core - full fastpath
3.0.101-default        3.753363 usecs/loop -- avg 3.770737 530.4 KHz   1.000
3.1.10-default         3.723843 usecs/loop -- avg 3.716058 538.2 KHz   1.014
3.2.51-default         3.728060 usecs/loop -- avg 3.710372 539.0 KHz   1.016
3.3.8-default          3.906174 usecs/loop -- avg 3.900399 512.8 KHz    .966
3.4.97-default         3.864158 usecs/loop -- avg 3.865281 517.4 KHz    .975
3.5.7-default          3.967481 usecs/loop -- avg 3.962757 504.7 KHz    .951
3.6.11-default         3.851186 usecs/loop -- avg 3.845321 520.1 KHz    .980
3.7.10-default         3.777869 usecs/loop -- avg 3.776913 529.5 KHz    .998
3.8.13-default         4.049927 usecs/loop -- avg 4.041905 494.8 KHz    .932
3.9.11-default         3.973046 usecs/loop -- avg 3.974208 503.2 KHz    .948
3.10.27-default        4.189598 usecs/loop -- avg 4.189298 477.4 KHz    .900
3.11.10-default        4.293870 usecs/loop -- avg 4.297979 465.3 KHz    .877
3.12.24-default        4.321570 usecs/loop -- avg 4.321961 462.8 KHz    .872
3.13.11-default        4.137845 usecs/loop -- avg 4.134863 483.7 KHz    .911
3.14.10-default        4.145348 usecs/loop -- avg 4.139987 483.1 KHz    .910            
3.15.4-default         4.355594 usecs/loop -- avg 4.351961 459.6 KHz    .866             
3.16.0-default         4.537279 usecs/loop -- avg 4.543532 440.2 KHz    .829     1.000   
3.16.0-default+v4      4.343542 usecs/loop -- avg 4.318803 463.1 KHz    .873     1.052

Extending max depth to 5, cost of depth++ seemingly did not change
despite repeatable dip at depth = 3 (gremlins at play).

mount -t cgroup o cpu none /cgroups
mkdir -p /cgroups/a/b/c/d/e

cgexec -g cpu:a pipe-test 1
3.16.0-default         5.016373 usecs/loop -- avg 5.021115 398.3 KHz   1.000
3.16.0-default+v4      4.978625 usecs/loop -- avg 4.977381 401.8 KHz   1.008

cgexec -g cpu:a/b pipe-test 1
3.16.0-default         5.543566 usecs/loop -- avg 5.565475 359.4 KHz   1.000
3.16.0-default+v4      5.597399 usecs/loop -- avg 5.570444 359.0 KHz    .998

cgexec -g cpu:a/b/c pipe-test 1
3.16.0-default         6.092256 usecs/loop -- avg 6.094186 328.2 KHz   1.000
3.16.0-default+v4      6.294858 usecs/loop -- avg 6.338453 315.5 KHz    .961

cgexec -g cpu:a/b/c/d pipe-test 1
3.16.0-default         6.719044 usecs/loop -- avg 6.717118 297.7 KHz   1.000
3.16.0-default+v4      6.788559 usecs/loop -- avg 6.710102 298.1 KHz   1.001

cgexec -g cpu:a/b/c/d/e pipe-test 1
3.16.0-default         7.186431 usecs/loop -- avg 7.194884 278.0 KHz   1.000
3.16.0-default+v4      7.368443 usecs/loop -- avg 7.250371 275.8 KHz    .992



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-18  9:43   ` Vincent Guittot
@ 2014-07-27 17:36     ` Yuyang Du
  2014-07-29  9:12       ` Vincent Guittot
  0 siblings, 1 reply; 47+ messages in thread
From: Yuyang Du @ 2014-07-27 17:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, Peter Zijlstra, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

Hi Vincent,

On Fri, Jul 18, 2014 at 11:43:00AM +0200, Vincent Guittot wrote:
> > @@ -2291,23 +2299,24 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
> >         delta >>= 10;
> >         if (!delta)
> >                 return 0;
> > -       sa->last_runnable_update = now;
> > +       sa->last_update_time = now;
> >
> >         /* delta_w is the amount already accumulated against our next period */
> > -       delta_w = sa->runnable_avg_period % 1024;
> > +       delta_w = sa->period_contrib;
> >         if (delta + delta_w >= 1024) {
> > -               /* period roll-over */
> >                 decayed = 1;
> >
> > +               /* how much left for next period will start over, we don't know yet */
> > +               sa->period_contrib = 0;
> > +
> >                 /*
> >                  * Now that we know we're crossing a period boundary, figure
> >                  * out how much from delta we need to complete the current
> >                  * period and accrue it.
> >                  */
> >                 delta_w = 1024 - delta_w;
> > -               if (runnable)
> > -                       sa->runnable_avg_sum += delta_w;
> > -               sa->runnable_avg_period += delta_w;
> > +               if (w)
> > +                       sa->load_sum += w * delta_w;
> 
> Do you really need to have *w for computing the load_sum ? can't you
> only use it when computing the load_avg ?
> 
> sa->load_avg = div_u64(sa->load_sum * w , LOAD_AVG_MAX)
> 

For task, assuming its load.weight does not change much, yes, we can. But in theory, task's
load.weight can change, and *w in load_sum can take into that change. For group entity
and cfs_rq, its load.weight changes all the time, I don't know how to do it without *w
for load_sum.

Sorry for my irresponsiveness for last week. I was on vacation and unfortunately failed to
connect VPN from where I was.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-18 15:39 ` [PATCH 0/2 " Morten Rasmussen
@ 2014-07-27 19:02   ` Yuyang Du
  2014-07-28 10:38     ` Peter Zijlstra
  2014-07-30 10:13     ` Morten Rasmussen
  0 siblings, 2 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-27 19:02 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

Hi Morten,

On Fri, Jul 18, 2014 at 04:39:31PM +0100, Morten Rasmussen wrote:
> 1. runnable_avg_period is removed
> 
> load_avg_contrib used to be runnable_avg_sum/runnable_avg_period scaled
> by the task load weight (priority). The runnable_avg_period is replaced
> by a constant in this patch set. The effect of that change is that task
> load tracking no longer is more sensitive early life of the task until
> it has built up some history. Task are now initialized to start out as
> if they have been runnable forever (>345ms). If this assumption about
> the task behavior is wrong it will take longer to converge to the true
> average than it did before. The upside is that is more stable.

I think "Give new task start runnable values to heavy its load in infant time"
in general is good, with an emphasis on infant. Or from the opposite, making it
zero to let it gain runnable weight looks worse than full weight.

> 2. runnable_load_avg and blocked_load_avg are combined
> 
> runnable_load_avg currently represents the sum of load_avg_contrib of
> all tasks on the rq, while blocked_load_avg is the sum of those tasks
> not on a runqueue. It makes perfect sense to consider the sum of both
> when calculating the load of a cpu, but we currently don't include
> blocked_load_avg. The reason for that is the priority scaling of the
> task load_avg_contrib may lead to under-utilization of cpus that
> occasionally have tiny high priority task running. You can easily have a
> task that takes 5% of cpu time but has a load_avg_contrib several times
> larger than a default priority task runnable 100% of the time.

So this is the effect of historical averaging and weight scaling, both of which
are just generally good, but may have bad cases.

> Another thing that might be an issue is that the blocked of a terminated
> task lives on for quite a while until has decayed away.

Good point. To do so, if I read correctly, we need to hook do_exit(), but probably
we are gonna encounter rq->lock issue.

What is the opinion/guidance from the maintainers/others?
 
> I'm all for taking the blocked load into consideration, but this issue
> has to be resolved first. Which leads me on to the next thing.
> 
> Most of the work going on around energy awareness is based on the load
> tracking to estimate task and cpu utilization. It seems that most of the
> involved parties think that we need an unweighted variant of the tracked
> load as well as tracking the running time of a task. The latter was part
> of the original proposal by pjt and Ben, but wasn't used. It seems that
> unweighted runnable tracking should be fairly easy to add to your
> proposal, but I don't have an overview of whether it is possible to add
> running tracking. Do you think that is possible?
> 

Running tracking is absolutely possbile, just the matter of minimizing overhead
(how to do it along with runnable for task and maybe for CPU, but not for
cfs_rq) from execution and code cleanness ponit of view. We can do it as soon as
it is needed.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-20  5:46 ` Mike Galbraith
@ 2014-07-27 19:34   ` Yuyang Du
  2014-07-28  7:49     ` Mike Galbraith
  2014-07-28  8:55     ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-27 19:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

Thanks a lot, Mike.

Ben asked for this test, but actually I don't know how to get pipe-test, still
not even after google it.

On Sun, Jul 20, 2014 at 07:46:23AM +0200, Mike Galbraith wrote:
> On Fri, 2014-07-18 at 07:26 +0800, Yuyang Du wrote: 
> > Thanks to Morten, Ben, and Fengguang.
> > 
> > v4 changes:
> > 
> > - Insert memory barrier before writing cfs_rq->load_last_update_copy.
> > - Fix typos.
> 
> My little desktop box says lovely minus signs have had their usual
> effect on the general case (cgroups enabled but not in use). 
> 
> pipe-test scheduling cross core - full fastpath
> 3.0.101-default        3.753363 usecs/loop -- avg 3.770737 530.4 KHz   1.000
> 3.1.10-default         3.723843 usecs/loop -- avg 3.716058 538.2 KHz   1.014
> 3.2.51-default         3.728060 usecs/loop -- avg 3.710372 539.0 KHz   1.016
> 3.3.8-default          3.906174 usecs/loop -- avg 3.900399 512.8 KHz    .966
> 3.4.97-default         3.864158 usecs/loop -- avg 3.865281 517.4 KHz    .975
> 3.5.7-default          3.967481 usecs/loop -- avg 3.962757 504.7 KHz    .951
> 3.6.11-default         3.851186 usecs/loop -- avg 3.845321 520.1 KHz    .980
> 3.7.10-default         3.777869 usecs/loop -- avg 3.776913 529.5 KHz    .998
> 3.8.13-default         4.049927 usecs/loop -- avg 4.041905 494.8 KHz    .932
> 3.9.11-default         3.973046 usecs/loop -- avg 3.974208 503.2 KHz    .948
> 3.10.27-default        4.189598 usecs/loop -- avg 4.189298 477.4 KHz    .900
> 3.11.10-default        4.293870 usecs/loop -- avg 4.297979 465.3 KHz    .877
> 3.12.24-default        4.321570 usecs/loop -- avg 4.321961 462.8 KHz    .872
> 3.13.11-default        4.137845 usecs/loop -- avg 4.134863 483.7 KHz    .911
> 3.14.10-default        4.145348 usecs/loop -- avg 4.139987 483.1 KHz    .910            
> 3.15.4-default         4.355594 usecs/loop -- avg 4.351961 459.6 KHz    .866             
> 3.16.0-default         4.537279 usecs/loop -- avg 4.543532 440.2 KHz    .829     1.000   
> 3.16.0-default+v4      4.343542 usecs/loop -- avg 4.318803 463.1 KHz    .873     1.052
> 
> Extending max depth to 5, cost of depth++ seemingly did not change
> despite repeatable dip at depth = 3 (gremlins at play).
> 
> mount -t cgroup o cpu none /cgroups
> mkdir -p /cgroups/a/b/c/d/e
> 
> cgexec -g cpu:a pipe-test 1
> 3.16.0-default         5.016373 usecs/loop -- avg 5.021115 398.3 KHz   1.000
> 3.16.0-default+v4      4.978625 usecs/loop -- avg 4.977381 401.8 KHz   1.008
> 
> cgexec -g cpu:a/b pipe-test 1
> 3.16.0-default         5.543566 usecs/loop -- avg 5.565475 359.4 KHz   1.000
> 3.16.0-default+v4      5.597399 usecs/loop -- avg 5.570444 359.0 KHz    .998
> 
> cgexec -g cpu:a/b/c pipe-test 1
> 3.16.0-default         6.092256 usecs/loop -- avg 6.094186 328.2 KHz   1.000
> 3.16.0-default+v4      6.294858 usecs/loop -- avg 6.338453 315.5 KHz    .961
> 
> cgexec -g cpu:a/b/c/d pipe-test 1
> 3.16.0-default         6.719044 usecs/loop -- avg 6.717118 297.7 KHz   1.000
> 3.16.0-default+v4      6.788559 usecs/loop -- avg 6.710102 298.1 KHz   1.001
> 
> cgexec -g cpu:a/b/c/d/e pipe-test 1
> 3.16.0-default         7.186431 usecs/loop -- avg 7.194884 278.0 KHz   1.000
> 3.16.0-default+v4      7.368443 usecs/loop -- avg 7.250371 275.8 KHz    .992
> 

So the result is flat compared to before or a pass?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-28  7:49     ` Mike Galbraith
@ 2014-07-28  0:01       ` Yuyang Du
  0 siblings, 0 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-28  0:01 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Mon, Jul 28, 2014 at 09:49:18AM +0200, Mike Galbraith wrote:
> On Mon, 2014-07-28 at 03:34 +0800, Yuyang Du wrote: 
> > Thanks a lot, Mike.
> > 
> > Ben asked for this test, but actually I don't know how to get pipe-test, still
> > not even after google it.
> 
> You could write it trivially, it's just pipe ping-pong.  I can send you
> a copy of the one I use if you like.  Ingo wrote it way back in the dark
> ages, I use it to watch out for fastpath lard accumulation.

Yes, please, drop me a copy.

> > So the result is flat compared to before or a pass?
> 
> Yeah, my little box said it was a deep cgroup price noop, but a modest
> win for root.  I love minus signs, they work great, though sometimes in
> highly mysterious ways :)
> 

Minus sign is great, and I hope this patchset is not mysterious, :)

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-27 19:34   ` Yuyang Du
@ 2014-07-28  7:49     ` Mike Galbraith
  2014-07-28  0:01       ` Yuyang Du
  2014-07-28  8:55     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Mike Galbraith @ 2014-07-28  7:49 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Mon, 2014-07-28 at 03:34 +0800, Yuyang Du wrote: 
> Thanks a lot, Mike.
> 
> Ben asked for this test, but actually I don't know how to get pipe-test, still
> not even after google it.

You could write it trivially, it's just pipe ping-pong.  I can send you
a copy of the one I use if you like.  Ingo wrote it way back in the dark
ages, I use it to watch out for fastpath lard accumulation.
> So the result is flat compared to before or a pass?

Yeah, my little box said it was a deep cgroup price noop, but a modest
win for root.  I love minus signs, they work great, though sometimes in
highly mysterious ways :)

-Mike


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-27 19:34   ` Yuyang Du
  2014-07-28  7:49     ` Mike Galbraith
@ 2014-07-28  8:55     ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-28  8:55 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Mike Galbraith, mingo, linux-kernel, pjt, bsegall,
	arjan.van.de.ven, len.brown, rafael.j.wysocki, alan.cox,
	mark.gross, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 274 bytes --]

On Mon, Jul 28, 2014 at 03:34:53AM +0800, Yuyang Du wrote:
> Thanks a lot, Mike.
> 
> Ben asked for this test, but actually I don't know how to get pipe-test, still
> not even after google it.

perf bench sched pipe

I realize that's a mouth full, but there it is.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-27 19:02   ` Yuyang Du
@ 2014-07-28 10:38     ` Peter Zijlstra
  2014-07-29  1:17       ` Yuyang Du
  2014-07-30 10:13     ` Morten Rasmussen
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-28 10:38 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Morten Rasmussen, mingo, linux-kernel, pjt, bsegall,
	arjan.van.de.ven, len.brown, rafael.j.wysocki, alan.cox,
	mark.gross, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 619 bytes --]

On Mon, Jul 28, 2014 at 03:02:37AM +0800, Yuyang Du wrote:
> > Another thing that might be an issue is that the blocked of a terminated
> > task lives on for quite a while until has decayed away.
> 
> Good point. To do so, if I read correctly, we need to hook do_exit(), but probably
> we are gonna encounter rq->lock issue.
> 
> What is the opinion/guidance from the maintainers/others?

So the entire point of this per entity tracking was to make sure load
numbers reflect reality. We account migrations etc., it would be weird
to then throw all that out the window and let task exit accumulate crap.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-17 23:26 ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
  2014-07-18  9:43   ` Vincent Guittot
@ 2014-07-28 10:48   ` Peter Zijlstra
  2014-07-29  0:56     ` Yuyang Du
  2014-07-28 11:39   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-28 10:48 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 2049 bytes --]

On Fri, Jul 18, 2014 at 07:26:06AM +0800, Yuyang Du wrote:
> @@ -665,20 +660,27 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  }
>  
>  #ifdef CONFIG_SMP
> -static unsigned long task_h_load(struct task_struct *p);
>  
> -static inline void __update_task_entity_contrib(struct sched_entity *se);
> +/* dependent on LOAD_AVG_PERIOD, see below */
> +#define LOAD_AVG_MAX 47742 /* maximum possible load avg */

Please don't separate this from the rest of the values it belongs to. If
you really have to, move the entire block.

> @@ -2071,13 +2073,9 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>  	long tg_weight;
>  
>  	/*
> -	 * Use this CPU's actual weight instead of the last load_contribution
> -	 * to gain a more accurate current total weight. See
> -	 * update_cfs_rq_load_contribution().
> +	 * Use this CPU's load average instead of actual weight
>  	 */
>  	tg_weight = atomic_long_read(&tg->load_avg);
> -	tg_weight -= cfs_rq->tg_load_contrib;
> -	tg_weight += cfs_rq->load.weight;

I don't think that comment makes any sense after this. The comment was
there to explain the -=, += things, but that's all gone so its pretty
trivial now, and i++ /* inc by one */ comments are not useful.

> @@ -2181,7 +2178,7 @@ static const u32 runnable_avg_yN_sum[] = {
>   * Approximate:
>   *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
>   */
> -static __always_inline u64 decay_load(u64 val, u64 n)
> +static __always_inline u64 decay_load32(u64 val, u64 n)
>  {
>  	unsigned int local_n;
>  
> @@ -2210,6 +2207,18 @@ static __always_inline u64 decay_load(u64 val, u64 n)
>  	return val >> 32;
>  }
>  
> +static __always_inline u64 decay_load(u64 val, u64 n)
> +{
> +	if (likely(val <= UINT_MAX))
> +		val = decay_load32(val, n);
> +	else {
> +		val *= (u32)decay_load32(1 << 15, n);
> +		val >>= 15;
> +	}
> +
> +	return val;
> +}

Please just use mul_u64_u32_shr().

/me continues reading the rest of it..

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-17 23:26 ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
  2014-07-18  9:43   ` Vincent Guittot
  2014-07-28 10:48   ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Peter Zijlstra
@ 2014-07-28 11:39   ` Peter Zijlstra
  2014-07-29  1:09     ` Yuyang Du
  2014-07-28 12:01   ` Peter Zijlstra
  2014-07-28 13:51   ` Peter Zijlstra
  4 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-28 11:39 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 2584 bytes --]

On Fri, Jul 18, 2014 at 07:26:06AM +0800, Yuyang Du wrote:

> -static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> -						  struct cfs_rq *cfs_rq)
> -{
> -	struct task_group *tg = cfs_rq->tg;
> -	long contrib;
> -
> -	/* The fraction of a cpu used by this cfs_rq */
> -	contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
> -			  sa->runnable_avg_period + 1);
> -	contrib -= cfs_rq->tg_runnable_contrib;
> -
> -	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> -		atomic_add(contrib, &tg->runnable_avg);
> -		cfs_rq->tg_runnable_contrib += contrib;
> -	}
> -}


> -static inline void __update_group_entity_contrib(struct sched_entity *se)
> +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
> +	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>  
> +	if (delta) {
> +		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> +		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>  	}
>  }

We talked about this before, you made that an unconditional atomic op on
an already hot line.

You need some words on why this isn't a problem. Either in a comment or
in the Changelog. You cannot leave such changes undocumented.

> +#define subtract_until_zero(minuend, subtrahend)	\
> +	(subtrahend < minuend ? minuend - subtrahend : 0)

WTH is a minuend or subtrahend? Are you a wordsmith in your spare time
and like to make up your own words?

Also, isn't writing: x = max(0, x-y), far more readable to begin with?

> +/*
> + * Group cfs_rq's load_avg is used for task_h_load and update_cfs_share
> + * calc.
> + */
> +static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> +	int decayed;
>  
> +	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> +		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> +		cfs_rq->avg.load_avg = subtract_until_zero(cfs_rq->avg.load_avg, r);
> +		r *= LOAD_AVG_MAX;
> +		cfs_rq->avg.load_sum = subtract_until_zero(cfs_rq->avg.load_sum, r);
>  	}
>  
> +	decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
>  
> +#ifndef CONFIG_64BIT
> +	if (cfs_rq->avg.last_update_time != cfs_rq->load_last_update_time_copy) {
> +		smp_wmb();
> +		cfs_rq->load_last_update_time_copy = cfs_rq->avg.last_update_time;
> +	}
> +#endif
>  
> +	return decayed;
> +}

Its a bit unfortunate that we update the copy in another function than
the original, but I think I see why you did that. But is it at all
likely that we do not need to update? That is, does that compare make
any sense?



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-17 23:26 ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
                     ` (2 preceding siblings ...)
  2014-07-28 11:39   ` Peter Zijlstra
@ 2014-07-28 12:01   ` Peter Zijlstra
  2014-07-28 13:51   ` Peter Zijlstra
  4 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-28 12:01 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 1435 bytes --]

On Fri, Jul 18, 2014 at 07:26:06AM +0800, Yuyang Du wrote:
> -static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
> +/* Add the load generated by se into cfs_rq's load average */
> +static inline void enqueue_entity_load_avg(struct sched_entity *se)
>  {
> +	struct sched_avg *sa = &se->avg;
> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	u64 now = cfs_rq_clock_task(cfs_rq);
> +	int migrated = 0, decayed;
>  
> +	if (sa->last_update_time == 0) {
> +		sa->last_update_time = now;
>  
> +		if (entity_is_task(se))
> +			migrated = 1;
>  	}
> +	else
> +		__update_load_avg(now, sa, se->on_rq * se->load.weight);

That's a coding style fail, that should look like:

	if () {
	} else {
	}

>  
> +	decayed = update_cfs_rq_load_avg(now, cfs_rq);
>  
> +	if (migrated) {
> +		cfs_rq->avg.load_avg += sa->load_avg;
> +		cfs_rq->avg.load_sum += sa->load_sum;
>  	}
>  
> +	if (decayed || migrated)
> +		update_tg_load_avg(cfs_rq);
>  }


> @@ -2764,7 +2595,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> -	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
> +	enqueue_entity_load_avg(se);
>  	account_entity_enqueue(cfs_rq, se);
>  	update_cfs_shares(cfs_rq);
>  

Why did you remove the cfs_rq argumetn only to have to re-compute it?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-17 23:26 ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
                     ` (3 preceding siblings ...)
  2014-07-28 12:01   ` Peter Zijlstra
@ 2014-07-28 13:51   ` Peter Zijlstra
  2014-07-28 16:58     ` bsegall
  4 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-28 13:51 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 3234 bytes --]



> +static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> +	int decayed;
>  
> +	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> +		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> +		cfs_rq->avg.load_avg = subtract_until_zero(cfs_rq->avg.load_avg, r);
> +		r *= LOAD_AVG_MAX;
> +		cfs_rq->avg.load_sum = subtract_until_zero(cfs_rq->avg.load_sum, r);
>  	}
>  
> +	decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
>  
> +#ifndef CONFIG_64BIT
> +	if (cfs_rq->avg.last_update_time != cfs_rq->load_last_update_time_copy) {
> +		smp_wmb();
> +		cfs_rq->load_last_update_time_copy = cfs_rq->avg.last_update_time;
> +	}
> +#endif
>  
> +	return decayed;
> +}

So on every cfs_rq update we first process the 'pending' removals, then
decay and then store the current timestamp.

> +static inline void enqueue_entity_load_avg(struct sched_entity *se)
>  {
> +	struct sched_avg *sa = &se->avg;
> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	u64 now = cfs_rq_clock_task(cfs_rq);
> +	int migrated = 0, decayed;
>  
> +	if (sa->last_update_time == 0) {
> +		sa->last_update_time = now;
>  
> +		if (entity_is_task(se))
> +			migrated = 1;
>  	}
> +	else
> +		__update_load_avg(now, sa, se->on_rq * se->load.weight);
>  
> +	decayed = update_cfs_rq_load_avg(now, cfs_rq);
>  
> +	if (migrated) {
> +		cfs_rq->avg.load_avg += sa->load_avg;
> +		cfs_rq->avg.load_sum += sa->load_sum;
>  	}
>  
> +	if (decayed || migrated)
> +		update_tg_load_avg(cfs_rq);
>  }

On enqueue we add ourselves to the cfs_rq.. and assume the entity is
'current' wrt updates since we did that when we just pulled it from the
old rq.

> @@ -4551,18 +4382,34 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
>  {
>  	struct sched_entity *se = &p->se;
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	u64 last_update_time;
>  
>  	/*
> +	 * Task on old CPU catches up with its old cfs_rq, and subtract itself from
> +	 * the cfs_rq (task must be off the queue now).
>  	 */
> +#ifndef CONFIG_64BIT
> +	u64 last_update_time_copy;
> +
> +	do {
> +		last_update_time_copy = cfs_rq->load_last_update_time_copy;
> +		smp_rmb();
> +		last_update_time = cfs_rq->avg.last_update_time;
> +	} while (last_update_time != last_update_time_copy);
> +#else
> +	last_update_time = cfs_rq->avg.last_update_time;
> +#endif
> +	__update_load_avg(last_update_time, &se->avg, 0);
> +	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
> +
> +	/*
> +	 * We are supposed to update the task to "current" time, then its up to date
> +	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
> +	 * what current time is, so simply throw away the out-of-date time. This
> +	 * will result in the wakee task is less decayed, but giving the wakee more
> +	 * load sounds not bad.
> +	 */
> +	se->avg.last_update_time = 0;
>  
>  	/* We have migrated, no longer consider this task hot */
>  	se->exec_start = 0;


And here we try and make good on that assumption. The thing I worry
about is what happens if the machine is entirely idle...

What guarantees an semi up-to-date cfs_rq->avg.last_update_time.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-28 13:51   ` Peter Zijlstra
@ 2014-07-28 16:58     ` bsegall
  2014-07-28 17:19       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: bsegall @ 2014-07-28 16:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, mingo, linux-kernel, pjt, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

Peter Zijlstra <peterz@infradead.org> writes:

>> @@ -4551,18 +4382,34 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
>>  {
>>  	struct sched_entity *se = &p->se;
>>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> +	u64 last_update_time;
>>  
>>  	/*
>> +	 * Task on old CPU catches up with its old cfs_rq, and subtract itself from
>> +	 * the cfs_rq (task must be off the queue now).
>>  	 */
>> +#ifndef CONFIG_64BIT
>> +	u64 last_update_time_copy;
>> +
>> +	do {
>> +		last_update_time_copy = cfs_rq->load_last_update_time_copy;
>> +		smp_rmb();
>> +		last_update_time = cfs_rq->avg.last_update_time;
>> +	} while (last_update_time != last_update_time_copy);
>> +#else
>> +	last_update_time = cfs_rq->avg.last_update_time;
>> +#endif
>> +	__update_load_avg(last_update_time, &se->avg, 0);
>> +	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
>> +
>> +	/*
>> +	 * We are supposed to update the task to "current" time, then its up to date
>> +	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
>> +	 * what current time is, so simply throw away the out-of-date time. This
>> +	 * will result in the wakee task is less decayed, but giving the wakee more
>> +	 * load sounds not bad.
>> +	 */
>> +	se->avg.last_update_time = 0;
>>  
>>  	/* We have migrated, no longer consider this task hot */
>>  	se->exec_start = 0;
>
>
> And here we try and make good on that assumption. The thing I worry
> about is what happens if the machine is entirely idle...
>
> What guarantees an semi up-to-date cfs_rq->avg.last_update_time.

update_blocked_averages I think should do just as good a job as the old
code, which isn't perfect but is about as good as you can get worst case.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-28 16:58     ` bsegall
@ 2014-07-28 17:19       ` Peter Zijlstra
  2014-07-29  1:13         ` Yuyang Du
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-28 17:19 UTC (permalink / raw)
  To: bsegall
  Cc: Yuyang Du, mingo, linux-kernel, pjt, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 2359 bytes --]

On Mon, Jul 28, 2014 at 09:58:19AM -0700, bsegall@google.com wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> >> @@ -4551,18 +4382,34 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
> >>  {
> >>  	struct sched_entity *se = &p->se;
> >>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >> +	u64 last_update_time;
> >>  
> >>  	/*
> >> +	 * Task on old CPU catches up with its old cfs_rq, and subtract itself from
> >> +	 * the cfs_rq (task must be off the queue now).
> >>  	 */
> >> +#ifndef CONFIG_64BIT
> >> +	u64 last_update_time_copy;
> >> +
> >> +	do {
> >> +		last_update_time_copy = cfs_rq->load_last_update_time_copy;
> >> +		smp_rmb();
> >> +		last_update_time = cfs_rq->avg.last_update_time;
> >> +	} while (last_update_time != last_update_time_copy);
> >> +#else
> >> +	last_update_time = cfs_rq->avg.last_update_time;
> >> +#endif
> >> +	__update_load_avg(last_update_time, &se->avg, 0);
> >> +	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
> >> +
> >> +	/*
> >> +	 * We are supposed to update the task to "current" time, then its up to date
> >> +	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
> >> +	 * what current time is, so simply throw away the out-of-date time. This
> >> +	 * will result in the wakee task is less decayed, but giving the wakee more
> >> +	 * load sounds not bad.
> >> +	 */
> >> +	se->avg.last_update_time = 0;
> >>  
> >>  	/* We have migrated, no longer consider this task hot */
> >>  	se->exec_start = 0;
> >
> >
> > And here we try and make good on that assumption. The thing I worry
> > about is what happens if the machine is entirely idle...
> >
> > What guarantees an semi up-to-date cfs_rq->avg.last_update_time.
> 
> update_blocked_averages I think should do just as good a job as the old
> code, which isn't perfect but is about as good as you can get worst case.

Right, that's called from rebalance_domains() which should more or less
update this value on tick boundaries or thereabouts for most 'active'
cpus.

But if the entire machine is idle, the first wakeup (if its a x-cpu one)
might see a very stale timestamp.

If we can fix that, that would be good I suppose, but I'm not
immediately seeing something pretty there, but you're right, it'd not be
worse than the current situation.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-28 10:48   ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Peter Zijlstra
@ 2014-07-29  0:56     ` Yuyang Du
  2014-07-29 13:15       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Yuyang Du @ 2014-07-29  0:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Mon, Jul 28, 2014 at 12:48:37PM +0200, Peter Zijlstra wrote:
> > +static __always_inline u64 decay_load(u64 val, u64 n)
> > +{
> > +	if (likely(val <= UINT_MAX))
> > +		val = decay_load32(val, n);
> > +	else {
> > +		val *= (u32)decay_load32(1 << 15, n);
> > +		val >>= 15;
> > +	}
> > +
> > +	return val;
> > +}
> 
> Please just use mul_u64_u32_shr().
> 
> /me continues reading the rest of it..

Good. Since 128bit is considered in mul_u64_u32_shr, load_sum can
afford more tasks :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-28 11:39   ` Peter Zijlstra
@ 2014-07-29  1:09     ` Yuyang Du
  2014-07-29 13:19       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Yuyang Du @ 2014-07-29  1:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Mon, Jul 28, 2014 at 01:39:39PM +0200, Peter Zijlstra wrote:
> > -static inline void __update_group_entity_contrib(struct sched_entity *se)
> > +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> > +	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> >  
> > +	if (delta) {
> > +		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > +		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> >  	}
> >  }
> 
> We talked about this before, you made that an unconditional atomic op on
> an already hot line.
> 
> You need some words on why this isn't a problem. Either in a comment or
> in the Changelog. You cannot leave such changes undocumented.
 
I am all for not updating trivial delta, e.g., 1 or 2. I just had no theory
in selecting a "good" threshold.

The current code uses 1/8 or 1/64 of contrib. Though it is not fair comparison,
because how current tg load is calculated is a big story (no offense), I choose
1/64 as the threshold.

> > +#define subtract_until_zero(minuend, subtrahend)	\
> > +	(subtrahend < minuend ? minuend - subtrahend : 0)
> 
> WTH is a minuend or subtrahend? Are you a wordsmith in your spare time
> and like to make up your own words?
> 
> Also, isn't writing: x = max(0, x-y), far more readable to begin with?
> 

Ok. IIUC, max() does not handle minus number super good, and we don't need the type
overhead in max(), so still use my macro, but won't be wordsmith again, :)

> > +/*
> > + * Group cfs_rq's load_avg is used for task_h_load and update_cfs_share
> > + * calc.
> > + */
> > +static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >  {
> > +	int decayed;
> >  
> > +	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> > +		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> > +		cfs_rq->avg.load_avg = subtract_until_zero(cfs_rq->avg.load_avg, r);
> > +		r *= LOAD_AVG_MAX;
> > +		cfs_rq->avg.load_sum = subtract_until_zero(cfs_rq->avg.load_sum, r);
> >  	}
> >  
> > +	decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
> >  
> > +#ifndef CONFIG_64BIT
> > +	if (cfs_rq->avg.last_update_time != cfs_rq->load_last_update_time_copy) {
> > +		smp_wmb();
> > +		cfs_rq->load_last_update_time_copy = cfs_rq->avg.last_update_time;
> > +	}
> > +#endif
> >  
> > +	return decayed;
> > +}
> 
> Its a bit unfortunate that we update the copy in another function than
> the original, but I think I see why you did that. But is it at all
> likely that we do not need to update? That is, does that compare make
> any sense?

I think we can assume last_update_time will mostly be changed, because it won't be
changed only in two cases: 1) minus delta time, 2) within a period, 1ms, these two
cases seemingly are minority. So yes, we can save the compare.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-28 17:19       ` Peter Zijlstra
@ 2014-07-29  1:13         ` Yuyang Du
  0 siblings, 0 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-29  1:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bsegall, mingo, linux-kernel, pjt, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Mon, Jul 28, 2014 at 07:19:09PM +0200, Peter Zijlstra wrote:
> > > And here we try and make good on that assumption. The thing I worry
> > > about is what happens if the machine is entirely idle...
> > >
> > > What guarantees an semi up-to-date cfs_rq->avg.last_update_time.
> > 
> > update_blocked_averages I think should do just as good a job as the old
> > code, which isn't perfect but is about as good as you can get worst case.
> 
> Right, that's called from rebalance_domains() which should more or less
> update this value on tick boundaries or thereabouts for most 'active'
> cpus.
> 
> But if the entire machine is idle, the first wakeup (if its a x-cpu one)
> might see a very stale timestamp.
> 
> If we can fix that, that would be good I suppose, but I'm not
> immediately seeing something pretty there, but you're right, it'd not be
> worse than the current situation.

It matters time is up-to-date before load_avg is actually used. So yes, we should
have already achieved that.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-28 10:38     ` Peter Zijlstra
@ 2014-07-29  1:17       ` Yuyang Du
  2014-07-29 13:06         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Yuyang Du @ 2014-07-29  1:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, mingo, linux-kernel, pjt, bsegall,
	arjan.van.de.ven, len.brown, rafael.j.wysocki, alan.cox,
	mark.gross, fengguang.wu

On Mon, Jul 28, 2014 at 12:38:12PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 28, 2014 at 03:02:37AM +0800, Yuyang Du wrote:
> > > Another thing that might be an issue is that the blocked of a terminated
> > > task lives on for quite a while until has decayed away.
> > 
> > Good point. To do so, if I read correctly, we need to hook do_exit(), but probably
> > we are gonna encounter rq->lock issue.
> > 
> > What is the opinion/guidance from the maintainers/others?
> 
> So the entire point of this per entity tracking was to make sure load
> numbers reflect reality. We account migrations etc., it would be weird
> to then throw all that out the window and let task exit accumulate crap.
> 

Yes. So I will hook up do_exit. Hope I did it right. Likewise, also do group entity
in group destroy and group offline? 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29  9:12       ` Vincent Guittot
@ 2014-07-29  1:43         ` Yuyang Du
  2014-07-29 13:17           ` Vincent Guittot
  2014-07-29  9:39         ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Yuyang Du @ 2014-07-29  1:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, Peter Zijlstra, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On Tue, Jul 29, 2014 at 11:12:37AM +0200, Vincent Guittot wrote:
> >>
> >> Do you really need to have *w for computing the load_sum ? can't you
> >> only use it when computing the load_avg ?
> >>
> >> sa->load_avg = div_u64(sa->load_sum * w , LOAD_AVG_MAX)
> >>
> >
> > For task, assuming its load.weight does not change much, yes, we can. But in theory, task's
> 
> I would even say that the load_avg of a task should not be impacted by
> an old priority value. Once, the priority of a task is changed, we
> should only take into account this new priority to weight the load_avg
> of the task
> 
> > load.weight can change, and *w in load_sum can take into that change. For group entity
> > and cfs_rq, its load.weight changes all the time, I don't know how to do it without *w
> > for load_sum.
> 
> IMHO, we should apply the same policy than the one i mentioned for
> task. So the load_avg of an entity or a cfs_rq will not be disturbed
> by an old but no more valid weight
> 

Well, I see your point. But the problem is what matters is load_avg vs. load_avg, not a
load_avg itself. So, if load_avg1 discards old weight if weight is changed, but load_avg2
has no weight changed or has weight changed, the comparison load_avg1 vs. load_avg2 is not
fair, but too impacted by the new weight. The point is, we count in history, so connt in the
real history, which is the whole point of why we count the history. Make sense?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29  9:39         ` Peter Zijlstra
@ 2014-07-29  1:53           ` Yuyang Du
  2014-07-29 13:35             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-29  1:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On Tue, Jul 29, 2014 at 11:39:11AM +0200, Peter Zijlstra wrote:
> > > For task, assuming its load.weight does not change much, yes, we can. But in theory, task's
> > 
> > I would even say that the load_avg of a task should not be impacted by
> > an old priority value. Once, the priority of a task is changed, we
> > should only take into account this new priority to weight the load_avg
> > of the task
> 
> So for tasks I would immediately agree, and I think for groups too,
> seeing how the group weight is based off of this avg, if you then
> include the old weight we'll get a feedback loop. This might not be
> desired as it would counteract the SMP movement of tasks.

Including the old weight can we get the *right* feedback. Because say until
weight is changed, we are balanced, changed weight leads to imbalance. Without
old weight, the imbalance is multiplied by the history, like we have never been
balanced.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-27 17:36     ` Yuyang Du
@ 2014-07-29  9:12       ` Vincent Guittot
  2014-07-29  1:43         ` Yuyang Du
  2014-07-29  9:39         ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Vincent Guittot @ 2014-07-29  9:12 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, Peter Zijlstra, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On 27 July 2014 19:36, Yuyang Du <yuyang.du@intel.com> wrote:
> Hi Vincent,
>
> On Fri, Jul 18, 2014 at 11:43:00AM +0200, Vincent Guittot wrote:
>> > @@ -2291,23 +2299,24 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>> >         delta >>= 10;
>> >         if (!delta)
>> >                 return 0;
>> > -       sa->last_runnable_update = now;
>> > +       sa->last_update_time = now;
>> >
>> >         /* delta_w is the amount already accumulated against our next period */
>> > -       delta_w = sa->runnable_avg_period % 1024;
>> > +       delta_w = sa->period_contrib;
>> >         if (delta + delta_w >= 1024) {
>> > -               /* period roll-over */
>> >                 decayed = 1;
>> >
>> > +               /* how much left for next period will start over, we don't know yet */
>> > +               sa->period_contrib = 0;
>> > +
>> >                 /*
>> >                  * Now that we know we're crossing a period boundary, figure
>> >                  * out how much from delta we need to complete the current
>> >                  * period and accrue it.
>> >                  */
>> >                 delta_w = 1024 - delta_w;
>> > -               if (runnable)
>> > -                       sa->runnable_avg_sum += delta_w;
>> > -               sa->runnable_avg_period += delta_w;
>> > +               if (w)
>> > +                       sa->load_sum += w * delta_w;
>>
>> Do you really need to have *w for computing the load_sum ? can't you
>> only use it when computing the load_avg ?
>>
>> sa->load_avg = div_u64(sa->load_sum * w , LOAD_AVG_MAX)
>>
>
> For task, assuming its load.weight does not change much, yes, we can. But in theory, task's

I would even say that the load_avg of a task should not be impacted by
an old priority value. Once, the priority of a task is changed, we
should only take into account this new priority to weight the load_avg
of the task

> load.weight can change, and *w in load_sum can take into that change. For group entity
> and cfs_rq, its load.weight changes all the time, I don't know how to do it without *w
> for load_sum.

IMHO, we should apply the same policy than the one i mentioned for
task. So the load_avg of an entity or a cfs_rq will not be disturbed
by an old but no more valid weight

Vincent
>
> Sorry for my irresponsiveness for last week. I was on vacation and unfortunately failed to
> connect VPN from where I was.
>
> Thanks,
> Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29  9:12       ` Vincent Guittot
  2014-07-29  1:43         ` Yuyang Du
@ 2014-07-29  9:39         ` Peter Zijlstra
  2014-07-29  1:53           ` Yuyang Du
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-29  9:39 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Yuyang Du, mingo, linux-kernel, Paul Turner, Benjamin Segall,
	arjan.van.de.ven, Len Brown, rafael.j.wysocki, alan.cox, Gross,
	Mark, fengguang.wu

On Tue, Jul 29, 2014 at 11:12:37AM +0200, Vincent Guittot wrote:
> On 27 July 2014 19:36, Yuyang Du <yuyang.du@intel.com> wrote:
> > Hi Vincent,
> >
> > On Fri, Jul 18, 2014 at 11:43:00AM +0200, Vincent Guittot wrote:
> >> > @@ -2291,23 +2299,24 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
> >> >         delta >>= 10;
> >> >         if (!delta)
> >> >                 return 0;
> >> > -       sa->last_runnable_update = now;
> >> > +       sa->last_update_time = now;
> >> >
> >> >         /* delta_w is the amount already accumulated against our next period */
> >> > -       delta_w = sa->runnable_avg_period % 1024;
> >> > +       delta_w = sa->period_contrib;
> >> >         if (delta + delta_w >= 1024) {
> >> > -               /* period roll-over */
> >> >                 decayed = 1;
> >> >
> >> > +               /* how much left for next period will start over, we don't know yet */
> >> > +               sa->period_contrib = 0;
> >> > +
> >> >                 /*
> >> >                  * Now that we know we're crossing a period boundary, figure
> >> >                  * out how much from delta we need to complete the current
> >> >                  * period and accrue it.
> >> >                  */
> >> >                 delta_w = 1024 - delta_w;
> >> > -               if (runnable)
> >> > -                       sa->runnable_avg_sum += delta_w;
> >> > -               sa->runnable_avg_period += delta_w;
> >> > +               if (w)
> >> > +                       sa->load_sum += w * delta_w;
> >>
> >> Do you really need to have *w for computing the load_sum ? can't you
> >> only use it when computing the load_avg ?
> >>
> >> sa->load_avg = div_u64(sa->load_sum * w , LOAD_AVG_MAX)
> >>
> >
> > For task, assuming its load.weight does not change much, yes, we can. But in theory, task's
> 
> I would even say that the load_avg of a task should not be impacted by
> an old priority value. Once, the priority of a task is changed, we
> should only take into account this new priority to weight the load_avg
> of the task

So for tasks I would immediately agree, and I think for groups too,
seeing how the group weight is based off of this avg, if you then
include the old weight we'll get a feedback loop. This might not be
desired as it would counteract the SMP movement of tasks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29  1:17       ` Yuyang Du
@ 2014-07-29 13:06         ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-29 13:06 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Morten Rasmussen, mingo, linux-kernel, pjt, bsegall,
	arjan.van.de.ven, len.brown, rafael.j.wysocki, alan.cox,
	mark.gross, fengguang.wu

On Tue, Jul 29, 2014 at 09:17:13AM +0800, Yuyang Du wrote:
> On Mon, Jul 28, 2014 at 12:38:12PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 28, 2014 at 03:02:37AM +0800, Yuyang Du wrote:
> > > > Another thing that might be an issue is that the blocked of a terminated
> > > > task lives on for quite a while until has decayed away.
> > > 
> > > Good point. To do so, if I read correctly, we need to hook do_exit(), but probably
> > > we are gonna encounter rq->lock issue.
> > > 
> > > What is the opinion/guidance from the maintainers/others?
> > 
> > So the entire point of this per entity tracking was to make sure load
> > numbers reflect reality. We account migrations etc., it would be weird
> > to then throw all that out the window and let task exit accumulate crap.
> > 
> 
> Yes. So I will hook up do_exit. 

There is sched_class::task_dead(), pjt wanted to use that for this.

> Hope I did it right. Likewise, also do group entity
> in group destroy and group offline? 

group destroy, yes. Offline should mostly fix itself already due to it
actively migrating the actual load around I think.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29  0:56     ` Yuyang Du
@ 2014-07-29 13:15       ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-29 13:15 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Tue, Jul 29, 2014 at 08:56:41AM +0800, Yuyang Du wrote:
> On Mon, Jul 28, 2014 at 12:48:37PM +0200, Peter Zijlstra wrote:
> > > +static __always_inline u64 decay_load(u64 val, u64 n)
> > > +{
> > > +	if (likely(val <= UINT_MAX))
> > > +		val = decay_load32(val, n);
> > > +	else {
> > > +		val *= (u32)decay_load32(1 << 15, n);
> > > +		val >>= 15;
> > > +	}
> > > +
> > > +	return val;
> > > +}
> > 
> > Please just use mul_u64_u32_shr().
> > 
> > /me continues reading the rest of it..
> 
> Good. Since 128bit is considered in mul_u64_u32_shr, load_sum can
> afford more tasks :)

96bit actually. While for 64bit platforms it uses the 64x64->128 mult it
only uses 2 32x32->64 mults for 32bit, which isn't sufficient for 128 as
that would require 4.

It also reduces to 1 32x32->64 mult (on 32bit) in case val fits in
32bit.

Therefore its as efficient as your code, but more accurate for not
loosing bits in the full (val is bigger than 32bit) case.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29  1:43         ` Yuyang Du
@ 2014-07-29 13:17           ` Vincent Guittot
  2014-07-29 22:27             ` Yuyang Du
  0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2014-07-29 13:17 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, Peter Zijlstra, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On 29 July 2014 03:43, Yuyang Du <yuyang.du@intel.com> wrote:
> On Tue, Jul 29, 2014 at 11:12:37AM +0200, Vincent Guittot wrote:
>> >>
>> >> Do you really need to have *w for computing the load_sum ? can't you
>> >> only use it when computing the load_avg ?
>> >>
>> >> sa->load_avg = div_u64(sa->load_sum * w , LOAD_AVG_MAX)
>> >>
>> >
>> > For task, assuming its load.weight does not change much, yes, we can. But in theory, task's
>>
>> I would even say that the load_avg of a task should not be impacted by
>> an old priority value. Once, the priority of a task is changed, we
>> should only take into account this new priority to weight the load_avg
>> of the task
>>
>> > load.weight can change, and *w in load_sum can take into that change. For group entity
>> > and cfs_rq, its load.weight changes all the time, I don't know how to do it without *w
>> > for load_sum.
>>
>> IMHO, we should apply the same policy than the one i mentioned for
>> task. So the load_avg of an entity or a cfs_rq will not be disturbed
>> by an old but no more valid weight
>>
>
> Well, I see your point. But the problem is what matters is load_avg vs. load_avg, not a
> load_avg itself. So, if load_avg1 discards old weight if weight is changed, but load_avg2
> has no weight changed or has weight changed, the comparison load_avg1 vs. load_avg2 is not
> fair, but too impacted by the new weight. The point is, we count in history, so connt in the
> real history, which is the whole point of why we count the history. Make sense?

IIUC, you want to soften the impact of weight change on cfs_rq-> load_avg ?

>
> Thanks,
> Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29  1:09     ` Yuyang Du
@ 2014-07-29 13:19       ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-29 13:19 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Tue, Jul 29, 2014 at 09:09:45AM +0800, Yuyang Du wrote:
> > > +#define subtract_until_zero(minuend, subtrahend)	\
> > > +	(subtrahend < minuend ? minuend - subtrahend : 0)
> > 
> > WTH is a minuend or subtrahend? Are you a wordsmith in your spare time
> > and like to make up your own words?
> > 
> > Also, isn't writing: x = max(0, x-y), far more readable to begin with?
> > 
> 
> Ok. IIUC, max() does not handle minus number super good, and we don't need the type
> overhead in max(), so still use my macro, but won't be wordsmith again, :)

The 'type' muck is compile time, it doesn't generate any code.

And max() deals just fine with negative numbers assuming you use signed
types, which you could force with max_t().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29  1:53           ` Yuyang Du
@ 2014-07-29 13:35             ` Peter Zijlstra
  2014-07-29 15:55               ` Peter Zijlstra
  2014-07-29 23:08               ` Yuyang Du
  2014-07-31  9:40             ` Vincent Guittot
  2014-07-31  9:56             ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average Vincent Guittot
  2 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-29 13:35 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Vincent Guittot, mingo, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On Tue, Jul 29, 2014 at 09:53:44AM +0800, Yuyang Du wrote:
> On Tue, Jul 29, 2014 at 11:39:11AM +0200, Peter Zijlstra wrote:
> > > > For task, assuming its load.weight does not change much, yes, we can. But in theory, task's
> > > 
> > > I would even say that the load_avg of a task should not be impacted by
> > > an old priority value. Once, the priority of a task is changed, we
> > > should only take into account this new priority to weight the load_avg
> > > of the task
> > 
> > So for tasks I would immediately agree, and I think for groups too,
> > seeing how the group weight is based off of this avg, if you then
> > include the old weight we'll get a feedback loop. This might not be
> > desired as it would counteract the SMP movement of tasks.
> 
> Including the old weight can we get the *right* feedback. Because say until
> weight is changed, we are balanced, changed weight leads to imbalance. Without
> old weight, the imbalance is multiplied by the history, like we have never been
> balanced.

Does not compute, sorry. How would delaying the effect of migrations
help?

Suppose we have 2 cpus and 6 tasks. cpu0 has 2 tasks, cpu1 has 4 tasks.
the group weights are resp. 341 and 682. We compute we have an imbalance
of 341 and need to migrate 170 to equalize. We achieve this by moving
the 1 task, such that both cpus end up with 4 tasks.

After that we want to find weights of 512 and 512. But if we were to
consider old weights, we'd find 426 and 597 making it appear there is
still an imbalance. We could end up migrating more, only to later find
we overshot and now need to go back.

This is the classical ringing problem.

I also don't see any up-sides from doing this.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29 13:35             ` Peter Zijlstra
@ 2014-07-29 15:55               ` Peter Zijlstra
  2014-07-29 23:08               ` Yuyang Du
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-29 15:55 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Vincent Guittot, mingo, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On Tue, Jul 29, 2014 at 03:35:10PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 09:53:44AM +0800, Yuyang Du wrote:
> > On Tue, Jul 29, 2014 at 11:39:11AM +0200, Peter Zijlstra wrote:
> > > > > For task, assuming its load.weight does not change much, yes, we can. But in theory, task's
> > > > 
> > > > I would even say that the load_avg of a task should not be impacted by
> > > > an old priority value. Once, the priority of a task is changed, we
> > > > should only take into account this new priority to weight the load_avg
> > > > of the task
> > > 
> > > So for tasks I would immediately agree, and I think for groups too,
> > > seeing how the group weight is based off of this avg, if you then
> > > include the old weight we'll get a feedback loop. This might not be
> > > desired as it would counteract the SMP movement of tasks.
> > 
> > Including the old weight can we get the *right* feedback. Because say until
> > weight is changed, we are balanced, changed weight leads to imbalance. Without
> > old weight, the imbalance is multiplied by the history, like we have never been
> > balanced.
> 
> Does not compute, sorry. How would delaying the effect of migrations
> help?
> 
> Suppose we have 2 cpus and 6 tasks. cpu0 has 2 tasks, cpu1 has 4 tasks.
> the group weights are resp. 341 and 682. We compute we have an imbalance
> of 341 and need to migrate 170 to equalize. We achieve this by moving
> the 1 task, such that both cpus end up with 4 tasks.

3 of course.

> After that we want to find weights of 512 and 512. But if we were to
> consider old weights, we'd find 426 and 597 making it appear there is
> still an imbalance. We could end up migrating more, only to later find
> we overshot and now need to go back.
> 
> This is the classical ringing problem.
> 
> I also don't see any up-sides from doing this.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29 13:17           ` Vincent Guittot
@ 2014-07-29 22:27             ` Yuyang Du
  2014-07-30  8:30               ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Yuyang Du @ 2014-07-29 22:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, Peter Zijlstra, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On Tue, Jul 29, 2014 at 03:17:29PM +0200, Vincent Guittot wrote:
> >>
> >> IMHO, we should apply the same policy than the one i mentioned for
> >> task. So the load_avg of an entity or a cfs_rq will not be disturbed
> >> by an old but no more valid weight
> >>
> >
> > Well, I see your point. But the problem is what matters is load_avg vs. load_avg, not a
> > load_avg itself. So, if load_avg1 discards old weight if weight is changed, but load_avg2
> > has no weight changed or has weight changed, the comparison load_avg1 vs. load_avg2 is not
> > fair, but too impacted by the new weight. The point is, we count in history, so connt in the
> > real history, which is the whole point of why we count the history. Make sense?
> 
> IIUC, you want to soften the impact of weight change on cfs_rq-> load_avg ?
> 

Yes, that would be the effect.

Isn't the entire effort starting from PJT and Ben up to now to soften the extremely
dynamic changes (runnable or not, weight change, etc)? Assume task does not change
weight much, but group entity does as Peter mentioned.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29 13:35             ` Peter Zijlstra
  2014-07-29 15:55               ` Peter Zijlstra
@ 2014-07-29 23:08               ` Yuyang Du
  1 sibling, 0 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-29 23:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On Tue, Jul 29, 2014 at 03:35:10PM +0200, Peter Zijlstra wrote:
> 
> Does not compute, sorry. How would delaying the effect of migrations
> help?
> 
> Suppose we have 2 cpus and 6 tasks. cpu0 has 2 tasks, cpu1 has 4 tasks.
> the group weights are resp. 341 and 682. We compute we have an imbalance
> of 341 and need to migrate 170 to equalize. We achieve this by moving
> the 1 task, such that both cpus end up with 4 tasks.
> 
> After that we want to find weights of 512 and 512. But if we were to
> consider old weights, we'd find 426 and 597 making it appear there is
> still an imbalance. We could end up migrating more, only to later find
> we overshot and now need to go back.
> 
> This is the classical ringing problem.
> 
> I also don't see any up-sides from doing this.

I am not sure I understand your example, but it seems it is group weight
distribution. Since in migration, we migrate the load with task, 170 will be
utterly moved. So the new weight/share distribution would be immediate
512 and 512.

But for the group entity's parent cfs_rq in terms of how this group entity
contributes to load in the infinite decaying series, it would be
e.g., 341 and before, migration, then 512 thereafter.

Hope this graph helps:

CPU1 <--> cfs_rq1                         CPU2 <--> cfs_rq2
             |                                         |
      |------------|                            |------------|
tsk_entity1   tg_entity1 <--> tg_cfs_rq1  tsk_entity2   tg_entity2 <--> tg_cfs_rq2
                                  |                                         |
                            tsk_entity3                               tsk_entity4

Then On CPU1:

cfs_rq1->avg.load_avg = tsk_entity1->avg.load_avg + tg_entity1->avg.load_avg

tg_cfs_rq1->avg.load_avg = tsk_entity3->avg.load_avg

tg_entity1's weight = tg_cfs_rq1->avg.load_avg / (tg_cfs_rq1->avg.load_avg + tg_cfs_rq2->avg.load_avg)

Same for things on CPU2.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-30  8:30               ` Peter Zijlstra
@ 2014-07-30  0:40                 ` Yuyang Du
  0 siblings, 0 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-30  0:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

On Wed, Jul 30, 2014 at 10:30:08AM +0200, Peter Zijlstra wrote:
> > 
> > Isn't the entire effort starting from PJT and Ben up to now to soften the extremely
> > dynamic changes (runnable or not, weight change, etc)? Assume task does not change
> > weight much, but group entity does as Peter mentioned.
> 
> No, softening isn't the point at all. But an integrator is the only
> means of predicting the future given the erratic past.
> 
> The whole point we got into this game is to better compute per cpu group
> weights, not to soften stuff, that's just a necessarily evil to more
> accurately predict erratic/unknown behaviour.
> 
> 
Yes, I totally agree. I think what I meant by "soften" is the *effect* of the integrator
that takes/averages the infinite history to predict the tufure.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-29 22:27             ` Yuyang Du
@ 2014-07-30  8:30               ` Peter Zijlstra
  2014-07-30  0:40                 ` Yuyang Du
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-30  8:30 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Vincent Guittot, mingo, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 1513 bytes --]

On Wed, Jul 30, 2014 at 06:27:52AM +0800, Yuyang Du wrote:
> On Tue, Jul 29, 2014 at 03:17:29PM +0200, Vincent Guittot wrote:
> > >>
> > >> IMHO, we should apply the same policy than the one i mentioned for
> > >> task. So the load_avg of an entity or a cfs_rq will not be disturbed
> > >> by an old but no more valid weight
> > >>
> > >
> > > Well, I see your point. But the problem is what matters is load_avg vs. load_avg, not a
> > > load_avg itself. So, if load_avg1 discards old weight if weight is changed, but load_avg2
> > > has no weight changed or has weight changed, the comparison load_avg1 vs. load_avg2 is not
> > > fair, but too impacted by the new weight. The point is, we count in history, so connt in the
> > > real history, which is the whole point of why we count the history. Make sense?
> > 
> > IIUC, you want to soften the impact of weight change on cfs_rq-> load_avg ?
> > 
> 
> Yes, that would be the effect.
> 
> Isn't the entire effort starting from PJT and Ben up to now to soften the extremely
> dynamic changes (runnable or not, weight change, etc)? Assume task does not change
> weight much, but group entity does as Peter mentioned.

No, softening isn't the point at all. But an integrator is the only
means of predicting the future given the erratic past.

The whole point we got into this game is to better compute per cpu group
weights, not to soften stuff, that's just a necessarily evil to more
accurately predict erratic/unknown behaviour.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-27 19:02   ` Yuyang Du
  2014-07-28 10:38     ` Peter Zijlstra
@ 2014-07-30 10:13     ` Morten Rasmussen
  2014-07-30 10:21       ` Peter Zijlstra
  2014-07-30 19:17       ` Yuyang Du
  1 sibling, 2 replies; 47+ messages in thread
From: Morten Rasmussen @ 2014-07-30 10:13 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Sun, Jul 27, 2014 at 08:02:37PM +0100, Yuyang Du wrote:
> Hi Morten,
> 
> On Fri, Jul 18, 2014 at 04:39:31PM +0100, Morten Rasmussen wrote:
> > 1. runnable_avg_period is removed
> > 
> > load_avg_contrib used to be runnable_avg_sum/runnable_avg_period scaled
> > by the task load weight (priority). The runnable_avg_period is replaced
> > by a constant in this patch set. The effect of that change is that task
> > load tracking no longer is more sensitive early life of the task until
> > it has built up some history. Task are now initialized to start out as
> > if they have been runnable forever (>345ms). If this assumption about
> > the task behavior is wrong it will take longer to converge to the true
> > average than it did before. The upside is that is more stable.
> 
> I think "Give new task start runnable values to heavy its load in infant time"
> in general is good, with an emphasis on infant. Or from the opposite, making it
> zero to let it gain runnable weight looks worse than full weight.

Initializing tasks to have full weight is current behaviour, which I
agree with. However, with your changes (dropping runnable_avg_period) it
may take longer for the tracked load of new tasks to converge to the
true load of the task. I don't think it is a big deal, but it is a
change compared to the current implementation.

> 
> > 2. runnable_load_avg and blocked_load_avg are combined
> > 
> > runnable_load_avg currently represents the sum of load_avg_contrib of
> > all tasks on the rq, while blocked_load_avg is the sum of those tasks
> > not on a runqueue. It makes perfect sense to consider the sum of both
> > when calculating the load of a cpu, but we currently don't include
> > blocked_load_avg. The reason for that is the priority scaling of the
> > task load_avg_contrib may lead to under-utilization of cpus that
> > occasionally have tiny high priority task running. You can easily have a
> > task that takes 5% of cpu time but has a load_avg_contrib several times
> > larger than a default priority task runnable 100% of the time.
> 
> So this is the effect of historical averaging and weight scaling, both of which
> are just generally good, but may have bad cases.

I don't agree that weight scaling is generally good. There has been
several threads discussing that topic over the last half year or so. It
is there to ensure smp niceness, but it makes load-balancing on systems
which are not fully utilized sub-optimal. You may end up with some cpus
not being fully utilized while others are over-utilized when you have
multiple tasks running at different priorities.

It is a very real problem when user-space uses priorities extensively
like Android does. Tasks related to audio run at very high priorities
but only for a very short amount of time, but due the to priority
scaling their load ends up being several times higher than tasks running
all the time at normal priority. Hence task load is a very poor
indicator of utilization.

> > Another thing that might be an issue is that the blocked of a terminated
> > task lives on for quite a while until has decayed away.
> 
> Good point. To do so, if I read correctly, we need to hook do_exit(), but probably
> we are gonna encounter rq->lock issue.
> 
> What is the opinion/guidance from the maintainers/others?
>  
> > I'm all for taking the blocked load into consideration, but this issue
> > has to be resolved first. Which leads me on to the next thing.
> > 
> > Most of the work going on around energy awareness is based on the load
> > tracking to estimate task and cpu utilization. It seems that most of the
> > involved parties think that we need an unweighted variant of the tracked
> > load as well as tracking the running time of a task. The latter was part
> > of the original proposal by pjt and Ben, but wasn't used. It seems that
> > unweighted runnable tracking should be fairly easy to add to your
> > proposal, but I don't have an overview of whether it is possible to add
> > running tracking. Do you think that is possible?
> > 
> 
> Running tracking is absolutely possbile, just the matter of minimizing overhead
> (how to do it along with runnable for task and maybe for CPU, but not for
> cfs_rq) from execution and code cleanness ponit of view. We can do it as soon as
> it is needed.

>From a coding point of view it is very easy to add to the current
load-tracking. We have already discussed putting it back in to enable
better tracking of utilization. It is quite likely needed for the
energy-awareness improvements and also to fix the priority scaling
problem described above. 

IMHO, the above things need to be considered as part of a rewrite of the
load-tracking implementation otherwise we risk having to change it again
soon.

Morten


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-30 10:13     ` Morten Rasmussen
@ 2014-07-30 10:21       ` Peter Zijlstra
  2014-07-30 10:57         ` Morten Rasmussen
  2014-07-30 19:17       ` Yuyang Du
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-07-30 10:21 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Yuyang Du, mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 570 bytes --]

On Wed, Jul 30, 2014 at 11:13:31AM +0100, Morten Rasmussen wrote:
> It is a very real problem when user-space uses priorities extensively
> like Android does. Tasks related to audio run at very high priorities
> but only for a very short amount of time, but due the to priority
> scaling their load ends up being several times higher than tasks running
> all the time at normal priority. Hence task load is a very poor
> indicator of utilization.

FWIW Android (which I think does this through binder) is utterly insane
in this regard. But yes, we have to live with it.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-30 10:21       ` Peter Zijlstra
@ 2014-07-30 10:57         ` Morten Rasmussen
  0 siblings, 0 replies; 47+ messages in thread
From: Morten Rasmussen @ 2014-07-30 10:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Wed, Jul 30, 2014 at 11:21:28AM +0100, Peter Zijlstra wrote:
> On Wed, Jul 30, 2014 at 11:13:31AM +0100, Morten Rasmussen wrote:
> > It is a very real problem when user-space uses priorities extensively
> > like Android does. Tasks related to audio run at very high priorities
> > but only for a very short amount of time, but due the to priority
> > scaling their load ends up being several times higher than tasks running
> > all the time at normal priority. Hence task load is a very poor
> > indicator of utilization.
> 
> FWIW Android (which I think does this through binder) is utterly insane
> in this regard. But yes, we have to live with it.

I agree that its use of priorities seems a bit extreme. I'm not trying
to defend it in any way :-) 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-30 10:13     ` Morten Rasmussen
  2014-07-30 10:21       ` Peter Zijlstra
@ 2014-07-30 19:17       ` Yuyang Du
  2014-07-31  8:54         ` Morten Rasmussen
  1 sibling, 1 reply; 47+ messages in thread
From: Yuyang Du @ 2014-07-30 19:17 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

Hi Morten,

On Wed, Jul 30, 2014 at 11:13:31AM +0100, Morten Rasmussen wrote:
> > > 2. runnable_load_avg and blocked_load_avg are combined
> > > 
> > > runnable_load_avg currently represents the sum of load_avg_contrib of
> > > all tasks on the rq, while blocked_load_avg is the sum of those tasks
> > > not on a runqueue. It makes perfect sense to consider the sum of both
> > > when calculating the load of a cpu, but we currently don't include
> > > blocked_load_avg. The reason for that is the priority scaling of the
> > > task load_avg_contrib may lead to under-utilization of cpus that
> > > occasionally have tiny high priority task running. You can easily have a
> > > task that takes 5% of cpu time but has a load_avg_contrib several times
> > > larger than a default priority task runnable 100% of the time.
> > 
> > So this is the effect of historical averaging and weight scaling, both of which
> > are just generally good, but may have bad cases.
> 
> I don't agree that weight scaling is generally good. There has been
> several threads discussing that topic over the last half year or so. It
> is there to ensure smp niceness, but it makes load-balancing on systems
> which are not fully utilized sub-optimal. You may end up with some cpus
> not being fully utilized while others are over-utilized when you have
> multiple tasks running at different priorities.
> 
> It is a very real problem when user-space uses priorities extensively
> like Android does. Tasks related to audio run at very high priorities
> but only for a very short amount of time, but due the to priority
> scaling their load ends up being several times higher than tasks running
> all the time at normal priority. Hence task load is a very poor
> indicator of utilization.
 
I understand the problem you said, but the problem is not described crystal clear.

You are saying tasks with big weight contribute too much, even they are running
short time. But is it unfair or does it lead to imbalance? It is hard to say if
not no. They have big weight, so are supposed to be "unfair" vs. small weight
tasks for the sake of fairness. In addition, since they are running short time,
their runnable weight/load is offset by this factor.

I think I am saying from pure fairness ponit of view, which is just generally good
in the sense that we can't think of a more "generally good" thing to replace it.

And you are saying when big weight task is not runnable, but already contributes
"too much" load, then leads to under utilization. So this is the matter of our
predicting algorithm. I am afraid I will say again the pridiction is generally
good. For the audio example, which is strictly periodic, it just can't be better.

FWIW, I am really not sure how serious this under utilization problem is in real
world.

I am not saying your argument does not make sense. It makes every sense from specific
case ponit from view. I do think there absolutely can be sub-optimal cases. But as
I said, I just don't think the problem description is clear enough so that we know
it is worth solving (by pros and cons comparison) and how to solve it, either
generally or specifically.

Plus, as Peter said, we have to live with user space uses big weight, and do it as
what weight is supposed to be.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-31  8:54         ` Morten Rasmussen
@ 2014-07-31  2:15           ` Yuyang Du
  0 siblings, 0 replies; 47+ messages in thread
From: Yuyang Du @ 2014-07-31  2:15 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Thu, Jul 31, 2014 at 09:54:21AM +0100, Morten Rasmussen wrote:
> 
> Overall, it is not clear to me why it is necessary to rewrite the
> per-entity load-tracking. The code is somewhat simpler, but I don't see
> any functional additions/improvements. If we have to go through a long
> review and testing process, why not address some of the most obvious
> issues with the existing implementation while we are at it? I don't see
> the point in replacing something sub-optimal with equally sub-optimal
> (or worse).
> 

This is absolutely nonsense. First, we have improvements, second, even
with no functions addition, but do you really understand what has been
changed besides simpler. Even just simpler, simpler means a lot of things..

> > I do think there absolutely can be sub-optimal cases.

I said there absolutely can be sub-optimal cases, which exactly referred to
the example you gave (one 10% 88761 vs. 8 100% 1024). Still, the links
does not say anything about how serious. Exist, yes, serious, don't know.

> > But as I said, I just don't think the problem description is clear enough.

I said your description is not clear enough, and at the time I was not
clear either. Arguably and sadly, none of what you said in this response
made a tiny little progress. About blocked load, prediction, ..., can you
be more wrong?

The problem is not weight scaling. The problem is how weight is accumulated
when not runnable. Why? Consider this, if all tasks are always runnalbe,
weight scaling cann't be more right.

WRT runnalbe weight, currently, it is runnalbe% * weight (simplified).
Since weight has so big range, it dwarfs runnable time ratio. So maybe what
can be done is (what I have in mind):

1) runnalbe%^2 * weight
2) bigger weight does faster decay

Still, if you can prove the issue is serious, we can try something..., but just
nothing is perfect.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking
  2014-07-30 19:17       ` Yuyang Du
@ 2014-07-31  8:54         ` Morten Rasmussen
  2014-07-31  2:15           ` Yuyang Du
  0 siblings, 1 reply; 47+ messages in thread
From: Morten Rasmussen @ 2014-07-31  8:54 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, peterz, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu

On Wed, Jul 30, 2014 at 08:17:39PM +0100, Yuyang Du wrote:
> Hi Morten,
> 
> On Wed, Jul 30, 2014 at 11:13:31AM +0100, Morten Rasmussen wrote:
> > > > 2. runnable_load_avg and blocked_load_avg are combined
> > > > 
> > > > runnable_load_avg currently represents the sum of load_avg_contrib of
> > > > all tasks on the rq, while blocked_load_avg is the sum of those tasks
> > > > not on a runqueue. It makes perfect sense to consider the sum of both
> > > > when calculating the load of a cpu, but we currently don't include
> > > > blocked_load_avg. The reason for that is the priority scaling of the
> > > > task load_avg_contrib may lead to under-utilization of cpus that
> > > > occasionally have tiny high priority task running. You can easily have a
> > > > task that takes 5% of cpu time but has a load_avg_contrib several times
> > > > larger than a default priority task runnable 100% of the time.
> > > 
> > > So this is the effect of historical averaging and weight scaling, both of which
> > > are just generally good, but may have bad cases.
> > 
> > I don't agree that weight scaling is generally good. There has been
> > several threads discussing that topic over the last half year or so. It
> > is there to ensure smp niceness, but it makes load-balancing on systems
> > which are not fully utilized sub-optimal. You may end up with some cpus
> > not being fully utilized while others are over-utilized when you have
> > multiple tasks running at different priorities.
> > 
> > It is a very real problem when user-space uses priorities extensively
> > like Android does. Tasks related to audio run at very high priorities
> > but only for a very short amount of time, but due the to priority
> > scaling their load ends up being several times higher than tasks running
> > all the time at normal priority. Hence task load is a very poor
> > indicator of utilization.
>  
> I understand the problem you said, but the problem is not described crystal clear.
> 
> You are saying tasks with big weight contribute too much, even they are running
> short time. But is it unfair or does it lead to imbalance? It is hard to say if
> not no. They have big weight, so are supposed to be "unfair" vs. small weight
> tasks for the sake of fairness. In addition, since they are running short time,
> their runnable weight/load is offset by this factor.

It does lead to imbalance and the problem is indeed very real as I
already said. It has been discussed numerous times before:

https://lkml.org/lkml/2014/5/28/264
https://lkml.org/lkml/2014/1/8/251

Summary:

Default priority (nice=0) has a weight of 1024. nice=-20 has a weight of
88761. So a nice=-20 that runs ~10% of the time has a load contribution
of ~8876, which is >8x the weight of a nice=0 task that runs 100% of the
time. Load contibution is used for load-balancing, which means that you
will put at least eight 100% nice=0 tasks on a cpu before you start
putting any additional tasks on the cpu with the nice=-20 task. So you
over-subscribe one cpu by 700% while another is idle 90% of the time.

You may argue that this is 'fair', but it is very much waste of
resources. Putting nice=0 tasks on the same cpu as the nice=-20 task
will have nearly no effect on the cpu time allocated to nice=-20 task
due to the vruntime scaling. Hence there is virtually no downside in
term of giving priority and a lot to be gained in term of throughput.

Generally, we don't have to care about priority as long as no cpu is
fully utilized. All tasks get the cpu time they need.

The problem with considering blocked priority scaled load is that the
blocked load doesn't disappear when it is blocked, so it effectively
reserves too much cpu time for high priority tasks. 

A real work use-case where this happens is described here:

https://lkml.org/lkml/2014/1/7/358

> I think I am saying from pure fairness ponit of view, which is just generally good
> in the sense that we can't think of a more "generally good" thing to replace it.

Unweighted utilization. As said above, we only need to care about
priority when cpus are fully utilized. It doesn't break any fairness.

> And you are saying when big weight task is not runnable, but already contributes
> "too much" load, then leads to under utilization. So this is the matter of our
> predicting algorithm. I am afraid I will say again the pridiction is generally
> good. For the audio example, which is strictly periodic, it just can't be better.

I disagree. The priority scaled prediction is generally bad. Why reserve
up to 88x times more cpu time to a task than is actually needed, when
the unweighted load tracking (utilization) is readily available?

> FWIW, I am really not sure how serious this under utilization problem is in real
> world.

Again, it is indeed a real world problem. We have experienced it first
hand and have been experimenting with this over the last 2-3 years. I'm
not making this up.

We have included unweighted load (utilization) in our RFC patch set for
the same reason. And the out-of-tree big.LITTLE solution carries similar
patches too.

> I am not saying your argument does not make sense. It makes every sense from specific
> case ponit from view. I do think there absolutely can be sub-optimal cases. But as
> I said, I just don't think the problem description is clear enough so that we know
> it is worth solving (by pros and cons comparison) and how to solve it, either
> generally or specifically.

I didn't repeat the whole history in my first response as I thought this
had already been debated several times and we had reached agreement that
is indeed a problem. You are not the first one to propose including
priority scaled blocked load in the load estimation.

> Plus, as Peter said, we have to live with user space uses big weight, and do it as
> what weight is supposed to be.

I don't follow. Are you saying it is fine to intentionally make
load-balancing worse for any user-space that uses task priorities other
than default?

You can't just ignore users of task priority. You may have the point of
view that you don't care about under-utilization, but there are lots of
users who do. Optimizing for energy consumption is a primary goal for
the mobile space (and servers seems to be moving that way too). This
requires more accurate estimates of cpu utilization to manage how many
cpus are needed. Ignoring priority scaling is moving in the exact
opposite direction an conflicts with other ongoing efforts.

Overall, it is not clear to me why it is necessary to rewrite the
per-entity load-tracking. The code is somewhat simpler, but I don't see
any functional additions/improvements. If we have to go through a long
review and testing process, why not address some of the most obvious
issues with the existing implementation while we are at it? I don't see
the point in replacing something sub-optimal with equally sub-optimal
(or worse).

Morten

^ permalink raw reply	[flat|nested] 47+ messages in thread

* (no subject)
  2014-07-29  1:53           ` Yuyang Du
  2014-07-29 13:35             ` Peter Zijlstra
@ 2014-07-31  9:40             ` Vincent Guittot
  2014-07-31  9:56             ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average Vincent Guittot
  2 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2014-07-31  9:40 UTC (permalink / raw)
  To: yuyang.du, peterz
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, engguang.wu,
	morten.rasmussen, Vincent Guittot

Hi Yuyang,

Does something like the patch below to be applied of top of your patchset, seem
reasonable add-on?

It adds 1 new usage_sum statistics which is something that I use to detect the
overload of a rq in my patchset that reworks cpu_power and removes
capacity_factor

And I think that the change I made on load_sum should match some of Morten's
concerns

Regards,
Vincent

---
Subject: [PATCH] sched: add usage_sum statistic

Add a new statitic that reflects the average time a task is running on CPU.

load_sum is now the average runnable time before being weighted

The sum of usage_sum of the tasks that are on a rq, is used to detect
the overload of a rq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 47 +++++++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h  |  2 ++
 3 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b6617a1..3296e76 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1080,6 +1080,7 @@ struct sched_avg {
 	 */
 	u64 last_update_time;
 	u64 load_sum;
+	unsigned long usage_sum;
 	unsigned long load_avg;
 	u32 period_contrib;
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3a3168..78408a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -679,7 +679,8 @@ void init_task_runnable_average(struct task_struct *p)
 	 */
 	sa->period_contrib = 1023;
 	sa->load_avg = p->se.load.weight;
-	sa->load_sum = p->se.load.weight * LOAD_AVG_MAX;
+	sa->load_sum = sa->usage_sum = LOAD_AVG_MAX;
+	;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 #else
@@ -2300,7 +2301,7 @@ static u32 __compute_runnable_contrib(u64 n)
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
 static __always_inline int
-__update_load_avg(u64 now, struct sched_avg *sa, unsigned long w)
+__update_load_avg(u64 now, struct sched_avg *sa, unsigned long w, int running)
 {
 	u64 delta, periods;
 	u32 contrib;
@@ -2340,7 +2341,9 @@ __update_load_avg(u64 now, struct sched_avg *sa, unsigned long w)
 		 */
 		delta_w = 1024 - delta_w;
 		if (w)
-			sa->load_sum += w * delta_w;
+			sa->load_sum += delta_w;
+		if (running)
+			sa->usage_sum += delta_w;
 
 		delta -= delta_w;
 
@@ -2349,21 +2352,26 @@ __update_load_avg(u64 now, struct sched_avg *sa, unsigned long w)
 		delta %= 1024;
 
 		sa->load_sum = decay_load(sa->load_sum, periods + 1);
+		sa->usage_sum = decay_load(sa->usage_sum, periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
 		if (w)
-			sa->load_sum += w * contrib;
+			sa->load_sum += contrib;
+		if (running)
+			sa->usage_sum += contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (w)
-		sa->load_sum += w * delta;
+		sa->load_sum +=  delta;
+	if (running)
+		sa->usage_sum += delta;
 
 	sa->period_contrib += delta;
 
 	if (decayed)
-		sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+		sa->load_avg = div_u64(sa->load_sum * w, LOAD_AVG_MAX);
 
 	return decayed;
 }
@@ -2404,11 +2412,17 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
 		cfs_rq->avg.load_avg = subtract_until_zero(cfs_rq->avg.load_avg, r);
-		r *= LOAD_AVG_MAX;
+	}
+	if (atomic_long_read(&cfs_rq->removed_load_sum)) {
+		long r = atomic_long_xchg(&cfs_rq->removed_load_sum, 0);
 		cfs_rq->avg.load_sum = subtract_until_zero(cfs_rq->avg.load_sum, r);
 	}
+	if (atomic_long_read(&cfs_rq->removed_usage_sum)) {
+		long r = atomic_long_xchg(&cfs_rq->removed_usage_sum, 0);
+		cfs_rq->avg.usage_sum = subtract_until_zero(cfs_rq->avg.usage_sum, r);
+	}
 
-	decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
+	decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight, cfs_rq->curr != NULL);
 
 #ifndef CONFIG_64BIT
 	if (cfs_rq->avg.last_update_time != cfs_rq->load_last_update_time_copy) {
@@ -2430,7 +2444,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated,
 	 * and group sched_entity for task_h_load calc in migration
 	 */
-	__update_load_avg(now, &se->avg, se->on_rq * se->load.weight);
+	__update_load_avg(now, &se->avg, se->on_rq * se->load.weight,
+			entity_is_task(se) ? task_of(se)->on_cpu : 0);
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq);
@@ -2451,13 +2466,14 @@ static inline void enqueue_entity_load_avg(struct sched_entity *se)
 			migrated = 1;
 	}
 	else
-		__update_load_avg(now, sa, se->on_rq * se->load.weight);
+		__update_load_avg(now, sa, se->on_rq * se->load.weight, entity_is_task(se) ? task_of(se)->on_cpu : 0);
 
 	decayed = update_cfs_rq_load_avg(now, cfs_rq);
 
 	if (migrated) {
 		cfs_rq->avg.load_avg += sa->load_avg;
 		cfs_rq->avg.load_sum += sa->load_sum;
+		cfs_rq->avg.usage_sum += sa->usage_sum;
 	}
 
 	if (decayed || migrated)
@@ -4442,8 +4458,10 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 #else
 	last_update_time = cfs_rq->avg.last_update_time;
 #endif
-	__update_load_avg(last_update_time, &se->avg, 0);
+	__update_load_avg(last_update_time, &se->avg, 0, p->on_cpu);
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
+	atomic_long_add(se->avg.load_sum, &cfs_rq->removed_load_sum);
+	atomic_long_add(se->avg.usage_sum, &cfs_rq->removed_usage_sum);
 
 	/*
 	 * We are supposed to update the task to "current" time, then its up to date
@@ -7316,11 +7334,13 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 	* Remove our load from contribution when we leave cfs_rq.
 	*/
 	__update_load_avg(cfs_rq->avg.last_update_time, &se->avg,
-		se->on_rq * se->load.weight);
+		se->on_rq * se->load.weight, p->on_cpu);
 	cfs_rq->avg.load_avg =
 		subtract_until_zero(cfs_rq->avg.load_avg, se->avg.load_avg);
 	cfs_rq->avg.load_sum =
 		subtract_until_zero(cfs_rq->avg.load_sum, se->avg.load_sum);
+	cfs_rq->avg.usage_sum =
+		subtract_until_zero(cfs_rq->avg.usage_sum, se->avg.usage_sum);
 #endif
 }
 
@@ -7378,6 +7398,8 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #endif
 #ifdef CONFIG_SMP
 	atomic_long_set(&cfs_rq->removed_load_avg, 0);
+	atomic_long_set(&cfs_rq->removed_load_sum, 0);
+	atomic_long_set(&cfs_rq->removed_usage_sum, 0);
 #endif
 }
 
@@ -7428,6 +7450,7 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 		p->se.avg.last_update_time = cfs_rq->avg.last_update_time;
 		cfs_rq->avg.load_avg += p->se.avg.load_avg;
 		cfs_rq->avg.load_sum += p->se.avg.load_sum;
+		cfs_rq->avg.usage_sum += p->se.avg.usage_sum;
 #endif
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f21ddde..1bdd878 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -335,6 +335,8 @@ struct cfs_rq {
 	struct sched_avg avg;
 	unsigned long tg_load_avg_contrib;
 	atomic_long_t removed_load_avg;
+	atomic_long_t removed_load_sum;
+	atomic_long_t removed_usage_sum;
 #ifndef CONFIG_64BIT
 	u64 load_last_update_time_copy;
 #endif
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average
  2014-07-29  1:53           ` Yuyang Du
  2014-07-29 13:35             ` Peter Zijlstra
  2014-07-31  9:40             ` Vincent Guittot
@ 2014-07-31  9:56             ` Vincent Guittot
  2014-07-31 19:16               ` Yuyang Du
  2 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2014-07-31  9:56 UTC (permalink / raw)
  To: yuyang.du, peterz
  Cc: mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven, len.brown,
	rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu,
	morten.rasmussen, Vincent Guittot

Resend with a correct subject

Hi Yuyang,

Does something like the patch below to be applied of top of your patchset, seem
reasonable add-on?

It adds 1 new usage_sum statistics which is something that I use to detect the
overload of a rq in my patchset that reworks cpu_power and removes
capacity_factor

And I think that the change I made on load_sum should match some of Morten's
concerns

Regards,
Vincent

---
Subject: [PATCH] sched: add usage_sum statistic

Add a new statitic that reflects the average time a task is running on CPU.

load_sum is now the average runnable time before being weighted

The sum of usage_sum of the tasks that are on a rq, is used to detect
the overload of a rq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 47 +++++++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h  |  2 ++
 3 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b6617a1..3296e76 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1080,6 +1080,7 @@ struct sched_avg {
 	 */
 	u64 last_update_time;
 	u64 load_sum;
+	unsigned long usage_sum;
 	unsigned long load_avg;
 	u32 period_contrib;
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3a3168..78408a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -679,7 +679,8 @@ void init_task_runnable_average(struct task_struct *p)
 	 */
 	sa->period_contrib = 1023;
 	sa->load_avg = p->se.load.weight;
-	sa->load_sum = p->se.load.weight * LOAD_AVG_MAX;
+	sa->load_sum = sa->usage_sum = LOAD_AVG_MAX;
+	;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 #else
@@ -2300,7 +2301,7 @@ static u32 __compute_runnable_contrib(u64 n)
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
 static __always_inline int
-__update_load_avg(u64 now, struct sched_avg *sa, unsigned long w)
+__update_load_avg(u64 now, struct sched_avg *sa, unsigned long w, int running)
 {
 	u64 delta, periods;
 	u32 contrib;
@@ -2340,7 +2341,9 @@ __update_load_avg(u64 now, struct sched_avg *sa, unsigned long w)
 		 */
 		delta_w = 1024 - delta_w;
 		if (w)
-			sa->load_sum += w * delta_w;
+			sa->load_sum += delta_w;
+		if (running)
+			sa->usage_sum += delta_w;
 
 		delta -= delta_w;
 
@@ -2349,21 +2352,26 @@ __update_load_avg(u64 now, struct sched_avg *sa, unsigned long w)
 		delta %= 1024;
 
 		sa->load_sum = decay_load(sa->load_sum, periods + 1);
+		sa->usage_sum = decay_load(sa->usage_sum, periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
 		if (w)
-			sa->load_sum += w * contrib;
+			sa->load_sum += contrib;
+		if (running)
+			sa->usage_sum += contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (w)
-		sa->load_sum += w * delta;
+		sa->load_sum +=  delta;
+	if (running)
+		sa->usage_sum += delta;
 
 	sa->period_contrib += delta;
 
 	if (decayed)
-		sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
+		sa->load_avg = div_u64(sa->load_sum * w, LOAD_AVG_MAX);
 
 	return decayed;
 }
@@ -2404,11 +2412,17 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
 		cfs_rq->avg.load_avg = subtract_until_zero(cfs_rq->avg.load_avg, r);
-		r *= LOAD_AVG_MAX;
+	}
+	if (atomic_long_read(&cfs_rq->removed_load_sum)) {
+		long r = atomic_long_xchg(&cfs_rq->removed_load_sum, 0);
 		cfs_rq->avg.load_sum = subtract_until_zero(cfs_rq->avg.load_sum, r);
 	}
+	if (atomic_long_read(&cfs_rq->removed_usage_sum)) {
+		long r = atomic_long_xchg(&cfs_rq->removed_usage_sum, 0);
+		cfs_rq->avg.usage_sum = subtract_until_zero(cfs_rq->avg.usage_sum, r);
+	}
 
-	decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
+	decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight, cfs_rq->curr != NULL);
 
 #ifndef CONFIG_64BIT
 	if (cfs_rq->avg.last_update_time != cfs_rq->load_last_update_time_copy) {
@@ -2430,7 +2444,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
 	 * Track task load average for carrying it to new CPU after migrated,
 	 * and group sched_entity for task_h_load calc in migration
 	 */
-	__update_load_avg(now, &se->avg, se->on_rq * se->load.weight);
+	__update_load_avg(now, &se->avg, se->on_rq * se->load.weight,
+			entity_is_task(se) ? task_of(se)->on_cpu : 0);
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq);
@@ -2451,13 +2466,14 @@ static inline void enqueue_entity_load_avg(struct sched_entity *se)
 			migrated = 1;
 	}
 	else
-		__update_load_avg(now, sa, se->on_rq * se->load.weight);
+		__update_load_avg(now, sa, se->on_rq * se->load.weight, entity_is_task(se) ? task_of(se)->on_cpu : 0);
 
 	decayed = update_cfs_rq_load_avg(now, cfs_rq);
 
 	if (migrated) {
 		cfs_rq->avg.load_avg += sa->load_avg;
 		cfs_rq->avg.load_sum += sa->load_sum;
+		cfs_rq->avg.usage_sum += sa->usage_sum;
 	}
 
 	if (decayed || migrated)
@@ -4442,8 +4458,10 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 #else
 	last_update_time = cfs_rq->avg.last_update_time;
 #endif
-	__update_load_avg(last_update_time, &se->avg, 0);
+	__update_load_avg(last_update_time, &se->avg, 0, p->on_cpu);
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
+	atomic_long_add(se->avg.load_sum, &cfs_rq->removed_load_sum);
+	atomic_long_add(se->avg.usage_sum, &cfs_rq->removed_usage_sum);
 
 	/*
 	 * We are supposed to update the task to "current" time, then its up to date
@@ -7316,11 +7334,13 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 	* Remove our load from contribution when we leave cfs_rq.
 	*/
 	__update_load_avg(cfs_rq->avg.last_update_time, &se->avg,
-		se->on_rq * se->load.weight);
+		se->on_rq * se->load.weight, p->on_cpu);
 	cfs_rq->avg.load_avg =
 		subtract_until_zero(cfs_rq->avg.load_avg, se->avg.load_avg);
 	cfs_rq->avg.load_sum =
 		subtract_until_zero(cfs_rq->avg.load_sum, se->avg.load_sum);
+	cfs_rq->avg.usage_sum =
+		subtract_until_zero(cfs_rq->avg.usage_sum, se->avg.usage_sum);
 #endif
 }
 
@@ -7378,6 +7398,8 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #endif
 #ifdef CONFIG_SMP
 	atomic_long_set(&cfs_rq->removed_load_avg, 0);
+	atomic_long_set(&cfs_rq->removed_load_sum, 0);
+	atomic_long_set(&cfs_rq->removed_usage_sum, 0);
 #endif
 }
 
@@ -7428,6 +7450,7 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 		p->se.avg.last_update_time = cfs_rq->avg.last_update_time;
 		cfs_rq->avg.load_avg += p->se.avg.load_avg;
 		cfs_rq->avg.load_sum += p->se.avg.load_sum;
+		cfs_rq->avg.usage_sum += p->se.avg.usage_sum;
 #endif
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f21ddde..1bdd878 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -335,6 +335,8 @@ struct cfs_rq {
 	struct sched_avg avg;
 	unsigned long tg_load_avg_contrib;
 	atomic_long_t removed_load_avg;
+	atomic_long_t removed_load_sum;
+	atomic_long_t removed_usage_sum;
 #ifndef CONFIG_64BIT
 	u64 load_last_update_time_copy;
 #endif
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average
  2014-07-31  9:56             ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average Vincent Guittot
@ 2014-07-31 19:16               ` Yuyang Du
  2014-08-01  9:28                 ` Vincent Guittot
  0 siblings, 1 reply; 47+ messages in thread
From: Yuyang Du @ 2014-07-31 19:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, pjt, bsegall, arjan.van.de.ven,
	len.brown, rafael.j.wysocki, alan.cox, mark.gross, fengguang.wu,
	morten.rasmussen

Hi Vincent,

On Thu, Jul 31, 2014 at 11:56:13AM +0200, Vincent Guittot wrote:
> 
> load_sum is now the average runnable time before being weighted
 
So when weight changes, load_avg will completely use new weight. I have
some cents:

1) Task does not change weight much, so it is practically ok

2) Group entity does change weight much, and very likely back and forth,
   so I really think keeping the intact history will make everything
   more predictable/stable, prevent thrashing, etc.

3) If you do the same for cfs_rq->load.weight, then we simply abandoned
   blocked entities, and all states won't compute. So we then need to
   maintain blocked load average again, and we just can't do cfs_rq load
   average as a whole anymore, but must update at the granularity of an
   entity...

Anyway, it does not seem to me you really need to change load_sum, no? So
could you please not change it?

> The sum of usage_sum of the tasks that are on a rq, is used to detect
> the overload of a rq.

I think you only need usage_sum for task and rq, but not cfs_rq. Others
are ok.
 
> Does something like the patch below to be applied of top of your patchset, seem
> reasonable add-on?
> 

If you only add running statistics, I am all good, and indeed reasonable if
you can make good use of it. I am not at all against adding anything or
adding running average or unweighted anything...

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average
  2014-07-31 19:16               ` Yuyang Du
@ 2014-08-01  9:28                 ` Vincent Guittot
  0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2014-08-01  9:28 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, mingo, linux-kernel, Paul Turner,
	Benjamin Segall, arjan.van.de.ven, Len Brown, rafael.j.wysocki,
	alan.cox, Gross, Mark, fengguang.wu, Morten Rasmussen

On 31 July 2014 21:16, Yuyang Du <yuyang.du@intel.com> wrote:
> Hi Vincent,
>
> On Thu, Jul 31, 2014 at 11:56:13AM +0200, Vincent Guittot wrote:
>>
>> load_sum is now the average runnable time before being weighted
>
> So when weight changes, load_avg will completely use new weight. I have
> some cents:
>
> 1) Task does not change weight much, so it is practically ok
>
> 2) Group entity does change weight much, and very likely back and forth,
>    so I really think keeping the intact history will make everything
>    more predictable/stable, prevent thrashing, etc.
>
> 3) If you do the same for cfs_rq->load.weight, then we simply abandoned
>    blocked entities, and all states won't compute. So we then need to
>    maintain blocked load average again, and we just can't do cfs_rq load
>    average as a whole anymore, but must update at the granularity of an
>    entity...
>
> Anyway, it does not seem to me you really need to change load_sum, no? So
> could you please not change it?
>
>> The sum of usage_sum of the tasks that are on a rq, is used to detect
>> the overload of a rq.
>
> I think you only need usage_sum for task and rq, but not cfs_rq. Others
> are ok.

yes, only usage_sum is useful for my rework of cpu_power

>
>> Does something like the patch below to be applied of top of your patchset, seem
>> reasonable add-on?
>>
>
> If you only add running statistics, I am all good, and indeed reasonable if
> you can make good use of it. I am not at all against adding anything or
> adding running average or unweighted anything...

ok. Thanks

Vincent

>
> Thanks,
> Yuyang

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2014-08-01  9:28 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-17 23:26 [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
2014-07-17 23:26 ` [PATCH 1/2 v4] sched: Remove update_rq_runnable_avg Yuyang Du
2014-07-17 23:26 ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Yuyang Du
2014-07-18  9:43   ` Vincent Guittot
2014-07-27 17:36     ` Yuyang Du
2014-07-29  9:12       ` Vincent Guittot
2014-07-29  1:43         ` Yuyang Du
2014-07-29 13:17           ` Vincent Guittot
2014-07-29 22:27             ` Yuyang Du
2014-07-30  8:30               ` Peter Zijlstra
2014-07-30  0:40                 ` Yuyang Du
2014-07-29  9:39         ` Peter Zijlstra
2014-07-29  1:53           ` Yuyang Du
2014-07-29 13:35             ` Peter Zijlstra
2014-07-29 15:55               ` Peter Zijlstra
2014-07-29 23:08               ` Yuyang Du
2014-07-31  9:40             ` Vincent Guittot
2014-07-31  9:56             ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average Vincent Guittot
2014-07-31 19:16               ` Yuyang Du
2014-08-01  9:28                 ` Vincent Guittot
2014-07-28 10:48   ` [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking Peter Zijlstra
2014-07-29  0:56     ` Yuyang Du
2014-07-29 13:15       ` Peter Zijlstra
2014-07-28 11:39   ` Peter Zijlstra
2014-07-29  1:09     ` Yuyang Du
2014-07-29 13:19       ` Peter Zijlstra
2014-07-28 12:01   ` Peter Zijlstra
2014-07-28 13:51   ` Peter Zijlstra
2014-07-28 16:58     ` bsegall
2014-07-28 17:19       ` Peter Zijlstra
2014-07-29  1:13         ` Yuyang Du
2014-07-18 15:39 ` [PATCH 0/2 " Morten Rasmussen
2014-07-27 19:02   ` Yuyang Du
2014-07-28 10:38     ` Peter Zijlstra
2014-07-29  1:17       ` Yuyang Du
2014-07-29 13:06         ` Peter Zijlstra
2014-07-30 10:13     ` Morten Rasmussen
2014-07-30 10:21       ` Peter Zijlstra
2014-07-30 10:57         ` Morten Rasmussen
2014-07-30 19:17       ` Yuyang Du
2014-07-31  8:54         ` Morten Rasmussen
2014-07-31  2:15           ` Yuyang Du
2014-07-20  5:46 ` Mike Galbraith
2014-07-27 19:34   ` Yuyang Du
2014-07-28  7:49     ` Mike Galbraith
2014-07-28  0:01       ` Yuyang Du
2014-07-28  8:55     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.