All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
       [not found] <1356588535-23251-1-git-send-email-wangyun@linux.vnet.ibm.com>
@ 2013-01-09  9:28 ` Michael Wang
  2013-01-12  8:01   ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-09  9:28 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 12/27/2012 02:08 PM, Michael Wang wrote:
> This patch set is trying to simplify the select_task_rq_fair() with
> schedule balance map.
> 
> After get rid of the complex code and reorganize the logical, pgbench show
> the improvement.
> 
> 	Prev:
> 		| db_size | clients |  tps  |
> 		+---------+---------+-------+
> 		| 22 MB   |       1 |  4437 |
> 		| 22 MB   |      16 | 51351 |
> 		| 22 MB   |      32 | 49959 |
> 		| 7484 MB |       1 |  4078 |
> 		| 7484 MB |      16 | 44681 |
> 		| 7484 MB |      32 | 42463 |
> 		| 15 GB   |       1 |  3992 |
> 		| 15 GB   |      16 | 44107 |
> 		| 15 GB   |      32 | 41797 |
> 
> 	Post:
> 		| db_size | clients |  tps  |
> 		+---------+---------+-------+
> 		| 22 MB   |       1 | 11053 |		+149.11%
> 		| 22 MB   |      16 | 55671 |		+8.41%
> 		| 22 MB   |      32 | 52596 |		+5.28%
> 		| 7483 MB |       1 |  8180 |		+100.59%
> 		| 7483 MB |      16 | 48392 |		+8.31%
> 		| 7483 MB |      32 | 44185 |		+0.18%
> 		| 15 GB   |       1 |  8127 |		+103.58%
> 		| 15 GB   |      16 | 48156 |		+9.18%
> 		| 15 GB   |      32 | 43387 |		+3.8%
> 
> Please check the patch for more details about schedule balance map, they
> currently based on linux-next 3.7.0-rc6, will rebase them to tip tree in
> follow version.
> 
> Comments are very welcomed.

Could I get some comments for this patch set?

Regards,
Michael Wang

> 
> Test with:
> 	12 cpu X86 server and linux-next 3.7.0-rc6.
> 
> Michael Wang (2):
> 	[PATCH 1/2] sched: schedule balance map foundation
> 	[PATCH 2/2] sched: simplify select_task_rq_fair() with schedule balance map
> 
> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
> ---
>  core.c  |   61 +++++++++++++++++++++++++++++
>  fair.c  |  133 +++++++++++++++++++++++++++++++++-------------------------------
>  sched.h |   28 +++++++++++++
>  3 files changed, 159 insertions(+), 63 deletions(-)
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-09  9:28 ` [RFC PATCH 0/2] sched: simplify the select_task_rq_fair() Michael Wang
@ 2013-01-12  8:01   ` Mike Galbraith
  2013-01-12 10:19     ` Mike Galbraith
  2013-01-15  2:46     ` Michael Wang
  0 siblings, 2 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-12  8:01 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Wed, 2013-01-09 at 17:28 +0800, Michael Wang wrote: 
> On 12/27/2012 02:08 PM, Michael Wang wrote:
> > This patch set is trying to simplify the select_task_rq_fair() with
> > schedule balance map.
> > 
> > After get rid of the complex code and reorganize the logical, pgbench show
> > the improvement.
> > 
> > 	Prev:
> > 		| db_size | clients |  tps  |
> > 		+---------+---------+-------+
> > 		| 22 MB   |       1 |  4437 |
> > 		| 22 MB   |      16 | 51351 |
> > 		| 22 MB   |      32 | 49959 |
> > 		| 7484 MB |       1 |  4078 |
> > 		| 7484 MB |      16 | 44681 |
> > 		| 7484 MB |      32 | 42463 |
> > 		| 15 GB   |       1 |  3992 |
> > 		| 15 GB   |      16 | 44107 |
> > 		| 15 GB   |      32 | 41797 |
> > 
> > 	Post:
> > 		| db_size | clients |  tps  |
> > 		+---------+---------+-------+
> > 		| 22 MB   |       1 | 11053 |		+149.11%
> > 		| 22 MB   |      16 | 55671 |		+8.41%
> > 		| 22 MB   |      32 | 52596 |		+5.28%
> > 		| 7483 MB |       1 |  8180 |		+100.59%
> > 		| 7483 MB |      16 | 48392 |		+8.31%
> > 		| 7483 MB |      32 | 44185 |		+0.18%
> > 		| 15 GB   |       1 |  8127 |		+103.58%
> > 		| 15 GB   |      16 | 48156 |		+9.18%
> > 		| 15 GB   |      32 | 43387 |		+3.8%
> > 
> > Please check the patch for more details about schedule balance map, they
> > currently based on linux-next 3.7.0-rc6, will rebase them to tip tree in
> > follow version.
> > 
> > Comments are very welcomed.
> 
> Could I get some comments for this patch set?

I kinda like it.  It doesn't bounce buddies all over a large package at
low load, doesn't have a tbench dip at clients=cores with HT enabled
that my idle buddy patch does, and your pgbench numbers look very nice.
It's not as good at ramp as idle buddies, but is an improvement over
mainline for both tbench and pgbench.  Cool.

It'll schedule client/server cross node sometimes with you preferring to
leave wakee near prev_cpu, but that's one of those things that can bite
whichever choice you make.  It kills the bounce problem, can't hurt
little boxen, and may help big boxen more often than it hurts, who
knows.

Some tbench numbers:

I had to plug it into 3.0 to play with it, the 3.6-stable kernel I had
been using on 4x10 core box is misbehaving.

mainline = upstream select_idle_sibling()
idle_buddy = upstream select_idle_sibling() with 37407ea7 reverted

clients                    1          5         10        20         40         80        160
3.0.57-mainline        30.76     146.29    1569.48   4396.10    7851.87   14065.90   14128.40
3.0.57-idle_buddy     291.69    1448.13    2874.62   5329.49    7311.44   13582.20   13927.50
3.0.57-mainline+wang  292.41    1085.70    2048.62   4342.16    8280.17   13494.60   13435.50

It'd be nice to see more numbers, likely there will be plus/minus all
the map, but from my quick test drive, generic behavior looks healthier.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-12  8:01   ` Mike Galbraith
@ 2013-01-12 10:19     ` Mike Galbraith
  2013-01-14  9:21       ` Mike Galbraith
  2013-01-15  2:46     ` Michael Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-12 10:19 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra


aim7 compute

Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      440.41  100       440.4070     13.76      3.65
    5     1923.81   99       384.7619     15.75     26.17
   10     4223.00   99       422.2997     14.35     41.66
   20     7632.24   87       381.6121     15.88     55.85
   40    16378.38   97       409.4595     14.80    169.63
   80    32934.78   99       411.6848     14.72    364.50
  160    63789.47   98       398.6842     15.20    398.55
  320   121200.00   98       378.7500     16.00    471.10
  640   213803.75   96       334.0684     18.14    640.05
 1280   323334.72   93       252.6053     23.99   1106.92
 2560   425846.83   87       166.3464     36.43   2097.59

+simplify select_task_rq_fair()

Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      441.37  100       441.3693     13.73      3.71
    5     1861.18   99       372.2359     16.28     28.87
   10     2777.27   99       277.7269     21.82     93.90
   20     4069.85   83       203.4923     29.78    140.40
   40     6500.40   69       162.5101     37.29    355.66
   80    31338.07   96       391.7259     15.47    316.10
  160    63207.30   98       395.0456     15.34    405.22
  320   122501.58   96       382.8174     15.83    450.57
  640   219863.95   91       343.5374     17.64    591.74
 1280   333339.06   84       260.4211     23.27   1045.13
 2560   428315.85   79       167.3109     36.22   2075.88

Hm, low end takes a big hit.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-12 10:19     ` Mike Galbraith
@ 2013-01-14  9:21       ` Mike Galbraith
  2013-01-15  3:10         ` Michael Wang
  2013-01-17  5:55         ` Michael Wang
  0 siblings, 2 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-14  9:21 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Sat, 2013-01-12 at 11:19 +0100, Mike Galbraith wrote:

> Hm, low end takes a big hit.

Bah, that's perturbations and knobs.

aim7 compute, three individual runs + average

Stock scheduler knobs..

3.8-wang                                    avg     3.8-virgin                          avg    vs wang
Tasks    jobs/min                      jobs/min                                    jobs/min
    1      435.97    433.48    433.48    434.31        436.91    436.60    434.41    435.97      1.003
    5     2108.56   2120.36   2153.52   2127.48       2239.47   2257.82   2285.07   2260.78      1.062
   10     4205.41   4167.81   4294.83   4222.68       4223.00   4199.58   4252.63   4225.07      1.000
   20     8511.24   8434.24   8614.07   8519.85       8523.21   8505.26   8931.47   8653.31      1.015
   40    13209.81   6389.04   5308.80   8302.55      13011.27  13131.09  13788.40  13310.25      1.603
   80    12239.33  17797.36  20438.45  16825.04      15380.71  14372.96  14080.74  13921.31       .827
  160    52638.44  52609.88  37364.16  47537.49      26644.68  44826.63  41703.23  37724.84       .793
  320   105162.69 111512.36 105909.34 107528.13     102386.48 106141.22 103424.00 103983.90       .967
  640   207290.22 207623.13 204556.96 206490.10     196673.43 193243.65 190210.89 193375.99       .936
 1280   329795.92 326739.68 328399.66 328311.75     305867.51 307931.72 305988.17 306595.80       .933
 2560   414580.44 418156.33 413035.14 415257.30     404000.00 403894.82 402428.02 403440.94       .971

Twiddled knobs..
sched_latency_ns = 24ms
sched_min_granularity_ns = 8ms
sched_wakeup_granularity_ns = 10ms

3.8-wang                                    avg     3.8-virgin                          avg    vs wang
Tasks    jobs/min                      jobs/min                                    jobs/min
    1      437.23    437.23    436.91    437.12        437.86    439.45    438.18    438.49      1.003
    5     2102.71   2121.85   2130.80   2118.45       2223.04   2165.83   2314.74   2234.53      1.054
   10     4282.69   4252.63   4378.61   4304.64       4310.10   4303.98   4310.10   4308.06      1.000
   20     8675.73   8650.96   8725.70   8684.13       8595.74   8638.63   8725.70   8653.35       .996
   40    16546.08  16512.26  16546.08  16534.80      17022.47  16798.34  16717.24  16846.01      1.018
   80    32712.55  32602.56  32493.30  32602.80      33137.39  33137.39  32890.09  33054.95      1.013
  160    63372.55  63125.00  63663.82  63387.12      64510.98  64382.47  64084.60  64326.01      1.014
  320   121885.61 122656.55 121503.76 122015.30     121124.30 121885.61 121732.58 121580.83       .996
  640   218010.12 216066.85 217034.14 217037.03     213450.74 212864.98 212282.43 212866.05       .980
 1280   332339.33 332197.00 332624.36 332386.89     325915.97 325505.67 325232.70 325551.44       .979
 2560   426901.49 426666.67 427254.20 426940.78     424448.70 425263.16 424564.86 424758.90       .994

Much better, ~no difference between kernels for this load.

Except patched 3.8-rc3 kernel crashes on reboot. 

Please stand by while rebooting the system...
[  123.104064] kvm: exiting hardware virtualization
[  123.302908] Disabling[  124.729877] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[  124.758804] IP: [<ffffffff810821f6>] wake_affine+0x26/0x2f0
[  124.785634] PGD e7089b067 PUD e736f7067 PMD 0 
[  124.810176] Oops: 0000 [#1] SMP 
[  124.829767] Modules linked in: iptable_filter ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd sunrpc autofs4 edd af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod coretemp kvm_intel iTCO_wdt kvm iTCO_vendor_support i7core_edac igb ioatdma lpc_ich tpm_tis ptp crc32c_intel ipv6 joydev edac_core mfd_core pps_core dca microcode tpm hid_generic i2c_i801 tpm_bios ehci_pci acpi_memhotplug sr_mod container pcspkr sg cdrom button rtc_cmos ext3 jbd mbcache mgag200 ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect i2c_core syscopyarea usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod
[  125.093116] CPU 36 
[  125.097498] Pid: 0, comm: swapper/36 Not tainted 3.8.0-wang #58 QCI QSSC-S4R/QSSC-S4R
[  125.148934] RIP: 0010:[<ffffffff810821f6>]  [<ffffffff810821f6>] wake_affine+0x26/0x2f0
[  125.183856] RSP: 0018:ffff88046d9dfc70  EFLAGS: 00010082
[  125.213390] RAX: 0000000000000001 RBX: 0000000000000024 RCX: 0000000000000046
[  125.247203] RDX: 0000000000000000 RSI: ffff88046d946280 RDI: 0000000000000000
[  125.280734] RBP: ffff88046d9dfce8 R08: 0000000000000000 R09: 0000000000000000
[  125.317137] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88046fa53980
[  125.354487] R13: 0000000000000024 R14: 0000000000000006 R15: ffff88046d946280
[  125.391549] FS:  0000000000000000(0000) GS:ffff88046fa40000(0000) knlGS:0000000000000000
[  125.431203] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  125.465018] CR2: 0000000000000040 CR3: 0000000e712a1000 CR4: 00000000000007e0
[  125.501827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  125.536604] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  125.570415] Process swapper/36 (pid: 0, threadinfo ffff88046d9de000, task ffff88046d9dc580)
[  125.606630] Stack:
[  125.629700]  ffffffff810895d9 ffff88046d9dfc88 ffff88046fc52f70 000000006d9462c8
[  125.664431]  0000000000000000 ffff88046d9dfcc8 ffffffff8108a004 ffff88046d9dfce8
[  125.699339]  ffffffff81086134 000000246d9dfcd8 0000000000000024 ffff88046fa53980
[  125.734981] Call Trace:
[  125.758061]  [<ffffffff810895d9>] ? enqueue_entity+0x229/0xa40
[  125.790423]  [<ffffffff8108a004>] ? enqueue_task_fair+0x214/0x560
[  125.823023]  [<ffffffff81086134>] ? select_idle_sibling+0xf4/0x120
[  125.856434]  [<ffffffff810863a9>] select_task_rq_fair+0x249/0x280
[  125.892564]  [<ffffffff8102d056>] ? native_apic_msr_write+0x36/0x40
[  125.925262]  [<ffffffff8107fbbb>] try_to_wake_up+0x12b/0x2b0
[  125.956939]  [<ffffffff8107fd4d>] default_wake_function+0xd/0x10
[  125.989521]  [<ffffffff8106d031>] autoremove_wake_function+0x11/0x40
[  126.022899]  [<ffffffff81075e1a>] __wake_up_common+0x5a/0x90
[  126.054874]  [<ffffffff810794a3>] __wake_up+0x43/0x70
[  126.085086]  [<ffffffff810e2869>] force_quiescent_state+0xe9/0x130
[  126.117469]  [<ffffffff810e420e>] rcu_prepare_for_idle+0x27e/0x480
[  126.150317]  [<ffffffff810e444d>] rcu_eqs_enter_common+0x3d/0x100
[  126.182428]  [<ffffffff810e4642>] rcu_idle_enter+0x92/0xe0
[  126.213041]  [<ffffffff8100abd8>] cpu_idle+0x78/0xd0
[  126.242939]  [<ffffffff8149bcce>] start_secondary+0x7a/0x7c
[  126.273874] Code: 00 00 00 00 00 55 48 89 e5 48 83 ec 78 4c 89 7d f8 89 55 a4 49 89 f7 48 89 5d d8 4c 89 65 e0 4c 89 6d e8 4c 89 75 f0 48 89 7d a8 <8b> 47 40 65 44 8b 04 25 20 b0 00 00 89 45 c8 48 8b 46 08 48 c7 
[  126.358480] RIP  [<ffffine+0x26/0x2f0[  126.392023]  RSP <ffff88046d9dfc70>
[  126.422108] CR2: 0000000000000040



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-12  8:01   ` Mike Galbraith
  2013-01-12 10:19     ` Mike Galbraith
@ 2013-01-15  2:46     ` Michael Wang
  1 sibling, 0 replies; 57+ messages in thread
From: Michael Wang @ 2013-01-15  2:46 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/12/2013 04:01 PM, Mike Galbraith wrote:
> On Wed, 2013-01-09 at 17:28 +0800, Michael Wang wrote: 
>> On 12/27/2012 02:08 PM, Michael Wang wrote:
>>> This patch set is trying to simplify the select_task_rq_fair() with
>>> schedule balance map.
>>>
>>> After get rid of the complex code and reorganize the logical, pgbench show
>>> the improvement.
>>>
>>> 	Prev:
>>> 		| db_size | clients |  tps  |
>>> 		+---------+---------+-------+
>>> 		| 22 MB   |       1 |  4437 |
>>> 		| 22 MB   |      16 | 51351 |
>>> 		| 22 MB   |      32 | 49959 |
>>> 		| 7484 MB |       1 |  4078 |
>>> 		| 7484 MB |      16 | 44681 |
>>> 		| 7484 MB |      32 | 42463 |
>>> 		| 15 GB   |       1 |  3992 |
>>> 		| 15 GB   |      16 | 44107 |
>>> 		| 15 GB   |      32 | 41797 |
>>>
>>> 	Post:
>>> 		| db_size | clients |  tps  |
>>> 		+---------+---------+-------+
>>> 		| 22 MB   |       1 | 11053 |		+149.11%
>>> 		| 22 MB   |      16 | 55671 |		+8.41%
>>> 		| 22 MB   |      32 | 52596 |		+5.28%
>>> 		| 7483 MB |       1 |  8180 |		+100.59%
>>> 		| 7483 MB |      16 | 48392 |		+8.31%
>>> 		| 7483 MB |      32 | 44185 |		+0.18%
>>> 		| 15 GB   |       1 |  8127 |		+103.58%
>>> 		| 15 GB   |      16 | 48156 |		+9.18%
>>> 		| 15 GB   |      32 | 43387 |		+3.8%
>>>
>>> Please check the patch for more details about schedule balance map, they
>>> currently based on linux-next 3.7.0-rc6, will rebase them to tip tree in
>>> follow version.
>>>
>>> Comments are very welcomed.
>>
>> Could I get some comments for this patch set?
> 
> I kinda like it.  It doesn't bounce buddies all over a large package at
> low load, doesn't have a tbench dip at clients=cores with HT enabled
> that my idle buddy patch does, and your pgbench numbers look very nice.
> It's not as good at ramp as idle buddies, but is an improvement over
> mainline for both tbench and pgbench.  Cool.
> 
> It'll schedule client/server cross node sometimes with you preferring to
> leave wakee near prev_cpu, but that's one of those things that can bite
> whichever choice you make.  It kills the bounce problem, can't hurt
> little boxen, and may help big boxen more often than it hurts, who
> knows.
> 
> Some tbench numbers:
> 
> I had to plug it into 3.0 to play with it, the 3.6-stable kernel I had
> been using on 4x10 core box is misbehaving.

Hi, Mike

Thanks for your reply and test results.

> 
> mainline = upstream select_idle_sibling()
> idle_buddy = upstream select_idle_sibling() with 37407ea7 reverted
> 
> clients                    1          5         10        20         40         80        160
> 3.0.57-mainline        30.76     146.29    1569.48   4396.10    7851.87   14065.90   14128.40
> 3.0.57-idle_buddy     291.69    1448.13    2874.62   5329.49    7311.44   13582.20   13927.50
> 3.0.57-mainline+wang  292.41    1085.70    2048.62   4342.16    8280.17   13494.60   13435.50
> 
> It'd be nice to see more numbers, likely there will be plus/minus all
> the map, but from my quick test drive, generic behavior looks healthier.

It's good to know that you like the idea, I will re-base the code on
latest tip tree and do more test on it.

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-14  9:21       ` Mike Galbraith
@ 2013-01-15  3:10         ` Michael Wang
  2013-01-15  4:52           ` Mike Galbraith
  2013-01-17  5:55         ` Michael Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-15  3:10 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/14/2013 05:21 PM, Mike Galbraith wrote:
> On Sat, 2013-01-12 at 11:19 +0100, Mike Galbraith wrote:
> 
>> Hm, low end takes a big hit.
> 
> Bah, that's perturbations and knobs.
> 
> aim7 compute, three individual runs + average
> 
> Stock scheduler knobs..
> 
> 3.8-wang                                    avg     3.8-virgin                          avg    vs wang
> Tasks    jobs/min                      jobs/min                                    jobs/min
>     1      435.97    433.48    433.48    434.31        436.91    436.60    434.41    435.97      1.003
>     5     2108.56   2120.36   2153.52   2127.48       2239.47   2257.82   2285.07   2260.78      1.062
>    10     4205.41   4167.81   4294.83   4222.68       4223.00   4199.58   4252.63   4225.07      1.000
>    20     8511.24   8434.24   8614.07   8519.85       8523.21   8505.26   8931.47   8653.31      1.015
>    40    13209.81   6389.04   5308.80   8302.55      13011.27  13131.09  13788.40  13310.25      1.603
>    80    12239.33  17797.36  20438.45  16825.04      15380.71  14372.96  14080.74  13921.31       .827
>   160    52638.44  52609.88  37364.16  47537.49      26644.68  44826.63  41703.23  37724.84       .793
>   320   105162.69 111512.36 105909.34 107528.13     102386.48 106141.22 103424.00 103983.90       .967
>   640   207290.22 207623.13 204556.96 206490.10     196673.43 193243.65 190210.89 193375.99       .936
>  1280   329795.92 326739.68 328399.66 328311.75     305867.51 307931.72 305988.17 306595.80       .933
>  2560   414580.44 418156.33 413035.14 415257.30     404000.00 403894.82 402428.02 403440.94       .971
> 
> Twiddled knobs..
> sched_latency_ns = 24ms
> sched_min_granularity_ns = 8ms
> sched_wakeup_granularity_ns = 10ms
> 
> 3.8-wang                                    avg     3.8-virgin                          avg    vs wang
> Tasks    jobs/min                      jobs/min                                    jobs/min
>     1      437.23    437.23    436.91    437.12        437.86    439.45    438.18    438.49      1.003
>     5     2102.71   2121.85   2130.80   2118.45       2223.04   2165.83   2314.74   2234.53      1.054
>    10     4282.69   4252.63   4378.61   4304.64       4310.10   4303.98   4310.10   4308.06      1.000
>    20     8675.73   8650.96   8725.70   8684.13       8595.74   8638.63   8725.70   8653.35       .996
>    40    16546.08  16512.26  16546.08  16534.80      17022.47  16798.34  16717.24  16846.01      1.018
>    80    32712.55  32602.56  32493.30  32602.80      33137.39  33137.39  32890.09  33054.95      1.013
>   160    63372.55  63125.00  63663.82  63387.12      64510.98  64382.47  64084.60  64326.01      1.014
>   320   121885.61 122656.55 121503.76 122015.30     121124.30 121885.61 121732.58 121580.83       .996
>   640   218010.12 216066.85 217034.14 217037.03     213450.74 212864.98 212282.43 212866.05       .980
>  1280   332339.33 332197.00 332624.36 332386.89     325915.97 325505.67 325232.70 325551.44       .979
>  2560   426901.49 426666.67 427254.20 426940.78     424448.70 425263.16 424564.86 424758.90       .994
> 
> Much better, ~no difference between kernels for this load.

Thanks for the testing, could you please tell me which benchmark
generate these results?

I will try to re-base the patch into 3.8-rc3 with consideration of NUMA
domain, I suppose it could fix the below BUG and provide a better results.

Regards,
Michael Wang

> 
> Except patched 3.8-rc3 kernel crashes on reboot. 
> 
> Please stand by while rebooting the system...
> [  123.104064] kvm: exiting hardware virtualization
> [  123.302908] Disabling[  124.729877] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
> [  124.758804] IP: [<ffffffff810821f6>] wake_affine+0x26/0x2f0
> [  124.785634] PGD e7089b067 PUD e736f7067 PMD 0 
> [  124.810176] Oops: 0000 [#1] SMP 
> [  124.829767] Modules linked in: iptable_filter ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd sunrpc autofs4 edd af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod coretemp kvm_intel iTCO_wdt kvm iTCO_vendor_support i7core_edac igb ioatdma lpc_ich tpm_tis ptp crc32c_intel ipv6 joydev edac_core mfd_core pps_core dca microcode tpm hid_generic i2c_i801 tpm_bios ehci_pci acpi_memhotplug sr_mod container pcspkr sg cdrom button rtc_cmos ext3 jbd mbcache mgag200 ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect i2c_core syscopyarea usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod
> [  125.093116] CPU 36 
> [  125.097498] Pid: 0, comm: swapper/36 Not tainted 3.8.0-wang #58 QCI QSSC-S4R/QSSC-S4R
> [  125.148934] RIP: 0010:[<ffffffff810821f6>]  [<ffffffff810821f6>] wake_affine+0x26/0x2f0
> [  125.183856] RSP: 0018:ffff88046d9dfc70  EFLAGS: 00010082
> [  125.213390] RAX: 0000000000000001 RBX: 0000000000000024 RCX: 0000000000000046
> [  125.247203] RDX: 0000000000000000 RSI: ffff88046d946280 RDI: 0000000000000000
> [  125.280734] RBP: ffff88046d9dfce8 R08: 0000000000000000 R09: 0000000000000000
> [  125.317137] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88046fa53980
> [  125.354487] R13: 0000000000000024 R14: 0000000000000006 R15: ffff88046d946280
> [  125.391549] FS:  0000000000000000(0000) GS:ffff88046fa40000(0000) knlGS:0000000000000000
> [  125.431203] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  125.465018] CR2: 0000000000000040 CR3: 0000000e712a1000 CR4: 00000000000007e0
> [  125.501827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  125.536604] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  125.570415] Process swapper/36 (pid: 0, threadinfo ffff88046d9de000, task ffff88046d9dc580)
> [  125.606630] Stack:
> [  125.629700]  ffffffff810895d9 ffff88046d9dfc88 ffff88046fc52f70 000000006d9462c8
> [  125.664431]  0000000000000000 ffff88046d9dfcc8 ffffffff8108a004 ffff88046d9dfce8
> [  125.699339]  ffffffff81086134 000000246d9dfcd8 0000000000000024 ffff88046fa53980
> [  125.734981] Call Trace:
> [  125.758061]  [<ffffffff810895d9>] ? enqueue_entity+0x229/0xa40
> [  125.790423]  [<ffffffff8108a004>] ? enqueue_task_fair+0x214/0x560
> [  125.823023]  [<ffffffff81086134>] ? select_idle_sibling+0xf4/0x120
> [  125.856434]  [<ffffffff810863a9>] select_task_rq_fair+0x249/0x280
> [  125.892564]  [<ffffffff8102d056>] ? native_apic_msr_write+0x36/0x40
> [  125.925262]  [<ffffffff8107fbbb>] try_to_wake_up+0x12b/0x2b0
> [  125.956939]  [<ffffffff8107fd4d>] default_wake_function+0xd/0x10
> [  125.989521]  [<ffffffff8106d031>] autoremove_wake_function+0x11/0x40
> [  126.022899]  [<ffffffff81075e1a>] __wake_up_common+0x5a/0x90
> [  126.054874]  [<ffffffff810794a3>] __wake_up+0x43/0x70
> [  126.085086]  [<ffffffff810e2869>] force_quiescent_state+0xe9/0x130
> [  126.117469]  [<ffffffff810e420e>] rcu_prepare_for_idle+0x27e/0x480
> [  126.150317]  [<ffffffff810e444d>] rcu_eqs_enter_common+0x3d/0x100
> [  126.182428]  [<ffffffff810e4642>] rcu_idle_enter+0x92/0xe0
> [  126.213041]  [<ffffffff8100abd8>] cpu_idle+0x78/0xd0
> [  126.242939]  [<ffffffff8149bcce>] start_secondary+0x7a/0x7c
> [  126.273874] Code: 00 00 00 00 00 55 48 89 e5 48 83 ec 78 4c 89 7d f8 89 55 a4 49 89 f7 48 89 5d d8 4c 89 65 e0 4c 89 6d e8 4c 89 75 f0 48 89 7d a8 <8b> 47 40 65 44 8b 04 25 20 b0 00 00 89 45 c8 48 8b 46 08 48 c7 
> [  126.358480] RIP  [<ffffine+0x26/0x2f0[  126.392023]  RSP <ffff88046d9dfc70>
> [  126.422108] CR2: 0000000000000040
> 
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-15  3:10         ` Michael Wang
@ 2013-01-15  4:52           ` Mike Galbraith
  2013-01-15  8:26             ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-15  4:52 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Tue, 2013-01-15 at 11:10 +0800, Michael Wang wrote: 
> hanks for the testing, could you please tell me which benchmark
> generate these results?

aim7, using the compute workfile, and a datapoints file containing
$Tasks.  multitask -nl -f will prompt for the datapoints file.  You'll
have to bump /proc/sys/kernel/sem if it's left at default by your
distro.  If you want, I can send you a tarball offline.

> I will try to re-base the patch into 3.8-rc3 with consideration of NUMA
> domain, I suppose it could fix the below BUG and provide a better results.

I don't _think_ you'll fix the sweet spot without setting knobs such
that last_buddy can do its thing, but go for it ;-)

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-15  4:52           ` Mike Galbraith
@ 2013-01-15  8:26             ` Michael Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Michael Wang @ 2013-01-15  8:26 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/15/2013 12:52 PM, Mike Galbraith wrote:
> On Tue, 2013-01-15 at 11:10 +0800, Michael Wang wrote: 
>> hanks for the testing, could you please tell me which benchmark
>> generate these results?
> 
> aim7, using the compute workfile, and a datapoints file containing
> $Tasks.  multitask -nl -f will prompt for the datapoints file.  You'll
> have to bump /proc/sys/kernel/sem if it's left at default by your
> distro.  If you want, I can send you a tarball offline.
> 
>> I will try to re-base the patch into 3.8-rc3 with consideration of NUMA
>> domain, I suppose it could fix the below BUG and provide a better results.
> 
> I don't _think_ you'll fix the sweet spot without setting knobs such
> that last_buddy can do its thing, but go for it ;-)

Hmm... I'm a little confusing, let's discuss on the next version if
issues are still there :)

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-14  9:21       ` Mike Galbraith
  2013-01-15  3:10         ` Michael Wang
@ 2013-01-17  5:55         ` Michael Wang
  2013-01-20  4:09           ` Mike Galbraith
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-17  5:55 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

Hi, Mike

I've send out the v2, which I suppose it will fix the below BUG and
perform better, please do let me know if it still cause issues on your
arm7 machine.

Regards,
Michael Wang


On 01/14/2013 05:21 PM, Mike Galbraith wrote:
> On Sat, 2013-01-12 at 11:19 +0100, Mike Galbraith wrote:
> 
>> Hm, low end takes a big hit.
> 
> Bah, that's perturbations and knobs.
> 
> aim7 compute, three individual runs + average
> 
> Stock scheduler knobs..
> 
> 3.8-wang                                    avg     3.8-virgin                          avg    vs wang
> Tasks    jobs/min                      jobs/min                                    jobs/min
>     1      435.97    433.48    433.48    434.31        436.91    436.60    434.41    435.97      1.003
>     5     2108.56   2120.36   2153.52   2127.48       2239.47   2257.82   2285.07   2260.78      1.062
>    10     4205.41   4167.81   4294.83   4222.68       4223.00   4199.58   4252.63   4225.07      1.000
>    20     8511.24   8434.24   8614.07   8519.85       8523.21   8505.26   8931.47   8653.31      1.015
>    40    13209.81   6389.04   5308.80   8302.55      13011.27  13131.09  13788.40  13310.25      1.603
>    80    12239.33  17797.36  20438.45  16825.04      15380.71  14372.96  14080.74  13921.31       .827
>   160    52638.44  52609.88  37364.16  47537.49      26644.68  44826.63  41703.23  37724.84       .793
>   320   105162.69 111512.36 105909.34 107528.13     102386.48 106141.22 103424.00 103983.90       .967
>   640   207290.22 207623.13 204556.96 206490.10     196673.43 193243.65 190210.89 193375.99       .936
>  1280   329795.92 326739.68 328399.66 328311.75     305867.51 307931.72 305988.17 306595.80       .933
>  2560   414580.44 418156.33 413035.14 415257.30     404000.00 403894.82 402428.02 403440.94       .971
> 
> Twiddled knobs..
> sched_latency_ns = 24ms
> sched_min_granularity_ns = 8ms
> sched_wakeup_granularity_ns = 10ms
> 
> 3.8-wang                                    avg     3.8-virgin                          avg    vs wang
> Tasks    jobs/min                      jobs/min                                    jobs/min
>     1      437.23    437.23    436.91    437.12        437.86    439.45    438.18    438.49      1.003
>     5     2102.71   2121.85   2130.80   2118.45       2223.04   2165.83   2314.74   2234.53      1.054
>    10     4282.69   4252.63   4378.61   4304.64       4310.10   4303.98   4310.10   4308.06      1.000
>    20     8675.73   8650.96   8725.70   8684.13       8595.74   8638.63   8725.70   8653.35       .996
>    40    16546.08  16512.26  16546.08  16534.80      17022.47  16798.34  16717.24  16846.01      1.018
>    80    32712.55  32602.56  32493.30  32602.80      33137.39  33137.39  32890.09  33054.95      1.013
>   160    63372.55  63125.00  63663.82  63387.12      64510.98  64382.47  64084.60  64326.01      1.014
>   320   121885.61 122656.55 121503.76 122015.30     121124.30 121885.61 121732.58 121580.83       .996
>   640   218010.12 216066.85 217034.14 217037.03     213450.74 212864.98 212282.43 212866.05       .980
>  1280   332339.33 332197.00 332624.36 332386.89     325915.97 325505.67 325232.70 325551.44       .979
>  2560   426901.49 426666.67 427254.20 426940.78     424448.70 425263.16 424564.86 424758.90       .994
> 
> Much better, ~no difference between kernels for this load.
> 
> Except patched 3.8-rc3 kernel crashes on reboot. 
> 
> Please stand by while rebooting the system...
> [  123.104064] kvm: exiting hardware virtualization
> [  123.302908] Disabling[  124.729877] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
> [  124.758804] IP: [<ffffffff810821f6>] wake_affine+0x26/0x2f0
> [  124.785634] PGD e7089b067 PUD e736f7067 PMD 0 
> [  124.810176] Oops: 0000 [#1] SMP 
> [  124.829767] Modules linked in: iptable_filter ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd sunrpc autofs4 edd af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop dm_mod coretemp kvm_intel iTCO_wdt kvm iTCO_vendor_support i7core_edac igb ioatdma lpc_ich tpm_tis ptp crc32c_intel ipv6 joydev edac_core mfd_core pps_core dca microcode tpm hid_generic i2c_i801 tpm_bios ehci_pci acpi_memhotplug sr_mod container pcspkr sg cdrom button rtc_cmos ext3 jbd mbcache mgag200 ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect i2c_core syscopyarea usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod
> [  125.093116] CPU 36 
> [  125.097498] Pid: 0, comm: swapper/36 Not tainted 3.8.0-wang #58 QCI QSSC-S4R/QSSC-S4R
> [  125.148934] RIP: 0010:[<ffffffff810821f6>]  [<ffffffff810821f6>] wake_affine+0x26/0x2f0
> [  125.183856] RSP: 0018:ffff88046d9dfc70  EFLAGS: 00010082
> [  125.213390] RAX: 0000000000000001 RBX: 0000000000000024 RCX: 0000000000000046
> [  125.247203] RDX: 0000000000000000 RSI: ffff88046d946280 RDI: 0000000000000000
> [  125.280734] RBP: ffff88046d9dfce8 R08: 0000000000000000 R09: 0000000000000000
> [  125.317137] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88046fa53980
> [  125.354487] R13: 0000000000000024 R14: 0000000000000006 R15: ffff88046d946280
> [  125.391549] FS:  0000000000000000(0000) GS:ffff88046fa40000(0000) knlGS:0000000000000000
> [  125.431203] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  125.465018] CR2: 0000000000000040 CR3: 0000000e712a1000 CR4: 00000000000007e0
> [  125.501827] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  125.536604] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  125.570415] Process swapper/36 (pid: 0, threadinfo ffff88046d9de000, task ffff88046d9dc580)
> [  125.606630] Stack:
> [  125.629700]  ffffffff810895d9 ffff88046d9dfc88 ffff88046fc52f70 000000006d9462c8
> [  125.664431]  0000000000000000 ffff88046d9dfcc8 ffffffff8108a004 ffff88046d9dfce8
> [  125.699339]  ffffffff81086134 000000246d9dfcd8 0000000000000024 ffff88046fa53980
> [  125.734981] Call Trace:
> [  125.758061]  [<ffffffff810895d9>] ? enqueue_entity+0x229/0xa40
> [  125.790423]  [<ffffffff8108a004>] ? enqueue_task_fair+0x214/0x560
> [  125.823023]  [<ffffffff81086134>] ? select_idle_sibling+0xf4/0x120
> [  125.856434]  [<ffffffff810863a9>] select_task_rq_fair+0x249/0x280
> [  125.892564]  [<ffffffff8102d056>] ? native_apic_msr_write+0x36/0x40
> [  125.925262]  [<ffffffff8107fbbb>] try_to_wake_up+0x12b/0x2b0
> [  125.956939]  [<ffffffff8107fd4d>] default_wake_function+0xd/0x10
> [  125.989521]  [<ffffffff8106d031>] autoremove_wake_function+0x11/0x40
> [  126.022899]  [<ffffffff81075e1a>] __wake_up_common+0x5a/0x90
> [  126.054874]  [<ffffffff810794a3>] __wake_up+0x43/0x70
> [  126.085086]  [<ffffffff810e2869>] force_quiescent_state+0xe9/0x130
> [  126.117469]  [<ffffffff810e420e>] rcu_prepare_for_idle+0x27e/0x480
> [  126.150317]  [<ffffffff810e444d>] rcu_eqs_enter_common+0x3d/0x100
> [  126.182428]  [<ffffffff810e4642>] rcu_idle_enter+0x92/0xe0
> [  126.213041]  [<ffffffff8100abd8>] cpu_idle+0x78/0xd0
> [  126.242939]  [<ffffffff8149bcce>] start_secondary+0x7a/0x7c
> [  126.273874] Code: 00 00 00 00 00 55 48 89 e5 48 83 ec 78 4c 89 7d f8 89 55 a4 49 89 f7 48 89 5d d8 4c 89 65 e0 4c 89 6d e8 4c 89 75 f0 48 89 7d a8 <8b> 47 40 65 44 8b 04 25 20 b0 00 00 89 45 c8 48 8b 46 08 48 c7 
> [  126.358480] RIP  [<ffffine+0x26/0x2f0[  126.392023]  RSP <ffff88046d9dfc70>
> [  126.422108] CR2: 0000000000000040
> 
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-17  5:55         ` Michael Wang
@ 2013-01-20  4:09           ` Mike Galbraith
  2013-01-21  2:50             ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-20  4:09 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Thu, 2013-01-17 at 13:55 +0800, Michael Wang wrote: 
> Hi, Mike
> 
> I've send out the v2, which I suppose it will fix the below BUG and
> perform better, please do let me know if it still cause issues on your
> arm7 machine.

s/arm7/aim7

Someone swiped half of CPUs/ram, so the box is now 2 10 core nodes vs 4.

stock scheduler knobs

3.8-wang-v2                                 avg     3.8-virgin                          avg    vs wang
Tasks    jobs/min
    1      436.29    435.66    435.97    435.97        437.86    441.69    440.09    439.88      1.008
    5     2361.65   2356.14   2350.66   2356.15       2416.27   2563.45   2374.61   2451.44      1.040
   10     4767.90   4764.15   4779.18   4770.41       4946.94   4832.54   4828.69   4869.39      1.020
   20     9672.79   9703.76   9380.80   9585.78       9634.34   9672.79   9727.13   9678.08      1.009
   40    19162.06  19207.61  19299.36  19223.01      19268.68  19192.40  19056.60  19172.56       .997
   80    37610.55  37465.22  37465.22  37513.66      37263.64  37120.98  37465.22  37283.28       .993
  160    69306.65  69655.17  69257.14  69406.32      69257.14  69306.65  69257.14  69273.64       .998
  320   111512.36 109066.37 111256.45 110611.72     108395.75 107913.19 108335.20 108214.71       .978
  640   142850.83 148483.92 150851.81 147395.52     151974.92 151263.65 151322.67 151520.41      1.027
 1280    52788.89  52706.39  67280.77  57592.01     189931.44 189745.60 189792.02 189823.02      3.295
 2560    75403.91  52905.91  45196.21  57835.34     217368.64 217582.05 217551.54 217500.74      3.760

sched_latency_ns = 24ms
sched_min_granularity_ns = 8ms
sched_wakeup_granularity_ns = 10ms

3.8-wang-v2                                 avg     3.8-virgin                          avg    vs wang
Tasks    jobs/min
    1      436.29    436.60    434.72    435.87        434.41    439.77    438.81    437.66      1.004
    5     2382.08   2393.36   2451.46   2408.96       2451.46   2453.44   2425.94   2443.61      1.014
   10     5029.05   4887.10   5045.80   4987.31       4844.12   4828.69   4844.12   4838.97       .970
   20     9869.71   9734.94   9758.45   9787.70       9513.34   9611.42   9565.90   9563.55       .977
   40    19146.92  19146.92  19192.40  19162.08      18617.51  18603.22  18517.95  18579.56       .969
   80    37177.91  37378.57  37292.31  37282.93      36451.13  36179.10  36233.18  36287.80       .973
  160    70260.87  69109.05  69207.71  69525.87      68281.69  68522.97  68912.58  68572.41       .986
  320   114745.56 113869.64 114474.62 114363.27     114137.73 114137.73 114137.73 114137.73       .998
  640   164338.98 164338.98 164618.00 164431.98     164130.34 164130.34 164130.34 164130.34       .998
 1280   209473.40 209134.54 209473.40 209360.44     210040.62 210040.62 210097.51 210059.58      1.003
 2560   242703.38 242627.46 242779.34 242703.39     244001.26 243847.85 243732.91 243860.67      1.004

As you can see, the load collapsed at the high load end with stock
scheduler knobs (desktop latency).  With knobs set to scale, the delta
disappeared.

I thought perhaps the bogus (shouldn't exist) CPU domain in mainline
somehow contributes to the strange behavioral delta, but killing it made
zero difference.  All of these numbers for both trees were logged with
the below applies, but as noted, it changed nothing. 

From: Alex Shi <alex.shi@intel.com>
Date: Mon, 17 Dec 2012 09:42:57 +0800
Subject: [PATCH 01/18] sched: remove SD_PERFER_SIBLING flag

The flag was introduced in commit b5d978e0c7e79a. Its purpose seems
trying to fullfill one node first in NUMA machine via pulling tasks
from other nodes when the node has capacity.

Its advantage is when few tasks share memories among them, pulling
together is helpful on locality, so has performance gain. The shortage
is it will keep unnecessary task migrations thrashing among different
nodes, that reduces the performance gain, and just hurt performance if
tasks has no memory cross.

Thinking about the sched numa balancing patch is coming. The small
advantage are meaningless to us, So better to remove this flag.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h    |  1 -
 include/linux/topology.h |  2 --
 kernel/sched/core.c      |  1 -
 kernel/sched/fair.c      | 19 +------------------
 4 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5dafac3..6dca96c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -836,7 +836,6 @@ enum cpu_idle_type {
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
-#define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..15864d1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -100,7 +100,6 @@ int arch_update_cpu_topology(void);
 				| 1*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
 				| arch_sd_sibling_asym_packing()	\
 				,					\
 	.last_balance		= jiffies,				\
@@ -162,7 +161,6 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 0*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
-				| 1*SD_PREFER_SIBLING			\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..8ed2784 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6014,7 +6014,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_CPUPOWER
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
-					| 0*SD_PREFER_SIBLING
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59e072b..5d175f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4339,13 +4339,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 static inline void update_sd_lb_stats(struct lb_env *env,
 					int *balance, struct sd_lb_stats *sds)
 {
-	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats sgs;
-	int load_idx, prefer_sibling = 0;
-
-	if (child && child->flags & SD_PREFER_SIBLING)
-		prefer_sibling = 1;
+	int load_idx;
 
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
@@ -4362,19 +4358,6 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		sds->total_load += sgs.group_load;
 		sds->total_pwr += sg->sgp->power;
 
-		/*
-		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity to one so that we'll try
-		 * and move all the excess tasks away. We lower the capacity
-		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity. The
-		 * extra check prevents the case where you always pull from the
-		 * heaviest group when it is already under-utilized (possible
-		 * with a large weight task outweighs the tasks on the system).
-		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
-			sgs.group_capacity = min(sgs.group_capacity, 1UL);
-
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
 			sds->this = sg;



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-20  4:09           ` Mike Galbraith
@ 2013-01-21  2:50             ` Michael Wang
  2013-01-21  4:38               ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-21  2:50 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/20/2013 12:09 PM, Mike Galbraith wrote:
> On Thu, 2013-01-17 at 13:55 +0800, Michael Wang wrote: 
>> Hi, Mike
>>
>> I've send out the v2, which I suppose it will fix the below BUG and
>> perform better, please do let me know if it still cause issues on your
>> arm7 machine.
> 
> s/arm7/aim7
> 
> Someone swiped half of CPUs/ram, so the box is now 2 10 core nodes vs 4.
> 
> stock scheduler knobs
> 
> 3.8-wang-v2                                 avg     3.8-virgin                          avg    vs wang
> Tasks    jobs/min
>     1      436.29    435.66    435.97    435.97        437.86    441.69    440.09    439.88      1.008
>     5     2361.65   2356.14   2350.66   2356.15       2416.27   2563.45   2374.61   2451.44      1.040
>    10     4767.90   4764.15   4779.18   4770.41       4946.94   4832.54   4828.69   4869.39      1.020
>    20     9672.79   9703.76   9380.80   9585.78       9634.34   9672.79   9727.13   9678.08      1.009
>    40    19162.06  19207.61  19299.36  19223.01      19268.68  19192.40  19056.60  19172.56       .997
>    80    37610.55  37465.22  37465.22  37513.66      37263.64  37120.98  37465.22  37283.28       .993
>   160    69306.65  69655.17  69257.14  69406.32      69257.14  69306.65  69257.14  69273.64       .998
>   320   111512.36 109066.37 111256.45 110611.72     108395.75 107913.19 108335.20 108214.71       .978
>   640   142850.83 148483.92 150851.81 147395.52     151974.92 151263.65 151322.67 151520.41      1.027
>  1280    52788.89  52706.39  67280.77  57592.01     189931.44 189745.60 189792.02 189823.02      3.295
>  2560    75403.91  52905.91  45196.21  57835.34     217368.64 217582.05 217551.54 217500.74      3.760
> 
> sched_latency_ns = 24ms
> sched_min_granularity_ns = 8ms
> sched_wakeup_granularity_ns = 10ms
> 
> 3.8-wang-v2                                 avg     3.8-virgin                          avg    vs wang
> Tasks    jobs/min
>     1      436.29    436.60    434.72    435.87        434.41    439.77    438.81    437.66      1.004
>     5     2382.08   2393.36   2451.46   2408.96       2451.46   2453.44   2425.94   2443.61      1.014
>    10     5029.05   4887.10   5045.80   4987.31       4844.12   4828.69   4844.12   4838.97       .970
>    20     9869.71   9734.94   9758.45   9787.70       9513.34   9611.42   9565.90   9563.55       .977
>    40    19146.92  19146.92  19192.40  19162.08      18617.51  18603.22  18517.95  18579.56       .969
>    80    37177.91  37378.57  37292.31  37282.93      36451.13  36179.10  36233.18  36287.80       .973
>   160    70260.87  69109.05  69207.71  69525.87      68281.69  68522.97  68912.58  68572.41       .986
>   320   114745.56 113869.64 114474.62 114363.27     114137.73 114137.73 114137.73 114137.73       .998
>   640   164338.98 164338.98 164618.00 164431.98     164130.34 164130.34 164130.34 164130.34       .998
>  1280   209473.40 209134.54 209473.40 209360.44     210040.62 210040.62 210097.51 210059.58      1.003
>  2560   242703.38 242627.46 242779.34 242703.39     244001.26 243847.85 243732.91 243860.67      1.004
> 
> As you can see, the load collapsed at the high load end with stock
> scheduler knobs (desktop latency).  With knobs set to scale, the delta
> disappeared.

Thanks for the testing, Mike, please allow me to ask few questions.

What are those tasks actually doing? what's the workload?

And I'm confusing about how those new parameter value was figured out
and how could them help solve the possible issue?

Do you have any idea about which part in this patch set may cause the issue?

One change by designed is that, for old logical, if it's a wake up and
we found affine sd, the select func will never go into the balance path,
but the new logical will, in some cases, do you think this could be a
problem?

> 
> I thought perhaps the bogus (shouldn't exist) CPU domain in mainline
> somehow contributes to the strange behavioral delta, but killing it made
> zero difference.  All of these numbers for both trees were logged with
> the below applies, but as noted, it changed nothing. 

The patch set was supposed to do accelerate by reduce the cost of
select_task_rq(), so it should be harmless for all the conditions.

Regards,
Michael Wang

> 
> From: Alex Shi <alex.shi@intel.com>
> Date: Mon, 17 Dec 2012 09:42:57 +0800
> Subject: [PATCH 01/18] sched: remove SD_PERFER_SIBLING flag
> 
> The flag was introduced in commit b5d978e0c7e79a. Its purpose seems
> trying to fullfill one node first in NUMA machine via pulling tasks
> from other nodes when the node has capacity.
> 
> Its advantage is when few tasks share memories among them, pulling
> together is helpful on locality, so has performance gain. The shortage
> is it will keep unnecessary task migrations thrashing among different
> nodes, that reduces the performance gain, and just hurt performance if
> tasks has no memory cross.
> 
> Thinking about the sched numa balancing patch is coming. The small
> advantage are meaningless to us, So better to remove this flag.
> 
> Reported-by: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  include/linux/sched.h    |  1 -
>  include/linux/topology.h |  2 --
>  kernel/sched/core.c      |  1 -
>  kernel/sched/fair.c      | 19 +------------------
>  4 files changed, 1 insertion(+), 22 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5dafac3..6dca96c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -836,7 +836,6 @@ enum cpu_idle_type {
>  #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
>  #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
>  #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
> -#define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
>  #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
> 
>  extern int __weak arch_sd_sibiling_asym_packing(void);
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index d3cf0d6..15864d1 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -100,7 +100,6 @@ int arch_update_cpu_topology(void);
>  				| 1*SD_SHARE_CPUPOWER			\
>  				| 1*SD_SHARE_PKG_RESOURCES		\
>  				| 0*SD_SERIALIZE			\
> -				| 0*SD_PREFER_SIBLING			\
>  				| arch_sd_sibling_asym_packing()	\
>  				,					\
>  	.last_balance		= jiffies,				\
> @@ -162,7 +161,6 @@ int arch_update_cpu_topology(void);
>  				| 0*SD_SHARE_CPUPOWER			\
>  				| 0*SD_SHARE_PKG_RESOURCES		\
>  				| 0*SD_SERIALIZE			\
> -				| 1*SD_PREFER_SIBLING			\
>  				,					\
>  	.last_balance		= jiffies,				\
>  	.balance_interval	= 1,					\
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5dae0d2..8ed2784 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6014,7 +6014,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
>  					| 0*SD_SHARE_CPUPOWER
>  					| 0*SD_SHARE_PKG_RESOURCES
>  					| 1*SD_SERIALIZE
> -					| 0*SD_PREFER_SIBLING
>  					| sd_local_flags(level)
>  					,
>  		.last_balance		= jiffies,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 59e072b..5d175f2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4339,13 +4339,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  static inline void update_sd_lb_stats(struct lb_env *env,
>  					int *balance, struct sd_lb_stats *sds)
>  {
> -	struct sched_domain *child = env->sd->child;
>  	struct sched_group *sg = env->sd->groups;
>  	struct sg_lb_stats sgs;
> -	int load_idx, prefer_sibling = 0;
> -
> -	if (child && child->flags & SD_PREFER_SIBLING)
> -		prefer_sibling = 1;
> +	int load_idx;
> 
>  	load_idx = get_sd_load_idx(env->sd, env->idle);
> 
> @@ -4362,19 +4358,6 @@ static inline void update_sd_lb_stats(struct lb_env *env,
>  		sds->total_load += sgs.group_load;
>  		sds->total_pwr += sg->sgp->power;
> 
> -		/*
> -		 * In case the child domain prefers tasks go to siblings
> -		 * first, lower the sg capacity to one so that we'll try
> -		 * and move all the excess tasks away. We lower the capacity
> -		 * of a group only if the local group has the capacity to fit
> -		 * these excess tasks, i.e. nr_running < group_capacity. The
> -		 * extra check prevents the case where you always pull from the
> -		 * heaviest group when it is already under-utilized (possible
> -		 * with a large weight task outweighs the tasks on the system).
> -		 */
> -		if (prefer_sibling && !local_group && sds->this_has_capacity)
> -			sgs.group_capacity = min(sgs.group_capacity, 1UL);
> -
>  		if (local_group) {
>  			sds->this_load = sgs.avg_load;
>  			sds->this = sg;
> 
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  2:50             ` Michael Wang
@ 2013-01-21  4:38               ` Mike Galbraith
  2013-01-21  5:07                 ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-21  4:38 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Mon, 2013-01-21 at 10:50 +0800, Michael Wang wrote: 
> On 01/20/2013 12:09 PM, Mike Galbraith wrote:
> > On Thu, 2013-01-17 at 13:55 +0800, Michael Wang wrote: 
> >> Hi, Mike
> >>
> >> I've send out the v2, which I suppose it will fix the below BUG and
> >> perform better, please do let me know if it still cause issues on your
> >> arm7 machine.
> > 
> > s/arm7/aim7
> > 
> > Someone swiped half of CPUs/ram, so the box is now 2 10 core nodes vs 4.
> > 
> > stock scheduler knobs
> > 
> > 3.8-wang-v2                                 avg     3.8-virgin                          avg    vs wang
> > Tasks    jobs/min
> >     1      436.29    435.66    435.97    435.97        437.86    441.69    440.09    439.88      1.008
> >     5     2361.65   2356.14   2350.66   2356.15       2416.27   2563.45   2374.61   2451.44      1.040
> >    10     4767.90   4764.15   4779.18   4770.41       4946.94   4832.54   4828.69   4869.39      1.020
> >    20     9672.79   9703.76   9380.80   9585.78       9634.34   9672.79   9727.13   9678.08      1.009
> >    40    19162.06  19207.61  19299.36  19223.01      19268.68  19192.40  19056.60  19172.56       .997
> >    80    37610.55  37465.22  37465.22  37513.66      37263.64  37120.98  37465.22  37283.28       .993
> >   160    69306.65  69655.17  69257.14  69406.32      69257.14  69306.65  69257.14  69273.64       .998
> >   320   111512.36 109066.37 111256.45 110611.72     108395.75 107913.19 108335.20 108214.71       .978
> >   640   142850.83 148483.92 150851.81 147395.52     151974.92 151263.65 151322.67 151520.41      1.027
> >  1280    52788.89  52706.39  67280.77  57592.01     189931.44 189745.60 189792.02 189823.02      3.295
> >  2560    75403.91  52905.91  45196.21  57835.34     217368.64 217582.05 217551.54 217500.74      3.760
> > 
> > sched_latency_ns = 24ms
> > sched_min_granularity_ns = 8ms
> > sched_wakeup_granularity_ns = 10ms
> > 
> > 3.8-wang-v2                                 avg     3.8-virgin                          avg    vs wang
> > Tasks    jobs/min
> >     1      436.29    436.60    434.72    435.87        434.41    439.77    438.81    437.66      1.004
> >     5     2382.08   2393.36   2451.46   2408.96       2451.46   2453.44   2425.94   2443.61      1.014
> >    10     5029.05   4887.10   5045.80   4987.31       4844.12   4828.69   4844.12   4838.97       .970
> >    20     9869.71   9734.94   9758.45   9787.70       9513.34   9611.42   9565.90   9563.55       .977
> >    40    19146.92  19146.92  19192.40  19162.08      18617.51  18603.22  18517.95  18579.56       .969
> >    80    37177.91  37378.57  37292.31  37282.93      36451.13  36179.10  36233.18  36287.80       .973
> >   160    70260.87  69109.05  69207.71  69525.87      68281.69  68522.97  68912.58  68572.41       .986
> >   320   114745.56 113869.64 114474.62 114363.27     114137.73 114137.73 114137.73 114137.73       .998
> >   640   164338.98 164338.98 164618.00 164431.98     164130.34 164130.34 164130.34 164130.34       .998
> >  1280   209473.40 209134.54 209473.40 209360.44     210040.62 210040.62 210097.51 210059.58      1.003
> >  2560   242703.38 242627.46 242779.34 242703.39     244001.26 243847.85 243732.91 243860.67      1.004
> > 
> > As you can see, the load collapsed at the high load end with stock
> > scheduler knobs (desktop latency).  With knobs set to scale, the delta
> > disappeared.
> 
> Thanks for the testing, Mike, please allow me to ask few questions.
> 
> What are those tasks actually doing? what's the workload?

It's the canned aim7 compute load, mixed bag load weighted toward
compute.  Below is the workfile, should give you an idea.

# @(#) workfile.compute:1.3 1/22/96 00:00:00
# Compute Server Mix
FILESIZE: 100K
POOLSIZE: 250M
50  add_double
30  add_int
30  add_long
10  array_rtns
10  disk_cp
30  disk_rd
10  disk_src
20  disk_wrt
40  div_double
30  div_int
50  matrix_rtns
40  mem_rtns_1
40  mem_rtns_2
50  mul_double
30  mul_int
30  mul_long
40  new_raph
40  num_rtns_1
50  page_test
40  series_1
10  shared_memory
30  sieve
20  stream_pipe
30  string_rtns
40  trig_rtns
20  udp_test

> And I'm confusing about how those new parameter value was figured out
> and how could them help solve the possible issue?

Oh, that's easy.  I set sched_min_granularity_ns such that last_buddy
kicks in when a third task arrives on a runqueue, and set
sched_wakeup_granularity_ns near minimum that still allows wakeup
preemption to occur.  Combined effect is reduced over-scheduling.
> Do you have any idea about which part in this patch set may cause the issue?

Nope, I'm as puzzled by that as you are.  When the box had 40 cores,
both virgin and patched showed over-scheduling effects, but not like
this.  With 20 cores, symptoms changed in a most puzzling way, and I
don't see how you'd be directly responsible.

> One change by designed is that, for old logical, if it's a wake up and
> we found affine sd, the select func will never go into the balance path,
> but the new logical will, in some cases, do you think this could be a
> problem?

Since it's the high load end, where looking for an idle core is most
likely to be a waste of time, it makes sense that entering the balance
path would hurt _some_, it isn't free.. except for twiddling preemption
knobs making the collapse just go away.  We're still going to enter that
path if all cores are busy, no matter how I twiddle those knobs.
  
> > I thought perhaps the bogus (shouldn't exist) CPU domain in mainline
> > somehow contributes to the strange behavioral delta, but killing it made
> > zero difference.  All of these numbers for both trees were logged with
> > the below applies, but as noted, it changed nothing. 
> 
> The patch set was supposed to do accelerate by reduce the cost of
> select_task_rq(), so it should be harmless for all the conditions.

Yeah, it should just save some cycles, but I like to eliminate known
bugs when testing, just in case.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  4:38               ` Mike Galbraith
@ 2013-01-21  5:07                 ` Michael Wang
  2013-01-21  6:42                   ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-21  5:07 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/21/2013 12:38 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 10:50 +0800, Michael Wang wrote: 
>> On 01/20/2013 12:09 PM, Mike Galbraith wrote:
>>> On Thu, 2013-01-17 at 13:55 +0800, Michael Wang wrote: 
>>>> Hi, Mike
>>>>
>>>> I've send out the v2, which I suppose it will fix the below BUG and
>>>> perform better, please do let me know if it still cause issues on your
>>>> arm7 machine.
>>>
>>> s/arm7/aim7
>>>
>>> Someone swiped half of CPUs/ram, so the box is now 2 10 core nodes vs 4.
>>>
>>> stock scheduler knobs
>>>
>>> 3.8-wang-v2                                 avg     3.8-virgin                          avg    vs wang
>>> Tasks    jobs/min
>>>     1      436.29    435.66    435.97    435.97        437.86    441.69    440.09    439.88      1.008
>>>     5     2361.65   2356.14   2350.66   2356.15       2416.27   2563.45   2374.61   2451.44      1.040
>>>    10     4767.90   4764.15   4779.18   4770.41       4946.94   4832.54   4828.69   4869.39      1.020
>>>    20     9672.79   9703.76   9380.80   9585.78       9634.34   9672.79   9727.13   9678.08      1.009
>>>    40    19162.06  19207.61  19299.36  19223.01      19268.68  19192.40  19056.60  19172.56       .997
>>>    80    37610.55  37465.22  37465.22  37513.66      37263.64  37120.98  37465.22  37283.28       .993
>>>   160    69306.65  69655.17  69257.14  69406.32      69257.14  69306.65  69257.14  69273.64       .998
>>>   320   111512.36 109066.37 111256.45 110611.72     108395.75 107913.19 108335.20 108214.71       .978
>>>   640   142850.83 148483.92 150851.81 147395.52     151974.92 151263.65 151322.67 151520.41      1.027
>>>  1280    52788.89  52706.39  67280.77  57592.01     189931.44 189745.60 189792.02 189823.02      3.295
>>>  2560    75403.91  52905.91  45196.21  57835.34     217368.64 217582.05 217551.54 217500.74      3.760
>>>
>>> sched_latency_ns = 24ms
>>> sched_min_granularity_ns = 8ms
>>> sched_wakeup_granularity_ns = 10ms
>>>
>>> 3.8-wang-v2                                 avg     3.8-virgin                          avg    vs wang
>>> Tasks    jobs/min
>>>     1      436.29    436.60    434.72    435.87        434.41    439.77    438.81    437.66      1.004
>>>     5     2382.08   2393.36   2451.46   2408.96       2451.46   2453.44   2425.94   2443.61      1.014
>>>    10     5029.05   4887.10   5045.80   4987.31       4844.12   4828.69   4844.12   4838.97       .970
>>>    20     9869.71   9734.94   9758.45   9787.70       9513.34   9611.42   9565.90   9563.55       .977
>>>    40    19146.92  19146.92  19192.40  19162.08      18617.51  18603.22  18517.95  18579.56       .969
>>>    80    37177.91  37378.57  37292.31  37282.93      36451.13  36179.10  36233.18  36287.80       .973
>>>   160    70260.87  69109.05  69207.71  69525.87      68281.69  68522.97  68912.58  68572.41       .986
>>>   320   114745.56 113869.64 114474.62 114363.27     114137.73 114137.73 114137.73 114137.73       .998
>>>   640   164338.98 164338.98 164618.00 164431.98     164130.34 164130.34 164130.34 164130.34       .998
>>>  1280   209473.40 209134.54 209473.40 209360.44     210040.62 210040.62 210097.51 210059.58      1.003
>>>  2560   242703.38 242627.46 242779.34 242703.39     244001.26 243847.85 243732.91 243860.67      1.004
>>>
>>> As you can see, the load collapsed at the high load end with stock
>>> scheduler knobs (desktop latency).  With knobs set to scale, the delta
>>> disappeared.
>>
>> Thanks for the testing, Mike, please allow me to ask few questions.
>>
>> What are those tasks actually doing? what's the workload?
> 
> It's the canned aim7 compute load, mixed bag load weighted toward
> compute.  Below is the workfile, should give you an idea.
> 
> # @(#) workfile.compute:1.3 1/22/96 00:00:00
> # Compute Server Mix
> FILESIZE: 100K
> POOLSIZE: 250M
> 50  add_double
> 30  add_int
> 30  add_long
> 10  array_rtns
> 10  disk_cp
> 30  disk_rd
> 10  disk_src
> 20  disk_wrt
> 40  div_double
> 30  div_int
> 50  matrix_rtns
> 40  mem_rtns_1
> 40  mem_rtns_2
> 50  mul_double
> 30  mul_int
> 30  mul_long
> 40  new_raph
> 40  num_rtns_1
> 50  page_test
> 40  series_1
> 10  shared_memory
> 30  sieve
> 20  stream_pipe
> 30  string_rtns
> 40  trig_rtns
> 20  udp_test
> 

That seems like the default one, could you please show me the numbers in
your datapoint file?

I'm not familiar with this benchmark, but I'd like to have a try on my
server, to make sure whether it is a generic issue.

>> And I'm confusing about how those new parameter value was figured out
>> and how could them help solve the possible issue?
> 
> Oh, that's easy.  I set sched_min_granularity_ns such that last_buddy
> kicks in when a third task arrives on a runqueue, and set
> sched_wakeup_granularity_ns near minimum that still allows wakeup
> preemption to occur.  Combined effect is reduced over-scheduling.

That sounds very hard, to catch the timing, whatever, it could be an
important clue for analysis.

>> Do you have any idea about which part in this patch set may cause the issue?
> 
> Nope, I'm as puzzled by that as you are.  When the box had 40 cores,
> both virgin and patched showed over-scheduling effects, but not like
> this.  With 20 cores, symptoms changed in a most puzzling way, and I
> don't see how you'd be directly responsible.

Hmm...

> 
>> One change by designed is that, for old logical, if it's a wake up and
>> we found affine sd, the select func will never go into the balance path,
>> but the new logical will, in some cases, do you think this could be a
>> problem?
> 
> Since it's the high load end, where looking for an idle core is most
> likely to be a waste of time, it makes sense that entering the balance
> path would hurt _some_, it isn't free.. except for twiddling preemption
> knobs making the collapse just go away.  We're still going to enter that
> path if all cores are busy, no matter how I twiddle those knobs.

May be we could try change this back to the old way later, after the aim
7 test on my server.

>   
>>> I thought perhaps the bogus (shouldn't exist) CPU domain in mainline
>>> somehow contributes to the strange behavioral delta, but killing it made
>>> zero difference.  All of these numbers for both trees were logged with
>>> the below applies, but as noted, it changed nothing. 
>>
>> The patch set was supposed to do accelerate by reduce the cost of
>> select_task_rq(), so it should be harmless for all the conditions.
> 
> Yeah, it should just save some cycles, but I like to eliminate known
> bugs when testing, just in case.

Agree, that's really important.

Regards,
Michael Wang

> 
> -Mike
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  5:07                 ` Michael Wang
@ 2013-01-21  6:42                   ` Mike Galbraith
  2013-01-21  7:09                     ` Mike Galbraith
  2013-01-21  7:34                     ` Michael Wang
  0 siblings, 2 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-21  6:42 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:

> That seems like the default one, could you please show me the numbers in
> your datapoint file?

Yup, I do not touch the workfile.  Datapoints is what you see in the
tabulated result...

1
1
1
5
5
5
10
10
10
...

so it does three consecutive runs at each load level.  I quiesce the
box, set governor to performance, echo 250 32000 32 4096
> /proc/sys/kernel/sem, then ./multitask -nl -f, and point it
at ./datapoints.

> I'm not familiar with this benchmark, but I'd like to have a try on my
> server, to make sure whether it is a generic issue.

One thing I didn't like about your changes is that you don't ask
wake_affine() if it's ok to pull cross node or not, which I though might
induce imbalance, but twiddling that didn't fix up the collapse, pretty
much leaving only the balance path.

> >> And I'm confusing about how those new parameter value was figured out
> >> and how could them help solve the possible issue?
> > 
> > Oh, that's easy.  I set sched_min_granularity_ns such that last_buddy
> > kicks in when a third task arrives on a runqueue, and set
> > sched_wakeup_granularity_ns near minimum that still allows wakeup
> > preemption to occur.  Combined effect is reduced over-scheduling.
> 
> That sounds very hard, to catch the timing, whatever, it could be an
> important clue for analysis.

(Play with the knobs with a bunch of different loads, I think you'll
find that those settings work well)

> >> Do you have any idea about which part in this patch set may cause the issue?
> > 
> > Nope, I'm as puzzled by that as you are.  When the box had 40 cores,
> > both virgin and patched showed over-scheduling effects, but not like
> > this.  With 20 cores, symptoms changed in a most puzzling way, and I
> > don't see how you'd be directly responsible.
> 
> Hmm...
> 
> > 
> >> One change by designed is that, for old logical, if it's a wake up and
> >> we found affine sd, the select func will never go into the balance path,
> >> but the new logical will, in some cases, do you think this could be a
> >> problem?
> > 
> > Since it's the high load end, where looking for an idle core is most
> > likely to be a waste of time, it makes sense that entering the balance
> > path would hurt _some_, it isn't free.. except for twiddling preemption
> > knobs making the collapse just go away.  We're still going to enter that
> > path if all cores are busy, no matter how I twiddle those knobs.
> 
> May be we could try change this back to the old way later, after the aim
> 7 test on my server.

Yeah, something funny is going on.  I'd like select_idle_sibling() to
just go away, that task be integrated into one and only one short and
sweet balance path.  I don't see why fine_idlest* needs to continue
traversal after seeing a zero.  It should be just fine to say gee, we're
done.  Hohum, so much for pure test and report, twiddle twiddle tweak,
bend spindle mutilate ;-) 
   
-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  6:42                   ` Mike Galbraith
@ 2013-01-21  7:09                     ` Mike Galbraith
  2013-01-21  7:45                       ` Michael Wang
  2013-01-21  7:34                     ` Michael Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-21  7:09 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:

> > May be we could try change this back to the old way later, after the aim
> > 7 test on my server.
> 
> Yeah, something funny is going on.

Never entering balance path kills the collapse.  Asking wake_affine()
wrt the pull as before, but allowing us to continue should no idle cpu
be found, still collapsed.  So the source of funny behavior is indeed in
balance_path.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  6:42                   ` Mike Galbraith
  2013-01-21  7:09                     ` Mike Galbraith
@ 2013-01-21  7:34                     ` Michael Wang
  2013-01-21  8:26                       ` Mike Galbraith
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-21  7:34 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/21/2013 02:42 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
> 
>> That seems like the default one, could you please show me the numbers in
>> your datapoint file?
> 
> Yup, I do not touch the workfile.  Datapoints is what you see in the
> tabulated result...
> 
> 1
> 1
> 1
> 5
> 5
> 5
> 10
> 10
> 10
> ...
> 
> so it does three consecutive runs at each load level.  I quiesce the
> box, set governor to performance, echo 250 32000 32 4096
>> /proc/sys/kernel/sem, then ./multitask -nl -f, and point it
> at ./datapoints.

I have changed the "/proc/sys/kernel/sem" to:

2000    2048000 256     1024

and run few rounds, seems like I can't reproduce this issue on my 12 cpu
X86 server:

	prev		post
Tasks    jobs/min  	jobs/min
    1      508.39    	506.69
    5     2792.63   	2792.63
   10     5454.55   	5449.64
   20    10262.49  	10271.19
   40    18089.55  	18184.55
   80    28995.22  	28960.57
  160    41365.19  	41613.73
  320    53099.67  	52767.35
  640    61308.88  	61483.83
 1280    66707.95  	66484.96
 2560    69736.58  	69350.02

Almost nothing changed...I would like to find another machine and do the
test again later.

> 
>> I'm not familiar with this benchmark, but I'd like to have a try on my
>> server, to make sure whether it is a generic issue.
> 
> One thing I didn't like about your changes is that you don't ask
> wake_affine() if it's ok to pull cross node or not, which I though might
> induce imbalance, but twiddling that didn't fix up the collapse, pretty
> much leaving only the balance path.

wake_affine() will be asked before trying to use the idle sibling
selected from current cpu's domain, doesn't it? It's just been delayed
since it's cost is too high.

But you notified me that I missed the case when prev == current, not
sure whether it's the killer, but will correct it.

> 
>>>> And I'm confusing about how those new parameter value was figured out
>>>> and how could them help solve the possible issue?
>>>
>>> Oh, that's easy.  I set sched_min_granularity_ns such that last_buddy
>>> kicks in when a third task arrives on a runqueue, and set
>>> sched_wakeup_granularity_ns near minimum that still allows wakeup
>>> preemption to occur.  Combined effect is reduced over-scheduling.
>>
>> That sounds very hard, to catch the timing, whatever, it could be an
>> important clue for analysis.
> 
> (Play with the knobs with a bunch of different loads, I think you'll
> find that those settings work well)
> 
>>>> Do you have any idea about which part in this patch set may cause the issue?
>>>
>>> Nope, I'm as puzzled by that as you are.  When the box had 40 cores,
>>> both virgin and patched showed over-scheduling effects, but not like
>>> this.  With 20 cores, symptoms changed in a most puzzling way, and I
>>> don't see how you'd be directly responsible.
>>
>> Hmm...
>>
>>>
>>>> One change by designed is that, for old logical, if it's a wake up and
>>>> we found affine sd, the select func will never go into the balance path,
>>>> but the new logical will, in some cases, do you think this could be a
>>>> problem?
>>>
>>> Since it's the high load end, where looking for an idle core is most
>>> likely to be a waste of time, it makes sense that entering the balance
>>> path would hurt _some_, it isn't free.. except for twiddling preemption
>>> knobs making the collapse just go away.  We're still going to enter that
>>> path if all cores are busy, no matter how I twiddle those knobs.
>>
>> May be we could try change this back to the old way later, after the aim
>> 7 test on my server.
> 
> Yeah, something funny is going on.  I'd like select_idle_sibling() to
> just go away, that task be integrated into one and only one short and
> sweet balance path.  I don't see why fine_idlest* needs to continue
> traversal after seeing a zero.  
It should be just fine to say gee, we're
> done.  

Yes, that's true :)

Hohum, so much for pure test and report, twiddle twiddle tweak,
> bend spindle mutilate ;-) 

Scheduler is impossible to be analysis some time, the only way to prove
is the painful endless testing...and usually, we still missed some thing
in the end...

Regards,
Michael Wang


>    
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  7:09                     ` Mike Galbraith
@ 2013-01-21  7:45                       ` Michael Wang
  2013-01-21  9:09                         ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-21  7:45 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/21/2013 03:09 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
> 
>>> May be we could try change this back to the old way later, after the aim
>>> 7 test on my server.
>>
>> Yeah, something funny is going on.
> 
> Never entering balance path kills the collapse.  Asking wake_affine()
> wrt the pull as before, but allowing us to continue should no idle cpu
> be found, still collapsed.  So the source of funny behavior is indeed in
> balance_path.

Below patch based on the patch set could help to avoid enter balance path
if affine_sd could be found, just like the old logical, would you like to
take a try and see whether it could help fix the collapse?

Regards,
Michael Wang

---
 kernel/sched/fair.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d600708..4e95bb0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3297,6 +3297,8 @@ next:
                        sg = sg->next;
                } while (sg != sd->groups);
        }
+
+       return -1;
 done:
        return target;
 }
@@ -3349,7 +3351,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
                 * some cases.
                 */
                new_cpu = select_idle_sibling(p, prev_cpu);
-               if (idle_cpu(new_cpu))
+               if (new_cpu != -1)
                        goto unlock;

                /*
@@ -3363,15 +3365,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
                        goto balance_path;

                new_cpu = select_idle_sibling(p, cpu);
-               if (!idle_cpu(new_cpu))
-                       goto balance_path;
-
                /*
                 * Invoke wake_affine() finally since it is no doubt a
                 * performance killer.
                 */
-               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
-                       goto unlock;
+               if (new_cpu == -1 ||
+                       !wake_affine(sbm->affine_map[prev_cpu], p, sync))
+                       new_cpu = prev_cpu;
+
+               goto unlock;
        }

 balance_path:
-- 
1.7.4.1


> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  7:34                     ` Michael Wang
@ 2013-01-21  8:26                       ` Mike Galbraith
  2013-01-21  8:46                         ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-21  8:26 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Mon, 2013-01-21 at 15:34 +0800, Michael Wang wrote: 
> On 01/21/2013 02:42 PM, Mike Galbraith wrote:
> > On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
> > 
> >> That seems like the default one, could you please show me the numbers in
> >> your datapoint file?
> > 
> > Yup, I do not touch the workfile.  Datapoints is what you see in the
> > tabulated result...
> > 
> > 1
> > 1
> > 1
> > 5
> > 5
> > 5
> > 10
> > 10
> > 10
> > ...
> > 
> > so it does three consecutive runs at each load level.  I quiesce the
> > box, set governor to performance, echo 250 32000 32 4096
> >> /proc/sys/kernel/sem, then ./multitask -nl -f, and point it
> > at ./datapoints.
> 
> I have changed the "/proc/sys/kernel/sem" to:
> 
> 2000    2048000 256     1024
> 
> and run few rounds, seems like I can't reproduce this issue on my 12 cpu
> X86 server:
> 
> 	prev		post
> Tasks    jobs/min  	jobs/min
>     1      508.39    	506.69
>     5     2792.63   	2792.63
>    10     5454.55   	5449.64
>    20    10262.49  	10271.19
>    40    18089.55  	18184.55
>    80    28995.22  	28960.57
>   160    41365.19  	41613.73
>   320    53099.67  	52767.35
>   640    61308.88  	61483.83
>  1280    66707.95  	66484.96
>  2560    69736.58  	69350.02
> 
> Almost nothing changed...I would like to find another machine and do the
> test again later.

Hm.  Those numbers look odd.  Ok, I've got 8 more cores, but your hefty
load throughput is low.  When I look low end numbers, seems your cores
are more macho than my 2.27 GHz EX cores, so it should have been a lot
closer.  Oh wait, you said "12 cpu".. so 1 6 core package + HT?  This
box is 2 NUMA nodes (was 4), 2 (was 4) 10 core packages + HT.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  8:26                       ` Mike Galbraith
@ 2013-01-21  8:46                         ` Michael Wang
  2013-01-21  9:11                           ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-21  8:46 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/21/2013 04:26 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 15:34 +0800, Michael Wang wrote: 
>> On 01/21/2013 02:42 PM, Mike Galbraith wrote:
>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
>>>
>>>> That seems like the default one, could you please show me the numbers in
>>>> your datapoint file?
>>>
>>> Yup, I do not touch the workfile.  Datapoints is what you see in the
>>> tabulated result...
>>>
>>> 1
>>> 1
>>> 1
>>> 5
>>> 5
>>> 5
>>> 10
>>> 10
>>> 10
>>> ...
>>>
>>> so it does three consecutive runs at each load level.  I quiesce the
>>> box, set governor to performance, echo 250 32000 32 4096
>>>> /proc/sys/kernel/sem, then ./multitask -nl -f, and point it
>>> at ./datapoints.
>>
>> I have changed the "/proc/sys/kernel/sem" to:
>>
>> 2000    2048000 256     1024
>>
>> and run few rounds, seems like I can't reproduce this issue on my 12 cpu
>> X86 server:
>>
>> 	prev		post
>> Tasks    jobs/min  	jobs/min
>>     1      508.39    	506.69
>>     5     2792.63   	2792.63
>>    10     5454.55   	5449.64
>>    20    10262.49  	10271.19
>>    40    18089.55  	18184.55
>>    80    28995.22  	28960.57
>>   160    41365.19  	41613.73
>>   320    53099.67  	52767.35
>>   640    61308.88  	61483.83
>>  1280    66707.95  	66484.96
>>  2560    69736.58  	69350.02
>>
>> Almost nothing changed...I would like to find another machine and do the
>> test again later.
> 
> Hm.  Those numbers look odd.  Ok, I've got 8 more cores, but your hefty
> load throughput is low.  When I look low end numbers, seems your cores
> are more macho than my 2.27 GHz EX cores, so it should have been a lot
> closer.  Oh wait, you said "12 cpu".. so 1 6 core package + HT?  This
> box is 2 NUMA nodes (was 4), 2 (was 4) 10 core packages + HT.

It's a 12 core package, and only 1 physical cpu:

Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz

So does that means the issue was related to the case when there are
multiple nodes?

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  7:45                       ` Michael Wang
@ 2013-01-21  9:09                         ` Mike Galbraith
  2013-01-21  9:22                           ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-21  9:09 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote: 
> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
> > On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
> >> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
> > 
> >>> May be we could try change this back to the old way later, after the aim
> >>> 7 test on my server.
> >>
> >> Yeah, something funny is going on.
> > 
> > Never entering balance path kills the collapse.  Asking wake_affine()
> > wrt the pull as before, but allowing us to continue should no idle cpu
> > be found, still collapsed.  So the source of funny behavior is indeed in
> > balance_path.
> 
> Below patch based on the patch set could help to avoid enter balance path
> if affine_sd could be found, just like the old logical, would you like to
> take a try and see whether it could help fix the collapse?

No, it does not.

> 
> Regards,
> Michael Wang
> 
> ---
>  kernel/sched/fair.c |   14 ++++++++------
>  1 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d600708..4e95bb0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3297,6 +3297,8 @@ next:
>                         sg = sg->next;
>                 } while (sg != sd->groups);
>         }
> +
> +       return -1;
>  done:
>         return target;
>  }
> @@ -3349,7 +3351,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>                  * some cases.
>                  */
>                 new_cpu = select_idle_sibling(p, prev_cpu);
> -               if (idle_cpu(new_cpu))
> +               if (new_cpu != -1)
>                         goto unlock;
> 
>                 /*
> @@ -3363,15 +3365,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>                         goto balance_path;
> 
>                 new_cpu = select_idle_sibling(p, cpu);
> -               if (!idle_cpu(new_cpu))
> -                       goto balance_path;
> -
>                 /*
>                  * Invoke wake_affine() finally since it is no doubt a
>                  * performance killer.
>                  */
> -               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
> -                       goto unlock;
> +               if (new_cpu == -1 ||
> +                       !wake_affine(sbm->affine_map[prev_cpu], p, sync))
> +                       new_cpu = prev_cpu;
> +
> +               goto unlock;
>         }
> 
>  balance_path:



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  8:46                         ` Michael Wang
@ 2013-01-21  9:11                           ` Mike Galbraith
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-21  9:11 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Mon, 2013-01-21 at 16:46 +0800, Michael Wang wrote: 
> On 01/21/2013 04:26 PM, Mike Galbraith wrote:
> > On Mon, 2013-01-21 at 15:34 +0800, Michael Wang wrote: 
> >> On 01/21/2013 02:42 PM, Mike Galbraith wrote:
> >>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
> >>>
> >>>> That seems like the default one, could you please show me the numbers in
> >>>> your datapoint file?
> >>>
> >>> Yup, I do not touch the workfile.  Datapoints is what you see in the
> >>> tabulated result...
> >>>
> >>> 1
> >>> 1
> >>> 1
> >>> 5
> >>> 5
> >>> 5
> >>> 10
> >>> 10
> >>> 10
> >>> ...
> >>>
> >>> so it does three consecutive runs at each load level.  I quiesce the
> >>> box, set governor to performance, echo 250 32000 32 4096
> >>>> /proc/sys/kernel/sem, then ./multitask -nl -f, and point it
> >>> at ./datapoints.
> >>
> >> I have changed the "/proc/sys/kernel/sem" to:
> >>
> >> 2000    2048000 256     1024
> >>
> >> and run few rounds, seems like I can't reproduce this issue on my 12 cpu
> >> X86 server:
> >>
> >> 	prev		post
> >> Tasks    jobs/min  	jobs/min
> >>     1      508.39    	506.69
> >>     5     2792.63   	2792.63
> >>    10     5454.55   	5449.64
> >>    20    10262.49  	10271.19
> >>    40    18089.55  	18184.55
> >>    80    28995.22  	28960.57
> >>   160    41365.19  	41613.73
> >>   320    53099.67  	52767.35
> >>   640    61308.88  	61483.83
> >>  1280    66707.95  	66484.96
> >>  2560    69736.58  	69350.02
> >>
> >> Almost nothing changed...I would like to find another machine and do the
> >> test again later.
> > 
> > Hm.  Those numbers look odd.  Ok, I've got 8 more cores, but your hefty
> > load throughput is low.  When I look low end numbers, seems your cores
> > are more macho than my 2.27 GHz EX cores, so it should have been a lot
> > closer.  Oh wait, you said "12 cpu".. so 1 6 core package + HT?  This
> > box is 2 NUMA nodes (was 4), 2 (was 4) 10 core packages + HT.
> 
> It's a 12 core package, and only 1 physical cpu:
> 
> Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
> 
> So does that means the issue was related to the case when there are
> multiple nodes?

Seems likely.  I had 4 nodes earlier though, and did NOT see collapse.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  9:09                         ` Mike Galbraith
@ 2013-01-21  9:22                           ` Michael Wang
  2013-01-21  9:44                             ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-21  9:22 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/21/2013 05:09 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote: 
>> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
>>> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
>>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
>>>
>>>>> May be we could try change this back to the old way later, after the aim
>>>>> 7 test on my server.
>>>>
>>>> Yeah, something funny is going on.
>>>
>>> Never entering balance path kills the collapse.  Asking wake_affine()
>>> wrt the pull as before, but allowing us to continue should no idle cpu
>>> be found, still collapsed.  So the source of funny behavior is indeed in
>>> balance_path.
>>
>> Below patch based on the patch set could help to avoid enter balance path
>> if affine_sd could be found, just like the old logical, would you like to
>> take a try and see whether it could help fix the collapse?
> 
> No, it does not.

Hmm...what have changed now compared to the old logical?

May be I missed some thing, well, I think I need to find a machine which
could reproduce the issue firstly.

Regards,
Michael Wang

> 
>>
>> Regards,
>> Michael Wang
>>
>> ---
>>  kernel/sched/fair.c |   14 ++++++++------
>>  1 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d600708..4e95bb0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3297,6 +3297,8 @@ next:
>>                         sg = sg->next;
>>                 } while (sg != sd->groups);
>>         }
>> +
>> +       return -1;
>>  done:
>>         return target;
>>  }
>> @@ -3349,7 +3351,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>                  * some cases.
>>                  */
>>                 new_cpu = select_idle_sibling(p, prev_cpu);
>> -               if (idle_cpu(new_cpu))
>> +               if (new_cpu != -1)
>>                         goto unlock;
>>
>>                 /*
>> @@ -3363,15 +3365,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>                         goto balance_path;
>>
>>                 new_cpu = select_idle_sibling(p, cpu);
>> -               if (!idle_cpu(new_cpu))
>> -                       goto balance_path;
>> -
>>                 /*
>>                  * Invoke wake_affine() finally since it is no doubt a
>>                  * performance killer.
>>                  */
>> -               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
>> -                       goto unlock;
>> +               if (new_cpu == -1 ||
>> +                       !wake_affine(sbm->affine_map[prev_cpu], p, sync))
>> +                       new_cpu = prev_cpu;
>> +
>> +               goto unlock;
>>         }
>>
>>  balance_path:
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  9:22                           ` Michael Wang
@ 2013-01-21  9:44                             ` Mike Galbraith
  2013-01-21 10:30                               ` Mike Galbraith
  2013-01-22  3:43                               ` Michael Wang
  0 siblings, 2 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-21  9:44 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Mon, 2013-01-21 at 17:22 +0800, Michael Wang wrote: 
> On 01/21/2013 05:09 PM, Mike Galbraith wrote:
> > On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote: 
> >> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
> >>> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
> >>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
> >>>
> >>>>> May be we could try change this back to the old way later, after the aim
> >>>>> 7 test on my server.
> >>>>
> >>>> Yeah, something funny is going on.
> >>>
> >>> Never entering balance path kills the collapse.  Asking wake_affine()
> >>> wrt the pull as before, but allowing us to continue should no idle cpu
> >>> be found, still collapsed.  So the source of funny behavior is indeed in
> >>> balance_path.
> >>
> >> Below patch based on the patch set could help to avoid enter balance path
> >> if affine_sd could be found, just like the old logical, would you like to
> >> take a try and see whether it could help fix the collapse?
> > 
> > No, it does not.
> 
> Hmm...what have changed now compared to the old logical?

What I did earlier to confirm the collapse originates in balance_path is
below.  I just retested to confirm.

Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      435.34  100       435.3448     13.92      3.76   Mon Jan 21 10:24:00 2013
    1      440.09  100       440.0871     13.77      3.76   Mon Jan 21 10:24:22 2013
    1      440.41  100       440.4070     13.76      3.75   Mon Jan 21 10:24:45 2013
    5     2467.43   99       493.4853     12.28     10.71   Mon Jan 21 10:24:59 2013
    5     2445.52   99       489.1041     12.39     10.98   Mon Jan 21 10:25:14 2013
    5     2475.49   99       495.0980     12.24     10.59   Mon Jan 21 10:25:27 2013
   10     4963.14   99       496.3145     12.21     20.64   Mon Jan 21 10:25:41 2013
   10     4959.08   99       495.9083     12.22     21.26   Mon Jan 21 10:25:54 2013
   10     5415.55   99       541.5550     11.19     11.54   Mon Jan 21 10:26:06 2013
   20     9934.43   96       496.7213     12.20     33.52   Mon Jan 21 10:26:18 2013
   20     9950.74   98       497.5369     12.18     36.52   Mon Jan 21 10:26:31 2013
   20     9893.88   96       494.6939     12.25     34.39   Mon Jan 21 10:26:43 2013
   40    18937.50   98       473.4375     12.80     84.74   Mon Jan 21 10:26:56 2013
   40    18996.87   98       474.9216     12.76     88.64   Mon Jan 21 10:27:09 2013
   40    19146.92   98       478.6730     12.66     89.98   Mon Jan 21 10:27:22 2013
   80    37610.55   98       470.1319     12.89    112.01   Mon Jan 21 10:27:35 2013
   80    37321.02   98       466.5127     12.99    114.21   Mon Jan 21 10:27:48 2013
   80    37610.55   98       470.1319     12.89    111.77   Mon Jan 21 10:28:01 2013
  160    69109.05   98       431.9316     14.03    156.81   Mon Jan 21 10:28:15 2013
  160    69505.38   98       434.4086     13.95    155.33   Mon Jan 21 10:28:29 2013
  160    69207.71   98       432.5482     14.01    155.79   Mon Jan 21 10:28:43 2013
  320   108033.43   98       337.6045     17.95    314.01   Mon Jan 21 10:29:01 2013
  320   108577.83   98       339.3057     17.86    311.79   Mon Jan 21 10:29:19 2013
  320   108395.75   98       338.7367     17.89    312.55   Mon Jan 21 10:29:37 2013
  640   151440.84   98       236.6263     25.61    620.37   Mon Jan 21 10:30:03 2013
  640   151440.84   97       236.6263     25.61    621.23   Mon Jan 21 10:30:29 2013
  640   151145.75   98       236.1652     25.66    622.35   Mon Jan 21 10:30:55 2013
 1280   190117.65   98       148.5294     40.80   1228.40   Mon Jan 21 10:31:36 2013
 1280   189977.96   98       148.4203     40.83   1229.91   Mon Jan 21 10:32:17 2013
 1280   189560.12   98       148.0938     40.92   1231.71   Mon Jan 21 10:32:58 2013
 2560   217857.04   98        85.1004     71.21   2441.61   Mon Jan 21 10:34:09 2013
 2560   217338.19   98        84.8977     71.38   2448.76   Mon Jan 21 10:35:21 2013
 2560   217795.87   97        85.0765     71.23   2443.12   Mon Jan 21 10:36:32 2013

That was with your change backed out, and the q/d below applied.

---
 kernel/sched/fair.c |   27 ++++++---------------------
 1 file changed, 6 insertions(+), 21 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3337,6 +3337,8 @@ select_task_rq_fair(struct task_struct *
 		goto unlock;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
+		new_cpu = prev_cpu;
+
 		/*
 		 * Tasks to be waked is special, memory it relied on
 		 * may has already been cached on prev_cpu, and usually
@@ -3348,33 +3350,16 @@ select_task_rq_fair(struct task_struct *
 		 * from top to bottom, which help to reduce the chance in
 		 * some cases.
 		 */
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		new_cpu = select_idle_sibling(p, new_cpu);
 		if (idle_cpu(new_cpu))
 			goto unlock;
 
-		/*
-		 * No idle cpu could be found in the topology of prev_cpu,
-		 * before jump into the slow balance_path, try search again
-		 * in the topology of current cpu if it is the affine of
-		 * prev_cpu.
-		 */
-		if (!sbm->affine_map[prev_cpu] ||
-				!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
-			goto balance_path;
-
-		new_cpu = select_idle_sibling(p, cpu);
-		if (!idle_cpu(new_cpu))
-			goto balance_path;
+		if (wake_affine(sbm->affine_map[cpu], p, sync))
+			new_cpu = select_idle_sibling(p, cpu);
 
-		/*
-		 * Invoke wake_affine() finally since it is no doubt a
-		 * performance killer.
-		 */
-		if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
-			goto unlock;
+		goto unlock;
 	}
 
-balance_path:
 	new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
 	sd = sbm->sd[type][sbm->top_level[type]];
 





^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  9:44                             ` Mike Galbraith
@ 2013-01-21 10:30                               ` Mike Galbraith
  2013-01-22  3:43                               ` Michael Wang
  1 sibling, 0 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-21 10:30 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Mon, 2013-01-21 at 10:44 +0100, Mike Galbraith wrote: 
> On Mon, 2013-01-21 at 17:22 +0800, Michael Wang wrote: 
> > On 01/21/2013 05:09 PM, Mike Galbraith wrote:
> > > On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote: 
> > >> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
> > >>> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
> > >>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
> > >>>
> > >>>>> May be we could try change this back to the old way later, after the aim
> > >>>>> 7 test on my server.
> > >>>>
> > >>>> Yeah, something funny is going on.
> > >>>
> > >>> Never entering balance path kills the collapse.  Asking wake_affine()
> > >>> wrt the pull as before, but allowing us to continue should no idle cpu
> > >>> be found, still collapsed.  So the source of funny behavior is indeed in
> > >>> balance_path.
> > >>
> > >> Below patch based on the patch set could help to avoid enter balance path
> > >> if affine_sd could be found, just like the old logical, would you like to
> > >> take a try and see whether it could help fix the collapse?
> > > 
> > > No, it does not.
> > 
> > Hmm...what have changed now compared to the old logical?
> 
> What I did earlier to confirm the collapse originates in balance_path is
> below.  I just retested to confirm.

...

And you can add..

if (per_cpu(sd_llc_id, cpu) == per_cpu(sd_llc_id, prev_cpu))
goto unlock;

..before calling select_idle_sibling() the second time to optimize
department of redundancy department idle search algorithm ;-)

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-21  9:44                             ` Mike Galbraith
  2013-01-21 10:30                               ` Mike Galbraith
@ 2013-01-22  3:43                               ` Michael Wang
  2013-01-22  8:03                                 ` Mike Galbraith
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-22  3:43 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/21/2013 05:44 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 17:22 +0800, Michael Wang wrote: 
>> On 01/21/2013 05:09 PM, Mike Galbraith wrote:
>>> On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote: 
>>>> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
>>>>> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
>>>>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
>>>>>
>>>>>>> May be we could try change this back to the old way later, after the aim
>>>>>>> 7 test on my server.
>>>>>>
>>>>>> Yeah, something funny is going on.
>>>>>
>>>>> Never entering balance path kills the collapse.  Asking wake_affine()
>>>>> wrt the pull as before, but allowing us to continue should no idle cpu
>>>>> be found, still collapsed.  So the source of funny behavior is indeed in
>>>>> balance_path.
>>>>
>>>> Below patch based on the patch set could help to avoid enter balance path
>>>> if affine_sd could be found, just like the old logical, would you like to
>>>> take a try and see whether it could help fix the collapse?
>>>
>>> No, it does not.
>>
>> Hmm...what have changed now compared to the old logical?
> 
> What I did earlier to confirm the collapse originates in balance_path is
> below.  I just retested to confirm.
> 
> Tasks    jobs/min  jti  jobs/min/task      real       cpu
>     1      435.34  100       435.3448     13.92      3.76   Mon Jan 21 10:24:00 2013
>     1      440.09  100       440.0871     13.77      3.76   Mon Jan 21 10:24:22 2013
>     1      440.41  100       440.4070     13.76      3.75   Mon Jan 21 10:24:45 2013
>     5     2467.43   99       493.4853     12.28     10.71   Mon Jan 21 10:24:59 2013
>     5     2445.52   99       489.1041     12.39     10.98   Mon Jan 21 10:25:14 2013
>     5     2475.49   99       495.0980     12.24     10.59   Mon Jan 21 10:25:27 2013
>    10     4963.14   99       496.3145     12.21     20.64   Mon Jan 21 10:25:41 2013
>    10     4959.08   99       495.9083     12.22     21.26   Mon Jan 21 10:25:54 2013
>    10     5415.55   99       541.5550     11.19     11.54   Mon Jan 21 10:26:06 2013
>    20     9934.43   96       496.7213     12.20     33.52   Mon Jan 21 10:26:18 2013
>    20     9950.74   98       497.5369     12.18     36.52   Mon Jan 21 10:26:31 2013
>    20     9893.88   96       494.6939     12.25     34.39   Mon Jan 21 10:26:43 2013
>    40    18937.50   98       473.4375     12.80     84.74   Mon Jan 21 10:26:56 2013
>    40    18996.87   98       474.9216     12.76     88.64   Mon Jan 21 10:27:09 2013
>    40    19146.92   98       478.6730     12.66     89.98   Mon Jan 21 10:27:22 2013
>    80    37610.55   98       470.1319     12.89    112.01   Mon Jan 21 10:27:35 2013
>    80    37321.02   98       466.5127     12.99    114.21   Mon Jan 21 10:27:48 2013
>    80    37610.55   98       470.1319     12.89    111.77   Mon Jan 21 10:28:01 2013
>   160    69109.05   98       431.9316     14.03    156.81   Mon Jan 21 10:28:15 2013
>   160    69505.38   98       434.4086     13.95    155.33   Mon Jan 21 10:28:29 2013
>   160    69207.71   98       432.5482     14.01    155.79   Mon Jan 21 10:28:43 2013
>   320   108033.43   98       337.6045     17.95    314.01   Mon Jan 21 10:29:01 2013
>   320   108577.83   98       339.3057     17.86    311.79   Mon Jan 21 10:29:19 2013
>   320   108395.75   98       338.7367     17.89    312.55   Mon Jan 21 10:29:37 2013
>   640   151440.84   98       236.6263     25.61    620.37   Mon Jan 21 10:30:03 2013
>   640   151440.84   97       236.6263     25.61    621.23   Mon Jan 21 10:30:29 2013
>   640   151145.75   98       236.1652     25.66    622.35   Mon Jan 21 10:30:55 2013
>  1280   190117.65   98       148.5294     40.80   1228.40   Mon Jan 21 10:31:36 2013
>  1280   189977.96   98       148.4203     40.83   1229.91   Mon Jan 21 10:32:17 2013
>  1280   189560.12   98       148.0938     40.92   1231.71   Mon Jan 21 10:32:58 2013
>  2560   217857.04   98        85.1004     71.21   2441.61   Mon Jan 21 10:34:09 2013
>  2560   217338.19   98        84.8977     71.38   2448.76   Mon Jan 21 10:35:21 2013
>  2560   217795.87   97        85.0765     71.23   2443.12   Mon Jan 21 10:36:32 2013
> 
> That was with your change backed out, and the q/d below applied.

So that change will help to solve the issue? good to know :)

But it will invoke wake_affine() with out any delay, the benefit
of the patch set will be reduced a lot...

I think this change help to solve the issue because it avoid jump
into balance path when wakeup for any cases, I think we can do
some change like below to achieve this and meanwhile gain benefit
from delay wake_affine().

Since the issue could not been reproduced on my side, I don't know
whether the patch benefit or not, so if you are willing to send out
a formal patch, I would be glad to include it in my patch set ;-)

And another patch below below is a debug one, which will print out
all the sbm info, so we could check whether it was initialized
correctly, just use command "dmesg | grep WYT" to show the map.

Regards,
Michael Wang

---
 kernel/sched/fair.c |   42 +++++++++++++++++++++++++-----------------
 1 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2aa26c1..4361333 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 }

 /*
- * Try and locate an idle CPU in the sched_domain.
+ * Try and locate an idle CPU in the sched_domain, return -1 if failed.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
@@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)

                        target = cpumask_first_and(sched_group_cpus(sg),
                                        tsk_cpus_allowed(p));
-                       goto done;
+                       return target;
 next:
                        sg = sg->next;
                } while (sg != sd->groups);
        }
-done:
-       return target;
+
+       return -1;
 }

 /*
@@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
                 * may has already been cached on prev_cpu, and usually
                 * they require low latency.
                 *
-                * So firstly try to locate an idle cpu shared the cache
+                * Therefor, balance path in such case will cause damage
+                * and bring benefit synchronously, wakeup on prev_cpu
+                * may better than wakeup on a new lower load cpu for the
+                * cached memory, and we never know.
+                *
+                * So the principle is, try to find an idle cpu as close to
+                * prev_cpu as possible, if failed, just take prev_cpu.
+                *
+                * Firstly try to locate an idle cpu shared the cache
                 * with prev_cpu, it has the chance to break the load
                 * balance, fortunately, select_idle_sibling() will search
                 * from top to bottom, which help to reduce the chance in
                 * some cases.
                 */
                new_cpu = select_idle_sibling(p, prev_cpu);
-               if (idle_cpu(new_cpu))
+               if (new_cpu != -1)
                        goto unlock;

                /*
                 * No idle cpu could be found in the topology of prev_cpu,
-                * before jump into the slow balance_path, try search again
-                * in the topology of current cpu if it is the affine of
-                * prev_cpu.
+                * before take the prev_cpu, try search again in the
+                * topology of current cpu if it is the affine of prev_cpu.
                 */
-               if (cpu == prev_cpu ||
-                               !sbm->affine_map[prev_cpu] ||
+               if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
                                !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
-                       goto balance_path;
+                       goto take_prev;

                new_cpu = select_idle_sibling(p, cpu);
-               if (!idle_cpu(new_cpu))
-                       goto balance_path;
-
                /*
                 * Invoke wake_affine() finally since it is no doubt a
                 * performance killer.
                 */
-               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
+               if ((new_cpu != -1) &&
+                       wake_affine(sbm->affine_map[prev_cpu], p, sync))
                        goto unlock;
+
+take_prev:
+               new_cpu = prev_cpu;
+               goto unlock;
        }

-balance_path:
+       /* Balance path. */
        new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
        sd = sbm->sd[type][sbm->top_level[type]];

-- 
1.7.4.1

DEBUG PATCH:

---
 kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c63303..f251f29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
 static int sbm_max_level;
 DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);

+static void debug_sched_balance_map(int cpu)
+{
+       int i, type, level = 0;
+       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
+
+       printk("WYT: sbm of cpu %d\n", cpu);
+
+       for (type = 0; type < SBM_MAX_TYPE; type++) {
+               if (type == SBM_EXEC_TYPE)
+                       printk("WYT: \t exec map\n");
+               else if (type == SBM_FORK_TYPE)
+                       printk("WYT: \t fork map\n");
+               else if (type == SBM_WAKE_TYPE)
+                       printk("WYT: \t wake map\n");
+
+               for (level = 0; level < sbm_max_level; level++) {
+                       if (sbm->sd[type][level])
+                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
+               }
+       }
+
+       printk("WYT: \t affine map\n");
+
+       for_each_possible_cpu(i) {
+               if (sbm->affine_map[i])
+                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
+       }
+}
+
 static void build_sched_balance_map(int cpu)
 {
        struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
@@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
         * destroy_sched_domains() already do the work.
         */
        build_sched_balance_map(cpu);
+       debug_sched_balance_map(cpu);
        rcu_assign_pointer(rq->sbm, sbm);
 }

-- 
1.7.4.1

> 
> ---
>  kernel/sched/fair.c |   27 ++++++---------------------
>  1 file changed, 6 insertions(+), 21 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3337,6 +3337,8 @@ select_task_rq_fair(struct task_struct *
>  		goto unlock;
> 
>  	if (sd_flag & SD_BALANCE_WAKE) {
> +		new_cpu = prev_cpu;
> +
>  		/*
>  		 * Tasks to be waked is special, memory it relied on
>  		 * may has already been cached on prev_cpu, and usually
> @@ -3348,33 +3350,16 @@ select_task_rq_fair(struct task_struct *
>  		 * from top to bottom, which help to reduce the chance in
>  		 * some cases.
>  		 */
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> +		new_cpu = select_idle_sibling(p, new_cpu);
>  		if (idle_cpu(new_cpu))
>  			goto unlock;
> 
> -		/*
> -		 * No idle cpu could be found in the topology of prev_cpu,
> -		 * before jump into the slow balance_path, try search again
> -		 * in the topology of current cpu if it is the affine of
> -		 * prev_cpu.
> -		 */
> -		if (!sbm->affine_map[prev_cpu] ||
> -				!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> -			goto balance_path;
> -
> -		new_cpu = select_idle_sibling(p, cpu);
> -		if (!idle_cpu(new_cpu))
> -			goto balance_path;
> +		if (wake_affine(sbm->affine_map[cpu], p, sync))
> +			new_cpu = select_idle_sibling(p, cpu);
> 
> -		/*
> -		 * Invoke wake_affine() finally since it is no doubt a
> -		 * performance killer.
> -		 */
> -		if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
> -			goto unlock;
> +		goto unlock;
>  	}
> 
> -balance_path:
>  	new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
>  	sd = sbm->sd[type][sbm->top_level[type]];
> 
> 
> 
> 
> 


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-22  3:43                               ` Michael Wang
@ 2013-01-22  8:03                                 ` Mike Galbraith
  2013-01-22  8:56                                   ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-22  8:03 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Tue, 2013-01-22 at 11:43 +0800, Michael Wang wrote: 
> On 01/21/2013 05:44 PM, Mike Galbraith wrote:
> > On Mon, 2013-01-21 at 17:22 +0800, Michael Wang wrote: 
> >> On 01/21/2013 05:09 PM, Mike Galbraith wrote:
> >>> On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote: 
> >>>> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
> >>>>> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote: 
> >>>>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
> >>>>>
> >>>>>>> May be we could try change this back to the old way later, after the aim
> >>>>>>> 7 test on my server.
> >>>>>>
> >>>>>> Yeah, something funny is going on.
> >>>>>
> >>>>> Never entering balance path kills the collapse.  Asking wake_affine()
> >>>>> wrt the pull as before, but allowing us to continue should no idle cpu
> >>>>> be found, still collapsed.  So the source of funny behavior is indeed in
> >>>>> balance_path.
> >>>>
> >>>> Below patch based on the patch set could help to avoid enter balance path
> >>>> if affine_sd could be found, just like the old logical, would you like to
> >>>> take a try and see whether it could help fix the collapse?
> >>>
> >>> No, it does not.
> >>
> >> Hmm...what have changed now compared to the old logical?
> > 
> > What I did earlier to confirm the collapse originates in balance_path is
> > below.  I just retested to confirm.
> > 
> > Tasks    jobs/min  jti  jobs/min/task      real       cpu
> >     1      435.34  100       435.3448     13.92      3.76   Mon Jan 21 10:24:00 2013
> >     1      440.09  100       440.0871     13.77      3.76   Mon Jan 21 10:24:22 2013
> >     1      440.41  100       440.4070     13.76      3.75   Mon Jan 21 10:24:45 2013
... 
> > 
> > That was with your change backed out, and the q/d below applied.
> 
> So that change will help to solve the issue? good to know :)
> 
> But it will invoke wake_affine() with out any delay, the benefit
> of the patch set will be reduced a lot...

Yeah, I used size large hammer.

> I think this change help to solve the issue because it avoid jump
> into balance path when wakeup for any cases, I think we can do
> some change like below to achieve this and meanwhile gain benefit
> from delay wake_affine().

Yup, I killed it all the way dead.  I'll see what below does.

I don't really see the point of the wake_affine() change in this set
though.  Its purpose is to decide if a pull is ok or not.  If we don't
need its opinion when we look for an (momentarily?) idle core in
this_domain, we shouldn't need it at all, and could just delete it.  If
we ever enter balance_path, we can't possibly induce imbalance without
there being something broken in that path, no?

BTW, it could well be that an unpatched kernel will collapse as well if
WAKE_BALANCE is turned on.  I've never tried that on a largish box, as
doing any of the wakeup time optional stuff used to make tbench scream.

> Since the issue could not been reproduced on my side, I don't know
> whether the patch benefit or not, so if you are willing to send out
> a formal patch, I would be glad to include it in my patch set ;-)

Just changing to scan prev_cpu before considering pulling would put a
big dent in the bouncing cow problem, but that's the intriguing thing
about this set.. can we have the tbench and pgbench big box gain without
a lot of pain to go with it?  Small boxen will surely benefit, pretty
much can't be hurt, but what about all those fast/light tasks that won't
hop across nodes to red hot data?

No formal patch is likely to result from any testing I do atm at least.
I'm testing your patches because I see potential, I really want it to
work out, but have to see it do that with my own two beady eyeballs ;-)

> And another patch below below is a debug one, which will print out
> all the sbm info, so we could check whether it was initialized
> correctly, just use command "dmesg | grep WYT" to show the map.
> 
> Regards,
> Michael Wang
> 
> ---
>  kernel/sched/fair.c |   42 +++++++++++++++++++++++++-----------------
>  1 files changed, 25 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2aa26c1..4361333 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  }
> 
>  /*
> - * Try and locate an idle CPU in the sched_domain.
> + * Try and locate an idle CPU in the sched_domain, return -1 if failed.
>   */
>  static int select_idle_sibling(struct task_struct *p, int target)
>  {
> @@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
> 
>                         target = cpumask_first_and(sched_group_cpus(sg),
>                                         tsk_cpus_allowed(p));
> -                       goto done;
> +                       return target;
>  next:
>                         sg = sg->next;
>                 } while (sg != sd->groups);
>         }
> -done:
> -       return target;
> +
> +       return -1;
>  }
> 
>  /*
> @@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>                  * may has already been cached on prev_cpu, and usually
>                  * they require low latency.
>                  *
> -                * So firstly try to locate an idle cpu shared the cache
> +                * Therefor, balance path in such case will cause damage
> +                * and bring benefit synchronously, wakeup on prev_cpu
> +                * may better than wakeup on a new lower load cpu for the
> +                * cached memory, and we never know.
> +                *
> +                * So the principle is, try to find an idle cpu as close to
> +                * prev_cpu as possible, if failed, just take prev_cpu.
> +                *
> +                * Firstly try to locate an idle cpu shared the cache
>                  * with prev_cpu, it has the chance to break the load
>                  * balance, fortunately, select_idle_sibling() will search
>                  * from top to bottom, which help to reduce the chance in
>                  * some cases.
>                  */
>                 new_cpu = select_idle_sibling(p, prev_cpu);
> -               if (idle_cpu(new_cpu))
> +               if (new_cpu != -1)
>                         goto unlock;
> 
>                 /*
>                  * No idle cpu could be found in the topology of prev_cpu,
> -                * before jump into the slow balance_path, try search again
> -                * in the topology of current cpu if it is the affine of
> -                * prev_cpu.
> +                * before take the prev_cpu, try search again in the
> +                * topology of current cpu if it is the affine of prev_cpu.
>                  */
> -               if (cpu == prev_cpu ||
> -                               !sbm->affine_map[prev_cpu] ||
> +               if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
>                                 !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> -                       goto balance_path;
> +                       goto take_prev;
> 
>                 new_cpu = select_idle_sibling(p, cpu);
> -               if (!idle_cpu(new_cpu))
> -                       goto balance_path;
> -
>                 /*
>                  * Invoke wake_affine() finally since it is no doubt a
>                  * performance killer.
>                  */
> -               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
> +               if ((new_cpu != -1) &&
> +                       wake_affine(sbm->affine_map[prev_cpu], p, sync))
>                         goto unlock;
> +
> +take_prev:
> +               new_cpu = prev_cpu;
> +               goto unlock;
>         }
> 
> -balance_path:
> +       /* Balance path. */
>         new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
>         sd = sbm->sd[type][sbm->top_level[type]];
> 
> -- 
> 1.7.4.1
> 
> DEBUG PATCH:
> 
> ---
>  kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
>  1 files changed, 30 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0c63303..f251f29 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
>  static int sbm_max_level;
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
> 
> +static void debug_sched_balance_map(int cpu)
> +{
> +       int i, type, level = 0;
> +       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
> +
> +       printk("WYT: sbm of cpu %d\n", cpu);
> +
> +       for (type = 0; type < SBM_MAX_TYPE; type++) {
> +               if (type == SBM_EXEC_TYPE)
> +                       printk("WYT: \t exec map\n");
> +               else if (type == SBM_FORK_TYPE)
> +                       printk("WYT: \t fork map\n");
> +               else if (type == SBM_WAKE_TYPE)
> +                       printk("WYT: \t wake map\n");
> +
> +               for (level = 0; level < sbm_max_level; level++) {
> +                       if (sbm->sd[type][level])
> +                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
> +               }
> +       }
> +
> +       printk("WYT: \t affine map\n");
> +
> +       for_each_possible_cpu(i) {
> +               if (sbm->affine_map[i])
> +                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
> +       }
> +}
> +
>  static void build_sched_balance_map(int cpu)
>  {
>         struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
> @@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>          * destroy_sched_domains() already do the work.
>          */
>         build_sched_balance_map(cpu);
> +       debug_sched_balance_map(cpu);
>         rcu_assign_pointer(rq->sbm, sbm);
>  }
> 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-22  8:03                                 ` Mike Galbraith
@ 2013-01-22  8:56                                   ` Michael Wang
  2013-01-22 11:34                                     ` Mike Galbraith
  2013-01-22 14:41                                     ` Mike Galbraith
  0 siblings, 2 replies; 57+ messages in thread
From: Michael Wang @ 2013-01-22  8:56 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/22/2013 04:03 PM, Mike Galbraith wrote:
[snip]
> ... 
>>>
>>> That was with your change backed out, and the q/d below applied.
>>
>> So that change will help to solve the issue? good to know :)
>>
>> But it will invoke wake_affine() with out any delay, the benefit
>> of the patch set will be reduced a lot...
> 
> Yeah, I used size large hammer.
> 
>> I think this change help to solve the issue because it avoid jump
>> into balance path when wakeup for any cases, I think we can do
>> some change like below to achieve this and meanwhile gain benefit
>> from delay wake_affine().
> 
> Yup, I killed it all the way dead.  I'll see what below does.
> 
> I don't really see the point of the wake_affine() change in this set
> though.  Its purpose is to decide if a pull is ok or not.  If we don't
> need its opinion when we look for an (momentarily?) idle core in
> this_domain, we shouldn't need it at all, and could just delete it.

I have a question here, so wake_affine() is:
A. check whether it is balance to pull.
B. check whether it's better to pull than not.

I suppose it's A, so my logical is:
1. find idle cpu in prev domain.
2. if failed and affine, find idle cpu in current domain.
3. if find idle cpu in current domain, check whether it is balance to
pull by wake_affine().
4. if all failed, two choice, go to balance path or directly return
prev_cpu.

So I still need wake_affine() for a final check, but to be honest, I
really doubt about whether it worth to care about balance while waking
up, if the task just run several ns then sleep again, it's totally
worthless...

> If we ever enter balance_path, we can't possibly induce imbalance without
> there being something broken in that path, no?

So your opinion is, some thing broken in the new balance path?

> 
> BTW, it could well be that an unpatched kernel will collapse as well if
> WAKE_BALANCE is turned on.  I've never tried that on a largish box, as
> doing any of the wakeup time optional stuff used to make tbench scream.
> 
>> Since the issue could not been reproduced on my side, I don't know
>> whether the patch benefit or not, so if you are willing to send out
>> a formal patch, I would be glad to include it in my patch set ;-)
> 
> Just changing to scan prev_cpu before considering pulling would put a
> big dent in the bouncing cow problem, but that's the intriguing thing
> about this set.. 

So that's my first question, if wake_affine() return 1 means it's better
to pull than not, then the new way may be harmful, but if it's just told
us, pull won't break the balance, then I still think, current domain is
just a backup, not the candidate of first choice.

can we have the tbench and pgbench big box gain without
> a lot of pain to go with it?  Small boxen will surely benefit, pretty
> much can't be hurt, but what about all those fast/light tasks that won't
> hop across nodes to red hot data?

I don't get it... a task won't hop means a task always been selected to
run on prev_cpu?

We will assign idle cpu if we found, but if not, we can use prev_cpu or
go to balance path and find one, so what's the problem here?

> 
> No formal patch is likely to result from any testing I do atm at least.
> I'm testing your patches because I see potential, I really want it to
> work out, but have to see it do that with my own two beady eyeballs ;-)

Got it.

> 
>> And another patch below below is a debug one, which will print out
>> all the sbm info, so we could check whether it was initialized
>> correctly, just use command "dmesg | grep WYT" to show the map.

What about this patch? May be the wrong map is the killer on balance
path, should we check it? ;-)

Regards,
Michael Wang

>>
>> Regards,
>> Michael Wang
>>
>> ---
>>  kernel/sched/fair.c |   42 +++++++++++++++++++++++++-----------------
>>  1 files changed, 25 insertions(+), 17 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2aa26c1..4361333 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>  }
>>
>>  /*
>> - * Try and locate an idle CPU in the sched_domain.
>> + * Try and locate an idle CPU in the sched_domain, return -1 if failed.
>>   */
>>  static int select_idle_sibling(struct task_struct *p, int target)
>>  {
>> @@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>
>>                         target = cpumask_first_and(sched_group_cpus(sg),
>>                                         tsk_cpus_allowed(p));
>> -                       goto done;
>> +                       return target;
>>  next:
>>                         sg = sg->next;
>>                 } while (sg != sd->groups);
>>         }
>> -done:
>> -       return target;
>> +
>> +       return -1;
>>  }
>>
>>  /*
>> @@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>                  * may has already been cached on prev_cpu, and usually
>>                  * they require low latency.
>>                  *
>> -                * So firstly try to locate an idle cpu shared the cache
>> +                * Therefor, balance path in such case will cause damage
>> +                * and bring benefit synchronously, wakeup on prev_cpu
>> +                * may better than wakeup on a new lower load cpu for the
>> +                * cached memory, and we never know.
>> +                *
>> +                * So the principle is, try to find an idle cpu as close to
>> +                * prev_cpu as possible, if failed, just take prev_cpu.
>> +                *
>> +                * Firstly try to locate an idle cpu shared the cache
>>                  * with prev_cpu, it has the chance to break the load
>>                  * balance, fortunately, select_idle_sibling() will search
>>                  * from top to bottom, which help to reduce the chance in
>>                  * some cases.
>>                  */
>>                 new_cpu = select_idle_sibling(p, prev_cpu);
>> -               if (idle_cpu(new_cpu))
>> +               if (new_cpu != -1)
>>                         goto unlock;
>>
>>                 /*
>>                  * No idle cpu could be found in the topology of prev_cpu,
>> -                * before jump into the slow balance_path, try search again
>> -                * in the topology of current cpu if it is the affine of
>> -                * prev_cpu.
>> +                * before take the prev_cpu, try search again in the
>> +                * topology of current cpu if it is the affine of prev_cpu.
>>                  */
>> -               if (cpu == prev_cpu ||
>> -                               !sbm->affine_map[prev_cpu] ||
>> +               if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
>>                                 !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> -                       goto balance_path;
>> +                       goto take_prev;
>>
>>                 new_cpu = select_idle_sibling(p, cpu);
>> -               if (!idle_cpu(new_cpu))
>> -                       goto balance_path;
>> -
>>                 /*
>>                  * Invoke wake_affine() finally since it is no doubt a
>>                  * performance killer.
>>                  */
>> -               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
>> +               if ((new_cpu != -1) &&
>> +                       wake_affine(sbm->affine_map[prev_cpu], p, sync))
>>                         goto unlock;
>> +
>> +take_prev:
>> +               new_cpu = prev_cpu;
>> +               goto unlock;
>>         }
>>
>> -balance_path:
>> +       /* Balance path. */
>>         new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
>>         sd = sbm->sd[type][sbm->top_level[type]];
>>
>> -- 
>> 1.7.4.1
>>
>> DEBUG PATCH:
>>
>> ---
>>  kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
>>  1 files changed, 30 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0c63303..f251f29 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
>>  static int sbm_max_level;
>>  DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
>>
>> +static void debug_sched_balance_map(int cpu)
>> +{
>> +       int i, type, level = 0;
>> +       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>> +
>> +       printk("WYT: sbm of cpu %d\n", cpu);
>> +
>> +       for (type = 0; type < SBM_MAX_TYPE; type++) {
>> +               if (type == SBM_EXEC_TYPE)
>> +                       printk("WYT: \t exec map\n");
>> +               else if (type == SBM_FORK_TYPE)
>> +                       printk("WYT: \t fork map\n");
>> +               else if (type == SBM_WAKE_TYPE)
>> +                       printk("WYT: \t wake map\n");
>> +
>> +               for (level = 0; level < sbm_max_level; level++) {
>> +                       if (sbm->sd[type][level])
>> +                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
>> +               }
>> +       }
>> +
>> +       printk("WYT: \t affine map\n");
>> +
>> +       for_each_possible_cpu(i) {
>> +               if (sbm->affine_map[i])
>> +                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
>> +       }
>> +}
>> +
>>  static void build_sched_balance_map(int cpu)
>>  {
>>         struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>> @@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>>          * destroy_sched_domains() already do the work.
>>          */
>>         build_sched_balance_map(cpu);
>> +       debug_sched_balance_map(cpu);
>>         rcu_assign_pointer(rq->sbm, sbm);
>>  }
>>
> 
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-22  8:56                                   ` Michael Wang
@ 2013-01-22 11:34                                     ` Mike Galbraith
  2013-01-23  3:01                                       ` Michael Wang
  2013-01-22 14:41                                     ` Mike Galbraith
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-22 11:34 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Tue, 2013-01-22 at 16:56 +0800, Michael Wang wrote: 
> On 01/22/2013 04:03 PM, Mike Galbraith wrote:
> [snip]
> > ... 
> >>>
> >>> That was with your change backed out, and the q/d below applied.
> >>
> >> So that change will help to solve the issue? good to know :)
> >>
> >> But it will invoke wake_affine() with out any delay, the benefit
> >> of the patch set will be reduced a lot...
> > 
> > Yeah, I used size large hammer.
> > 
> >> I think this change help to solve the issue because it avoid jump
> >> into balance path when wakeup for any cases, I think we can do
> >> some change like below to achieve this and meanwhile gain benefit
> >> from delay wake_affine().
> > 
> > Yup, I killed it all the way dead.  I'll see what below does.
> > 
> > I don't really see the point of the wake_affine() change in this set
> > though.  Its purpose is to decide if a pull is ok or not.  If we don't
> > need its opinion when we look for an (momentarily?) idle core in
> > this_domain, we shouldn't need it at all, and could just delete it.
> 
> I have a question here, so wake_affine() is:
> A. check whether it is balance to pull.
> B. check whether it's better to pull than not.

A, "is it ok to move this guy to where red hot data awaits" is the way
it has always been used, ie "can we move him here without upsetting
balance too much", with sync hint meaning the waker is likely going to
sleep very soon, so we pretend he's already gone when looking at load.
That sync hint btw doesn't have anywhere near as much real meaning as
would be nice to have.

> I suppose it's A, so my logical is:
> 1. find idle cpu in prev domain.
> 2. if failed and affine, find idle cpu in current domain.

Hm.  If cpu and prev_cpu are cache affine, you already searched both.

> 3. if find idle cpu in current domain, check whether it is balance to
> pull by wake_affine().
> 4. if all failed, two choice, go to balance path or directly return
> prev_cpu.
> 
> So I still need wake_affine() for a final check, but to be honest, I
> really doubt about whether it worth to care about balance while waking
> up, if the task just run several ns then sleep again, it's totally
> worthless...

Not totally worthless, but questionable yes.  It really matters most
when wakee was way over there in cache foo for whatever reason, has no
big footprint, and red hot data is waiting here in cache bar.  Light
tasks migrating helps communicating buddies find each other and perform.

The new NUMA stuff will help heavy tasks, but it won't help with light
tasks that could benefit by moving to the data.  We currently try to
migrate on wakeup, if we do stop doing that, we may get hurt more often
than not, dunno.  Benchmarks will tell.

> > If we ever enter balance_path, we can't possibly induce imbalance without
> > there being something broken in that path, no?
> 
> So your opinion is, some thing broken in the new balance path?

I don't know that, but it is a logical bug candidate. 

> > BTW, it could well be that an unpatched kernel will collapse as well if
> > WAKE_BALANCE is turned on.  I've never tried that on a largish box, as
> > doing any of the wakeup time optional stuff used to make tbench scream.
> > 
> >> Since the issue could not been reproduced on my side, I don't know
> >> whether the patch benefit or not, so if you are willing to send out
> >> a formal patch, I would be glad to include it in my patch set ;-)
> > 
> > Just changing to scan prev_cpu before considering pulling would put a
> > big dent in the bouncing cow problem, but that's the intriguing thing
> > about this set.. 
> 
> So that's my first question, if wake_affine() return 1 means it's better
> to pull than not, then the new way may be harmful, but if it's just told
> us, pull won't break the balance, then I still think, current domain is
> just a backup, not the candidate of first choice.

wake_affine() doesn't know if it'll be a good or bad move, it only says
go for it, load numbers are within parameters.

> can we have the tbench and pgbench big box gain without
> > a lot of pain to go with it?  Small boxen will surely benefit, pretty
> > much can't be hurt, but what about all those fast/light tasks that won't
> > hop across nodes to red hot data?
> 
> I don't get it... a task won't hop means a task always been selected to
> run on prev_cpu?

Yes.  If wakee is light, has no large footprint to later have to drag to
its new home, moving it to the hot data is a win.

> We will assign idle cpu if we found, but if not, we can use prev_cpu or
> go to balance path and find one, so what's the problem here?

Wakeup latency may be low, but the task can still perform badly due to
misses.  In the tbench case, cross node data misses aren't anywhere near
as bad as the _everything_ is a miss you get from waker/wakee bouncing
all over a single shared L3. 

> > No formal patch is likely to result from any testing I do atm at least.
> > I'm testing your patches because I see potential, I really want it to
> > work out, but have to see it do that with my own two beady eyeballs ;-)
> 
> Got it.
> 
> > 
> >> And another patch below below is a debug one, which will print out
> >> all the sbm info, so we could check whether it was initialized
> >> correctly, just use command "dmesg | grep WYT" to show the map.
> 
> What about this patch? May be the wrong map is the killer on balance
> path, should we check it? ;-)

Yeah,I haven't actually looked for any booboos, just ran it straight out
of the box ;-)

> Regards,
> Michael Wang
> 
> >>
> >> Regards,
> >> Michael Wang
> >>
> >> ---
> >>  kernel/sched/fair.c |   42 +++++++++++++++++++++++++-----------------
> >>  1 files changed, 25 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2aa26c1..4361333 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> >>  }
> >>
> >>  /*
> >> - * Try and locate an idle CPU in the sched_domain.
> >> + * Try and locate an idle CPU in the sched_domain, return -1 if failed.
> >>   */
> >>  static int select_idle_sibling(struct task_struct *p, int target)
> >>  {
> >> @@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
> >>
> >>                         target = cpumask_first_and(sched_group_cpus(sg),
> >>                                         tsk_cpus_allowed(p));
> >> -                       goto done;
> >> +                       return target;
> >>  next:
> >>                         sg = sg->next;
> >>                 } while (sg != sd->groups);
> >>         }
> >> -done:
> >> -       return target;
> >> +
> >> +       return -1;
> >>  }
> >>
> >>  /*
> >> @@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> >>                  * may has already been cached on prev_cpu, and usually
> >>                  * they require low latency.
> >>                  *
> >> -                * So firstly try to locate an idle cpu shared the cache
> >> +                * Therefor, balance path in such case will cause damage
> >> +                * and bring benefit synchronously, wakeup on prev_cpu
> >> +                * may better than wakeup on a new lower load cpu for the
> >> +                * cached memory, and we never know.
> >> +                *
> >> +                * So the principle is, try to find an idle cpu as close to
> >> +                * prev_cpu as possible, if failed, just take prev_cpu.
> >> +                *
> >> +                * Firstly try to locate an idle cpu shared the cache
> >>                  * with prev_cpu, it has the chance to break the load
> >>                  * balance, fortunately, select_idle_sibling() will search
> >>                  * from top to bottom, which help to reduce the chance in
> >>                  * some cases.
> >>                  */
> >>                 new_cpu = select_idle_sibling(p, prev_cpu);
> >> -               if (idle_cpu(new_cpu))
> >> +               if (new_cpu != -1)
> >>                         goto unlock;
> >>
> >>                 /*
> >>                  * No idle cpu could be found in the topology of prev_cpu,
> >> -                * before jump into the slow balance_path, try search again
> >> -                * in the topology of current cpu if it is the affine of
> >> -                * prev_cpu.
> >> +                * before take the prev_cpu, try search again in the
> >> +                * topology of current cpu if it is the affine of prev_cpu.
> >>                  */
> >> -               if (cpu == prev_cpu ||
> >> -                               !sbm->affine_map[prev_cpu] ||
> >> +               if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
> >>                                 !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> >> -                       goto balance_path;
> >> +                       goto take_prev;
> >>
> >>                 new_cpu = select_idle_sibling(p, cpu);
> >> -               if (!idle_cpu(new_cpu))
> >> -                       goto balance_path;
> >> -
> >>                 /*
> >>                  * Invoke wake_affine() finally since it is no doubt a
> >>                  * performance killer.
> >>                  */
> >> -               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
> >> +               if ((new_cpu != -1) &&
> >> +                       wake_affine(sbm->affine_map[prev_cpu], p, sync))
> >>                         goto unlock;
> >> +
> >> +take_prev:
> >> +               new_cpu = prev_cpu;
> >> +               goto unlock;
> >>         }
> >>
> >> -balance_path:
> >> +       /* Balance path. */
> >>         new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
> >>         sd = sbm->sd[type][sbm->top_level[type]];
> >>
> >> -- 
> >> 1.7.4.1
> >>
> >> DEBUG PATCH:
> >>
> >> ---
> >>  kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
> >>  1 files changed, 30 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 0c63303..f251f29 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
> >>  static int sbm_max_level;
> >>  DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
> >>
> >> +static void debug_sched_balance_map(int cpu)
> >> +{
> >> +       int i, type, level = 0;
> >> +       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
> >> +
> >> +       printk("WYT: sbm of cpu %d\n", cpu);
> >> +
> >> +       for (type = 0; type < SBM_MAX_TYPE; type++) {
> >> +               if (type == SBM_EXEC_TYPE)
> >> +                       printk("WYT: \t exec map\n");
> >> +               else if (type == SBM_FORK_TYPE)
> >> +                       printk("WYT: \t fork map\n");
> >> +               else if (type == SBM_WAKE_TYPE)
> >> +                       printk("WYT: \t wake map\n");
> >> +
> >> +               for (level = 0; level < sbm_max_level; level++) {
> >> +                       if (sbm->sd[type][level])
> >> +                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
> >> +               }
> >> +       }
> >> +
> >> +       printk("WYT: \t affine map\n");
> >> +
> >> +       for_each_possible_cpu(i) {
> >> +               if (sbm->affine_map[i])
> >> +                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
> >> +       }
> >> +}
> >> +
> >>  static void build_sched_balance_map(int cpu)
> >>  {
> >>         struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
> >> @@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
> >>          * destroy_sched_domains() already do the work.
> >>          */
> >>         build_sched_balance_map(cpu);
> >> +       debug_sched_balance_map(cpu);
> >>         rcu_assign_pointer(rq->sbm, sbm);
> >>  }
> >>
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-22  8:56                                   ` Michael Wang
  2013-01-22 11:34                                     ` Mike Galbraith
@ 2013-01-22 14:41                                     ` Mike Galbraith
  2013-01-23  2:44                                       ` Michael Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-22 14:41 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Tue, 2013-01-22 at 16:56 +0800, Michael Wang wrote:

> What about this patch? May be the wrong map is the killer on balance
> path, should we check it? ;-) 

[    1.232249] Brought up 40 CPUs
[    1.236003] smpboot: Total of 40 processors activated (180873.90 BogoMIPS)
[    1.244744] CPU0 attaching sched-domain:
[    1.254131] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[    1.252010]  domain 0: span 0,16 level SIBLING
[    1.280001]   groups: 0 (cpu_power = 589) 16 (cpu_power = 589)
[    1.292540]   domain 1: span 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 level MC
[    1.312001]    groups: 0,16 (cpu_power = 1178) 2,18 (cpu_power = 1178) 4,20 (cpu_power = 1178) 6,22 (cpu_power = 1178) 8,24 (cpu_power = 1178)
                         10,26 (cpu_power = 1178)12,28 (cpu_power = 1178)14,30 (cpu_power = 1178)32,36 (cpu_power = 1178)34,38 (cpu_power = 1178)
[    1.368002]    domain 2: span 0-39 level NUMA
[    1.376001]     groups: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 (cpu_power = 11780)
                           1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 (cpu_power = 11780)
[    1.412546] WYT: sbm of cpu 0
[    1.416001] WYT:      exec map
[    1.424002] WYT:              sd 6ce55000, idx 0, level 0, weight 2
[    1.436001] WYT:              sd 6ce74000, idx 1, level 1, weight 20
[    1.448001] WYT:              sd 6cef3000, idx 3, level 3, weight 40
[    1.460001] WYT:      fork map
[    1.468001] WYT:              sd 6ce55000, idx 0, level 0, weight 2
[    1.480001] WYT:              sd 6ce74000, idx 1, level 1, weight 20
[    1.492001] WYT:              sd 6cef3000, idx 3, level 3, weight 40
[    1.504001] WYT:      wake map
                                 Hi, we're not home right now...
[    1.508001] WYT:      affine map
[    1.516001] WYT:              affine with cpu 0 in sd 6ce55000, weight 2
[    1.528001] WYT:              affine with cpu 1 in sd 6cef3000, weight 40
[    1.544001] WYT:              affine with cpu 2 in sd 6ce74000, weight 20
[    1.556001] WYT:              affine with cpu 3 in sd 6cef3000, weight 40
[    1.568001] WYT:              affine with cpu 4 in sd 6ce74000, weight 20
[    1.584001] WYT:              affine with cpu 5 in sd 6cef3000, weight 40
[    1.596001] WYT:              affine with cpu 6 in sd 6ce74000, weight 20
[    1.608001] WYT:              affine with cpu 7 in sd 6cef3000, weight 40
[    1.624001] WYT:              affine with cpu 8 in sd 6ce74000, weight 20
[    1.636001] WYT:              affine with cpu 9 in sd 6cef3000, weight 40
[    1.648001] WYT:              affine with cpu a in sd 6ce74000, weight 20
[    1.660001] WYT:              affine with cpu b in sd 6cef3000, weight 40
[    1.676001] WYT:              affine with cpu c in sd 6ce74000, weight 20
[    1.688001] WYT:              affine with cpu d in sd 6cef3000, weight 40
[    1.700001] WYT:              affine with cpu e in sd 6ce74000, weight 20
[    1.716001] WYT:              affine with cpu f in sd 6cef3000, weight 40
[    1.728001] WYT:              affine with cpu 10 in sd 6ce55000, weight 2
[    1.740001] WYT:              affine with cpu 11 in sd 6cef3000, weight 40
[    1.756001] WYT:              affine with cpu 12 in sd 6ce74000, weight 20
[    1.768001] WYT:              affine with cpu 13 in sd 6cef3000, weight 40
[    1.780001] WYT:              affine with cpu 14 in sd 6ce74000, weight 20
[    1.796001] WYT:              affine with cpu 15 in sd 6cef3000, weight 40
[    1.808001] WYT:              affine with cpu 16 in sd 6ce74000, weight 20
[    1.820001] WYT:              affine with cpu 17 in sd 6cef3000, weight 40
[    1.836001] WYT:              affine with cpu 18 in sd 6ce74000, weight 20
[    1.848001] WYT:              affine with cpu 19 in sd 6cef3000, weight 40
[    1.860001] WYT:              affine with cpu 1a in sd 6ce74000, weight 20
[    1.876001] WYT:              affine with cpu 1b in sd 6cef3000, weight 40
[    1.888001] WYT:              affine with cpu 1c in sd 6ce74000, weight 20
[    1.900001] WYT:              affine with cpu 1d in sd 6cef3000, weight 40
[    1.916001] WYT:              affine with cpu 1e in sd 6ce74000, weight 20
[    1.928001] WYT:              affine with cpu 1f in sd 6cef3000, weight 40
[    1.940001] WYT:              affine with cpu 20 in sd 6ce74000, weight 20
[    1.956001] WYT:              affine with cpu 21 in sd 6cef3000, weight 40
[    1.968001] WYT:              affine with cpu 22 in sd 6ce74000, weight 20
[    1.984001] WYT:              affine with cpu 23 in sd 6cef3000, weight 40
[    1.996001] WYT:              affine with cpu 24 in sd 6ce74000, weight 20
[    2.008002] WYT:              affine with cpu 25 in sd 6cef3000, weight 40
[    2.024002] WYT:              affine with cpu 26 in sd 6ce74000, weight 20
[    2.036001] WYT:              affine with cpu 27 in sd 6cef3000, weight 40


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-22 14:41                                     ` Mike Galbraith
@ 2013-01-23  2:44                                       ` Michael Wang
  2013-01-23  4:31                                         ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-23  2:44 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/22/2013 10:41 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 16:56 +0800, Michael Wang wrote:
> 
>> What about this patch? May be the wrong map is the killer on balance
>> path, should we check it? ;-) 
> 
> [    1.232249] Brought up 40 CPUs
> [    1.236003] smpboot: Total of 40 processors activated (180873.90 BogoMIPS)
> [    1.244744] CPU0 attaching sched-domain:
> [    1.254131] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
> [    1.252010]  domain 0: span 0,16 level SIBLING
> [    1.280001]   groups: 0 (cpu_power = 589) 16 (cpu_power = 589)
> [    1.292540]   domain 1: span 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 level MC
> [    1.312001]    groups: 0,16 (cpu_power = 1178) 2,18 (cpu_power = 1178) 4,20 (cpu_power = 1178) 6,22 (cpu_power = 1178) 8,24 (cpu_power = 1178)
>                          10,26 (cpu_power = 1178)12,28 (cpu_power = 1178)14,30 (cpu_power = 1178)32,36 (cpu_power = 1178)34,38 (cpu_power = 1178)
> [    1.368002]    domain 2: span 0-39 level NUMA
> [    1.376001]     groups: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 (cpu_power = 11780)
>                            1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 (cpu_power = 11780)

Thanks for the testing, that's not all the output but just for cpu 0,
correct?

> [    1.412546] WYT: sbm of cpu 0
> [    1.416001] WYT:      exec map
> [    1.424002] WYT:              sd 6ce55000, idx 0, level 0, weight 2
> [    1.436001] WYT:              sd 6ce74000, idx 1, level 1, weight 20
> [    1.448001] WYT:              sd 6cef3000, idx 3, level 3, weight 40
> [    1.460001] WYT:      fork map
> [    1.468001] WYT:              sd 6ce55000, idx 0, level 0, weight 2
> [    1.480001] WYT:              sd 6ce74000, idx 1, level 1, weight 20

This is not by design... sd in idx 2 should point to level 1 sd if there
is no level 2 sd, this part is broken...oh, how could level 3 sd be
there with out level 2 created? strange...

So with this map, the new balance path will no doubt broken, I think we
got the reason, amazing ;-)

Let's see how to fix it, hmm... need some study firstly.

Regards,
Michael Wang

> [    1.492001] WYT:              sd 6cef3000, idx 3, level 3, weight 40
> [    1.504001] WYT:      wake map
>                                  Hi, we're not home right now...
> [    1.508001] WYT:      affine map
> [    1.516001] WYT:              affine with cpu 0 in sd 6ce55000, weight 2
> [    1.528001] WYT:              affine with cpu 1 in sd 6cef3000, weight 40
> [    1.544001] WYT:              affine with cpu 2 in sd 6ce74000, weight 20
> [    1.556001] WYT:              affine with cpu 3 in sd 6cef3000, weight 40
> [    1.568001] WYT:              affine with cpu 4 in sd 6ce74000, weight 20
> [    1.584001] WYT:              affine with cpu 5 in sd 6cef3000, weight 40
> [    1.596001] WYT:              affine with cpu 6 in sd 6ce74000, weight 20
> [    1.608001] WYT:              affine with cpu 7 in sd 6cef3000, weight 40
> [    1.624001] WYT:              affine with cpu 8 in sd 6ce74000, weight 20
> [    1.636001] WYT:              affine with cpu 9 in sd 6cef3000, weight 40
> [    1.648001] WYT:              affine with cpu a in sd 6ce74000, weight 20
> [    1.660001] WYT:              affine with cpu b in sd 6cef3000, weight 40
> [    1.676001] WYT:              affine with cpu c in sd 6ce74000, weight 20
> [    1.688001] WYT:              affine with cpu d in sd 6cef3000, weight 40
> [    1.700001] WYT:              affine with cpu e in sd 6ce74000, weight 20
> [    1.716001] WYT:              affine with cpu f in sd 6cef3000, weight 40
> [    1.728001] WYT:              affine with cpu 10 in sd 6ce55000, weight 2
> [    1.740001] WYT:              affine with cpu 11 in sd 6cef3000, weight 40
> [    1.756001] WYT:              affine with cpu 12 in sd 6ce74000, weight 20
> [    1.768001] WYT:              affine with cpu 13 in sd 6cef3000, weight 40
> [    1.780001] WYT:              affine with cpu 14 in sd 6ce74000, weight 20
> [    1.796001] WYT:              affine with cpu 15 in sd 6cef3000, weight 40
> [    1.808001] WYT:              affine with cpu 16 in sd 6ce74000, weight 20
> [    1.820001] WYT:              affine with cpu 17 in sd 6cef3000, weight 40
> [    1.836001] WYT:              affine with cpu 18 in sd 6ce74000, weight 20
> [    1.848001] WYT:              affine with cpu 19 in sd 6cef3000, weight 40
> [    1.860001] WYT:              affine with cpu 1a in sd 6ce74000, weight 20
> [    1.876001] WYT:              affine with cpu 1b in sd 6cef3000, weight 40
> [    1.888001] WYT:              affine with cpu 1c in sd 6ce74000, weight 20
> [    1.900001] WYT:              affine with cpu 1d in sd 6cef3000, weight 40
> [    1.916001] WYT:              affine with cpu 1e in sd 6ce74000, weight 20
> [    1.928001] WYT:              affine with cpu 1f in sd 6cef3000, weight 40
> [    1.940001] WYT:              affine with cpu 20 in sd 6ce74000, weight 20
> [    1.956001] WYT:              affine with cpu 21 in sd 6cef3000, weight 40
> [    1.968001] WYT:              affine with cpu 22 in sd 6ce74000, weight 20
> [    1.984001] WYT:              affine with cpu 23 in sd 6cef3000, weight 40
> [    1.996001] WYT:              affine with cpu 24 in sd 6ce74000, weight 20
> [    2.008002] WYT:              affine with cpu 25 in sd 6cef3000, weight 40
> [    2.024002] WYT:              affine with cpu 26 in sd 6ce74000, weight 20
> [    2.036001] WYT:              affine with cpu 27 in sd 6cef3000, weight 40


> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-22 11:34                                     ` Mike Galbraith
@ 2013-01-23  3:01                                       ` Michael Wang
  2013-01-23  5:02                                         ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-23  3:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/22/2013 07:34 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 16:56 +0800, Michael Wang wrote: 
>> On 01/22/2013 04:03 PM, Mike Galbraith wrote:
>> [snip]
>>> ... 
>>>>>
>>>>> That was with your change backed out, and the q/d below applied.
>>>>
>>>> So that change will help to solve the issue? good to know :)
>>>>
>>>> But it will invoke wake_affine() with out any delay, the benefit
>>>> of the patch set will be reduced a lot...
>>>
>>> Yeah, I used size large hammer.
>>>
>>>> I think this change help to solve the issue because it avoid jump
>>>> into balance path when wakeup for any cases, I think we can do
>>>> some change like below to achieve this and meanwhile gain benefit
>>>> from delay wake_affine().
>>>
>>> Yup, I killed it all the way dead.  I'll see what below does.
>>>
>>> I don't really see the point of the wake_affine() change in this set
>>> though.  Its purpose is to decide if a pull is ok or not.  If we don't
>>> need its opinion when we look for an (momentarily?) idle core in
>>> this_domain, we shouldn't need it at all, and could just delete it.
>>
>> I have a question here, so wake_affine() is:
>> A. check whether it is balance to pull.
>> B. check whether it's better to pull than not.
> 
> A, "is it ok to move this guy to where red hot data awaits" is the way
> it has always been used, ie "can we move him here without upsetting
> balance too much", with sync hint meaning the waker is likely going to
> sleep very soon, so we pretend he's already gone when looking at load.
> That sync hint btw doesn't have anywhere near as much real meaning as
> would be nice to have.

Agree.

> 
>> I suppose it's A, so my logical is:
>> 1. find idle cpu in prev domain.
>> 2. if failed and affine, find idle cpu in current domain.
> 
> Hm.  If cpu and prev_cpu are cache affine, you already searched both.
> 

Well, it's true if affine cpus means their sd topology are always same,
but do we have a promise on it?

>> 3. if find idle cpu in current domain, check whether it is balance to
>> pull by wake_affine().
>> 4. if all failed, two choice, go to balance path or directly return
>> prev_cpu.
>>
>> So I still need wake_affine() for a final check, but to be honest, I
>> really doubt about whether it worth to care about balance while waking
>> up, if the task just run several ns then sleep again, it's totally
>> worthless...
> 
> Not totally worthless, but questionable yes.  It really matters most
> when wakee was way over there in cache foo for whatever reason, has no
> big footprint, and red hot data is waiting here in cache bar.  Light
> tasks migrating helps communicating buddies find each other and perform.
> 
> The new NUMA stuff will help heavy tasks, but it won't help with light
> tasks that could benefit by moving to the data.  We currently try to
> migrate on wakeup, if we do stop doing that, we may get hurt more often
> than not, dunno.  Benchmarks will tell.

Agree, for this patch set, before got the proof that do balance is worse
than not while waking up, I will choose do, well, I think the answer
won't be so easy, we need some number to show how a task is worth to do
balance while waking up, I even doubt whether it is possible to have
such number...

> 
>>> If we ever enter balance_path, we can't possibly induce imbalance without
>>> there being something broken in that path, no?
>>
>> So your opinion is, some thing broken in the new balance path?
> 
> I don't know that, but it is a logical bug candidate.

And you are right, I think I got some thing from the debug info you
showed, thanks again for that :)

> 
>>> BTW, it could well be that an unpatched kernel will collapse as well if
>>> WAKE_BALANCE is turned on.  I've never tried that on a largish box, as
>>> doing any of the wakeup time optional stuff used to make tbench scream.
>>>
>>>> Since the issue could not been reproduced on my side, I don't know
>>>> whether the patch benefit or not, so if you are willing to send out
>>>> a formal patch, I would be glad to include it in my patch set ;-)
>>>
>>> Just changing to scan prev_cpu before considering pulling would put a
>>> big dent in the bouncing cow problem, but that's the intriguing thing
>>> about this set.. 
>>
>> So that's my first question, if wake_affine() return 1 means it's better
>> to pull than not, then the new way may be harmful, but if it's just told
>> us, pull won't break the balance, then I still think, current domain is
>> just a backup, not the candidate of first choice.
> 
> wake_affine() doesn't know if it'll be a good or bad move, it only says
> go for it, load numbers are within parameters.
> 
>> can we have the tbench and pgbench big box gain without
>>> a lot of pain to go with it?  Small boxen will surely benefit, pretty
>>> much can't be hurt, but what about all those fast/light tasks that won't
>>> hop across nodes to red hot data?
>>
>> I don't get it... a task won't hop means a task always been selected to
>> run on prev_cpu?
> 
> Yes.  If wakee is light, has no large footprint to later have to drag to
> its new home, moving it to the hot data is a win.
> 
>> We will assign idle cpu if we found, but if not, we can use prev_cpu or
>> go to balance path and find one, so what's the problem here?
> 
> Wakeup latency may be low, but the task can still perform badly due to
> misses.  In the tbench case, cross node data misses aren't anywhere near
> as bad as the _everything_ is a miss you get from waker/wakee bouncing
> all over a single shared L3. 

Hmm...that seems like another point which could only be directed by
benchmarks.

Regards,
Michael Wang

> 
>>> No formal patch is likely to result from any testing I do atm at least.
>>> I'm testing your patches because I see potential, I really want it to
>>> work out, but have to see it do that with my own two beady eyeballs ;-)
>>
>> Got it.
>>
>>>
>>>> And another patch below below is a debug one, which will print out
>>>> all the sbm info, so we could check whether it was initialized
>>>> correctly, just use command "dmesg | grep WYT" to show the map.
>>
>> What about this patch? May be the wrong map is the killer on balance
>> path, should we check it? ;-)
> 
> Yeah,I haven't actually looked for any booboos, just ran it straight out
> of the box ;-)
> 
>> Regards,
>> Michael Wang
>>
>>>>
>>>> Regards,
>>>> Michael Wang
>>>>
>>>> ---
>>>>  kernel/sched/fair.c |   42 +++++++++++++++++++++++++-----------------
>>>>  1 files changed, 25 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 2aa26c1..4361333 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>>>  }
>>>>
>>>>  /*
>>>> - * Try and locate an idle CPU in the sched_domain.
>>>> + * Try and locate an idle CPU in the sched_domain, return -1 if failed.
>>>>   */
>>>>  static int select_idle_sibling(struct task_struct *p, int target)
>>>>  {
>>>> @@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>>>
>>>>                         target = cpumask_first_and(sched_group_cpus(sg),
>>>>                                         tsk_cpus_allowed(p));
>>>> -                       goto done;
>>>> +                       return target;
>>>>  next:
>>>>                         sg = sg->next;
>>>>                 } while (sg != sd->groups);
>>>>         }
>>>> -done:
>>>> -       return target;
>>>> +
>>>> +       return -1;
>>>>  }
>>>>
>>>>  /*
>>>> @@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>>>                  * may has already been cached on prev_cpu, and usually
>>>>                  * they require low latency.
>>>>                  *
>>>> -                * So firstly try to locate an idle cpu shared the cache
>>>> +                * Therefor, balance path in such case will cause damage
>>>> +                * and bring benefit synchronously, wakeup on prev_cpu
>>>> +                * may better than wakeup on a new lower load cpu for the
>>>> +                * cached memory, and we never know.
>>>> +                *
>>>> +                * So the principle is, try to find an idle cpu as close to
>>>> +                * prev_cpu as possible, if failed, just take prev_cpu.
>>>> +                *
>>>> +                * Firstly try to locate an idle cpu shared the cache
>>>>                  * with prev_cpu, it has the chance to break the load
>>>>                  * balance, fortunately, select_idle_sibling() will search
>>>>                  * from top to bottom, which help to reduce the chance in
>>>>                  * some cases.
>>>>                  */
>>>>                 new_cpu = select_idle_sibling(p, prev_cpu);
>>>> -               if (idle_cpu(new_cpu))
>>>> +               if (new_cpu != -1)
>>>>                         goto unlock;
>>>>
>>>>                 /*
>>>>                  * No idle cpu could be found in the topology of prev_cpu,
>>>> -                * before jump into the slow balance_path, try search again
>>>> -                * in the topology of current cpu if it is the affine of
>>>> -                * prev_cpu.
>>>> +                * before take the prev_cpu, try search again in the
>>>> +                * topology of current cpu if it is the affine of prev_cpu.
>>>>                  */
>>>> -               if (cpu == prev_cpu ||
>>>> -                               !sbm->affine_map[prev_cpu] ||
>>>> +               if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
>>>>                                 !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>>>> -                       goto balance_path;
>>>> +                       goto take_prev;
>>>>
>>>>                 new_cpu = select_idle_sibling(p, cpu);
>>>> -               if (!idle_cpu(new_cpu))
>>>> -                       goto balance_path;
>>>> -
>>>>                 /*
>>>>                  * Invoke wake_affine() finally since it is no doubt a
>>>>                  * performance killer.
>>>>                  */
>>>> -               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
>>>> +               if ((new_cpu != -1) &&
>>>> +                       wake_affine(sbm->affine_map[prev_cpu], p, sync))
>>>>                         goto unlock;
>>>> +
>>>> +take_prev:
>>>> +               new_cpu = prev_cpu;
>>>> +               goto unlock;
>>>>         }
>>>>
>>>> -balance_path:
>>>> +       /* Balance path. */
>>>>         new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
>>>>         sd = sbm->sd[type][sbm->top_level[type]];
>>>>
>>>> -- 
>>>> 1.7.4.1
>>>>
>>>> DEBUG PATCH:
>>>>
>>>> ---
>>>>  kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
>>>>  1 files changed, 30 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 0c63303..f251f29 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
>>>>  static int sbm_max_level;
>>>>  DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
>>>>
>>>> +static void debug_sched_balance_map(int cpu)
>>>> +{
>>>> +       int i, type, level = 0;
>>>> +       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>>>> +
>>>> +       printk("WYT: sbm of cpu %d\n", cpu);
>>>> +
>>>> +       for (type = 0; type < SBM_MAX_TYPE; type++) {
>>>> +               if (type == SBM_EXEC_TYPE)
>>>> +                       printk("WYT: \t exec map\n");
>>>> +               else if (type == SBM_FORK_TYPE)
>>>> +                       printk("WYT: \t fork map\n");
>>>> +               else if (type == SBM_WAKE_TYPE)
>>>> +                       printk("WYT: \t wake map\n");
>>>> +
>>>> +               for (level = 0; level < sbm_max_level; level++) {
>>>> +                       if (sbm->sd[type][level])
>>>> +                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
>>>> +               }
>>>> +       }
>>>> +
>>>> +       printk("WYT: \t affine map\n");
>>>> +
>>>> +       for_each_possible_cpu(i) {
>>>> +               if (sbm->affine_map[i])
>>>> +                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
>>>> +       }
>>>> +}
>>>> +
>>>>  static void build_sched_balance_map(int cpu)
>>>>  {
>>>>         struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>>>> @@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>>>>          * destroy_sched_domains() already do the work.
>>>>          */
>>>>         build_sched_balance_map(cpu);
>>>> +       debug_sched_balance_map(cpu);
>>>>         rcu_assign_pointer(rq->sbm, sbm);
>>>>  }
>>>>
>>>
>>>
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  2:44                                       ` Michael Wang
@ 2013-01-23  4:31                                         ` Mike Galbraith
  2013-01-23  5:09                                           ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-23  4:31 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Wed, 2013-01-23 at 10:44 +0800, Michael Wang wrote: 
> On 01/22/2013 10:41 PM, Mike Galbraith wrote:
> > On Tue, 2013-01-22 at 16:56 +0800, Michael Wang wrote:
> > 
> >> What about this patch? May be the wrong map is the killer on balance
> >> path, should we check it? ;-) 
> > 
> > [    1.232249] Brought up 40 CPUs
> > [    1.236003] smpboot: Total of 40 processors activated (180873.90 BogoMIPS)
> > [    1.244744] CPU0 attaching sched-domain:
> > [    1.254131] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
> > [    1.252010]  domain 0: span 0,16 level SIBLING
> > [    1.280001]   groups: 0 (cpu_power = 589) 16 (cpu_power = 589)
> > [    1.292540]   domain 1: span 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 level MC
> > [    1.312001]    groups: 0,16 (cpu_power = 1178) 2,18 (cpu_power = 1178) 4,20 (cpu_power = 1178) 6,22 (cpu_power = 1178) 8,24 (cpu_power = 1178)
> >                          10,26 (cpu_power = 1178)12,28 (cpu_power = 1178)14,30 (cpu_power = 1178)32,36 (cpu_power = 1178)34,38 (cpu_power = 1178)
> > [    1.368002]    domain 2: span 0-39 level NUMA
> > [    1.376001]     groups: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 (cpu_power = 11780)
> >                            1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 (cpu_power = 11780)
> 
> Thanks for the testing, that's not all the output but just for cpu 0,
> correct?

Yeah, I presumed one was enough.  You can have more if you like, there's
LOTS more where that came from (reboot is amazing with low speed serial
console -> high latency low bandwidth DSL conection;). 

> > [    1.412546] WYT: sbm of cpu 0
> > [    1.416001] WYT:      exec map
> > [    1.424002] WYT:              sd 6ce55000, idx 0, level 0, weight 2
> > [    1.436001] WYT:              sd 6ce74000, idx 1, level 1, weight 20
> > [    1.448001] WYT:              sd 6cef3000, idx 3, level 3, weight 40
> > [    1.460001] WYT:      fork map
> > [    1.468001] WYT:              sd 6ce55000, idx 0, level 0, weight 2
> > [    1.480001] WYT:              sd 6ce74000, idx 1, level 1, weight 20
> 
> This is not by design... sd in idx 2 should point to level 1 sd if there
> is no level 2 sd, this part is broken...oh, how could level 3 sd be
> there with out level 2 created? strange...
> 
> So with this map, the new balance path will no doubt broken, I think we
> got the reason, amazing ;-)
> 
> Let's see how to fix it, hmm... need some study firstly.

Another thing that wants fixing: root can set flags for _existing_
domains any way he likes, but when he invokes godly powers to rebuild
domains, he gets what's hard coded, which is neither clever (godly
wrath;), nor wonderful for godly runtime path decisions.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  3:01                                       ` Michael Wang
@ 2013-01-23  5:02                                         ` Mike Galbraith
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-23  5:02 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Wed, 2013-01-23 at 11:01 +0800, Michael Wang wrote: 
> On 01/22/2013 07:34 PM, Mike Galbraith wrote:
>  I suppose it's A, so my logical is:
> >> 1. find idle cpu in prev domain.
> >> 2. if failed and affine, find idle cpu in current domain.
> > 
> > Hm.  If cpu and prev_cpu are cache affine, you already searched both.
> > 
> 
> Well, it's true if affine cpus means their sd topology are always same,
> but do we have a promise on it?

Ignore that, I think apple/orange communication happened.

-Mike




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  4:31                                         ` Mike Galbraith
@ 2013-01-23  5:09                                           ` Michael Wang
  2013-01-23  6:28                                             ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-23  5:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/23/2013 12:31 PM, Mike Galbraith wrote:
> On Wed, 2013-01-23 at 10:44 +0800, Michael Wang wrote: 
>> On 01/22/2013 10:41 PM, Mike Galbraith wrote:
>>> On Tue, 2013-01-22 at 16:56 +0800, Michael Wang wrote:
>>>
>>>> What about this patch? May be the wrong map is the killer on balance
>>>> path, should we check it? ;-) 
>>>
>>> [    1.232249] Brought up 40 CPUs
>>> [    1.236003] smpboot: Total of 40 processors activated (180873.90 BogoMIPS)
>>> [    1.244744] CPU0 attaching sched-domain:
>>> [    1.254131] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
>>> [    1.252010]  domain 0: span 0,16 level SIBLING
>>> [    1.280001]   groups: 0 (cpu_power = 589) 16 (cpu_power = 589)
>>> [    1.292540]   domain 1: span 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 level MC
>>> [    1.312001]    groups: 0,16 (cpu_power = 1178) 2,18 (cpu_power = 1178) 4,20 (cpu_power = 1178) 6,22 (cpu_power = 1178) 8,24 (cpu_power = 1178)
>>>                          10,26 (cpu_power = 1178)12,28 (cpu_power = 1178)14,30 (cpu_power = 1178)32,36 (cpu_power = 1178)34,38 (cpu_power = 1178)
>>> [    1.368002]    domain 2: span 0-39 level NUMA
>>> [    1.376001]     groups: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 (cpu_power = 11780)
>>>                            1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 (cpu_power = 11780)
>>
>> Thanks for the testing, that's not all the output but just for cpu 0,
>> correct?
> 
> Yeah, I presumed one was enough.  You can have more if you like, there's
> LOTS more where that came from (reboot is amazing with low speed serial
> console -> high latency low bandwidth DSL conection;). 
> 
>>> [    1.412546] WYT: sbm of cpu 0
>>> [    1.416001] WYT:      exec map
>>> [    1.424002] WYT:              sd 6ce55000, idx 0, level 0, weight 2
>>> [    1.436001] WYT:              sd 6ce74000, idx 1, level 1, weight 20
>>> [    1.448001] WYT:              sd 6cef3000, idx 3, level 3, weight 40
>>> [    1.460001] WYT:      fork map
>>> [    1.468001] WYT:              sd 6ce55000, idx 0, level 0, weight 2
>>> [    1.480001] WYT:              sd 6ce74000, idx 1, level 1, weight 20
>>
>> This is not by design... sd in idx 2 should point to level 1 sd if there
>> is no level 2 sd, this part is broken...oh, how could level 3 sd be
>> there with out level 2 created? strange...
>>
>> So with this map, the new balance path will no doubt broken, I think we
>> got the reason, amazing ;-)
>>
>> Let's see how to fix it, hmm... need some study firstly.
> 
> Another thing that wants fixing: root can set flags for _existing_
> domains any way he likes,

Can he? on running time changing the domain flags? I do remember I used to
send out some patch to achieve that but was refused since it's dangerous...

but when he invokes godly powers to rebuild
> domains, he gets what's hard coded, which is neither clever (godly
> wrath;), nor wonderful for godly runtime path decisions.

The purpose is to using a map to describe the sd topology of a cpu, it
should be rebuild correctly according to the new topology when attaching
new domain to a cpu.

For this case, it's really strange that level 2 was missed in topology,
I found that in build_sched_domains(), the level was added one by one,
and I don't know why it jumps here...sounds like some BUG to me.

Whatever, the sbm should still work properly by designed, even in such
strange topology, if it's initialized correctly.

And below patch will do help on it, just based on the original patch set.

Could you please take a try on it, it's supposed to make the balance path
correctly, and please apply below DEBUG patch too, so we could know how it
changes, I think this time, we may be able to solve the issue by the right
way ;-)

Regards,
Michael Wang

---
 kernel/sched/core.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c63303..c2a13bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5582,7 +5582,6 @@ static void build_sched_balance_map(int cpu)
 {
        struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
        struct sched_domain *sd = cpu_rq(cpu)->sd;
-       struct sched_domain *top_sd = NULL;
        int i, type, level = 0;

        memset(sbm->top_level, 0, sizeof((*sbm).top_level));
@@ -5625,11 +5624,9 @@ static void build_sched_balance_map(int cpu)
         * fill the hole to get lower level sd easily.
         */
        for (type = 0; type < SBM_MAX_TYPE; type++) {
-               level = sbm->top_level[type];
-               top_sd = sbm->sd[type][level];
-               if ((++level != sbm_max_level) && top_sd) {
-                       for (; level < sbm_max_level; level++)
-                               sbm->sd[type][level] = top_sd;
+               for (level = 1; level < sbm_max_level; level++) {
+                       if (!sbm->sd[type][level])
+                               sbm->sd[type][level] = sbm->sd[type][level - 1];
                }
        }
 }
-- 
1.7.4.1



DEBUG PATCH:

---
 kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2a13bc..7c6c736 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
 static int sbm_max_level;
 DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);

+static void debug_sched_balance_map(int cpu)
+{
+       int i, type, level = 0;
+       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
+
+       printk("WYT: sbm of cpu %d\n", cpu);
+
+       for (type = 0; type < SBM_MAX_TYPE; type++) {
+               if (type == SBM_EXEC_TYPE)
+                       printk("WYT: \t exec map\n");
+               else if (type == SBM_FORK_TYPE)
+                       printk("WYT: \t fork map\n");
+               else if (type == SBM_WAKE_TYPE)
+                       printk("WYT: \t wake map\n");
+
+               for (level = 0; level < sbm_max_level; level++) {
+                       if (sbm->sd[type][level])
+                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
+               }
+       }
+
+       printk("WYT: \t affine map\n");
+
+       for_each_possible_cpu(i) {
+               if (sbm->affine_map[i])
+               if (sbm->affine_map[i])
+                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
+       }
+}
+
 static void build_sched_balance_map(int cpu)
 {
        struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
@@ -5685,6 +5714,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
         * destroy_sched_domains() already do the work.
         */
        build_sched_balance_map(cpu);
+       debug_sched_balance_map(cpu);
        rcu_assign_pointer(rq->sbm, sbm);
 }

-- 
1.7.4.1




> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  5:09                                           ` Michael Wang
@ 2013-01-23  6:28                                             ` Mike Galbraith
  2013-01-23  7:10                                               ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-23  6:28 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

[-- Attachment #1: Type: text/plain, Size: 5162 bytes --]

On Wed, 2013-01-23 at 13:09 +0800, Michael Wang wrote: 
> On 01/23/2013 12:31 PM, Mike Galbraith wrote:

> > Another thing that wants fixing: root can set flags for _existing_
> > domains any way he likes,
> 
> Can he? on running time changing the domain flags? I do remember I used to
> send out some patch to achieve that but was refused since it's dangerous...

Yes, flags can be set any way you like, which works just fine when flags
are evaluated at runtime.

WRT dangerous: if root says "Let there be stupidity", stupidity should
appear immediately :)

> but when he invokes godly powers to rebuild
> > domains, he gets what's hard coded, which is neither clever (godly
> > wrath;), nor wonderful for godly runtime path decisions.
> 
> The purpose is to using a map to describe the sd topology of a cpu, it
> should be rebuild correctly according to the new topology when attaching
> new domain to a cpu.

Try turning FORK/EXEC/WAKE on/off.

echo [01] > [cpuset]/sched_load_balance will rebuild, but resulting
domains won't reflect flag your change. 

> For this case, it's really strange that level 2 was missed in topology,
> I found that in build_sched_domains(), the level was added one by one,
> and I don't know why it jumps here...sounds like some BUG to me.
> 
> Whatever, the sbm should still work properly by designed, even in such
> strange topology, if it's initialized correctly.
> 
> And below patch will do help on it, just based on the original patch set.
> 
> Could you please take a try on it, it's supposed to make the balance path
> correctly, and please apply below DEBUG patch too, so we could know how it
> changes, I think this time, we may be able to solve the issue by the right
> way ;-)

Done, previous changes backed out, new change applied on top of v2 set.
Full debug output attached.

Domain flags on this box (bogus CPU domain is still patched away).

monteverdi:/abuild/mike/aim7/:[127]# tune-sched-domains
usage: tune-sched-domains <val>
{cpu0/domain0:SIBLING} SD flag: 687
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
+   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64: SD_PREFER_LOCAL:          Prefer to keep tasks local to this domain
+ 128: SD_SHARE_CPUPOWER:        Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE:  Balance for power savings
+ 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
-1024: SD_SERIALIZE:             Only a single load balancing instance
-2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
-4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
-8192: SD_PREFER_UTILIZATION:    Prefer utilization over SMP nice
{cpu0/domain1:MC} SD flag: 559
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
+   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64: SD_PREFER_LOCAL:          Prefer to keep tasks local to this domain
- 128: SD_SHARE_CPUPOWER:        Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE:  Balance for power savings
+ 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
-1024: SD_SERIALIZE:             Only a single load balancing instance
-2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
-4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
-8192: SD_PREFER_UTILIZATION:    Prefer utilization over SMP nice
{cpu0/domain2:NUMA} SD flag: 9263
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
+   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64: SD_PREFER_LOCAL:          Prefer to keep tasks local to this domain
- 128: SD_SHARE_CPUPOWER:        Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE:  Balance for power savings
- 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
+1024: SD_SERIALIZE:             Only a single load balancing instance
-2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
-4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
+8192: SD_PREFER_UTILIZATION:    Prefer utilization over SMP nice

Abbreviated test run:
Tasks    jobs/min  jti  jobs/min/task      real       cpu
  640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
 1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
 2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013


[-- Attachment #2: dmesg.gz --]
[-- Type: application/x-gzip, Size: 12112 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  6:28                                             ` Mike Galbraith
@ 2013-01-23  7:10                                               ` Michael Wang
  2013-01-23  8:20                                                 ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-23  7:10 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/23/2013 02:28 PM, Mike Galbraith wrote:
> On Wed, 2013-01-23 at 13:09 +0800, Michael Wang wrote: 
>> On 01/23/2013 12:31 PM, Mike Galbraith wrote:
> 
>>> Another thing that wants fixing: root can set flags for _existing_
>>> domains any way he likes,
>>
>> Can he? on running time changing the domain flags? I do remember I used to
>> send out some patch to achieve that but was refused since it's dangerous...
> 
> Yes, flags can be set any way you like, which works just fine when flags
> are evaluated at runtime.
> 
> WRT dangerous: if root says "Let there be stupidity", stupidity should
> appear immediately :)
> 
>> but when he invokes godly powers to rebuild
>>> domains, he gets what's hard coded, which is neither clever (godly
>>> wrath;), nor wonderful for godly runtime path decisions.
>>
>> The purpose is to using a map to describe the sd topology of a cpu, it
>> should be rebuild correctly according to the new topology when attaching
>> new domain to a cpu.
> 
> Try turning FORK/EXEC/WAKE on/off.
> 
> echo [01] > [cpuset]/sched_load_balance will rebuild, but resulting
> domains won't reflect flag your change. 

Yeah, I've done some test on it previously, but I failed to enter the
rebuild procedure, need more research on it.

> 
>> For this case, it's really strange that level 2 was missed in topology,
>> I found that in build_sched_domains(), the level was added one by one,
>> and I don't know why it jumps here...sounds like some BUG to me.
>>
>> Whatever, the sbm should still work properly by designed, even in such
>> strange topology, if it's initialized correctly.
>>
>> And below patch will do help on it, just based on the original patch set.
>>
>> Could you please take a try on it, it's supposed to make the balance path
>> correctly, and please apply below DEBUG patch too, so we could know how it
>> changes, I think this time, we may be able to solve the issue by the right
>> way ;-)
> 
> Done, previous changes backed out, new change applied on top of v2 set.
> Full debug output attached.
> 
> Domain flags on this box (bogus CPU domain is still patched away).
> 
> monteverdi:/abuild/mike/aim7/:[127]# tune-sched-domains
> usage: tune-sched-domains <val>
> {cpu0/domain0:SIBLING} SD flag: 687
> +   1: SD_LOAD_BALANCE:          Do load balancing on this domain
> +   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
> +   4: SD_BALANCE_EXEC:          Balance on exec
> +   8: SD_BALANCE_FORK:          Balance on fork, clone
> -  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
> +  32: SD_WAKE_AFFINE:           Wake task to waking CPU
> -  64: SD_PREFER_LOCAL:          Prefer to keep tasks local to this domain
> + 128: SD_SHARE_CPUPOWER:        Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE:  Balance for power savings
> + 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
> -1024: SD_SERIALIZE:             Only a single load balancing instance
> -2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
> -4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
> -8192: SD_PREFER_UTILIZATION:    Prefer utilization over SMP nice
> {cpu0/domain1:MC} SD flag: 559
> +   1: SD_LOAD_BALANCE:          Do load balancing on this domain
> +   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
> +   4: SD_BALANCE_EXEC:          Balance on exec
> +   8: SD_BALANCE_FORK:          Balance on fork, clone
> -  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
> +  32: SD_WAKE_AFFINE:           Wake task to waking CPU
> -  64: SD_PREFER_LOCAL:          Prefer to keep tasks local to this domain
> - 128: SD_SHARE_CPUPOWER:        Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE:  Balance for power savings
> + 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
> -1024: SD_SERIALIZE:             Only a single load balancing instance
> -2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
> -4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
> -8192: SD_PREFER_UTILIZATION:    Prefer utilization over SMP nice
> {cpu0/domain2:NUMA} SD flag: 9263
> +   1: SD_LOAD_BALANCE:          Do load balancing on this domain
> +   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
> +   4: SD_BALANCE_EXEC:          Balance on exec
> +   8: SD_BALANCE_FORK:          Balance on fork, clone
> -  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
> +  32: SD_WAKE_AFFINE:           Wake task to waking CPU
> -  64: SD_PREFER_LOCAL:          Prefer to keep tasks local to this domain
> - 128: SD_SHARE_CPUPOWER:        Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE:  Balance for power savings
> - 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
> +1024: SD_SERIALIZE:             Only a single load balancing instance
> -2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
> -4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
> +8192: SD_PREFER_UTILIZATION:    Prefer utilization over SMP nice

I will study this BUG candidate later.

> 
> Abbreviated test run:
> Tasks    jobs/min  jti  jobs/min/task      real       cpu
>   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
>  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
>  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013

So still not works... and not going to balance path while waking up will
fix it, looks like that's the only choice if no error on balance path
could be found...benchmark wins again, I'm feeling bad...

I will conclude the info we collected and make a v3 later.

Regards,
Michael Wang

> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  7:10                                               ` Michael Wang
@ 2013-01-23  8:20                                                 ` Mike Galbraith
  2013-01-23  8:30                                                   ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-23  8:20 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Wed, 2013-01-23 at 15:10 +0800, Michael Wang wrote: 
> On 01/23/2013 02:28 PM, Mike Galbraith wrote:

> > Abbreviated test run:
> > Tasks    jobs/min  jti  jobs/min/task      real       cpu
> >   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
> >  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
> >  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013
> 
> So still not works... and not going to balance path while waking up will
> fix it, looks like that's the only choice if no error on balance path
> could be found...benchmark wins again, I'm feeling bad...
> 
> I will conclude the info we collected and make a v3 later.

FWIW, I hacked virgin to do full balance if an idle CPU was not found,
leaving the preference to wake cache affine intact though, turned on
WAKE_BALANCE in all domains, and it did not collapse.  In fact, the high
load end, where the idle search will frequently be a waste of cycles,
actually improved a bit.  Things that make ya go hmmm.

Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      436.60  100       436.5994     13.88      3.80   Wed Jan 23 08:49:21 2013
    1      437.23  100       437.2294     13.86      3.85   Wed Jan 23 08:49:45 2013
    1      440.41  100       440.4070     13.76      3.76   Wed Jan 23 08:50:08 2013
    5     2463.41   99       492.6829     12.30     10.90   Wed Jan 23 08:50:22 2013
    5     2427.88   99       485.5769     12.48     11.90   Wed Jan 23 08:50:37 2013
    5     2431.78   99       486.3563     12.46     11.74   Wed Jan 23 08:50:51 2013
   10     4867.47   99       486.7470     12.45     23.30   Wed Jan 23 08:51:05 2013
   10     4855.77   99       485.5769     12.48     23.35   Wed Jan 23 08:51:18 2013
   10     4891.04   99       489.1041     12.39     22.71   Wed Jan 23 08:51:31 2013
   20     9789.98   96       489.4992     12.38     36.18   Wed Jan 23 08:51:44 2013
   20     9774.19   97       488.7097     12.40     39.58   Wed Jan 23 08:51:56 2013
   20     9774.19   97       488.7097     12.40     37.99   Wed Jan 23 08:52:09 2013
   40    19086.61   98       477.1654     12.70     89.56   Wed Jan 23 08:52:22 2013
   40    19116.72   98       477.9180     12.68     92.69   Wed Jan 23 08:52:35 2013
   40    19056.60   98       476.4151     12.72     90.19   Wed Jan 23 08:52:48 2013
   80    37149.43   98       464.3678     13.05    114.19   Wed Jan 23 08:53:01 2013
   80    37436.29   98       467.9537     12.95    111.54   Wed Jan 23 08:53:14 2013
   80    37206.45   98       465.0806     13.03    111.49   Wed Jan 23 08:53:27 2013
  160    69605.17   97       435.0323     13.93    152.35   Wed Jan 23 08:53:41 2013
  160    69705.25   97       435.6578     13.91    152.05   Wed Jan 23 08:53:55 2013
  160    69356.22   97       433.4764     13.98    154.56   Wed Jan 23 08:54:09 2013
  320   112482.60   94       351.5081     17.24    285.52   Wed Jan 23 08:54:27 2013
  320   112222.22   94       350.6944     17.28    287.80   Wed Jan 23 08:54:44 2013
  320   109994.33   97       343.7323     17.63    302.40   Wed Jan 23 08:55:02 2013
  640   152273.26   94       237.9270     25.47    614.95   Wed Jan 23 08:55:27 2013
  640   153175.36   96       239.3365     25.32    608.48   Wed Jan 23 08:55:53 2013
  640   152994.08   95       239.0533     25.35    609.33   Wed Jan 23 08:56:18 2013
 1280   191101.26   95       149.2979     40.59   1218.71   Wed Jan 23 08:56:59 2013
 1280   191667.90   94       149.7405     40.47   1215.06   Wed Jan 23 08:57:40 2013
 1280   191289.77   94       149.4451     40.55   1217.35   Wed Jan 23 08:58:20 2013
 2560   221654.52   94        86.5838     69.99   2392.78   Wed Jan 23 08:59:31 2013
 2560   221117.45   91        86.3740     70.16   2399.01   Wed Jan 23 09:00:41 2013
 2560   220394.94   93        86.0918     70.39   2409.10   Wed Jan 23 09:01:52 2013



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  8:20                                                 ` Mike Galbraith
@ 2013-01-23  8:30                                                   ` Michael Wang
  2013-01-23  8:49                                                     ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-23  8:30 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/23/2013 04:20 PM, Mike Galbraith wrote:
> On Wed, 2013-01-23 at 15:10 +0800, Michael Wang wrote: 
>> On 01/23/2013 02:28 PM, Mike Galbraith wrote:
> 
>>> Abbreviated test run:
>>> Tasks    jobs/min  jti  jobs/min/task      real       cpu
>>>   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
>>>  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
>>>  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013
>>
>> So still not works... and not going to balance path while waking up will
>> fix it, looks like that's the only choice if no error on balance path
>> could be found...benchmark wins again, I'm feeling bad...
>>
>> I will conclude the info we collected and make a v3 later.
> 
> FWIW, I hacked virgin to do full balance if an idle CPU was not found,
> leaving the preference to wake cache affine intact though, turned on
> WAKE_BALANCE in all domains, and it did not collapse.  In fact, the high
> load end, where the idle search will frequently be a waste of cycles,
> actually improved a bit.  Things that make ya go hmmm.

Oh, does that means the old balance path is good while the new is really
broken, I mean, compared this with the previously results, could we say
that all the collapse was just caused by the change of balance path?

Regards,
Michael Wang

> 
> Tasks    jobs/min  jti  jobs/min/task      real       cpu
>     1      436.60  100       436.5994     13.88      3.80   Wed Jan 23 08:49:21 2013
>     1      437.23  100       437.2294     13.86      3.85   Wed Jan 23 08:49:45 2013
>     1      440.41  100       440.4070     13.76      3.76   Wed Jan 23 08:50:08 2013
>     5     2463.41   99       492.6829     12.30     10.90   Wed Jan 23 08:50:22 2013
>     5     2427.88   99       485.5769     12.48     11.90   Wed Jan 23 08:50:37 2013
>     5     2431.78   99       486.3563     12.46     11.74   Wed Jan 23 08:50:51 2013
>    10     4867.47   99       486.7470     12.45     23.30   Wed Jan 23 08:51:05 2013
>    10     4855.77   99       485.5769     12.48     23.35   Wed Jan 23 08:51:18 2013
>    10     4891.04   99       489.1041     12.39     22.71   Wed Jan 23 08:51:31 2013
>    20     9789.98   96       489.4992     12.38     36.18   Wed Jan 23 08:51:44 2013
>    20     9774.19   97       488.7097     12.40     39.58   Wed Jan 23 08:51:56 2013
>    20     9774.19   97       488.7097     12.40     37.99   Wed Jan 23 08:52:09 2013
>    40    19086.61   98       477.1654     12.70     89.56   Wed Jan 23 08:52:22 2013
>    40    19116.72   98       477.9180     12.68     92.69   Wed Jan 23 08:52:35 2013
>    40    19056.60   98       476.4151     12.72     90.19   Wed Jan 23 08:52:48 2013
>    80    37149.43   98       464.3678     13.05    114.19   Wed Jan 23 08:53:01 2013
>    80    37436.29   98       467.9537     12.95    111.54   Wed Jan 23 08:53:14 2013
>    80    37206.45   98       465.0806     13.03    111.49   Wed Jan 23 08:53:27 2013
>   160    69605.17   97       435.0323     13.93    152.35   Wed Jan 23 08:53:41 2013
>   160    69705.25   97       435.6578     13.91    152.05   Wed Jan 23 08:53:55 2013
>   160    69356.22   97       433.4764     13.98    154.56   Wed Jan 23 08:54:09 2013
>   320   112482.60   94       351.5081     17.24    285.52   Wed Jan 23 08:54:27 2013
>   320   112222.22   94       350.6944     17.28    287.80   Wed Jan 23 08:54:44 2013
>   320   109994.33   97       343.7323     17.63    302.40   Wed Jan 23 08:55:02 2013
>   640   152273.26   94       237.9270     25.47    614.95   Wed Jan 23 08:55:27 2013
>   640   153175.36   96       239.3365     25.32    608.48   Wed Jan 23 08:55:53 2013
>   640   152994.08   95       239.0533     25.35    609.33   Wed Jan 23 08:56:18 2013
>  1280   191101.26   95       149.2979     40.59   1218.71   Wed Jan 23 08:56:59 2013
>  1280   191667.90   94       149.7405     40.47   1215.06   Wed Jan 23 08:57:40 2013
>  1280   191289.77   94       149.4451     40.55   1217.35   Wed Jan 23 08:58:20 2013
>  2560   221654.52   94        86.5838     69.99   2392.78   Wed Jan 23 08:59:31 2013
>  2560   221117.45   91        86.3740     70.16   2399.01   Wed Jan 23 09:00:41 2013
>  2560   220394.94   93        86.0918     70.39   2409.10   Wed Jan 23 09:01:52 2013
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  8:30                                                   ` Michael Wang
@ 2013-01-23  8:49                                                     ` Mike Galbraith
  2013-01-23  9:00                                                       ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-23  8:49 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Wed, 2013-01-23 at 16:30 +0800, Michael Wang wrote: 
> On 01/23/2013 04:20 PM, Mike Galbraith wrote:
> > On Wed, 2013-01-23 at 15:10 +0800, Michael Wang wrote: 
> >> On 01/23/2013 02:28 PM, Mike Galbraith wrote:
> > 
> >>> Abbreviated test run:
> >>> Tasks    jobs/min  jti  jobs/min/task      real       cpu
> >>>   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
> >>>  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
> >>>  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013
> >>
> >> So still not works... and not going to balance path while waking up will
> >> fix it, looks like that's the only choice if no error on balance path
> >> could be found...benchmark wins again, I'm feeling bad...
> >>
> >> I will conclude the info we collected and make a v3 later.
> > 
> > FWIW, I hacked virgin to do full balance if an idle CPU was not found,
> > leaving the preference to wake cache affine intact though, turned on
> > WAKE_BALANCE in all domains, and it did not collapse.  In fact, the high
> > load end, where the idle search will frequently be a waste of cycles,
> > actually improved a bit.  Things that make ya go hmmm.
> 
> Oh, does that means the old balance path is good while the new is really
> broken, I mean, compared this with the previously results, could we say
> that all the collapse was just caused by the change of balance path?

That's a good supposition.  I'll see if it holds.

Next, I'm going to try ripping select_idle_sibling() to tiny shreds,
twiddle the balance path a little to see if I can get rid of the bad
stuff for tbench, maybe make some good stuff for pgbench and ilk, ilk
_maybe_ including heavy duty remote network type loads.

There's gonna be some violent axe swinging here shortly.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  8:49                                                     ` Mike Galbraith
@ 2013-01-23  9:00                                                       ` Michael Wang
  2013-01-23  9:18                                                         ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-23  9:00 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/23/2013 04:49 PM, Mike Galbraith wrote:
> On Wed, 2013-01-23 at 16:30 +0800, Michael Wang wrote: 
>> On 01/23/2013 04:20 PM, Mike Galbraith wrote:
>>> On Wed, 2013-01-23 at 15:10 +0800, Michael Wang wrote: 
>>>> On 01/23/2013 02:28 PM, Mike Galbraith wrote:
>>>
>>>>> Abbreviated test run:
>>>>> Tasks    jobs/min  jti  jobs/min/task      real       cpu
>>>>>   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
>>>>>  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
>>>>>  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013
>>>>
>>>> So still not works... and not going to balance path while waking up will
>>>> fix it, looks like that's the only choice if no error on balance path
>>>> could be found...benchmark wins again, I'm feeling bad...
>>>>
>>>> I will conclude the info we collected and make a v3 later.
>>>
>>> FWIW, I hacked virgin to do full balance if an idle CPU was not found,
>>> leaving the preference to wake cache affine intact though, turned on
>>> WAKE_BALANCE in all domains, and it did not collapse.  In fact, the high
>>> load end, where the idle search will frequently be a waste of cycles,
>>> actually improved a bit.  Things that make ya go hmmm.
>>
>> Oh, does that means the old balance path is good while the new is really
>> broken, I mean, compared this with the previously results, could we say
>> that all the collapse was just caused by the change of balance path?
> 
> That's a good supposition.  I'll see if it holds.

I just notice that there is no sd support the WAKE flag at all according
to your debug info, isn't it?

Which means there is no way to do load balance since we can even not
found a suitable sd for wake up...totally confusing me now ;(

Regards,
Michael Wang

> 
> Next, I'm going to try ripping select_idle_sibling() to tiny shreds,
> twiddle the balance path a little to see if I can get rid of the bad
> stuff for tbench, maybe make some good stuff for pgbench and ilk, ilk
> _maybe_ including heavy duty remote network type loads.
> 
> There's gonna be some violent axe swinging here shortly.
> 
> -Mike
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  9:00                                                       ` Michael Wang
@ 2013-01-23  9:18                                                         ` Mike Galbraith
  2013-01-23  9:26                                                           ` Michael Wang
  2013-01-23  9:32                                                           ` Mike Galbraith
  0 siblings, 2 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-23  9:18 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Wed, 2013-01-23 at 17:00 +0800, Michael Wang wrote: 
> On 01/23/2013 04:49 PM, Mike Galbraith wrote:
> > On Wed, 2013-01-23 at 16:30 +0800, Michael Wang wrote: 
> >> On 01/23/2013 04:20 PM, Mike Galbraith wrote:
> >>> On Wed, 2013-01-23 at 15:10 +0800, Michael Wang wrote: 
> >>>> On 01/23/2013 02:28 PM, Mike Galbraith wrote:
> >>>
> >>>>> Abbreviated test run:
> >>>>> Tasks    jobs/min  jti  jobs/min/task      real       cpu
> >>>>>   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
> >>>>>  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
> >>>>>  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013
> >>>>
> >>>> So still not works... and not going to balance path while waking up will
> >>>> fix it, looks like that's the only choice if no error on balance path
> >>>> could be found...benchmark wins again, I'm feeling bad...
> >>>>
> >>>> I will conclude the info we collected and make a v3 later.
> >>>
> >>> FWIW, I hacked virgin to do full balance if an idle CPU was not found,
> >>> leaving the preference to wake cache affine intact though, turned on
> >>> WAKE_BALANCE in all domains, and it did not collapse.  In fact, the high
> >>> load end, where the idle search will frequently be a waste of cycles,
> >>> actually improved a bit.  Things that make ya go hmmm.
> >>
> >> Oh, does that means the old balance path is good while the new is really
> >> broken, I mean, compared this with the previously results, could we say
> >> that all the collapse was just caused by the change of balance path?
> > 
> > That's a good supposition.  I'll see if it holds.
> 
> I just notice that there is no sd support the WAKE flag at all according
> to your debug info, isn't it?

There is, I turned it on in all domains.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  9:18                                                         ` Mike Galbraith
@ 2013-01-23  9:26                                                           ` Michael Wang
  2013-01-23  9:37                                                             ` Mike Galbraith
  2013-01-23  9:32                                                           ` Mike Galbraith
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-23  9:26 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/23/2013 05:18 PM, Mike Galbraith wrote:
> On Wed, 2013-01-23 at 17:00 +0800, Michael Wang wrote: 
>> On 01/23/2013 04:49 PM, Mike Galbraith wrote:
>>> On Wed, 2013-01-23 at 16:30 +0800, Michael Wang wrote: 
>>>> On 01/23/2013 04:20 PM, Mike Galbraith wrote:
>>>>> On Wed, 2013-01-23 at 15:10 +0800, Michael Wang wrote: 
>>>>>> On 01/23/2013 02:28 PM, Mike Galbraith wrote:
>>>>>
>>>>>>> Abbreviated test run:
>>>>>>> Tasks    jobs/min  jti  jobs/min/task      real       cpu
>>>>>>>   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
>>>>>>>  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
>>>>>>>  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013
>>>>>>
>>>>>> So still not works... and not going to balance path while waking up will
>>>>>> fix it, looks like that's the only choice if no error on balance path
>>>>>> could be found...benchmark wins again, I'm feeling bad...
>>>>>>
>>>>>> I will conclude the info we collected and make a v3 later.
>>>>>
>>>>> FWIW, I hacked virgin to do full balance if an idle CPU was not found,
>>>>> leaving the preference to wake cache affine intact though, turned on
>>>>> WAKE_BALANCE in all domains, and it did not collapse.  In fact, the high
>>>>> load end, where the idle search will frequently be a waste of cycles,
>>>>> actually improved a bit.  Things that make ya go hmmm.
>>>>
>>>> Oh, does that means the old balance path is good while the new is really
>>>> broken, I mean, compared this with the previously results, could we say
>>>> that all the collapse was just caused by the change of balance path?
>>>
>>> That's a good supposition.  I'll see if it holds.
>>
>> I just notice that there is no sd support the WAKE flag at all according
>> to your debug info, isn't it?
> 
> There is, I turned it on in all domains.

So is the debug info show the changes? May be I missed some timing which
need to rebuild the sbm.

Regards,
Michael Wang

> 
> -Mike
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  9:18                                                         ` Mike Galbraith
  2013-01-23  9:26                                                           ` Michael Wang
@ 2013-01-23  9:32                                                           ` Mike Galbraith
  2013-01-24  6:01                                                             ` Michael Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-23  9:32 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Wed, 2013-01-23 at 10:18 +0100, Mike Galbraith wrote: 
> On Wed, 2013-01-23 at 17:00 +0800, Michael Wang wrote: 
> > On 01/23/2013 04:49 PM, Mike Galbraith wrote:
> > > On Wed, 2013-01-23 at 16:30 +0800, Michael Wang wrote: 
> > >> On 01/23/2013 04:20 PM, Mike Galbraith wrote:
> > >>> On Wed, 2013-01-23 at 15:10 +0800, Michael Wang wrote: 
> > >>>> On 01/23/2013 02:28 PM, Mike Galbraith wrote:
> > >>>
> > >>>>> Abbreviated test run:
> > >>>>> Tasks    jobs/min  jti  jobs/min/task      real       cpu
> > >>>>>   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
> > >>>>>  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
> > >>>>>  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013
> > >>>>
> > >>>> So still not works... and not going to balance path while waking up will
> > >>>> fix it, looks like that's the only choice if no error on balance path
> > >>>> could be found...benchmark wins again, I'm feeling bad...
> > >>>>
> > >>>> I will conclude the info we collected and make a v3 later.
> > >>>
> > >>> FWIW, I hacked virgin to do full balance if an idle CPU was not found,
> > >>> leaving the preference to wake cache affine intact though, turned on
> > >>> WAKE_BALANCE in all domains, and it did not collapse.  In fact, the high
> > >>> load end, where the idle search will frequently be a waste of cycles,
> > >>> actually improved a bit.  Things that make ya go hmmm.
> > >>
> > >> Oh, does that means the old balance path is good while the new is really
> > >> broken, I mean, compared this with the previously results, could we say
> > >> that all the collapse was just caused by the change of balance path?
> > > 
> > > That's a good supposition.  I'll see if it holds.
> > 
> > I just notice that there is no sd support the WAKE flag at all according
> > to your debug info, isn't it?
> 
> There is, I turned it on in all domains.

For your patches, I had to turn it on at birth, but doing that, and
restoring the full balance path to original form killed the collapse.

Tasks    jobs/min  jti  jobs/min/task      real       cpu
  640   152452.83   97       238.2075     25.44    613.48   Wed Jan 23 10:22:12 2013
 1280   190491.16   97       148.8212     40.72   1223.74   Wed Jan 23 10:22:53 2013
 2560   219397.54   95        85.7022     70.71   2422.46   Wed Jan 23 10:24:04 2013

---
 include/linux/topology.h |    6 ++---
 kernel/sched/core.c      |   41 ++++++++++++++++++++++++++++++-------
 kernel/sched/fair.c      |   52 +++++++++++++++++++++++++++++------------------
 3 files changed, 70 insertions(+), 29 deletions(-)

--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -95,7 +95,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
@@ -126,7 +126,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
@@ -156,7 +156,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_SHARE_CPUPOWER			\
 				| 0*SD_SHARE_PKG_RESOURCES		\
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5609,11 +5609,39 @@ static void update_top_cache_domain(int
 static int sbm_max_level;
 DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
 
+static void debug_sched_balance_map(int cpu)
+{
+	int i, type, level = 0;
+	struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
+
+	printk("WYT: sbm of cpu %d\n", cpu);
+
+	for (type = 0; type < SBM_MAX_TYPE; type++) {
+		if (type == SBM_EXEC_TYPE)
+			printk("WYT: \t exec map\n");
+		else if (type == SBM_FORK_TYPE)
+			printk("WYT: \t fork map\n");
+		else if (type == SBM_WAKE_TYPE)
+			printk("WYT: \t wake map\n");
+
+		for (level = 0; level < sbm_max_level; level++) {
+			if (sbm->sd[type][level])
+				printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
+		}
+	}
+
+	printk("WYT: \t affine map\n");
+
+	for_each_possible_cpu(i) {
+		if (sbm->affine_map[i])
+			printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
+	}
+}
+
 static void build_sched_balance_map(int cpu)
 {
 	struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
 	struct sched_domain *sd = cpu_rq(cpu)->sd;
-	struct sched_domain *top_sd = NULL;
 	int i, type, level = 0;
 
 	memset(sbm->top_level, 0, sizeof((*sbm).top_level));
@@ -5656,11 +5684,9 @@ static void build_sched_balance_map(int
 	 * fill the hole to get lower level sd easily.
 	 */
 	for (type = 0; type < SBM_MAX_TYPE; type++) {
-		level = sbm->top_level[type];
-		top_sd = sbm->sd[type][level];
-		if ((++level != sbm_max_level) && top_sd) {
-			for (; level < sbm_max_level; level++)
-				sbm->sd[type][level] = top_sd;
+		for (level = 1; level < sbm_max_level; level++) {
+			if (!sbm->sd[type][level])
+				sbm->sd[type][level] = sbm->sd[type][level - 1];
 		}
 	}
 }
@@ -5719,6 +5745,7 @@ cpu_attach_domain(struct sched_domain *s
 	 * destroy_sched_domains() already do the work.
 	 */
 	build_sched_balance_map(cpu);
+//MIKE	debug_sched_balance_map(cpu);
 	rcu_assign_pointer(rq->sbm, sbm);
 }
 
@@ -6220,7 +6247,7 @@ sd_numa_init(struct sched_domain_topolog
 					| 1*SD_BALANCE_NEWIDLE
 					| 0*SD_BALANCE_EXEC
 					| 0*SD_BALANCE_FORK
-					| 0*SD_BALANCE_WAKE
+					| 1*SD_BALANCE_WAKE
 					| 0*SD_WAKE_AFFINE
 					| 0*SD_SHARE_CPUPOWER
 					| 0*SD_SHARE_PKG_RESOURCES
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3312,7 +3312,7 @@ static int select_idle_sibling(struct ta
 static int
 select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
-	struct sched_domain *sd = NULL;
+	struct sched_domain *sd = NULL, *tmp;
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
@@ -3376,31 +3376,45 @@ select_task_rq_fair(struct task_struct *
 
 balance_path:
 	new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
-	sd = sbm->sd[type][sbm->top_level[type]];
+	sd = tmp = sbm->sd[type][sbm->top_level[type]];
 
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
-		struct sched_group *sg = NULL;
+		struct sched_group *group;
+		int weight;
+
+		if (!(sd->flags & sd_flag)) {
+			sd = sd->child;
+			continue;
+		}
 
 		if (sd_flag & SD_BALANCE_WAKE)
 			load_idx = sd->wake_idx;
 
-		sg = find_idlest_group(sd, p, cpu, load_idx);
-		if (!sg)
-			goto next_sd;
-
-		new_cpu = find_idlest_cpu(sg, p, cpu);
-		if (new_cpu != -1)
-			cpu = new_cpu;
-next_sd:
-		if (!sd->level)
-			break;
-
-		sbm = cpu_rq(cpu)->sbm;
-		if (!sbm)
-			break;
-
-		sd = sbm->sd[type][sd->level - 1];
+		group = find_idlest_group(sd, p, cpu, load_idx);
+		if (!group) {
+			sd = sd->child;
+			continue;
+		}
+
+		new_cpu = find_idlest_cpu(group, p, cpu);
+		if (new_cpu == -1 || new_cpu == cpu) {
+			/* Now try balancing at a lower domain level of cpu */
+			sd = sd->child;
+			continue;
+		}
+
+		/* Now try balancing at a lower domain level of new_cpu */
+		cpu = new_cpu;
+		weight = sd->span_weight;
+		sd = NULL;
+		for_each_domain(cpu, tmp) {
+			if (weight <= tmp->span_weight)
+				break;
+			if (tmp->flags & sd_flag)
+				sd = tmp;
+		}
+		/* while loop will break here if sd == NULL */
 	}
 
 unlock:



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  9:26                                                           ` Michael Wang
@ 2013-01-23  9:37                                                             ` Mike Galbraith
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Galbraith @ 2013-01-23  9:37 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Wed, 2013-01-23 at 17:26 +0800, Michael Wang wrote: 
> On 01/23/2013 05:18 PM, Mike Galbraith wrote:
> > On Wed, 2013-01-23 at 17:00 +0800, Michael Wang wrote: 
> >> On 01/23/2013 04:49 PM, Mike Galbraith wrote:
> >>> On Wed, 2013-01-23 at 16:30 +0800, Michael Wang wrote: 
> >>>> On 01/23/2013 04:20 PM, Mike Galbraith wrote:
> >>>>> On Wed, 2013-01-23 at 15:10 +0800, Michael Wang wrote: 
> >>>>>> On 01/23/2013 02:28 PM, Mike Galbraith wrote:
> >>>>>
> >>>>>>> Abbreviated test run:
> >>>>>>> Tasks    jobs/min  jti  jobs/min/task      real       cpu
> >>>>>>>   640   158044.01   81       246.9438     24.54    577.66   Wed Jan 23 07:14:33 2013
> >>>>>>>  1280    50434.33   39        39.4018    153.80   5737.57   Wed Jan 23 07:17:07 2013
> >>>>>>>  2560    47214.07   34        18.4430    328.58  12715.56   Wed Jan 23 07:22:36 2013
> >>>>>>
> >>>>>> So still not works... and not going to balance path while waking up will
> >>>>>> fix it, looks like that's the only choice if no error on balance path
> >>>>>> could be found...benchmark wins again, I'm feeling bad...
> >>>>>>
> >>>>>> I will conclude the info we collected and make a v3 later.
> >>>>>
> >>>>> FWIW, I hacked virgin to do full balance if an idle CPU was not found,
> >>>>> leaving the preference to wake cache affine intact though, turned on
> >>>>> WAKE_BALANCE in all domains, and it did not collapse.  In fact, the high
> >>>>> load end, where the idle search will frequently be a waste of cycles,
> >>>>> actually improved a bit.  Things that make ya go hmmm.
> >>>>
> >>>> Oh, does that means the old balance path is good while the new is really
> >>>> broken, I mean, compared this with the previously results, could we say
> >>>> that all the collapse was just caused by the change of balance path?
> >>>
> >>> That's a good supposition.  I'll see if it holds.
> >>
> >> I just notice that there is no sd support the WAKE flag at all according
> >> to your debug info, isn't it?
> > 
> > There is, I turned it on in all domains.

? Virgin doesn't have any of your patches.  In virgin, I can twiddle
flags effectively with a script.

With your patches, I have to make that happen from the start for it to
be effective, but not in virgin (well nearly virgin) 3.8-rc3.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-23  9:32                                                           ` Mike Galbraith
@ 2013-01-24  6:01                                                             ` Michael Wang
  2013-01-24  6:51                                                               ` Mike Galbraith
  2013-01-24  7:00                                                               ` Michael Wang
  0 siblings, 2 replies; 57+ messages in thread
From: Michael Wang @ 2013-01-24  6:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/23/2013 05:32 PM, Mike Galbraith wrote:
[snip]
> ---
>  include/linux/topology.h |    6 ++---
>  kernel/sched/core.c      |   41 ++++++++++++++++++++++++++++++-------
>  kernel/sched/fair.c      |   52 +++++++++++++++++++++++++++++------------------
>  3 files changed, 70 insertions(+), 29 deletions(-)
> 
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -95,7 +95,7 @@ int arch_update_cpu_topology(void);
>  				| 1*SD_BALANCE_NEWIDLE			\
>  				| 1*SD_BALANCE_EXEC			\
>  				| 1*SD_BALANCE_FORK			\
> -				| 0*SD_BALANCE_WAKE			\
> +				| 1*SD_BALANCE_WAKE			\
>  				| 1*SD_WAKE_AFFINE			\
>  				| 1*SD_SHARE_CPUPOWER			\
>  				| 1*SD_SHARE_PKG_RESOURCES		\
> @@ -126,7 +126,7 @@ int arch_update_cpu_topology(void);
>  				| 1*SD_BALANCE_NEWIDLE			\
>  				| 1*SD_BALANCE_EXEC			\
>  				| 1*SD_BALANCE_FORK			\
> -				| 0*SD_BALANCE_WAKE			\
> +				| 1*SD_BALANCE_WAKE			\
>  				| 1*SD_WAKE_AFFINE			\
>  				| 0*SD_SHARE_CPUPOWER			\
>  				| 1*SD_SHARE_PKG_RESOURCES		\
> @@ -156,7 +156,7 @@ int arch_update_cpu_topology(void);
>  				| 1*SD_BALANCE_NEWIDLE			\
>  				| 1*SD_BALANCE_EXEC			\
>  				| 1*SD_BALANCE_FORK			\
> -				| 0*SD_BALANCE_WAKE			\
> +				| 1*SD_BALANCE_WAKE			\
>  				| 1*SD_WAKE_AFFINE			\
>  				| 0*SD_SHARE_CPUPOWER			\
>  				| 0*SD_SHARE_PKG_RESOURCES		\

I've enabled WAKE flag on my box like you did, but still can't see
regression, and I've just tested on a power server with 64 cpu, also
failed to reproduce the issue (not compared with virgin yet, but can't
see collapse).

I will do more testing on the power box to confirm it.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5609,11 +5609,39 @@ static void update_top_cache_domain(int
>  static int sbm_max_level;
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
> 
> +static void debug_sched_balance_map(int cpu)
> +{
> +	int i, type, level = 0;
> +	struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
> +
> +	printk("WYT: sbm of cpu %d\n", cpu);
> +
> +	for (type = 0; type < SBM_MAX_TYPE; type++) {
> +		if (type == SBM_EXEC_TYPE)
> +			printk("WYT: \t exec map\n");
> +		else if (type == SBM_FORK_TYPE)
> +			printk("WYT: \t fork map\n");
> +		else if (type == SBM_WAKE_TYPE)
> +			printk("WYT: \t wake map\n");
> +
> +		for (level = 0; level < sbm_max_level; level++) {
> +			if (sbm->sd[type][level])
> +				printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
> +		}
> +	}
> +
> +	printk("WYT: \t affine map\n");
> +
> +	for_each_possible_cpu(i) {
> +		if (sbm->affine_map[i])
> +			printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
> +	}
> +}
> +
>  static void build_sched_balance_map(int cpu)
>  {
>  	struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>  	struct sched_domain *sd = cpu_rq(cpu)->sd;
> -	struct sched_domain *top_sd = NULL;
>  	int i, type, level = 0;
> 
>  	memset(sbm->top_level, 0, sizeof((*sbm).top_level));
> @@ -5656,11 +5684,9 @@ static void build_sched_balance_map(int
>  	 * fill the hole to get lower level sd easily.
>  	 */
>  	for (type = 0; type < SBM_MAX_TYPE; type++) {
> -		level = sbm->top_level[type];
> -		top_sd = sbm->sd[type][level];
> -		if ((++level != sbm_max_level) && top_sd) {
> -			for (; level < sbm_max_level; level++)
> -				sbm->sd[type][level] = top_sd;
> +		for (level = 1; level < sbm_max_level; level++) {
> +			if (!sbm->sd[type][level])
> +				sbm->sd[type][level] = sbm->sd[type][level - 1];
>  		}
>  	}
>  }
> @@ -5719,6 +5745,7 @@ cpu_attach_domain(struct sched_domain *s
>  	 * destroy_sched_domains() already do the work.
>  	 */
>  	build_sched_balance_map(cpu);
> +//MIKE	debug_sched_balance_map(cpu);
>  	rcu_assign_pointer(rq->sbm, sbm);
>  }
> 
> @@ -6220,7 +6247,7 @@ sd_numa_init(struct sched_domain_topolog
>  					| 1*SD_BALANCE_NEWIDLE
>  					| 0*SD_BALANCE_EXEC
>  					| 0*SD_BALANCE_FORK
> -					| 0*SD_BALANCE_WAKE
> +					| 1*SD_BALANCE_WAKE
>  					| 0*SD_WAKE_AFFINE
>  					| 0*SD_SHARE_CPUPOWER
>  					| 0*SD_SHARE_PKG_RESOURCES
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3312,7 +3312,7 @@ static int select_idle_sibling(struct ta
>  static int
>  select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  {
> -	struct sched_domain *sd = NULL;
> +	struct sched_domain *sd = NULL, *tmp;
>  	int cpu = smp_processor_id();
>  	int prev_cpu = task_cpu(p);
>  	int new_cpu = cpu;
> @@ -3376,31 +3376,45 @@ select_task_rq_fair(struct task_struct *
> 
>  balance_path:
>  	new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
> -	sd = sbm->sd[type][sbm->top_level[type]];
> +	sd = tmp = sbm->sd[type][sbm->top_level[type]];
> 
>  	while (sd) {
>  		int load_idx = sd->forkexec_idx;
> -		struct sched_group *sg = NULL;
> +		struct sched_group *group;
> +		int weight;
> +
> +		if (!(sd->flags & sd_flag)) {
> +			sd = sd->child;
> +			continue;
> +		}
> 
>  		if (sd_flag & SD_BALANCE_WAKE)
>  			load_idx = sd->wake_idx;
> 
> -		sg = find_idlest_group(sd, p, cpu, load_idx);
> -		if (!sg)
> -			goto next_sd;
> -
> -		new_cpu = find_idlest_cpu(sg, p, cpu);
> -		if (new_cpu != -1)
> -			cpu = new_cpu;
> -next_sd:
> -		if (!sd->level)
> -			break;
> -
> -		sbm = cpu_rq(cpu)->sbm;
> -		if (!sbm)
> -			break;
> -
> -		sd = sbm->sd[type][sd->level - 1];

May be we could test part by part? I'm planing to write another debug
patch, by which we could compare just part of the two ways, will send to
you when I finished it.

Regards,
Michael Wang

> +		group = find_idlest_group(sd, p, cpu, load_idx);
> +		if (!group) {
> +			sd = sd->child;
> +			continue;
> +		}
> +
> +		new_cpu = find_idlest_cpu(group, p, cpu);
> +		if (new_cpu == -1 || new_cpu == cpu) {
> +			/* Now try balancing at a lower domain level of cpu */
> +			sd = sd->child;
> +			continue;
> +		}
> +
> +		/* Now try balancing at a lower domain level of new_cpu */
> +		cpu = new_cpu;
> +		weight = sd->span_weight;
> +		sd = NULL;
> +		for_each_domain(cpu, tmp) {
> +			if (weight <= tmp->span_weight)
> +				break;
> +			if (tmp->flags & sd_flag)
> +				sd = tmp;
> +		}
> +		/* while loop will break here if sd == NULL */
>  	}
> 
>  unlock:
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24  6:01                                                             ` Michael Wang
@ 2013-01-24  6:51                                                               ` Mike Galbraith
  2013-01-24  7:15                                                                 ` Michael Wang
  2013-01-24  7:00                                                               ` Michael Wang
  1 sibling, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-24  6:51 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Thu, 2013-01-24 at 14:01 +0800, Michael Wang wrote:

> I've enabled WAKE flag on my box like you did, but still can't see
> regression, and I've just tested on a power server with 64 cpu, also
> failed to reproduce the issue (not compared with virgin yet, but can't
> see collapse).

I'm not surprised.  I'm seeing enough inconsistent crap to come to the
conclusion that stock scheduler knobs flat can't be used on a largish
box, they're just too preempt-happy, leading to weird crap.

My 2 missing nodes came back, and the very same kernel that highly
repeatably collapsed with 2 nodes does not with 4 nodes, and 2 nodes
does not collapse with only preemption knob tweaking, and that's
bullshit.  Virgin shows instability in the mid-range, make a tiny tweak
that should have little if any effect there, and that instability
vanishes entirely.  Test runs are not consistent enough boot to boot etc
etc.  Either stock knobs suck on NUMA boxen, or this box is possessed.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24  6:01                                                             ` Michael Wang
  2013-01-24  6:51                                                               ` Mike Galbraith
@ 2013-01-24  7:00                                                               ` Michael Wang
  1 sibling, 0 replies; 57+ messages in thread
From: Michael Wang @ 2013-01-24  7:00 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/24/2013 02:01 PM, Michael Wang wrote:
> On 01/23/2013 05:32 PM, Mike Galbraith wrote:
> [snip]
>> ---
>>  include/linux/topology.h |    6 ++---
>>  kernel/sched/core.c      |   41 ++++++++++++++++++++++++++++++-------
>>  kernel/sched/fair.c      |   52 +++++++++++++++++++++++++++++------------------
>>  3 files changed, 70 insertions(+), 29 deletions(-)
>>
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -95,7 +95,7 @@ int arch_update_cpu_topology(void);
>>  				| 1*SD_BALANCE_NEWIDLE			\
>>  				| 1*SD_BALANCE_EXEC			\
>>  				| 1*SD_BALANCE_FORK			\
>> -				| 0*SD_BALANCE_WAKE			\
>> +				| 1*SD_BALANCE_WAKE			\
>>  				| 1*SD_WAKE_AFFINE			\
>>  				| 1*SD_SHARE_CPUPOWER			\
>>  				| 1*SD_SHARE_PKG_RESOURCES		\
>> @@ -126,7 +126,7 @@ int arch_update_cpu_topology(void);
>>  				| 1*SD_BALANCE_NEWIDLE			\
>>  				| 1*SD_BALANCE_EXEC			\
>>  				| 1*SD_BALANCE_FORK			\
>> -				| 0*SD_BALANCE_WAKE			\
>> +				| 1*SD_BALANCE_WAKE			\
>>  				| 1*SD_WAKE_AFFINE			\
>>  				| 0*SD_SHARE_CPUPOWER			\
>>  				| 1*SD_SHARE_PKG_RESOURCES		\
>> @@ -156,7 +156,7 @@ int arch_update_cpu_topology(void);
>>  				| 1*SD_BALANCE_NEWIDLE			\
>>  				| 1*SD_BALANCE_EXEC			\
>>  				| 1*SD_BALANCE_FORK			\
>> -				| 0*SD_BALANCE_WAKE			\
>> +				| 1*SD_BALANCE_WAKE			\
>>  				| 1*SD_WAKE_AFFINE			\
>>  				| 0*SD_SHARE_CPUPOWER			\
>>  				| 0*SD_SHARE_PKG_RESOURCES		\
> 
> I've enabled WAKE flag on my box like you did, but still can't see
> regression, and I've just tested on a power server with 64 cpu, also
> failed to reproduce the issue (not compared with virgin yet, but can't
> see collapse).
> 
> I will do more testing on the power box to confirm it.

I still can't reproduce the issue, but there are some difference
according to my default sd topology:

WYT: sbm of cpu 0
WYT: 	 exec map
WYT: 		 sd f051be80, idx 0, level 0, weight 4
WYT: 		 sd f08b3700, idx 1, level 1, weight 32
WYT: 		 sd f08b3700, idx 2, level 1, weight 32
WYT: 	 fork map
WYT: 		 sd f051be80, idx 0, level 0, weight 4
WYT: 		 sd f08b3700, idx 1, level 1, weight 32
WYT: 		 sd f08b3700, idx 2, level 1, weight 32
WYT: 	 wake map
WYT: 		 sd f051be80, idx 0, level 0, weight 4
WYT: 		 sd f08b3700, idx 1, level 1, weight 32
WYT: 		 sd f08b6300, idx 2, level 2, weight 64
WYT: 	 affine map
WYT: 		 affine with cpu 0 in sd f051be80, weight 4
WYT: 		 affine with cpu 1 in sd f051be80, weight 4
WYT: 		 affine with cpu 2 in sd f051be80, weight 4
WYT: 		 affine with cpu 3 in sd f051be80, weight 4
		...

And there are only sibling, cpu and numa level, no mc level while your
box have, but that looks harmless to me... isn't it?

This is the aim 7 results of the patched kernel, it's just fine.

Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      424.07  100       424.0728     14.29      4.29   Thu Jan 24
01:52:22 2013
    5     2561.28   99       512.2570     11.83      8.82   Thu Jan 24
01:52:35 2013
   10     5033.22   97       503.3223     12.04     16.35   Thu Jan 24
01:52:47 2013
   20    10350.13   98       517.5064     11.71     28.54   Thu Jan 24
01:52:59 2013
   40    20116.18   98       502.9046     12.05     62.06   Thu Jan 24
01:53:11 2013
   80    39255.06   98       490.6883     12.35    122.18   Thu Jan 24
01:53:24 2013
  160    69405.87   97       433.7867     13.97    234.41   Thu Jan 24
01:53:38 2013
  320   111192.66   92       347.4771     17.44    463.18   Thu Jan 24
01:53:56 2013
  640   158044.01   86       246.9438     24.54    920.38   Thu Jan 24
01:54:20 2013
 1280   199763.07   87       156.0649     38.83   1833.75   Thu Jan 24
01:54:59 2013
 2560   229933.30   81        89.8177     67.47   3665.30   Thu Jan 24
01:56:07 2013

And this is my cpu info:
processor	: 63
cpu		: POWER7 (raw), altivec supported
clock		: 8.388608MHz
revision	: 2.3 (pvr 003f 0203)

Regards,
Michael Wang

> 
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5609,11 +5609,39 @@ static void update_top_cache_domain(int
>>  static int sbm_max_level;
>>  DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
>>
>> +static void debug_sched_balance_map(int cpu)
>> +{
>> +	int i, type, level = 0;
>> +	struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>> +
>> +	printk("WYT: sbm of cpu %d\n", cpu);
>> +
>> +	for (type = 0; type < SBM_MAX_TYPE; type++) {
>> +		if (type == SBM_EXEC_TYPE)
>> +			printk("WYT: \t exec map\n");
>> +		else if (type == SBM_FORK_TYPE)
>> +			printk("WYT: \t fork map\n");
>> +		else if (type == SBM_WAKE_TYPE)
>> +			printk("WYT: \t wake map\n");
>> +
>> +		for (level = 0; level < sbm_max_level; level++) {
>> +			if (sbm->sd[type][level])
>> +				printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
>> +		}
>> +	}
>> +
>> +	printk("WYT: \t affine map\n");
>> +
>> +	for_each_possible_cpu(i) {
>> +		if (sbm->affine_map[i])
>> +			printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
>> +	}
>> +}
>> +
>>  static void build_sched_balance_map(int cpu)
>>  {
>>  	struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>>  	struct sched_domain *sd = cpu_rq(cpu)->sd;
>> -	struct sched_domain *top_sd = NULL;
>>  	int i, type, level = 0;
>>
>>  	memset(sbm->top_level, 0, sizeof((*sbm).top_level));
>> @@ -5656,11 +5684,9 @@ static void build_sched_balance_map(int
>>  	 * fill the hole to get lower level sd easily.
>>  	 */
>>  	for (type = 0; type < SBM_MAX_TYPE; type++) {
>> -		level = sbm->top_level[type];
>> -		top_sd = sbm->sd[type][level];
>> -		if ((++level != sbm_max_level) && top_sd) {
>> -			for (; level < sbm_max_level; level++)
>> -				sbm->sd[type][level] = top_sd;
>> +		for (level = 1; level < sbm_max_level; level++) {
>> +			if (!sbm->sd[type][level])
>> +				sbm->sd[type][level] = sbm->sd[type][level - 1];
>>  		}
>>  	}
>>  }
>> @@ -5719,6 +5745,7 @@ cpu_attach_domain(struct sched_domain *s
>>  	 * destroy_sched_domains() already do the work.
>>  	 */
>>  	build_sched_balance_map(cpu);
>> +//MIKE	debug_sched_balance_map(cpu);
>>  	rcu_assign_pointer(rq->sbm, sbm);
>>  }
>>
>> @@ -6220,7 +6247,7 @@ sd_numa_init(struct sched_domain_topolog
>>  					| 1*SD_BALANCE_NEWIDLE
>>  					| 0*SD_BALANCE_EXEC
>>  					| 0*SD_BALANCE_FORK
>> -					| 0*SD_BALANCE_WAKE
>> +					| 1*SD_BALANCE_WAKE
>>  					| 0*SD_WAKE_AFFINE
>>  					| 0*SD_SHARE_CPUPOWER
>>  					| 0*SD_SHARE_PKG_RESOURCES
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3312,7 +3312,7 @@ static int select_idle_sibling(struct ta
>>  static int
>>  select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>  {
>> -	struct sched_domain *sd = NULL;
>> +	struct sched_domain *sd = NULL, *tmp;
>>  	int cpu = smp_processor_id();
>>  	int prev_cpu = task_cpu(p);
>>  	int new_cpu = cpu;
>> @@ -3376,31 +3376,45 @@ select_task_rq_fair(struct task_struct *
>>
>>  balance_path:
>>  	new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
>> -	sd = sbm->sd[type][sbm->top_level[type]];
>> +	sd = tmp = sbm->sd[type][sbm->top_level[type]];
>>
>>  	while (sd) {
>>  		int load_idx = sd->forkexec_idx;
>> -		struct sched_group *sg = NULL;
>> +		struct sched_group *group;
>> +		int weight;
>> +
>> +		if (!(sd->flags & sd_flag)) {
>> +			sd = sd->child;
>> +			continue;
>> +		}
>>
>>  		if (sd_flag & SD_BALANCE_WAKE)
>>  			load_idx = sd->wake_idx;
>>
>> -		sg = find_idlest_group(sd, p, cpu, load_idx);
>> -		if (!sg)
>> -			goto next_sd;
>> -
>> -		new_cpu = find_idlest_cpu(sg, p, cpu);
>> -		if (new_cpu != -1)
>> -			cpu = new_cpu;
>> -next_sd:
>> -		if (!sd->level)
>> -			break;
>> -
>> -		sbm = cpu_rq(cpu)->sbm;
>> -		if (!sbm)
>> -			break;
>> -
>> -		sd = sbm->sd[type][sd->level - 1];
> 
> May be we could test part by part? I'm planing to write another debug
> patch, by which we could compare just part of the two ways, will send to
> you when I finished it.
> 
> Regards,
> Michael Wang
> 
>> +		group = find_idlest_group(sd, p, cpu, load_idx);
>> +		if (!group) {
>> +			sd = sd->child;
>> +			continue;
>> +		}
>> +
>> +		new_cpu = find_idlest_cpu(group, p, cpu);
>> +		if (new_cpu == -1 || new_cpu == cpu) {
>> +			/* Now try balancing at a lower domain level of cpu */
>> +			sd = sd->child;
>> +			continue;
>> +		}
>> +
>> +		/* Now try balancing at a lower domain level of new_cpu */
>> +		cpu = new_cpu;
>> +		weight = sd->span_weight;
>> +		sd = NULL;
>> +		for_each_domain(cpu, tmp) {
>> +			if (weight <= tmp->span_weight)
>> +				break;
>> +			if (tmp->flags & sd_flag)
>> +				sd = tmp;
>> +		}
>> +		/* while loop will break here if sd == NULL */
>>  	}
>>
>>  unlock:
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24  6:51                                                               ` Mike Galbraith
@ 2013-01-24  7:15                                                                 ` Michael Wang
  2013-01-24  7:47                                                                   ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-24  7:15 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/24/2013 02:51 PM, Mike Galbraith wrote:
> On Thu, 2013-01-24 at 14:01 +0800, Michael Wang wrote:
> 
>> I've enabled WAKE flag on my box like you did, but still can't see
>> regression, and I've just tested on a power server with 64 cpu, also
>> failed to reproduce the issue (not compared with virgin yet, but can't
>> see collapse).
> 
> I'm not surprised.  I'm seeing enough inconsistent crap to come to the
> conclusion that stock scheduler knobs flat can't be used on a largish
> box, they're just too preempt-happy, leading to weird crap.
> 
> My 2 missing nodes came back, and the very same kernel that highly
> repeatably collapsed with 2 nodes does not with 4 nodes, and 2 nodes
> does not collapse with only preemption knob tweaking, and that's
> bullshit.  Virgin shows instability in the mid-range, make a tiny tweak
> that should have little if any effect there, and that instability
> vanishes entirely.  Test runs are not consistent enough boot to boot etc
> etc.  Either stock knobs suck on NUMA boxen, or this box is possessed.

Mike, I wonder the reason why change back to the old way make collapse
away may not because there are logical error in new balance path, it's
just changed the cost of select_task_rq(), whatever it's more or less,
it's accidentally achieve the same effect as you tweak the knob, so
that's the reason why it looks like old is better than new.

Regards,
Michael Wang

> 
> -Mike
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24  7:15                                                                 ` Michael Wang
@ 2013-01-24  7:47                                                                   ` Mike Galbraith
  2013-01-24  8:14                                                                     ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-24  7:47 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Thu, 2013-01-24 at 15:15 +0800, Michael Wang wrote: 
> On 01/24/2013 02:51 PM, Mike Galbraith wrote:
> > On Thu, 2013-01-24 at 14:01 +0800, Michael Wang wrote:
> > 
> >> I've enabled WAKE flag on my box like you did, but still can't see
> >> regression, and I've just tested on a power server with 64 cpu, also
> >> failed to reproduce the issue (not compared with virgin yet, but can't
> >> see collapse).
> > 
> > I'm not surprised.  I'm seeing enough inconsistent crap to come to the
> > conclusion that stock scheduler knobs flat can't be used on a largish
> > box, they're just too preempt-happy, leading to weird crap.
> > 
> > My 2 missing nodes came back, and the very same kernel that highly
> > repeatably collapsed with 2 nodes does not with 4 nodes, and 2 nodes
> > does not collapse with only preemption knob tweaking, and that's
> > bullshit.  Virgin shows instability in the mid-range, make a tiny tweak
> > that should have little if any effect there, and that instability
> > vanishes entirely.  Test runs are not consistent enough boot to boot etc
> > etc.  Either stock knobs suck on NUMA boxen, or this box is possessed.
> 
> Mike, I wonder the reason why change back to the old way make collapse
> away may not because there are logical error in new balance path, it's
> just changed the cost of select_task_rq(), whatever it's more or less,
> it's accidentally achieve the same effect as you tweak the knob, so
> that's the reason why it looks like old is better than new.

That's what I'm saying, it's a useless crap side-effect of a preempt
happy kernel.  Results with these knobs are just not stable.  Results go
wildly unstable with 2 nodes vs 4 in this box, but can be stabilized in
all with preemption knob adjustment.. or phase of moon might make them
appear stable.. or not.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24  7:47                                                                   ` Mike Galbraith
@ 2013-01-24  8:14                                                                     ` Michael Wang
  2013-01-24  9:07                                                                       ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-24  8:14 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/24/2013 03:47 PM, Mike Galbraith wrote:
> On Thu, 2013-01-24 at 15:15 +0800, Michael Wang wrote: 
>> On 01/24/2013 02:51 PM, Mike Galbraith wrote:
>>> On Thu, 2013-01-24 at 14:01 +0800, Michael Wang wrote:
>>>
>>>> I've enabled WAKE flag on my box like you did, but still can't see
>>>> regression, and I've just tested on a power server with 64 cpu, also
>>>> failed to reproduce the issue (not compared with virgin yet, but can't
>>>> see collapse).
>>>
>>> I'm not surprised.  I'm seeing enough inconsistent crap to come to the
>>> conclusion that stock scheduler knobs flat can't be used on a largish
>>> box, they're just too preempt-happy, leading to weird crap.
>>>
>>> My 2 missing nodes came back, and the very same kernel that highly
>>> repeatably collapsed with 2 nodes does not with 4 nodes, and 2 nodes
>>> does not collapse with only preemption knob tweaking, and that's
>>> bullshit.  Virgin shows instability in the mid-range, make a tiny tweak
>>> that should have little if any effect there, and that instability
>>> vanishes entirely.  Test runs are not consistent enough boot to boot etc
>>> etc.  Either stock knobs suck on NUMA boxen, or this box is possessed.
>>
>> Mike, I wonder the reason why change back to the old way make collapse
>> away may not because there are logical error in new balance path, it's
>> just changed the cost of select_task_rq(), whatever it's more or less,
>> it's accidentally achieve the same effect as you tweak the knob, so
>> that's the reason why it looks like old is better than new.
> 
> That's what I'm saying, it's a useless crap side-effect of a preempt
> happy kernel.  Results with these knobs are just not stable.  Results go
> wildly unstable with 2 nodes vs 4 in this box, but can be stabilized in
> all with preemption knob adjustment.. or phase of moon might make them
> appear stable.. or not.

Yeah, it's time to stop blame the patch now, it's not the real killer on
your box.

Well, at least it's worth to be tortured on it, we found several points
I missed, we are more familiar with the balance path, and we found some
places we could do better, all these are because your kindly help, it's
nice to work with you ;-)

Now it's time to work on v3 I think, let's see what we could get this time.

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24  8:14                                                                     ` Michael Wang
@ 2013-01-24  9:07                                                                       ` Mike Galbraith
  2013-01-24  9:26                                                                         ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-24  9:07 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Thu, 2013-01-24 at 16:14 +0800, Michael Wang wrote:

> Now it's time to work on v3 I think, let's see what we could get this time.

Maybe v3 can try to not waste so much ram on affine map?

Even better would be if it could just go away, along with relic of the
bad old days wake_affine(), and we make the balance path so damn light
but clever that select_idle_sibling() can go away too... and a pony ;-)

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24  9:07                                                                       ` Mike Galbraith
@ 2013-01-24  9:26                                                                         ` Michael Wang
  2013-01-24 10:34                                                                           ` Mike Galbraith
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-24  9:26 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/24/2013 05:07 PM, Mike Galbraith wrote:
> On Thu, 2013-01-24 at 16:14 +0800, Michael Wang wrote:
> 
>> Now it's time to work on v3 I think, let's see what we could get this time.
> 
> Maybe v3 can try to not waste so much ram on affine map?

Yeah, that has been a question in my mind at very beginning, but how...

> 
> Even better would be if it could just go away, along with relic of the
> bad old days wake_affine(), and we make the balance path so damn light
> but clever that select_idle_sibling() can go away too... and a pony ;-)

Hmm...may be, I need some consideration here, a totally balance path,
interesting...

But I think we still need the clean code which sbm bring to us, do we?

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24  9:26                                                                         ` Michael Wang
@ 2013-01-24 10:34                                                                           ` Mike Galbraith
  2013-01-25  2:14                                                                             ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Mike Galbraith @ 2013-01-24 10:34 UTC (permalink / raw)
  To: Michael Wang; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On Thu, 2013-01-24 at 17:26 +0800, Michael Wang wrote: 
> On 01/24/2013 05:07 PM, Mike Galbraith wrote:
> > On Thu, 2013-01-24 at 16:14 +0800, Michael Wang wrote:
> > 
> >> Now it's time to work on v3 I think, let's see what we could get this time.
> > 
> > Maybe v3 can try to not waste so much ram on affine map?
> 
> Yeah, that has been a question in my mind at very beginning, but how...

Allocate at domain build time the max we can acquire via hotplug?
> > Even better would be if it could just go away, along with relic of the
> > bad old days wake_affine(), and we make the balance path so damn light
> > but clever that select_idle_sibling() can go away too... and a pony ;-)
> 
> Hmm...may be, I need some consideration here, a totally balance path,
> interesting...

Unification is the right target, hitting it might not be so easy though.

> But I think we still need the clean code which sbm bring to us, do we?

Sure, if it makes things perform better.

-Mike


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-24 10:34                                                                           ` Mike Galbraith
@ 2013-01-25  2:14                                                                             ` Michael Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Michael Wang @ 2013-01-25  2:14 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, mingo, peterz, mingo, a.p.zijlstra

On 01/24/2013 06:34 PM, Mike Galbraith wrote:
> On Thu, 2013-01-24 at 17:26 +0800, Michael Wang wrote: 
>> On 01/24/2013 05:07 PM, Mike Galbraith wrote:
>>> On Thu, 2013-01-24 at 16:14 +0800, Michael Wang wrote:
>>>
>>>> Now it's time to work on v3 I think, let's see what we could get this time.
>>>
>>> Maybe v3 can try to not waste so much ram on affine map?
>>
>> Yeah, that has been a question in my mind at very beginning, but how...
> 
> Allocate at domain build time the max we can acquire via hotplug?
>>> Even better would be if it could just go away, along with relic of the
>>> bad old days wake_affine(), and we make the balance path so damn light
>>> but clever that select_idle_sibling() can go away too... and a pony ;-)
>>
>> Hmm...may be, I need some consideration here, a totally balance path,
>> interesting...
> 
> Unification is the right target, hitting it might not be so easy though.
> 
>> But I think we still need the clean code which sbm bring to us, do we?
> 
> Sure, if it makes things perform better.

Ok, let me think about these, big changes may not show in v3, but I will
keep those in mind.

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-11 10:13 ` Nikunj A Dadhania
@ 2013-01-15  2:20   ` Michael Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Michael Wang @ 2013-01-15  2:20 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Paul Turner, Tejun Heo,
	Mike Galbraith, Andrew Morton

On 01/11/2013 06:13 PM, Nikunj A Dadhania wrote:
> Hi Michael,
> 
> Michael Wang <wangyun@linux.vnet.ibm.com> writes:
>> 	Prev:
>> 		+---------+---------+-------+
>> 		| 7484 MB |      32 | 42463 |
>> 	Post:
>> 		| 7483 MB |      32 | 44185 |		+0.18%
> That should be +4.05%

Hi, Nikunj

Thanks for your notify, that's my mistake on the calculation, will
correct it.

Regards,
Michael Wang

> 
> Regards
> Nikunj
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
  2013-01-11  8:15 Michael Wang
@ 2013-01-11 10:13 ` Nikunj A Dadhania
  2013-01-15  2:20   ` Michael Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Nikunj A Dadhania @ 2013-01-11 10:13 UTC (permalink / raw)
  To: Michael Wang, LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Tejun Heo,
	Mike Galbraith, Andrew Morton

Hi Michael,

Michael Wang <wangyun@linux.vnet.ibm.com> writes:
> 	Prev:
> 		+---------+---------+-------+
> 		| 7484 MB |      32 | 42463 |
> 	Post:
> 		| 7483 MB |      32 | 44185 |		+0.18%
That should be +4.05%

Regards
Nikunj


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
@ 2013-01-11  8:15 Michael Wang
  2013-01-11 10:13 ` Nikunj A Dadhania
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Wang @ 2013-01-11  8:15 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Tejun Heo,
	Mike Galbraith, Andrew Morton

This patch set is trying to simplify the select_task_rq_fair() with
schedule balance map.

After get rid of the complex code and reorganize the logical, pgbench show
the improvement.

	Prev:
		| db_size | clients |  tps  |
		+---------+---------+-------+
		| 22 MB   |       1 |  4437 |
		| 22 MB   |      16 | 51351 |
		| 22 MB   |      32 | 49959 |
		| 7484 MB |       1 |  4078 |
		| 7484 MB |      16 | 44681 |
		| 7484 MB |      32 | 42463 |
		| 15 GB   |       1 |  3992 |
		| 15 GB   |      16 | 44107 |
		| 15 GB   |      32 | 41797 |

	Post:
		| db_size | clients |  tps  |
		+---------+---------+-------+
		| 22 MB   |       1 | 11053 |		+149.11%
		| 22 MB   |      16 | 55671 |		+8.41%
		| 22 MB   |      32 | 52596 |		+5.28%
		| 7483 MB |       1 |  8180 |		+100.59%
		| 7483 MB |      16 | 48392 |		+8.31%
		| 7483 MB |      32 | 44185 |		+0.18%
		| 15 GB   |       1 |  8127 |		+103.58%
		| 15 GB   |      16 | 48156 |		+9.18%
		| 15 GB   |      32 | 43387 |		+3.8%

Please check the patch for more details about schedule balance map, they
currently based on linux-next 3.7.0-rc6, will rebase them to tip tree in
follow version.

Comments are very welcomed.

Test with:
	12 cpu X86 server and linux-next 3.7.0-rc6.

Michael Wang (2):
	[PATCH 1/2] sched: schedule balance map foundation
	[PATCH 2/2] sched: simplify select_task_rq_fair() with schedule balance map

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
---
 core.c  |   61 +++++++++++++++++++++++++++++
 fair.c  |  133 +++++++++++++++++++++++++++++++++-------------------------------
 sched.h |   28 +++++++++++++
 3 files changed, 159 insertions(+), 63 deletions(-)


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2013-01-25  2:14 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1356588535-23251-1-git-send-email-wangyun@linux.vnet.ibm.com>
2013-01-09  9:28 ` [RFC PATCH 0/2] sched: simplify the select_task_rq_fair() Michael Wang
2013-01-12  8:01   ` Mike Galbraith
2013-01-12 10:19     ` Mike Galbraith
2013-01-14  9:21       ` Mike Galbraith
2013-01-15  3:10         ` Michael Wang
2013-01-15  4:52           ` Mike Galbraith
2013-01-15  8:26             ` Michael Wang
2013-01-17  5:55         ` Michael Wang
2013-01-20  4:09           ` Mike Galbraith
2013-01-21  2:50             ` Michael Wang
2013-01-21  4:38               ` Mike Galbraith
2013-01-21  5:07                 ` Michael Wang
2013-01-21  6:42                   ` Mike Galbraith
2013-01-21  7:09                     ` Mike Galbraith
2013-01-21  7:45                       ` Michael Wang
2013-01-21  9:09                         ` Mike Galbraith
2013-01-21  9:22                           ` Michael Wang
2013-01-21  9:44                             ` Mike Galbraith
2013-01-21 10:30                               ` Mike Galbraith
2013-01-22  3:43                               ` Michael Wang
2013-01-22  8:03                                 ` Mike Galbraith
2013-01-22  8:56                                   ` Michael Wang
2013-01-22 11:34                                     ` Mike Galbraith
2013-01-23  3:01                                       ` Michael Wang
2013-01-23  5:02                                         ` Mike Galbraith
2013-01-22 14:41                                     ` Mike Galbraith
2013-01-23  2:44                                       ` Michael Wang
2013-01-23  4:31                                         ` Mike Galbraith
2013-01-23  5:09                                           ` Michael Wang
2013-01-23  6:28                                             ` Mike Galbraith
2013-01-23  7:10                                               ` Michael Wang
2013-01-23  8:20                                                 ` Mike Galbraith
2013-01-23  8:30                                                   ` Michael Wang
2013-01-23  8:49                                                     ` Mike Galbraith
2013-01-23  9:00                                                       ` Michael Wang
2013-01-23  9:18                                                         ` Mike Galbraith
2013-01-23  9:26                                                           ` Michael Wang
2013-01-23  9:37                                                             ` Mike Galbraith
2013-01-23  9:32                                                           ` Mike Galbraith
2013-01-24  6:01                                                             ` Michael Wang
2013-01-24  6:51                                                               ` Mike Galbraith
2013-01-24  7:15                                                                 ` Michael Wang
2013-01-24  7:47                                                                   ` Mike Galbraith
2013-01-24  8:14                                                                     ` Michael Wang
2013-01-24  9:07                                                                       ` Mike Galbraith
2013-01-24  9:26                                                                         ` Michael Wang
2013-01-24 10:34                                                                           ` Mike Galbraith
2013-01-25  2:14                                                                             ` Michael Wang
2013-01-24  7:00                                                               ` Michael Wang
2013-01-21  7:34                     ` Michael Wang
2013-01-21  8:26                       ` Mike Galbraith
2013-01-21  8:46                         ` Michael Wang
2013-01-21  9:11                           ` Mike Galbraith
2013-01-15  2:46     ` Michael Wang
2013-01-11  8:15 Michael Wang
2013-01-11 10:13 ` Nikunj A Dadhania
2013-01-15  2:20   ` Michael Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.