All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors
@ 2009-08-20 13:12 Andreas Herrmann
  2009-08-20 13:15 ` [PATCH 1/15] x86, sched: Add config option for multi-node CPU scheduling Andreas Herrmann
                   ` (14 more replies)
  0 siblings, 15 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel

Hi,

Subsequent patches adapt scheduling code to support multi-node processors.

In short, the required changes need to fulfill two requirements:

(1) The set of CPUs in a NUMA node does not necessarily span CPUs of entire
    sockets anymore. (Current code assumes that.)

(2) The additional hierarchy in the CPU topology (i.e. internal node) might
    be useful when doing load balancing when power saving matters.

Patches 1-7 (add basic) support fo a new scheduling domain (called MN for multi-node)
Patch 8 adds a knob to control power_savings balancing for MN domain
Patches 9, 10 add the snippets to do the power_savings balancing for MN domain
Patch 11 adds a way to pass unlimited __cpu_power information to upper domain levels
Patch 12 allows NODE domain to be parent of MC domain (and thus child of MN domain)
Patch 13 detects whether NODE domain is parent of MC instead of CPU domain
Patch 14 fixes perf policy scheduling when NODE domain is parent of MC domain
Patch 15 fixes cpu_coregroup_mask to use mask of core_siblings instead of node_siblings
         (I admit that this change should better be added to my topology patches.)

To apply the patches you need to use tip/master as of today
(containing the sched cleanup patches) plus the 8 topology patches
that I have sent recently. (See
http://marc.info/?l=linux-kernel&m=124964980507887)

Full power saving scheduling options on multi-node processors are only
available with CONFIG_SCHED_MN=y. See example and example output
below.


Regards,

Andreas

PS: I send this as RFC. It seems to be pretty stable, though. But I
    want to do some more testing before I'd ask to apply this to
    tip-tree and I also want to check the performance impact of the
    new sched domain level.

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632
--------------------------------------------------------------------------------
Examples:
=========
(1) To demonstrate power_savings balancing I provide top output when system
    is partially loaded.

 (Note: sched_mc_power_savings=sched_mn_power_savings=0)

 # for i in `seq 1 6`; do nbench& done

 top - 15:41:08 up 7 min,  3 users,  load average: 2.49, 0.64, 0.21
 Tasks: 267 total,   7 running, 260 sleeping,   0 stopped,   0 zombie
 Cpu0  : 49.7%us,  0.0%sy,  0.0%ni, 50.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu6  : 50.2%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu8  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu12 : 48.8%us,  0.0%sy,  0.0%ni, 51.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu13 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu18 : 50.8%us,  0.0%sy,  0.0%ni, 49.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu19 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 # echo 1  >> /sys/devices/system/cpu/sched_mn_power_savings

 top - 15:42:27 up 8 min,  3 users,  load average: 3.91, 1.49, 0.54
 Tasks: 267 total,   7 running, 260 sleeping,   0 stopped,   0 zombie
 Cpu0  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu1  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu2  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu6  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu7  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu8  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu18 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu19 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 # echo 2  >> /sys/devices/system/cpu/sched_mc_power_savings 

 top - 15:43:09 up 9 min,  3 users,  load average: 4.93, 2.06, 0.77
 Tasks: 267 total,   7 running, 260 sleeping,   0 stopped,   0 zombie
 Cpu0  : 99.0%us,  0.0%sy,  0.0%ni,  1.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu1  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu2  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu6  :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu18 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu19 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 # echo 0  >> /sys/devices/system/cpu/sched_mc_power_savings 
 # echo 0  >> /sys/devices/system/cpu/sched_mn_power_savings 

 top - 15:44:22 up 10 min,  3 users,  load average: 5.38, 2.86, 1.15
 Tasks: 267 total,   7 running, 260 sleeping,   0 stopped,   0 zombie
 Cpu0  : 49.2%us,  0.3%sy,  0.0%ni, 50.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu6  : 50.8%us,  0.0%sy,  0.0%ni, 49.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu8  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu12 : 49.2%us,  0.0%sy,  0.0%ni, 50.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu14 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu18 : 50.7%us,  0.0%sy,  0.0%ni, 49.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu19 :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

--------------------------------------------------------------------------------
(2) To illustrate the new domain hierarchy I give examples for sched
    domains and groups of CPU 23 of my test system:

CONFIG_SCHED_MN=n
sched_mc_power_savings=0

 CPU23 attaching sched-domain:
  domain 0: span 18-23 level MC
   groups: 23 18 19 20 21 22
   domain 1: span 18-23 level NODE
    groups: 18-23 (__cpu_power = 6144)
    domain 2: span 0-23 level CPU
     groups: 18-23 0-5 6-11 12-17

CONFIG_SCHED_MN=n
sched_mc_power_savings=2

 CPU23 attaching sched-domain:
  domain 0: span 18-23 level MC
   groups: 23 18 19 20 21 22
   domain 1: span 18-23 level NODE
    groups: 18-23 (__cpu_power = 6144)
    domain 2: span 0-23 level CPU
     groups: 18-23 (__cpu_power = 6144) 0-5 (__cpu_power = 6144) 6-11 (__cpu_power = 6144) 12-17 (__cpu_power = 6144)

CONFIG_SCHED_MN=y, CONFIG_NUMA=y
sched_mc_power_savings=0, sched_mn_power_savings=0

 CPU23 attaching sched-domain:
  domain 0: span 18-23 level MC
   groups: 23 18 19 20 21 22
   domain 1: span 18-23 level NODE
    groups: 18-23 (__cpu_power = 6144)
    domain 2: span 12-23 level MN
     groups: 18-23 12-17
     domain 3: span 0-23 level CPU
      groups: 12-23 0-11

CONFIG_SCHED_MN=y, CONFIG_NUMA=y
sched_mc_power_savings=0, sched_mn_power_savings=1

 CPU23 attaching sched-domain:
  domain 0: span 18-23 level MC
   groups: 23 18 19 20 21 22
   domain 1: span 18-23 level NODE
    groups: 18-23 (__cpu_power = 6144)
    domain 2: span 12-23 level MN
     groups: 18-23 12-17
     domain 3: span 0-23 level CPU
      groups: 12-23 (__cpu_power = 12288) 0-11 (__cpu_power = 12288)

CONFIG_SCHED_MN=y, CONFIG_NUMA=y
sched_mc_power_savings=2, sched_mn_power_savings=0

 CPU23 attaching sched-domain:
  domain 0: span 18-23 level MC
   groups: 23 18 19 20 21 22
   domain 1: span 18-23 level NODE
    groups: 18-23 (__cpu_power = 6144)
    domain 2: span 12-23 level MN
     groups: 18-23 (__cpu_power = 6144) 12-17 (__cpu_power = 6144)
     domain 3: span 0-23 level CPU
      groups: 12-23 (__cpu_power = 12288) 0-11 (__cpu_power = 12288)

CONFIG_SCHED_MN=y, CONFIG_NUMA=y, CONFIG_ACPI_NUMA=n
(and CONFIG_SCHED_MN=y, CONFIG_NUMA=n)
sched_mc_power_savings=0, sched_mn_power_savings=0

 CPU23 attaching sched-domain:
  domain 0: span 18-23 level MC
   groups: 23 18 19 20 21 22
   domain 1: span 12-23 level MN
    groups: 18-23 12-17
    domain 2: span 0-23 level CPU
     groups: 12-23 0-11

CONFIG_SCHED_MN=y, CONFIG_NUMA=y, CONFIG_ACPI_NUMA=n
(and CONFIG_SCHED_MN=y, CONFIG_NUMA=n)
sched_mc_power_savings=0, sched_mn_power_savings=1

 CPU23 attaching sched-domain:
  domain 0: span 18-23 level MC
   groups: 23 18 19 20 21 22
   domain 1: span 12-23 level MN
    groups: 18-23 12-17
    domain 2: span 0-23 level CPU
     groups: 12-23 (__cpu_power = 12288) 0-11 (__cpu_power = 12288)

CONFIG_SCHED_MN=y, CONFIG_NUMA=y, CONFIG_ACPI_NUMA=n
(and CONFIG_SCHED_MN=y, CONFIG_NUMA=n)
sched_mc_power_savings=2, sched_mn_power_savings=0

 CPU23 attaching sched-domain:
  domain 0: span 18-23 level MC
   groups: 23 18 19 20 21 22
   domain 1: span 12-23 level MN
    groups: 18-23 (__cpu_power = 6144) 12-17 (__cpu_power = 6144)
    domain 2: span 0-23 level CPU
     groups: 12-23 (__cpu_power = 12288) 0-11 (__cpu_power = 12288)

--------------------------------------------------------------------------------
(3) Further information -- just for completeness.

With NUMA support and SRAT detection the kernel uses following NUMA
information:

  # numactl --hardware
 available: 4 nodes (0-3)
 node 0 cpus: 0 1 2 3 4 5
 node 0 size: 2047 MB
 node 0 free: 1761 MB
 node 1 cpus: 6 7 8 9 10 11
 node 1 size: 2046 MB
 node 1 free: 1990 MB
 node 2 cpus: 18 19 20 21 22 23
 node 2 size: 2048 MB
 node 2 free: 2004 MB
 node 3 cpus: 12 13 14 15 16 17
 node 3 size: 2048 MB
 node 3 free: 2002 MB
 node distances:
 node   0   1   2   3 
   0:  10  16  16  16 
   1:  16  10  16  16 
   2:  16  16  10  16 
   3:  16  16  16  10 

Without ACPI SRAT support (e.g. CONFIG_ACPI_NUMA=n) the NUMA
information is:

 # numactl  --hardware
 available: 1 nodes (0-0)
 node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 node 0 size: 8189 MB
 node 0 free: 7900 MB
 node distances:
 node   0 
   0:  10

--------------------------------------------------------------------------------
FINI



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/15] x86, sched: Add config option for multi-node CPU scheduling
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
@ 2009-08-20 13:15 ` Andreas Herrmann
  2009-08-21 13:50   ` Valdis.Kletnieks
  2009-08-20 13:34 ` [PATCH 2/15] sched, x86: Provide initializer for MN scheduling domain, define MN level Andreas Herrmann
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:15 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


I've decided to add this as a subitem of MC scheduling.
I think the normal case will be that a multi-node CPU has more than 1
core on each of its nodes. Thus using MN scheduling without MC
scheduling does not make much sense.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 arch/x86/Kconfig |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 65fb791..594e7bc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -713,6 +713,16 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_MN
+	def_bool n
+	prompt "Multi-node CPU scheduler support"
+	depends on X86_HT && SCHED_MC
+	---help---
+	  Multi-node CPU scheduler support improves the CPU
+	  scheduler's decision making when dealing with multi-node
+	  CPUs (e.g. AMD Magny-Cours) at a cost of slightly increased
+	  overhead in some places. If unsure say N here.
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/15] sched, x86: Provide initializer for MN scheduling domain, define MN level
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
  2009-08-20 13:15 ` [PATCH 1/15] x86, sched: Add config option for multi-node CPU scheduling Andreas Herrmann
@ 2009-08-20 13:34 ` Andreas Herrmann
  2009-08-20 13:34 ` [PATCH 3/15] sched: Add cpumask to be used when building MN domain Andreas Herrmann
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:34 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 arch/x86/include/asm/topology.h |   25 +++++++++++++++++++++++++
 include/linux/sched.h           |    1 +
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index d53ef91..6d7d133 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -181,6 +181,31 @@ static inline void setup_node_to_cpumask_map(void) { }
 
 #endif
 
+#ifdef CONFIG_SCHED_MN
+/* Common values for multi-node siblings */
+#ifndef SD_MN_INIT
+#define SD_MN_INIT (struct sched_domain) {		\
+	.min_interval		= 1,			\
+	.max_interval		= 4,			\
+	.busy_factor		= 64,			\
+	.imbalance_pct		= 125,			\
+	.cache_nice_tries	= 1,			\
+	.busy_idx		= 2,			\
+	.wake_idx		= 1,			\
+	.forkexec_idx		= 1,			\
+	.flags			= SD_LOAD_BALANCE	\
+				| SD_BALANCE_FORK	\
+				| SD_BALANCE_EXEC	\
+				| SD_WAKE_AFFINE	\
+				| SD_WAKE_BALANCE	\
+				| sd_balance_for_package_power()\
+				| sd_power_saving_flags(),\
+	.last_balance		= jiffies,		\
+	.balance_interval	= 1,			\
+}
+#endif
+#endif /* CONFIG_SCHED_MN */
+
 #include <asm-generic/topology.h>
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index af1e328..3a1f8db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -901,6 +901,7 @@ enum sched_domain_level {
 	SD_LV_NONE = 0,
 	SD_LV_SIBLING,
 	SD_LV_MC,
+	SD_LV_MN,
 	SD_LV_CPU,
 	SD_LV_NODE,
 	SD_LV_ALLNODES,
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/15] sched: Add cpumask to be used when building MN domain
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
  2009-08-20 13:15 ` [PATCH 1/15] x86, sched: Add config option for multi-node CPU scheduling Andreas Herrmann
  2009-08-20 13:34 ` [PATCH 2/15] sched, x86: Provide initializer for MN scheduling domain, define MN level Andreas Herrmann
@ 2009-08-20 13:34 ` Andreas Herrmann
  2009-08-20 13:35 ` [PATCH 4/15] sched: Define per CPU variables and cpu_to_group function for " Andreas Herrmann
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:34 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c780eed..9990c3a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8207,6 +8207,7 @@ struct s_data {
 	cpumask_var_t		nodemask;
 	cpumask_var_t		this_sibling_map;
 	cpumask_var_t		this_core_map;
+	cpumask_var_t		this_cpu_node_map;
 	cpumask_var_t		send_covered;
 	cpumask_var_t		tmpmask;
 	struct sched_group	**sched_group_nodes;
@@ -8218,6 +8219,7 @@ enum s_alloc {
 	sa_rootdomain,
 	sa_tmpmask,
 	sa_send_covered,
+	sa_this_cpu_node_map,
 	sa_this_core_map,
 	sa_this_sibling_map,
 	sa_nodemask,
@@ -8594,6 +8596,8 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
 		free_cpumask_var(d->tmpmask); /* fall through */
 	case sa_send_covered:
 		free_cpumask_var(d->send_covered); /* fall through */
+	case sa_this_cpu_node_map:
+		free_cpumask_var(d->this_cpu_node_map); /* fall through */
 	case sa_this_core_map:
 		free_cpumask_var(d->this_core_map); /* fall through */
 	case sa_this_sibling_map:
@@ -8640,8 +8644,10 @@ static enum s_alloc __visit_domain_allocation_hell(struct s_data *d,
 		return sa_nodemask;
 	if (!alloc_cpumask_var(&d->this_core_map, GFP_KERNEL))
 		return sa_this_sibling_map;
-	if (!alloc_cpumask_var(&d->send_covered, GFP_KERNEL))
+	if (!alloc_cpumask_var(&d->this_cpu_node_map, GFP_KERNEL))
 		return sa_this_core_map;
+	if (!alloc_cpumask_var(&d->send_covered, GFP_KERNEL))
+		return sa_this_cpu_node_map;
 	if (!alloc_cpumask_var(&d->tmpmask, GFP_KERNEL))
 		return sa_send_covered;
 	d->rd = alloc_rootdomain();
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 4/15] sched: Define per CPU variables and cpu_to_group function for MN domain
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (2 preceding siblings ...)
  2009-08-20 13:34 ` [PATCH 3/15] sched: Add cpumask to be used when building MN domain Andreas Herrmann
@ 2009-08-20 13:35 ` Andreas Herrmann
  2009-08-20 13:36 ` [PATCH 5/15] sched: Add function to build MN sched domain Andreas Herrmann
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:35 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


Additionally fixup cpu_to_phys_group() in case of CONFIG_SCHED_MN=y.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |   32 +++++++++++++++++++++++++++-----
 1 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9990c3a..d85985d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8255,9 +8255,27 @@ cpu_to_cpu_group(int cpu, const struct cpumask *cpu_map,
 #ifdef CONFIG_SCHED_MC
 static DEFINE_PER_CPU(struct static_sched_domain, core_domains);
 static DEFINE_PER_CPU(struct static_sched_group, sched_group_core);
-#endif /* CONFIG_SCHED_MC */
 
-#if defined(CONFIG_SCHED_MC) && defined(CONFIG_SCHED_SMT)
+/*
+ * multi-node sched-domains:
+ */
+#ifdef CONFIG_SCHED_MN
+static DEFINE_PER_CPU(struct static_sched_domain, cpu_node_domains);
+static DEFINE_PER_CPU(struct static_sched_group, sched_group_cpu_node);
+
+static int cpu_to_cpu_node_group(int cpu, const struct cpumask *cpu_map,
+                                 struct sched_group **sg, struct cpumask *mask)
+{
+        int group;
+        cpumask_and(mask, cpu_coregroup_mask(cpu), cpu_map);
+        group = cpumask_first(mask);
+        if (sg)
+                *sg = &per_cpu(sched_group_cpu_node, group).sg;
+        return group;
+}
+#endif /* CONFIG_SCHED_MN */
+
+#ifdef CONFIG_SCHED_SMT
 static int
 cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
 		  struct sched_group **sg, struct cpumask *mask)
@@ -8270,7 +8288,7 @@ cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
 		*sg = &per_cpu(sched_group_core, group).sg;
 	return group;
 }
-#elif defined(CONFIG_SCHED_MC)
+#else
 static int
 cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
 		  struct sched_group **sg, struct cpumask *unused)
@@ -8279,7 +8297,8 @@ cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
 		*sg = &per_cpu(sched_group_core, cpu).sg;
 	return cpu;
 }
-#endif
+#endif /* CONFIG_SCHED_SMT */
+#endif /* CONFIG_SCHED_MC */
 
 static DEFINE_PER_CPU(struct static_sched_domain, phys_domains);
 static DEFINE_PER_CPU(struct static_sched_group, sched_group_phys);
@@ -8289,7 +8308,10 @@ cpu_to_phys_group(int cpu, const struct cpumask *cpu_map,
 		  struct sched_group **sg, struct cpumask *mask)
 {
 	int group;
-#ifdef CONFIG_SCHED_MC
+#ifdef CONFIG_SCHED_MN
+	cpumask_and(mask, topology_cpu_node_cpumask(cpu), cpu_map);
+	group = cpumask_first(mask);
+#elif CONFIG_SCHED_MC
 	cpumask_and(mask, cpu_coregroup_mask(cpu), cpu_map);
 	group = cpumask_first(mask);
 #elif defined(CONFIG_SCHED_SMT)
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 5/15] sched: Add function to build MN sched domain
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (3 preceding siblings ...)
  2009-08-20 13:35 ` [PATCH 4/15] sched: Define per CPU variables and cpu_to_group function for " Andreas Herrmann
@ 2009-08-20 13:36 ` Andreas Herrmann
  2009-08-20 13:37 ` [PATCH 6/15] sched: Add support for MN domain in build_sched_groups Andreas Herrmann
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d85985d..7b8b2ab 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8569,6 +8569,9 @@ SD_INIT_FUNC(CPU)
 #ifdef CONFIG_SCHED_MC
  SD_INIT_FUNC(MC)
 #endif
+#ifdef CONFIG_SCHED_MN
+ SD_INIT_FUNC(MN)
+#endif
 
 static int default_relax_domain_level = -1;
 
@@ -8727,6 +8730,24 @@ static struct sched_domain *__build_cpu_sched_domain(struct s_data *d,
 	return sd;
 }
 
+static struct sched_domain *__build_mn_sched_domain(struct s_data *d,
+	const struct cpumask *cpu_map, struct sched_domain_attr *attr,
+	struct sched_domain *parent, int i)
+{
+	struct sched_domain *sd = parent;
+#ifdef CONFIG_SCHED_MN
+	sd = &per_cpu(cpu_node_domains, i).sd;
+	SD_INIT(sd, MN);
+	set_domain_attribute(sd, attr);
+	cpumask_and(sched_domain_span(sd), cpu_map,
+		    topology_cpu_node_cpumask(i));
+	sd->parent = parent;
+	parent->child = sd;
+	cpu_to_cpu_node_group(i, cpu_map, &sd->groups, d->tmpmask);
+#endif
+	return sd;
+}
+
 static struct sched_domain *__build_mc_sched_domain(struct s_data *d,
 	const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 	struct sched_domain *parent, int i)
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 6/15] sched: Add support for MN domain in build_sched_groups
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (4 preceding siblings ...)
  2009-08-20 13:36 ` [PATCH 5/15] sched: Add function to build MN sched domain Andreas Herrmann
@ 2009-08-20 13:37 ` Andreas Herrmann
  2009-08-20 13:38 ` [PATCH 7/15] sched: Activate build of MN domains Andreas Herrmann
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7b8b2ab..cc16629 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8805,6 +8805,16 @@ static void build_sched_groups(struct s_data *d, enum sched_domain_level l,
 						d->send_covered, d->tmpmask);
 		break;
 #endif
+#ifdef CONFIG_SCHED_MN
+	case SD_LV_MN: /* set up multi-node groups */
+		cpumask_and(d->this_cpu_node_map, cpu_map,
+			    topology_cpu_node_cpumask(cpu));
+		if (cpu == cpumask_first(d->this_cpu_node_map))
+			init_sched_build_groups(d->this_cpu_node_map, cpu_map,
+						&cpu_to_cpu_node_group,
+						d->send_covered, d->tmpmask);
+		break;
+#endif
 	case SD_LV_CPU: /* set up physical groups */
 		cpumask_and(d->nodemask, cpumask_of_node(cpu), cpu_map);
 		if (!cpumask_empty(d->nodemask))
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 7/15] sched: Activate build of MN domains
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (5 preceding siblings ...)
  2009-08-20 13:37 ` [PATCH 6/15] sched: Add support for MN domain in build_sched_groups Andreas Herrmann
@ 2009-08-20 13:38 ` Andreas Herrmann
  2009-08-20 13:39 ` [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy Andreas Herrmann
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


I.e. call __build_mn_sched_domain, build corresponding groups
and calculate power.

Note: still missing are changes in various places
to actually detect domain hierarchy and fixup some stuff.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index cc16629..6cfc840 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8862,6 +8862,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 
 		sd = __build_numa_sched_domains(&d, cpu_map, attr, i);
 		sd = __build_cpu_sched_domain(&d, cpu_map, attr, sd, i);
+		sd = __build_mn_sched_domain(&d, cpu_map, attr, sd, i);
 		sd = __build_mc_sched_domain(&d, cpu_map, attr, sd, i);
 		sd = __build_smt_sched_domain(&d, cpu_map, attr, sd, i);
 	}
@@ -8869,6 +8870,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 	for_each_cpu(i, cpu_map) {
 		build_sched_groups(&d, SD_LV_SIBLING, cpu_map, i);
 		build_sched_groups(&d, SD_LV_MC, cpu_map, i);
+		build_sched_groups(&d, SD_LV_MN, cpu_map, i);
 	}
 
 	/* Set up physical groups */
@@ -8898,6 +8900,12 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 		init_sched_groups_power(i, sd);
 	}
 #endif
+#ifdef CONFIG_SCHED_MN
+	for_each_cpu(i, cpu_map) {
+		sd = &per_cpu(cpu_node_domains, i).sd;
+		init_sched_groups_power(i, sd);
+	}
+#endif
 
 	for_each_cpu(i, cpu_map) {
 		sd = &per_cpu(phys_domains, i).sd;
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (6 preceding siblings ...)
  2009-08-20 13:38 ` [PATCH 7/15] sched: Activate build of MN domains Andreas Herrmann
@ 2009-08-20 13:39 ` Andreas Herrmann
  2009-08-24 14:56   ` Peter Zijlstra
  2009-08-26  9:30   ` Gautham R Shenoy
  2009-08-20 13:40 ` [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains Andreas Herrmann
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 include/linux/sched.h |    4 +++-
 kernel/sched.c        |   38 ++++++++++++++++++++++++++++++++------
 2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3a1f8db..5755643 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -832,7 +832,9 @@ enum powersavings_balance_level {
 	MAX_POWERSAVINGS_BALANCE_LEVELS
 };
 
-extern int sched_mc_power_savings, sched_smt_power_savings;
+extern int sched_mn_power_savings;
+extern int sched_mc_power_savings;
+extern int sched_smt_power_savings;
 
 static inline int sd_balance_for_mc_power(void)
 {
diff --git a/kernel/sched.c b/kernel/sched.c
index 6cfc840..ebcda58 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8179,7 +8179,9 @@ static void sched_domain_node_span(int node, struct cpumask *span)
 }
 #endif /* CONFIG_NUMA */
 
-int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+int sched_mn_power_savings = 0;
+int sched_mc_power_savings = 0;
+int sched_smt_power_savings = 0;
 
 /*
  * The cpus mask in sched_group and sched_domain hangs off the end.
@@ -9135,7 +9137,8 @@ static void arch_reinit_sched_domains(void)
 	put_online_cpus();
 }
 
-static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
+static ssize_t sched_power_savings_store(const char *buf, size_t count,
+					 enum sched_domain_level dl)
 {
 	unsigned int level = 0;
 
@@ -9152,16 +9155,34 @@ static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
 	if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
 		return -EINVAL;
 
-	if (smt)
+	if (dl == SD_LV_SIBLING)
 		sched_smt_power_savings = level;
-	else
+	else if (dl == SD_LV_MC)
 		sched_mc_power_savings = level;
+	else if (dl == SD_LV_MN)
+		sched_mn_power_savings = level;
 
 	arch_reinit_sched_domains();
 
 	return count;
 }
 
+#ifdef CONFIG_SCHED_MN
+static ssize_t sched_mn_power_savings_show(struct sysdev_class *class,
+					   char *page)
+{
+	return sprintf(page, "%u\n", sched_mn_power_savings);
+}
+static ssize_t sched_mn_power_savings_store(struct sysdev_class *class,
+					    const char *buf, size_t count)
+{
+	return sched_power_savings_store(buf, count, SD_LV_MN);
+}
+static SYSDEV_CLASS_ATTR(sched_mn_power_savings, 0644,
+			 sched_mn_power_savings_show,
+			 sched_mn_power_savings_store);
+#endif
+
 #ifdef CONFIG_SCHED_MC
 static ssize_t sched_mc_power_savings_show(struct sysdev_class *class,
 					   char *page)
@@ -9171,7 +9192,7 @@ static ssize_t sched_mc_power_savings_show(struct sysdev_class *class,
 static ssize_t sched_mc_power_savings_store(struct sysdev_class *class,
 					    const char *buf, size_t count)
 {
-	return sched_power_savings_store(buf, count, 0);
+	return sched_power_savings_store(buf, count, SD_LV_MC);
 }
 static SYSDEV_CLASS_ATTR(sched_mc_power_savings, 0644,
 			 sched_mc_power_savings_show,
@@ -9187,7 +9208,7 @@ static ssize_t sched_smt_power_savings_show(struct sysdev_class *dev,
 static ssize_t sched_smt_power_savings_store(struct sysdev_class *dev,
 					     const char *buf, size_t count)
 {
-	return sched_power_savings_store(buf, count, 1);
+	return sched_power_savings_store(buf, count, SD_LV_SIBLING);
 }
 static SYSDEV_CLASS_ATTR(sched_smt_power_savings, 0644,
 		   sched_smt_power_savings_show,
@@ -9208,6 +9229,11 @@ int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
 		err = sysfs_create_file(&cls->kset.kobj,
 					&attr_sched_mc_power_savings.attr);
 #endif
+#ifdef CONFIG_SCHED_MN
+	if (!err && mc_capable())
+		err = sysfs_create_file(&cls->kset.kobj,
+					&attr_sched_mn_power_savings.attr);
+#endif
 	return err;
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (7 preceding siblings ...)
  2009-08-20 13:39 ` [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy Andreas Herrmann
@ 2009-08-20 13:40 ` Andreas Herrmann
  2009-08-24 14:57   ` Peter Zijlstra
  2009-08-26 10:01   ` Gautham R Shenoy
  2009-08-20 13:41 ` [PATCH 10/15] sched: Check for sched_mn_power_savings when doing load balancing Andreas Herrmann
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


Use new function sd_balance_for_mn_power() and adapt
sd_balance_for_package_power() and sd_power_saving_flags() for correct
setting of flags SD_POWERSAVINGS_BALANCE and SD_BALANCE_NEWIDLE in CPU
and MN domains.

Furthermore add flag SD_SHARE_PKG_RESOURCES to MN domain.
Rational: a multi-node processor most likely shares package resources
(on Magny-Cours the package constitues a "voltage domain").

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 arch/x86/include/asm/topology.h |    3 ++-
 include/linux/sched.h           |   14 ++++++++++++--
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 6d7d133..4a520b8 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -198,7 +198,8 @@ static inline void setup_node_to_cpumask_map(void) { }
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_BALANCE	\
-				| sd_balance_for_package_power()\
+				| SD_SHARE_PKG_RESOURCES\
+				| sd_balance_for_mn_power()\
 				| sd_power_saving_flags(),\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5755643..c53bdd8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -844,9 +844,18 @@ static inline int sd_balance_for_mc_power(void)
 	return 0;
 }
 
+static inline int sd_balance_for_mn_power(void)
+{
+	if (sched_mc_power_savings || sched_smt_power_savings)
+		return SD_POWERSAVINGS_BALANCE;
+
+	return 0;
+}
+
 static inline int sd_balance_for_package_power(void)
 {
-	if (sched_mc_power_savings | sched_smt_power_savings)
+	if (sched_mn_power_savings || sched_mc_power_savings ||
+	    sched_smt_power_savings)
 		return SD_POWERSAVINGS_BALANCE;
 
 	return 0;
@@ -860,7 +869,8 @@ static inline int sd_balance_for_package_power(void)
 
 static inline int sd_power_saving_flags(void)
 {
-	if (sched_mc_power_savings | sched_smt_power_savings)
+	if (sched_mn_power_savings || sched_mc_power_savings ||
+	    sched_smt_power_savings)
 		return SD_BALANCE_NEWIDLE;
 
 	return 0;
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 10/15] sched: Check for sched_mn_power_savings when doing load balancing
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (8 preceding siblings ...)
  2009-08-20 13:40 ` [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains Andreas Herrmann
@ 2009-08-20 13:41 ` Andreas Herrmann
  2009-08-24 15:03   ` Peter Zijlstra
  2009-08-20 13:41 ` [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups Andreas Herrmann
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:41 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


The patch adds support for POWERSAVINGS_BALANCE_BASIC for MN domain
level. Currently POWERSAVINGS_BALANCE_WAKEUP is not used for MN domain.

(I have to admit that so far I don't have the correct understanding
what's the benefit of POWERSAVINGS_BALANCE_WAKEUP (when an deticated
wakeup CPU is used) in contrast to POWERSAVINGS_BALANCE_BASIC.  I also
have not found an example that would demonstrate the difference
between those two powersaving levels.)

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ebcda58..7a0d710 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4591,7 +4591,8 @@ static int find_new_ilb(int cpu)
 	 * Have idle load balancer selection from semi-idle packages only
 	 * when power-aware load balancing is enabled
 	 */
-	if (!(sched_smt_power_savings || sched_mc_power_savings))
+	if (!(sched_smt_power_savings || sched_mc_power_savings ||
+	      sched_mn_power_savings))
 		goto out_done;
 
 	/*
@@ -4681,7 +4682,7 @@ int select_nohz_load_balancer(int stop_tick)
 			int new_ilb;
 
 			if (!(sched_smt_power_savings ||
-						sched_mc_power_savings))
+			      sched_mc_power_savings || sched_mn_power_savings))
 				return 1;
 			/*
 			 * Check to see if there is a more power-efficient
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (9 preceding siblings ...)
  2009-08-20 13:41 ` [PATCH 10/15] sched: Check for sched_mn_power_savings when doing load balancing Andreas Herrmann
@ 2009-08-20 13:41 ` Andreas Herrmann
  2009-08-24 15:21   ` Peter Zijlstra
  2009-08-20 13:42 ` [PATCH 12/15] sched: Allow NODE domain to be parent of MC instead of CPU domain Andreas Herrmann
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:41 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


For performance reasons __cpu_power in a sched_group might be limited
such that the group can handle only one task. To correctly calculate
the capacity in upper domain level groups the unlimited power
information is required. This patch stores unlimited __cpu_power
information in sched_groups.orig_power and uses this when calculating
__cpu_power in upper domain level groups.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 include/linux/sched.h |    8 +++++++-
 kernel/sched.c        |   36 ++++++++++++++++++++++++------------
 2 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c53bdd8..d230717 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -890,7 +890,13 @@ struct sched_group {
 	 * (see include/linux/reciprocal_div.h)
 	 */
 	u32 reciprocal_cpu_power;
-
+	/*
+	 * Backup of original power for this group.
+	 * It is used to pass correct power information to upper
+	 * domain level groups in case __cpu_power is limited for
+	 * performance reasons.
+	 */
+	unsigned int orig_power;
 	/*
 	 * The CPUs this group covers.
 	 *
diff --git a/kernel/sched.c b/kernel/sched.c
index 7a0d710..464b6ba 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8376,6 +8376,7 @@ static void init_numa_sched_groups_power(struct sched_group *group_head)
 
 			sg_inc_cpu_power(sg, sd->groups->__cpu_power);
 		}
+		sg->orig_power = sg->__cpu_power;
 		sg = sg->next;
 	} while (sg != group_head);
 }
@@ -8514,18 +8515,9 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 	child = sd->child;
 
 	sd->groups->__cpu_power = 0;
-
-	/*
-	 * For perf policy, if the groups in child domain share resources
-	 * (for example cores sharing some portions of the cache hierarchy
-	 * or SMT), then set this domain groups cpu_power such that each group
-	 * can handle only one task, when there are other idle groups in the
-	 * same sched domain.
-	 */
-	if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
-		       (child->flags &
-			(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
+	if (!child) {
 		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
+		sd->groups->orig_power = sd->groups->__cpu_power;
 		return;
 	}
 
@@ -8534,9 +8526,29 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 	 */
 	group = child->groups;
 	do {
-		sg_inc_cpu_power(sd->groups, group->__cpu_power);
+		sg_inc_cpu_power(sd->groups, group->orig_power);
 		group = group->next;
 	} while (group != child->groups);
+	sd->groups->orig_power = sd->groups->__cpu_power;
+
+	/*
+	 * For perf policy, if the groups in child domain share resources
+	 * (for example cores sharing some portions of the cache hierarchy
+	 * or SMT), then set this domain groups cpu_power such that each group
+	 * can handle only one task, when there are other idle groups in the
+	 * same sched domain.
+	 * Note: Unmodified power information is kept in orig_power and
+	 *       can be used in higher domain levels to calculate
+	 *       and reflect the correct capacity of a sched_group.
+	 *       This is required for power_savings scheduling.
+	 */
+	if (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
+	    ((child->flags &
+	      (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
+		sd->groups->__cpu_power = 0;
+		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
+	}
+
 }
 
 /*
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 12/15] sched: Allow NODE domain to be parent of MC instead of CPU domain
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (10 preceding siblings ...)
  2009-08-20 13:41 ` [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups Andreas Herrmann
@ 2009-08-20 13:42 ` Andreas Herrmann
  2009-08-24 15:32   ` Peter Zijlstra
  2009-08-20 13:43 ` [PATCH 13/15] sched: Detect child domain of NUMA (aka NODE) domain Andreas Herrmann
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:42 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


The level of NODE domain's child domain is provided in s_data.numa_child_level.
Then several adaptions are required when creating the domain hierarchy.
In case NODE domain is parent of MC domain we have to:
- limit NODE domains' span in sched_domain_node_span() to not exceed
  corresponding topology_core_cpumask.
- fix CPU domain span to cover entire cpu_map
- fix CPU domain sched groups to cover entire physical groups instead of
  covering a node (a node sched_group might be a proper subset of a CPU
  sched_group).
- use correct child domain in init_numa_sched_groups_power() when
  calculating sched_group.__cpu_power in NODE domain
- calculate group_power of NODE domain after its child domain

Note: As I have no idea when the ALLNODES domain is required
      I assumed that an ALLNODES domain exists only if NODE domain
      is parent of CPU domain.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |  106 ++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 464b6ba..b03701d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8161,7 +8161,8 @@ static int find_next_best_node(int node, nodemask_t *used_nodes)
  * should be one that prevents unnecessary balancing, but also spreads tasks
  * out optimally.
  */
-static void sched_domain_node_span(int node, struct cpumask *span)
+static void sched_domain_node_span(int node, struct cpumask *span,
+				   enum sched_domain_level child_level)
 {
 	nodemask_t used_nodes;
 	int i;
@@ -8177,6 +8178,10 @@ static void sched_domain_node_span(int node, struct cpumask *span)
 
 		cpumask_or(span, span, cpumask_of_node(next_node));
 	}
+
+	if (child_level == SD_LV_MC)
+		cpumask_and(span, span, topology_core_cpumask(
+			      cpumask_first(cpumask_of_node(node))));
 }
 #endif /* CONFIG_NUMA */
 
@@ -8201,6 +8206,7 @@ struct static_sched_domain {
 };
 
 struct s_data {
+	enum sched_domain_level numa_child_level;
 #ifdef CONFIG_NUMA
 	int			sd_allnodes;
 	cpumask_var_t		domainspan;
@@ -8354,7 +8360,8 @@ static int cpu_to_allnodes_group(int cpu, const struct cpumask *cpu_map,
 	return group;
 }
 
-static void init_numa_sched_groups_power(struct sched_group *group_head)
+static void init_numa_sched_groups_power(struct sched_group *group_head,
+					 enum sched_domain_level child_level)
 {
 	struct sched_group *sg = group_head;
 	int j;
@@ -8365,7 +8372,11 @@ static void init_numa_sched_groups_power(struct sched_group *group_head)
 		for_each_cpu(j, sched_group_cpus(sg)) {
 			struct sched_domain *sd;
 
-			sd = &per_cpu(phys_domains, j).sd;
+			if (child_level == SD_LV_CPU)
+				sd = &per_cpu(phys_domains, j).sd;
+			else /* SD_LV_MC */
+				sd = &per_cpu(core_domains, j).sd;
+
 			if (j != group_first_cpu(sd->groups)) {
 				/*
 				 * Only add "power" once for each
@@ -8394,7 +8405,7 @@ static int build_numa_sched_groups(struct s_data *d,
 		goto out;
 	}
 
-	sched_domain_node_span(num, d->domainspan);
+	sched_domain_node_span(num, d->domainspan, d->numa_child_level);
 	cpumask_and(d->domainspan, d->domainspan, cpu_map);
 
 	sg = kmalloc_node(sizeof(struct sched_group) + cpumask_size(),
@@ -8699,15 +8710,15 @@ static enum s_alloc __visit_domain_allocation_hell(struct s_data *d,
 }
 
 static struct sched_domain *__build_numa_sched_domains(struct s_data *d,
-	const struct cpumask *cpu_map, struct sched_domain_attr *attr, int i)
+	const struct cpumask *cpu_map, struct sched_domain_attr *attr,
+	struct sched_domain *parent, int i)
 {
-	struct sched_domain *sd = NULL;
+	struct sched_domain *sd = parent;
 #ifdef CONFIG_NUMA
-	struct sched_domain *parent;
-
 	d->sd_allnodes = 0;
-	if (cpumask_weight(cpu_map) >
-	    SD_NODES_PER_DOMAIN * cpumask_weight(d->nodemask)) {
+	if ((cpumask_weight(cpu_map) >
+	     SD_NODES_PER_DOMAIN * cpumask_weight(d->nodemask)) &&
+	    (d->numa_child_level == SD_LV_CPU)) {
 		sd = &per_cpu(allnodes_domains, i).sd;
 		SD_INIT(sd, ALLNODES);
 		set_domain_attribute(sd, attr);
@@ -8720,7 +8731,8 @@ static struct sched_domain *__build_numa_sched_domains(struct s_data *d,
 	sd = &per_cpu(node_domains, i).sd;
 	SD_INIT(sd, NODE);
 	set_domain_attribute(sd, attr);
-	sched_domain_node_span(cpu_to_node(i), sched_domain_span(sd));
+	sched_domain_node_span(cpu_to_node(i), sched_domain_span(sd),
+			       d->numa_child_level);
 	sd->parent = parent;
 	if (parent)
 		parent->child = sd;
@@ -8737,10 +8749,12 @@ static struct sched_domain *__build_cpu_sched_domain(struct s_data *d,
 	sd = &per_cpu(phys_domains, i).sd;
 	SD_INIT(sd, CPU);
 	set_domain_attribute(sd, attr);
-	cpumask_copy(sched_domain_span(sd), d->nodemask);
 	sd->parent = parent;
-	if (parent)
+	if (parent) {
+		cpumask_copy(sched_domain_span(sd), d->nodemask);
 		parent->child = sd;
+	} else
+		cpumask_copy(sched_domain_span(sd), cpu_map);
 	cpu_to_phys_group(i, cpu_map, &sd->groups, d->tmpmask);
 	return sd;
 }
@@ -8831,11 +8845,18 @@ static void build_sched_groups(struct s_data *d, enum sched_domain_level l,
 		break;
 #endif
 	case SD_LV_CPU: /* set up physical groups */
-		cpumask_and(d->nodemask, cpumask_of_node(cpu), cpu_map);
-		if (!cpumask_empty(d->nodemask))
-			init_sched_build_groups(d->nodemask, cpu_map,
-						&cpu_to_phys_group,
-						d->send_covered, d->tmpmask);
+		if (d->numa_child_level == SD_LV_MC) {
+			init_sched_build_groups(cpu_map, cpu_map,
+                                                &cpu_to_phys_group,
+                                                d->send_covered, d->tmpmask);
+		} else {
+			cpumask_and(d->nodemask, cpumask_of_node(cpu), cpu_map);
+			if (!cpumask_empty(d->nodemask))
+				init_sched_build_groups(d->nodemask, cpu_map,
+							&cpu_to_phys_group,
+							d->send_covered,
+							d->tmpmask);
+		}
 		break;
 #ifdef CONFIG_NUMA
 	case SD_LV_ALLNODES:
@@ -8859,9 +8880,8 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 	struct s_data d;
 	struct sched_domain *sd;
 	int i;
-#ifdef CONFIG_NUMA
-	d.sd_allnodes = 0;
-#endif
+
+	d.numa_child_level = SD_LV_NONE;
 
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
 	if (alloc_state != sa_rootdomain)
@@ -8875,9 +8895,18 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 		cpumask_and(d.nodemask, cpumask_of_node(cpu_to_node(i)),
 			    cpu_map);
 
-		sd = __build_numa_sched_domains(&d, cpu_map, attr, i);
-		sd = __build_cpu_sched_domain(&d, cpu_map, attr, sd, i);
-		sd = __build_mn_sched_domain(&d, cpu_map, attr, sd, i);
+		if (d.numa_child_level == SD_LV_CPU) {
+			sd = __build_numa_sched_domains(&d, cpu_map, attr,
+							NULL, i);
+			sd = __build_cpu_sched_domain(&d, cpu_map, attr, sd, i);
+			sd = __build_mn_sched_domain(&d, cpu_map, attr, sd, i);
+		} else {
+			sd = __build_cpu_sched_domain(&d, cpu_map, attr,
+						      NULL, i);
+			sd = __build_mn_sched_domain(&d, cpu_map, attr, sd, i);
+			sd = __build_numa_sched_domains(&d, cpu_map, attr,
+							sd, i);
+		}
 		sd = __build_mc_sched_domain(&d, cpu_map, attr, sd, i);
 		sd = __build_smt_sched_domain(&d, cpu_map, attr, sd, i);
 	}
@@ -8915,6 +8944,15 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 		init_sched_groups_power(i, sd);
 	}
 #endif
+
+#ifdef CONFIG_NUMA
+	if (d.numa_child_level == SD_LV_MC)
+		for (i = 0; i < nr_node_ids; i++)
+			init_numa_sched_groups_power(d.sched_group_nodes[i],
+						     d.numa_child_level);
+#endif
+
+
 #ifdef CONFIG_SCHED_MN
 	for_each_cpu(i, cpu_map) {
 		sd = &per_cpu(cpu_node_domains, i).sd;
@@ -8928,15 +8966,17 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 	}
 
 #ifdef CONFIG_NUMA
-	for (i = 0; i < nr_node_ids; i++)
-		init_numa_sched_groups_power(d.sched_group_nodes[i]);
-
-	if (d.sd_allnodes) {
-		struct sched_group *sg;
-
-		cpu_to_allnodes_group(cpumask_first(cpu_map), cpu_map, &sg,
-								d.tmpmask);
-		init_numa_sched_groups_power(sg);
+	if (d.numa_child_level == SD_LV_CPU) {
+		for (i = 0; i < nr_node_ids; i++)
+			init_numa_sched_groups_power(d.sched_group_nodes[i],
+						     d.numa_child_level);
+
+		if (d.sd_allnodes) {
+			struct sched_group *sg;
+			cpu_to_allnodes_group(cpumask_first(cpu_map),
+					      cpu_map, &sg, d.tmpmask);
+			init_numa_sched_groups_power(sg, d.numa_child_level);
+		}
 	}
 #endif
 
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 13/15] sched: Detect child domain of NUMA (aka NODE) domain
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (11 preceding siblings ...)
  2009-08-20 13:42 ` [PATCH 12/15] sched: Allow NODE domain to be parent of MC instead of CPU domain Andreas Herrmann
@ 2009-08-20 13:43 ` Andreas Herrmann
  2009-08-24 15:34   ` Peter Zijlstra
  2009-08-20 13:45 ` [PATCH 14/15] sched: Conditionally limit __cpu_power when child sched domain has type NODE Andreas Herrmann
  2009-08-20 13:46 ` [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors Andreas Herrmann
  14 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:43 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


On multi-node processors a NUMA node might not span a socket.
Instead a socket might span several NUMA nodes.

This patch introduces a check whether NODE domain is parent
of MC domain and sets s_data.numa_child_level accordingly.
(See previous patch for further details.)

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |   43 +++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b03701d..0c950dc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8869,6 +8869,45 @@ static void build_sched_groups(struct s_data *d, enum sched_domain_level l,
 	}
 }
 
+static enum sched_domain_level get_numa_child_domain_level(struct s_data *d,
+	const struct cpumask *cpu_map)
+{
+	enum sched_domain_level dl = SD_LV_NONE;
+#ifdef CONFIG_NUMA
+	int i;
+	int proper_superset = 0;
+	int proper_subset = 0;
+
+	for_each_cpu(i, cpu_map) {
+		cpumask_and(d->tmpmask, cpu_map, topology_cpu_node_cpumask(i));
+		cpumask_and(d->nodemask, cpu_map,
+			    cpumask_of_node(cpu_to_node(i)));
+
+		/* NUMA node's CPU set is proper subset of socket's CPU set */
+		if (cpumask_subset(d->nodemask, d->tmpmask) &&
+		    !cpumask_subset(d->tmpmask, d->nodemask))
+			proper_subset = 1;
+
+		/* socket's CPU set is proper subset of NUMA node's CPU set */
+		if (!cpumask_subset(d->nodemask, d->tmpmask) &&
+		    cpumask_subset(d->tmpmask, d->nodemask))
+			proper_superset = 1;
+	}
+
+	if (proper_subset && proper_superset)
+		printk(KERN_ERR "sched: inconsistent NUMA hierarchy\n");
+	else if (proper_subset) {
+		printk(KERN_DEBUG "sched: NUMA child domain: MC\n");
+		dl = SD_LV_MC;
+	} else {
+		printk(KERN_DEBUG "sched: NUMA child domain: CPU\n");
+		dl = SD_LV_CPU;
+	}
+
+#endif
+	return dl;
+}
+
 /*
  * Build sched domains for a given set of cpus and attach the sched domains
  * to the individual cpus
@@ -8881,13 +8920,13 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 	struct sched_domain *sd;
 	int i;
 
-	d.numa_child_level = SD_LV_NONE;
-
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
 	if (alloc_state != sa_rootdomain)
 		goto error;
 	alloc_state = sa_sched_groups;
 
+	d.numa_child_level = get_numa_child_domain_level(&d, cpu_map);
+
 	/*
 	 * Set up domains for cpus specified by the cpu_map.
 	 */
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 14/15] sched: Conditionally limit __cpu_power when child sched domain has type NODE
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (12 preceding siblings ...)
  2009-08-20 13:43 ` [PATCH 13/15] sched: Detect child domain of NUMA (aka NODE) domain Andreas Herrmann
@ 2009-08-20 13:45 ` Andreas Herrmann
  2009-08-24 15:35   ` Peter Zijlstra
  2009-08-20 13:46 ` [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors Andreas Herrmann
  14 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


We need this in case of performance policy. All sched_groups in
child's parent domain (MN in this case) should be limited such that
tasks are balanced among these sched_groups.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 kernel/sched.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0c950dc..ab88d88 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8555,11 +8555,11 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 	 */
 	if (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
 	    ((child->flags &
-	      (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
+	      (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)) ||
+	     (child->level == SD_LV_NODE))) {
 		sd->groups->__cpu_power = 0;
 		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
 	}
-
 }
 
 /*
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
                   ` (13 preceding siblings ...)
  2009-08-20 13:45 ` [PATCH 14/15] sched: Conditionally limit __cpu_power when child sched domain has type NODE Andreas Herrmann
@ 2009-08-20 13:46 ` Andreas Herrmann
  2009-08-24 15:36   ` Peter Zijlstra
  14 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-20 13:46 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel


The correct mask that describes core-siblings of an processor
is topology_core_cpumask. See topology adapation patches, especially
http://marc.info/?l=linux-kernel&m=124964999608179

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
---
 arch/x86/kernel/smpboot.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f797214..f39bb2c 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -446,7 +446,7 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	 * And for power savings, we return cpu_core_map
 	 */
 	if (sched_mc_power_savings || sched_smt_power_savings)
-		return cpu_core_mask(cpu);
+		return topology_core_cpumask(cpu);
 	else
 		return c->llc_shared_map;
 }
-- 
1.6.0.4




^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/15] x86, sched: Add config option for multi-node CPU scheduling
  2009-08-20 13:15 ` [PATCH 1/15] x86, sched: Add config option for multi-node CPU scheduling Andreas Herrmann
@ 2009-08-21 13:50   ` Valdis.Kletnieks
  2009-08-24  8:49     ` Andreas Herrmann
  0 siblings, 1 reply; 64+ messages in thread
From: Valdis.Kletnieks @ 2009-08-21 13:50 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 704 bytes --]

On Thu, 20 Aug 2009 15:15:21 +0200, Andreas Herrmann said:
> 
> I've decided to add this as a subitem of MC scheduling.
> I think the normal case will be that a multi-node CPU has more than 1
> core on each of its nodes. Thus using MN scheduling without MC
> scheduling does not make much sense.
> 
> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> ---
>  arch/x86/Kconfig |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)

Is this patch series bisectable? I admit not having checked deeply, and
at least at first glance the rest looks sane - but usually Kconfig changes
are kept until the *last* patch so that there's no chance of inter-patch
dependencies breaking...

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/15] x86, sched: Add config option for multi-node CPU scheduling
  2009-08-21 13:50   ` Valdis.Kletnieks
@ 2009-08-24  8:49     ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-24  8:49 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

On Fri, Aug 21, 2009 at 09:50:35AM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Thu, 20 Aug 2009 15:15:21 +0200, Andreas Herrmann said:
> > 
> > I've decided to add this as a subitem of MC scheduling.
> > I think the normal case will be that a multi-node CPU has more than 1
> > core on each of its nodes. Thus using MN scheduling without MC
> > scheduling does not make much sense.
> > 
> > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > ---
> >  arch/x86/Kconfig |   10 ++++++++++
> >  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> Is this patch series bisectable? I admit not having checked deeply, and
> at least at first glance the rest looks sane - but usually Kconfig changes
> are kept until the *last* patch so that there's no chance of inter-patch
> dependencies breaking...

It should be bisectable -- i.e. it shouldn't cause build errors.
Furthermore the real usage of MN domains gets activated with
patch 7.


Regards,

Andreas


-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-20 13:39 ` [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy Andreas Herrmann
@ 2009-08-24 14:56   ` Peter Zijlstra
  2009-08-24 15:32     ` Vaidyanathan Srinivasan
  2009-08-25  6:24     ` Andreas Herrmann
  2009-08-26  9:30   ` Gautham R Shenoy
  1 sibling, 2 replies; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 14:56 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Ingo Molnar, linux-kernel, Gautham Shenoy, Srivatsa Vaddagiri,
	Dipankar Sarma, Balbir Singh, svaidy, Arun R Bharadwaj

On Thu, 2009-08-20 at 15:39 +0200, Andreas Herrmann wrote:
> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> ---

> +#ifdef CONFIG_SCHED_MN
> +	if (!err && mc_capable())
> +		err = sysfs_create_file(&cls->kset.kobj,
> +					&attr_sched_mn_power_savings.attr);
> +#endif

*sigh* another crappy sysfs file

Guys, can't we come up with anything better than sched_*_power_saving=n?

This configuration space is _way_ too large, and now it gets even
crazier.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains
  2009-08-20 13:40 ` [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains Andreas Herrmann
@ 2009-08-24 14:57   ` Peter Zijlstra
  2009-08-25  9:34     ` Gautham R Shenoy
  2009-08-26 10:01   ` Gautham R Shenoy
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 14:57 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel, Gautham Shenoy

On Thu, 2009-08-20 at 15:40 +0200, Andreas Herrmann wrote:
> Use new function sd_balance_for_mn_power() and adapt
> sd_balance_for_package_power() and sd_power_saving_flags() for correct
> setting of flags SD_POWERSAVINGS_BALANCE and SD_BALANCE_NEWIDLE in CPU
> and MN domains.
> 
> Furthermore add flag SD_SHARE_PKG_RESOURCES to MN domain.
> Rational: a multi-node processor most likely shares package resources
> (on Magny-Cours the package constitues a "voltage domain").

IIRC SD_SHARE_PKG_RESOURCES plays games with the cpu_pwer of a
sched_domain, which breaks in all kinds of curious ways, this adds more
breakage afaict.

ego?

> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> ---
>  arch/x86/include/asm/topology.h |    3 ++-
>  include/linux/sched.h           |   14 ++++++++++++--
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 6d7d133..4a520b8 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -198,7 +198,8 @@ static inline void setup_node_to_cpumask_map(void) { }
>  				| SD_BALANCE_EXEC	\
>  				| SD_WAKE_AFFINE	\
>  				| SD_WAKE_BALANCE	\
> -				| sd_balance_for_package_power()\
> +				| SD_SHARE_PKG_RESOURCES\
> +				| sd_balance_for_mn_power()\
>  				| sd_power_saving_flags(),\
>  	.last_balance		= jiffies,		\
>  	.balance_interval	= 1,			\
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5755643..c53bdd8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -844,9 +844,18 @@ static inline int sd_balance_for_mc_power(void)
>  	return 0;
>  }
>  
> +static inline int sd_balance_for_mn_power(void)
> +{
> +	if (sched_mc_power_savings || sched_smt_power_savings)
> +		return SD_POWERSAVINGS_BALANCE;
> +
> +	return 0;
> +}
> +
>  static inline int sd_balance_for_package_power(void)
>  {
> -	if (sched_mc_power_savings | sched_smt_power_savings)
> +	if (sched_mn_power_savings || sched_mc_power_savings ||
> +	    sched_smt_power_savings)
>  		return SD_POWERSAVINGS_BALANCE;
>  
>  	return 0;
> @@ -860,7 +869,8 @@ static inline int sd_balance_for_package_power(void)
>  
>  static inline int sd_power_saving_flags(void)
>  {
> -	if (sched_mc_power_savings | sched_smt_power_savings)
> +	if (sched_mn_power_savings || sched_mc_power_savings ||
> +	    sched_smt_power_savings)
>  		return SD_BALANCE_NEWIDLE;
>  
>  	return 0;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 10/15] sched: Check for sched_mn_power_savings when doing load balancing
  2009-08-20 13:41 ` [PATCH 10/15] sched: Check for sched_mn_power_savings when doing load balancing Andreas Herrmann
@ 2009-08-24 15:03   ` Peter Zijlstra
  2009-08-24 15:40     ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 15:03 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel, Vaidyanathan Srinivasan

On Thu, 2009-08-20 at 15:41 +0200, Andreas Herrmann wrote:
> The patch adds support for POWERSAVINGS_BALANCE_BASIC for MN domain
> level. Currently POWERSAVINGS_BALANCE_WAKEUP is not used for MN domain.
> 
> (I have to admit that so far I don't have the correct understanding
> what's the benefit of POWERSAVINGS_BALANCE_WAKEUP (when an deticated
> wakeup CPU is used) in contrast to POWERSAVINGS_BALANCE_BASIC.  I also
> have not found an example that would demonstrate the difference
> between those two powersaving levels.)

blame svaidy for not writing enough comments ;-)

iirc it moves tasks to sched_mv_preferred_wakeup_cpu instead of waking
an idle cpu, this leaves idle cpus idle longer at the cost of creating
overload on other cpus.

> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> ---
>  kernel/sched.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index ebcda58..7a0d710 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4591,7 +4591,8 @@ static int find_new_ilb(int cpu)
>  	 * Have idle load balancer selection from semi-idle packages only
>  	 * when power-aware load balancing is enabled
>  	 */
> -	if (!(sched_smt_power_savings || sched_mc_power_savings))
> +	if (!(sched_smt_power_savings || sched_mc_power_savings ||
> +	      sched_mn_power_savings))
>  		goto out_done;
>  
>  	/*
> @@ -4681,7 +4682,7 @@ int select_nohz_load_balancer(int stop_tick)
>  			int new_ilb;
>  
>  			if (!(sched_smt_power_savings ||
> -						sched_mc_power_savings))
> +			      sched_mc_power_savings || sched_mn_power_savings))
>  				return 1;
>  			/*
>  			 * Check to see if there is a more power-efficient

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-20 13:41 ` [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups Andreas Herrmann
@ 2009-08-24 15:21   ` Peter Zijlstra
  2009-08-24 16:44     ` Balbir Singh
  2009-08-25  8:51     ` Andreas Herrmann
  0 siblings, 2 replies; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 15:21 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Ingo Molnar, linux-kernel, Gautham Shenoy, svaidy, Balbir Singh

On Thu, 2009-08-20 at 15:41 +0200, Andreas Herrmann wrote:
> For performance reasons __cpu_power in a sched_group might be limited
> such that the group can handle only one task. To correctly calculate
> the capacity in upper domain level groups the unlimited power
> information is required. This patch stores unlimited __cpu_power
> information in sched_groups.orig_power and uses this when calculating
> __cpu_power in upper domain level groups.

OK, so this tries to fix the cpu_power wreckage?

ok, so let me try this with an example:


Suppose we have a dual-core with shared cache and SMT

  0-3     MC
0-1 2-3   SMT

Then both levels fancy setting SHARED_RESOURCES and both levels end up
normalizing the cpu_power to 1, so when we unplug cpu 2, load-balancing
gets all screwy because the whole system doesn't get normalized
properly.

What you propose here is every time we muck with cpu_power we keep the
real stuff in orig_power and use that to compute the level above.

Except you don't use it in the load-balancer proper, so normalization is
still hosed.

Its a creative solution, but I'd rather see cpu_power returned to a
straight sum of actual power to normalize the inter-cpu runqueue weights
and do the placement decision using something else.

> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> ---
>  include/linux/sched.h |    8 +++++++-
>  kernel/sched.c        |   36 ++++++++++++++++++++++++------------
>  2 files changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c53bdd8..d230717 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -890,7 +890,13 @@ struct sched_group {
>  	 * (see include/linux/reciprocal_div.h)
>  	 */
>  	u32 reciprocal_cpu_power;
> -
> +	/*
> +	 * Backup of original power for this group.
> +	 * It is used to pass correct power information to upper
> +	 * domain level groups in case __cpu_power is limited for
> +	 * performance reasons.
> +	 */
> +	unsigned int orig_power;
>  	/*
>  	 * The CPUs this group covers.
>  	 *
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 7a0d710..464b6ba 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8376,6 +8376,7 @@ static void init_numa_sched_groups_power(struct sched_group *group_head)
>  
>  			sg_inc_cpu_power(sg, sd->groups->__cpu_power);
>  		}
> +		sg->orig_power = sg->__cpu_power;
>  		sg = sg->next;
>  	} while (sg != group_head);
>  }
> @@ -8514,18 +8515,9 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
>  	child = sd->child;
>  
>  	sd->groups->__cpu_power = 0;
> -
> -	/*
> -	 * For perf policy, if the groups in child domain share resources
> -	 * (for example cores sharing some portions of the cache hierarchy
> -	 * or SMT), then set this domain groups cpu_power such that each group
> -	 * can handle only one task, when there are other idle groups in the
> -	 * same sched domain.
> -	 */
> -	if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
> -		       (child->flags &
> -			(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
> +	if (!child) {
>  		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
> +		sd->groups->orig_power = sd->groups->__cpu_power;
>  		return;
>  	}
>  
> @@ -8534,9 +8526,29 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
>  	 */
>  	group = child->groups;
>  	do {
> -		sg_inc_cpu_power(sd->groups, group->__cpu_power);
> +		sg_inc_cpu_power(sd->groups, group->orig_power);
>  		group = group->next;
>  	} while (group != child->groups);
> +	sd->groups->orig_power = sd->groups->__cpu_power;
> +
> +	/*
> +	 * For perf policy, if the groups in child domain share resources
> +	 * (for example cores sharing some portions of the cache hierarchy
> +	 * or SMT), then set this domain groups cpu_power such that each group
> +	 * can handle only one task, when there are other idle groups in the
> +	 * same sched domain.
> +	 * Note: Unmodified power information is kept in orig_power and
> +	 *       can be used in higher domain levels to calculate
> +	 *       and reflect the correct capacity of a sched_group.
> +	 *       This is required for power_savings scheduling.
> +	 */
> +	if (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
> +	    ((child->flags &
> +	      (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
> +		sd->groups->__cpu_power = 0;
> +		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
> +	}
> +
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-24 14:56   ` Peter Zijlstra
@ 2009-08-24 15:32     ` Vaidyanathan Srinivasan
  2009-08-24 15:45       ` Peter Zijlstra
  2009-08-25  7:50       ` Andreas Herrmann
  2009-08-25  6:24     ` Andreas Herrmann
  1 sibling, 2 replies; 64+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-08-24 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andreas Herrmann, Ingo Molnar, linux-kernel, Gautham Shenoy,
	Srivatsa Vaddagiri, Dipankar Sarma, Balbir Singh,
	Arun R Bharadwaj

* Peter Zijlstra <peterz@infradead.org> [2009-08-24 16:56:18]:

> On Thu, 2009-08-20 at 15:39 +0200, Andreas Herrmann wrote:
> > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > ---
> 
> > +#ifdef CONFIG_SCHED_MN
> > +	if (!err && mc_capable())
> > +		err = sysfs_create_file(&cls->kset.kobj,
> > +					&attr_sched_mn_power_savings.attr);
> > +#endif
> 
> *sigh* another crappy sysfs file
> 
> Guys, can't we come up with anything better than sched_*_power_saving=n?
> 
> This configuration space is _way_ too large, and now it gets even
> crazier.

Hi Peter and Andreas,

Actually we had sched_power_savings and related simplifications, but
that did not really simplify the interface.

As for this mulit-node MN stuff, Gautham had posted a better solution
to propagate the sched_mc flags without need for new sysfs file and
related changes.

Please take a look at: http://lkml.org/lkml/2009/3/31/137 and
http://lkml.org/lkml/2009/3/31/142 which actually degenerates the
domain.

However Andreas's requirement seem to indicate multiple nodes within
a single socket.  I did not yet completely understand that topology.
Some for of smart degeneration may save an additional tunable here.

Thanks for pointing me to this patch.

--Vaidy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/15] sched: Allow NODE domain to be parent of MC instead of CPU domain
  2009-08-20 13:42 ` [PATCH 12/15] sched: Allow NODE domain to be parent of MC instead of CPU domain Andreas Herrmann
@ 2009-08-24 15:32   ` Peter Zijlstra
  2009-08-25  8:55     ` Andreas Herrmann
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 15:32 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel

On Thu, 2009-08-20 at 15:42 +0200, Andreas Herrmann wrote:
> The level of NODE domain's child domain is provided in s_data.numa_child_level.
> Then several adaptions are required when creating the domain hierarchy.
> In case NODE domain is parent of MC domain we have to:
> - limit NODE domains' span in sched_domain_node_span() to not exceed
>   corresponding topology_core_cpumask.
> - fix CPU domain span to cover entire cpu_map
> - fix CPU domain sched groups to cover entire physical groups instead of
>   covering a node (a node sched_group might be a proper subset of a CPU
>   sched_group).
> - use correct child domain in init_numa_sched_groups_power() when
>   calculating sched_group.__cpu_power in NODE domain
> - calculate group_power of NODE domain after its child domain
> 
> Note: As I have no idea when the ALLNODES domain is required
>       I assumed that an ALLNODES domain exists only if NODE domain
>       is parent of CPU domain.

I think its only used when the regular node level is too large, then we
split it into smaller bits. SGI folks who run crazy large machines use
this.

/me mumbels about renaming the domain level, CPU is the physical socket
level, right? stupid names.

Patch sounds funky though, numa_child_level should be effident from the
tree build.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/15] sched: Detect child domain of NUMA (aka NODE) domain
  2009-08-20 13:43 ` [PATCH 13/15] sched: Detect child domain of NUMA (aka NODE) domain Andreas Herrmann
@ 2009-08-24 15:34   ` Peter Zijlstra
  2009-08-25  9:13     ` Andreas Herrmann
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 15:34 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel

On Thu, 2009-08-20 at 15:43 +0200, Andreas Herrmann wrote:
> On multi-node processors a NUMA node might not span a socket.
> Instead a socket might span several NUMA nodes.
> 
> This patch introduces a check whether NODE domain is parent
> of MC domain and sets s_data.numa_child_level accordingly.
> (See previous patch for further details.)

right, except that the previous patch was rather cryptic :/

So you're proposing to have the NODE level depend on multi-node and then
flip NODE and CPU around?



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 14/15] sched: Conditionally limit __cpu_power when child sched domain has type NODE
  2009-08-20 13:45 ` [PATCH 14/15] sched: Conditionally limit __cpu_power when child sched domain has type NODE Andreas Herrmann
@ 2009-08-24 15:35   ` Peter Zijlstra
  2009-08-25  9:19     ` Andreas Herrmann
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 15:35 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel

On Thu, 2009-08-20 at 15:45 +0200, Andreas Herrmann wrote:
> We need this in case of performance policy. All sched_groups in
> child's parent domain (MN in this case) should be limited such that
> tasks are balanced among these sched_groups.

/me fails at correlating the above changelog and the below patch.

So here we go mess up cpu_power again in order to invluence the
placement policy?

> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> ---
>  kernel/sched.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 0c950dc..ab88d88 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8555,11 +8555,11 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
>  	 */
>  	if (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
>  	    ((child->flags &
> -	      (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
> +	      (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)) ||
> +	     (child->level == SD_LV_NODE))) {
>  		sd->groups->__cpu_power = 0;
>  		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
>  	}
> -
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-20 13:46 ` [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors Andreas Herrmann
@ 2009-08-24 15:36   ` Peter Zijlstra
  2009-08-24 18:21     ` Ingo Molnar
  2009-08-25  9:31     ` Andreas Herrmann
  0 siblings, 2 replies; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 15:36 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel

On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> The correct mask that describes core-siblings of an processor
> is topology_core_cpumask. See topology adapation patches, especially
> http://marc.info/?l=linux-kernel&m=124964999608179


argh, violence, murder kill.. this is the worst possible hack and you're
extending it :/

> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> ---
>  arch/x86/kernel/smpboot.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index f797214..f39bb2c 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -446,7 +446,7 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>  	 * And for power savings, we return cpu_core_map
>  	 */
>  	if (sched_mc_power_savings || sched_smt_power_savings)
> -		return cpu_core_mask(cpu);
> +		return topology_core_cpumask(cpu);
>  	else
>  		return c->llc_shared_map;
>  }

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 10/15] sched: Check for sched_mn_power_savings when doing load balancing
  2009-08-24 15:03   ` Peter Zijlstra
@ 2009-08-24 15:40     ` Vaidyanathan Srinivasan
  2009-08-25  8:00       ` Andreas Herrmann
  0 siblings, 1 reply; 64+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-08-24 15:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andreas Herrmann, Ingo Molnar, linux-kernel

* Peter Zijlstra <peterz@infradead.org> [2009-08-24 17:03:40]:

> On Thu, 2009-08-20 at 15:41 +0200, Andreas Herrmann wrote:
> > The patch adds support for POWERSAVINGS_BALANCE_BASIC for MN domain
> > level. Currently POWERSAVINGS_BALANCE_WAKEUP is not used for MN domain.
> > 
> > (I have to admit that so far I don't have the correct understanding
> > what's the benefit of POWERSAVINGS_BALANCE_WAKEUP (when an deticated
> > wakeup CPU is used) in contrast to POWERSAVINGS_BALANCE_BASIC.  I also
> > have not found an example that would demonstrate the difference
> > between those two powersaving levels.)
> 
> blame svaidy for not writing enough comments ;-)

I am here to explain ;)

> iirc it moves tasks to sched_mv_preferred_wakeup_cpu instead of waking
> an idle cpu, this leaves idle cpus idle longer at the cost of creating
> overload on other cpus.

Yes, as Peter said, the POWERSAVINGS_BALANCE_WAKEUP biases task
wakeups to sched_mc_preferred_wakeup_cpu which has been nominated from
previous load balance loops.

Task wakeup biasing of sched_mc=2 works for most workloads like
kernbench and other sleeping tasks that come in and out of runqueue.
The default sched_mc=1 will work only for jobs running much longer
than the loadbalance interval or almost 100% CPU intensive job where
the load balancer can take time to identify the load pattern and
initiate a task migrate.

The wakeup biasing (sched_mc=2) will help move bursty jobs faster and
statistically pack them in single package and save power.
 
> > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > ---
> >  kernel/sched.c |    5 +++--
> >  1 files changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index ebcda58..7a0d710 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4591,7 +4591,8 @@ static int find_new_ilb(int cpu)
> >  	 * Have idle load balancer selection from semi-idle packages only
> >  	 * when power-aware load balancing is enabled
> >  	 */
> > -	if (!(sched_smt_power_savings || sched_mc_power_savings))
> > +	if (!(sched_smt_power_savings || sched_mc_power_savings ||
> > +	      sched_mn_power_savings))
> >  		goto out_done;
> >  
> >  	/*
> > @@ -4681,7 +4682,7 @@ int select_nohz_load_balancer(int stop_tick)
> >  			int new_ilb;
> >  
> >  			if (!(sched_smt_power_savings ||
> > -						sched_mc_power_savings))
> > +			      sched_mc_power_savings || sched_mn_power_savings))
> >  				return 1;
> >  			/*
> >  			 * Check to see if there is a more power-efficient


You can achieve the balancing effects by propagating the SD_ flags at
the right domain level with the same sysfs interface.  At some point
we wanted to change to sched_power_savings=N and set the flags
according to system topology to provide consolidation at the right
sched_domain and save power.

--Vaidy


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-24 15:32     ` Vaidyanathan Srinivasan
@ 2009-08-24 15:45       ` Peter Zijlstra
  2009-08-25  7:52         ` Andreas Herrmann
  2009-08-25  7:50       ` Andreas Herrmann
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 15:45 UTC (permalink / raw)
  To: svaidy
  Cc: Andreas Herrmann, Ingo Molnar, linux-kernel, Gautham Shenoy,
	Srivatsa Vaddagiri, Dipankar Sarma, Balbir Singh,
	Arun R Bharadwaj

On Mon, 2009-08-24 at 21:02 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-08-24 16:56:18]:
> 
> > On Thu, 2009-08-20 at 15:39 +0200, Andreas Herrmann wrote:
> > > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > > ---
> > 
> > > +#ifdef CONFIG_SCHED_MN
> > > +	if (!err && mc_capable())
> > > +		err = sysfs_create_file(&cls->kset.kobj,
> > > +					&attr_sched_mn_power_savings.attr);
> > > +#endif
> > 
> > *sigh* another crappy sysfs file
> > 
> > Guys, can't we come up with anything better than sched_*_power_saving=n?
> > 
> > This configuration space is _way_ too large, and now it gets even
> > crazier.
> 
> Hi Peter and Andreas,
> 
> Actually we had sched_power_savings and related simplifications, but
> that did not really simplify the interface.

Well, I prefer a single sched_power knob that either goes on or off.

A user really isn't interested in exploring a 3^3 configuration space
{PERF, POWER, POWER-WAKE-BALANCE} x {SMT, MC, MN} in order to find what
works best.

> As for this mulit-node MN stuff, Gautham had posted a better solution
> to propagate the sched_mc flags without need for new sysfs file and
> related changes.
> 
> Please take a look at: http://lkml.org/lkml/2009/3/31/137 and
> http://lkml.org/lkml/2009/3/31/142 which actually degenerates the
> domain.

Ah, right, that got lost in my inbox :/ Let me go read those too.

> However Andreas's requirement seem to indicate multiple nodes within
> a single socket.  I did not yet completely understand that topology.
> Some for of smart degeneration may save an additional tunable here.

Yes, apparently AMD is going to put multiple nodes in a single socket,
not sure how they do that, Andreas do these chips have multiple memory
busses?

I was thinking chips were pin constrained and wouldn't add a whole
second memory interface to the package, but what do I know...



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-24 15:21   ` Peter Zijlstra
@ 2009-08-24 16:44     ` Balbir Singh
  2009-08-24 17:26       ` Peter Zijlstra
  2009-08-25  8:51     ` Andreas Herrmann
  1 sibling, 1 reply; 64+ messages in thread
From: Balbir Singh @ 2009-08-24 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andreas Herrmann, Ingo Molnar, linux-kernel, Gautham Shenoy, svaidy

* Peter Zijlstra <peterz@infradead.org> [2009-08-24 17:21:37]:

> On Thu, 2009-08-20 at 15:41 +0200, Andreas Herrmann wrote:
> > For performance reasons __cpu_power in a sched_group might be limited
> > such that the group can handle only one task. To correctly calculate
> > the capacity in upper domain level groups the unlimited power
> > information is required. This patch stores unlimited __cpu_power
> > information in sched_groups.orig_power and uses this when calculating
> > __cpu_power in upper domain level groups.
> 
> OK, so this tries to fix the cpu_power wreckage?
> 
> ok, so let me try this with an example:
> 
> 
> Suppose we have a dual-core with shared cache and SMT
> 
>   0-3     MC
> 0-1 2-3   SMT
> 
> Then both levels fancy setting SHARED_RESOURCES and both levels end up
> normalizing the cpu_power to 1, so when we unplug cpu 2, load-balancing
> gets all screwy because the whole system doesn't get normalized
> properly.
> 
> What you propose here is every time we muck with cpu_power we keep the
> real stuff in orig_power and use that to compute the level above.
> 
> Except you don't use it in the load-balancer proper, so normalization is
> still hosed.
> 
> Its a creative solution, but I'd rather see cpu_power returned to a
> straight sum of actual power to normalize the inter-cpu runqueue weights
> and do the placement decision using something else.

The real solution is to find a way to solve asymmetric load balancing,
I suppose. The asymmetry might be due to cores being hot-plugged for
example

-- 
	Balbir

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-24 16:44     ` Balbir Singh
@ 2009-08-24 17:26       ` Peter Zijlstra
  2009-08-24 18:19         ` Balbir Singh
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-24 17:26 UTC (permalink / raw)
  To: balbir
  Cc: Andreas Herrmann, Ingo Molnar, linux-kernel, Gautham Shenoy, svaidy

On Mon, 2009-08-24 at 22:14 +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-08-24 17:21:37]:
> 
> > On Thu, 2009-08-20 at 15:41 +0200, Andreas Herrmann wrote:
> > > For performance reasons __cpu_power in a sched_group might be limited
> > > such that the group can handle only one task. To correctly calculate
> > > the capacity in upper domain level groups the unlimited power
> > > information is required. This patch stores unlimited __cpu_power
> > > information in sched_groups.orig_power and uses this when calculating
> > > __cpu_power in upper domain level groups.
> > 
> > OK, so this tries to fix the cpu_power wreckage?
> > 
> > ok, so let me try this with an example:
> > 
> > 
> > Suppose we have a dual-core with shared cache and SMT
> > 
> >   0-3     MC
> > 0-1 2-3   SMT
> > 
> > Then both levels fancy setting SHARED_RESOURCES and both levels end up
> > normalizing the cpu_power to 1, so when we unplug cpu 2, load-balancing
> > gets all screwy because the whole system doesn't get normalized
> > properly.
> > 
> > What you propose here is every time we muck with cpu_power we keep the
> > real stuff in orig_power and use that to compute the level above.
> > 
> > Except you don't use it in the load-balancer proper, so normalization is
> > still hosed.
> > 
> > Its a creative solution, but I'd rather see cpu_power returned to a
> > straight sum of actual power to normalize the inter-cpu runqueue weights
> > and do the placement decision using something else.
> 
> The real solution is to find a way to solve asymmetric load balancing,
> I suppose. The asymmetry might be due to cores being hot-plugged for
> example

No, the solution is to not use cpu_power for placement and use it for
normalization of the weight only. That would make the asym work by
definition.

The real fun comes when we then introduce dynamic cpu_power based on
feedback from things like aperf/mperf ratios for SMT and feedback from
the RT scheduler.

The trouble is that cpu_power is now abused for placement decisions too,
and that needs to be taken out.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-24 17:26       ` Peter Zijlstra
@ 2009-08-24 18:19         ` Balbir Singh
  2009-08-25  7:11           ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Balbir Singh @ 2009-08-24 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andreas Herrmann, Ingo Molnar, linux-kernel, Gautham Shenoy, svaidy

* Peter Zijlstra <peterz@infradead.org> [2009-08-24 19:26:50]:

> On Mon, 2009-08-24 at 22:14 +0530, Balbir Singh wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2009-08-24 17:21:37]:
> > 
> > > On Thu, 2009-08-20 at 15:41 +0200, Andreas Herrmann wrote:
> > > > For performance reasons __cpu_power in a sched_group might be limited
> > > > such that the group can handle only one task. To correctly calculate
> > > > the capacity in upper domain level groups the unlimited power
> > > > information is required. This patch stores unlimited __cpu_power
> > > > information in sched_groups.orig_power and uses this when calculating
> > > > __cpu_power in upper domain level groups.
> > > 
> > > OK, so this tries to fix the cpu_power wreckage?
> > > 
> > > ok, so let me try this with an example:
> > > 
> > > 
> > > Suppose we have a dual-core with shared cache and SMT
> > > 
> > >   0-3     MC
> > > 0-1 2-3   SMT
> > > 
> > > Then both levels fancy setting SHARED_RESOURCES and both levels end up
> > > normalizing the cpu_power to 1, so when we unplug cpu 2, load-balancing
> > > gets all screwy because the whole system doesn't get normalized
> > > properly.
> > > 
> > > What you propose here is every time we muck with cpu_power we keep the
> > > real stuff in orig_power and use that to compute the level above.
> > > 
> > > Except you don't use it in the load-balancer proper, so normalization is
> > > still hosed.
> > > 
> > > Its a creative solution, but I'd rather see cpu_power returned to a
> > > straight sum of actual power to normalize the inter-cpu runqueue weights
> > > and do the placement decision using something else.
> > 
> > The real solution is to find a way to solve asymmetric load balancing,
> > I suppose. The asymmetry might be due to cores being hot-plugged for
> > example
> 
> No, the solution is to not use cpu_power for placement and use it for
> normalization of the weight only. That would make the asym work by
> definition.
> 
> The real fun comes when we then introduce dynamic cpu_power based on
> feedback from things like aperf/mperf ratios for SMT and feedback from
> the RT scheduler.
>

That reminds me, accounting is currently broken and should be based on
APER/MPERF (Power gets it right - based on SPURR).
 
> The trouble is that cpu_power is now abused for placement decisions too,
> and that needs to be taken out.

OK.. so you propose extending the static cpu_power to dynamic
cpu_power but based on current topology?

-- 
	Balbir

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-24 15:36   ` Peter Zijlstra
@ 2009-08-24 18:21     ` Ingo Molnar
  2009-08-25 10:13       ` Andreas Herrmann
  2009-08-25  9:31     ` Andreas Herrmann
  1 sibling, 1 reply; 64+ messages in thread
From: Ingo Molnar @ 2009-08-24 18:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andreas Herrmann, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > The correct mask that describes core-siblings of an processor
> > is topology_core_cpumask. See topology adapation patches, especially
> > http://marc.info/?l=linux-kernel&m=124964999608179
> 
> argh, violence, murder kill.. this is the worst possible hack and 
> you're extending it :/

I think most of the trouble here comes from having inconsistent 
names, a rather static structure for sched-domains setup and then we 
are confusing things back and forth.

Right now we have thread/sibling, core, CPU/socket and node, with 
many data structures around these hardcoded. Certain scheduler 
features only operate on the hardcoded fields.

Now Magny-Cours adds a socket internal node construct to the whole 
thing, names it randomly and basically breaks the semi-static 
representation.

We cannot just flip around our static names and hope it goes well 
and everything just drops into place. Everything just falls apart 
really instead.

Instead we should have an arch-defined tree and a CPU architecture 
dependent ASCII name associated with each level - but not hardcoded 
into the scheduler.

Plus we should have independent scheduler domains feature flags that 
can be turned on/off in various levels of that tree, depending on 
the cache and interconnect properties of the hardware - without 
having to worry about what the ASCII name says. Those features 
should be capable to work not just on the lowest level of the tree, 
but on higher levels too, regardless whether that level is called a 
'core', a 'socket' or an 'internal node' on the ASCII level really.

This is why i insisted on handling the Magny-Cours topology 
discovery and enumeration patches together with the scheduler 
patches. It can easily become a mess if extended.

	Ingo

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-24 14:56   ` Peter Zijlstra
  2009-08-24 15:32     ` Vaidyanathan Srinivasan
@ 2009-08-25  6:24     ` Andreas Herrmann
  2009-08-25  6:41       ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  6:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Gautham Shenoy, Srivatsa Vaddagiri,
	Dipankar Sarma, Balbir Singh, svaidy, Arun R Bharadwaj

On Mon, Aug 24, 2009 at 04:56:18PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-20 at 15:39 +0200, Andreas Herrmann wrote:
> > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > ---
> 
> > +#ifdef CONFIG_SCHED_MN
> > +	if (!err && mc_capable())
> > +		err = sysfs_create_file(&cls->kset.kobj,
> > +					&attr_sched_mn_power_savings.attr);
> > +#endif
> 
> *sigh* another crappy sysfs file
> 
> Guys, can't we come up with anything better than sched_*_power_saving=n?

Thought this is a settled thing. At least there are already two
such parameters. So using the existing convention is an obvious
thing, no?
 
> This configuration space is _way_ too large, and now it gets even
> crazier.

I don't fully agree.

Having one control interface for each domain level is just one
approach. It gives the user full control of scheduling policies.
It just might have to be properly documented.

In another mail Vaidy mentioned that

  "at some point we wanted to change the interface to
   sched_power_savings=N and and set the flags according to system
   topology".

But how you'll decide at which domain level you have to do power
savings scheduling?

Using sched_mn_power_savings=1 is quite different from
sched_smt_power_savings=1. Probably, the most power you save if you
switch on power saving scheduling on each domain level. I.e. first
filling threads of one core, then filling all cores on one internal
node, then filling all internal nodes of one socket.

But for performance reasons a user might just want to use power
savings in the MN domain. How you'd allow the user to configure that
with just one interface? Passing the domain level to
sched_power_savings, e.g.  sched_power_savings=MC instead of the power
saving level?

Besides that, don't we have to keep the user-interface stable, i.e.
stick to sched_smt_power_savings and sched_mc_power_savings?


Regards,
Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-25  6:24     ` Andreas Herrmann
@ 2009-08-25  6:41       ` Peter Zijlstra
  2009-08-25  8:38         ` Andreas Herrmann
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-25  6:41 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Ingo Molnar, linux-kernel, Gautham Shenoy, Srivatsa Vaddagiri,
	Dipankar Sarma, Balbir Singh, svaidy, Arun R Bharadwaj

On Tue, 2009-08-25 at 08:24 +0200, Andreas Herrmann wrote:
> On Mon, Aug 24, 2009 at 04:56:18PM +0200, Peter Zijlstra wrote:
> > On Thu, 2009-08-20 at 15:39 +0200, Andreas Herrmann wrote:
> > > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > > ---
> > 
> > > +#ifdef CONFIG_SCHED_MN
> > > +	if (!err && mc_capable())
> > > +		err = sysfs_create_file(&cls->kset.kobj,
> > > +					&attr_sched_mn_power_savings.attr);
> > > +#endif
> > 
> > *sigh* another crappy sysfs file
> > 
> > Guys, can't we come up with anything better than sched_*_power_saving=n?
> 
> Thought this is a settled thing. At least there are already two
> such parameters. So using the existing convention is an obvious
> thing, no?

Well, yes its the obvious thing, but I'm questioning whether its the
best thing ;-)

> > This configuration space is _way_ too large, and now it gets even
> > crazier.
> 
> I don't fully agree.
> 
> Having one control interface for each domain level is just one
> approach. It gives the user full control of scheduling policies.
> It just might have to be properly documented.
> 
> In another mail Vaidy mentioned that
> 
>   "at some point we wanted to change the interface to
>    sched_power_savings=N and and set the flags according to system
>    topology".
> 
> But how you'll decide at which domain level you have to do power
> savings scheduling?

The user isn't interested in knowing about domains and cpu topology in
99% of the cases, all they want is the machine not burning power like
there's no tomorrow.

Users (me including) have no interest exploring a 27-state power
configuration space in order to find out what works best for them, I'd
throw up my hands and not bother, really.

> Using sched_mn_power_savings=1 is quite different from
> sched_smt_power_savings=1. Probably, the most power you save if you
> switch on power saving scheduling on each domain level. I.e. first
> filling threads of one core, then filling all cores on one internal
> node, then filling all internal nodes of one socket.
> 
> But for performance reasons a user might just want to use power
> savings in the MN domain. How you'd allow the user to configure that
> with just one interface? Passing the domain level to
> sched_power_savings, e.g.  sched_power_savings=MC instead of the power
> saving level?

Sure its different, it reduces the configuration space, that gives less
choice, but does make it accessible.

Ask joe-admin what he prefers.

If you're really really worried people might miss the joy of fine tuning
their power scheduling, then we can provide a dual interface, one for
dumb people like me, and one for crazy people like you ;-)

> Besides that, don't we have to keep the user-interface stable, i.e.
> stick to sched_smt_power_savings and sched_mc_power_savings?

Don't ever defend crappy stuff with interface stability, that's just
lame ;-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-24 18:19         ` Balbir Singh
@ 2009-08-25  7:11           ` Peter Zijlstra
  2009-08-25  8:04             ` Balbir Singh
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-25  7:11 UTC (permalink / raw)
  To: balbir
  Cc: Andreas Herrmann, Ingo Molnar, linux-kernel, Gautham Shenoy, svaidy

On Mon, 2009-08-24 at 23:49 +0530, Balbir Singh wrote:

> That reminds me, accounting is currently broken and should be based on
> APER/MPERF (Power gets it right - based on SPURR).

What accounting?

> > The trouble is that cpu_power is now abused for placement decisions too,
> > and that needs to be taken out.
> 
> OK.. so you propose extending the static cpu_power to dynamic
> cpu_power but based on current topology?

Right, so cpu_power is primarily used to normalize domain weight in the
load-balancer.

Suppose a 4 core machine with 1 unplugged core:

 0,1,3

0,1  3

The sd-0,1 will have cpu_power 2048, while the sd-3 will have 1024, this
allowed find_busiest_group() for sd-0,1,3 to pick the one which is
relatively most overloaded.

Supposing 3, 2, 2 (nice0) tasks on these cores, the domain weight of
sd-0,1 is 5*1024 and sd-3 is 2*1024, normalized that becomes 5/2 and 2
resp. which clearly shows sd-0,1 to be the busiest of the pair.

Now back in the days Nick wrote all this, he did the cpu_power hack for
SMT which sets the combined cpu_power of 2 threads (that's all we had
back then) to 1024, because two threads share 1 core, and are roughly as
fast.

He then also used this to influence task placement, preferring to move
tasks to another sibling domain before getting the second thread active,
this worked.

Then multi-core with shared caches came along and people did the same
trick for mc power save in order to get that placement stuff, but that
horribly broke the load-balancer normalization.

Now comes multi-node, and people asking for more elaborate placement
strategies and all this starts creaking like a ghost house about to
collapse.

Therefore I want cpu_power back to load normalization only, and do the
placement stuff with something else.

Once cpu_power is pure again, we can start making it dynamic, for SMT we
can utilize APERF/MPERF to guesstimate the actual work capacity of
threads, and scaling cpu_power back based on RT time used on the cpu.

Then when we walk the domain tree for load-balancing we re-do the
cpu_power sum, etc..



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-24 15:32     ` Vaidyanathan Srinivasan
  2009-08-24 15:45       ` Peter Zijlstra
@ 2009-08-25  7:50       ` Andreas Herrmann
  1 sibling, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  7:50 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Gautham Shenoy,
	Srivatsa Vaddagiri, Dipankar Sarma, Balbir Singh,
	Arun R Bharadwaj

On Mon, Aug 24, 2009 at 09:02:29PM +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-08-24 16:56:18]:
> 
> > On Thu, 2009-08-20 at 15:39 +0200, Andreas Herrmann wrote:
> > > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > > ---
> > 
> > > +#ifdef CONFIG_SCHED_MN
> > > +	if (!err && mc_capable())
> > > +		err = sysfs_create_file(&cls->kset.kobj,
> > > +					&attr_sched_mn_power_savings.attr);
> > > +#endif
> > 
> > *sigh* another crappy sysfs file
> > 
> > Guys, can't we come up with anything better than sched_*_power_saving=n?
> > 
> > This configuration space is _way_ too large, and now it gets even
> > crazier.
> 
> Hi Peter and Andreas,
> 
> Actually we had sched_power_savings and related simplifications, but
> that did not really simplify the interface.
> 
> As for this mulit-node MN stuff, Gautham had posted a better solution
> to propagate the sched_mc flags without need for new sysfs file and
> related changes.

For Magny-Cours it might be sufficient to just propagate the sched_mc
flags into the MN domain. But then the MC domain shouldn't use the
sched_mc power saving flag (for performance reasons). But I don't know
what other multi-node CPUs we will have in the future. So on an
abstract level power savings scheduling in sched_mc is not equal to
power savings scheduling on sched_mn.

> Please take a look at: http://lkml.org/lkml/2009/3/31/137 and
> http://lkml.org/lkml/2009/3/31/142 which actually degenerates the
> domain.
> 
> However Andreas's requirement seem to indicate multiple nodes within
> a single socket.  I did not yet completely understand that topology.
> Some for of smart degeneration may save an additional tunable here.

I doubt that a "form of smart degeneration" would help. You want to
have sched_groups for each internal node and you most likely always
want to do balancing between those two. But in case of power saving
scheduling you want to utilize an entire socket before you deploy
another socket. Hence I think, in the end, "smart degeneration" would
rather mean some sort of hackery which won't make the code easier.

> Thanks for pointing me to this patch.

Dito.
I'll have a look at your power savings simplifications asap.


Thanks,

Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-24 15:45       ` Peter Zijlstra
@ 2009-08-25  7:52         ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  7:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: svaidy, Ingo Molnar, linux-kernel, Gautham Shenoy,
	Srivatsa Vaddagiri, Dipankar Sarma, Balbir, Singh <balbir

On Mon, Aug 24, 2009 at 05:45:14PM +0200, Peter Zijlstra wrote:
> On Mon, 2009-08-24 at 21:02 +0530, Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2009-08-24 16:56:18]:
> > 
> > > On Thu, 2009-08-20 at 15:39 +0200, Andreas Herrmann wrote:
> > > > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > > > ---
> > > 
> > > > +#ifdef CONFIG_SCHED_MN
> > > > +	if (!err && mc_capable())
> > > > +		err = sysfs_create_file(&cls->kset.kobj,
> > > > +					&attr_sched_mn_power_savings.attr);
> > > > +#endif
> > > 
> > > *sigh* another crappy sysfs file
> > > 
> > > Guys, can't we come up with anything better than sched_*_power_saving=n?
> > > 
> > > This configuration space is _way_ too large, and now it gets even
> > > crazier.
> > 
> > Hi Peter and Andreas,
> > 
> > Actually we had sched_power_savings and related simplifications, but
> > that did not really simplify the interface.
> 
> Well, I prefer a single sched_power knob that either goes on or off.

IMHO all options that are selectable at the moment have to map to that
single knob. One user might want to fill one socket for power savings
but still want to balance tasks between the internal nodes. Another
user wants to have highest possible power savings and likes to see all
threads utilized before another core is used even if the FPU/cache or
whatsoever are shared between threads on the same core.

> A user really isn't interested in exploring a 3^3 configuration space
> {PERF, POWER, POWER-WAKE-BALANCE} x {SMT, MC, MN} in order to find what
> works best.

Why not just give the average user some hints what he should select on
his machine but still let power users decide themselves what best fits
their purpose and provide means/knobs to select what they want?

> > As for this mulit-node MN stuff, Gautham had posted a better solution
> > to propagate the sched_mc flags without need for new sysfs file and
> > related changes.

> > Please take a look at: http://lkml.org/lkml/2009/3/31/137 and
> > http://lkml.org/lkml/2009/3/31/142 which actually degenerates the
> > domain.
> 
> Ah, right, that got lost in my inbox :/ Let me go read those too.
> 
> > However Andreas's requirement seem to indicate multiple nodes within
> > a single socket.  I did not yet completely understand that topology.
> > Some for of smart degeneration may save an additional tunable here.

> Yes, apparently AMD is going to put multiple nodes in a single socket,
> not sure how they do that, Andreas do these chips have multiple memory
> busses?

In contrast to current AMD processors (supporting two DRAM channels
per socket), Magny-Cours (new package type) has four DRAM channels.

> I was thinking chips were pin constrained and wouldn't add a whole
> second memory interface to the package, but what do I know...

For the two new channels of a Magny-Cours processor of course
additional pins are required.


Regards,

Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 10/15] sched: Check for sched_mn_power_savings when doing load balancing
  2009-08-24 15:40     ` Vaidyanathan Srinivasan
@ 2009-08-25  8:00       ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  8:00 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

On Mon, Aug 24, 2009 at 09:10:13PM +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-08-24 17:03:40]:
> 
> > On Thu, 2009-08-20 at 15:41 +0200, Andreas Herrmann wrote:
> > > The patch adds support for POWERSAVINGS_BALANCE_BASIC for MN domain
> > > level. Currently POWERSAVINGS_BALANCE_WAKEUP is not used for MN domain.
> > > 
> > > (I have to admit that so far I don't have the correct understanding
> > > what's the benefit of POWERSAVINGS_BALANCE_WAKEUP (when an deticated
> > > wakeup CPU is used) in contrast to POWERSAVINGS_BALANCE_BASIC.  I also
> > > have not found an example that would demonstrate the difference
> > > between those two powersaving levels.)
> > 
> > blame svaidy for not writing enough comments ;-)
> 
> I am here to explain ;)
> 
> > iirc it moves tasks to sched_mv_preferred_wakeup_cpu instead of waking
> > an idle cpu, this leaves idle cpus idle longer at the cost of creating
> > overload on other cpus.
> 
> Yes, as Peter said, the POWERSAVINGS_BALANCE_WAKEUP biases task
> wakeups to sched_mc_preferred_wakeup_cpu which has been nominated from
> previous load balance loops.
> 
> Task wakeup biasing of sched_mc=2 works for most workloads like
> kernbench and other sleeping tasks that come in and out of runqueue.
> The default sched_mc=1 will work only for jobs running much longer
> than the loadbalance interval or almost 100% CPU intensive job where

Ok, one of my tests was using 100% CPU intensive jobs and for those
the sched_mc=1 or sched_mn=1 level was sufficient to show the effect
of loadbalancing.

> the load balancer can take time to identify the load pattern and
> initiate a task migrate.

> The wakeup biasing (sched_mc=2) will help move bursty jobs faster and
> statistically pack them in single package and save power.

That means that wakeup biasing will also make sense for the MN domain.


Thanks,

Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-25  7:11           ` Peter Zijlstra
@ 2009-08-25  8:04             ` Balbir Singh
  2009-08-25  8:30               ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Balbir Singh @ 2009-08-25  8:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andreas Herrmann, Ingo Molnar, linux-kernel, Gautham Shenoy, svaidy

* Peter Zijlstra <peterz@infradead.org> [2009-08-25 09:11:14]:

> On Mon, 2009-08-24 at 23:49 +0530, Balbir Singh wrote:
> 
> > That reminds me, accounting is currently broken and should be based on
> > APER/MPERF (Power gets it right - based on SPURR).
> 
> What accounting?
> 


We need scaled time accounting for x86 (see *timescaled). By scaled
accounting I mean ratio of APERF/MPERF

> > > The trouble is that cpu_power is now abused for placement decisions too,
> > > and that needs to be taken out.
> > 
> > OK.. so you propose extending the static cpu_power to dynamic
> > cpu_power but based on current topology?
> 
> Right, so cpu_power is primarily used to normalize domain weight in the
> load-balancer.
> 
> Suppose a 4 core machine with 1 unplugged core:
> 
>  0,1,3
> 
> 0,1  3
> 
> The sd-0,1 will have cpu_power 2048, while the sd-3 will have 1024, this
> allowed find_busiest_group() for sd-0,1,3 to pick the one which is
> relatively most overloaded.
> 
> Supposing 3, 2, 2 (nice0) tasks on these cores, the domain weight of
> sd-0,1 is 5*1024 and sd-3 is 2*1024, normalized that becomes 5/2 and 2
> resp. which clearly shows sd-0,1 to be the busiest of the pair.
> 
> Now back in the days Nick wrote all this, he did the cpu_power hack for
> SMT which sets the combined cpu_power of 2 threads (that's all we had
> back then) to 1024, because two threads share 1 core, and are roughly as
> fast.
> 
> He then also used this to influence task placement, preferring to move
> tasks to another sibling domain before getting the second thread active,
> this worked.
> 
> Then multi-core with shared caches came along and people did the same
> trick for mc power save in order to get that placement stuff, but that
> horribly broke the load-balancer normalization.
> 
> Now comes multi-node, and people asking for more elaborate placement
> strategies and all this starts creaking like a ghost house about to
> collapse.
> 
> Therefore I want cpu_power back to load normalization only, and do the
> placement stuff with something else.
> 


What do you have in mind for the something else? Aren't normalization
and placement two sides of the same coin? My concern is that load
normalization might give different recommendations from the placement
stuff, then what do we do?

> Once cpu_power is pure again, we can start making it dynamic, for SMT we
> can utilize APERF/MPERF to guesstimate the actual work capacity of
> threads, and scaling cpu_power back based on RT time used on the cpu.
> 
> Then when we walk the domain tree for load-balancing we re-do the
> cpu_power sum, etc..
> 
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-25  8:04             ` Balbir Singh
@ 2009-08-25  8:30               ` Peter Zijlstra
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-25  8:30 UTC (permalink / raw)
  To: balbir
  Cc: Andreas Herrmann, Ingo Molnar, linux-kernel, Gautham Shenoy, svaidy

On Tue, 2009-08-25 at 13:34 +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-08-25 09:11:14]:
> 
> > On Mon, 2009-08-24 at 23:49 +0530, Balbir Singh wrote:
> > 
> > > That reminds me, accounting is currently broken and should be based on
> > > APER/MPERF (Power gets it right - based on SPURR).
> > 
> > What accounting?
> > 
> 
> 
> We need scaled time accounting for x86 (see *timescaled). By scaled
> accounting I mean ratio of APERF/MPERF

Runtime accounting? I don't see why that would need to be scaled by a/m,
we're accounting wall-time, not a virtual time quantity that represents
work.

> > > > The trouble is that cpu_power is now abused for placement decisions too,
> > > > and that needs to be taken out.
> > > 
> > > OK.. so you propose extending the static cpu_power to dynamic
> > > cpu_power but based on current topology?
> > 
> > Right, so cpu_power is primarily used to normalize domain weight in the
> > load-balancer.
> > 
> > Suppose a 4 core machine with 1 unplugged core:
> > 
> >  0,1,3
> > 
> > 0,1  3
> > 
> > The sd-0,1 will have cpu_power 2048, while the sd-3 will have 1024, this
> > allowed find_busiest_group() for sd-0,1,3 to pick the one which is
> > relatively most overloaded.
> > 
> > Supposing 3, 2, 2 (nice0) tasks on these cores, the domain weight of
> > sd-0,1 is 5*1024 and sd-3 is 2*1024, normalized that becomes 5/2 and 2
> > resp. which clearly shows sd-0,1 to be the busiest of the pair.
> > 
> > Now back in the days Nick wrote all this, he did the cpu_power hack for
> > SMT which sets the combined cpu_power of 2 threads (that's all we had
> > back then) to 1024, because two threads share 1 core, and are roughly as
> > fast.
> > 
> > He then also used this to influence task placement, preferring to move
> > tasks to another sibling domain before getting the second thread active,
> > this worked.
> > 
> > Then multi-core with shared caches came along and people did the same
> > trick for mc power save in order to get that placement stuff, but that
> > horribly broke the load-balancer normalization.
> > 
> > Now comes multi-node, and people asking for more elaborate placement
> > strategies and all this starts creaking like a ghost house about to
> > collapse.
> > 
> > Therefore I want cpu_power back to load normalization only, and do the
> > placement stuff with something else.
> > 

> What do you have in mind for the something else? Aren't normalization
> and placement two sides of the same coin? My concern is that load
> normalization might give different recommendations from the placement
> stuff, then what do we do?

They are related but not the same. People have been asking for placement
policies that exceed the relation.

Also the current ties between them are already strained on multi-level
placement policies.

So what I'd like to see is move all placement decisions to SD_flags and
restore cpu_power to a straight sum of work capacity.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-25  6:41       ` Peter Zijlstra
@ 2009-08-25  8:38         ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  8:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Gautham Shenoy, Srivatsa Vaddagiri,
	Dipankar Sarma, Balbir Singh, svaidy, Arun R Bharadwaj

On Tue, Aug 25, 2009 at 08:41:36AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-08-25 at 08:24 +0200, Andreas Herrmann wrote:
> > On Mon, Aug 24, 2009 at 04:56:18PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2009-08-20 at 15:39 +0200, Andreas Herrmann wrote:
> > > > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > > > ---
> > > 
> > > > +#ifdef CONFIG_SCHED_MN
> > > > +	if (!err && mc_capable())
> > > > +		err = sysfs_create_file(&cls->kset.kobj,
> > > > +					&attr_sched_mn_power_savings.attr);
> > > > +#endif
> > > 
> > > *sigh* another crappy sysfs file
> > > 
> > > Guys, can't we come up with anything better than sched_*_power_saving=n?
> > 
> > Thought this is a settled thing. At least there are already two
> > such parameters. So using the existing convention is an obvious
> > thing, no?
> 
> Well, yes its the obvious thing, but I'm questioning whether its the
> best thing ;-)

Ok.

> > > This configuration space is _way_ too large, and now it gets even
> > > crazier.
> > 
> > I don't fully agree.
> > 
> > Having one control interface for each domain level is just one
> > approach. It gives the user full control of scheduling policies.
> > It just might have to be properly documented.
> > 
> > In another mail Vaidy mentioned that
> > 
> >   "at some point we wanted to change the interface to
> >    sched_power_savings=N and and set the flags according to system
> >    topology".
> > 
> > But how you'll decide at which domain level you have to do power
> > savings scheduling?
> 
> The user isn't interested in knowing about domains and cpu topology in
> 99% of the cases, all they want is the machine not burning power like
> there's no tomorrow.
> 
> Users (me including) have no interest exploring a 27-state power
> configuration space in order to find out what works best for them, I'd
> throw up my hands and not bother, really.

If we have only a single knob (with 0==performance, 1==power savings)
then the arch-specific code must properly set the required SD flags
after CPU/topology detection. Only this will allow the scheduler code
to do the right thing.

Imagine you have following "virtual" CPU topology in a server

- more than one thread per core (sharing cache, FPU, whatsoever)
- multiple cores per internal node (sharing cache, maybe same memory channels)
- multiple internal nodes per socket
- multiple sockets

For power savings scheduling you can choose one or more option from

(a) You might save power when first utilizing all threads of one core, but
    degrade performance by not using other cores.

(b) You might save power when first utilizing all cores of an internal node,
    but you degrade performance by not using other internal nodes.

(c) You might save power when first utilizing all internal nodes of one socket
    before using another socket.

With only a single knob, would you switch on (a) and (b) and (c)?
Or do you decide to switch on only (c) because performance degradation
is too high with (a) and (b)?

One solution could be to have
- two sysfs attributes:
  * sched_power_domain, value=one of {SMT, MC, MN}
  * sched_power_level, value=one of {0, 1, 2})
- and an implicit rule that (a) implies (b) and (b) implies (c).
- Note: this implies that its impossible to switch on only (a).

> > Using sched_mn_power_savings=1 is quite different from
> > sched_smt_power_savings=1. Probably, the most power you save if you
> > switch on power saving scheduling on each domain level. I.e. first
> > filling threads of one core, then filling all cores on one internal
> > node, then filling all internal nodes of one socket.
> > 
> > But for performance reasons a user might just want to use power
> > savings in the MN domain. How you'd allow the user to configure that
> > with just one interface? Passing the domain level to
> > sched_power_savings, e.g.  sched_power_savings=MC instead of the power
> > saving level?
> 
> Sure its different, it reduces the configuration space, that gives less
> choice, but does make it accessible.
> 
> Ask joe-admin what he prefers.
> 
> If you're really really worried people might miss the joy of fine tuning
> their power scheduling, then we can provide a dual interface, one for
> dumb people like me, and one for crazy people like you ;-)

> > Besides that, don't we have to keep the user-interface stable, i.e.
> > stick to sched_smt_power_savings and sched_mc_power_savings?
> 
> Don't ever defend crappy stuff with interface stability, that's just
> lame ;-)

Yep, I have no problem with changing interfaces if they are considered
crappy.

But we should have an approriate replacement.


Thanks,

Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups
  2009-08-24 15:21   ` Peter Zijlstra
  2009-08-24 16:44     ` Balbir Singh
@ 2009-08-25  8:51     ` Andreas Herrmann
  1 sibling, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  8:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Gautham Shenoy, svaidy, Balbir Singh

On Mon, Aug 24, 2009 at 05:21:37PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-20 at 15:41 +0200, Andreas Herrmann wrote:
> > For performance reasons __cpu_power in a sched_group might be limited
> > such that the group can handle only one task. To correctly calculate
> > the capacity in upper domain level groups the unlimited power
> > information is required. This patch stores unlimited __cpu_power
> > information in sched_groups.orig_power and uses this when calculating
> > __cpu_power in upper domain level groups.
> 
> OK, so this tries to fix the cpu_power wreckage?

Not completely. Just (partially) for my MN domain needs.

> ok, so let me try this with an example:
> 
> Suppose we have a dual-core with shared cache and SMT
> 
>   0-3     MC
> 0-1 2-3   SMT
> 
> Then both levels fancy setting SHARED_RESOURCES and both levels end up
> normalizing the cpu_power to 1, so when we unplug cpu 2, load-balancing
> gets all screwy because the whole system doesn't get normalized
> properly.

So normalization is broken already, right?

In case of sched_smt_power_savings we have 1024 as __cpu_power for
each SMT sched_group. And at MC level we have always 2048 as long as
we have two sched_groups in the SMT level.

> What you propose here is every time we muck with cpu_power we keep the
> real stuff in orig_power and use that to compute the level above.

Yes.

> Except you don't use it in the load-balancer proper, so normalization is
> still hosed.

Yes, the normalization problem that you've mentioned is not fixed by that.
But it might be advisable to fix it.

> Its a creative solution, but I'd rather see cpu_power returned to a
> straight sum of actual power to normalize the inter-cpu runqueue weights
> and do the placement decision using something else.

This means not to artificially restrict __cpu_power to 1024 for performance
scheduling?

Seconded.
But I don't have an impromptu patch for this. ;-(


Regards,
Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/15] sched: Allow NODE domain to be parent of MC instead of CPU domain
  2009-08-24 15:32   ` Peter Zijlstra
@ 2009-08-25  8:55     ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  8:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

On Mon, Aug 24, 2009 at 05:32:40PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-20 at 15:42 +0200, Andreas Herrmann wrote:
> > The level of NODE domain's child domain is provided in s_data.numa_child_level.
> > Then several adaptions are required when creating the domain hierarchy.
> > In case NODE domain is parent of MC domain we have to:
> > - limit NODE domains' span in sched_domain_node_span() to not exceed
> >   corresponding topology_core_cpumask.
> > - fix CPU domain span to cover entire cpu_map
> > - fix CPU domain sched groups to cover entire physical groups instead of
> >   covering a node (a node sched_group might be a proper subset of a CPU
> >   sched_group).
> > - use correct child domain in init_numa_sched_groups_power() when
> >   calculating sched_group.__cpu_power in NODE domain
> > - calculate group_power of NODE domain after its child domain
> > 
> > Note: As I have no idea when the ALLNODES domain is required
> >       I assumed that an ALLNODES domain exists only if NODE domain
> >       is parent of CPU domain.
> 
> I think its only used when the regular node level is too large, then we
> split it into smaller bits. SGI folks who run crazy large machines use
> this.

Ok.

> /me mumbels about renaming the domain level, CPU is the physical socket
> level, right? stupid names.
> 
> Patch sounds funky though, numa_child_level should be effident from the
> tree build.

In the current code the numa_child_level must be known before/while
the tree is built. Of course once the tree is ready, the child domain
is known (apart from degeneration).

Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/15] sched: Detect child domain of NUMA (aka NODE) domain
  2009-08-24 15:34   ` Peter Zijlstra
@ 2009-08-25  9:13     ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  9:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

On Mon, Aug 24, 2009 at 05:34:18PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-20 at 15:43 +0200, Andreas Herrmann wrote:
> > On multi-node processors a NUMA node might not span a socket.
> > Instead a socket might span several NUMA nodes.
> > 
> > This patch introduces a check whether NODE domain is parent
> > of MC domain and sets s_data.numa_child_level accordingly.
> > (See previous patch for further details.)
> 
> right, except that the previous patch
> was rather cryptic :/

Sorry for that.

> So you're proposing to have the NODE level depend on multi-node and then
> flip NODE and CPU around?

Conditioned.

Only if a NUMA node does not span an entire socket, e.g.
  node 0: 0-3
  node 1: 4-7
  socket 0: 0-3, 4-7

You may have a SRAT that describes one NUMA node containing all
sockets, e.g.
  node 0: 0-7
  socket 0 : 0-7
If we have something like that on a multi-node processor system then
we don't need to flip NODE and CPU around.

Same is true if there is no SRAT or SRAT is bogus or CONFIG_ACPI_NUMA=n.

In theory, I could also think of node interleaving where a NUMA node
spans internal nodes of a socket on a multi-node processor -- no flip
in domain hierarchy needed.



On balance, as soon as a socket spans more than one NUMA node we have
to flip NODE and CPU.


Regards,
Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 14/15] sched: Conditionally limit __cpu_power when child sched domain has type NODE
  2009-08-24 15:35   ` Peter Zijlstra
@ 2009-08-25  9:19     ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  9:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

On Mon, Aug 24, 2009 at 05:35:32PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-20 at 15:45 +0200, Andreas Herrmann wrote:
> > We need this in case of performance policy. All sched_groups in
> > child's parent domain (MN in this case) should be limited such that
> > tasks are balanced among these sched_groups.
> 
> /me fails at correlating the above changelog and the below patch.

> So here we go mess up cpu_power again in order to invluence the
> placement policy?

Yep, to restrict the capacity of sched_groups in MN domain.

But you pointed already out that messing up cpu_power is
unwanted.

So I have to look for an alternative to influence placement policy
such that cpu_power is kept intact and always properly normalized.
 
> > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > ---
> >  kernel/sched.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 0c950dc..ab88d88 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -8555,11 +8555,11 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
> >  	 */
> >  	if (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
> >  	    ((child->flags &
> > -	      (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
> > +	      (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)) ||
> > +	     (child->level == SD_LV_NODE))) {
> >  		sd->groups->__cpu_power = 0;
> >  		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
> >  	}
> > -
> >  }
> >  
> >  /*
> 

Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-24 15:36   ` Peter Zijlstra
  2009-08-24 18:21     ` Ingo Molnar
@ 2009-08-25  9:31     ` Andreas Herrmann
  2009-08-25  9:55       ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25  9:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > The correct mask that describes core-siblings of an processor
> > is topology_core_cpumask. See topology adapation patches, especially
> > http://marc.info/?l=linux-kernel&m=124964999608179
> 
> 
> argh, violence, murder kill.. this is the worst possible hack and you're
> extending it :/

So this is the third code area
(besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
that is crap, mess, a hack.

Didn't know that I'd enter such a minefield when touching this code. ;-(

What would be your perferred solution for the
core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
this function?

> > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > ---
> >  arch/x86/kernel/smpboot.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index f797214..f39bb2c 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -446,7 +446,7 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >  	 * And for power savings, we return cpu_core_map
> >  	 */
> >  	if (sched_mc_power_savings || sched_smt_power_savings)
> > -		return cpu_core_mask(cpu);
> > +		return topology_core_cpumask(cpu);
> >  	else
> >  		return c->llc_shared_map;
> >  }
> 

Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains
  2009-08-24 14:57   ` Peter Zijlstra
@ 2009-08-25  9:34     ` Gautham R Shenoy
  0 siblings, 0 replies; 64+ messages in thread
From: Gautham R Shenoy @ 2009-08-25  9:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andreas Herrmann, Ingo Molnar, linux-kernel

On Mon, Aug 24, 2009 at 04:57:42PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-20 at 15:40 +0200, Andreas Herrmann wrote:
> > Use new function sd_balance_for_mn_power() and adapt
> > sd_balance_for_package_power() and sd_power_saving_flags() for correct
> > setting of flags SD_POWERSAVINGS_BALANCE and SD_BALANCE_NEWIDLE in CPU
> > and MN domains.
> > 
> > Furthermore add flag SD_SHARE_PKG_RESOURCES to MN domain.
> > Rational: a multi-node processor most likely shares package resources
> > (on Magny-Cours the package constitues a "voltage domain").
> 
> IIRC SD_SHARE_PKG_RESOURCES plays games with the cpu_pwer of a
> sched_domain, which breaks in all kinds of curious ways, this adds more
> breakage afaict.
> 
> ego?

A domain which has SD_SHARE_PKG_RESOURCES, will always have the
__cpu_power = SD_LOAD_SCALE if the domain hasn't set
SD_POWERSAVINGS_BALANCE flag.

The problem which you are talking about is
when you offline a CPU of such a domain, it will still show the same
cpu_power, which can confuse the scheduler.

Eg:
A Dual socket Dual core machine, in the absense of
SD_POWERSAVINGS_BALANCE the SD_LV_CPU which has SD_SHARE_PKG_RESOURCES
set will have both of it's group->cpu_power set to SD_LOAD_SCALE.
If we offline, say one of the four cores,
the group->cpu_power the corresponding group will will still be SD_LOAD_SCALE.

This might affect the fairness calculations. For eg, if you have 6 tasks
running, the ideal placement should have been 4 on the socket whose CPUs
are online and 2 on which one of the cpus has been offlined. But in this
case, we will have 3 + 3, which is not correct.


> 
> > Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> > ---
> >  arch/x86/include/asm/topology.h |    3 ++-
> >  include/linux/sched.h           |   14 ++++++++++++--
> >  2 files changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > index 6d7d133..4a520b8 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -198,7 +198,8 @@ static inline void setup_node_to_cpumask_map(void) { }
> >  				| SD_BALANCE_EXEC	\
> >  				| SD_WAKE_AFFINE	\
> >  				| SD_WAKE_BALANCE	\
> > -				| sd_balance_for_package_power()\
> > +				| SD_SHARE_PKG_RESOURCES\
> > +				| sd_balance_for_mn_power()\
> >  				| sd_power_saving_flags(),\
> >  	.last_balance		= jiffies,		\
> >  	.balance_interval	= 1,			\
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 5755643..c53bdd8 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -844,9 +844,18 @@ static inline int sd_balance_for_mc_power(void)
> >  	return 0;
> >  }
> >  
> > +static inline int sd_balance_for_mn_power(void)
> > +{
> > +	if (sched_mc_power_savings || sched_smt_power_savings)
> > +		return SD_POWERSAVINGS_BALANCE;
> > +
> > +	return 0;
> > +}
> > +
> >  static inline int sd_balance_for_package_power(void)
> >  {
> > -	if (sched_mc_power_savings | sched_smt_power_savings)
> > +	if (sched_mn_power_savings || sched_mc_power_savings ||
> > +	    sched_smt_power_savings)
> >  		return SD_POWERSAVINGS_BALANCE;
> >  
> >  	return 0;
> > @@ -860,7 +869,8 @@ static inline int sd_balance_for_package_power(void)
> >  
> >  static inline int sd_power_saving_flags(void)
> >  {
> > -	if (sched_mc_power_savings | sched_smt_power_savings)
> > +	if (sched_mn_power_savings || sched_mc_power_savings ||
> > +	    sched_smt_power_savings)
> >  		return SD_BALANCE_NEWIDLE;
> >  
> >  	return 0;

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25  9:31     ` Andreas Herrmann
@ 2009-08-25  9:55       ` Peter Zijlstra
  2009-08-25 10:20         ` Ingo Molnar
                           ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-25  9:55 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel

On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > The correct mask that describes core-siblings of an processor
> > > is topology_core_cpumask. See topology adapation patches, especially
> > > http://marc.info/?l=linux-kernel&m=124964999608179
> > 
> > 
> > argh, violence, murder kill.. this is the worst possible hack and you're
> > extending it :/
> 
> So this is the third code area
> (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> that is crap, mess, a hack.
> 
> Didn't know that I'd enter such a minefield when touching this code. ;-(

Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
been making noises to the IBM people (who so far have been the main
source of power saving patches) to clean this up, but now you trod onto
all of it at once..

> What would be your perferred solution for the
> core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> this function?

Right, I'd like to see everything exposed as domain levels.


numa-cluster
numa
socket
in-socket-numa
multi-core
shared-cache
core
threads

We currently have a fixed order of these things, but I think we should
simply provide helpers for building the sd tree and let the arch code do
that instead of exporting all these masks in a fixed order.

Once we get the arch domain tree, we do degenerate stuff to cull all the
trivial domains and fold SD flags.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-24 18:21     ` Ingo Molnar
@ 2009-08-25 10:13       ` Andreas Herrmann
  2009-08-25 10:36         ` Ingo Molnar
  0 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25 10:13 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel

On Mon, Aug 24, 2009 at 08:21:54PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > The correct mask that describes core-siblings of an processor
> > > is topology_core_cpumask. See topology adapation patches, especially
> > > http://marc.info/?l=linux-kernel&m=124964999608179
> > 
> > argh, violence, murder kill.. this is the worst possible hack and 
> > you're extending it :/
> 
> I think most of the trouble here comes from having inconsistent 
> names, a rather static structure for sched-domains setup and then we 
> are confusing things back and forth.
> 
> Right now we have thread/sibling, core, CPU/socket and node, with 
> many data structures around these hardcoded. Certain scheduler 
> features only operate on the hardcoded fields.
> 
> Now Magny-Cours adds a socket internal node construct to the whole 
> thing, names it randomly and basically breaks the semi-static 
> representation.
> 
> We cannot just flip around our static names and hope it goes well 
> and everything just drops into place. Everything just falls apart 
> really instead.
> 
> Instead we should have an arch-defined tree and a CPU architecture 
> dependent ASCII name associated with each level - but not hardcoded 
> into the scheduler.

I admit that it's strange to have the x86 specific SCHED_SMT/MC
snippets in common code.

And the NUMA/SD_NODE stuff is not used by all architectures either.

Having an arch-defined tree seems the right thing to do.

> Plus we should have independent scheduler domains feature flags that 
> can be turned on/off in various levels of that tree, depending on 
> the cache and interconnect properties of the hardware - without 
> having to worry about what the ASCII name says. Those features 
> should be capable to work not just on the lowest level of the tree, 
> but on higher levels too, regardless whether that level is called a 
> 'core', a 'socket' or an 'internal node' on the ASCII level really.
> 
> This is why i insisted on handling the Magny-Cours topology 
> discovery and enumeration patches together with the scheduler 
> patches. It can easily become a mess if extended.

I don't buy this argument.

The main source of information when building sched-domains will be the
CPU topology. That must be provided somehow independent of how
scheduling domains are created. When the domains are built you just
need to know which cpumask to use when the sched_groups and domain's
span are determined.

Thus I think the topology detection is rather self-contained and
can/should be provided independent of how the scheduler side is going
to be implemented.

> 	Ingo


Regards,
Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25  9:55       ` Peter Zijlstra
@ 2009-08-25 10:20         ` Ingo Molnar
  2009-08-25 10:24         ` Andreas Herrmann
  2009-08-27 15:25         ` Andreas Herrmann
  2 siblings, 0 replies; 64+ messages in thread
From: Ingo Molnar @ 2009-08-25 10:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andreas Herrmann, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> > On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > The correct mask that describes core-siblings of an processor
> > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > 
> > > 
> > > argh, violence, murder kill.. this is the worst possible hack and you're
> > > extending it :/
> > 
> > So this is the third code area
> > (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> > that is crap, mess, a hack.
> > 
> > Didn't know that I'd enter such a minefield when touching this code. ;-(
> 
> Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
> been making noises to the IBM people (who so far have been the main
> source of power saving patches) to clean this up, but now you trod onto
> all of it at once..
> 
> > What would be your perferred solution for the
> > core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> > this function?
> 
> Right, I'd like to see everything exposed as domain levels.
> 
>  numa-cluster
>  numa
>  socket
>  in-socket-numa
>  multi-core
>  shared-cache
>  core
>  threads
> 
> We currently have a fixed order of these things, but I think we 
> should simply provide helpers for building the sd tree and let the 
> arch code do that instead of exporting all these masks in a fixed 
> order.
> 
> Once we get the arch domain tree, we do degenerate stuff to cull 
> all the trivial domains and fold SD flags.

Btw., to move this into the realm of possibility for .32, we can 
start this by adding the framework and then crudely cutting off 
these wrongly layered connections to the architecture code and doing 
a clean core.

We might regress but the fixes to those will be isolated and forward 
looking.

	Ingo

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25  9:55       ` Peter Zijlstra
  2009-08-25 10:20         ` Ingo Molnar
@ 2009-08-25 10:24         ` Andreas Herrmann
  2009-08-25 10:28           ` Ingo Molnar
  2009-08-25 10:35           ` Peter Zijlstra
  2009-08-27 15:25         ` Andreas Herrmann
  2 siblings, 2 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-25 10:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

On Tue, Aug 25, 2009 at 11:55:43AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> > On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > The correct mask that describes core-siblings of an processor
> > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > 
> > > 
> > > argh, violence, murder kill.. this is the worst possible hack and you're
> > > extending it :/
> > 
> > So this is the third code area
> > (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> > that is crap, mess, a hack.
> > 
> > Didn't know that I'd enter such a minefield when touching this code. ;-(
> 
> Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
> been making noises to the IBM people (who so far have been the main
> source of power saving patches) to clean this up, but now you trod onto
> all of it at once..
> 
> > What would be your perferred solution for the
> > core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> > this function?
> 
> Right, I'd like to see everything exposed as domain levels.
> 
> 
> numa-cluster
> numa
> socket
> in-socket-numa
> multi-core
> shared-cache
> core
> threads
> 
> We currently have a fixed order of these things, but I think we should
> simply provide helpers for building the sd tree and let the arch code do
> that instead of exporting all these masks in a fixed order.
> 
> Once we get the arch domain tree, we do degenerate stuff to cull all the
> trivial domains and fold SD flags.

So any in-socket-numa is only going to haeppen with the arch-defined
domain tree.

Now that this is settled you should throw away the
__build_sched_domains cleanup patches that are in tip. They won't be
of use when domain creation code is basically changed.


Regards,

Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25 10:24         ` Andreas Herrmann
@ 2009-08-25 10:28           ` Ingo Molnar
  2009-08-25 10:35           ` Peter Zijlstra
  1 sibling, 0 replies; 64+ messages in thread
From: Ingo Molnar @ 2009-08-25 10:28 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Peter Zijlstra, linux-kernel


* Andreas Herrmann <andreas.herrmann3@amd.com> wrote:

> On Tue, Aug 25, 2009 at 11:55:43AM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> > > On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > > The correct mask that describes core-siblings of an processor
> > > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > > 
> > > > 
> > > > argh, violence, murder kill.. this is the worst possible hack and you're
> > > > extending it :/
> > > 
> > > So this is the third code area
> > > (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> > > that is crap, mess, a hack.
> > > 
> > > Didn't know that I'd enter such a minefield when touching this code. ;-(
> > 
> > Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
> > been making noises to the IBM people (who so far have been the main
> > source of power saving patches) to clean this up, but now you trod onto
> > all of it at once..
> > 
> > > What would be your perferred solution for the
> > > core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> > > this function?
> > 
> > Right, I'd like to see everything exposed as domain levels.
> > 
> > 
> > numa-cluster
> > numa
> > socket
> > in-socket-numa
> > multi-core
> > shared-cache
> > core
> > threads
> > 
> > We currently have a fixed order of these things, but I think we should
> > simply provide helpers for building the sd tree and let the arch code do
> > that instead of exporting all these masks in a fixed order.
> > 
> > Once we get the arch domain tree, we do degenerate stuff to cull all the
> > trivial domains and fold SD flags.
> 
> So any in-socket-numa is only going to haeppen with the 
> arch-defined domain tree.
> 
> Now that this is settled you should throw away the 
> __build_sched_domains cleanup patches that are in tip. They won't 
> be of use when domain creation code is basically changed.

I'd rather keep them - it gives a better/cleaner basis to develop 
the new stuff.

	Ingo

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25 10:24         ` Andreas Herrmann
  2009-08-25 10:28           ` Ingo Molnar
@ 2009-08-25 10:35           ` Peter Zijlstra
  2009-08-27 15:42             ` Andreas Herrmann
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-25 10:35 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel

On Tue, 2009-08-25 at 12:24 +0200, Andreas Herrmann wrote:
> On Tue, Aug 25, 2009 at 11:55:43AM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> > > On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > > The correct mask that describes core-siblings of an processor
> > > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > > 
> > > > 
> > > > argh, violence, murder kill.. this is the worst possible hack and you're
> > > > extending it :/
> > > 
> > > So this is the third code area
> > > (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> > > that is crap, mess, a hack.
> > > 
> > > Didn't know that I'd enter such a minefield when touching this code. ;-(
> > 
> > Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
> > been making noises to the IBM people (who so far have been the main
> > source of power saving patches) to clean this up, but now you trod onto
> > all of it at once..
> > 
> > > What would be your perferred solution for the
> > > core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> > > this function?
> > 
> > Right, I'd like to see everything exposed as domain levels.
> > 
> > 
> > numa-cluster
> > numa
> > socket
> > in-socket-numa
> > multi-core
> > shared-cache
> > core
> > threads
> > 
> > We currently have a fixed order of these things, but I think we should
> > simply provide helpers for building the sd tree and let the arch code do
> > that instead of exporting all these masks in a fixed order.
> > 
> > Once we get the arch domain tree, we do degenerate stuff to cull all the
> > trivial domains and fold SD flags.
> 
> So any in-socket-numa is only going to haeppen with the arch-defined
> domain tree.

Well, we could see what it takes to make this work without that. I mean,
this is just how I'd like to see it end up, doesn't mean we cannot work
on it from multiple angles at the same time.

> Now that this is settled you should throw away the
> __build_sched_domains cleanup patches that are in tip. They won't be
> of use when domain creation code is basically changed.

I'm not sure that's needed, we can continue work on refactoring that.
Small steps towards something better seems a better plan than a single
large step.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25 10:13       ` Andreas Herrmann
@ 2009-08-25 10:36         ` Ingo Molnar
  2009-08-27 13:18           ` Andreas Herrmann
  0 siblings, 1 reply; 64+ messages in thread
From: Ingo Molnar @ 2009-08-25 10:36 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Peter Zijlstra, linux-kernel


* Andreas Herrmann <andreas.herrmann3@amd.com> wrote:

> On Mon, Aug 24, 2009 at 08:21:54PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > The correct mask that describes core-siblings of an processor
> > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > 
> > > argh, violence, murder kill.. this is the worst possible hack and 
> > > you're extending it :/
> > 
> > I think most of the trouble here comes from having inconsistent 
> > names, a rather static structure for sched-domains setup and 
> > then we are confusing things back and forth.
> > 
> > Right now we have thread/sibling, core, CPU/socket and node, 
> > with many data structures around these hardcoded. Certain 
> > scheduler features only operate on the hardcoded fields.
> > 
> > Now Magny-Cours adds a socket internal node construct to the 
> > whole thing, names it randomly and basically breaks the 
> > semi-static representation.
> > 
> > We cannot just flip around our static names and hope it goes 
> > well and everything just drops into place. Everything just falls 
> > apart really instead.
> > 
> > Instead we should have an arch-defined tree and a CPU 
> > architecture dependent ASCII name associated with each level - 
> > but not hardcoded into the scheduler.
> 
> I admit that it's strange to have the x86 specific SCHED_SMT/MC 
> snippets in common code.
> 
> And the NUMA/SD_NODE stuff is not used by all architectures 
> either.
> 
> Having an arch-defined tree seems the right thing to do.

yep, with generic helpers to reduce per arch bloat. 
(named/structured in a neutral way)

> > Plus we should have independent scheduler domains feature flags 
> > that can be turned on/off in various levels of that tree, 
> > depending on the cache and interconnect properties of the 
> > hardware - without having to worry about what the ASCII name 
> > says. Those features should be capable to work not just on the 
> > lowest level of the tree, but on higher levels too, regardless 
> > whether that level is called a 'core', a 'socket' or an 
> > 'internal node' on the ASCII level really.
> > 
> > This is why i insisted on handling the Magny-Cours topology 
> > discovery and enumeration patches together with the scheduler 
> > patches. It can easily become a mess if extended.
> 
> I don't buy this argument.
> 
> The main source of information when building sched-domains will be 
> the CPU topology. That must be provided somehow independent of how 
> scheduling domains are created. When the domains are built you 
> just need to know which cpumask to use when the sched_groups and 
> domain's span are determined.
> 
> Thus I think the topology detection is rather self-contained and 
> can/should be provided independent of how the scheduler side is 
> going to be implemented.

This is the sysfs bits? What is this needed for exactly? The 
scheduler is pretty much the most important thing to tune in a 
topology aware manner, besides memory allocations.

	Ingo

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-20 13:39 ` [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy Andreas Herrmann
  2009-08-24 14:56   ` Peter Zijlstra
@ 2009-08-26  9:30   ` Gautham R Shenoy
  2009-08-27 12:47     ` Andreas Herrmann
  1 sibling, 1 reply; 64+ messages in thread
From: Gautham R Shenoy @ 2009-08-26  9:30 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Vaidyanathan Srinivasan, Balbir Singh

Hi Andreas,

On Thu, Aug 20, 2009 at 03:39:14PM +0200, Andreas Herrmann wrote:
> @@ -9208,6 +9229,11 @@ int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
>  		err = sysfs_create_file(&cls->kset.kobj,
>  					&attr_sched_mc_power_savings.attr);
>  #endif
> +#ifdef CONFIG_SCHED_MN
> +	if (!err && mc_capable())
> +		err = sysfs_create_file(&cls->kset.kobj,
> +					&attr_sched_mn_power_savings.attr);
> +#endif

This would create the sysfs tunable even on systems which are
mc_capable() but don't have multi-nodes on a package, no ?

>  	return err;
>  }
>  #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
> -- 
> 1.6.0.4
> 
> 

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains
  2009-08-20 13:40 ` [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains Andreas Herrmann
  2009-08-24 14:57   ` Peter Zijlstra
@ 2009-08-26 10:01   ` Gautham R Shenoy
  1 sibling, 0 replies; 64+ messages in thread
From: Gautham R Shenoy @ 2009-08-26 10:01 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Balbir Singh,
	Vaidyanathan Srinivasan

On Thu, Aug 20, 2009 at 03:40:13PM +0200, Andreas Herrmann wrote:
> 
> Use new function sd_balance_for_mn_power() and adapt
> sd_balance_for_package_power() and sd_power_saving_flags() for correct
> setting of flags SD_POWERSAVINGS_BALANCE and SD_BALANCE_NEWIDLE in CPU
> and MN domains.
> 
> Furthermore add flag SD_SHARE_PKG_RESOURCES to MN domain.
> Rational: a multi-node processor most likely shares package resources
> (on Magny-Cours the package constitues a "voltage domain").
> 
> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
> ---
>  arch/x86/include/asm/topology.h |    3 ++-
>  include/linux/sched.h           |   14 ++++++++++++--
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 6d7d133..4a520b8 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -198,7 +198,8 @@ static inline void setup_node_to_cpumask_map(void) { }
>  				| SD_BALANCE_EXEC	\
>  				| SD_WAKE_AFFINE	\
>  				| SD_WAKE_BALANCE	\
> -				| sd_balance_for_package_power()\
> +				| SD_SHARE_PKG_RESOURCES\
> +				| sd_balance_for_mn_power()\
>  				| sd_power_saving_flags(),\
>  	.last_balance		= jiffies,		\
>  	.balance_interval	= 1,			\
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5755643..c53bdd8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -844,9 +844,18 @@ static inline int sd_balance_for_mc_power(void)
>  	return 0;
>  }
> 
> +static inline int sd_balance_for_mn_power(void)
> +{
> +	if (sched_mc_power_savings || sched_smt_power_savings)
> +		return SD_POWERSAVINGS_BALANCE;
> +
> +	return 0;

This again implies that if SD_POWERSAVINGS_BALANCE is set at any level,
it must also be set at it's parent.

With this constraint, there can only be 4 combinations.
0) SD_POWERSAVINGS_BALANCE not set.
1) SD_POWERSAVINGS_BALANCE set at SD_LV_CPU.
2) SD_POWERSAVINGS_BALANCE set at SD_LV_MN and SD_LV_CPU
3) SD_POWERSAVINGS_BALANCE set at SD_LV_MC, SD_LV_MN and SD_LV_CPU.

If we could independently decide the aggressiveness of consolidation
(i.e, 1 or 2), We can do away with these multiple sysfs variables have
have a single tunable.

Does this make sense ?

> +
>  static inline int sd_balance_for_package_power(void)
>  {
> -	if (sched_mc_power_savings | sched_smt_power_savings)
> +	if (sched_mn_power_savings || sched_mc_power_savings ||
> +	    sched_smt_power_savings)
>  		return SD_POWERSAVINGS_BALANCE;
> 
>  	return 0;
> @@ -860,7 +869,8 @@ static inline int sd_balance_for_package_power(void)
> 
>  static inline int sd_power_saving_flags(void)
>  {
> -	if (sched_mc_power_savings | sched_smt_power_savings)
> +	if (sched_mn_power_savings || sched_mc_power_savings ||
> +	    sched_smt_power_savings)
>  		return SD_BALANCE_NEWIDLE;
> 
>  	return 0;
> -- 
> 1.6.0.4
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy
  2009-08-26  9:30   ` Gautham R Shenoy
@ 2009-08-27 12:47     ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-27 12:47 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Vaidyanathan Srinivasan, Balbir Singh

On Wed, Aug 26, 2009 at 03:00:43PM +0530, Gautham R Shenoy wrote:
> Hi Andreas,
> 
> On Thu, Aug 20, 2009 at 03:39:14PM +0200, Andreas Herrmann wrote:
> > @@ -9208,6 +9229,11 @@ int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
> >  		err = sysfs_create_file(&cls->kset.kobj,
> >  					&attr_sched_mc_power_savings.attr);
> >  #endif
> > +#ifdef CONFIG_SCHED_MN
> > +	if (!err && mc_capable())
> > +		err = sysfs_create_file(&cls->kset.kobj,
> > +					&attr_sched_mn_power_savings.attr);
> > +#endif
> 
> This would create the sysfs tunable even on systems which are
> mc_capable() but don't have multi-nodes on a package, no ?

Yes, that is a bug. I should have introduced mn_capable() to
create this file only if really required.


Thanks,
Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25 10:36         ` Ingo Molnar
@ 2009-08-27 13:18           ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-27 13:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel

On Tue, Aug 25, 2009 at 12:36:51PM +0200, Ingo Molnar wrote:
> 
> * Andreas Herrmann <andreas.herrmann3@amd.com> wrote:
> 
> > On Mon, Aug 24, 2009 at 08:21:54PM +0200, Ingo Molnar wrote:
> > > 
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > > The correct mask that describes core-siblings of an processor
> > > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > > 
> > > > argh, violence, murder kill.. this is the worst possible hack and 
> > > > you're extending it :/
> > > 
> > > I think most of the trouble here comes from having inconsistent 
> > > names, a rather static structure for sched-domains setup and 
> > > then we are confusing things back and forth.
> > > 
> > > Right now we have thread/sibling, core, CPU/socket and node, 
> > > with many data structures around these hardcoded. Certain 
> > > scheduler features only operate on the hardcoded fields.
> > > 
> > > Now Magny-Cours adds a socket internal node construct to the 
> > > whole thing, names it randomly and basically breaks the 
> > > semi-static representation.
> > > 
> > > We cannot just flip around our static names and hope it goes 
> > > well and everything just drops into place. Everything just falls 
> > > apart really instead.
> > > 
> > > Instead we should have an arch-defined tree and a CPU 
> > > architecture dependent ASCII name associated with each level - 
> > > but not hardcoded into the scheduler.
> > 
> > I admit that it's strange to have the x86 specific SCHED_SMT/MC 
> > snippets in common code.
> > 
> > And the NUMA/SD_NODE stuff is not used by all architectures 
> > either.
> > 
> > Having an arch-defined tree seems the right thing to do.
> 
> yep, with generic helpers to reduce per arch bloat. 
> (named/structured in a neutral way)
> 
> > > Plus we should have independent scheduler domains feature flags 
> > > that can be turned on/off in various levels of that tree, 
> > > depending on the cache and interconnect properties of the 
> > > hardware - without having to worry about what the ASCII name 
> > > says. Those features should be capable to work not just on the 
> > > lowest level of the tree, but on higher levels too, regardless 
> > > whether that level is called a 'core', a 'socket' or an 
> > > 'internal node' on the ASCII level really.
> > > 
> > > This is why i insisted on handling the Magny-Cours topology 
> > > discovery and enumeration patches together with the scheduler 
> > > patches. It can easily become a mess if extended.
> > 
> > I don't buy this argument.
> > 
> > The main source of information when building sched-domains will be 
> > the CPU topology. That must be provided somehow independent of how 
> > scheduling domains are created. When the domains are built you 
> > just need to know which cpumask to use when the sched_groups and 
> > domain's span are determined.
> > 
> > Thus I think the topology detection is rather self-contained and 
> > can/should be provided independent of how the scheduler side is 
> > going to be implemented.
> 
> This is the sysfs bits?

So you "only" object to the sysfs topology additions, correct?

> What is this needed for exactly?

If you need to know which cores share the same northbridge or in
a general sense which cores are on the same die.
That directly leads to the question whether a more generic
nomenclature should be used: chip_siblings instead of
cpu_node_siblings (could cover all MCM processors).

When a user wants to pin tasks to dedicated CPUs he might need this
information.
Maybe even you like to count northbridge events with PCL and have to
know which CPUs share the same northbridge and where you have to bind
the tasks/threads that you wanna monitor?

> The scheduler is pretty much the most important thing to tune in a
> topology aware manner, besides memory allocations.

I can leave out the patches that introduce the interface. But I
really want to have a cpu_node_map for a CPU and the cpu_node_id in
cpuinfo_x86 plus the two fixes (for L3 cache and MCE).

Instead of using new sysfs topology attributes the user can also
gather the node information from the shared_cpu_map of the L3
cache. That's not as straightforward as keeping all topology
information in one place but I can live with that.


Regards,
Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25  9:55       ` Peter Zijlstra
  2009-08-25 10:20         ` Ingo Molnar
  2009-08-25 10:24         ` Andreas Herrmann
@ 2009-08-27 15:25         ` Andreas Herrmann
  2009-08-28 10:39           ` Peter Zijlstra
  2 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-27 15:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

On Tue, Aug 25, 2009 at 11:55:43AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> > On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > The correct mask that describes core-siblings of an processor
> > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > 
> > > 
> > > argh, violence, murder kill.. this is the worst possible hack and you're
> > > extending it :/
> > 
> > So this is the third code area
> > (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> > that is crap, mess, a hack.
> > 
> > Didn't know that I'd enter such a minefield when touching this code. ;-(
> 
> Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
> been making noises to the IBM people (who so far have been the main
> source of power saving patches) to clean this up, but now you trod onto
> all of it at once..
> 
> > What would be your perferred solution for the
> > core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> > this function?
> 
> Right, I'd like to see everything exposed as domain levels.
> 
> 
> numa-cluster
> numa
> socket
> in-socket-numa
> multi-core
> shared-cache
> core
> threads

Out of curiosity, when does cpu_core_mask differ from llc_shared_map
on Intel? Only in case of MCM (e.g. Core2 Quad)?

If yes, the hackery of cpu_coregroup_mask() could be replace by
the domain that I'd like to introduce for Magny-Cours:

  MC domain span would represent one die.
  The new domain would span all dies in an MCM.

Bad idea?



Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-25 10:35           ` Peter Zijlstra
@ 2009-08-27 15:42             ` Andreas Herrmann
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-27 15:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

On Tue, Aug 25, 2009 at 12:35:02PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-08-25 at 12:24 +0200, Andreas Herrmann wrote:
> > On Tue, Aug 25, 2009 at 11:55:43AM +0200, Peter Zijlstra wrote:
> > > On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> > > > On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > > > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > > > The correct mask that describes core-siblings of an processor
> > > > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > > > 
> > > > > 
> > > > > argh, violence, murder kill.. this is the worst possible hack and you're
> > > > > extending it :/
> > > > 
> > > > So this is the third code area
> > > > (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> > > > that is crap, mess, a hack.
> > > > 
> > > > Didn't know that I'd enter such a minefield when touching this code. ;-(
> > > 
> > > Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
> > > been making noises to the IBM people (who so far have been the main
> > > source of power saving patches) to clean this up, but now you trod onto
> > > all of it at once..
> > > 
> > > > What would be your perferred solution for the
> > > > core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> > > > this function?
> > > 
> > > Right, I'd like to see everything exposed as domain levels.
> > > 
> > > 
> > > numa-cluster
> > > numa
> > > socket
> > > in-socket-numa
> > > multi-core
> > > shared-cache
> > > core
> > > threads
> > > 
> > > We currently have a fixed order of these things, but I think we should
> > > simply provide helpers for building the sd tree and let the arch code do
> > > that instead of exporting all these masks in a fixed order.
> > > 
> > > Once we get the arch domain tree, we do degenerate stuff to cull all the
> > > trivial domains and fold SD flags.
> > 
> > So any in-socket-numa is only going to haeppen with the arch-defined
> > domain tree.
> 
> Well, we could see what it takes to make this work without that. I mean,
> this is just how I'd like to see it end up, doesn't mean we cannot work
> on it from multiple angles at the same time.

Yup.

> > Now that this is settled you should throw away the
> > __build_sched_domains cleanup patches that are in tip. They won't be
> > of use when domain creation code is basically changed.
> 
> I'm not sure that's needed, we can continue work on refactoring that.
> Small steps towards something better seems a better plan than a single
> large step.

Ok.


Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-27 15:25         ` Andreas Herrmann
@ 2009-08-28 10:39           ` Peter Zijlstra
  2009-08-28 12:03             ` Andreas Herrmann
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-28 10:39 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel

On Thu, 2009-08-27 at 17:25 +0200, Andreas Herrmann wrote:
> On Tue, Aug 25, 2009 at 11:55:43AM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> > > On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > > The correct mask that describes core-siblings of an processor
> > > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > > 
> > > > 
> > > > argh, violence, murder kill.. this is the worst possible hack and you're
> > > > extending it :/
> > > 
> > > So this is the third code area
> > > (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> > > that is crap, mess, a hack.
> > > 
> > > Didn't know that I'd enter such a minefield when touching this code. ;-(
> > 
> > Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
> > been making noises to the IBM people (who so far have been the main
> > source of power saving patches) to clean this up, but now you trod onto
> > all of it at once..
> > 
> > > What would be your perferred solution for the
> > > core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> > > this function?
> > 
> > Right, I'd like to see everything exposed as domain levels.
> > 
> > 
> > numa-cluster
> > numa
> > socket
> > in-socket-numa
> > multi-core
> > shared-cache
> > core
> > threads
> 
> Out of curiosity, when does cpu_core_mask differ from llc_shared_map
> on Intel? Only in case of MCM (e.g. Core2 Quad)?

Yes, I think both c2q and some dual-core opteron have multiple cache
domains per socket.

> If yes, the hackery of cpu_coregroup_mask() could be replace by
> the domain that I'd like to introduce for Magny-Cours:
> 
>   MC domain span would represent one die.
>   The new domain would span all dies in an MCM.
> 
> Bad idea?

No, I think all the mentioned chips have the multi-die thing in common,
the intel c2q has 2 dual-core dies, the opteron I have seems to be two
single cores and this magny thing has 2 many cores -- teh pun, sides
aching :-)

So the generalization to dies per socket seems sensible.




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-28 10:39           ` Peter Zijlstra
@ 2009-08-28 12:03             ` Andreas Herrmann
  2009-08-28 12:50               ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Andreas Herrmann @ 2009-08-28 12:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

On Fri, Aug 28, 2009 at 12:39:44PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-27 at 17:25 +0200, Andreas Herrmann wrote:
> > On Tue, Aug 25, 2009 at 11:55:43AM +0200, Peter Zijlstra wrote:
> > > On Tue, 2009-08-25 at 11:31 +0200, Andreas Herrmann wrote:
> > > > On Mon, Aug 24, 2009 at 05:36:16PM +0200, Peter Zijlstra wrote:
> > > > > On Thu, 2009-08-20 at 15:46 +0200, Andreas Herrmann wrote:
> > > > > > The correct mask that describes core-siblings of an processor
> > > > > > is topology_core_cpumask. See topology adapation patches, especially
> > > > > > http://marc.info/?l=linux-kernel&m=124964999608179
> > > > > 
> > > > > 
> > > > > argh, violence, murder kill.. this is the worst possible hack and you're
> > > > > extending it :/
> > > > 
> > > > So this is the third code area
> > > > (besides sched_*_power_savings sysfs interface, and the __cpu_power fiddling)
> > > > that is crap, mess, a hack.
> > > > 
> > > > Didn't know that I'd enter such a minefield when touching this code. ;-(
> > > 
> > > Yeah, you're lucky that way ;-) Its been creaking for a while, and I've
> > > been making noises to the IBM people (who so far have been the main
> > > source of power saving patches) to clean this up, but now you trod onto
> > > all of it at once..
> > > 
> > > > What would be your perferred solution for the
> > > > core_cpumask/llc_shared_map stuff?  Another domain level to get rid of
> > > > this function?
> > > 
> > > Right, I'd like to see everything exposed as domain levels.
> > > 
> > > 
> > > numa-cluster
> > > numa
> > > socket
> > > in-socket-numa
> > > multi-core
> > > shared-cache
> > > core
> > > threads
> > 
> > Out of curiosity, when does cpu_core_mask differ from llc_shared_map
> > on Intel? Only in case of MCM (e.g. Core2 Quad)?
> 
> Yes, I think both c2q and some dual-core opteron have multiple cache
> domains per socket.
> 
> > If yes, the hackery of cpu_coregroup_mask() could be replace by
> > the domain that I'd like to introduce for Magny-Cours:
> > 
> >   MC domain span would represent one die.
> >   The new domain would span all dies in an MCM.
> > 
> > Bad idea?
> 
> No, I think all the mentioned chips have the multi-die thing in common,
> the intel c2q has 2 dual-core dies,

> the opteron I have seems to be two
> single cores

Really? I am not aware of such a thing.

Can you check how many sets of northbridge functions do you have?  If
you would have two dies in one package then you should see one set of
PCI functions at bus 0 device 24 and a second set of PCI functions at
bus 0 device 25, e.g.

  # lspci -d 1022:
  ...
  00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h
          [Opteron, Athlon64, Sempron] HyperTransport Configuration [1022:1200]
  ...
  00:19.0 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h
          [Opteron, Athlon64, Sempron] HyperTransport Configuration [1022:1200]
  ... 

> and this magny thing has 2 many cores -- teh pun, sides
> aching :-)
> 
> So the generalization to dies per socket seems sensible.

Yup


Andreas

-- 
Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors
  2009-08-28 12:03             ` Andreas Herrmann
@ 2009-08-28 12:50               ` Peter Zijlstra
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2009-08-28 12:50 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Ingo Molnar, linux-kernel

On Fri, 2009-08-28 at 14:03 +0200, Andreas Herrmann wrote:
> 
> > the opteron I have seems to be two
> > single cores
> 
> Really? I am not aware of such a thing.
> 
> Can you check how many sets of northbridge functions do you have?  If
> you would have two dies in one package then you should see one set of
> PCI functions at bus 0 device 24 and a second set of PCI functions at
> bus 0 device 25, e.g.
> 
>   # lspci -d 1022:
>   ...
>   00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h
>           [Opteron, Athlon64, Sempron] HyperTransport Configuration [1022:1200]
>   ...
>   00:19.0 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h
>           [Opteron, Athlon64, Sempron] HyperTransport Configuration [1022:1200]
>   ... 

Doesn't appear to be the case, maybe I mis-remembered the various masks
for that machine.

Oh well ;-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2009-08-28 12:50 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-20 13:12 [RFC][PATCH 0/15] sched: Fix scheduling for multi-node processors Andreas Herrmann
2009-08-20 13:15 ` [PATCH 1/15] x86, sched: Add config option for multi-node CPU scheduling Andreas Herrmann
2009-08-21 13:50   ` Valdis.Kletnieks
2009-08-24  8:49     ` Andreas Herrmann
2009-08-20 13:34 ` [PATCH 2/15] sched, x86: Provide initializer for MN scheduling domain, define MN level Andreas Herrmann
2009-08-20 13:34 ` [PATCH 3/15] sched: Add cpumask to be used when building MN domain Andreas Herrmann
2009-08-20 13:35 ` [PATCH 4/15] sched: Define per CPU variables and cpu_to_group function for " Andreas Herrmann
2009-08-20 13:36 ` [PATCH 5/15] sched: Add function to build MN sched domain Andreas Herrmann
2009-08-20 13:37 ` [PATCH 6/15] sched: Add support for MN domain in build_sched_groups Andreas Herrmann
2009-08-20 13:38 ` [PATCH 7/15] sched: Activate build of MN domains Andreas Herrmann
2009-08-20 13:39 ` [PATCH 8/15] sched: Add parameter sched_mn_power_savings to control MN domain sched policy Andreas Herrmann
2009-08-24 14:56   ` Peter Zijlstra
2009-08-24 15:32     ` Vaidyanathan Srinivasan
2009-08-24 15:45       ` Peter Zijlstra
2009-08-25  7:52         ` Andreas Herrmann
2009-08-25  7:50       ` Andreas Herrmann
2009-08-25  6:24     ` Andreas Herrmann
2009-08-25  6:41       ` Peter Zijlstra
2009-08-25  8:38         ` Andreas Herrmann
2009-08-26  9:30   ` Gautham R Shenoy
2009-08-27 12:47     ` Andreas Herrmann
2009-08-20 13:40 ` [PATCH 9/15] sched: Check sched_mn_power_savings when setting flags for CPU and MN domains Andreas Herrmann
2009-08-24 14:57   ` Peter Zijlstra
2009-08-25  9:34     ` Gautham R Shenoy
2009-08-26 10:01   ` Gautham R Shenoy
2009-08-20 13:41 ` [PATCH 10/15] sched: Check for sched_mn_power_savings when doing load balancing Andreas Herrmann
2009-08-24 15:03   ` Peter Zijlstra
2009-08-24 15:40     ` Vaidyanathan Srinivasan
2009-08-25  8:00       ` Andreas Herrmann
2009-08-20 13:41 ` [PATCH 11/15] sched: Pass unlimited __cpu_power information to upper domain level groups Andreas Herrmann
2009-08-24 15:21   ` Peter Zijlstra
2009-08-24 16:44     ` Balbir Singh
2009-08-24 17:26       ` Peter Zijlstra
2009-08-24 18:19         ` Balbir Singh
2009-08-25  7:11           ` Peter Zijlstra
2009-08-25  8:04             ` Balbir Singh
2009-08-25  8:30               ` Peter Zijlstra
2009-08-25  8:51     ` Andreas Herrmann
2009-08-20 13:42 ` [PATCH 12/15] sched: Allow NODE domain to be parent of MC instead of CPU domain Andreas Herrmann
2009-08-24 15:32   ` Peter Zijlstra
2009-08-25  8:55     ` Andreas Herrmann
2009-08-20 13:43 ` [PATCH 13/15] sched: Detect child domain of NUMA (aka NODE) domain Andreas Herrmann
2009-08-24 15:34   ` Peter Zijlstra
2009-08-25  9:13     ` Andreas Herrmann
2009-08-20 13:45 ` [PATCH 14/15] sched: Conditionally limit __cpu_power when child sched domain has type NODE Andreas Herrmann
2009-08-24 15:35   ` Peter Zijlstra
2009-08-25  9:19     ` Andreas Herrmann
2009-08-20 13:46 ` [PATCH 15/15] x86: Fix cpu_coregroup_mask to return correct cpumask on multi-node processors Andreas Herrmann
2009-08-24 15:36   ` Peter Zijlstra
2009-08-24 18:21     ` Ingo Molnar
2009-08-25 10:13       ` Andreas Herrmann
2009-08-25 10:36         ` Ingo Molnar
2009-08-27 13:18           ` Andreas Herrmann
2009-08-25  9:31     ` Andreas Herrmann
2009-08-25  9:55       ` Peter Zijlstra
2009-08-25 10:20         ` Ingo Molnar
2009-08-25 10:24         ` Andreas Herrmann
2009-08-25 10:28           ` Ingo Molnar
2009-08-25 10:35           ` Peter Zijlstra
2009-08-27 15:42             ` Andreas Herrmann
2009-08-27 15:25         ` Andreas Herrmann
2009-08-28 10:39           ` Peter Zijlstra
2009-08-28 12:03             ` Andreas Herrmann
2009-08-28 12:50               ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.