linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] s390: introduce drawer scheduling domain
@ 2016-06-08  9:09 Heiko Carstens
  2016-06-08  9:09 ` [PATCH 1/2] topology/sysfs: provide drawer id and siblings attributes Heiko Carstens
  2016-06-08  9:09 ` [PATCH 2/2] s390/topology: add drawer scheduling domain level Heiko Carstens
  0 siblings, 2 replies; 11+ messages in thread
From: Heiko Carstens @ 2016-06-08  9:09 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Peter Ziljstra; +Cc: Martin Schwidefsky, linux-kernel

Greg, Peter,

the s390 z13 machines expose yet another topology level which would be
on top of the already existing "book" scheduling domain, which is
currently only used by s390.

Therefore I would like to introduce a new "drawer" level to the
topology sysfs code. The scheduler code itself does not need any
changes to support this, since I can introduce as many scheduling
domains as I like within architecture code. However the sysfs
representation is common code.

The first patch introduces the files

/sys/devices/system/cpu/cpuX/topology/drawer_id
/sys/devices/system/cpu/cpuX/topology/drawer_siblings
/sys/devices/system/cpu/cpuX/topology/drawer_siblings_list

if CONFIG_SCHED_DRAWER is enabled.

The second patch implements support for the new topology level on
s390. Performance measurements did show a performance increase of up
to 8%, while there don't seem to be any negative impacts.

If this is acceptable I'd like to add these patches to the s390 tree
for the next merge window.

For reference the z13 machine is described here:
www.redbooks.ibm.com/redbooks/pdfs/sg248251.pdf

Chapter 2 (or 2.2) gives a brief overview of what I'm talking about.

Heiko Carstens (2):
  topology/sysfs: provide drawer id and siblings attributes
  s390/topology: add drawer scheduling domain level

 Documentation/cputopology.txt    | 40 ++++++++++++++++++++++++++++++++--------
 arch/s390/Kconfig                |  4 ++++
 arch/s390/include/asm/topology.h |  4 ++++
 arch/s390/kernel/topology.c      | 33 +++++++++++++++++++++++++++------
 arch/s390/numa/mode_emu.c        | 25 ++++++++++++++++++++-----
 drivers/base/topology.c          | 13 +++++++++++++
 6 files changed, 100 insertions(+), 19 deletions(-)

-- 
2.6.6

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] topology/sysfs: provide drawer id and siblings attributes
  2016-06-08  9:09 [PATCH 0/2] s390: introduce drawer scheduling domain Heiko Carstens
@ 2016-06-08  9:09 ` Heiko Carstens
  2016-06-08  9:09 ` [PATCH 2/2] s390/topology: add drawer scheduling domain level Heiko Carstens
  1 sibling, 0 replies; 11+ messages in thread
From: Heiko Carstens @ 2016-06-08  9:09 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Peter Ziljstra; +Cc: Martin Schwidefsky, linux-kernel

The s390 cpu topology gained another hierarchy level. The top level is
now called drawer and contains several books. A book used to be the
top level.

In order to expose the cpu topology to user space allow to create new
sysfs attributes dependent on CONFIG_SCHED_DRAWER which an
architecture may define and select.

These additional attributes will be available:

/sys/devices/system/cpu/cpuX/topology/drawer_id
/sys/devices/system/cpu/cpuX/topology/drawer_siblings
/sys/devices/system/cpu/cpuX/topology/drawer_siblings_list

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 Documentation/cputopology.txt | 40 ++++++++++++++++++++++++++++++++--------
 drivers/base/topology.c       | 13 +++++++++++++
 2 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index 12b1b25b4da9..f722f227a73b 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -20,48 +20,70 @@ to /proc/cpuinfo output of some architectures:
 	identifier (rather than the kernel's).	The actual value is
 	architecture and platform dependent.
 
-4) /sys/devices/system/cpu/cpuX/topology/thread_siblings:
+4) /sys/devices/system/cpu/cpuX/topology/drawer_id:
+
+	the drawer ID of cpuX. Typically it is the hardware platform's
+	identifier (rather than the kernel's).	The actual value is
+	architecture and platform dependent.
+
+5) /sys/devices/system/cpu/cpuX/topology/thread_siblings:
 
 	internal kernel map of cpuX's hardware threads within the same
 	core as cpuX.
 
-5) /sys/devices/system/cpu/cpuX/topology/thread_siblings_list:
+6) /sys/devices/system/cpu/cpuX/topology/thread_siblings_list:
 
 	human-readable list of cpuX's hardware threads within the same
 	core as cpuX.
 
-6) /sys/devices/system/cpu/cpuX/topology/core_siblings:
+7) /sys/devices/system/cpu/cpuX/topology/core_siblings:
 
 	internal kernel map of cpuX's hardware threads within the same
 	physical_package_id.
 
-7) /sys/devices/system/cpu/cpuX/topology/core_siblings_list:
+8) /sys/devices/system/cpu/cpuX/topology/core_siblings_list:
 
 	human-readable list of cpuX's hardware threads within the same
 	physical_package_id.
 
-8) /sys/devices/system/cpu/cpuX/topology/book_siblings:
+9) /sys/devices/system/cpu/cpuX/topology/book_siblings:
 
 	internal kernel map of cpuX's hardware threads within the same
 	book_id.
 
-9) /sys/devices/system/cpu/cpuX/topology/book_siblings_list:
+10) /sys/devices/system/cpu/cpuX/topology/book_siblings_list:
 
 	human-readable list of cpuX's hardware threads within the same
 	book_id.
 
+11) /sys/devices/system/cpu/cpuX/topology/drawer_siblings:
+
+	internal kernel map of cpuX's hardware threads within the same
+	drawer_id.
+
+12) /sys/devices/system/cpu/cpuX/topology/drawer_siblings_list:
+
+	human-readable list of cpuX's hardware threads within the same
+	drawer_id.
+
 To implement it in an architecture-neutral way, a new source file,
-drivers/base/topology.c, is to export the 6 or 9 attributes. The three book
-related sysfs files will only be created if CONFIG_SCHED_BOOK is selected.
+drivers/base/topology.c, is to export the 6 to 12 attributes. The book
+and drawer related sysfs files will only be created if CONFIG_SCHED_BOOK
+and CONFIG_SCHED_DRAWER are selected.
+
+CONFIG_SCHED_BOOK and CONFIG_DRAWER are currently only used on s390, where
+they reflect the cpu and cache hierarchy.
 
 For an architecture to support this feature, it must define some of
 these macros in include/asm-XXX/topology.h:
 #define topology_physical_package_id(cpu)
 #define topology_core_id(cpu)
 #define topology_book_id(cpu)
+#define topology_drawer_id(cpu)
 #define topology_sibling_cpumask(cpu)
 #define topology_core_cpumask(cpu)
 #define topology_book_cpumask(cpu)
+#define topology_drawer_cpumask(cpu)
 
 The type of **_id macros is int.
 The type of **_cpumask macros is (const) struct cpumask *. The latter
@@ -78,6 +100,8 @@ not defined by include/asm-XXX/topology.h:
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
+For architectures that don't support drawes (CONFIG_SCHED_DRAWER) there are
+no default definitions for topology_drawer_id() and topology_drawer_cpumask().
 
 Additionally, CPU topology information is provided under
 /sys/devices/system/cpu and includes these files.  The internal
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 8b7d7f8e5851..df3c97cb4c99 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -77,6 +77,14 @@ static DEVICE_ATTR_RO(book_siblings);
 static DEVICE_ATTR_RO(book_siblings_list);
 #endif
 
+#ifdef CONFIG_SCHED_DRAWER
+define_id_show_func(drawer_id);
+static DEVICE_ATTR_RO(drawer_id);
+define_siblings_show_func(drawer_siblings, drawer_cpumask);
+static DEVICE_ATTR_RO(drawer_siblings);
+static DEVICE_ATTR_RO(drawer_siblings_list);
+#endif
+
 static struct attribute *default_attrs[] = {
 	&dev_attr_physical_package_id.attr,
 	&dev_attr_core_id.attr,
@@ -89,6 +97,11 @@ static struct attribute *default_attrs[] = {
 	&dev_attr_book_siblings.attr,
 	&dev_attr_book_siblings_list.attr,
 #endif
+#ifdef CONFIG_SCHED_DRAWER
+	&dev_attr_drawer_id.attr,
+	&dev_attr_drawer_siblings.attr,
+	&dev_attr_drawer_siblings_list.attr,
+#endif
 	NULL
 };
 
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-08  9:09 [PATCH 0/2] s390: introduce drawer scheduling domain Heiko Carstens
  2016-06-08  9:09 ` [PATCH 1/2] topology/sysfs: provide drawer id and siblings attributes Heiko Carstens
@ 2016-06-08  9:09 ` Heiko Carstens
  2016-06-13 11:06   ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Heiko Carstens @ 2016-06-08  9:09 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Peter Ziljstra; +Cc: Martin Schwidefsky, linux-kernel

The z13 machine added a fourth level to the cpu topology
information. The new top level is called drawer.

A drawer contains two books, which used to be the top level.

Adding this additional scheduling domain did show performance
improvements for some workloads of up to 8%, while there don't
seem to be any workloads impacted in a negative way.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 arch/s390/Kconfig                |  4 ++++
 arch/s390/include/asm/topology.h |  4 ++++
 arch/s390/kernel/topology.c      | 33 +++++++++++++++++++++++++++------
 arch/s390/numa/mode_emu.c        | 25 ++++++++++++++++++++-----
 4 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index a8c259059adf..9d35d6d084da 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -477,6 +477,9 @@ config SCHED_MC
 config SCHED_BOOK
 	def_bool n
 
+config SCHED_DRAWER
+	def_bool n
+
 config SCHED_TOPOLOGY
 	def_bool y
 	prompt "Topology scheduler support"
@@ -484,6 +487,7 @@ config SCHED_TOPOLOGY
 	select SCHED_SMT
 	select SCHED_MC
 	select SCHED_BOOK
+	select SCHED_DRAWER
 	help
 	  Topology scheduler support improves the CPU scheduler's decision
 	  making when dealing with machines that have multi-threading,
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index 6b53962e807e..f15f5571ca2b 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -14,10 +14,12 @@ struct cpu_topology_s390 {
 	unsigned short core_id;
 	unsigned short socket_id;
 	unsigned short book_id;
+	unsigned short drawer_id;
 	unsigned short node_id;
 	cpumask_t thread_mask;
 	cpumask_t core_mask;
 	cpumask_t book_mask;
+	cpumask_t drawer_mask;
 };
 
 DECLARE_PER_CPU(struct cpu_topology_s390, cpu_topology);
@@ -30,6 +32,8 @@ DECLARE_PER_CPU(struct cpu_topology_s390, cpu_topology);
 #define topology_core_cpumask(cpu)	  (&per_cpu(cpu_topology, cpu).core_mask)
 #define topology_book_id(cpu)		  (per_cpu(cpu_topology, cpu).book_id)
 #define topology_book_cpumask(cpu)	  (&per_cpu(cpu_topology, cpu).book_mask)
+#define topology_drawer_id(cpu)		  (per_cpu(cpu_topology, cpu).drawer_id)
+#define topology_drawer_cpumask(cpu)	  (&per_cpu(cpu_topology, cpu).drawer_mask)
 
 #define mc_capable() 1
 
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index 64298a867589..44745e751c3a 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -46,6 +46,7 @@ static DECLARE_WORK(topology_work, topology_work_fn);
  */
 static struct mask_info socket_info;
 static struct mask_info book_info;
+static struct mask_info drawer_info;
 
 DEFINE_PER_CPU(struct cpu_topology_s390, cpu_topology);
 EXPORT_PER_CPU_SYMBOL_GPL(cpu_topology);
@@ -80,6 +81,7 @@ static cpumask_t cpu_thread_map(unsigned int cpu)
 }
 
 static struct mask_info *add_cpus_to_mask(struct topology_core *tl_core,
+					  struct mask_info *drawer,
 					  struct mask_info *book,
 					  struct mask_info *socket,
 					  int one_socket_per_cpu)
@@ -97,9 +99,11 @@ static struct mask_info *add_cpus_to_mask(struct topology_core *tl_core,
 			continue;
 		for (i = 0; i <= smp_cpu_mtid; i++) {
 			topo = &per_cpu(cpu_topology, lcpu + i);
+			topo->drawer_id = drawer->id;
 			topo->book_id = book->id;
 			topo->core_id = rcore;
 			topo->thread_id = lcpu + i;
+			cpumask_set_cpu(lcpu + i, &drawer->mask);
 			cpumask_set_cpu(lcpu + i, &book->mask);
 			cpumask_set_cpu(lcpu + i, &socket->mask);
 			if (one_socket_per_cpu)
@@ -128,6 +132,11 @@ static void clear_masks(void)
 		cpumask_clear(&info->mask);
 		info = info->next;
 	}
+	info = &drawer_info;
+	while (info) {
+		cpumask_clear(&info->mask);
+		info = info->next;
+	}
 }
 
 static union topology_entry *next_tle(union topology_entry *tle)
@@ -141,12 +150,17 @@ static void __tl_to_masks_generic(struct sysinfo_15_1_x *info)
 {
 	struct mask_info *socket = &socket_info;
 	struct mask_info *book = &book_info;
+	struct mask_info *drawer = &drawer_info;
 	union topology_entry *tle, *end;
 
 	tle = info->tle;
 	end = (union topology_entry *)((unsigned long)info + info->length);
 	while (tle < end) {
 		switch (tle->nl) {
+		case 3:
+			drawer = drawer->next;
+			drawer->id = tle->container.id;
+			break;
 		case 2:
 			book = book->next;
 			book->id = tle->container.id;
@@ -156,7 +170,7 @@ static void __tl_to_masks_generic(struct sysinfo_15_1_x *info)
 			socket->id = tle->container.id;
 			break;
 		case 0:
-			add_cpus_to_mask(&tle->cpu, book, socket, 0);
+			add_cpus_to_mask(&tle->cpu, drawer, book, socket, 0);
 			break;
 		default:
 			clear_masks();
@@ -170,6 +184,7 @@ static void __tl_to_masks_z10(struct sysinfo_15_1_x *info)
 {
 	struct mask_info *socket = &socket_info;
 	struct mask_info *book = &book_info;
+	struct mask_info *drawer = &drawer_info;
 	union topology_entry *tle, *end;
 
 	tle = info->tle;
@@ -181,7 +196,7 @@ static void __tl_to_masks_z10(struct sysinfo_15_1_x *info)
 			book->id = tle->container.id;
 			break;
 		case 0:
-			socket = add_cpus_to_mask(&tle->cpu, book, socket, 1);
+			socket = add_cpus_to_mask(&tle->cpu, drawer, book, socket, 1);
 			break;
 		default:
 			clear_masks();
@@ -257,11 +272,13 @@ static void update_cpu_masks(void)
 		topo->thread_mask = cpu_thread_map(cpu);
 		topo->core_mask = cpu_group_map(&socket_info, cpu);
 		topo->book_mask = cpu_group_map(&book_info, cpu);
+		topo->drawer_mask = cpu_group_map(&drawer_info, cpu);
 		if (!MACHINE_HAS_TOPOLOGY) {
 			topo->thread_id = cpu;
 			topo->core_id = cpu;
 			topo->socket_id = cpu;
 			topo->book_id = cpu;
+			topo->drawer_id = cpu;
 		}
 	}
 	numa_update_cpu_topology();
@@ -269,10 +286,7 @@ static void update_cpu_masks(void)
 
 void store_topology(struct sysinfo_15_1_x *info)
 {
-	if (topology_max_mnest >= 3)
-		stsi(info, 15, 1, 3);
-	else
-		stsi(info, 15, 1, 2);
+	stsi(info, 15, 1, min(topology_max_mnest, 4));
 }
 
 int arch_update_cpu_topology(void)
@@ -442,6 +456,11 @@ static const struct cpumask *cpu_book_mask(int cpu)
 	return &per_cpu(cpu_topology, cpu).book_mask;
 }
 
+static const struct cpumask *cpu_drawer_mask(int cpu)
+{
+	return &per_cpu(cpu_topology, cpu).drawer_mask;
+}
+
 static int __init early_parse_topology(char *p)
 {
 	return kstrtobool(p, &topology_enabled);
@@ -452,6 +471,7 @@ static struct sched_domain_topology_level s390_topology[] = {
 	{ cpu_thread_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 	{ cpu_book_mask, SD_INIT_NAME(BOOK) },
+	{ cpu_drawer_mask, SD_INIT_NAME(DRAWER) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
@@ -487,6 +507,7 @@ static int __init s390_topology_init(void)
 	printk(KERN_CONT " / %d\n", info->mnest);
 	alloc_masks(info, &socket_info, 1);
 	alloc_masks(info, &book_info, 2);
+	alloc_masks(info, &drawer_info, 3);
 	set_sched_topology(s390_topology);
 	return 0;
 }
diff --git a/arch/s390/numa/mode_emu.c b/arch/s390/numa/mode_emu.c
index 828d0695d0d4..fbc394e16b2c 100644
--- a/arch/s390/numa/mode_emu.c
+++ b/arch/s390/numa/mode_emu.c
@@ -34,7 +34,8 @@
 #define DIST_CORE	1
 #define DIST_MC		2
 #define DIST_BOOK	3
-#define DIST_MAX	4
+#define DIST_DRAWER	4
+#define DIST_MAX	5
 
 /* Node distance reported to common code */
 #define EMU_NODE_DIST	10
@@ -43,7 +44,7 @@
 #define NODE_ID_FREE	-1
 
 /* Different levels of toptree */
-enum toptree_level {CORE, MC, BOOK, NODE, TOPOLOGY};
+enum toptree_level {CORE, MC, BOOK, DRAWER, NODE, TOPOLOGY};
 
 /* The two toptree IDs */
 enum {TOPTREE_ID_PHYS, TOPTREE_ID_NUMA};
@@ -114,6 +115,14 @@ static int cores_free(struct toptree *tree)
  */
 static struct toptree *core_node(struct toptree *core)
 {
+	return core->parent->parent->parent->parent;
+}
+
+/*
+ * Return drawer of core
+ */
+static struct toptree *core_drawer(struct toptree *core)
+{
 	return core->parent->parent->parent;
 }
 
@@ -138,6 +147,8 @@ static struct toptree *core_mc(struct toptree *core)
  */
 static int dist_core_to_core(struct toptree *core1, struct toptree *core2)
 {
+	if (core_drawer(core1)->id != core_drawer(core2)->id)
+		return DIST_DRAWER;
 	if (core_book(core1)->id != core_book(core2)->id)
 		return DIST_BOOK;
 	if (core_mc(core1)->id != core_mc(core2)->id)
@@ -262,6 +273,8 @@ static void toptree_to_numa_first(struct toptree *numa, struct toptree *phys)
 	struct toptree *core;
 
 	/* Always try to move perfectly fitting structures first */
+	move_level_to_numa(numa, phys, DRAWER, true);
+	move_level_to_numa(numa, phys, DRAWER, false);
 	move_level_to_numa(numa, phys, BOOK, true);
 	move_level_to_numa(numa, phys, BOOK, false);
 	move_level_to_numa(numa, phys, MC, true);
@@ -335,7 +348,7 @@ static struct toptree *toptree_to_numa(struct toptree *phys)
  */
 static struct toptree *toptree_from_topology(void)
 {
-	struct toptree *phys, *node, *book, *mc, *core;
+	struct toptree *phys, *node, *drawer, *book, *mc, *core;
 	struct cpu_topology_s390 *top;
 	int cpu;
 
@@ -344,10 +357,11 @@ static struct toptree *toptree_from_topology(void)
 	for_each_online_cpu(cpu) {
 		top = &per_cpu(cpu_topology, cpu);
 		node = toptree_get_child(phys, 0);
-		book = toptree_get_child(node, top->book_id);
+		drawer = toptree_get_child(node, top->drawer_id);
+		book = toptree_get_child(drawer, top->book_id);
 		mc = toptree_get_child(book, top->socket_id);
 		core = toptree_get_child(mc, top->core_id);
-		if (!book || !mc || !core)
+		if (!drawer || !book || !mc || !core)
 			panic("NUMA emulation could not allocate memory");
 		cpumask_set_cpu(cpu, &core->mask);
 		toptree_update_mask(mc);
@@ -368,6 +382,7 @@ static void topology_add_core(struct toptree *core)
 		cpumask_copy(&top->thread_mask, &core->mask);
 		cpumask_copy(&top->core_mask, &core_mc(core)->mask);
 		cpumask_copy(&top->book_mask, &core_book(core)->mask);
+		cpumask_copy(&top->drawer_mask, &core_drawer(core)->mask);
 		cpumask_set_cpu(cpu, &node_to_cpumask_map[core_node(core)->id]);
 		top->node_id = core_node(core)->id;
 	}
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-08  9:09 ` [PATCH 2/2] s390/topology: add drawer scheduling domain level Heiko Carstens
@ 2016-06-13 11:06   ` Peter Zijlstra
  2016-06-13 11:22     ` Heiko Carstens
  2016-06-13 11:25     ` Heiko Carstens
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-06-13 11:06 UTC (permalink / raw)
  To: Heiko Carstens; +Cc: Greg Kroah-Hartman, Martin Schwidefsky, linux-kernel

On Wed, Jun 08, 2016 at 11:09:16AM +0200, Heiko Carstens wrote:
> The z13 machine added a fourth level to the cpu topology
> information. The new top level is called drawer.
> 
> A drawer contains two books, which used to be the top level.
> 
> Adding this additional scheduling domain did show performance
> improvements for some workloads of up to 8%, while there don't
> seem to be any workloads impacted in a negative way.

Right; so no objection.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

You still don't want to make NUMA explicit on this thing? So while I
suppose the SC 480M L4 cache does hide some of it, there can be up to 8
nodes on this thing. Which seems to me there's win to be had by exposing
it.

Of course, the moment you go all virt/LPAR on it, that all gets really
interesting, but for those cases where you run 1:1 it might make sense.

Also, are you sure you don't want some of the behaviour changed for the
drawer domains? I could for example imagine you wouldn't want
SD_WAKE_AFFINE set (we disable that for NUMA domains as well).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-13 11:06   ` Peter Zijlstra
@ 2016-06-13 11:22     ` Heiko Carstens
  2016-06-13 13:06       ` Peter Zijlstra
  2016-06-13 11:25     ` Heiko Carstens
  1 sibling, 1 reply; 11+ messages in thread
From: Heiko Carstens @ 2016-06-13 11:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Greg Kroah-Hartman, Martin Schwidefsky, linux-kernel

On Mon, Jun 13, 2016 at 01:06:21PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 08, 2016 at 11:09:16AM +0200, Heiko Carstens wrote:
> > The z13 machine added a fourth level to the cpu topology
> > information. The new top level is called drawer.
> > 
> > A drawer contains two books, which used to be the top level.
> > 
> > Adding this additional scheduling domain did show performance
> > improvements for some workloads of up to 8%, while there don't
> > seem to be any workloads impacted in a negative way.
> 
> Right; so no objection.
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks!

> You still don't want to make NUMA explicit on this thing? So while I
> suppose the SC 480M L4 cache does hide some of it, there can be up to 8
> nodes on this thing. Which seems to me there's win to be had by exposing
> it.
> 
> Of course, the moment you go all virt/LPAR on it, that all gets really
> interesting, but for those cases where you run 1:1 it might make sense.

Yes, and actually we are all virt/LPAR always, so this is unfortunately not
very easy to do. And yes, I do agree that for the 1:1 case it most likely
would make sense, however we don't have any run-time guarantee to stay 1:1.

> Also, are you sure you don't want some of the behaviour changed for the
> drawer domains? I could for example imagine you wouldn't want
> SD_WAKE_AFFINE set (we disable that for NUMA domains as well).

That's something we need to look into further as well. Thanks for pointing
this out!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-13 11:06   ` Peter Zijlstra
  2016-06-13 11:22     ` Heiko Carstens
@ 2016-06-13 11:25     ` Heiko Carstens
  2016-06-13 11:33       ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Heiko Carstens @ 2016-06-13 11:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Greg Kroah-Hartman, Martin Schwidefsky, linux-kernel

On Mon, Jun 13, 2016 at 01:06:21PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 08, 2016 at 11:09:16AM +0200, Heiko Carstens wrote:
> > The z13 machine added a fourth level to the cpu topology
> > information. The new top level is called drawer.
> > 
> > A drawer contains two books, which used to be the top level.
> > 
> > Adding this additional scheduling domain did show performance
> > improvements for some workloads of up to 8%, while there don't
> > seem to be any workloads impacted in a negative way.
> 
> Right; so no objection.
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

May I add your ACK also to the sysfs patch?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-13 11:25     ` Heiko Carstens
@ 2016-06-13 11:33       ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-06-13 11:33 UTC (permalink / raw)
  To: Heiko Carstens; +Cc: Greg Kroah-Hartman, Martin Schwidefsky, linux-kernel

On Mon, Jun 13, 2016 at 01:25:53PM +0200, Heiko Carstens wrote:
> On Mon, Jun 13, 2016 at 01:06:21PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 08, 2016 at 11:09:16AM +0200, Heiko Carstens wrote:
> > > The z13 machine added a fourth level to the cpu topology
> > > information. The new top level is called drawer.
> > > 
> > > A drawer contains two books, which used to be the top level.
> > > 
> > > Adding this additional scheduling domain did show performance
> > > improvements for some workloads of up to 8%, while there don't
> > > seem to be any workloads impacted in a negative way.
> > 
> > Right; so no objection.
> > 
> > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> May I add your ACK also to the sysfs patch?

Not really my area, nor something I've ever looked hard at, but the
patch seems to have the right shape, so sure ;-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-13 11:22     ` Heiko Carstens
@ 2016-06-13 13:06       ` Peter Zijlstra
  2016-06-13 13:19         ` Martin Schwidefsky
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-06-13 13:06 UTC (permalink / raw)
  To: Heiko Carstens; +Cc: Greg Kroah-Hartman, Martin Schwidefsky, linux-kernel

On Mon, Jun 13, 2016 at 01:22:30PM +0200, Heiko Carstens wrote:
> Yes, and actually we are all virt/LPAR always, so this is unfortunately not
> very easy to do. And yes, I do agree that for the 1:1 case it most likely
> would make sense, however we don't have any run-time guarantee to stay 1:1.

One option would be to make it a boot option; such that the
administrator has to set it. At that point, if the admin creates
multiple LPARs its on him.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-13 13:06       ` Peter Zijlstra
@ 2016-06-13 13:19         ` Martin Schwidefsky
  2016-06-13 13:53           ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Schwidefsky @ 2016-06-13 13:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Heiko Carstens, Greg Kroah-Hartman, linux-kernel

On Mon, 13 Jun 2016 15:06:47 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jun 13, 2016 at 01:22:30PM +0200, Heiko Carstens wrote:
> > Yes, and actually we are all virt/LPAR always, so this is unfortunately not
> > very easy to do. And yes, I do agree that for the 1:1 case it most likely
> > would make sense, however we don't have any run-time guarantee to stay 1:1.
> 
> One option would be to make it a boot option; such that the
> administrator has to set it. At that point, if the admin creates
> multiple LPARs its on him.

Unfortunately not good enough. The LPAR code tries to optimize the layout
at the time a partition is activated. The landscape of already running
partitions can change at this point.

To get around this you would have to activate *all* partitions first and
then start the operating systems in a second step.

And then there is concurrent repair which will move things around if a
piece of memory goes bad. This happens rarely though.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-13 13:19         ` Martin Schwidefsky
@ 2016-06-13 13:53           ` Peter Zijlstra
  2016-06-13 14:37             ` Martin Schwidefsky
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-06-13 13:53 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Heiko Carstens, Greg Kroah-Hartman, linux-kernel

On Mon, Jun 13, 2016 at 03:19:42PM +0200, Martin Schwidefsky wrote:
> On Mon, 13 Jun 2016 15:06:47 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Jun 13, 2016 at 01:22:30PM +0200, Heiko Carstens wrote:
> > > Yes, and actually we are all virt/LPAR always, so this is unfortunately not
> > > very easy to do. And yes, I do agree that for the 1:1 case it most likely
> > > would make sense, however we don't have any run-time guarantee to stay 1:1.
> > 
> > One option would be to make it a boot option; such that the
> > administrator has to set it. At that point, if the admin creates
> > multiple LPARs its on him.
> 
> Unfortunately not good enough. The LPAR code tries to optimize the layout
> at the time a partition is activated. The landscape of already running
> partitions can change at this point.

Would not the admin _know_ this? It would be him activating partitions
after all, no?

> To get around this you would have to activate *all* partitions first and
> then start the operating systems in a second step.

Arguably, you only care about the single partition covering the entire
machine case, so I don't see that being a problem.

Again, admin _knows_ this.

> And then there is concurrent repair which will move things around if a
> piece of memory goes bad. This happens rarely though.

That would be magic disturbance indeed, nothing much to do about that.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] s390/topology: add drawer scheduling domain level
  2016-06-13 13:53           ` Peter Zijlstra
@ 2016-06-13 14:37             ` Martin Schwidefsky
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Schwidefsky @ 2016-06-13 14:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Heiko Carstens, Greg Kroah-Hartman, linux-kernel

On Mon, 13 Jun 2016 15:53:02 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jun 13, 2016 at 03:19:42PM +0200, Martin Schwidefsky wrote:
> > On Mon, 13 Jun 2016 15:06:47 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Mon, Jun 13, 2016 at 01:22:30PM +0200, Heiko Carstens wrote:
> > > > Yes, and actually we are all virt/LPAR always, so this is unfortunately not
> > > > very easy to do. And yes, I do agree that for the 1:1 case it most likely
> > > > would make sense, however we don't have any run-time guarantee to stay 1:1.
> > > 
> > > One option would be to make it a boot option; such that the
> > > administrator has to set it. At that point, if the admin creates
> > > multiple LPARs its on him.
> > 
> > Unfortunately not good enough. The LPAR code tries to optimize the layout
> > at the time a partition is activated. The landscape of already running
> > partitions can change at this point.
> 
> Would not the admin _know_ this? It would be him activating partitions
> after all, no?

This is all fine and good in a static environment where you can afford to
stop all partitions to do a reconfiguration. There you could get away with
a kernel option that enables "real" NUMA.

But as a general solution this fails. Consider this scenario: you have several
partitions already running with a workload that you do *not* want to interrupt
right now, think stock exchange. And now another partition urgently needs more
memory. To do this you have to shut it down, deactivate it, update the profile
with more memory, re-activate it and restart the OS. End result: memory
landscape could have changed.

> > To get around this you would have to activate *all* partitions first and
> > then start the operating systems in a second step.
> 
> Arguably, you only care about the single partition covering the entire
> machine case, so I don't see that being a problem.
> 
> Again, admin _knows_ this.

The single partitions case is boring, several large partitions to big for a
single node is the hard part.

> > And then there is concurrent repair which will move things around if a
> > piece of memory goes bad. This happens rarely though.
> 
> That would be magic disturbance indeed, nothing much to do about that.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-06-13 14:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-08  9:09 [PATCH 0/2] s390: introduce drawer scheduling domain Heiko Carstens
2016-06-08  9:09 ` [PATCH 1/2] topology/sysfs: provide drawer id and siblings attributes Heiko Carstens
2016-06-08  9:09 ` [PATCH 2/2] s390/topology: add drawer scheduling domain level Heiko Carstens
2016-06-13 11:06   ` Peter Zijlstra
2016-06-13 11:22     ` Heiko Carstens
2016-06-13 13:06       ` Peter Zijlstra
2016-06-13 13:19         ` Martin Schwidefsky
2016-06-13 13:53           ` Peter Zijlstra
2016-06-13 14:37             ` Martin Schwidefsky
2016-06-13 11:25     ` Heiko Carstens
2016-06-13 11:33       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).