All of lore.kernel.org
 help / color / mirror / Atom feed
* [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23  4:36 ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  4:36 UTC (permalink / raw)
  To: linux-mm, linuxppc-dev
  Cc: Balbir Singh, Tejun Heo, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov

In the absence of hotplug we use extra memory proportional to
(possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
to disable large consumption with large number of cgroups. This patch
adds hotplug support to memory cgroups and reverts the commit that
limited possible nodes to online nodes.

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>

I've tested this patches under a VM with two nodes and movable
nodes enabled. I've offlined nodes and checked that the system
and cgroups with tasks deep in the hierarchy continue to work
fine.

These patches are on top of linux-next (20161117)

Changelog v2:
	Add get/put_online_mems() around node iteration
	Use MEM_OFFLINE/MEM_ONLINE instead of MEM_GOING_OFFLINE/ONLINE

Balbir Singh (3):
  mm: Add basic infrastructure for memcg hotplug support
  mm: Move operations to hotplug callbacks
  powerpc/mm: fix node_possible_map limitations

 arch/powerpc/mm/numa.c |   7 ----
 mm/memcontrol.c        | 107 +++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 94 insertions(+), 20 deletions(-)

-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23  4:36 ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  4:36 UTC (permalink / raw)
  To: linux-mm, linuxppc-dev
  Cc: Balbir Singh, Tejun Heo, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov

In the absence of hotplug we use extra memory proportional to
(possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
to disable large consumption with large number of cgroups. This patch
adds hotplug support to memory cgroups and reverts the commit that
limited possible nodes to online nodes.

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>

I've tested this patches under a VM with two nodes and movable
nodes enabled. I've offlined nodes and checked that the system
and cgroups with tasks deep in the hierarchy continue to work
fine.

These patches are on top of linux-next (20161117)

Changelog v2:
	Add get/put_online_mems() around node iteration
	Use MEM_OFFLINE/MEM_ONLINE instead of MEM_GOING_OFFLINE/ONLINE

Balbir Singh (3):
  mm: Add basic infrastructure for memcg hotplug support
  mm: Move operations to hotplug callbacks
  powerpc/mm: fix node_possible_map limitations

 arch/powerpc/mm/numa.c |   7 ----
 mm/memcontrol.c        | 107 +++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 94 insertions(+), 20 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [mm v2 1/3] mm: Add basic infrastructure for memcg hotplug support
  2016-11-23  4:36 ` Balbir Singh
@ 2016-11-23  4:36   ` Balbir Singh
  -1 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  4:36 UTC (permalink / raw)
  To: linux-mm, linuxppc-dev; +Cc: Balbir Singh

The lack of hotplug support makes us allocate all memory
upfront for per node data structures. With large number
of cgroups this can be an overhead. PPC64 actually limits
n_possible nodes to n_online to avoid some of this overhead.

This patch adds the basic notifiers to listen to hotplug
events and does the allocation and free of those structures
per cgroup. We walk every cgroup per event, its a trade-off
of allocating upfront vs allocating on demand and freeing
on offline.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 175ec51..5482c7d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/memory.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -4106,14 +4107,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 	int tmp = node;
-	/*
-	 * This routine is called against possible nodes.
-	 * But it's BUG to call kmalloc() against offline node.
-	 *
-	 * TODO: this routine can waste much memory for nodes which will
-	 *       never be onlined. It's better to use memory hotplug callback
-	 *       function.
-	 */
+
 	if (!node_state(node, N_NORMAL_MEMORY))
 		tmp = -1;
 	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
@@ -5764,6 +5758,41 @@ static int __init cgroup_memory(char *s)
 }
 __setup("cgroup.memory=", cgroup_memory);
 
+static void memcg_node_offline(int node)
+{
+}
+
+static void memcg_node_online(int node)
+{
+}
+
+static int memcg_memory_hotplug_callback(struct notifier_block *self,
+					unsigned long action, void *arg)
+{
+	struct memory_notify *marg = arg;
+	int node = marg->status_change_nid;
+
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+	case MEM_CANCEL_OFFLINE:
+	case MEM_ONLINE:
+		break;
+	case MEM_GOING_ONLINE:
+		memcg_node_online(node);
+		break;
+	case MEM_CANCEL_ONLINE:
+	case MEM_OFFLINE:
+		memcg_node_offline(node);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block memcg_memory_hotplug_nb __meminitdata = {
+	.notifier_call = memcg_memory_hotplug_callback,
+	.priority = IPC_CALLBACK_PRI,
+};
+
 /*
  * subsys_initcall() for memory controller.
  *
@@ -5789,6 +5818,7 @@ static int __init mem_cgroup_init(void)
 
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
+	register_hotmemory_notifier(&memcg_memory_hotplug_nb);
 
 	for_each_possible_cpu(cpu)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [mm v2 1/3] mm: Add basic infrastructure for memcg hotplug support
@ 2016-11-23  4:36   ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  4:36 UTC (permalink / raw)
  To: linux-mm, linuxppc-dev

The lack of hotplug support makes us allocate all memory
upfront for per node data structures. With large number
of cgroups this can be an overhead. PPC64 actually limits
n_possible nodes to n_online to avoid some of this overhead.

This patch adds the basic notifiers to listen to hotplug
events and does the allocation and free of those structures
per cgroup. We walk every cgroup per event, its a trade-off
of allocating upfront vs allocating on demand and freeing
on offline.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 175ec51..5482c7d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/memory.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -4106,14 +4107,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 	int tmp = node;
-	/*
-	 * This routine is called against possible nodes.
-	 * But it's BUG to call kmalloc() against offline node.
-	 *
-	 * TODO: this routine can waste much memory for nodes which will
-	 *       never be onlined. It's better to use memory hotplug callback
-	 *       function.
-	 */
+
 	if (!node_state(node, N_NORMAL_MEMORY))
 		tmp = -1;
 	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
@@ -5764,6 +5758,41 @@ static int __init cgroup_memory(char *s)
 }
 __setup("cgroup.memory=", cgroup_memory);
 
+static void memcg_node_offline(int node)
+{
+}
+
+static void memcg_node_online(int node)
+{
+}
+
+static int memcg_memory_hotplug_callback(struct notifier_block *self,
+					unsigned long action, void *arg)
+{
+	struct memory_notify *marg = arg;
+	int node = marg->status_change_nid;
+
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+	case MEM_CANCEL_OFFLINE:
+	case MEM_ONLINE:
+		break;
+	case MEM_GOING_ONLINE:
+		memcg_node_online(node);
+		break;
+	case MEM_CANCEL_ONLINE:
+	case MEM_OFFLINE:
+		memcg_node_offline(node);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block memcg_memory_hotplug_nb __meminitdata = {
+	.notifier_call = memcg_memory_hotplug_callback,
+	.priority = IPC_CALLBACK_PRI,
+};
+
 /*
  * subsys_initcall() for memory controller.
  *
@@ -5789,6 +5818,7 @@ static int __init mem_cgroup_init(void)
 
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
+	register_hotmemory_notifier(&memcg_memory_hotplug_nb);
 
 	for_each_possible_cpu(cpu)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [mm v2 2/3] mm: Move operations to hotplug callbacks
  2016-11-23  4:36 ` Balbir Singh
@ 2016-11-23  4:36   ` Balbir Singh
  -1 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  4:36 UTC (permalink / raw)
  To: linux-mm, linuxppc-dev
  Cc: Balbir Singh, Tejun Heo, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov

Move routines that do operations on all nodes to
just the online nodes. Most of the changes are
very obvious (like the ones related to soft limit tree
per node)

Implications of this patch

1. get/put_online_mems around for_each_online_node
   paths. These are expected to be !fast path
2. Memory allocation/free is on demand. On a system
   with large number of cgroups we expect savings
   proportional to number of cgroups * size of per node
   structure(s)

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 mm/memcontrol.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 75 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5482c7d..cdfc3e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -497,11 +497,13 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 	struct mem_cgroup_per_node *mz;
 	int nid;
 
-	for_each_node(nid) {
+	get_online_mems();
+	for_each_online_node(nid) {
 		mz = mem_cgroup_nodeinfo(memcg, nid);
 		mctz = soft_limit_tree_node(nid);
 		mem_cgroup_remove_exceeded(mz, mctz);
 	}
+	put_online_mems();
 }
 
 static struct mem_cgroup_per_node *
@@ -895,7 +897,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 	int i;
 
 	while ((memcg = parent_mem_cgroup(memcg))) {
-		for_each_node(nid) {
+		get_online_mems();
+		for_each_online_node(nid) {
 			mz = mem_cgroup_nodeinfo(memcg, nid);
 			for (i = 0; i <= DEF_PRIORITY; i++) {
 				iter = &mz->iter[i];
@@ -903,6 +906,7 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 					dead_memcg, NULL);
 			}
 		}
+		put_online_mems();
 	}
 }
 
@@ -1343,6 +1347,10 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 {
 	return 0;
 }
+
+static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
@@ -4133,8 +4141,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 	int node;
 
 	memcg_wb_domain_exit(memcg);
-	for_each_node(node)
+	get_online_mems();
+	for_each_online_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+	put_online_mems();
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4162,9 +4172,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg->stat)
 		goto fail;
 
-	for_each_node(node)
+	get_online_mems();
+	for_each_online_node(node)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
+	put_online_mems();
 
 	if (memcg_wb_domain_init(memcg, GFP_KERNEL))
 		goto fail;
@@ -4187,6 +4199,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
 fail:
+	put_online_mems();
 	if (memcg->id.id > 0)
 		idr_remove(&mem_cgroup_idr, memcg->id.id);
 	mem_cgroup_free(memcg);
@@ -5760,10 +5773,61 @@ __setup("cgroup.memory=", cgroup_memory);
 
 static void memcg_node_offline(int node)
 {
+	struct mem_cgroup *memcg;
+	struct mem_cgroup_tree_per_node *rtpn;
+	struct mem_cgroup_per_node *mz;
+
+	if (node < 0)
+		return;
+
+	rtpn = soft_limit_tree_node(node);
+
+	for_each_mem_cgroup(memcg) {
+		mz = mem_cgroup_nodeinfo(memcg, node);
+		/* mz can be NULL if node_online failed */
+		if (mz)
+			mem_cgroup_remove_exceeded(mz, rtpn);
+
+		free_mem_cgroup_per_node_info(memcg, node);
+		mem_cgroup_may_update_nodemask(memcg);
+	}
+
+	kfree(rtpn);
+
 }
 
-static void memcg_node_online(int node)
+static int memcg_node_online(int node)
 {
+	struct mem_cgroup *memcg;
+	struct mem_cgroup_tree_per_node *rtpn;
+	struct mem_cgroup_per_node *mz;
+
+	if (node < 0)
+		return 0;
+
+	rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
+
+	rtpn->rb_root = RB_ROOT;
+	spin_lock_init(&rtpn->lock);
+	soft_limit_tree.rb_tree_per_node[node] = rtpn;
+
+	for_each_mem_cgroup(memcg) {
+		if (alloc_mem_cgroup_per_node_info(memcg, node))
+			goto fail;
+		mem_cgroup_may_update_nodemask(memcg);
+	}
+	return 0;
+fail:
+	/*
+	 * We don't want mz in node_offline to trip when
+	 * allocation fails and CANCEL_ONLINE gets called
+	 */
+	for_each_mem_cgroup(memcg) {
+		mz = mem_cgroup_nodeinfo(memcg, node);
+		free_mem_cgroup_per_node_info(memcg, node);
+		mz = NULL;
+	}
+	return -ENOMEM;
 }
 
 static int memcg_memory_hotplug_callback(struct notifier_block *self,
@@ -5773,12 +5837,13 @@ static int memcg_memory_hotplug_callback(struct notifier_block *self,
 	int node = marg->status_change_nid;
 
 	switch (action) {
-	case MEM_GOING_OFFLINE:
 	case MEM_CANCEL_OFFLINE:
+	case MEM_GOING_OFFLINE:
 	case MEM_ONLINE:
 		break;
 	case MEM_GOING_ONLINE:
-		memcg_node_online(node);
+		if (memcg_node_online(node))
+			return NOTIFY_BAD;
 		break;
 	case MEM_CANCEL_ONLINE:
 	case MEM_OFFLINE:
@@ -5824,7 +5889,8 @@ static int __init mem_cgroup_init(void)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
 			  drain_local_stock);
 
-	for_each_node(node) {
+	get_online_mems();
+	for_each_online_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
 
 		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
@@ -5834,6 +5900,7 @@ static int __init mem_cgroup_init(void)
 		spin_lock_init(&rtpn->lock);
 		soft_limit_tree.rb_tree_per_node[node] = rtpn;
 	}
+	put_online_mems();
 
 	return 0;
 }
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [mm v2 2/3] mm: Move operations to hotplug callbacks
@ 2016-11-23  4:36   ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  4:36 UTC (permalink / raw)
  To: linux-mm, linuxppc-dev
  Cc: Balbir Singh, Tejun Heo, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov

Move routines that do operations on all nodes to
just the online nodes. Most of the changes are
very obvious (like the ones related to soft limit tree
per node)

Implications of this patch

1. get/put_online_mems around for_each_online_node
   paths. These are expected to be !fast path
2. Memory allocation/free is on demand. On a system
   with large number of cgroups we expect savings
   proportional to number of cgroups * size of per node
   structure(s)

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 mm/memcontrol.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 75 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5482c7d..cdfc3e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -497,11 +497,13 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 	struct mem_cgroup_per_node *mz;
 	int nid;
 
-	for_each_node(nid) {
+	get_online_mems();
+	for_each_online_node(nid) {
 		mz = mem_cgroup_nodeinfo(memcg, nid);
 		mctz = soft_limit_tree_node(nid);
 		mem_cgroup_remove_exceeded(mz, mctz);
 	}
+	put_online_mems();
 }
 
 static struct mem_cgroup_per_node *
@@ -895,7 +897,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 	int i;
 
 	while ((memcg = parent_mem_cgroup(memcg))) {
-		for_each_node(nid) {
+		get_online_mems();
+		for_each_online_node(nid) {
 			mz = mem_cgroup_nodeinfo(memcg, nid);
 			for (i = 0; i <= DEF_PRIORITY; i++) {
 				iter = &mz->iter[i];
@@ -903,6 +906,7 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 					dead_memcg, NULL);
 			}
 		}
+		put_online_mems();
 	}
 }
 
@@ -1343,6 +1347,10 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 {
 	return 0;
 }
+
+static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
@@ -4133,8 +4141,10 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 	int node;
 
 	memcg_wb_domain_exit(memcg);
-	for_each_node(node)
+	get_online_mems();
+	for_each_online_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+	put_online_mems();
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4162,9 +4172,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg->stat)
 		goto fail;
 
-	for_each_node(node)
+	get_online_mems();
+	for_each_online_node(node)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
+	put_online_mems();
 
 	if (memcg_wb_domain_init(memcg, GFP_KERNEL))
 		goto fail;
@@ -4187,6 +4199,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
 fail:
+	put_online_mems();
 	if (memcg->id.id > 0)
 		idr_remove(&mem_cgroup_idr, memcg->id.id);
 	mem_cgroup_free(memcg);
@@ -5760,10 +5773,61 @@ __setup("cgroup.memory=", cgroup_memory);
 
 static void memcg_node_offline(int node)
 {
+	struct mem_cgroup *memcg;
+	struct mem_cgroup_tree_per_node *rtpn;
+	struct mem_cgroup_per_node *mz;
+
+	if (node < 0)
+		return;
+
+	rtpn = soft_limit_tree_node(node);
+
+	for_each_mem_cgroup(memcg) {
+		mz = mem_cgroup_nodeinfo(memcg, node);
+		/* mz can be NULL if node_online failed */
+		if (mz)
+			mem_cgroup_remove_exceeded(mz, rtpn);
+
+		free_mem_cgroup_per_node_info(memcg, node);
+		mem_cgroup_may_update_nodemask(memcg);
+	}
+
+	kfree(rtpn);
+
 }
 
-static void memcg_node_online(int node)
+static int memcg_node_online(int node)
 {
+	struct mem_cgroup *memcg;
+	struct mem_cgroup_tree_per_node *rtpn;
+	struct mem_cgroup_per_node *mz;
+
+	if (node < 0)
+		return 0;
+
+	rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
+
+	rtpn->rb_root = RB_ROOT;
+	spin_lock_init(&rtpn->lock);
+	soft_limit_tree.rb_tree_per_node[node] = rtpn;
+
+	for_each_mem_cgroup(memcg) {
+		if (alloc_mem_cgroup_per_node_info(memcg, node))
+			goto fail;
+		mem_cgroup_may_update_nodemask(memcg);
+	}
+	return 0;
+fail:
+	/*
+	 * We don't want mz in node_offline to trip when
+	 * allocation fails and CANCEL_ONLINE gets called
+	 */
+	for_each_mem_cgroup(memcg) {
+		mz = mem_cgroup_nodeinfo(memcg, node);
+		free_mem_cgroup_per_node_info(memcg, node);
+		mz = NULL;
+	}
+	return -ENOMEM;
 }
 
 static int memcg_memory_hotplug_callback(struct notifier_block *self,
@@ -5773,12 +5837,13 @@ static int memcg_memory_hotplug_callback(struct notifier_block *self,
 	int node = marg->status_change_nid;
 
 	switch (action) {
-	case MEM_GOING_OFFLINE:
 	case MEM_CANCEL_OFFLINE:
+	case MEM_GOING_OFFLINE:
 	case MEM_ONLINE:
 		break;
 	case MEM_GOING_ONLINE:
-		memcg_node_online(node);
+		if (memcg_node_online(node))
+			return NOTIFY_BAD;
 		break;
 	case MEM_CANCEL_ONLINE:
 	case MEM_OFFLINE:
@@ -5824,7 +5889,8 @@ static int __init mem_cgroup_init(void)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
 			  drain_local_stock);
 
-	for_each_node(node) {
+	get_online_mems();
+	for_each_online_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
 
 		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
@@ -5834,6 +5900,7 @@ static int __init mem_cgroup_init(void)
 		spin_lock_init(&rtpn->lock);
 		soft_limit_tree.rb_tree_per_node[node] = rtpn;
 	}
+	put_online_mems();
 
 	return 0;
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [mm v2 3/3] powerpc/mm: fix node_possible_map limitations
  2016-11-23  4:36 ` Balbir Singh
@ 2016-11-23  4:36   ` Balbir Singh
  -1 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  4:36 UTC (permalink / raw)
  To: linux-mm, linuxppc-dev
  Cc: Balbir Singh, Tejun Heo, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov

We've fixed the memory hotplug issue with memcg, hence
this work around should not be required.

Reverts: commit 3af229f2071f
("powerpc/numa: Reset node_possible_map to only node_online_map")

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/mm/numa.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a51c188..ca8c2ab 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -916,13 +916,6 @@ void __init initmem_init(void)
 
 	memblock_dump_all();
 
-	/*
-	 * Reduce the possible NUMA nodes to the online NUMA nodes,
-	 * since we do not support node hotplug. This ensures that  we
-	 * lower the maximum NUMA node ID to what is actually present.
-	 */
-	nodes_and(node_possible_map, node_possible_map, node_online_map);
-
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [mm v2 3/3] powerpc/mm: fix node_possible_map limitations
@ 2016-11-23  4:36   ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  4:36 UTC (permalink / raw)
  To: linux-mm, linuxppc-dev
  Cc: Balbir Singh, Tejun Heo, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov

We've fixed the memory hotplug issue with memcg, hence
this work around should not be required.

Reverts: commit 3af229f2071f
("powerpc/numa: Reset node_possible_map to only node_online_map")

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/mm/numa.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a51c188..ca8c2ab 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -916,13 +916,6 @@ void __init initmem_init(void)
 
 	memblock_dump_all();
 
-	/*
-	 * Reduce the possible NUMA nodes to the online NUMA nodes,
-	 * since we do not support node hotplug. This ensures that  we
-	 * lower the maximum NUMA node ID to what is actually present.
-	 */
-	nodes_and(node_possible_map, node_possible_map, node_online_map);
-
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-23  4:36 ` Balbir Singh
@ 2016-11-23  7:25   ` Michal Hocko
  -1 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2016-11-23  7:25 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Wed 23-11-16 15:36:51, Balbir Singh wrote:
> In the absence of hotplug we use extra memory proportional to
> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
> to disable large consumption with large number of cgroups. This patch
> adds hotplug support to memory cgroups and reverts the commit that
> limited possible nodes to online nodes.

Balbir,
I have asked this in the previous version but there still seems to be a
lack of information of _why_ do we want this, _how_ much do we save on
the memory overhead on most systems and _why_ the additional complexity
is really worth it. Please make sure to add all this in the cover
letter.

I still didn't get to look into those patches because I am swamped with
other things but to be honest I do not really see a strong justification
to make it high priority for me.

> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org> 
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> 
> I've tested this patches under a VM with two nodes and movable
> nodes enabled. I've offlined nodes and checked that the system
> and cgroups with tasks deep in the hierarchy continue to work
> fine.
> 
> These patches are on top of linux-next (20161117)
> 
> Changelog v2:
> 	Add get/put_online_mems() around node iteration
> 	Use MEM_OFFLINE/MEM_ONLINE instead of MEM_GOING_OFFLINE/ONLINE
> 
> Balbir Singh (3):
>   mm: Add basic infrastructure for memcg hotplug support
>   mm: Move operations to hotplug callbacks
>   powerpc/mm: fix node_possible_map limitations
> 
>  arch/powerpc/mm/numa.c |   7 ----
>  mm/memcontrol.c        | 107 +++++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 94 insertions(+), 20 deletions(-)
> 
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23  7:25   ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2016-11-23  7:25 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Wed 23-11-16 15:36:51, Balbir Singh wrote:
> In the absence of hotplug we use extra memory proportional to
> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
> to disable large consumption with large number of cgroups. This patch
> adds hotplug support to memory cgroups and reverts the commit that
> limited possible nodes to online nodes.

Balbir,
I have asked this in the previous version but there still seems to be a
lack of information of _why_ do we want this, _how_ much do we save on
the memory overhead on most systems and _why_ the additional complexity
is really worth it. Please make sure to add all this in the cover
letter.

I still didn't get to look into those patches because I am swamped with
other things but to be honest I do not really see a strong justification
to make it high priority for me.

> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org> 
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> 
> I've tested this patches under a VM with two nodes and movable
> nodes enabled. I've offlined nodes and checked that the system
> and cgroups with tasks deep in the hierarchy continue to work
> fine.
> 
> These patches are on top of linux-next (20161117)
> 
> Changelog v2:
> 	Add get/put_online_mems() around node iteration
> 	Use MEM_OFFLINE/MEM_ONLINE instead of MEM_GOING_OFFLINE/ONLINE
> 
> Balbir Singh (3):
>   mm: Add basic infrastructure for memcg hotplug support
>   mm: Move operations to hotplug callbacks
>   powerpc/mm: fix node_possible_map limitations
> 
>  arch/powerpc/mm/numa.c |   7 ----
>  mm/memcontrol.c        | 107 +++++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 94 insertions(+), 20 deletions(-)
> 
> -- 
> 2.5.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-23  7:25   ` Michal Hocko
@ 2016-11-23  7:50     ` Balbir Singh
  -1 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  7:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 23/11/16 18:25, Michal Hocko wrote:
> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
>> In the absence of hotplug we use extra memory proportional to
>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
>> to disable large consumption with large number of cgroups. This patch
>> adds hotplug support to memory cgroups and reverts the commit that
>> limited possible nodes to online nodes.
> 
> Balbir,
> I have asked this in the previous version but there still seems to be a
> lack of information of _why_ do we want this, _how_ much do we save on
> the memory overhead on most systems and _why_ the additional complexity
> is really worth it. Please make sure to add all this in the cover
> letter.
> 

The data is in the patch referred to in patch 3. The order of waste was
200MB for 400 cgroup directories enough for us to restrict possible_map
to online_map. These patches allow us to have a larger possible map and
allow onlining nodes not in the online_map, which is currently a restriction
on ppc64.

A typical system that I use has about 100-150 directories, depending on the
number of users/docker instances/configuration/virtual machines. These numbers
will only grow as we pack more of these instances on them.

>From a complexity view point, the patches are quite straight forward.

> I still didn't get to look into those patches because I am swamped with
> other things but to be honest I do not really see a strong justification
> to make it high priority for me.
> 

I am OK if you need more time to review them, but I've been pushing them
to fix the cases I've mentioned above.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23  7:50     ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  7:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 23/11/16 18:25, Michal Hocko wrote:
> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
>> In the absence of hotplug we use extra memory proportional to
>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
>> to disable large consumption with large number of cgroups. This patch
>> adds hotplug support to memory cgroups and reverts the commit that
>> limited possible nodes to online nodes.
> 
> Balbir,
> I have asked this in the previous version but there still seems to be a
> lack of information of _why_ do we want this, _how_ much do we save on
> the memory overhead on most systems and _why_ the additional complexity
> is really worth it. Please make sure to add all this in the cover
> letter.
> 

The data is in the patch referred to in patch 3. The order of waste was
200MB for 400 cgroup directories enough for us to restrict possible_map
to online_map. These patches allow us to have a larger possible map and
allow onlining nodes not in the online_map, which is currently a restriction
on ppc64.

A typical system that I use has about 100-150 directories, depending on the
number of users/docker instances/configuration/virtual machines. These numbers
will only grow as we pack more of these instances on them.

>From a complexity view point, the patches are quite straight forward.

> I still didn't get to look into those patches because I am swamped with
> other things but to be honest I do not really see a strong justification
> to make it high priority for me.
> 

I am OK if you need more time to review them, but I've been pushing them
to fix the cases I've mentioned above.

Balbir Singh.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-23  7:50     ` Balbir Singh
@ 2016-11-23  8:07       ` Michal Hocko
  -1 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2016-11-23  8:07 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Wed 23-11-16 18:50:42, Balbir Singh wrote:
> 
> 
> On 23/11/16 18:25, Michal Hocko wrote:
> > On Wed 23-11-16 15:36:51, Balbir Singh wrote:
> >> In the absence of hotplug we use extra memory proportional to
> >> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
> >> to disable large consumption with large number of cgroups. This patch
> >> adds hotplug support to memory cgroups and reverts the commit that
> >> limited possible nodes to online nodes.
> > 
> > Balbir,
> > I have asked this in the previous version but there still seems to be a
> > lack of information of _why_ do we want this, _how_ much do we save on
> > the memory overhead on most systems and _why_ the additional complexity
> > is really worth it. Please make sure to add all this in the cover
> > letter.
> > 
> 
> The data is in the patch referred to in patch 3. The order of waste was
> 200MB for 400 cgroup directories enough for us to restrict possible_map
> to online_map. These patches allow us to have a larger possible map and
> allow onlining nodes not in the online_map, which is currently a restriction
> on ppc64.

How common is to have possible_map >> online_map? If this is ppc64 then
what is the downside of keeping the current restriction instead?

> A typical system that I use has about 100-150 directories, depending on the
> number of users/docker instances/configuration/virtual machines. These numbers
> will only grow as we pack more of these instances on them.
> 
> From a complexity view point, the patches are quite straight forward.

Well, I would like to hear more about that. {get,put}_online_memory
at random places doesn't sound all that straightforward to me.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23  8:07       ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2016-11-23  8:07 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Wed 23-11-16 18:50:42, Balbir Singh wrote:
> 
> 
> On 23/11/16 18:25, Michal Hocko wrote:
> > On Wed 23-11-16 15:36:51, Balbir Singh wrote:
> >> In the absence of hotplug we use extra memory proportional to
> >> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
> >> to disable large consumption with large number of cgroups. This patch
> >> adds hotplug support to memory cgroups and reverts the commit that
> >> limited possible nodes to online nodes.
> > 
> > Balbir,
> > I have asked this in the previous version but there still seems to be a
> > lack of information of _why_ do we want this, _how_ much do we save on
> > the memory overhead on most systems and _why_ the additional complexity
> > is really worth it. Please make sure to add all this in the cover
> > letter.
> > 
> 
> The data is in the patch referred to in patch 3. The order of waste was
> 200MB for 400 cgroup directories enough for us to restrict possible_map
> to online_map. These patches allow us to have a larger possible map and
> allow onlining nodes not in the online_map, which is currently a restriction
> on ppc64.

How common is to have possible_map >> online_map? If this is ppc64 then
what is the downside of keeping the current restriction instead?

> A typical system that I use has about 100-150 directories, depending on the
> number of users/docker instances/configuration/virtual machines. These numbers
> will only grow as we pack more of these instances on them.
> 
> From a complexity view point, the patches are quite straight forward.

Well, I would like to hear more about that. {get,put}_online_memory
at random places doesn't sound all that straightforward to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-23  8:07       ` Michal Hocko
@ 2016-11-23  8:37         ` Balbir Singh
  -1 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  8:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 23/11/16 19:07, Michal Hocko wrote:
> On Wed 23-11-16 18:50:42, Balbir Singh wrote:
>>
>>
>> On 23/11/16 18:25, Michal Hocko wrote:
>>> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
>>>> In the absence of hotplug we use extra memory proportional to
>>>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
>>>> to disable large consumption with large number of cgroups. This patch
>>>> adds hotplug support to memory cgroups and reverts the commit that
>>>> limited possible nodes to online nodes.
>>>
>>> Balbir,
>>> I have asked this in the previous version but there still seems to be a
>>> lack of information of _why_ do we want this, _how_ much do we save on
>>> the memory overhead on most systems and _why_ the additional complexity
>>> is really worth it. Please make sure to add all this in the cover
>>> letter.
>>>
>>
>> The data is in the patch referred to in patch 3. The order of waste was
>> 200MB for 400 cgroup directories enough for us to restrict possible_map
>> to online_map. These patches allow us to have a larger possible map and
>> allow onlining nodes not in the online_map, which is currently a restriction
>> on ppc64.
> 
> How common is to have possible_map >> online_map? If this is ppc64 then
> what is the downside of keeping the current restriction instead?
> 

On my system CONFIG_NODE_SHIFT is 8, 256 nodes and possible_nodes are 2
The downside is the ability to hotplug and online an offline node.
Please see http://www.spinics.net/lists/linux-mm/msg116724.html

>> A typical system that I use has about 100-150 directories, depending on the
>> number of users/docker instances/configuration/virtual machines. These numbers
>> will only grow as we pack more of these instances on them.
>>
>> From a complexity view point, the patches are quite straight forward.
> 
> Well, I would like to hear more about that. {get,put}_online_memory
> at random places doesn't sound all that straightforward to me.
> 

I thought those places were not random :) I tried to think them out as
discussed with Vladimir. I don't claim the code is bug free, we can fix
any bugs as we test this more.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23  8:37         ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23  8:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 23/11/16 19:07, Michal Hocko wrote:
> On Wed 23-11-16 18:50:42, Balbir Singh wrote:
>>
>>
>> On 23/11/16 18:25, Michal Hocko wrote:
>>> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
>>>> In the absence of hotplug we use extra memory proportional to
>>>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
>>>> to disable large consumption with large number of cgroups. This patch
>>>> adds hotplug support to memory cgroups and reverts the commit that
>>>> limited possible nodes to online nodes.
>>>
>>> Balbir,
>>> I have asked this in the previous version but there still seems to be a
>>> lack of information of _why_ do we want this, _how_ much do we save on
>>> the memory overhead on most systems and _why_ the additional complexity
>>> is really worth it. Please make sure to add all this in the cover
>>> letter.
>>>
>>
>> The data is in the patch referred to in patch 3. The order of waste was
>> 200MB for 400 cgroup directories enough for us to restrict possible_map
>> to online_map. These patches allow us to have a larger possible map and
>> allow onlining nodes not in the online_map, which is currently a restriction
>> on ppc64.
> 
> How common is to have possible_map >> online_map? If this is ppc64 then
> what is the downside of keeping the current restriction instead?
> 

On my system CONFIG_NODE_SHIFT is 8, 256 nodes and possible_nodes are 2
The downside is the ability to hotplug and online an offline node.
Please see http://www.spinics.net/lists/linux-mm/msg116724.html

>> A typical system that I use has about 100-150 directories, depending on the
>> number of users/docker instances/configuration/virtual machines. These numbers
>> will only grow as we pack more of these instances on them.
>>
>> From a complexity view point, the patches are quite straight forward.
> 
> Well, I would like to hear more about that. {get,put}_online_memory
> at random places doesn't sound all that straightforward to me.
> 

I thought those places were not random :) I tried to think them out as
discussed with Vladimir. I don't claim the code is bug free, we can fix
any bugs as we test this more.

Balbir Singh.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-23  8:37         ` Balbir Singh
@ 2016-11-23  9:28           ` Michal Hocko
  -1 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2016-11-23  9:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Wed 23-11-16 19:37:16, Balbir Singh wrote:
> 
> 
> On 23/11/16 19:07, Michal Hocko wrote:
> > On Wed 23-11-16 18:50:42, Balbir Singh wrote:
> >>
> >>
> >> On 23/11/16 18:25, Michal Hocko wrote:
> >>> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
> >>>> In the absence of hotplug we use extra memory proportional to
> >>>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
> >>>> to disable large consumption with large number of cgroups. This patch
> >>>> adds hotplug support to memory cgroups and reverts the commit that
> >>>> limited possible nodes to online nodes.
> >>>
> >>> Balbir,
> >>> I have asked this in the previous version but there still seems to be a
> >>> lack of information of _why_ do we want this, _how_ much do we save on
> >>> the memory overhead on most systems and _why_ the additional complexity
> >>> is really worth it. Please make sure to add all this in the cover
> >>> letter.
> >>>
> >>
> >> The data is in the patch referred to in patch 3. The order of waste was
> >> 200MB for 400 cgroup directories enough for us to restrict possible_map
> >> to online_map. These patches allow us to have a larger possible map and
> >> allow onlining nodes not in the online_map, which is currently a restriction
> >> on ppc64.
> > 
> > How common is to have possible_map >> online_map? If this is ppc64 then
> > what is the downside of keeping the current restriction instead?
> > 
> 
> On my system CONFIG_NODE_SHIFT is 8, 256 nodes and possible_nodes are 2
> The downside is the ability to hotplug and online an offline node.
> Please see http://www.spinics.net/lists/linux-mm/msg116724.html

OK, so we are slowly getting to what I've asked originally ;) So who
cares? Depending on CONFIG_NODE_SHIFT (which tends to be quite large in
distribution or other general purpose kernels) the overhead is 424B (as
per pahole on the current kernel) for one numa node. Most machines are
to be expected 1-4 numa nodes so the overhead might be somewhere around
100K per memcg (with 256 possible nodes). Not trivial amount for sure
but I would rather encourage people to lower the possible node count for
their hardware if it is artificially large.

> >> A typical system that I use has about 100-150 directories, depending on the
> >> number of users/docker instances/configuration/virtual machines. These numbers
> >> will only grow as we pack more of these instances on them.
> >>
> >> From a complexity view point, the patches are quite straight forward.
> > 
> > Well, I would like to hear more about that. {get,put}_online_memory
> > at random places doesn't sound all that straightforward to me.
> > 
> 
> I thought those places were not random :) I tried to think them out as
> discussed with Vladimir. I don't claim the code is bug free, we can fix
> any bugs as we test this more.

I am more worried about synchronization with the hotplug which tends to
be a PITA in places were we were simply safe by definition until now. We
do not have all that many users of memcg->nodeinfo[nid] from what I can see
but are all of them safe to never race with the hotplug. A lack of
highlevel design description is less than encouraging. So please try to
spend some time describing how do we use nodeinfo currently and how is
the synchronization with the hotplug supposed to work and what
guarantees that no stale nodinfos can be ever used. This is just too
easy to get wrong...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23  9:28           ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2016-11-23  9:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Wed 23-11-16 19:37:16, Balbir Singh wrote:
> 
> 
> On 23/11/16 19:07, Michal Hocko wrote:
> > On Wed 23-11-16 18:50:42, Balbir Singh wrote:
> >>
> >>
> >> On 23/11/16 18:25, Michal Hocko wrote:
> >>> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
> >>>> In the absence of hotplug we use extra memory proportional to
> >>>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
> >>>> to disable large consumption with large number of cgroups. This patch
> >>>> adds hotplug support to memory cgroups and reverts the commit that
> >>>> limited possible nodes to online nodes.
> >>>
> >>> Balbir,
> >>> I have asked this in the previous version but there still seems to be a
> >>> lack of information of _why_ do we want this, _how_ much do we save on
> >>> the memory overhead on most systems and _why_ the additional complexity
> >>> is really worth it. Please make sure to add all this in the cover
> >>> letter.
> >>>
> >>
> >> The data is in the patch referred to in patch 3. The order of waste was
> >> 200MB for 400 cgroup directories enough for us to restrict possible_map
> >> to online_map. These patches allow us to have a larger possible map and
> >> allow onlining nodes not in the online_map, which is currently a restriction
> >> on ppc64.
> > 
> > How common is to have possible_map >> online_map? If this is ppc64 then
> > what is the downside of keeping the current restriction instead?
> > 
> 
> On my system CONFIG_NODE_SHIFT is 8, 256 nodes and possible_nodes are 2
> The downside is the ability to hotplug and online an offline node.
> Please see http://www.spinics.net/lists/linux-mm/msg116724.html

OK, so we are slowly getting to what I've asked originally ;) So who
cares? Depending on CONFIG_NODE_SHIFT (which tends to be quite large in
distribution or other general purpose kernels) the overhead is 424B (as
per pahole on the current kernel) for one numa node. Most machines are
to be expected 1-4 numa nodes so the overhead might be somewhere around
100K per memcg (with 256 possible nodes). Not trivial amount for sure
but I would rather encourage people to lower the possible node count for
their hardware if it is artificially large.

> >> A typical system that I use has about 100-150 directories, depending on the
> >> number of users/docker instances/configuration/virtual machines. These numbers
> >> will only grow as we pack more of these instances on them.
> >>
> >> From a complexity view point, the patches are quite straight forward.
> > 
> > Well, I would like to hear more about that. {get,put}_online_memory
> > at random places doesn't sound all that straightforward to me.
> > 
> 
> I thought those places were not random :) I tried to think them out as
> discussed with Vladimir. I don't claim the code is bug free, we can fix
> any bugs as we test this more.

I am more worried about synchronization with the hotplug which tends to
be a PITA in places were we were simply safe by definition until now. We
do not have all that many users of memcg->nodeinfo[nid] from what I can see
but are all of them safe to never race with the hotplug. A lack of
highlevel design description is less than encouraging. So please try to
spend some time describing how do we use nodeinfo currently and how is
the synchronization with the hotplug supposed to work and what
guarantees that no stale nodinfos can be ever used. This is just too
easy to get wrong...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-23  9:28           ` Michal Hocko
@ 2016-11-23 13:05             ` Balbir Singh
  -1 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23 13:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 23/11/16 20:28, Michal Hocko wrote:
> On Wed 23-11-16 19:37:16, Balbir Singh wrote:
>>
>>
>> On 23/11/16 19:07, Michal Hocko wrote:
>>> On Wed 23-11-16 18:50:42, Balbir Singh wrote:
>>>>
>>>>
>>>> On 23/11/16 18:25, Michal Hocko wrote:
>>>>> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
>>>>>> In the absence of hotplug we use extra memory proportional to
>>>>>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
>>>>>> to disable large consumption with large number of cgroups. This patch
>>>>>> adds hotplug support to memory cgroups and reverts the commit that
>>>>>> limited possible nodes to online nodes.
>>>>>
>>>>> Balbir,
>>>>> I have asked this in the previous version but there still seems to be a
>>>>> lack of information of _why_ do we want this, _how_ much do we save on
>>>>> the memory overhead on most systems and _why_ the additional complexity
>>>>> is really worth it. Please make sure to add all this in the cover
>>>>> letter.
>>>>>
>>>>
>>>> The data is in the patch referred to in patch 3. The order of waste was
>>>> 200MB for 400 cgroup directories enough for us to restrict possible_map
>>>> to online_map. These patches allow us to have a larger possible map and
>>>> allow onlining nodes not in the online_map, which is currently a restriction
>>>> on ppc64.
>>>
>>> How common is to have possible_map >> online_map? If this is ppc64 then
>>> what is the downside of keeping the current restriction instead?
>>>
>>
>> On my system CONFIG_NODE_SHIFT is 8, 256 nodes and possible_nodes are 2
>> The downside is the ability to hotplug and online an offline node.
>> Please see http://www.spinics.net/lists/linux-mm/msg116724.html
> 
> OK, so we are slowly getting to what I've asked originally ;) So who
> cares? Depending on CONFIG_NODE_SHIFT (which tends to be quite large in
> distribution or other general purpose kernels) the overhead is 424B (as
> per pahole on the current kernel) for one numa node. Most machines are
> to be expected 1-4 numa nodes so the overhead might be somewhere around
> 100K per memcg (with 256 possible nodes). Not trivial amount for sure
> but I would rather encourage people to lower the possible node count for
> their hardware if it is artificially large.
> 

On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
of solutions that use fake NUMA for partitioning and need as many nodes as
possible.

>>>> A typical system that I use has about 100-150 directories, depending on the
>>>> number of users/docker instances/configuration/virtual machines. These numbers
>>>> will only grow as we pack more of these instances on them.
>>>>
>>>> From a complexity view point, the patches are quite straight forward.
>>>
>>> Well, I would like to hear more about that. {get,put}_online_memory
>>> at random places doesn't sound all that straightforward to me.
>>>
>>
>> I thought those places were not random :) I tried to think them out as
>> discussed with Vladimir. I don't claim the code is bug free, we can fix
>> any bugs as we test this more.
> 
> I am more worried about synchronization with the hotplug which tends to
> be a PITA in places were we were simply safe by definition until now. We
> do not have all that many users of memcg->nodeinfo[nid] from what I can see
> but are all of them safe to never race with the hotplug. A lack of
> highlevel design description is less than encouraging.

As in explanation? The design is dictated by the notifier and the actions
to take when the node comes online/offline.

 So please try to
> spend some time describing how do we use nodeinfo currently and how is
> the synchronization with the hotplug supposed to work and what
> guarantees that no stale nodinfos can be ever used. This is just too
> easy to get wrong...
> 

OK.. I'll add that in the next cover letter

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23 13:05             ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-23 13:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 23/11/16 20:28, Michal Hocko wrote:
> On Wed 23-11-16 19:37:16, Balbir Singh wrote:
>>
>>
>> On 23/11/16 19:07, Michal Hocko wrote:
>>> On Wed 23-11-16 18:50:42, Balbir Singh wrote:
>>>>
>>>>
>>>> On 23/11/16 18:25, Michal Hocko wrote:
>>>>> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
>>>>>> In the absence of hotplug we use extra memory proportional to
>>>>>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
>>>>>> to disable large consumption with large number of cgroups. This patch
>>>>>> adds hotplug support to memory cgroups and reverts the commit that
>>>>>> limited possible nodes to online nodes.
>>>>>
>>>>> Balbir,
>>>>> I have asked this in the previous version but there still seems to be a
>>>>> lack of information of _why_ do we want this, _how_ much do we save on
>>>>> the memory overhead on most systems and _why_ the additional complexity
>>>>> is really worth it. Please make sure to add all this in the cover
>>>>> letter.
>>>>>
>>>>
>>>> The data is in the patch referred to in patch 3. The order of waste was
>>>> 200MB for 400 cgroup directories enough for us to restrict possible_map
>>>> to online_map. These patches allow us to have a larger possible map and
>>>> allow onlining nodes not in the online_map, which is currently a restriction
>>>> on ppc64.
>>>
>>> How common is to have possible_map >> online_map? If this is ppc64 then
>>> what is the downside of keeping the current restriction instead?
>>>
>>
>> On my system CONFIG_NODE_SHIFT is 8, 256 nodes and possible_nodes are 2
>> The downside is the ability to hotplug and online an offline node.
>> Please see http://www.spinics.net/lists/linux-mm/msg116724.html
> 
> OK, so we are slowly getting to what I've asked originally ;) So who
> cares? Depending on CONFIG_NODE_SHIFT (which tends to be quite large in
> distribution or other general purpose kernels) the overhead is 424B (as
> per pahole on the current kernel) for one numa node. Most machines are
> to be expected 1-4 numa nodes so the overhead might be somewhere around
> 100K per memcg (with 256 possible nodes). Not trivial amount for sure
> but I would rather encourage people to lower the possible node count for
> their hardware if it is artificially large.
> 

On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
of solutions that use fake NUMA for partitioning and need as many nodes as
possible.

>>>> A typical system that I use has about 100-150 directories, depending on the
>>>> number of users/docker instances/configuration/virtual machines. These numbers
>>>> will only grow as we pack more of these instances on them.
>>>>
>>>> From a complexity view point, the patches are quite straight forward.
>>>
>>> Well, I would like to hear more about that. {get,put}_online_memory
>>> at random places doesn't sound all that straightforward to me.
>>>
>>
>> I thought those places were not random :) I tried to think them out as
>> discussed with Vladimir. I don't claim the code is bug free, we can fix
>> any bugs as we test this more.
> 
> I am more worried about synchronization with the hotplug which tends to
> be a PITA in places were we were simply safe by definition until now. We
> do not have all that many users of memcg->nodeinfo[nid] from what I can see
> but are all of them safe to never race with the hotplug. A lack of
> highlevel design description is less than encouraging.

As in explanation? The design is dictated by the notifier and the actions
to take when the node comes online/offline.

 So please try to
> spend some time describing how do we use nodeinfo currently and how is
> the synchronization with the hotplug supposed to work and what
> guarantees that no stale nodinfos can be ever used. This is just too
> easy to get wrong...
> 

OK.. I'll add that in the next cover letter

Balbir

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-23 13:05             ` Balbir Singh
@ 2016-11-23 13:22               ` Michal Hocko
  -1 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2016-11-23 13:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Thu 24-11-16 00:05:12, Balbir Singh wrote:
> 
> 
> On 23/11/16 20:28, Michal Hocko wrote:
[...]
> > I am more worried about synchronization with the hotplug which tends to
> > be a PITA in places were we were simply safe by definition until now. We
> > do not have all that many users of memcg->nodeinfo[nid] from what I can see
> > but are all of them safe to never race with the hotplug. A lack of
> > highlevel design description is less than encouraging.
> 
> As in explanation? The design is dictated by the notifier and the actions
> to take when the node comes online/offline.

Sure but how all the users of lruvec (for example) which is stored in
the nodeinfo AFAIR, are supposed to synchronize with the notifier.
Really if you are doing something dynamic then the first thing to
explain is the sychronization. There might be really good reasons why we
do not have to care about explicit synchr. for most code paths but my
past experience with many subtle hotplug related bugs just make me a bit
suspicious. So in other words, please make sure to document as much as
possible. This will make the review so much easier.

>  So please try to
> > spend some time describing how do we use nodeinfo currently and how is
> > the synchronization with the hotplug supposed to work and what
> > guarantees that no stale nodinfos can be ever used. This is just too
> > easy to get wrong...
> > 
> 
> OK.. I'll add that in the next cover letter

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-23 13:22               ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2016-11-23 13:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linuxppc-dev, Tejun Heo, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Thu 24-11-16 00:05:12, Balbir Singh wrote:
> 
> 
> On 23/11/16 20:28, Michal Hocko wrote:
[...]
> > I am more worried about synchronization with the hotplug which tends to
> > be a PITA in places were we were simply safe by definition until now. We
> > do not have all that many users of memcg->nodeinfo[nid] from what I can see
> > but are all of them safe to never race with the hotplug. A lack of
> > highlevel design description is less than encouraging.
> 
> As in explanation? The design is dictated by the notifier and the actions
> to take when the node comes online/offline.

Sure but how all the users of lruvec (for example) which is stored in
the nodeinfo AFAIR, are supposed to synchronize with the notifier.
Really if you are doing something dynamic then the first thing to
explain is the sychronization. There might be really good reasons why we
do not have to care about explicit synchr. for most code paths but my
past experience with many subtle hotplug related bugs just make me a bit
suspicious. So in other words, please make sure to document as much as
possible. This will make the review so much easier.

>  So please try to
> > spend some time describing how do we use nodeinfo currently and how is
> > the synchronization with the hotplug supposed to work and what
> > guarantees that no stale nodinfos can be ever used. This is just too
> > easy to get wrong...
> > 
> 
> OK.. I'll add that in the next cover letter

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-23 13:05             ` Balbir Singh
@ 2016-11-28 21:10               ` Tejun Heo
  -1 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2016-11-28 21:10 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Thu, Nov 24, 2016 at 12:05:12AM +1100, Balbir Singh wrote:
> On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
> of solutions that use fake NUMA for partitioning and need as many nodes as
> possible.

It was a crude kludge that people used before memcg.  If people still
use it, that's fine but we don't want to optimize / make code
complicated for it, so let's please put away this part of
justification.

It's understandable that some kernels want to have large NODES_SHIFT
to support wide range of configurations but if that makes wastage too
high, the simpler solution is updating the users to use the rumtime
detected possible number / mask instead of the compile time
NODES_SHIFT.  Note that we do exactly the same thing for per-cpu
things - we configure high max but do all operations on what's
possible on the system.

NUMA code already has possible detection.  Why not simply make memcg
use those instead of MAX_NUMNODES like how we use nr_cpu_ids instead
of NR_CPUS?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-28 21:10               ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2016-11-28 21:10 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

On Thu, Nov 24, 2016 at 12:05:12AM +1100, Balbir Singh wrote:
> On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
> of solutions that use fake NUMA for partitioning and need as many nodes as
> possible.

It was a crude kludge that people used before memcg.  If people still
use it, that's fine but we don't want to optimize / make code
complicated for it, so let's please put away this part of
justification.

It's understandable that some kernels want to have large NODES_SHIFT
to support wide range of configurations but if that makes wastage too
high, the simpler solution is updating the users to use the rumtime
detected possible number / mask instead of the compile time
NODES_SHIFT.  Note that we do exactly the same thing for per-cpu
things - we configure high max but do all operations on what's
possible on the system.

NUMA code already has possible detection.  Why not simply make memcg
use those instead of MAX_NUMNODES like how we use nr_cpu_ids instead
of NR_CPUS?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-28 21:10               ` Tejun Heo
@ 2016-11-29  0:09                 ` Balbir Singh
  -1 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-29  0:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 29/11/16 08:10, Tejun Heo wrote:
> On Thu, Nov 24, 2016 at 12:05:12AM +1100, Balbir Singh wrote:
>> On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
>> of solutions that use fake NUMA for partitioning and need as many nodes as
>> possible.
> 
> It was a crude kludge that people used before memcg.  If people still
> use it, that's fine but we don't want to optimize / make code
> complicated for it, so let's please put away this part of
> justification.

Are you suggesting those use cases can be ignored now?

> 
> It's understandable that some kernels want to have large NODES_SHIFT
> to support wide range of configurations but if that makes wastage too
> high, the simpler solution is updating the users to use the rumtime
> detected possible number / mask instead of the compile time
> NODES_SHIFT.  Note that we do exactly the same thing for per-cpu
> things - we configure high max but do all operations on what's
> possible on the system.
> 
> NUMA code already has possible detection.  Why not simply make memcg
> use those instead of MAX_NUMNODES like how we use nr_cpu_ids instead
> of NR_CPUS?
> 

nodes_possible_map is set to node_online_map at the moment for ppc64.
Which becomes a problem when hotplugging a node that was not already
online.

I am not sure what you mean by possible detection. node_possible_map
is set based on CONFIG_NODE_SHIFT and then can be adjusted by the
architecture (if desired). Are you suggesting firmware populate it
in?

Thanks,
Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-29  0:09                 ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-29  0:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 29/11/16 08:10, Tejun Heo wrote:
> On Thu, Nov 24, 2016 at 12:05:12AM +1100, Balbir Singh wrote:
>> On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
>> of solutions that use fake NUMA for partitioning and need as many nodes as
>> possible.
> 
> It was a crude kludge that people used before memcg.  If people still
> use it, that's fine but we don't want to optimize / make code
> complicated for it, so let's please put away this part of
> justification.

Are you suggesting those use cases can be ignored now?

> 
> It's understandable that some kernels want to have large NODES_SHIFT
> to support wide range of configurations but if that makes wastage too
> high, the simpler solution is updating the users to use the rumtime
> detected possible number / mask instead of the compile time
> NODES_SHIFT.  Note that we do exactly the same thing for per-cpu
> things - we configure high max but do all operations on what's
> possible on the system.
> 
> NUMA code already has possible detection.  Why not simply make memcg
> use those instead of MAX_NUMNODES like how we use nr_cpu_ids instead
> of NR_CPUS?
> 

nodes_possible_map is set to node_online_map at the moment for ppc64.
Which becomes a problem when hotplugging a node that was not already
online.

I am not sure what you mean by possible detection. node_possible_map
is set based on CONFIG_NODE_SHIFT and then can be adjusted by the
architecture (if desired). Are you suggesting firmware populate it
in?

Thanks,
Balbir Singh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-29  0:09                 ` Balbir Singh
@ 2016-11-29  0:42                   ` Tejun Heo
  -1 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2016-11-29  0:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

Hello, Balbir.

On Tue, Nov 29, 2016 at 11:09:26AM +1100, Balbir Singh wrote:
> On 29/11/16 08:10, Tejun Heo wrote:
> > On Thu, Nov 24, 2016 at 12:05:12AM +1100, Balbir Singh wrote:
> >> On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
> >> of solutions that use fake NUMA for partitioning and need as many nodes as
> >> possible.
> > 
> > It was a crude kludge that people used before memcg.  If people still
> > use it, that's fine but we don't want to optimize / make code
> > complicated for it, so let's please put away this part of
> > justification.
> 
> Are you suggesting those use cases can be ignored now?

Don't do that.  When did I say that?  What I said is that it isn't a
good idea to optimize and complicate the code base for it at this
point.  It shouldn't a controversial argument given fake numa's
inherent issues and general lack of popularity.

Besides, does node hotplug even apply to fake numa?  ISTR it being
configured statically on the boot prompt.

> > NUMA code already has possible detection.  Why not simply make memcg
> > use those instead of MAX_NUMNODES like how we use nr_cpu_ids instead
> > of NR_CPUS?
> 
> nodes_possible_map is set to node_online_map at the moment for ppc64.
> Which becomes a problem when hotplugging a node that was not already
> online.
> 
> I am not sure what you mean by possible detection. node_possible_map
> is set based on CONFIG_NODE_SHIFT and then can be adjusted by the
> architecture (if desired). Are you suggesting firmware populate it
> in?

That's what we do with cpus.  The kernel is built with high maximum
limit and the kernel queries the firmware during boot to determine how
many are actually possible on the system, which in most cases isn't
too far from what's already on the system.  I don't see why we would
take a different approach with NUMA nodes.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-29  0:42                   ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2016-11-29  0:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Andrew Morton,
	Johannes Weiner, Vladimir Davydov

Hello, Balbir.

On Tue, Nov 29, 2016 at 11:09:26AM +1100, Balbir Singh wrote:
> On 29/11/16 08:10, Tejun Heo wrote:
> > On Thu, Nov 24, 2016 at 12:05:12AM +1100, Balbir Singh wrote:
> >> On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
> >> of solutions that use fake NUMA for partitioning and need as many nodes as
> >> possible.
> > 
> > It was a crude kludge that people used before memcg.  If people still
> > use it, that's fine but we don't want to optimize / make code
> > complicated for it, so let's please put away this part of
> > justification.
> 
> Are you suggesting those use cases can be ignored now?

Don't do that.  When did I say that?  What I said is that it isn't a
good idea to optimize and complicate the code base for it at this
point.  It shouldn't a controversial argument given fake numa's
inherent issues and general lack of popularity.

Besides, does node hotplug even apply to fake numa?  ISTR it being
configured statically on the boot prompt.

> > NUMA code already has possible detection.  Why not simply make memcg
> > use those instead of MAX_NUMNODES like how we use nr_cpu_ids instead
> > of NR_CPUS?
> 
> nodes_possible_map is set to node_online_map at the moment for ppc64.
> Which becomes a problem when hotplugging a node that was not already
> online.
> 
> I am not sure what you mean by possible detection. node_possible_map
> is set based on CONFIG_NODE_SHIFT and then can be adjusted by the
> architecture (if desired). Are you suggesting firmware populate it
> in?

That's what we do with cpus.  The kernel is built with high maximum
limit and the kernel queries the firmware during boot to determine how
many are actually possible on the system, which in most cases isn't
too far from what's already on the system.  I don't see why we would
take a different approach with NUMA nodes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
  2016-11-29  0:42                   ` Tejun Heo
@ 2016-11-29  4:57                     ` Balbir Singh
  -1 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-29  4:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 29/11/16 11:42, Tejun Heo wrote:
> Hello, Balbir.
> 
> On Tue, Nov 29, 2016 at 11:09:26AM +1100, Balbir Singh wrote:
>> On 29/11/16 08:10, Tejun Heo wrote:
>>> On Thu, Nov 24, 2016 at 12:05:12AM +1100, Balbir Singh wrote:
>>>> On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
>>>> of solutions that use fake NUMA for partitioning and need as many nodes as
>>>> possible.
>>>
>>> It was a crude kludge that people used before memcg.  If people still
>>> use it, that's fine but we don't want to optimize / make code
>>> complicated for it, so let's please put away this part of
>>> justification.
>>
>> Are you suggesting those use cases can be ignored now?
> 
> Don't do that.  When did I say that?  What I said is that it isn't a
> good idea to optimize and complicate the code base for it at this
> point.  It shouldn't a controversial argument given fake numa's
> inherent issues and general lack of popularity.
> 
> Besides, does node hotplug even apply to fake numa?  ISTR it being
> configured statically on the boot prompt.

Good point, not sure if it worked with that setup as well.

> 
>>> NUMA code already has possible detection.  Why not simply make memcg
>>> use those instead of MAX_NUMNODES like how we use nr_cpu_ids instead
>>> of NR_CPUS?
>>
>> nodes_possible_map is set to node_online_map at the moment for ppc64.
>> Which becomes a problem when hotplugging a node that was not already
>> online.
>>
>> I am not sure what you mean by possible detection. node_possible_map
>> is set based on CONFIG_NODE_SHIFT and then can be adjusted by the
>> architecture (if desired). Are you suggesting firmware populate it
>> in?
> 
> That's what we do with cpus.  The kernel is built with high maximum
> limit and the kernel queries the firmware during boot to determine how
> many are actually possible on the system, which in most cases isn't
> too far from what's already on the system.  I don't see why we would
> take a different approach with NUMA nodes.

Agreed for the short term. I think memory hotplug needs a review and we'll
probably get it fixed as we make progress along the way

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [mm v2 0/3] Support memory cgroup hotplug
@ 2016-11-29  4:57                     ` Balbir Singh
  0 siblings, 0 replies; 30+ messages in thread
From: Balbir Singh @ 2016-11-29  4:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Andrew Morton,
	Johannes Weiner, Vladimir Davydov



On 29/11/16 11:42, Tejun Heo wrote:
> Hello, Balbir.
> 
> On Tue, Nov 29, 2016 at 11:09:26AM +1100, Balbir Singh wrote:
>> On 29/11/16 08:10, Tejun Heo wrote:
>>> On Thu, Nov 24, 2016 at 12:05:12AM +1100, Balbir Singh wrote:
>>>> On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
>>>> of solutions that use fake NUMA for partitioning and need as many nodes as
>>>> possible.
>>>
>>> It was a crude kludge that people used before memcg.  If people still
>>> use it, that's fine but we don't want to optimize / make code
>>> complicated for it, so let's please put away this part of
>>> justification.
>>
>> Are you suggesting those use cases can be ignored now?
> 
> Don't do that.  When did I say that?  What I said is that it isn't a
> good idea to optimize and complicate the code base for it at this
> point.  It shouldn't a controversial argument given fake numa's
> inherent issues and general lack of popularity.
> 
> Besides, does node hotplug even apply to fake numa?  ISTR it being
> configured statically on the boot prompt.

Good point, not sure if it worked with that setup as well.

> 
>>> NUMA code already has possible detection.  Why not simply make memcg
>>> use those instead of MAX_NUMNODES like how we use nr_cpu_ids instead
>>> of NR_CPUS?
>>
>> nodes_possible_map is set to node_online_map at the moment for ppc64.
>> Which becomes a problem when hotplugging a node that was not already
>> online.
>>
>> I am not sure what you mean by possible detection. node_possible_map
>> is set based on CONFIG_NODE_SHIFT and then can be adjusted by the
>> architecture (if desired). Are you suggesting firmware populate it
>> in?
> 
> That's what we do with cpus.  The kernel is built with high maximum
> limit and the kernel queries the firmware during boot to determine how
> many are actually possible on the system, which in most cases isn't
> too far from what's already on the system.  I don't see why we would
> take a different approach with NUMA nodes.

Agreed for the short term. I think memory hotplug needs a review and we'll
probably get it fixed as we make progress along the way

Balbir

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-11-29  4:58 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-23  4:36 [mm v2 0/3] Support memory cgroup hotplug Balbir Singh
2016-11-23  4:36 ` Balbir Singh
2016-11-23  4:36 ` [mm v2 1/3] mm: Add basic infrastructure for memcg hotplug support Balbir Singh
2016-11-23  4:36   ` Balbir Singh
2016-11-23  4:36 ` [mm v2 2/3] mm: Move operations to hotplug callbacks Balbir Singh
2016-11-23  4:36   ` Balbir Singh
2016-11-23  4:36 ` [mm v2 3/3] powerpc/mm: fix node_possible_map limitations Balbir Singh
2016-11-23  4:36   ` Balbir Singh
2016-11-23  7:25 ` [mm v2 0/3] Support memory cgroup hotplug Michal Hocko
2016-11-23  7:25   ` Michal Hocko
2016-11-23  7:50   ` Balbir Singh
2016-11-23  7:50     ` Balbir Singh
2016-11-23  8:07     ` Michal Hocko
2016-11-23  8:07       ` Michal Hocko
2016-11-23  8:37       ` Balbir Singh
2016-11-23  8:37         ` Balbir Singh
2016-11-23  9:28         ` Michal Hocko
2016-11-23  9:28           ` Michal Hocko
2016-11-23 13:05           ` Balbir Singh
2016-11-23 13:05             ` Balbir Singh
2016-11-23 13:22             ` Michal Hocko
2016-11-23 13:22               ` Michal Hocko
2016-11-28 21:10             ` Tejun Heo
2016-11-28 21:10               ` Tejun Heo
2016-11-29  0:09               ` Balbir Singh
2016-11-29  0:09                 ` Balbir Singh
2016-11-29  0:42                 ` Tejun Heo
2016-11-29  0:42                   ` Tejun Heo
2016-11-29  4:57                   ` Balbir Singh
2016-11-29  4:57                     ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.