linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RESEND][v1 0/3] Support memory cgroup hotplug
@ 2016-11-15 23:44 Balbir Singh
  2016-11-15 23:44 ` [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support Balbir Singh
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Balbir Singh @ 2016-11-15 23:44 UTC (permalink / raw)
  To: mpe, hannes, mhocko, vdavydov.dev
  Cc: linuxppc-dev, linux-mm, Balbir Singh, Tejun Heo, Andrew Morton

In the absence of hotplug we use extra memory proportional to
(possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
to disable large consumption with large number of cgroups. This patch
adds hotplug support to memory cgroups and reverts the commit that
limited possible nodes to online nodes.

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>

I've tested this patches under a VM with two nodes and movable
nodes enabled. I've offlined nodes and checked that the system
and cgroups with tasks deep in the hierarchy continue to work
fine.

Balbir Singh (3):
  Add basic infrastructure for memcg hotplug support
  Move from all possible nodes to online nodes
  powerpc: fix node_possible_map limitations

 arch/powerpc/mm/numa.c |  7 ----
 mm/memcontrol.c        | 96 +++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 83 insertions(+), 20 deletions(-)

-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support
  2016-11-15 23:44 [RESEND][v1 0/3] Support memory cgroup hotplug Balbir Singh
@ 2016-11-15 23:44 ` Balbir Singh
  2016-11-16  9:01   ` Vladimir Davydov
  2016-11-15 23:45 ` [RESEND] [PATCH v1 2/3] Move from all possible nodes to online nodes Balbir Singh
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2016-11-15 23:44 UTC (permalink / raw)
  To: mpe, hannes, mhocko, vdavydov.dev
  Cc: linuxppc-dev, linux-mm, Balbir Singh, Tejun Heo, Andrew Morton

The lack of hotplug support makes us allocate all memory
upfront for per node data structures. With large number
of cgroups this can be an overhead. PPC64 actually limits
n_possible nodes to n_online to avoid some of this overhead.

This patch adds the basic notifiers to listen to hotplug
events and does the allocation and free of those structures
per cgroup. We walk every cgroup per event, its a trade-off
of allocating upfront vs allocating on demand and freeing
on offline.

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 mm/memcontrol.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91dfc7c..5585fce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/memory.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -1342,6 +1343,10 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 {
 	return 0;
 }
+
+static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
@@ -4115,14 +4120,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 	int tmp = node;
-	/*
-	 * This routine is called against possible nodes.
-	 * But it's BUG to call kmalloc() against offline node.
-	 *
-	 * TODO: this routine can waste much memory for nodes which will
-	 *       never be onlined. It's better to use memory hotplug callback
-	 *       function.
-	 */
+
 	if (!node_state(node, N_NORMAL_MEMORY))
 		tmp = -1;
 	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
@@ -5773,6 +5771,59 @@ static int __init cgroup_memory(char *s)
 }
 __setup("cgroup.memory=", cgroup_memory);
 
+static void memcg_node_offline(int node)
+{
+	struct mem_cgroup *memcg;
+
+	if (node < 0)
+		return;
+
+	for_each_mem_cgroup(memcg) {
+		free_mem_cgroup_per_node_info(memcg, node);
+		mem_cgroup_may_update_nodemask(memcg);
+	}
+}
+
+static void memcg_node_online(int node)
+{
+	struct mem_cgroup *memcg;
+
+	if (node < 0)
+		return;
+
+	for_each_mem_cgroup(memcg) {
+		alloc_mem_cgroup_per_node_info(memcg, node);
+		mem_cgroup_may_update_nodemask(memcg);
+	}
+}
+
+static int memcg_memory_hotplug_callback(struct notifier_block *self,
+					unsigned long action, void *arg)
+{
+	struct memory_notify *marg = arg;
+	int node = marg->status_change_nid;
+
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+	case MEM_CANCEL_ONLINE:
+		memcg_node_offline(node);
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		memcg_node_online(node);
+		break;
+	case MEM_ONLINE:
+	case MEM_OFFLINE:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block memcg_memory_hotplug_nb __meminitdata = {
+	.notifier_call = memcg_memory_hotplug_callback,
+	.priority = IPC_CALLBACK_PRI,
+};
+
 /*
  * subsys_initcall() for memory controller.
  *
@@ -5797,6 +5848,7 @@ static int __init mem_cgroup_init(void)
 #endif
 
 	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
+	register_hotmemory_notifier(&memcg_memory_hotplug_nb);
 
 	for_each_possible_cpu(cpu)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RESEND] [PATCH v1 2/3] Move from all possible nodes to online nodes
  2016-11-15 23:44 [RESEND][v1 0/3] Support memory cgroup hotplug Balbir Singh
  2016-11-15 23:44 ` [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support Balbir Singh
@ 2016-11-15 23:45 ` Balbir Singh
  2016-11-15 23:45 ` [RESEND] [PATCH v1 3/3] powerpc: fix node_possible_map limitations Balbir Singh
  2016-11-21 14:03 ` [RESEND][v1 0/3] Support memory cgroup hotplug Michal Hocko
  3 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2016-11-15 23:45 UTC (permalink / raw)
  To: mpe, hannes, mhocko, vdavydov.dev
  Cc: linuxppc-dev, linux-mm, Balbir Singh, Tejun Heo, Andrew Morton

Move routines that do operations on all nodes to
just the online nodes. Most of the changes are
very obvious (like the ones related to soft limit tree
per node)

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 mm/memcontrol.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5585fce..cc49fa2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -497,7 +497,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 	struct mem_cgroup_per_node *mz;
 	int nid;
 
-	for_each_node(nid) {
+	for_each_online_node(nid) {
 		mz = mem_cgroup_nodeinfo(memcg, nid);
 		mctz = soft_limit_tree_node(nid);
 		mem_cgroup_remove_exceeded(mz, mctz);
@@ -895,7 +895,7 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 	int i;
 
 	while ((memcg = parent_mem_cgroup(memcg))) {
-		for_each_node(nid) {
+		for_each_online_node(nid) {
 			mz = mem_cgroup_nodeinfo(memcg, nid);
 			for (i = 0; i <= DEF_PRIORITY; i++) {
 				iter = &mz->iter[i];
@@ -4146,7 +4146,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 	int node;
 
 	memcg_wb_domain_exit(memcg);
-	for_each_node(node)
+	for_each_online_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->stat);
 	kfree(memcg);
@@ -4175,7 +4175,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg->stat)
 		goto fail;
 
-	for_each_node(node)
+	for_each_online_node(node)
 		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
 
@@ -5774,11 +5774,21 @@ __setup("cgroup.memory=", cgroup_memory);
 static void memcg_node_offline(int node)
 {
 	struct mem_cgroup *memcg;
+	struct mem_cgroup_tree_per_node *rtpn;
+	struct mem_cgroup_tree_per_node *mctz;
+	struct mem_cgroup_per_node *mz;
 
 	if (node < 0)
 		return;
 
+	rtpn = soft_limit_tree.rb_tree_per_node[node];
+	kfree(rtpn);
+
 	for_each_mem_cgroup(memcg) {
+		mz = mem_cgroup_nodeinfo(memcg, node);
+		mctz = soft_limit_tree_node(node);
+		mem_cgroup_remove_exceeded(mz, mctz);
+
 		free_mem_cgroup_per_node_info(memcg, node);
 		mem_cgroup_may_update_nodemask(memcg);
 	}
@@ -5787,10 +5797,18 @@ static void memcg_node_offline(int node)
 static void memcg_node_online(int node)
 {
 	struct mem_cgroup *memcg;
+	struct mem_cgroup_tree_per_node *rtpn;
 
 	if (node < 0)
 		return;
 
+	rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
+			    node_online(node) ? node : NUMA_NO_NODE);
+
+	rtpn->rb_root = RB_ROOT;
+	spin_lock_init(&rtpn->lock);
+	soft_limit_tree.rb_tree_per_node[node] = rtpn;
+
 	for_each_mem_cgroup(memcg) {
 		alloc_mem_cgroup_per_node_info(memcg, node);
 		mem_cgroup_may_update_nodemask(memcg);
@@ -5854,7 +5872,7 @@ static int __init mem_cgroup_init(void)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
 			  drain_local_stock);
 
-	for_each_node(node) {
+	for_each_online_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
 
 		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RESEND] [PATCH v1 3/3] powerpc: fix node_possible_map limitations
  2016-11-15 23:44 [RESEND][v1 0/3] Support memory cgroup hotplug Balbir Singh
  2016-11-15 23:44 ` [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support Balbir Singh
  2016-11-15 23:45 ` [RESEND] [PATCH v1 2/3] Move from all possible nodes to online nodes Balbir Singh
@ 2016-11-15 23:45 ` Balbir Singh
  2016-11-16 16:40   ` Reza Arbab
  2016-11-21 14:03 ` [RESEND][v1 0/3] Support memory cgroup hotplug Michal Hocko
  3 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2016-11-15 23:45 UTC (permalink / raw)
  To: mpe, hannes, mhocko, vdavydov.dev
  Cc: linuxppc-dev, linux-mm, Balbir Singh, Tejun Heo, Andrew Morton

We've fixed the memory hotplug issue with memcg, hence
this work around should not be required.

Reverts: commit 3af229f2071f
("powerpc/numa: Reset node_possible_map to only node_online_map")

Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org> 
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>

---
 arch/powerpc/mm/numa.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a51c188..ca8c2ab 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -916,13 +916,6 @@ void __init initmem_init(void)
 
 	memblock_dump_all();
 
-	/*
-	 * Reduce the possible NUMA nodes to the online NUMA nodes,
-	 * since we do not support node hotplug. This ensures that  we
-	 * lower the maximum NUMA node ID to what is actually present.
-	 */
-	nodes_and(node_possible_map, node_possible_map, node_online_map);
-
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support
  2016-11-15 23:44 ` [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support Balbir Singh
@ 2016-11-16  9:01   ` Vladimir Davydov
  2016-11-17  0:28     ` Balbir Singh
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2016-11-16  9:01 UTC (permalink / raw)
  To: Balbir Singh
  Cc: mpe, hannes, mhocko, linuxppc-dev, linux-mm, Tejun Heo, Andrew Morton

Hello,

On Wed, Nov 16, 2016 at 10:44:59AM +1100, Balbir Singh wrote:
> The lack of hotplug support makes us allocate all memory
> upfront for per node data structures. With large number
> of cgroups this can be an overhead. PPC64 actually limits
> n_possible nodes to n_online to avoid some of this overhead.
> 
> This patch adds the basic notifiers to listen to hotplug
> events and does the allocation and free of those structures
> per cgroup. We walk every cgroup per event, its a trade-off
> of allocating upfront vs allocating on demand and freeing
> on offline.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org> 
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> 
> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
> ---
>  mm/memcontrol.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 60 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 91dfc7c..5585fce 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,6 +63,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/file.h>
>  #include <linux/tracehook.h>
> +#include <linux/memory.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -1342,6 +1343,10 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
>  {
>  	return 0;
>  }
> +
> +static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
> +{
> +}
>  #endif
>  
>  static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> @@ -4115,14 +4120,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>  {
>  	struct mem_cgroup_per_node *pn;
>  	int tmp = node;
> -	/*
> -	 * This routine is called against possible nodes.
> -	 * But it's BUG to call kmalloc() against offline node.
> -	 *
> -	 * TODO: this routine can waste much memory for nodes which will
> -	 *       never be onlined. It's better to use memory hotplug callback
> -	 *       function.
> -	 */
> +
>  	if (!node_state(node, N_NORMAL_MEMORY))
>  		tmp = -1;
>  	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
> @@ -5773,6 +5771,59 @@ static int __init cgroup_memory(char *s)
>  }
>  __setup("cgroup.memory=", cgroup_memory);
>  
> +static void memcg_node_offline(int node)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (node < 0)
> +		return;

Is this possible?

> +
> +	for_each_mem_cgroup(memcg) {
> +		free_mem_cgroup_per_node_info(memcg, node);
> +		mem_cgroup_may_update_nodemask(memcg);

If memcg->numainfo_events is 0, mem_cgroup_may_update_nodemask() won't
update memcg->scan_nodes. Is it OK?

> +	}

What if a memory cgroup is created or destroyed while you're walking the
tree? Should we probably use get_online_mems() in mem_cgroup_alloc() to
avoid that?

> +}
> +
> +static void memcg_node_online(int node)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (node < 0)
> +		return;
> +
> +	for_each_mem_cgroup(memcg) {
> +		alloc_mem_cgroup_per_node_info(memcg, node);
> +		mem_cgroup_may_update_nodemask(memcg);
> +	}
> +}
> +
> +static int memcg_memory_hotplug_callback(struct notifier_block *self,
> +					unsigned long action, void *arg)
> +{
> +	struct memory_notify *marg = arg;
> +	int node = marg->status_change_nid;
> +
> +	switch (action) {
> +	case MEM_GOING_OFFLINE:
> +	case MEM_CANCEL_ONLINE:
> +		memcg_node_offline(node);

Judging by __offline_pages(), the MEM_GOING_OFFLINE event is emitted
before migrating pages off the node. So, I guess freeing per-node info
here isn't quite correct, as pages still need it to be moved from the
node's LRU lists. Better move it to MEM_OFFLINE?

> +		break;
> +	case MEM_GOING_ONLINE:
> +	case MEM_CANCEL_OFFLINE:
> +		memcg_node_online(node);
> +		break;
> +	case MEM_ONLINE:
> +	case MEM_OFFLINE:
> +		break;
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block memcg_memory_hotplug_nb __meminitdata = {
> +	.notifier_call = memcg_memory_hotplug_callback,
> +	.priority = IPC_CALLBACK_PRI,

I wonder why you chose this priority?

> +};
> +
>  /*
>   * subsys_initcall() for memory controller.
>   *
> @@ -5797,6 +5848,7 @@ static int __init mem_cgroup_init(void)
>  #endif
>  
>  	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
> +	register_hotmemory_notifier(&memcg_memory_hotplug_nb);
>  
>  	for_each_possible_cpu(cpu)
>  		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,

I guess, we should modify mem_cgroup_alloc/free() in the scope of this
patch, otherwise it doesn't make much sense IMHO. May be, it's even
worth merging patches 1 and 2 altogether.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RESEND] [PATCH v1 3/3] powerpc: fix node_possible_map limitations
  2016-11-15 23:45 ` [RESEND] [PATCH v1 3/3] powerpc: fix node_possible_map limitations Balbir Singh
@ 2016-11-16 16:40   ` Reza Arbab
  2016-11-16 16:45     ` [PATCH] powerpc/mm: allow memory hotplug into an offline node Reza Arbab
  0 siblings, 1 reply; 13+ messages in thread
From: Reza Arbab @ 2016-11-16 16:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: mpe, hannes, mhocko, vdavydov.dev, Tejun Heo, linux-mm,
	linuxppc-dev, Andrew Morton

On Wed, Nov 16, 2016 at 10:45:01AM +1100, Balbir Singh wrote:
>Reverts: commit 3af229f2071f
>("powerpc/numa: Reset node_possible_map to only node_online_map")

Nice! With this limitation going away, I have a small patch to enable 
onlining new nodes via memory hotplug. Incoming.

-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] powerpc/mm: allow memory hotplug into an offline node
  2016-11-16 16:40   ` Reza Arbab
@ 2016-11-16 16:45     ` Reza Arbab
  2017-02-01  1:05       ` Michael Ellerman
  0 siblings, 1 reply; 13+ messages in thread
From: Reza Arbab @ 2016-11-16 16:45 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras, Andrew Morton
  Cc: linuxppc-dev, linux-mm, Balbir Singh, Nathan Fontenot, John Allen

Relax the check preventing us from hotplugging into an offline node.

This limitation was added in commit 482ec7c403d2 ("[PATCH] powerpc numa:
Support sparse online node map") to prevent adding resources to an
uninitialized node.

These days, there is no harm in doing so. The addition will actually
cause the node to be initialized and onlined; add_memory_resource()
calls hotadd_new_pgdat() (if necessary) and node_set_online().

Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
---
This applies on top of "powerpc/mm: allow memory hotplug into a
memoryless node", currently in the -mm tree:
http://lkml.kernel.org/r/1479160961-25840-2-git-send-email-arbab@linux.vnet.ibm.com

 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d69f6f6..07620c9 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1091,7 +1091,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
 		nid = hot_add_node_scn_to_nid(scn_addr);
 	}
 
-	if (nid < 0 || !node_online(nid))
+	if (nid < 0 || !node_possible(nid))
 		nid = first_online_node;
 
 	return nid;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support
  2016-11-16  9:01   ` Vladimir Davydov
@ 2016-11-17  0:28     ` Balbir Singh
  2016-11-21  8:36       ` Vladimir Davydov
  0 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2016-11-17  0:28 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: mpe, hannes, mhocko, linuxppc-dev, linux-mm, Tejun Heo, Andrew Morton



On 16/11/16 20:01, Vladimir Davydov wrote:
> Hello,
> 
> On Wed, Nov 16, 2016 at 10:44:59AM +1100, Balbir Singh wrote:
>> The lack of hotplug support makes us allocate all memory
>> upfront for per node data structures. With large number
>> of cgroups this can be an overhead. PPC64 actually limits
>> n_possible nodes to n_online to avoid some of this overhead.
>>
>> This patch adds the basic notifiers to listen to hotplug
>> events and does the allocation and free of those structures
>> per cgroup. We walk every cgroup per event, its a trade-off
>> of allocating upfront vs allocating on demand and freeing
>> on offline.
>>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@kernel.org> 
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>>
>> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
>> ---
>>  mm/memcontrol.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 60 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 91dfc7c..5585fce 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -63,6 +63,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/file.h>
>>  #include <linux/tracehook.h>
>> +#include <linux/memory.h>
>>  #include "internal.h"
>>  #include <net/sock.h>
>>  #include <net/ip.h>
>> @@ -1342,6 +1343,10 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
>>  {
>>  	return 0;
>>  }
>> +
>> +static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
>> +{
>> +}
>>  #endif
>>  
>>  static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>> @@ -4115,14 +4120,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>>  {
>>  	struct mem_cgroup_per_node *pn;
>>  	int tmp = node;
>> -	/*
>> -	 * This routine is called against possible nodes.
>> -	 * But it's BUG to call kmalloc() against offline node.
>> -	 *
>> -	 * TODO: this routine can waste much memory for nodes which will
>> -	 *       never be onlined. It's better to use memory hotplug callback
>> -	 *       function.
>> -	 */
>> +
>>  	if (!node_state(node, N_NORMAL_MEMORY))
>>  		tmp = -1;
>>  	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp);
>> @@ -5773,6 +5771,59 @@ static int __init cgroup_memory(char *s)
>>  }
>>  __setup("cgroup.memory=", cgroup_memory);
>>  
>> +static void memcg_node_offline(int node)
>> +{
>> +	struct mem_cgroup *memcg;
>> +
>> +	if (node < 0)
>> +		return;
> 
> Is this possible?

Yes, please see node_states_check_changes_online/offline

> 
>> +
>> +	for_each_mem_cgroup(memcg) {
>> +		free_mem_cgroup_per_node_info(memcg, node);
>> +		mem_cgroup_may_update_nodemask(memcg);
> 
> If memcg->numainfo_events is 0, mem_cgroup_may_update_nodemask() won't
> update memcg->scan_nodes. Is it OK?
> 
>> +	}
> 
> What if a memory cgroup is created or destroyed while you're walking the
> tree? Should we probably use get_online_mems() in mem_cgroup_alloc() to
> avoid that?
> 

The iterator internally takes rcu_read_lock() to avoid any side-effects
of cgroups added/removed. I suspect you are also suggesting using get_online_mems()
around each call to for_each_online_node

My understanding so far is

1. invalidate_reclaim_iterators should be safe (no bad side-effects)
2. mem_cgroup_free - should be safe as well
3. mem_cgroup_alloc - needs protection
4. mem_cgroup_init - needs protection
5. mem_cgroup_remove_from_tress - should be safe

>> +}
>> +
>> +static void memcg_node_online(int node)
>> +{
>> +	struct mem_cgroup *memcg;
>> +
>> +	if (node < 0)
>> +		return;
>> +
>> +	for_each_mem_cgroup(memcg) {
>> +		alloc_mem_cgroup_per_node_info(memcg, node);
>> +		mem_cgroup_may_update_nodemask(memcg);
>> +	}
>> +}
>> +
>> +static int memcg_memory_hotplug_callback(struct notifier_block *self,
>> +					unsigned long action, void *arg)
>> +{
>> +	struct memory_notify *marg = arg;
>> +	int node = marg->status_change_nid;
>> +
>> +	switch (action) {
>> +	case MEM_GOING_OFFLINE:
>> +	case MEM_CANCEL_ONLINE:
>> +		memcg_node_offline(node);
> 
> Judging by __offline_pages(), the MEM_GOING_OFFLINE event is emitted
> before migrating pages off the node. So, I guess freeing per-node info
> here isn't quite correct, as pages still need it to be moved from the
> node's LRU lists. Better move it to MEM_OFFLINE?
> 

Good point, will redo

>> +		break;
>> +	case MEM_GOING_ONLINE:
>> +	case MEM_CANCEL_OFFLINE:
>> +		memcg_node_online(node);
>> +		break;
>> +	case MEM_ONLINE:
>> +	case MEM_OFFLINE:
>> +		break;
>> +	}
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static struct notifier_block memcg_memory_hotplug_nb __meminitdata = {
>> +	.notifier_call = memcg_memory_hotplug_callback,
>> +	.priority = IPC_CALLBACK_PRI,
> 
> I wonder why you chose this priority?
> 

I just chose the lowest priority

>> +};
>> +
>>  /*
>>   * subsys_initcall() for memory controller.
>>   *
>> @@ -5797,6 +5848,7 @@ static int __init mem_cgroup_init(void)
>>  #endif
>>  
>>  	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>> +	register_hotmemory_notifier(&memcg_memory_hotplug_nb);
>>  
>>  	for_each_possible_cpu(cpu)
>>  		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
> 
> I guess, we should modify mem_cgroup_alloc/free() in the scope of this
> patch, otherwise it doesn't make much sense IMHO. May be, it's even
> worth merging patches 1 and 2 altogether.
> 


Thanks for the review, I'll revisit the organization of the patches.


Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support
  2016-11-17  0:28     ` Balbir Singh
@ 2016-11-21  8:36       ` Vladimir Davydov
  2016-11-22  0:17         ` Balbir Singh
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2016-11-21  8:36 UTC (permalink / raw)
  To: Balbir Singh
  Cc: mpe, hannes, mhocko, linuxppc-dev, linux-mm, Tejun Heo, Andrew Morton

On Thu, Nov 17, 2016 at 11:28:12AM +1100, Balbir Singh wrote:
> >> @@ -5773,6 +5771,59 @@ static int __init cgroup_memory(char *s)
> >>  }
> >>  __setup("cgroup.memory=", cgroup_memory);
> >>  
> >> +static void memcg_node_offline(int node)
> >> +{
> >> +	struct mem_cgroup *memcg;
> >> +
> >> +	if (node < 0)
> >> +		return;
> > 
> > Is this possible?
> 
> Yes, please see node_states_check_changes_online/offline

OK, I see.

> 
> > 
> >> +
> >> +	for_each_mem_cgroup(memcg) {
> >> +		free_mem_cgroup_per_node_info(memcg, node);
> >> +		mem_cgroup_may_update_nodemask(memcg);
> > 
> > If memcg->numainfo_events is 0, mem_cgroup_may_update_nodemask() won't
> > update memcg->scan_nodes. Is it OK?
> > 
> >> +	}
> > 
> > What if a memory cgroup is created or destroyed while you're walking the
> > tree? Should we probably use get_online_mems() in mem_cgroup_alloc() to
> > avoid that?
> > 
> 
> The iterator internally takes rcu_read_lock() to avoid any side-effects
> of cgroups added/removed. I suspect you are also suggesting using get_online_mems()
> around each call to for_each_online_node
> 
> My understanding so far is
> 
> 1. invalidate_reclaim_iterators should be safe (no bad side-effects)
> 2. mem_cgroup_free - should be safe as well
> 3. mem_cgroup_alloc - needs protection
> 4. mem_cgroup_init - needs protection
> 5. mem_cgroup_remove_from_tress - should be safe

I'm not into the memory hotplug code, but my understanding is that if
memcg offline happens to race with node unplug, it's possible that

 - mem_cgroup_free() doesn't free the node's data, because it sees the
   node as already offline
 - memcg hotplug code doesn't free the node's data either, because it
   sees the cgroup as offline

May be, we should surround all the loops over online nodes with
get/put_online_mems() to be sure that nothing wrong can happen.
They are slow path, anyway.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RESEND][v1 0/3] Support memory cgroup hotplug
  2016-11-15 23:44 [RESEND][v1 0/3] Support memory cgroup hotplug Balbir Singh
                   ` (2 preceding siblings ...)
  2016-11-15 23:45 ` [RESEND] [PATCH v1 3/3] powerpc: fix node_possible_map limitations Balbir Singh
@ 2016-11-21 14:03 ` Michal Hocko
  2016-11-22  0:16   ` Balbir Singh
  3 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-11-21 14:03 UTC (permalink / raw)
  To: Balbir Singh
  Cc: mpe, hannes, vdavydov.dev, linuxppc-dev, linux-mm, Tejun Heo,
	Andrew Morton

On Wed 16-11-16 10:44:58, Balbir Singh wrote:
> In the absence of hotplug we use extra memory proportional to
> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
> to disable large consumption with large number of cgroups. This patch
> adds hotplug support to memory cgroups and reverts the commit that
> limited possible nodes to online nodes.

I didn't get to read patches yet (I am currently swamped by emails after
longer vacation so bear with me) but this doesn't tell us _why_ we want
this and how much we can actaully save. In general being dynamic is more
complex and most systems tend to have possible_nodes close to
online_nodes in my experience (well at least on most reasonable
architectures). I would also appreciate some highlevel description of
the implications. E.g. how to we synchronize with the hotplug operations
when iterating node specific data structures.

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RESEND][v1 0/3] Support memory cgroup hotplug
  2016-11-21 14:03 ` [RESEND][v1 0/3] Support memory cgroup hotplug Michal Hocko
@ 2016-11-22  0:16   ` Balbir Singh
  0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2016-11-22  0:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: mpe, hannes, vdavydov.dev, linuxppc-dev, linux-mm, Tejun Heo,
	Andrew Morton



On 22/11/16 01:03, Michal Hocko wrote:
> On Wed 16-11-16 10:44:58, Balbir Singh wrote:
>> In the absence of hotplug we use extra memory proportional to
>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
>> to disable large consumption with large number of cgroups. This patch
>> adds hotplug support to memory cgroups and reverts the commit that
>> limited possible nodes to online nodes.
> 
> I didn't get to read patches yet (I am currently swamped by emails after
> longer vacation so bear with me) but this doesn't tell us _why_ we want
> this and how much we can actaully save. 

The motivation was 3af229f2071f
(powerpc/numa: Reset node_possible_map to only node_online_map)

In general being dynamic is more
> complex and most systems tend to have possible_nodes close to
> online_nodes in my experience (well at least on most reasonable
> architectures). I would also appreciate some highlevel description of
> the implications. E.g. how to we synchronize with the hotplug operations
> when iterating node specific data structures.

I agree dynamic is more complex, but I think we'll begin to see a lot
of more of it. The rules are not hard IMHO. From an implication perspective
it means that we need to get/put_online_mem_nodes in certain paths - specifically
mem_cgroup_alloc/free and mem_cgroup_init from what I can see so far

Thanks for the review!

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support
  2016-11-21  8:36       ` Vladimir Davydov
@ 2016-11-22  0:17         ` Balbir Singh
  0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2016-11-22  0:17 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: mpe, hannes, mhocko, linuxppc-dev, linux-mm, Tejun Heo, Andrew Morton


>>
>> The iterator internally takes rcu_read_lock() to avoid any side-effects
>> of cgroups added/removed. I suspect you are also suggesting using get_online_mems()
>> around each call to for_each_online_node
>>
>> My understanding so far is
>>
>> 1. invalidate_reclaim_iterators should be safe (no bad side-effects)
>> 2. mem_cgroup_free - should be safe as well
>> 3. mem_cgroup_alloc - needs protection
>> 4. mem_cgroup_init - needs protection
>> 5. mem_cgroup_remove_from_tress - should be safe
> 
> I'm not into the memory hotplug code, but my understanding is that if
> memcg offline happens to race with node unplug, it's possible that
> 
>  - mem_cgroup_free() doesn't free the node's data, because it sees the
>    node as already offline
>  - memcg hotplug code doesn't free the node's data either, because it
>    sees the cgroup as offline
> 
> May be, we should surround all the loops over online nodes with
> get/put_online_mems() to be sure that nothing wrong can happen.
> They are slow path, anyway.
> 

Makes sense, agreed

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: powerpc/mm: allow memory hotplug into an offline node
  2016-11-16 16:45     ` [PATCH] powerpc/mm: allow memory hotplug into an offline node Reza Arbab
@ 2017-02-01  1:05       ` Michael Ellerman
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Ellerman @ 2017-02-01  1:05 UTC (permalink / raw)
  To: Reza Arbab, Benjamin Herrenschmidt, Paul Mackerras, Andrew Morton
  Cc: linux-mm, John Allen, linuxppc-dev, Nathan Fontenot

On Wed, 2016-11-16 at 16:45:03 UTC, Reza Arbab wrote:
> Relax the check preventing us from hotplugging into an offline node.
> 
> This limitation was added in commit 482ec7c403d2 ("[PATCH] powerpc numa:
> Support sparse online node map") to prevent adding resources to an
> uninitialized node.
> 
> These days, there is no harm in doing so. The addition will actually
> cause the node to be initialized and onlined; add_memory_resource()
> calls hotadd_new_pgdat() (if necessary) and node_set_online().
> 
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> Cc: John Allen <jallen@linux.vnet.ibm.com>
> Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/2a8628d41602dc9f988af051a657ee

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-02-01  1:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-15 23:44 [RESEND][v1 0/3] Support memory cgroup hotplug Balbir Singh
2016-11-15 23:44 ` [RESEND] [PATCH v1 1/3] Add basic infrastructure for memcg hotplug support Balbir Singh
2016-11-16  9:01   ` Vladimir Davydov
2016-11-17  0:28     ` Balbir Singh
2016-11-21  8:36       ` Vladimir Davydov
2016-11-22  0:17         ` Balbir Singh
2016-11-15 23:45 ` [RESEND] [PATCH v1 2/3] Move from all possible nodes to online nodes Balbir Singh
2016-11-15 23:45 ` [RESEND] [PATCH v1 3/3] powerpc: fix node_possible_map limitations Balbir Singh
2016-11-16 16:40   ` Reza Arbab
2016-11-16 16:45     ` [PATCH] powerpc/mm: allow memory hotplug into an offline node Reza Arbab
2017-02-01  1:05       ` Michael Ellerman
2016-11-21 14:03 ` [RESEND][v1 0/3] Support memory cgroup hotplug Michal Hocko
2016-11-22  0:16   ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).