linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm/cgroup soft limit data allocation
@ 2017-02-23 13:36 Laurent Dufour
  2017-02-23 13:36 ` [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory Laurent Dufour
  2017-02-23 13:36 ` [PATCH v2 2/2] mm/cgroup: delay soft limit data allocation Laurent Dufour
  0 siblings, 2 replies; 9+ messages in thread
From: Laurent Dufour @ 2017-02-23 13:36 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Balbir Singh
  Cc: cgroups, linux-mm, linux-kernel

The first patch of this series is fixing a panic occurring when soft
limit data allocation is using soft limit data.

The second patch, as suggested by Michal Hocko, is pushing forward by
delaying the soft limit data allocation when a soft limit is set.

V1->V2:
 - move sanity pointer checks to the first patch
 - differ also the allocation of the pointer table
 - return error in the case allocation failed

Laurent Dufour (2):
  mm/cgroup: avoid panic when init with low memory
  mm/cgroup: delay soft limit data allocation

 mm/memcontrol.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 57 insertions(+), 17 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory
  2017-02-23 13:36 [PATCH v2 0/2] mm/cgroup soft limit data allocation Laurent Dufour
@ 2017-02-23 13:36 ` Laurent Dufour
  2017-02-23 15:12   ` Michal Hocko
                     ` (3 more replies)
  2017-02-23 13:36 ` [PATCH v2 2/2] mm/cgroup: delay soft limit data allocation Laurent Dufour
  1 sibling, 4 replies; 9+ messages in thread
From: Laurent Dufour @ 2017-02-23 13:36 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Balbir Singh
  Cc: cgroups, linux-mm, linux-kernel

The system may panic when initialisation is done when almost all the
memory is assigned to the huge pages using the kernel command line
parameter hugepage=xxxx. Panic may occur like this:

[    0.082289] Unable to handle kernel paging request for data at address 0x00000000
[    0.082338] Faulting instruction address: 0xc000000000302b88
[    0.082377] Oops: Kernel access of bad area, sig: 11 [#1]
[    0.082408] SMP NR_CPUS=2048 [    0.082424] NUMA
[    0.082440] pSeries
[    0.082457] Modules linked in:
[    0.082490] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
[    0.082536] task: c00000021ed01600 task.stack: c00000010d108000
[    0.082575] NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
[    0.082621] REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
[    0.082666] MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
[    0.082793] CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
NIP [c000000000302b88] mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
[    0.083456] LR [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
[    0.083494] Call Trace:
[    0.083511] [c00000010d10b540] [c00000010d10b640] 0xc00000010d10b640 (unreliable)
[    0.083567] [c00000010d10b610] [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
[    0.083622] [c00000010d10b6b0] [c000000000271198] try_to_free_pages+0xf8/0x270
[    0.083676] [c00000010d10b740] [c000000000259dd8] __alloc_pages_nodemask+0x7a8/0xff0
[    0.083729] [c00000010d10b960] [c0000000002dd274] new_slab+0x104/0x8e0
[    0.083776] [c00000010d10ba40] [c0000000002e03d0] ___slab_alloc+0x620/0x700
[    0.083822] [c00000010d10bb70] [c0000000002e04e4] __slab_alloc+0x34/0x60
[    0.083868] [c00000010d10bba0] [c0000000002e101c] kmem_cache_alloc_node_trace+0xdc/0x310
[    0.083947] [c00000010d10bc00] [c000000000eb8120] mem_cgroup_init+0x158/0x1c8
[    0.083994] [c00000010d10bc40] [c00000000000dde8] do_one_initcall+0x68/0x1d0
[    0.084041] [c00000010d10bd00] [c000000000e84184] kernel_init_freeable+0x278/0x360
[    0.084094] [c00000010d10bdc0] [c00000000000e714] kernel_init+0x24/0x170
[    0.084143] [c00000010d10be30] [c00000000000c0e8] ret_from_kernel_thread+0x5c/0x74
[    0.084195] Instruction dump:
[    0.084220] eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
[    0.084300] 3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
[    0.084382] ---[ end trace 342f5208b00d01b6 ]---

This is a chicken and egg issue where the kernel try to get free
memory when allocating per node data in mem_cgroup_init(), but in that
path mem_cgroup_soft_limit_reclaim() is called which assumes that
these data are allocated.

As mem_cgroup_soft_limit_reclaim() is best effort, it should return
when these data are not yet allocated.

This patch also fixes potential null pointer access in
mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 45867e439d31..a9f10fde44a6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -465,6 +465,8 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	struct mem_cgroup_tree_per_node *mctz;
 
 	mctz = soft_limit_tree_from_page(page);
+	if (!mctz)
+		return;
 	/*
 	 * Necessary to update all ancestors when hierarchy is used.
 	 * because their event counter is not touched.
@@ -502,7 +504,8 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 	for_each_node(nid) {
 		mz = mem_cgroup_nodeinfo(memcg, nid);
 		mctz = soft_limit_tree_node(nid);
-		mem_cgroup_remove_exceeded(mz, mctz);
+		if (mctz)
+			mem_cgroup_remove_exceeded(mz, mctz);
 	}
 }
 
@@ -2557,7 +2560,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	 * is empty. Do it lockless to prevent lock bouncing. Races
 	 * are acceptable as soft limit is best effort anyway.
 	 */
-	if (RB_EMPTY_ROOT(&mctz->rb_root))
+	if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root))
 		return 0;
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] mm/cgroup: delay soft limit data allocation
  2017-02-23 13:36 [PATCH v2 0/2] mm/cgroup soft limit data allocation Laurent Dufour
  2017-02-23 13:36 ` [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory Laurent Dufour
@ 2017-02-23 13:36 ` Laurent Dufour
  2017-02-23 15:31   ` Michal Hocko
  1 sibling, 1 reply; 9+ messages in thread
From: Laurent Dufour @ 2017-02-23 13:36 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Balbir Singh
  Cc: cgroups, linux-mm, linux-kernel

Until a soft limit is set to a cgroup, the soft limit data are useless
so delay this allocation when a limit is set.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 mm/memcontrol.c | 67 ++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 52 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a9f10fde44a6..c639c898809d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -142,7 +142,7 @@ struct mem_cgroup_tree {
 	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
 };
 
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
+static struct mem_cgroup_tree *soft_limit_tree __read_mostly;
 
 /* for OOM */
 struct mem_cgroup_eventfd_list {
@@ -381,10 +381,52 @@ mem_cgroup_page_nodeinfo(struct mem_cgroup *memcg, struct page *page)
 	return memcg->nodeinfo[nid];
 }
 
+static bool soft_limit_initialize(void)
+{
+	static DEFINE_MUTEX(soft_limit_mutex);
+	struct mem_cgroup_tree *tree;
+	bool ret = true;
+	int node;
+
+	mutex_lock(&soft_limit_mutex);
+	if (soft_limit_tree)
+		goto bail;
+
+	tree = kmalloc(sizeof(*soft_limit_tree), GFP_KERNEL);
+	if (!tree) {
+		ret = false;
+		goto bail;
+	}
+	for_each_node(node) {
+		struct mem_cgroup_tree_per_node *rtpn;
+
+		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
+				    node_online(node) ? node : NUMA_NO_NODE);
+		if (!rtpn)
+			goto cleanup;
+
+		rtpn->rb_root = RB_ROOT;
+		spin_lock_init(&rtpn->lock);
+		tree->rb_tree_per_node[node] = rtpn;
+	}
+	WRITE_ONCE(soft_limit_tree, tree);
+bail:
+	mutex_unlock(&soft_limit_mutex);
+	return ret;
+cleanup:
+	for_each_node(node)
+		kfree(tree->rb_tree_per_node[node]);
+	kfree(tree);
+	ret = false;
+	goto bail;
+}
+
 static struct mem_cgroup_tree_per_node *
 soft_limit_tree_node(int nid)
 {
-	return soft_limit_tree.rb_tree_per_node[nid];
+	if (!soft_limit_tree)
+		return NULL;
+	return soft_limit_tree->rb_tree_per_node[nid];
 }
 
 static struct mem_cgroup_tree_per_node *
@@ -392,7 +434,9 @@ soft_limit_tree_from_page(struct page *page)
 {
 	int nid = page_to_nid(page);
 
-	return soft_limit_tree.rb_tree_per_node[nid];
+	if (!soft_limit_tree)
+		return NULL;
+	return soft_limit_tree->rb_tree_per_node[nid];
 }
 
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
@@ -3003,6 +3047,10 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		}
 		break;
 	case RES_SOFT_LIMIT:
+		if (!soft_limit_initialize()) {
+			ret = -ENOMEM;
+			break;
+		}
 		memcg->soft_limit = nr_pages;
 		ret = 0;
 		break;
@@ -5777,7 +5825,7 @@ __setup("cgroup.memory=", cgroup_memory);
  */
 static int __init mem_cgroup_init(void)
 {
-	int cpu, node;
+	int cpu;
 
 #ifndef CONFIG_SLOB
 	/*
@@ -5797,17 +5845,6 @@ static int __init mem_cgroup_init(void)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
 			  drain_local_stock);
 
-	for_each_node(node) {
-		struct mem_cgroup_tree_per_node *rtpn;
-
-		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
-				    node_online(node) ? node : NUMA_NO_NODE);
-
-		rtpn->rb_root = RB_ROOT;
-		spin_lock_init(&rtpn->lock);
-		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-	}
-
 	return 0;
 }
 subsys_initcall(mem_cgroup_init);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory
  2017-02-23 13:36 ` [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory Laurent Dufour
@ 2017-02-23 15:12   ` Michal Hocko
  2017-02-23 18:39   ` Johannes Weiner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2017-02-23 15:12 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Johannes Weiner, Vladimir Davydov, Balbir Singh, cgroups,
	linux-mm, linux-kernel

On Thu 23-02-17 14:36:38, Laurent Dufour wrote:
> The system may panic when initialisation is done when almost all the
> memory is assigned to the huge pages using the kernel command line
> parameter hugepage=xxxx. Panic may occur like this:
> 
> [    0.082289] Unable to handle kernel paging request for data at address 0x00000000
> [    0.082338] Faulting instruction address: 0xc000000000302b88
> [    0.082377] Oops: Kernel access of bad area, sig: 11 [#1]
> [    0.082408] SMP NR_CPUS=2048 [    0.082424] NUMA
> [    0.082440] pSeries
> [    0.082457] Modules linked in:
> [    0.082490] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
> [    0.082536] task: c00000021ed01600 task.stack: c00000010d108000
> [    0.082575] NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
> [    0.082621] REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
> [    0.082666] MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
> [    0.082793] CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
> GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
> GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
> GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
> GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
> GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
> GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
> GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
> GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
> NIP [c000000000302b88] mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
> [    0.083456] LR [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
> [    0.083494] Call Trace:
> [    0.083511] [c00000010d10b540] [c00000010d10b640] 0xc00000010d10b640 (unreliable)
> [    0.083567] [c00000010d10b610] [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
> [    0.083622] [c00000010d10b6b0] [c000000000271198] try_to_free_pages+0xf8/0x270
> [    0.083676] [c00000010d10b740] [c000000000259dd8] __alloc_pages_nodemask+0x7a8/0xff0
> [    0.083729] [c00000010d10b960] [c0000000002dd274] new_slab+0x104/0x8e0
> [    0.083776] [c00000010d10ba40] [c0000000002e03d0] ___slab_alloc+0x620/0x700
> [    0.083822] [c00000010d10bb70] [c0000000002e04e4] __slab_alloc+0x34/0x60
> [    0.083868] [c00000010d10bba0] [c0000000002e101c] kmem_cache_alloc_node_trace+0xdc/0x310
> [    0.083947] [c00000010d10bc00] [c000000000eb8120] mem_cgroup_init+0x158/0x1c8
> [    0.083994] [c00000010d10bc40] [c00000000000dde8] do_one_initcall+0x68/0x1d0
> [    0.084041] [c00000010d10bd00] [c000000000e84184] kernel_init_freeable+0x278/0x360
> [    0.084094] [c00000010d10bdc0] [c00000000000e714] kernel_init+0x24/0x170
> [    0.084143] [c00000010d10be30] [c00000000000c0e8] ret_from_kernel_thread+0x5c/0x74
> [    0.084195] Instruction dump:
> [    0.084220] eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
> [    0.084300] 3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
> [    0.084382] ---[ end trace 342f5208b00d01b6 ]---
> 
> This is a chicken and egg issue where the kernel try to get free
> memory when allocating per node data in mem_cgroup_init(), but in that
> path mem_cgroup_soft_limit_reclaim() is called which assumes that
> these data are allocated.
> 
> As mem_cgroup_soft_limit_reclaim() is best effort, it should return
> when these data are not yet allocated.
> 
> This patch also fixes potential null pointer access in
> mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
> 
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 45867e439d31..a9f10fde44a6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -465,6 +465,8 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
>  	struct mem_cgroup_tree_per_node *mctz;
>  
>  	mctz = soft_limit_tree_from_page(page);
> +	if (!mctz)
> +		return;
>  	/*
>  	 * Necessary to update all ancestors when hierarchy is used.
>  	 * because their event counter is not touched.
> @@ -502,7 +504,8 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
>  	for_each_node(nid) {
>  		mz = mem_cgroup_nodeinfo(memcg, nid);
>  		mctz = soft_limit_tree_node(nid);
> -		mem_cgroup_remove_exceeded(mz, mctz);
> +		if (mctz)
> +			mem_cgroup_remove_exceeded(mz, mctz);
>  	}
>  }
>  
> @@ -2557,7 +2560,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  	 * is empty. Do it lockless to prevent lock bouncing. Races
>  	 * are acceptable as soft limit is best effort anyway.
>  	 */
> -	if (RB_EMPTY_ROOT(&mctz->rb_root))
> +	if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root))
>  		return 0;
>  
>  	/*
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm/cgroup: delay soft limit data allocation
  2017-02-23 13:36 ` [PATCH v2 2/2] mm/cgroup: delay soft limit data allocation Laurent Dufour
@ 2017-02-23 15:31   ` Michal Hocko
  2017-02-23 19:03     ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2017-02-23 15:31 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Johannes Weiner, Vladimir Davydov, Balbir Singh, cgroups,
	linux-mm, linux-kernel

On Thu 23-02-17 14:36:39, Laurent Dufour wrote:
> Until a soft limit is set to a cgroup, the soft limit data are useless
> so delay this allocation when a limit is set.

Hmm, I am still undecided whether this is actually worth it. On one hand
distribution kernels tend to have quite large NUMA_SHIFT (e.g. SLES has
NUMA_SHIFT=10 and then we will save 8kB+12kB which is not hell of a lot
but always good if we can save that, especially for a rarely used
feature. The code grown on the other hand (it was in __init section
previously) which is a minus, on the other hand.

What do you think Johannes?

This would be a useful info in the changelog, btw.

> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

The patch looks good to me so feel free to add
Reviewed-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 67 ++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 52 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a9f10fde44a6..c639c898809d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -142,7 +142,7 @@ struct mem_cgroup_tree {
>  	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
>  };
>  
> -static struct mem_cgroup_tree soft_limit_tree __read_mostly;
> +static struct mem_cgroup_tree *soft_limit_tree __read_mostly;
>  
>  /* for OOM */
>  struct mem_cgroup_eventfd_list {
> @@ -381,10 +381,52 @@ mem_cgroup_page_nodeinfo(struct mem_cgroup *memcg, struct page *page)
>  	return memcg->nodeinfo[nid];
>  }
>  
> +static bool soft_limit_initialize(void)
> +{
> +	static DEFINE_MUTEX(soft_limit_mutex);
> +	struct mem_cgroup_tree *tree;
> +	bool ret = true;
> +	int node;
> +
> +	mutex_lock(&soft_limit_mutex);
> +	if (soft_limit_tree)
> +		goto bail;
> +
> +	tree = kmalloc(sizeof(*soft_limit_tree), GFP_KERNEL);
> +	if (!tree) {
> +		ret = false;
> +		goto bail;
> +	}
> +	for_each_node(node) {
> +		struct mem_cgroup_tree_per_node *rtpn;
> +
> +		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
> +				    node_online(node) ? node : NUMA_NO_NODE);
> +		if (!rtpn)
> +			goto cleanup;
> +
> +		rtpn->rb_root = RB_ROOT;
> +		spin_lock_init(&rtpn->lock);
> +		tree->rb_tree_per_node[node] = rtpn;
> +	}
> +	WRITE_ONCE(soft_limit_tree, tree);
> +bail:
> +	mutex_unlock(&soft_limit_mutex);
> +	return ret;
> +cleanup:
> +	for_each_node(node)
> +		kfree(tree->rb_tree_per_node[node]);
> +	kfree(tree);
> +	ret = false;
> +	goto bail;
> +}
> +
>  static struct mem_cgroup_tree_per_node *
>  soft_limit_tree_node(int nid)
>  {
> -	return soft_limit_tree.rb_tree_per_node[nid];
> +	if (!soft_limit_tree)
> +		return NULL;
> +	return soft_limit_tree->rb_tree_per_node[nid];
>  }
>  
>  static struct mem_cgroup_tree_per_node *
> @@ -392,7 +434,9 @@ soft_limit_tree_from_page(struct page *page)
>  {
>  	int nid = page_to_nid(page);
>  
> -	return soft_limit_tree.rb_tree_per_node[nid];
> +	if (!soft_limit_tree)
> +		return NULL;
> +	return soft_limit_tree->rb_tree_per_node[nid];
>  }
>  
>  static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
> @@ -3003,6 +3047,10 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
>  		}
>  		break;
>  	case RES_SOFT_LIMIT:
> +		if (!soft_limit_initialize()) {
> +			ret = -ENOMEM;
> +			break;
> +		}
>  		memcg->soft_limit = nr_pages;
>  		ret = 0;
>  		break;
> @@ -5777,7 +5825,7 @@ __setup("cgroup.memory=", cgroup_memory);
>   */
>  static int __init mem_cgroup_init(void)
>  {
> -	int cpu, node;
> +	int cpu;
>  
>  #ifndef CONFIG_SLOB
>  	/*
> @@ -5797,17 +5845,6 @@ static int __init mem_cgroup_init(void)
>  		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
>  			  drain_local_stock);
>  
> -	for_each_node(node) {
> -		struct mem_cgroup_tree_per_node *rtpn;
> -
> -		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
> -				    node_online(node) ? node : NUMA_NO_NODE);
> -
> -		rtpn->rb_root = RB_ROOT;
> -		spin_lock_init(&rtpn->lock);
> -		soft_limit_tree.rb_tree_per_node[node] = rtpn;
> -	}
> -
>  	return 0;
>  }
>  subsys_initcall(mem_cgroup_init);
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory
  2017-02-23 13:36 ` [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory Laurent Dufour
  2017-02-23 15:12   ` Michal Hocko
@ 2017-02-23 18:39   ` Johannes Weiner
  2017-02-24 11:10   ` Michal Hocko
  2017-02-24 13:42   ` Balbir Singh
  3 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2017-02-23 18:39 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Michal Hocko, Vladimir Davydov, Balbir Singh, cgroups, linux-mm,
	linux-kernel

On Thu, Feb 23, 2017 at 02:36:38PM +0100, Laurent Dufour wrote:
> The system may panic when initialisation is done when almost all the
> memory is assigned to the huge pages using the kernel command line
> parameter hugepage=xxxx. Panic may occur like this:
> 
> [    0.082289] Unable to handle kernel paging request for data at address 0x00000000
> [    0.082338] Faulting instruction address: 0xc000000000302b88
> [    0.082377] Oops: Kernel access of bad area, sig: 11 [#1]
> [    0.082408] SMP NR_CPUS=2048 [    0.082424] NUMA
> [    0.082440] pSeries
> [    0.082457] Modules linked in:
> [    0.082490] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
> [    0.082536] task: c00000021ed01600 task.stack: c00000010d108000
> [    0.082575] NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
> [    0.082621] REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
> [    0.082666] MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
> [    0.082793] CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
> GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
> GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
> GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
> GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
> GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
> GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
> GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
> GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
> NIP [c000000000302b88] mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
> [    0.083456] LR [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
> [    0.083494] Call Trace:
> [    0.083511] [c00000010d10b540] [c00000010d10b640] 0xc00000010d10b640 (unreliable)
> [    0.083567] [c00000010d10b610] [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
> [    0.083622] [c00000010d10b6b0] [c000000000271198] try_to_free_pages+0xf8/0x270
> [    0.083676] [c00000010d10b740] [c000000000259dd8] __alloc_pages_nodemask+0x7a8/0xff0
> [    0.083729] [c00000010d10b960] [c0000000002dd274] new_slab+0x104/0x8e0
> [    0.083776] [c00000010d10ba40] [c0000000002e03d0] ___slab_alloc+0x620/0x700
> [    0.083822] [c00000010d10bb70] [c0000000002e04e4] __slab_alloc+0x34/0x60
> [    0.083868] [c00000010d10bba0] [c0000000002e101c] kmem_cache_alloc_node_trace+0xdc/0x310
> [    0.083947] [c00000010d10bc00] [c000000000eb8120] mem_cgroup_init+0x158/0x1c8
> [    0.083994] [c00000010d10bc40] [c00000000000dde8] do_one_initcall+0x68/0x1d0
> [    0.084041] [c00000010d10bd00] [c000000000e84184] kernel_init_freeable+0x278/0x360
> [    0.084094] [c00000010d10bdc0] [c00000000000e714] kernel_init+0x24/0x170
> [    0.084143] [c00000010d10be30] [c00000000000c0e8] ret_from_kernel_thread+0x5c/0x74
> [    0.084195] Instruction dump:
> [    0.084220] eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
> [    0.084300] 3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
> [    0.084382] ---[ end trace 342f5208b00d01b6 ]---
> 
> This is a chicken and egg issue where the kernel try to get free
> memory when allocating per node data in mem_cgroup_init(), but in that
> path mem_cgroup_soft_limit_reclaim() is called which assumes that
> these data are allocated.
> 
> As mem_cgroup_soft_limit_reclaim() is best effort, it should return
> when these data are not yet allocated.
> 
> This patch also fixes potential null pointer access in
> mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
> 
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm/cgroup: delay soft limit data allocation
  2017-02-23 15:31   ` Michal Hocko
@ 2017-02-23 19:03     ` Johannes Weiner
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2017-02-23 19:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Laurent Dufour, Vladimir Davydov, Balbir Singh, cgroups,
	linux-mm, linux-kernel

On Thu, Feb 23, 2017 at 04:31:07PM +0100, Michal Hocko wrote:
> On Thu 23-02-17 14:36:39, Laurent Dufour wrote:
> > Until a soft limit is set to a cgroup, the soft limit data are useless
> > so delay this allocation when a limit is set.
> 
> Hmm, I am still undecided whether this is actually worth it. On one hand
> distribution kernels tend to have quite large NUMA_SHIFT (e.g. SLES has
> NUMA_SHIFT=10 and then we will save 8kB+12kB which is not hell of a lot
> but always good if we can save that, especially for a rarely used
> feature. The code grown on the other hand (it was in __init section
> previously) which is a minus, on the other hand.
> 
> What do you think Johannes?

Hohumm, saving 5 pages on a NUMA machine vs. the additional complexity
and the increased risk of memory problems when somebody sets up a soft
limit after some uptime... I don't think I can give a strong yes or no
on this one, so inertia wins for me; I'd just leave it alone.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory
  2017-02-23 13:36 ` [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory Laurent Dufour
  2017-02-23 15:12   ` Michal Hocko
  2017-02-23 18:39   ` Johannes Weiner
@ 2017-02-24 11:10   ` Michal Hocko
  2017-02-24 13:42   ` Balbir Singh
  3 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2017-02-24 11:10 UTC (permalink / raw)
  To: Laurent Dufour, Andrew Morton
  Cc: Johannes Weiner, Vladimir Davydov, Balbir Singh, cgroups,
	linux-mm, linux-kernel

Andrew, could you pick up this patch?

On Thu 23-02-17 14:36:38, Laurent Dufour wrote:
> The system may panic when initialisation is done when almost all the
> memory is assigned to the huge pages using the kernel command line
> parameter hugepage=xxxx. Panic may occur like this:
> 
> [    0.082289] Unable to handle kernel paging request for data at address 0x00000000
> [    0.082338] Faulting instruction address: 0xc000000000302b88
> [    0.082377] Oops: Kernel access of bad area, sig: 11 [#1]
> [    0.082408] SMP NR_CPUS=2048 [    0.082424] NUMA
> [    0.082440] pSeries
> [    0.082457] Modules linked in:
> [    0.082490] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
> [    0.082536] task: c00000021ed01600 task.stack: c00000010d108000
> [    0.082575] NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
> [    0.082621] REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
> [    0.082666] MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
> [    0.082793] CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
> GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
> GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
> GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
> GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
> GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
> GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
> GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
> GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
> NIP [c000000000302b88] mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
> [    0.083456] LR [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
> [    0.083494] Call Trace:
> [    0.083511] [c00000010d10b540] [c00000010d10b640] 0xc00000010d10b640 (unreliable)
> [    0.083567] [c00000010d10b610] [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
> [    0.083622] [c00000010d10b6b0] [c000000000271198] try_to_free_pages+0xf8/0x270
> [    0.083676] [c00000010d10b740] [c000000000259dd8] __alloc_pages_nodemask+0x7a8/0xff0
> [    0.083729] [c00000010d10b960] [c0000000002dd274] new_slab+0x104/0x8e0
> [    0.083776] [c00000010d10ba40] [c0000000002e03d0] ___slab_alloc+0x620/0x700
> [    0.083822] [c00000010d10bb70] [c0000000002e04e4] __slab_alloc+0x34/0x60
> [    0.083868] [c00000010d10bba0] [c0000000002e101c] kmem_cache_alloc_node_trace+0xdc/0x310
> [    0.083947] [c00000010d10bc00] [c000000000eb8120] mem_cgroup_init+0x158/0x1c8
> [    0.083994] [c00000010d10bc40] [c00000000000dde8] do_one_initcall+0x68/0x1d0
> [    0.084041] [c00000010d10bd00] [c000000000e84184] kernel_init_freeable+0x278/0x360
> [    0.084094] [c00000010d10bdc0] [c00000000000e714] kernel_init+0x24/0x170
> [    0.084143] [c00000010d10be30] [c00000000000c0e8] ret_from_kernel_thread+0x5c/0x74
> [    0.084195] Instruction dump:
> [    0.084220] eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
> [    0.084300] 3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
> [    0.084382] ---[ end trace 342f5208b00d01b6 ]---
> 
> This is a chicken and egg issue where the kernel try to get free
> memory when allocating per node data in mem_cgroup_init(), but in that
> path mem_cgroup_soft_limit_reclaim() is called which assumes that
> these data are allocated.
> 
> As mem_cgroup_soft_limit_reclaim() is best effort, it should return
> when these data are not yet allocated.
> 
> This patch also fixes potential null pointer access in
> mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
> 
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
>  mm/memcontrol.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 45867e439d31..a9f10fde44a6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -465,6 +465,8 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
>  	struct mem_cgroup_tree_per_node *mctz;
>  
>  	mctz = soft_limit_tree_from_page(page);
> +	if (!mctz)
> +		return;
>  	/*
>  	 * Necessary to update all ancestors when hierarchy is used.
>  	 * because their event counter is not touched.
> @@ -502,7 +504,8 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
>  	for_each_node(nid) {
>  		mz = mem_cgroup_nodeinfo(memcg, nid);
>  		mctz = soft_limit_tree_node(nid);
> -		mem_cgroup_remove_exceeded(mz, mctz);
> +		if (mctz)
> +			mem_cgroup_remove_exceeded(mz, mctz);
>  	}
>  }
>  
> @@ -2557,7 +2560,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  	 * is empty. Do it lockless to prevent lock bouncing. Races
>  	 * are acceptable as soft limit is best effort anyway.
>  	 */
> -	if (RB_EMPTY_ROOT(&mctz->rb_root))
> +	if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root))
>  		return 0;
>  
>  	/*
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory
  2017-02-23 13:36 ` [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory Laurent Dufour
                     ` (2 preceding siblings ...)
  2017-02-24 11:10   ` Michal Hocko
@ 2017-02-24 13:42   ` Balbir Singh
  3 siblings, 0 replies; 9+ messages in thread
From: Balbir Singh @ 2017-02-24 13:42 UTC (permalink / raw)
  To: Laurent Dufour, Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: cgroups, linux-mm, linux-kernel



On 24/02/17 00:36, Laurent Dufour wrote:
> The system may panic when initialisation is done when almost all the
> memory is assigned to the huge pages using the kernel command line
> parameter hugepage=xxxx. Panic may occur like this:
> 
> [    0.082289] Unable to handle kernel paging request for data at address 0x00000000
> [    0.082338] Faulting instruction address: 0xc000000000302b88
> [    0.082377] Oops: Kernel access of bad area, sig: 11 [#1]
> [    0.082408] SMP NR_CPUS=2048 [    0.082424] NUMA
> [    0.082440] pSeries
> [    0.082457] Modules linked in:
> [    0.082490] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
> [    0.082536] task: c00000021ed01600 task.stack: c00000010d108000
> [    0.082575] NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
> [    0.082621] REGS: c00000010d10b2c0 TRAP: 0300   Not tainted (4.9.0-15-generic)
> [    0.082666] MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>[ 0.082770]   CR: 28424422  XER: 00000000
> [    0.082793] CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
> GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
> GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
> GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
> GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
> GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
> GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
> GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
> GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
> NIP [c000000000302b88] mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
> [    0.083456] LR [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
> [    0.083494] Call Trace:
> [    0.083511] [c00000010d10b540] [c00000010d10b640] 0xc00000010d10b640 (unreliable)
> [    0.083567] [c00000010d10b610] [c000000000270e04] do_try_to_free_pages+0x1b4/0x450
> [    0.083622] [c00000010d10b6b0] [c000000000271198] try_to_free_pages+0xf8/0x270
> [    0.083676] [c00000010d10b740] [c000000000259dd8] __alloc_pages_nodemask+0x7a8/0xff0
> [    0.083729] [c00000010d10b960] [c0000000002dd274] new_slab+0x104/0x8e0
> [    0.083776] [c00000010d10ba40] [c0000000002e03d0] ___slab_alloc+0x620/0x700
> [    0.083822] [c00000010d10bb70] [c0000000002e04e4] __slab_alloc+0x34/0x60
> [    0.083868] [c00000010d10bba0] [c0000000002e101c] kmem_cache_alloc_node_trace+0xdc/0x310
> [    0.083947] [c00000010d10bc00] [c000000000eb8120] mem_cgroup_init+0x158/0x1c8
> [    0.083994] [c00000010d10bc40] [c00000000000dde8] do_one_initcall+0x68/0x1d0
> [    0.084041] [c00000010d10bd00] [c000000000e84184] kernel_init_freeable+0x278/0x360
> [    0.084094] [c00000010d10bdc0] [c00000000000e714] kernel_init+0x24/0x170
> [    0.084143] [c00000010d10be30] [c00000000000c0e8] ret_from_kernel_thread+0x5c/0x74
> [    0.084195] Instruction dump:
> [    0.084220] eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
> [    0.084300] 3929acd8 794a1f24 7d295214 eac90100 <e9360000> 2fa90000 419eff74 3b200000
> [    0.084382] ---[ end trace 342f5208b00d01b6 ]---
> 
> This is a chicken and egg issue where the kernel try to get free
> memory when allocating per node data in mem_cgroup_init(), but in that
> path mem_cgroup_soft_limit_reclaim() is called which assumes that
> these data are allocated.
> 
> As mem_cgroup_soft_limit_reclaim() is best effort, it should return
> when these data are not yet allocated.
> 
> This patch also fixes potential null pointer access in
> mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().
> 
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
Acked-by: Balbir Singh <bsingharora@gmail.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-02-24 13:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-23 13:36 [PATCH v2 0/2] mm/cgroup soft limit data allocation Laurent Dufour
2017-02-23 13:36 ` [PATCH v2 1/2] mm/cgroup: avoid panic when init with low memory Laurent Dufour
2017-02-23 15:12   ` Michal Hocko
2017-02-23 18:39   ` Johannes Weiner
2017-02-24 11:10   ` Michal Hocko
2017-02-24 13:42   ` Balbir Singh
2017-02-23 13:36 ` [PATCH v2 2/2] mm/cgroup: delay soft limit data allocation Laurent Dufour
2017-02-23 15:31   ` Michal Hocko
2017-02-23 19:03     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).