linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Wen Congyang <wency@cn.fujitsu.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, Jiang Liu <liuj97@gmail.com>,
	mhocko@suse.cz, bsingharora@gmail.com,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Konstantin Khlebnikov <khlebnikov@openvz.org>,
	paul.gortmaker@windriver.com, Tang Chen <tangchen@cn.fujitsu.com>
Subject: [PATCH] memcg: fix hotplugged memory zone oops
Date: Thu, 1 Nov 2012 18:28:02 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LNX.2.00.1211011822190.20048@eggly.anvils> (raw)
In-Reply-To: <20121018220306.GA1739@cmpxchg.org>

When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec.
(On many configurations, calculating zone from page is slower.)

But we have no code to update all the lruvecs (per zone, per memcg)
when a memory node is hotadded.  Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
IP: [<ffffffff811870b9>] __mod_zone_page_state+0x9/0x60
PGD 0 
Oops: 0000 [#1] SMP 
CPU 2 
Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
Stack:
 ffff880039abdaf8 ffffffff8117390f ffff880039abdaf8 000000008167c601
 ffffffff81174162 ffff88003a480f00 0000000000000001 ffff8800395e0000
 ffff88003dbd0e80 0000000000000282 ffff880039abdb48 ffffffff81174181
Call Trace:
 [<ffffffff8117390f>] __pagevec_lru_add_fn+0xdf/0x140
 [<ffffffff81174181>] pagevec_lru_move_fn+0xb1/0x100
 [<ffffffff811741ec>] __pagevec_lru_add+0x1c/0x30
 [<ffffffff81174383>] lru_add_drain_cpu+0xa3/0x130
 [<ffffffff8117443f>] lru_add_drain+0x2f/0x40
 ...

The natural solution might be to use a memcg callback whenever memory
is hotadded; but that solution has not been scoped out, and it happens
that we do have an easy location at which to update lruvec->zone.  The
lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.

So check and set lruvec->zone in those; and remove the inadequate
attempt to set lruvec->zone from lruvec_init(), which is called
before NODE_DATA(node) has been allocated in such cases.

Ah, there was one exceptionr.  For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too.  In fact it was already safe against
such an oops (the lru lists in danger could only be empty),
but we're better proofed against future changes this way.

Reported-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: stable@vger.kernel.org
---
I've marked this for stable (3.6) since we introduced the problem
in 3.5 (now closed to stable); but I have no idea if this is the
only fix needed to get memory hotadd working with memcg in 3.6,
and received no answer when I enquired twice before.

 include/linux/mmzone.h |    2 -
 mm/memcontrol.c        |   46 +++++++++++++++++++++++++++++----------
 mm/mmzone.c            |    6 -----
 mm/page_alloc.c        |    2 -
 4 files changed, 38 insertions(+), 18 deletions(-)

--- 3.7-rc3/include/linux/mmzone.h	2012-10-14 16:16:57.665308933 -0700
+++ linux/include/linux/mmzone.h	2012-11-01 14:31:04.284185741 -0700
@@ -752,7 +752,7 @@ extern int init_currently_empty_zone(str
 				     unsigned long size,
 				     enum memmap_context context);
 
-extern void lruvec_init(struct lruvec *lruvec, struct zone *zone);
+extern void lruvec_init(struct lruvec *lruvec);
 
 static inline struct zone *lruvec_zone(struct lruvec *lruvec)
 {
--- 3.7-rc3/mm/memcontrol.c	2012-10-14 16:16:58.341309118 -0700
+++ linux/mm/memcontrol.c	2012-11-01 14:31:04.284185741 -0700
@@ -1055,12 +1055,24 @@ struct lruvec *mem_cgroup_zone_lruvec(st
 				      struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_zone *mz;
+	struct lruvec *lruvec;
 
-	if (mem_cgroup_disabled())
-		return &zone->lruvec;
+	if (mem_cgroup_disabled()) {
+		lruvec = &zone->lruvec;
+		goto out;
+	}
 
 	mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
-	return &mz->lruvec;
+	lruvec = &mz->lruvec;
+out:
+	/*
+	 * Since a node can be onlined after the mem_cgroup was created,
+	 * we have to be prepared to initialize lruvec->zone here;
+	 * and if offlined then reonlined, we need to reinitialize it.
+	 */
+	if (unlikely(lruvec->zone != zone))
+		lruvec->zone = zone;
+	return lruvec;
 }
 
 /*
@@ -1087,9 +1099,12 @@ struct lruvec *mem_cgroup_page_lruvec(st
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc;
+	struct lruvec *lruvec;
 
-	if (mem_cgroup_disabled())
-		return &zone->lruvec;
+	if (mem_cgroup_disabled()) {
+		lruvec = &zone->lruvec;
+		goto out;
+	}
 
 	pc = lookup_page_cgroup(page);
 	memcg = pc->mem_cgroup;
@@ -1107,7 +1122,16 @@ struct lruvec *mem_cgroup_page_lruvec(st
 		pc->mem_cgroup = memcg = root_mem_cgroup;
 
 	mz = page_cgroup_zoneinfo(memcg, page);
-	return &mz->lruvec;
+	lruvec = &mz->lruvec;
+out:
+	/*
+	 * Since a node can be onlined after the mem_cgroup was created,
+	 * we have to be prepared to initialize lruvec->zone here;
+	 * and if offlined then reonlined, we need to reinitialize it.
+	 */
+	if (unlikely(lruvec->zone != zone))
+		lruvec->zone = zone;
+	return lruvec;
 }
 
 /**
@@ -3688,17 +3712,17 @@ unsigned long mem_cgroup_soft_limit_recl
 static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 				int node, int zid, enum lru_list lru)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct lruvec *lruvec;
 	unsigned long flags, loop;
 	struct list_head *list;
 	struct page *busy;
 	struct zone *zone;
 
 	zone = &NODE_DATA(node)->node_zones[zid];
-	mz = mem_cgroup_zoneinfo(memcg, node, zid);
-	list = &mz->lruvec.lists[lru];
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	list = &lruvec->lists[lru];
 
-	loop = mz->lru_size[lru];
+	loop = mem_cgroup_get_lru_size(lruvec, lru);
 	/* give some margin against EBUSY etc...*/
 	loop += 256;
 	busy = NULL;
@@ -4736,7 +4760,7 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]);
+		lruvec_init(&mz->lruvec);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
--- 3.7-rc3/mm/mmzone.c	2012-09-30 16:47:46.000000000 -0700
+++ linux/mm/mmzone.c	2012-11-01 14:31:04.284185741 -0700
@@ -87,7 +87,7 @@ int memmap_valid_within(unsigned long pf
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
 
-void lruvec_init(struct lruvec *lruvec, struct zone *zone)
+void lruvec_init(struct lruvec *lruvec)
 {
 	enum lru_list lru;
 
@@ -95,8 +95,4 @@ void lruvec_init(struct lruvec *lruvec,
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
-
-#ifdef CONFIG_MEMCG
-	lruvec->zone = zone;
-#endif
 }
--- 3.7-rc3/mm/page_alloc.c	2012-10-28 13:48:00.021774166 -0700
+++ linux/mm/page_alloc.c	2012-11-01 14:31:04.284185741 -0700
@@ -4505,7 +4505,7 @@ static void __paginginit free_area_init_
 		zone->zone_pgdat = pgdat;
 
 		zone_pcp_init(zone);
-		lruvec_init(&zone->lruvec, zone);
+		lruvec_init(&zone->lruvec);
 		if (!size)
 			continue;
 

  reply	other threads:[~2012-11-02  1:28 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-13  7:14 [PATCH] memory cgroup: update root memory cgroup when node is onlined Wen Congyang
2012-09-13 10:06 ` Kamezawa Hiroyuki
2012-09-13 10:18   ` Wen Congyang
2012-09-13 20:59 ` Johannes Weiner
2012-09-14  1:36   ` Hugh Dickins
2012-09-14  1:53     ` Wen Congyang
2012-09-14 15:46     ` Johannes Weiner
2012-09-15 10:56     ` Konstantin Khlebnikov
2012-09-17  5:50     ` Wen Congyang
2012-10-16  5:58     ` Wen Congyang
2012-10-18 19:03       ` Hugh Dickins
2012-10-18 22:03         ` Johannes Weiner
2012-11-02  1:28           ` Hugh Dickins [this message]
2012-11-02 10:21             ` [PATCH] memcg: fix hotplugged memory zone oops Michal Hocko
2012-11-02 10:54               ` Michal Hocko
2012-11-02 23:37                 ` Hugh Dickins
2012-11-03  7:00                   ` Michal Hocko
2012-11-02 23:31               ` Hugh Dickins
2012-10-16  7:21     ` [PATCH] memory cgroup: update root memory cgroup when node is onlined Kamezawa Hiroyuki
2012-09-14  1:46   ` Wen Congyang
2012-09-17  6:40     ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LNX.2.00.1211011822190.20048@eggly.anvils \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=khlebnikov@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuj97@gmail.com \
    --cc=mhocko@suse.cz \
    --cc=paul.gortmaker@windriver.com \
    --cc=tangchen@cn.fujitsu.com \
    --cc=wency@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).