linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	alan@lxorguk.ukuu.org.uk, Tang Chen <tangchen@cn.fujitsu.com>,
	Hugh Dickins <hughd@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Konstantin Khlebnikov <khlebnikov@openvz.org>,
	Wen Congyang <wency@cn.fujitsu.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [ 49/83] memcg: fix hotplugged memory zone oops
Date: Wed, 21 Nov 2012 16:42:11 -0800	[thread overview]
Message-ID: <20121122004217.979865059@linuxfoundation.org> (raw)
In-Reply-To: <20121122004212.371862690@linuxfoundation.org>

3.6-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Hugh Dickins <hughd@google.com>

commit bea8c150a7efbc0f204e709b7274fe273f55e0d3 upstream.

When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec.  (On
many configurations, calculating zone from page is slower.)

But we have no code to update all the lruvecs (per zone, per memcg) when
a memory node is hotadded.  Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
  IP:  __mod_zone_page_state+0x9/0x60
  Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
  Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
  Call Trace:
    __pagevec_lru_add_fn+0xdf/0x140
    pagevec_lru_move_fn+0xb1/0x100
    __pagevec_lru_add+0x1c/0x30
    lru_add_drain_cpu+0xa3/0x130
    lru_add_drain+0x2f/0x40
   ...

The natural solution might be to use a memcg callback whenever memory is
hotadded; but that solution has not been scoped out, and it happens that
we do have an easy location at which to update lruvec->zone.  The lruvec
pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.

So check and set lruvec->zone in those; and remove the inadequate
attempt to set lruvec->zone from lruvec_init(), which is called before
NODE_DATA(node) has been allocated in such cases.

Ah, there was one exceptionr.  For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too.  In fact it was already safe against such
an oops (the lru lists in danger could only be empty), but we're better
proofed against future changes this way.

I've marked this for stable (3.6) since we introduced the problem in 3.5
(now closed to stable); but I have no idea if this is the only fix
needed to get memory hotadd working with memcg in 3.6, and received no
answer when I enquired twice before.

Reported-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 include/linux/mmzone.h |    2 +-
 mm/memcontrol.c        |   46 +++++++++++++++++++++++++++++++++++-----------
 mm/mmzone.c            |    6 +-----
 mm/page_alloc.c        |    2 +-
 4 files changed, 38 insertions(+), 18 deletions(-)

--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -744,7 +744,7 @@ extern int init_currently_empty_zone(str
 				     unsigned long size,
 				     enum memmap_context context);
 
-extern void lruvec_init(struct lruvec *lruvec, struct zone *zone);
+extern void lruvec_init(struct lruvec *lruvec);
 
 static inline struct zone *lruvec_zone(struct lruvec *lruvec)
 {
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1061,12 +1061,24 @@ struct lruvec *mem_cgroup_zone_lruvec(st
 				      struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_zone *mz;
+	struct lruvec *lruvec;
 
-	if (mem_cgroup_disabled())
-		return &zone->lruvec;
+	if (mem_cgroup_disabled()) {
+		lruvec = &zone->lruvec;
+		goto out;
+	}
 
 	mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
-	return &mz->lruvec;
+	lruvec = &mz->lruvec;
+out:
+	/*
+	 * Since a node can be onlined after the mem_cgroup was created,
+	 * we have to be prepared to initialize lruvec->zone here;
+	 * and if offlined then reonlined, we need to reinitialize it.
+	 */
+	if (unlikely(lruvec->zone != zone))
+		lruvec->zone = zone;
+	return lruvec;
 }
 
 /*
@@ -1093,9 +1105,12 @@ struct lruvec *mem_cgroup_page_lruvec(st
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc;
+	struct lruvec *lruvec;
 
-	if (mem_cgroup_disabled())
-		return &zone->lruvec;
+	if (mem_cgroup_disabled()) {
+		lruvec = &zone->lruvec;
+		goto out;
+	}
 
 	pc = lookup_page_cgroup(page);
 	memcg = pc->mem_cgroup;
@@ -1113,7 +1128,16 @@ struct lruvec *mem_cgroup_page_lruvec(st
 		pc->mem_cgroup = memcg = root_mem_cgroup;
 
 	mz = page_cgroup_zoneinfo(memcg, page);
-	return &mz->lruvec;
+	lruvec = &mz->lruvec;
+out:
+	/*
+	 * Since a node can be onlined after the mem_cgroup was created,
+	 * we have to be prepared to initialize lruvec->zone here;
+	 * and if offlined then reonlined, we need to reinitialize it.
+	 */
+	if (unlikely(lruvec->zone != zone))
+		lruvec->zone = zone;
+	return lruvec;
 }
 
 /**
@@ -3703,17 +3727,17 @@ unsigned long mem_cgroup_soft_limit_recl
 static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 				int node, int zid, enum lru_list lru)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct lruvec *lruvec;
 	unsigned long flags, loop;
 	struct list_head *list;
 	struct page *busy;
 	struct zone *zone;
 
 	zone = &NODE_DATA(node)->node_zones[zid];
-	mz = mem_cgroup_zoneinfo(memcg, node, zid);
-	list = &mz->lruvec.lists[lru];
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	list = &lruvec->lists[lru];
 
-	loop = mz->lru_size[lru];
+	loop = mem_cgroup_get_lru_size(lruvec, lru);
 	/* give some margin against EBUSY etc...*/
 	loop += 256;
 	busy = NULL;
@@ -4751,7 +4775,7 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]);
+		lruvec_init(&mz->lruvec);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,7 +87,7 @@ int memmap_valid_within(unsigned long pf
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
 
-void lruvec_init(struct lruvec *lruvec, struct zone *zone)
+void lruvec_init(struct lruvec *lruvec)
 {
 	enum lru_list lru;
 
@@ -95,8 +95,4 @@ void lruvec_init(struct lruvec *lruvec,
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
-
-#ifdef CONFIG_MEMCG
-	lruvec->zone = zone;
-#endif
 }
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4456,7 +4456,7 @@ static void __paginginit free_area_init_
 		zone->zone_pgdat = pgdat;
 
 		zone_pcp_init(zone);
-		lruvec_init(&zone->lruvec, zone);
+		lruvec_init(&zone->lruvec);
 		if (!size)
 			continue;
 



  parent reply	other threads:[~2012-11-22 22:47 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-22  0:41 [ 00/83] 3.6.8-stable review Greg Kroah-Hartman
2012-11-22  0:41 ` [ 01/83] mm: bugfix: set current->reclaim_state to NULL while returning from kswapd() Greg Kroah-Hartman
2012-11-22  0:41 ` [ 02/83] libata-acpi: Fix NULL ptr derference in ata_acpi_dev_handle Greg Kroah-Hartman
2012-11-22  0:41 ` [ 03/83] xfs: drop buffer io reference when a bad bio is built Greg Kroah-Hartman
2012-11-22  0:41 ` [ 04/83] mac80211: sync acccess to tx_filtered/ps_tx_buf queues Greg Kroah-Hartman
2012-11-22  0:41 ` [ 05/83] mac80211: dont send null data packet when not associated Greg Kroah-Hartman
2012-11-22  0:41 ` [ 06/83] mac80211: call skb_dequeue/ieee80211_free_txskb instead of __skb_queue_purge Greg Kroah-Hartman
2012-11-22  0:41 ` [ 07/83] PCI/PM: Fix deadlock when unbinding device if parent in D3cold Greg Kroah-Hartman
2012-11-22  0:41 ` [ 08/83] PCI/PM: Resume device before shutdown Greg Kroah-Hartman
2012-11-22  0:41 ` [ 09/83] PCI/PM: Fix proc config reg access for D3cold and bridge suspending Greg Kroah-Hartman
2012-11-22  0:41 ` [ 10/83] fanotify: fix missing break Greg Kroah-Hartman
2012-11-22  0:41 ` [ 11/83] module: fix out-by-one error in kallsyms Greg Kroah-Hartman
2012-11-23 10:35   ` satoru takeuchi
2012-11-26 18:43     ` Greg Kroah-Hartman
2012-12-03  0:04     ` Rusty Russell
2012-11-22  0:41 ` [ 12/83] virtio: Dont access index after unregister Greg Kroah-Hartman
2012-11-22  0:41 ` [ 13/83] cifs: fix potential buffer overrun in cifs.idmap handling code Greg Kroah-Hartman
2012-11-22  0:41 ` [ 14/83] cifs: Do not lookup hashed negative dentry in cifs_atomic_open Greg Kroah-Hartman
2012-11-22  0:41 ` [ 15/83] crypto: cryptd - disable softirqs in cryptd_queue_worker to prevent data corruption Greg Kroah-Hartman
2012-11-22  0:41 ` [ 16/83] ARM: at91/AT91SAM9G45: fix crypto peripherals irq issue due to sparse irq support Greg Kroah-Hartman
2012-11-22  0:41 ` [ 17/83] ptp: update adjfreq callback description Greg Kroah-Hartman
2012-11-22  0:41 ` [ 18/83] ALSA: hda: Cirrus: Fix coefficient index for beep configuration Greg Kroah-Hartman
2012-11-22  0:41 ` [ 19/83] ALSA: HDA: Fix digital microphone on CS420x Greg Kroah-Hartman
2012-11-22  0:41 ` [ 20/83] ALSA: hda - Force to reset IEC958 status bits for AD codecs Greg Kroah-Hartman
2012-11-22  0:41 ` [ 21/83] ALSA: hda - Fix empty DAC filling in patch_via.c Greg Kroah-Hartman
2012-11-22  0:41 ` [ 22/83] ALSA: hda - Fix invalid connections in VT1802 codec Greg Kroah-Hartman
2012-11-22  0:41 ` [ 23/83] ALSA: hda - Improve HP depop when system enter to S3 Greg Kroah-Hartman
2012-11-22  0:41 ` [ 24/83] ALSA: hda - Add new codec ALC668 and ALC900 (default name ALC1150) Greg Kroah-Hartman
2012-11-22  0:41 ` [ 25/83] ALSA: hda - Add a missing quirk entry for iMac 9,1 Greg Kroah-Hartman
2012-11-22  0:41 ` [ 26/83] ASoC: wm8978: pll incorrectly configured when codec is master Greg Kroah-Hartman
2012-11-22  0:41 ` [ 27/83] ASoC: cs42l52: fix the return value of cs42l52_set_fmt() Greg Kroah-Hartman
2012-11-22  0:41 ` [ 28/83] ASoC: dapm: Use card_list during DAPM shutdown Greg Kroah-Hartman
2012-11-22  0:41 ` [ 29/83] ASoC: core: Double control update err for snd_soc_put_volsw_sx Greg Kroah-Hartman
2012-11-22  0:41 ` [ 30/83] UBIFS: fix mounting problems after power cuts Greg Kroah-Hartman
2012-11-22  0:41 ` [ 31/83] UBIFS: introduce categorized lprops counter Greg Kroah-Hartman
2012-11-22  0:41 ` [ 32/83] pstore: Fix NULL pointer dereference in console writes Greg Kroah-Hartman
2012-11-22  0:41 ` [ 33/83] regulator: fix voltage check in regulator_is_supported_voltage() Greg Kroah-Hartman
2012-11-22  0:41 ` [ 34/83] i2c-mux-pinctrl: Fix probe error path Greg Kroah-Hartman
2012-11-22  0:41 ` [ 35/83] ARM: imx: ehci: fix host power mask bit Greg Kroah-Hartman
2012-11-22  4:52   ` Michael D. Burkey
2012-11-26 18:44     ` Greg Kroah-Hartman
2012-11-26 19:17       ` Michael D. Burkey
2012-11-22  0:41 ` [ 36/83] ARM: dt: tegra: fix length of pad control and mux registers Greg Kroah-Hartman
2012-11-22  0:41 ` [ 37/83] Revert "Staging: Android alarm: IOCTL command encoding fix" Greg Kroah-Hartman
2012-11-22  0:42 ` [ 38/83] s390/gup: add missing TASK_SIZE check to get_user_pages_fast() Greg Kroah-Hartman
2012-11-22  0:42 ` [ 39/83] USB: keyspan: fix typo causing GPF on open Greg Kroah-Hartman
2012-11-22  0:42 ` [ 40/83] USB: usb_wwan: fix bulk-urb allocation Greg Kroah-Hartman
2012-11-22  0:42 ` [ 41/83] USB: option: add Novatel E362 and Dell Wireless 5800 USB IDs Greg Kroah-Hartman
2012-11-22  0:42 ` [ 42/83] USB: option: add Alcatel X220/X500D " Greg Kroah-Hartman
2012-11-22  0:42 ` [ 43/83] drm/i915/sdvo: clean up connectors on intel_sdvo_init() failures Greg Kroah-Hartman
2012-11-22  0:42 ` [ 44/83] drm/radeon: fix logic error in atombios_encoders.c Greg Kroah-Hartman
2012-11-22  0:42 ` [ 45/83] tmpfs: fix shmem_getpage_gfp() VM_BUG_ON Greg Kroah-Hartman
2012-11-22  0:42 ` [ 46/83] KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update() Greg Kroah-Hartman
2012-11-22  0:42 ` [ 47/83] ttm: Clear the ttm page allocated from high memory zone correctly Greg Kroah-Hartman
2012-11-22  0:42 ` [ 48/83] memcg: oom: fix totalpages calculation for memory.swappiness==0 Greg Kroah-Hartman
2012-11-22  0:42 ` Greg Kroah-Hartman [this message]
2012-11-22  0:42 ` [ 50/83] iwlwifi: handle DMA mapping failures Greg Kroah-Hartman
2012-11-22  0:42 ` [ 51/83] wireless: allow 40 MHz on world roaming channels 12/13 Greg Kroah-Hartman
2012-11-22  0:42 ` [ 52/83] Bluetooth: Fix having bogus entries in mgmt_read_index_list reply Greg Kroah-Hartman
2012-11-22  0:42 ` [ 53/83] m68k: fix sigset_t accessor functions Greg Kroah-Hartman
2012-11-22  0:42 ` [ 54/83] ipv4: avoid undefined behavior in do_ip_setsockopt() Greg Kroah-Hartman
2012-11-22  0:42 ` [ 55/83] ipv4/ip_vti.c: VTI fix post-decryption forwarding Greg Kroah-Hartman
2012-11-22  0:42 ` [ 56/83] ipv6: setsockopt(IPIPPROTO_IPV6, IPV6_MINHOPCOUNT) forgot to set return value Greg Kroah-Hartman
2012-11-22  0:42 ` [ 57/83] net: correct check in dev_addr_del() Greg Kroah-Hartman
2012-11-22  0:42 ` [ 58/83] net-rps: Fix brokeness causing OOO packets Greg Kroah-Hartman
2012-11-22  0:42 ` [ 59/83] tcp: fix retransmission in repair mode Greg Kroah-Hartman
2012-11-22  0:42 ` [ 60/83] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures Greg Kroah-Hartman
2012-11-22  0:42 ` [ 61/83] tmpfs: change final i_blocks BUG to WARNING Greg Kroah-Hartman
2012-11-22  0:42 ` [ 62/83] ALSA: usb-audio: Fix crash at re-preparing the PCM stream Greg Kroah-Hartman
2012-11-22  0:42 ` [ 63/83] GFS2: Dont call file_accessed() with a shared glock Greg Kroah-Hartman
2012-11-22  0:42 ` [ 64/83] r8169: use unlimited DMA burst for TX Greg Kroah-Hartman
2012-11-22  0:42 ` [ 65/83] xen/events: fix RCU warning, or Call idle notifier after irq_enter() Greg Kroah-Hartman
2012-11-22  0:42 ` [ 66/83] SCSI: isci: Allow SSP tasks into the task management path Greg Kroah-Hartman
2012-11-22  0:42 ` [ 67/83] tg3: unconditionally select HWMON support when tg3 is enabled Greg Kroah-Hartman
2012-11-22  0:42 ` [ 68/83] r8169: Fix WoL on RTL8168d/8111d Greg Kroah-Hartman
2012-11-22  0:42 ` [ 69/83] r8169: allow multicast packets on sub-8168f chipset Greg Kroah-Hartman
2012-11-22  0:42 ` [ 70/83] netfilter: nf_nat: dont check for port change on ICMP tuples Greg Kroah-Hartman
2012-11-22  0:42 ` [ 71/83] netfilter: xt_TEE: dont use destination address found in header Greg Kroah-Hartman
2012-11-22  0:42 ` [ 72/83] netfilter: nf_conntrack: fix rt_gateway checks for H.323 helper Greg Kroah-Hartman
2012-11-22  0:42 ` [ 73/83] s390/signal: set correct address space control Greg Kroah-Hartman
2012-11-22  0:42 ` [ 74/83] NFC: Use dynamic initialization for rwlocks Greg Kroah-Hartman
2012-11-22  0:42 ` [ 75/83] reiserfs: Fix lock ordering during remount Greg Kroah-Hartman
2012-11-22  0:42 ` [ 76/83] reiserfs: Protect reiserfs_quota_on() with write lock Greg Kroah-Hartman
2012-11-22  0:42 ` [ 77/83] reiserfs: Move quota calls out of " Greg Kroah-Hartman
2012-11-22  0:42 ` [ 78/83] reiserfs: Protect reiserfs_quota_write() with " Greg Kroah-Hartman
2012-11-22  0:42 ` [ 79/83] intel-iommu: Fix lookup in add device Greg Kroah-Hartman
2012-11-22  0:42 ` [ 80/83] selinux: fix sel_netnode_insert() suspicious rcu dereference Greg Kroah-Hartman
2012-11-22  0:42 ` [ 81/83] ACPI video: Ignore errors after _DOD evaluation Greg Kroah-Hartman
2012-11-22 20:40   ` Christoph Biedl
2012-11-22 21:27     ` Greg KH
2012-11-22  0:42 ` [ 82/83] Revert "serial: omap: fix software flow control" Greg Kroah-Hartman
2012-11-22  0:42 ` [ 83/83] ext4: fix metadata checksum calculation for the superblock Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121122004217.979865059@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=khlebnikov@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tangchen@cn.fujitsu.com \
    --cc=torvalds@linux-foundation.org \
    --cc=wency@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).