Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@techadventures.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	Pavel Tatashin <pasha.tatashin@oracle.com>,
	Reza Arbab <arbab@linux.vnet.ibm.com>,
	Igor Mammedov <imammedo@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, Michal Hocko <mhocko@suse.com>
Subject: [PATCH 1/2] mm, memory_hotplug: make has_unmovable_pages more robust
Date: Wed, 23 May 2018 14:55:54 +0200
Message-ID: <20180523125555.30039-2-mhocko@kernel.org> (raw)
In-Reply-To: <20180523125555.30039-1-mhocko@kernel.org>

From: Michal Hocko <mhocko@suse.com>

Oscar has reported:
: Due to an unfortunate setting with movablecore, memblocks containing bootmem
: memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
: So while trying to remove that memory, the system failed in do_migrate_range
: and __offline_pages never returned.
:
: This can be reproduced by running
: qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M
: and movablecore=4G kernel command line
:
: linux kernel: BIOS-provided physical RAM map:
: linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
: linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
: linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
: linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
: linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
: linux kernel: NX (Execute Disable) protection: active
: linux kernel: SMBIOS 2.8 present.
: linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org
: linux kernel: Hypervisor detected: KVM
: linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
: linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
: linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
:
: linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
: linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
: linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
: linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
: linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
: linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
: linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
:
: zoneinfo shows that the zone movable is placed into both numa nodes:
: Node 0, zone  Movable
:   pages free     160140
:         min      1823
:         low      2278
:         high     2733
:         spanned  262144
:         present  262144
:         managed  245670
: Node 1, zone  Movable
:   pages free     448427
:         min      3827
:         low      4783
:         high     5739
:         spanned  524288
:         present  524288
:         managed  515766

Note how only Node 0 has a hutplugable memory region which would rule
it out from the early memblock allocations (most likely memmap). Node1
will surely contain memmaps on the same node and those would prevent
offlining to succeed. So this is arguably a configuration issue.
Although one could argue that we should be more clever and rule early
allocations from the zone movable. This would be correct but probably
not worth the effort considering what a hack movablecore is.

Anyway, We could do better for those cases though. We rely on
start_isolate_page_range resp. has_unmovable_pages to do their job. The
first one isolates the whole range to be offlined so that we do not
allocate from it anymore and the later makes sure we are not stumbling
over non-migrateable pages.

has_unmovable_pages is overly optimistic, however. It doesn't check all
the pages if we are withing zone_movable because we rely that those
pages will be always migrateable. As it turns out we are still not
perfect there. While bootmem pages in zonemovable sound like a clear bug
which should be fixed let's remove the optimization for now and warn if
we encounter unmovable pages in zone_movable in the meantime. That
should help for now at least.

Btw. this wasn't a real problem until 72b39cfc4d75 ("mm, memory_hotplug:
do not fail offlining too early") because we used to have a small number
of retries and then failed. This turned out to be too fragile though.

Reported-by: Oscar Salvador <osalvador@techadventures.net>
Tested-by: Oscar Salvador <osalvador@techadventures.net>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c6f4008ea55..b9a45753244d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7629,11 +7629,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 	unsigned long pfn, iter, found;
 
 	/*
-	 * For avoiding noise data, lru_add_drain_all() should be called
-	 * If ZONE_MOVABLE, the zone never contains unmovable pages
+	 * TODO we could make this much more efficient by not checking every
+	 * page in the range if we know all of them are in MOVABLE_ZONE and
+	 * that the movable zone guarantees that pages are migratable but
+	 * the later is not the case right now unfortunatelly. E.g. movablecore
+	 * can still lead to having bootmem allocations in zone_movable.
 	 */
-	if (zone_idx(zone) == ZONE_MOVABLE)
-		return false;
 
 	/*
 	 * CMA allocations (alloc_contig_range) really need to mark isolate
@@ -7654,7 +7655,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 		page = pfn_to_page(check);
 
 		if (PageReserved(page))
-			return true;
+			goto unmovable;
 
 		/*
 		 * Hugepages are not in LRU lists, but they're movable.
@@ -7704,9 +7705,12 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 		 * page at boot.
 		 */
 		if (found > count)
-			return true;
+			goto unmovable;
 	}
 	return false;
+unmovable:
+	WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
+	return true;
 }
 
 #if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || defined(CONFIG_CMA)
-- 
2.17.0

  reply index

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-23 12:55 [PATCH 0/2] few memory hotplug fixes Michal Hocko
2018-05-23 12:55 ` Michal Hocko [this message]
2018-05-23 13:14   ` [PATCH 1/2] mm, memory_hotplug: make has_unmovable_pages more robust Pavel Tatashin
2018-05-23 12:55 ` [PATCH 2/2] mm: do not warn on offline nodes unless the specific node is explicitly requested Michal Hocko
2018-05-23 13:14   ` Pavel Tatashin
2018-05-23 13:45   ` Anshuman Khandual
2018-05-23 14:06     ` Michal Hocko
2018-05-24  3:22       ` Anshuman Khandual
2018-05-24  8:00         ` Michal Hocko
2018-05-25  4:50           ` Anshuman Khandual

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180523125555.30039-2-mhocko@kernel.org \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=arbab@linux.vnet.ibm.com \
    --cc=imammedo@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=osalvador@techadventures.net \
    --cc=pasha.tatashin@oracle.com \
    --cc=vbabka@suse.cz \
    --cc=vkuznets@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git