linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v4 0/6] optimize memory hotplug
@ 2018-02-15 16:59 Pavel Tatashin
  2018-02-15 16:59 ` [v4 1/6] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-15 16:59 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Changelog:
	v3 - v4
	Addressed comments from Ingo Molnar and from Michal Hocko
	Splitted 4th patch into three patches
	Instead of using section table to save node ids, saving node id in
	the first page of every section.

	v2 - v3
	Fixed two issues found during testing
	Addressed Kbuild warning reports

	v1 - v2
	Added struct page poisoning checking in order to verify that
	struct pages are never accessed until initialized during memory
	hotplug

This patchset:
- Improves hotplug performance by eliminating a number of
struct page traverses during memory hotplug.

- Fixes some issues with hotplugging, where boundaries
were not properly checked. And on x86 block size was not properly aligned
with end of memory

- Also, potentially improves boot performance by eliminating condition from
  __init_single_page().

- Adds robustness by verifying that that struct pages are correctly
  poisoned when flags are accessed.

The following experiments were performed on Xeon(R) CPU E7-8895 v3 @ 2.60GHz
with 1T RAM:

booting in qemu with 960G of memory, time to initialize struct pages:

no-kvm:
	TRY1		TRY2
BEFORE:	39.433668	39.39705
AFTER:	36.903781	36.989329

with-kvm:
BEFORE:	10.977447	11.103164
AFTER:	10.929072	10.751885

Hotplug 896G memory:
no-kvm:
	TRY1		TRY2
BEFORE: 848.740000	846.910000
AFTER:  783.070000	786.560000

with-kvm:
	TRY1		TRY2
BEFORE: 34.410000	33.57
AFTER:	29.810000	29.580000

Pavel Tatashin (6):
  mm/memory_hotplug: enforce block size aligned range check
  x86/mm/memory_hotplug: determine block size based on the end of boot
    memory
  mm: uninitialized struct page poisoning sanity checking
  mm/memory_hotplug: optimize probe routine
  mm/memory_hotplug: don't read nid from struct page during hotplug
  mm/memory_hotplug: optimize memory hotplug

 arch/x86/mm/init_64.c      | 33 +++++++++++++++++++++++++++++----
 drivers/base/memory.c      | 38 +++++++++++++++++++++-----------------
 drivers/base/node.c        | 22 +++++++++++++++-------
 include/linux/mm.h         |  4 +++-
 include/linux/node.h       |  4 ++--
 include/linux/page-flags.h | 22 +++++++++++++++++-----
 mm/memblock.c              |  2 +-
 mm/memory_hotplug.c        | 36 ++++++++++++++----------------------
 mm/page_alloc.c            | 28 ++++++++++------------------
 mm/sparse.c                |  9 ++++++++-
 10 files changed, 120 insertions(+), 78 deletions(-)

-- 
2.16.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [v4 1/6] mm/memory_hotplug: enforce block size aligned range check
  2018-02-15 16:59 [v4 0/6] optimize memory hotplug Pavel Tatashin
@ 2018-02-15 16:59 ` Pavel Tatashin
  2018-02-15 16:59 ` [v4 2/6] x86/mm/memory_hotplug: determine block size based on the end of boot memory Pavel Tatashin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-15 16:59 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Start qemu with the following arguments:

-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

Which boots machine with 64G and adds a device mem1 with 2G that can be
hotplugged later.

Also make sure that .config has the following options turned on:

CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY

Using the qemu monitor hotplug the memory:

(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

The operation will fail with the following trace:

WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
RSP: 0018:ffffbe5086b53d98 EFLAGS: 00010246
RAX: ffff9acb3fff3180 RBX: ffff9acaf7646038 RCX: 0000000000000800
RDX: ffff9acb3fff3000 RSI: 0000000000000218 RDI: 00000000010c0000
RBP: 0000000001080000 R08: ffffe81f83000040 R09: 0000000001100000
R10: ffff9acb3fff6000 R11: 0000000000000246 R12: 0000000000080000
R13: 0000000000000000 R14: ffffbe5086b53f08 R15: ffff9acaf7506f20
FS:  00007fd7f20da8c0(0000) GS:ffff9acb3fc00000(0000) knlGS:000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd7f20f2000 CR3: 0000000ff7ac2001 CR4: 00000000001606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 memory_subsys_online+0x44/0xa0
 device_online+0x51/0x80
 store_mem_state+0x5e/0xe0
 kernfs_fop_write+0xfa/0x170
 __vfs_write+0x2e/0x150
 ? __inode_security_revalidate+0x47/0x60
 ? selinux_file_permission+0xd5/0x130
 ? _cond_resched+0x10/0x20
 vfs_write+0xa8/0x1a0
 ? find_vma+0x54/0x60
 SyS_write+0x4d/0xb0
 do_syscall_64+0x5d/0x110
 entry_SYSCALL_64_after_hwframe+0x21/0x86
RIP: 0033:0x7fd7f0d3a840
RSP: 002b:00007fff5db77c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7f0d3a840
RDX: 0000000000000006 RSI: 00007fd7f20f2000 RDI: 0000000000000007
RBP: 00007fd7f20f2000 R08: 000055db265c4ab0 R09: 00007fd7f20da8c0
R10: 0000000000000006 R11: 0000000000000246 R12: 000055db265c49d0
R13: 0000000000000006 R14: 000055db265c5510 R15: 000000000000000b
Code: fe ff ff 07 00 77 24 48 89 f8 48 c1 e8 17 49 8b 14 c2 48 85 d2 74 14
40 0f b6 c6 49 81 c0 00 00 20 00 48 c1 e0 04 48 01 d0 75 93 <0f> ff 31 c0
c3 b8 01 00 00 00 c3 31 d2 48 c7 c7 b0 32 67 a6 31
---[ end trace 6203bc4f1a5d30e8 ]---

The problem is detected in: drivers/base/memory.c

static bool pages_correctly_reserved(unsigned long start_pfn)
                 if (WARN_ON_ONCE(!pfn_valid(pfn)))

This function loops through every section in the newly added memory block
and verifies that the first pfn in each section is valid, meaning section
exists, has mapping (struct page array), and is online.

The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory the block size is changed to 2G:

$ cat /sys/devices/system/memory/block_size_bytes
80000000

or

$ dmesg | grep "block size"
[    0.086469] x86/mm: Memory block size: 2048MB

During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations.  See:
Documentation/memory-hotplug.txt

So, when the start_pfn of newly added memory is not block size aligned, we
can get a memory block with partially populated sections.

In our case the start_pfn starts from the last_pfn (end of physical
memory).

$ dmesg | grep last_pfn
[    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000

0x1040000 == 65G, and so is not 2G aligned!

The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.

With this fix, running the above sequence yield to the following result:

(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
							size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b2bd52ff7605..565048f496f7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1083,15 +1083,16 @@ int try_online_node(int nid)
 
 static int check_hotplug_memory_range(u64 start, u64 size)
 {
-	u64 start_pfn = PFN_DOWN(start);
+	unsigned long block_sz = memory_block_size_bytes();
+	u64 block_nr_pages = block_sz >> PAGE_SHIFT;
 	u64 nr_pages = size >> PAGE_SHIFT;
+	u64 start_pfn = PFN_DOWN(start);
 
-	/* Memory range must be aligned with section */
-	if ((start_pfn & ~PAGE_SECTION_MASK) ||
-	    (nr_pages % PAGES_PER_SECTION) || (!nr_pages)) {
-		pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n",
-				(unsigned long long)start,
-				(unsigned long long)size);
+	/* memory range must be block size aligned */
+	if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
+	    !IS_ALIGNED(nr_pages, block_nr_pages)) {
+		pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
+		       block_sz, start, size);
 		return -EINVAL;
 	}
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 2/6] x86/mm/memory_hotplug: determine block size based on the end of boot memory
  2018-02-15 16:59 [v4 0/6] optimize memory hotplug Pavel Tatashin
  2018-02-15 16:59 ` [v4 1/6] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin
@ 2018-02-15 16:59 ` Pavel Tatashin
  2018-02-15 16:59 ` [v4 3/6] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-15 16:59 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Memory sections are combined into "memory block" chunks.  These chunks are
the units upon which memory can be added and removed.

On x86 the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hotplugging/hotremoving can be broken.

Currently, whenever machine is booted with more than 64G the block size
is unconditionally increased to 2G from the base 128M. This is done in
order to reduce number of memory device files in sysfs:
	/sys/devices/system/memory/memoryXXX

We must use the largest allowed block size that aligns to the next
address to be able to hotplug the next block of memory.

So, when memory is larger or equal to 64G, we check the end address and
find the largest block size that is still power of two but smaller or
equal to 2G.

Before, the fix:
Run qemu with:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
							size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure

With the fix memory is added successfully as the block size is set to 1G,
and therefore aligns with start address 0x1040000000.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b72923f1d35..9ac776de2252 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1326,14 +1326,39 @@ int kern_addr_valid(unsigned long addr)
 	return pfn_valid(pte_pfn(*pte));
 }
 
+/*
+ * Block size is the minimum amount of memory which can be hotplugged or
+ * hotremoved. It must be power of two and must be equal or larger than
+ * MIN_MEMORY_BLOCK_SIZE.
+ */
+#define MAX_BLOCK_SIZE (2UL << 30)
+
+/* Amount of ram needed to start using large blocks */
+#define MEM_SIZE_FOR_LARGE_BLOCK (64UL << 30)
+
 static unsigned long probe_memory_block_size(void)
 {
-	unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
+	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
+	unsigned long bz;
 
-	/* if system is UV or has 64GB of RAM or more, use large blocks */
-	if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30)))
-		bz = 2UL << 30; /* 2GB */
+	/* If this is UV system, always set 2G block size */
+	if (is_uv_system()) {
+		bz = MAX_BLOCK_SIZE;
+		goto done;
+	}
 
+	/* Use regular block if RAM is smaller than MEM_SIZE_FOR_LARGE_BLOCK */
+	if (boot_mem_end < MEM_SIZE_FOR_LARGE_BLOCK) {
+		bz = MIN_MEMORY_BLOCK_SIZE;
+		goto done;
+	}
+
+	/* Find the largest allowed block size that aligns to memory end */
+	for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
+		if (IS_ALIGNED(boot_mem_end, bz))
+			break;
+	}
+done:
 	pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20);
 
 	return bz;
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 3/6] mm: uninitialized struct page poisoning sanity checking
  2018-02-15 16:59 [v4 0/6] optimize memory hotplug Pavel Tatashin
  2018-02-15 16:59 ` [v4 1/6] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin
  2018-02-15 16:59 ` [v4 2/6] x86/mm/memory_hotplug: determine block size based on the end of boot memory Pavel Tatashin
@ 2018-02-15 16:59 ` Pavel Tatashin
  2018-02-16  9:23   ` Ingo Molnar
  2018-02-15 16:59 ` [v4 4/6] mm/memory_hotplug: optimize probe routine Pavel Tatashin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-15 16:59 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

During boot we poison struct page memory in order to ensure that no one is
accessing this memory until the struct pages are initialized in
__init_single_page().

This patch adds more scrutiny to this checking by making sure that flags
do not equal the poison pattern when they are accessed.  The pattern is all
ones.

Since node id is also stored in struct page, and may be accessed quite
early, we add this enforcement into page_to_nid() function as well.
Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h         |  4 +++-
 include/linux/page-flags.h | 22 +++++++++++++++++-----
 mm/memblock.c              |  2 +-
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..ad71136a6494 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -896,7 +896,9 @@ extern int page_to_nid(const struct page *page);
 #else
 static inline int page_to_nid(const struct page *page)
 {
-	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
+	struct page *p = (struct page *)page;
+
+	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
 }
 #endif
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50c2b8786831..e34a27727b9a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -156,9 +156,18 @@ static __always_inline int PageCompound(struct page *page)
 	return test_bit(PG_head, &page->flags) || PageTail(page);
 }
 
+#define	PAGE_POISON_PATTERN	-1l
+static inline int PagePoisoned(const struct page *page)
+{
+	return page->flags == PAGE_POISON_PATTERN;
+}
+
 /*
  * Page flags policies wrt compound pages
  *
+ * PF_POISONED_CHECK
+ *     check if this struct page poisoned/uninitialized
+ *
  * PF_ANY:
  *     the page flag is relevant for small, head and tail pages.
  *
@@ -176,17 +185,20 @@ static __always_inline int PageCompound(struct page *page)
  * PF_NO_COMPOUND:
  *     the page flag is not relevant for compound pages.
  */
-#define PF_ANY(page, enforce)	page
-#define PF_HEAD(page, enforce)	compound_head(page)
+#define PF_POISONED_CHECK(page) ({					\
+		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
+		page; })
+#define PF_ANY(page, enforce)	PF_POISONED_CHECK(page)
+#define PF_HEAD(page, enforce)	PF_POISONED_CHECK(compound_head(page))
 #define PF_ONLY_HEAD(page, enforce) ({					\
 		VM_BUG_ON_PGFLAGS(PageTail(page), page);		\
-		page;})
+		PF_POISONED_CHECK(page); })
 #define PF_NO_TAIL(page, enforce) ({					\
 		VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page);	\
-		compound_head(page);})
+		PF_POISONED_CHECK(compound_head(page)); })
 #define PF_NO_COMPOUND(page, enforce) ({				\
 		VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page);	\
-		page;})
+		PF_POISONED_CHECK(page); })
 
 /*
  * Macros to create function definitions for page flags
diff --git a/mm/memblock.c b/mm/memblock.c
index 5a9ca2a1751b..d85c8754e0ce 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1373,7 +1373,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
 					   min_addr, max_addr, nid);
 #ifdef CONFIG_DEBUG_VM
 	if (ptr && size > 0)
-		memset(ptr, 0xff, size);
+		memset(ptr, PAGE_POISON_PATTERN, size);
 #endif
 	return ptr;
 }
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 4/6] mm/memory_hotplug: optimize probe routine
  2018-02-15 16:59 [v4 0/6] optimize memory hotplug Pavel Tatashin
                   ` (2 preceding siblings ...)
  2018-02-15 16:59 ` [v4 3/6] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin
@ 2018-02-15 16:59 ` Pavel Tatashin
  2018-02-16  9:13   ` Ingo Molnar
  2018-02-19 13:23   ` Michal Hocko
  2018-02-15 16:59 ` [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug Pavel Tatashin
  2018-02-15 16:59 ` [v4 6/6] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin
  5 siblings, 2 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-15 16:59 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

When memory is hotplugged pages_correctly_reserved() is called to verify
that the added memory is present, this routine traverses through every
struct page and verifies that PageReserved() is set. This is a slow
operation especially if a large amount of memory is added.

Instead of checking every page, it is enough to simply check that the
section is present, has mapping (struct page array is allocated), and the
mapping is online.

In addition, we should not excpect that probe routine sets flags in struct
page, as the struct pages have not yet been initialized. The initialization
should be done in __init_single_page(), the same as during boot.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/memory.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fe4b24f05f6a..deb3f029b451 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -187,13 +187,14 @@ int memory_isolate_notify(unsigned long val, void *v)
 }
 
 /*
- * The probe routines leave the pages reserved, just as the bootmem code does.
- * Make sure they're still that way.
+ * The probe routines leave the pages uninitialized, just as the bootmem code
+ * does. Make sure we do not access them, but instead use only information from
+ * within sections.
  */
-static bool pages_correctly_reserved(unsigned long start_pfn)
+static bool pages_correctly_probed(unsigned long start_pfn)
 {
-	int i, j;
-	struct page *page;
+	unsigned long section_nr = pfn_to_section_nr(start_pfn);
+	unsigned long section_nr_end = section_nr + sections_per_block;
 	unsigned long pfn = start_pfn;
 
 	/*
@@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
 	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
 	 * and assume memmap is contiguous within each section
 	 */
-	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
+	for (; section_nr < section_nr_end; section_nr++) {
 		if (WARN_ON_ONCE(!pfn_valid(pfn)))
 			return false;
-		page = pfn_to_page(pfn);
-
-		for (j = 0; j < PAGES_PER_SECTION; j++) {
-			if (PageReserved(page + j))
-				continue;
-
-			printk(KERN_WARNING "section number %ld page number %d "
-				"not reserved, was it already online?\n",
-				pfn_to_section_nr(pfn), j);
 
+		if (!present_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) not present",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
+			return false;
+		} else if (!valid_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) no valid memmap",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
+			return false;
+		} else if (online_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) is already online",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
 			return false;
 		}
+		pfn += PAGES_PER_SECTION;
 	}
 
 	return true;
@@ -237,7 +241,7 @@ memory_block_action(unsigned long phys_index, unsigned long action, int online_t
 
 	switch (action) {
 	case MEM_ONLINE:
-		if (!pages_correctly_reserved(start_pfn))
+		if (!pages_correctly_probed(start_pfn))
 			return -EBUSY;
 
 		ret = online_pages(start_pfn, nr_pages, online_type);
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug
  2018-02-15 16:59 [v4 0/6] optimize memory hotplug Pavel Tatashin
                   ` (3 preceding siblings ...)
  2018-02-15 16:59 ` [v4 4/6] mm/memory_hotplug: optimize probe routine Pavel Tatashin
@ 2018-02-15 16:59 ` Pavel Tatashin
  2018-02-16  9:19   ` Ingo Molnar
  2018-02-19 13:37   ` Michal Hocko
  2018-02-15 16:59 ` [v4 6/6] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin
  5 siblings, 2 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-15 16:59 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

During memory hotplugging the probe routine will leave struct pages
uninitialized, the same as it is currently done during boot. Therefore, we
do not want to access the inside of struct pages before
__init_single_page() is called during onlining.

Because during hotplug we know that pages in one memory block belong to
the same numa node, we can skip the checking. We should keep checking for
the boot case.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/memory.c |  2 +-
 drivers/base/node.c   | 22 +++++++++++++++-------
 include/linux/node.h  |  4 ++--
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index deb3f029b451..a14fb0cd424a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -731,7 +731,7 @@ int register_new_memory(int nid, struct mem_section *section)
 	}
 
 	if (mem->section_count == sections_per_block)
-		ret = register_mem_sect_under_node(mem, nid);
+		ret = register_mem_sect_under_node(mem, nid, false);
 out:
 	mutex_unlock(&mem_sysfs_mutex);
 	return ret;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab9171c..d7cfc8d8a5c5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -397,7 +397,8 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
 }
 
 /* register memory section under specified node if it spans that node */
-int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
+int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
+				 bool check_nid)
 {
 	int ret;
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -423,11 +424,18 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 			continue;
 		}
 
-		page_nid = get_nid_for_pfn(pfn);
-		if (page_nid < 0)
-			continue;
-		if (page_nid != nid)
-			continue;
+		/*
+		 * We need to check if page belongs to nid only for the boot
+		 * case, during hotplug we know that all pages in the memory
+		 * block belong to the same node.
+		 */
+		if (check_nid) {
+			page_nid = get_nid_for_pfn(pfn);
+			if (page_nid < 0)
+				continue;
+			if (page_nid != nid)
+				continue;
+		}
 		ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
 					&mem_blk->dev.kobj,
 					kobject_name(&mem_blk->dev.kobj));
@@ -502,7 +510,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
 
 		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
 
-		ret = register_mem_sect_under_node(mem_blk, nid);
+		ret = register_mem_sect_under_node(mem_blk, nid, true);
 		if (!err)
 			err = ret;
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 4ece0fee0ffc..41f171861dcc 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -67,7 +67,7 @@ extern void unregister_one_node(int nid);
 extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
-						int nid);
+						int nid, bool check_nid);
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
@@ -97,7 +97,7 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 	return 0;
 }
 static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
-							int nid)
+							int nid, bool check_nid)
 {
 	return 0;
 }
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 6/6] mm/memory_hotplug: optimize memory hotplug
  2018-02-15 16:59 [v4 0/6] optimize memory hotplug Pavel Tatashin
                   ` (4 preceding siblings ...)
  2018-02-15 16:59 ` [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug Pavel Tatashin
@ 2018-02-15 16:59 ` Pavel Tatashin
  2018-02-16  9:30   ` Ingo Molnar
  2018-02-19 13:42   ` Michal Hocko
  5 siblings, 2 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-15 16:59 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

During memory hotplugging we traverse struct pages three times:

1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
   SetPageReserved(page);
3. loop in memmap_init_zone() to call __init_single_pfn()

This patch remove the first two loops, and leaves only loop 3. All struct
pages are initialized in one place, the same as it is done during boot.

The benefits:
- We improve the memory hotplug performance because we are not evicting
  cache several times and also reduce loop branching overheads.

- Remove condition from hotpath in __init_single_pfn(), that was added in
  order to fix the problem that was reported by Bharata in the above email
  thread, thus also improve the performance during normal boot.

- Make memory hotplug more similar to boot memory initialization path
  because we zero and initialize struct pages only in one function.

- Simplifies memory hotplug strut page initialization code, and thus
  enables future improvements, such as multi-threading the initialization
  of struct pages in order to improve the hotplug performance even further
  on larger machines.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 mm/memory_hotplug.c | 21 ++++++---------------
 mm/page_alloc.c     | 28 ++++++++++------------------
 mm/sparse.c         |  9 ++++++++-
 3 files changed, 24 insertions(+), 34 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 565048f496f7..ee04dae21f0c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,7 +250,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		struct vmem_altmap *altmap, bool want_memblock)
 {
 	int ret;
-	int i;
+	struct page *page;
 
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
@@ -260,21 +260,12 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		return ret;
 
 	/*
-	 * Make all the pages reserved so that nobody will stumble over half
-	 * initialized state.
-	 * FIXME: We also have to associate it with a node because page_to_nid
-	 * relies on having page with the proper node.
+	 * The first page in every section holds node id, this is because we
+	 * will need it in online_pages().
 	 */
-	for (i = 0; i < PAGES_PER_SECTION; i++) {
-		unsigned long pfn = phys_start_pfn + i;
-		struct page *page;
-		if (!pfn_valid(pfn))
-			continue;
-
-		page = pfn_to_page(pfn);
-		set_page_node(page, nid);
-		SetPageReserved(page);
-	}
+	page = pfn_to_page(phys_start_pfn);
+	mm_zero_struct_page(page);
+	set_page_node(page, nid);
 
 	if (!want_memblock)
 		return 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ceef579..2667b35fca41 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1177,10 +1177,9 @@ static void free_one_page(struct zone *zone,
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
-				unsigned long zone, int nid, bool zero)
+				unsigned long zone, int nid)
 {
-	if (zero)
-		mm_zero_struct_page(page);
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
@@ -1194,12 +1193,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 #endif
 }
 
-static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
-					int nid, bool zero)
-{
-	return __init_single_page(pfn_to_page(pfn), pfn, zone, nid, zero);
-}
-
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static void __meminit init_reserved_page(unsigned long pfn)
 {
@@ -1218,7 +1211,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
 			break;
 	}
-	__init_single_pfn(pfn, zid, nid, true);
+	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
 }
 #else
 static inline void init_reserved_page(unsigned long pfn)
@@ -1535,7 +1528,7 @@ static unsigned long  __init deferred_init_pages(int nid, int zid,
 		} else {
 			page++;
 		}
-		__init_single_page(page, pfn, zid, nid, true);
+		__init_single_page(page, pfn, zid, nid);
 		nr_pages++;
 	}
 	return (nr_pages);
@@ -5328,6 +5321,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long pfn;
 	unsigned long nr_initialised = 0;
+	struct page *page;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	struct memblock_region *r = NULL, *tmp;
 #endif
@@ -5389,6 +5383,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 #endif
 
 not_early:
+		page = pfn_to_page(pfn);
+		__init_single_page(page, pfn, zone, nid);
+		if (context == MEMMAP_HOTPLUG)
+			SetPageReserved(page);
+
 		/*
 		 * Mark the block movable so that blocks are reserved for
 		 * movable at startup. This will force kernel allocations
@@ -5405,15 +5404,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		 * because this is done early in sparse_add_one_section
 		 */
 		if (!(pfn & (pageblock_nr_pages - 1))) {
-			struct page *page = pfn_to_page(pfn);
-
-			__init_single_page(page, pfn, zone, nid,
-					context != MEMMAP_HOTPLUG);
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 			cond_resched();
-		} else {
-			__init_single_pfn(pfn, zone, nid,
-					context != MEMMAP_HOTPLUG);
 		}
 	}
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index 7af5e7a92528..eb72de54089c 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -816,7 +816,14 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
 		goto out;
 	}
 
-	memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION);
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * poison uninitialized struct pages in order to catch invalid flags
+	 * combinations.
+	 */
+	memset(memmap, PAGE_POISON_PATTERN,
+	       sizeof(struct page) * PAGES_PER_SECTION);
+#endif
 
 	section_mark_present(ms);
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [v4 4/6] mm/memory_hotplug: optimize probe routine
  2018-02-15 16:59 ` [v4 4/6] mm/memory_hotplug: optimize probe routine Pavel Tatashin
@ 2018-02-16  9:13   ` Ingo Molnar
  2018-02-16 13:07     ` Pavel Tatashin
  2018-02-19 13:23   ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2018-02-16  9:13 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe


* Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> When memory is hotplugged pages_correctly_reserved() is called to verify
> that the added memory is present, this routine traverses through every
> struct page and verifies that PageReserved() is set. This is a slow
> operation especially if a large amount of memory is added.
> 
> Instead of checking every page, it is enough to simply check that the
> section is present, has mapping (struct page array is allocated), and the
> mapping is online.
> 
> In addition, we should not excpect that probe routine sets flags in struct
> page, as the struct pages have not yet been initialized. The initialization
> should be done in __init_single_page(), the same as during boot.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>

Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug
  2018-02-15 16:59 ` [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug Pavel Tatashin
@ 2018-02-16  9:19   ` Ingo Molnar
  2018-02-16 13:09     ` Pavel Tatashin
  2018-02-19 13:37   ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2018-02-16  9:19 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe


* Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> During memory hotplugging the probe routine will leave struct pages
> uninitialized, the same as it is currently done during boot. Therefore, we
> do not want to access the inside of struct pages before
> __init_single_page() is called during onlining.
> 
> Because during hotplug we know that pages in one memory block belong to
> the same numa node, we can skip the checking. We should keep checking for
> the boot case.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  drivers/base/memory.c |  2 +-
>  drivers/base/node.c   | 22 +++++++++++++++-------
>  include/linux/node.h  |  4 ++--
>  3 files changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index deb3f029b451..a14fb0cd424a 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -731,7 +731,7 @@ int register_new_memory(int nid, struct mem_section *section)
>  	}
>  
>  	if (mem->section_count == sections_per_block)
> -		ret = register_mem_sect_under_node(mem, nid);
> +		ret = register_mem_sect_under_node(mem, nid, false);
>  out:

The namespace of all these memory range handling functions is horribly random,
and I think now it got worse: we add an assumption that register_new_memory() is 
implicitly called as part of hotplugged memory (where things are pre-cleared) - 
but nothing in its naming suggests so.

How about renaming it to hotplug_memory_register() or so?

With that change you can add:

  Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 3/6] mm: uninitialized struct page poisoning sanity checking
  2018-02-15 16:59 ` [v4 3/6] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin
@ 2018-02-16  9:23   ` Ingo Molnar
  2018-02-16 13:10     ` Pavel Tatashin
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2018-02-16  9:23 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe


* Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> During boot we poison struct page memory in order to ensure that no one is
> accessing this memory until the struct pages are initialized in
> __init_single_page().
> 
> This patch adds more scrutiny to this checking by making sure that flags
> do not equal the poison pattern when they are accessed.  The pattern is all
> ones.
> 
> Since node id is also stored in struct page, and may be accessed quite
> early, we add this enforcement into page_to_nid() function as well.
> Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Ingo Molnar <mingo@kernel.org>
> Acked-by: Michal Hocko <mhocko@suse.com>

Please always start patch titles with a verb, i.e.:

 mm: Add uninitialized struct page poisoning sanity check

or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 6/6] mm/memory_hotplug: optimize memory hotplug
  2018-02-15 16:59 ` [v4 6/6] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin
@ 2018-02-16  9:30   ` Ingo Molnar
  2018-02-16 13:12     ` Pavel Tatashin
  2018-02-19 13:42   ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2018-02-16  9:30 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe


* Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> During memory hotplugging we traverse struct pages three times:
> 
> 1. memset(0) in sparse_add_one_section()
> 2. loop in __add_section() to set do: set_page_node(page, nid); and
>    SetPageReserved(page);
> 3. loop in memmap_init_zone() to call __init_single_pfn()
> 
> This patch remove the first two loops, and leaves only loop 3. All struct
> pages are initialized in one place, the same as it is done during boot.

s/remove
 /removes

> The benefits:
> - We improve the memory hotplug performance because we are not evicting
>   cache several times and also reduce loop branching overheads.

s/We improve the memory hotplug performance
 /We improve memory hotplug performance

s/not evicting cache several times
 /not evicting the cache several times

s/overheads
 /overhead

> - Remove condition from hotpath in __init_single_pfn(), that was added in
>   order to fix the problem that was reported by Bharata in the above email
>   thread, thus also improve the performance during normal boot.

s/improve the performance
 /improve performance

> - Make memory hotplug more similar to boot memory initialization path
>   because we zero and initialize struct pages only in one function.

s/more similar to boot memory initialization path
 /more similar to the boot memory initialization path

> - Simplifies memory hotplug strut page initialization code, and thus
>   enables future improvements, such as multi-threading the initialization
>   of struct pages in order to improve the hotplug performance even further
>   on larger machines.

s/strut
 /struct

s/to improve the hotplug performance even further
 /to improve hotplug performance even further

> @@ -260,21 +260,12 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
>  		return ret;
>  
>  	/*
> -	 * Make all the pages reserved so that nobody will stumble over half
> -	 * initialized state.
> -	 * FIXME: We also have to associate it with a node because page_to_nid
> -	 * relies on having page with the proper node.
> +	 * The first page in every section holds node id, this is because we
> +	 * will need it in online_pages().

s/holds node id
 /holds the node id

> +#ifdef CONFIG_DEBUG_VM
> +	/*
> +	 * poison uninitialized struct pages in order to catch invalid flags
> +	 * combinations.

Please capitalize sentences properly.

> +	 */
> +	memset(memmap, PAGE_POISON_PATTERN,
> +	       sizeof(struct page) * PAGES_PER_SECTION);
> +#endif

I'd suggest writing this into a single line:

	memset(memmap, PAGE_POISON_PATTERN, sizeof(struct page)*PAGES_PER_SECTION);

(And ignore any checkpatch whinging - the line break didn't make it more 
readable.)

With those details fixed, and assuming that this patch was tested:

  Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 4/6] mm/memory_hotplug: optimize probe routine
  2018-02-16  9:13   ` Ingo Molnar
@ 2018-02-16 13:07     ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-16 13:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Michal Hocko, Linux Memory Management List,
	Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka,
	Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thank you!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug
  2018-02-16  9:19   ` Ingo Molnar
@ 2018-02-16 13:09     ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-16 13:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Michal Hocko, Linux Memory Management List,
	Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka,
	Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> The namespace of all these memory range handling functions is horribly random,
> and I think now it got worse: we add an assumption that register_new_memory() is
> implicitly called as part of hotplugged memory (where things are pre-cleared) -
> but nothing in its naming suggests so.
>
> How about renaming it to hotplug_memory_register() or so?

Sure, I will rename it.

>
> With that change you can add:
>
>   Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 3/6] mm: uninitialized struct page poisoning sanity checking
  2018-02-16  9:23   ` Ingo Molnar
@ 2018-02-16 13:10     ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-16 13:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Michal Hocko, Linux Memory Management List,
	Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka,
	Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

OK, I will update the title.

>
> Please always start patch titles with a verb, i.e.:
>
>  mm: Add uninitialized struct page poisoning sanity check

OK, I will update the title.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 6/6] mm/memory_hotplug: optimize memory hotplug
  2018-02-16  9:30   ` Ingo Molnar
@ 2018-02-16 13:12     ` Pavel Tatashin
  0 siblings, 0 replies; 18+ messages in thread
From: Pavel Tatashin @ 2018-02-16 13:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Michal Hocko, Linux Memory Management List,
	Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka,
	Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

>
>   Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thank you for your review! I will address all of your comments in the
next patch iteration.

Pavel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 4/6] mm/memory_hotplug: optimize probe routine
  2018-02-15 16:59 ` [v4 4/6] mm/memory_hotplug: optimize probe routine Pavel Tatashin
  2018-02-16  9:13   ` Ingo Molnar
@ 2018-02-19 13:23   ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2018-02-19 13:23 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Thu 15-02-18 11:59:18, Pavel Tatashin wrote:
> When memory is hotplugged pages_correctly_reserved() is called to verify
> that the added memory is present, this routine traverses through every
> struct page and verifies that PageReserved() is set. This is a slow
> operation especially if a large amount of memory is added.
> 
> Instead of checking every page, it is enough to simply check that the
> section is present, has mapping (struct page array is allocated), and the
> mapping is online.
> 
> In addition, we should not excpect that probe routine sets flags in struct
> page, as the struct pages have not yet been initialized. The initialization
> should be done in __init_single_page(), the same as during boot.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  drivers/base/memory.c | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index fe4b24f05f6a..deb3f029b451 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -187,13 +187,14 @@ int memory_isolate_notify(unsigned long val, void *v)
>  }
>  
>  /*
> - * The probe routines leave the pages reserved, just as the bootmem code does.
> - * Make sure they're still that way.
> + * The probe routines leave the pages uninitialized, just as the bootmem code
> + * does. Make sure we do not access them, but instead use only information from
> + * within sections.
>   */
> -static bool pages_correctly_reserved(unsigned long start_pfn)
> +static bool pages_correctly_probed(unsigned long start_pfn)
>  {
> -	int i, j;
> -	struct page *page;
> +	unsigned long section_nr = pfn_to_section_nr(start_pfn);
> +	unsigned long section_nr_end = section_nr + sections_per_block;
>  	unsigned long pfn = start_pfn;
>  
>  	/*
> @@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
>  	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
>  	 * and assume memmap is contiguous within each section
>  	 */
> -	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
> +	for (; section_nr < section_nr_end; section_nr++) {
>  		if (WARN_ON_ONCE(!pfn_valid(pfn)))
>  			return false;
> -		page = pfn_to_page(pfn);
> -
> -		for (j = 0; j < PAGES_PER_SECTION; j++) {
> -			if (PageReserved(page + j))
> -				continue;
> -
> -			printk(KERN_WARNING "section number %ld page number %d "
> -				"not reserved, was it already online?\n",
> -				pfn_to_section_nr(pfn), j);
>  
> +		if (!present_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) not present",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
> +			return false;
> +		} else if (!valid_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) no valid memmap",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
> +			return false;
> +		} else if (online_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) is already online",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
>  			return false;
>  		}
> +		pfn += PAGES_PER_SECTION;
>  	}
>  
>  	return true;
> @@ -237,7 +241,7 @@ memory_block_action(unsigned long phys_index, unsigned long action, int online_t
>  
>  	switch (action) {
>  	case MEM_ONLINE:
> -		if (!pages_correctly_reserved(start_pfn))
> +		if (!pages_correctly_probed(start_pfn))
>  			return -EBUSY;
>  
>  		ret = online_pages(start_pfn, nr_pages, online_type);
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug
  2018-02-15 16:59 ` [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug Pavel Tatashin
  2018-02-16  9:19   ` Ingo Molnar
@ 2018-02-19 13:37   ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2018-02-19 13:37 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Thu 15-02-18 11:59:19, Pavel Tatashin wrote:
> During memory hotplugging the probe routine will leave struct pages
> uninitialized, the same as it is currently done during boot. Therefore, we
> do not want to access the inside of struct pages before
> __init_single_page() is called during onlining.
> 
> Because during hotplug we know that pages in one memory block belong to
> the same numa node, we can skip the checking. We should keep checking for
> the boot case.

This could be more specific. What about the following:
"
register_mem_sect_under_node is careful to check the node id of each
pfn in the memblock range to handle configurations with interleaving
nodes. This is not really needed for the memory hotplug because hotadded
ranges are bound to a single NUMA node. We simply cannot handle interleaving
NUMA nodes in the same memblock currently and there are no signs that
anybody would want anything like that in future. That would require much
more refactoring.

This is a preparatory patch for later patches.
"
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>

Other than that this looks sane to me. register_mem_sect_under_node
could see some more improvements on top. E.g. do we really have to crawl
each pfn and try to recreate the sysfs machinery just to realize that we
already have one for the current memblock?

Acked-by: Michal Hocko <mhocko@suse.com

> ---
>  drivers/base/memory.c |  2 +-
>  drivers/base/node.c   | 22 +++++++++++++++-------
>  include/linux/node.h  |  4 ++--
>  3 files changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index deb3f029b451..a14fb0cd424a 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -731,7 +731,7 @@ int register_new_memory(int nid, struct mem_section *section)
>  	}
>  
>  	if (mem->section_count == sections_per_block)
> -		ret = register_mem_sect_under_node(mem, nid);
> +		ret = register_mem_sect_under_node(mem, nid, false);
>  out:
>  	mutex_unlock(&mem_sysfs_mutex);
>  	return ret;
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ee090ab9171c..d7cfc8d8a5c5 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -397,7 +397,8 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
>  }
>  
>  /* register memory section under specified node if it spans that node */
> -int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
> +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
> +				 bool check_nid)
>  {
>  	int ret;
>  	unsigned long pfn, sect_start_pfn, sect_end_pfn;
> @@ -423,11 +424,18 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
>  			continue;
>  		}
>  
> -		page_nid = get_nid_for_pfn(pfn);
> -		if (page_nid < 0)
> -			continue;
> -		if (page_nid != nid)
> -			continue;
> +		/*
> +		 * We need to check if page belongs to nid only for the boot
> +		 * case, during hotplug we know that all pages in the memory
> +		 * block belong to the same node.
> +		 */
> +		if (check_nid) {
> +			page_nid = get_nid_for_pfn(pfn);
> +			if (page_nid < 0)
> +				continue;
> +			if (page_nid != nid)
> +				continue;
> +		}
>  		ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
>  					&mem_blk->dev.kobj,
>  					kobject_name(&mem_blk->dev.kobj));
> @@ -502,7 +510,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
>  
>  		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
>  
> -		ret = register_mem_sect_under_node(mem_blk, nid);
> +		ret = register_mem_sect_under_node(mem_blk, nid, true);
>  		if (!err)
>  			err = ret;
>  
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 4ece0fee0ffc..41f171861dcc 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -67,7 +67,7 @@ extern void unregister_one_node(int nid);
>  extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
>  extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
>  extern int register_mem_sect_under_node(struct memory_block *mem_blk,
> -						int nid);
> +						int nid, bool check_nid);
>  extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>  					   unsigned long phys_index);
>  
> @@ -97,7 +97,7 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
>  	return 0;
>  }
>  static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
> -							int nid)
> +							int nid, bool check_nid)
>  {
>  	return 0;
>  }
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 6/6] mm/memory_hotplug: optimize memory hotplug
  2018-02-15 16:59 ` [v4 6/6] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin
  2018-02-16  9:30   ` Ingo Molnar
@ 2018-02-19 13:42   ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2018-02-19 13:42 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Thu 15-02-18 11:59:20, Pavel Tatashin wrote:
[...]
> @@ -260,21 +260,12 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
>  		return ret;
>  
>  	/*
> -	 * Make all the pages reserved so that nobody will stumble over half
> -	 * initialized state.
> -	 * FIXME: We also have to associate it with a node because page_to_nid
> -	 * relies on having page with the proper node.
> +	 * The first page in every section holds node id, this is because we
> +	 * will need it in online_pages().
>  	 */
> -	for (i = 0; i < PAGES_PER_SECTION; i++) {
> -		unsigned long pfn = phys_start_pfn + i;
> -		struct page *page;
> -		if (!pfn_valid(pfn))
> -			continue;
> -
> -		page = pfn_to_page(pfn);
> -		set_page_node(page, nid);
> -		SetPageReserved(page);
> -	}
> +	page = pfn_to_page(phys_start_pfn);
> +	mm_zero_struct_page(page);
> +	set_page_node(page, nid);

I really dislike this part. It is just too subtle assumption. We can
safely store the node id into memory_block and push it down the way to
online. Or am I missing something?

Btw. the rest of the series seem good as is so I would go with it and
keep this last patch aparat and make sure to do it properly rather than
add more hacks.
  
>  	if (!want_memblock)
>  		return 0;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-02-19 13:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-15 16:59 [v4 0/6] optimize memory hotplug Pavel Tatashin
2018-02-15 16:59 ` [v4 1/6] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin
2018-02-15 16:59 ` [v4 2/6] x86/mm/memory_hotplug: determine block size based on the end of boot memory Pavel Tatashin
2018-02-15 16:59 ` [v4 3/6] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin
2018-02-16  9:23   ` Ingo Molnar
2018-02-16 13:10     ` Pavel Tatashin
2018-02-15 16:59 ` [v4 4/6] mm/memory_hotplug: optimize probe routine Pavel Tatashin
2018-02-16  9:13   ` Ingo Molnar
2018-02-16 13:07     ` Pavel Tatashin
2018-02-19 13:23   ` Michal Hocko
2018-02-15 16:59 ` [v4 5/6] mm/memory_hotplug: don't read nid from struct page during hotplug Pavel Tatashin
2018-02-16  9:19   ` Ingo Molnar
2018-02-16 13:09     ` Pavel Tatashin
2018-02-19 13:37   ` Michal Hocko
2018-02-15 16:59 ` [v4 6/6] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin
2018-02-16  9:30   ` Ingo Molnar
2018-02-16 13:12     ` Pavel Tatashin
2018-02-19 13:42   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).