[PATCH v3 0/4] optimize memory hotplug

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/4] optimize memory hotplug
@ 2018-02-13 19:31 ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Changelog:
	v2 - v3
	Fixed two issues found during testing
	Addressed Kbuild warning reports

	v1 - v2
	Added struct page poisoning checking in order to verify that
	struct pages are never accessed until initialized during memory
	hotplug

This patchset:
- Improves hotplug performance by eliminating a number of
struct page traverses during memory hotplug.

- Fixes some issues with hotplugging, where boundaries
were not properly checked. And on x86 block size was not properly aligned
with end of memory

- Also, potentially improves boot performance by eliminating condition from
  __init_single_page().

- Adds robustness by verifying that that struct pages are correctly
  poisoned when flags are accessed.

The following experiments were performed on Xeon(R) CPU E7-8895 v3 @ 2.60GHz
with 1T RAM:

booting in qemu with 960G of memory, time to initialize struct pages:

no-kvm:
	TRY1		TRY2
BEFORE:	39.433668	39.39705
AFTER:	36.903781	36.989329

with-kvm:
BEFORE:	10.977447	11.103164
AFTER:	10.929072	10.751885

Hotplug 896G memory:
no-kvm:
	TRY1		TRY2
BEFORE: 848.740000	846.910000
AFTER:  783.070000	786.560000

with-kvm:
	TRY1		TRY2
BEFORE: 34.410000	33.57
AFTER:	29.810000	29.580000

Pavel Tatashin (4):
  mm/memory_hotplug: enforce block size aligned range check
  x86/mm/memory_hotplug: determine block size based on the end of boot
    memory
  mm: uninitialized struct page poisoning sanity checking
  mm/memory_hotplug: optimize memory hotplug

 arch/x86/mm/init_64.c          | 33 +++++++++++++++++++++++++++++----
 drivers/base/memory.c          | 38 +++++++++++++++++++++-----------------
 drivers/base/node.c            | 17 ++++++++++-------
 include/linux/memory_hotplug.h |  2 ++
 include/linux/mm.h             |  4 +++-
 include/linux/node.h           |  4 ++--
 include/linux/page-flags.h     | 22 +++++++++++++++++-----
 mm/memblock.c                  |  2 +-
 mm/memory_hotplug.c            | 36 ++++++++++--------------------------
 mm/page_alloc.c                | 28 ++++++++++------------------
 mm/sparse.c                    | 29 ++++++++++++++++++++++++++---
 11 files changed, 131 insertions(+), 84 deletions(-)

-- 
2.16.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 0/4] optimize memory hotplug
@ 2018-02-13 19:31 ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Changelog:
	v2 - v3
	Fixed two issues found during testing
	Addressed Kbuild warning reports

	v1 - v2
	Added struct page poisoning checking in order to verify that
	struct pages are never accessed until initialized during memory
	hotplug

This patchset:
- Improves hotplug performance by eliminating a number of
struct page traverses during memory hotplug.

- Fixes some issues with hotplugging, where boundaries
were not properly checked. And on x86 block size was not properly aligned
with end of memory

- Also, potentially improves boot performance by eliminating condition from
  __init_single_page().

- Adds robustness by verifying that that struct pages are correctly
  poisoned when flags are accessed.

The following experiments were performed on Xeon(R) CPU E7-8895 v3 @ 2.60GHz
with 1T RAM:

booting in qemu with 960G of memory, time to initialize struct pages:

no-kvm:
	TRY1		TRY2
BEFORE:	39.433668	39.39705
AFTER:	36.903781	36.989329

with-kvm:
BEFORE:	10.977447	11.103164
AFTER:	10.929072	10.751885

Hotplug 896G memory:
no-kvm:
	TRY1		TRY2
BEFORE: 848.740000	846.910000
AFTER:  783.070000	786.560000

with-kvm:
	TRY1		TRY2
BEFORE: 34.410000	33.57
AFTER:	29.810000	29.580000

Pavel Tatashin (4):
  mm/memory_hotplug: enforce block size aligned range check
  x86/mm/memory_hotplug: determine block size based on the end of boot
    memory
  mm: uninitialized struct page poisoning sanity checking
  mm/memory_hotplug: optimize memory hotplug

 arch/x86/mm/init_64.c          | 33 +++++++++++++++++++++++++++++----
 drivers/base/memory.c          | 38 +++++++++++++++++++++-----------------
 drivers/base/node.c            | 17 ++++++++++-------
 include/linux/memory_hotplug.h |  2 ++
 include/linux/mm.h             |  4 +++-
 include/linux/node.h           |  4 ++--
 include/linux/page-flags.h     | 22 +++++++++++++++++-----
 mm/memblock.c                  |  2 +-
 mm/memory_hotplug.c            | 36 ++++++++++--------------------------
 mm/page_alloc.c                | 28 ++++++++++------------------
 mm/sparse.c                    | 29 ++++++++++++++++++++++++++---
 11 files changed, 131 insertions(+), 84 deletions(-)

-- 
2.16.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
  2018-02-13 19:31 ` Pavel Tatashin
@ 2018-02-13 19:31   ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

Which: boots machine with 64G, and adds a device mem1 with 2G which can be
hotplugged later.

Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY

>From qemu monitor hotplug the memory (make sure config has
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

The operation will fail with the following trace:

WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
RSP: 0018:ffffbe5086b53d98 EFLAGS: 00010246
RAX: ffff9acb3fff3180 RBX: ffff9acaf7646038 RCX: 0000000000000800
RDX: ffff9acb3fff3000 RSI: 0000000000000218 RDI: 00000000010c0000
RBP: 0000000001080000 R08: ffffe81f83000040 R09: 0000000001100000
R10: ffff9acb3fff6000 R11: 0000000000000246 R12: 0000000000080000
R13: 0000000000000000 R14: ffffbe5086b53f08 R15: ffff9acaf7506f20
FS:  00007fd7f20da8c0(0000) GS:ffff9acb3fc00000(0000) knlGS:000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd7f20f2000 CR3: 0000000ff7ac2001 CR4: 00000000001606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 memory_subsys_online+0x44/0xa0
 device_online+0x51/0x80
 store_mem_state+0x5e/0xe0
 kernfs_fop_write+0xfa/0x170
 __vfs_write+0x2e/0x150
 ? __inode_security_revalidate+0x47/0x60
 ? selinux_file_permission+0xd5/0x130
 ? _cond_resched+0x10/0x20
 vfs_write+0xa8/0x1a0
 ? find_vma+0x54/0x60
 SyS_write+0x4d/0xb0
 do_syscall_64+0x5d/0x110
 entry_SYSCALL_64_after_hwframe+0x21/0x86
RIP: 0033:0x7fd7f0d3a840
RSP: 002b:00007fff5db77c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7f0d3a840
RDX: 0000000000000006 RSI: 00007fd7f20f2000 RDI: 0000000000000007
RBP: 00007fd7f20f2000 R08: 000055db265c4ab0 R09: 00007fd7f20da8c0
R10: 0000000000000006 R11: 0000000000000246 R12: 000055db265c49d0
R13: 0000000000000006 R14: 000055db265c5510 R15: 000000000000000b
Code: fe ff ff 07 00 77 24 48 89 f8 48 c1 e8 17 49 8b 14 c2 48 85 d2 74 14
40 0f b6 c6 49 81 c0 00 00 20 00 48 c1 e0 04 48 01 d0 75 93 <0f> ff 31 c0
c3 b8 01 00 00 00 c3 31 d2 48 c7 c7 b0 32 67 a6 31
---[ end trace 6203bc4f1a5d30e8 ]---

The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))

This function loops through every section in the newly added memory block
and verifies that the first pfn is valid, meaning section exists, has
mapping (struct page array), and is online.

The block size on x86 is usually 128M, but when machine is booted with more
than 64G of memory, the block size is changed to 2G:
$ cat /sys/devices/system/memory/block_size_bytes
80000000

or
$ dmesg | grep "block size"
[    0.086469] x86/mm: Memory block size: 2048MB

During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt

So, when the start_pfn of newly added memory is not block size aligned, we
can get a memory block that has only part of it with properly populated
sections.

In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000

0x1040000 == 65G, and so is not 2G aligned!

The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.

With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
							size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 mm/memory_hotplug.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b2bd52ff7605..565048f496f7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1083,15 +1083,16 @@ int try_online_node(int nid)
 
 static int check_hotplug_memory_range(u64 start, u64 size)
 {
-	u64 start_pfn = PFN_DOWN(start);
+	unsigned long block_sz = memory_block_size_bytes();
+	u64 block_nr_pages = block_sz >> PAGE_SHIFT;
 	u64 nr_pages = size >> PAGE_SHIFT;
+	u64 start_pfn = PFN_DOWN(start);
 
-	/* Memory range must be aligned with section */
-	if ((start_pfn & ~PAGE_SECTION_MASK) ||
-	    (nr_pages % PAGES_PER_SECTION) || (!nr_pages)) {
-		pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n",
-				(unsigned long long)start,
-				(unsigned long long)size);
+	/* memory range must be block size aligned */
+	if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
+	    !IS_ALIGNED(nr_pages, block_nr_pages)) {
+		pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
+		       block_sz, start, size);
 		return -EINVAL;
 	}
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
@ 2018-02-13 19:31   ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Start qemu with the following arguments:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

Which: boots machine with 64G, and adds a device mem1 with 2G which can be
hotplugged later.

Also make sure that config has the following turned on:
CONFIG_MEMORY_HOTPLUG
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
CONFIG_ACPI_HOTPLUG_MEMORY

>From qemu monitor hotplug the memory (make sure config has
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

The operation will fail with the following trace:

WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
pages_correctly_reserved+0xe6/0x110
Modules linked in:
CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:pages_correctly_reserved+0xe6/0x110
RSP: 0018:ffffbe5086b53d98 EFLAGS: 00010246
RAX: ffff9acb3fff3180 RBX: ffff9acaf7646038 RCX: 0000000000000800
RDX: ffff9acb3fff3000 RSI: 0000000000000218 RDI: 00000000010c0000
RBP: 0000000001080000 R08: ffffe81f83000040 R09: 0000000001100000
R10: ffff9acb3fff6000 R11: 0000000000000246 R12: 0000000000080000
R13: 0000000000000000 R14: ffffbe5086b53f08 R15: ffff9acaf7506f20
FS:  00007fd7f20da8c0(0000) GS:ffff9acb3fc00000(0000) knlGS:000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd7f20f2000 CR3: 0000000ff7ac2001 CR4: 00000000001606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 memory_subsys_online+0x44/0xa0
 device_online+0x51/0x80
 store_mem_state+0x5e/0xe0
 kernfs_fop_write+0xfa/0x170
 __vfs_write+0x2e/0x150
 ? __inode_security_revalidate+0x47/0x60
 ? selinux_file_permission+0xd5/0x130
 ? _cond_resched+0x10/0x20
 vfs_write+0xa8/0x1a0
 ? find_vma+0x54/0x60
 SyS_write+0x4d/0xb0
 do_syscall_64+0x5d/0x110
 entry_SYSCALL_64_after_hwframe+0x21/0x86
RIP: 0033:0x7fd7f0d3a840
RSP: 002b:00007fff5db77c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7f0d3a840
RDX: 0000000000000006 RSI: 00007fd7f20f2000 RDI: 0000000000000007
RBP: 00007fd7f20f2000 R08: 000055db265c4ab0 R09: 00007fd7f20da8c0
R10: 0000000000000006 R11: 0000000000000246 R12: 000055db265c49d0
R13: 0000000000000006 R14: 000055db265c5510 R15: 000000000000000b
Code: fe ff ff 07 00 77 24 48 89 f8 48 c1 e8 17 49 8b 14 c2 48 85 d2 74 14
40 0f b6 c6 49 81 c0 00 00 20 00 48 c1 e0 04 48 01 d0 75 93 <0f> ff 31 c0
c3 b8 01 00 00 00 c3 31 d2 48 c7 c7 b0 32 67 a6 31
---[ end trace 6203bc4f1a5d30e8 ]---

The problem is detected in: drivers/base/memory.c
static bool pages_correctly_reserved(unsigned long start_pfn)
205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))

This function loops through every section in the newly added memory block
and verifies that the first pfn is valid, meaning section exists, has
mapping (struct page array), and is online.

The block size on x86 is usually 128M, but when machine is booted with more
than 64G of memory, the block size is changed to 2G:
$ cat /sys/devices/system/memory/block_size_bytes
80000000

or
$ dmesg | grep "block size"
[    0.086469] x86/mm: Memory block size: 2048MB

During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt

So, when the start_pfn of newly added memory is not block size aligned, we
can get a memory block that has only part of it with properly populated
sections.

In our case the start_pfn starts from the last_pfn (end of physical
memory).
$ dmesg | grep last_pfn
[    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000

0x1040000 == 65G, and so is not 2G aligned!

The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.

With this fix, running the above sequence yield to the following result:
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
							size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 mm/memory_hotplug.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b2bd52ff7605..565048f496f7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1083,15 +1083,16 @@ int try_online_node(int nid)
 
 static int check_hotplug_memory_range(u64 start, u64 size)
 {
-	u64 start_pfn = PFN_DOWN(start);
+	unsigned long block_sz = memory_block_size_bytes();
+	u64 block_nr_pages = block_sz >> PAGE_SHIFT;
 	u64 nr_pages = size >> PAGE_SHIFT;
+	u64 start_pfn = PFN_DOWN(start);
 
-	/* Memory range must be aligned with section */
-	if ((start_pfn & ~PAGE_SECTION_MASK) ||
-	    (nr_pages % PAGES_PER_SECTION) || (!nr_pages)) {
-		pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n",
-				(unsigned long long)start,
-				(unsigned long long)size);
+	/* memory range must be block size aligned */
+	if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
+	    !IS_ALIGNED(nr_pages, block_nr_pages)) {
+		pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
+		       block_sz, start, size);
 		return -EINVAL;
 	}
 
-- 
2.16.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory
  2018-02-13 19:31 ` Pavel Tatashin
@ 2018-02-13 19:31   ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Memory sections are combined into "memory block" chunks. These chunks are
the units upon which memory can be added and removed.

On x86, the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hot-plugging/hot-removing can be broken.

Currently, whenever machine is boot with more than 64G the block size
unconditionally increased to 2G from the base 128M in order to reduce
number of memory devices in sysfs:
	/sys/devices/system/memory/memoryXXX

But, we must use the largest allowed block size that aligns to the next
address to be able to hotplug the next block of memory.

So, when memory is larger than 64G, we check the end address and find the
largest block size that is still power of two but smaller or equal to 2G.

Before, the fix:
Run qemu with:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
							size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure

With the fix memory is added successfully, as the block size is set to 1G,
and therefore aligns with start address 0x1040000000.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 1ab42c852069..f7dc80364397 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1326,14 +1326,39 @@ int kern_addr_valid(unsigned long addr)
 	return pfn_valid(pte_pfn(*pte));
 }
 
+/*
+ * Block size is the minimum quantum of memory which can be hot-plugged or
+ * hot-removed. It must be power of two, and must be equal or larger than
+ * MIN_MEMORY_BLOCK_SIZE.
+ */
+#define MAX_BLOCK_SIZE (2UL << 30)
+
+/* Amount of ram needed to start using large blocks */
+#define MEM_SIZE_FOR_LARGE_BLOCK (64UL << 30)
+
 static unsigned long probe_memory_block_size(void)
 {
-	unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
+	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
+	unsigned long bz;
 
-	/* if system is UV or has 64GB of RAM or more, use large blocks */
-	if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30)))
-		bz = 2UL << 30; /* 2GB */
+	/* If this is UV system, always set 2G block size */
+	if (is_uv_system()) {
+		bz = MAX_BLOCK_SIZE;
+		goto done;
+	}
 
+	/* Use regular block if RAM is smaller than MEM_SIZE_FOR_LARGE_BLOCK */
+	if (boot_mem_end < MEM_SIZE_FOR_LARGE_BLOCK) {
+		bz = MIN_MEMORY_BLOCK_SIZE;
+		goto done;
+	}
+
+	/* Find the largest allowed block size that aligns to memory end */
+	for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
+		if (IS_ALIGNED(boot_mem_end, bz))
+			break;
+	}
+done:
 	pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20);
 
 	return bz;
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory
@ 2018-02-13 19:31   ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

Memory sections are combined into "memory block" chunks. These chunks are
the units upon which memory can be added and removed.

On x86, the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hot-plugging/hot-removing can be broken.

Currently, whenever machine is boot with more than 64G the block size
unconditionally increased to 2G from the base 128M in order to reduce
number of memory devices in sysfs:
	/sys/devices/system/memory/memoryXXX

But, we must use the largest allowed block size that aligns to the next
address to be able to hotplug the next block of memory.

So, when memory is larger than 64G, we check the end address and find the
largest block size that is still power of two but smaller or equal to 2G.

Before, the fix:
Run qemu with:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
							size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure

With the fix memory is added successfully, as the block size is set to 1G,
and therefore aligns with start address 0x1040000000.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 1ab42c852069..f7dc80364397 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1326,14 +1326,39 @@ int kern_addr_valid(unsigned long addr)
 	return pfn_valid(pte_pfn(*pte));
 }
 
+/*
+ * Block size is the minimum quantum of memory which can be hot-plugged or
+ * hot-removed. It must be power of two, and must be equal or larger than
+ * MIN_MEMORY_BLOCK_SIZE.
+ */
+#define MAX_BLOCK_SIZE (2UL << 30)
+
+/* Amount of ram needed to start using large blocks */
+#define MEM_SIZE_FOR_LARGE_BLOCK (64UL << 30)
+
 static unsigned long probe_memory_block_size(void)
 {
-	unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
+	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
+	unsigned long bz;
 
-	/* if system is UV or has 64GB of RAM or more, use large blocks */
-	if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30)))
-		bz = 2UL << 30; /* 2GB */
+	/* If this is UV system, always set 2G block size */
+	if (is_uv_system()) {
+		bz = MAX_BLOCK_SIZE;
+		goto done;
+	}
 
+	/* Use regular block if RAM is smaller than MEM_SIZE_FOR_LARGE_BLOCK */
+	if (boot_mem_end < MEM_SIZE_FOR_LARGE_BLOCK) {
+		bz = MIN_MEMORY_BLOCK_SIZE;
+		goto done;
+	}
+
+	/* Find the largest allowed block size that aligns to memory end */
+	for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
+		if (IS_ALIGNED(boot_mem_end, bz))
+			break;
+	}
+done:
 	pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20);
 
 	return bz;
-- 
2.16.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking
  2018-02-13 19:31 ` Pavel Tatashin
@ 2018-02-13 19:31   ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

During boot we poison struct page memory in order to ensure that no one is
accessing this memory until the struct pages are initialized in
__init_single_page().

This patch adds more scrutiny to this checking, by making sure that flags
do not equal to poison pattern when the are accessed. The pattern is all
ones.

Since, node id is also stored in struct page, and may be accessed quiet
early we add the enforcement into page_to_nid() function as well.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 include/linux/mm.h         |  4 +++-
 include/linux/page-flags.h | 22 +++++++++++++++++-----
 mm/memblock.c              |  2 +-
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..ad71136a6494 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -896,7 +896,9 @@ extern int page_to_nid(const struct page *page);
 #else
 static inline int page_to_nid(const struct page *page)
 {
-	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
+	struct page *p = (struct page *)page;
+
+	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
 }
 #endif
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50c2b8786831..5d5493e1f7ba 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -156,9 +156,18 @@ static __always_inline int PageCompound(struct page *page)
 	return test_bit(PG_head, &page->flags) || PageTail(page);
 }
 
+#define	PAGE_POISON_PATTERN	~0ul
+static inline int PagePoisoned(const struct page *page)
+{
+	return page->flags == PAGE_POISON_PATTERN;
+}
+
 /*
  * Page flags policies wrt compound pages
  *
+ * PF_POISONED_CHECK
+ *     check if this struct page poisoned/uninitialized
+ *
  * PF_ANY:
  *     the page flag is relevant for small, head and tail pages.
  *
@@ -176,17 +185,20 @@ static __always_inline int PageCompound(struct page *page)
  * PF_NO_COMPOUND:
  *     the page flag is not relevant for compound pages.
  */
-#define PF_ANY(page, enforce)	page
-#define PF_HEAD(page, enforce)	compound_head(page)
+#define PF_POISONED_CHECK(page) ({					\
+		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
+		page;})
+#define PF_ANY(page, enforce)	PF_POISONED_CHECK(page)
+#define PF_HEAD(page, enforce)	PF_POISONED_CHECK(compound_head(page))
 #define PF_ONLY_HEAD(page, enforce) ({					\
 		VM_BUG_ON_PGFLAGS(PageTail(page), page);		\
-		page;})
+		PF_POISONED_CHECK(page);})
 #define PF_NO_TAIL(page, enforce) ({					\
 		VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page);	\
-		compound_head(page);})
+		PF_POISONED_CHECK(compound_head(page));})
 #define PF_NO_COMPOUND(page, enforce) ({				\
 		VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page);	\
-		page;})
+		PF_POISONED_CHECK(page);})
 
 /*
  * Macros to create function definitions for page flags
diff --git a/mm/memblock.c b/mm/memblock.c
index 5a9ca2a1751b..d85c8754e0ce 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1373,7 +1373,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
 					   min_addr, max_addr, nid);
 #ifdef CONFIG_DEBUG_VM
 	if (ptr && size > 0)
-		memset(ptr, 0xff, size);
+		memset(ptr, PAGE_POISON_PATTERN, size);
 #endif
 	return ptr;
 }
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking
@ 2018-02-13 19:31   ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

During boot we poison struct page memory in order to ensure that no one is
accessing this memory until the struct pages are initialized in
__init_single_page().

This patch adds more scrutiny to this checking, by making sure that flags
do not equal to poison pattern when the are accessed. The pattern is all
ones.

Since, node id is also stored in struct page, and may be accessed quiet
early we add the enforcement into page_to_nid() function as well.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 include/linux/mm.h         |  4 +++-
 include/linux/page-flags.h | 22 +++++++++++++++++-----
 mm/memblock.c              |  2 +-
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..ad71136a6494 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -896,7 +896,9 @@ extern int page_to_nid(const struct page *page);
 #else
 static inline int page_to_nid(const struct page *page)
 {
-	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
+	struct page *p = (struct page *)page;
+
+	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
 }
 #endif
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50c2b8786831..5d5493e1f7ba 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -156,9 +156,18 @@ static __always_inline int PageCompound(struct page *page)
 	return test_bit(PG_head, &page->flags) || PageTail(page);
 }
 
+#define	PAGE_POISON_PATTERN	~0ul
+static inline int PagePoisoned(const struct page *page)
+{
+	return page->flags == PAGE_POISON_PATTERN;
+}
+
 /*
  * Page flags policies wrt compound pages
  *
+ * PF_POISONED_CHECK
+ *     check if this struct page poisoned/uninitialized
+ *
  * PF_ANY:
  *     the page flag is relevant for small, head and tail pages.
  *
@@ -176,17 +185,20 @@ static __always_inline int PageCompound(struct page *page)
  * PF_NO_COMPOUND:
  *     the page flag is not relevant for compound pages.
  */
-#define PF_ANY(page, enforce)	page
-#define PF_HEAD(page, enforce)	compound_head(page)
+#define PF_POISONED_CHECK(page) ({					\
+		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
+		page;})
+#define PF_ANY(page, enforce)	PF_POISONED_CHECK(page)
+#define PF_HEAD(page, enforce)	PF_POISONED_CHECK(compound_head(page))
 #define PF_ONLY_HEAD(page, enforce) ({					\
 		VM_BUG_ON_PGFLAGS(PageTail(page), page);		\
-		page;})
+		PF_POISONED_CHECK(page);})
 #define PF_NO_TAIL(page, enforce) ({					\
 		VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page);	\
-		compound_head(page);})
+		PF_POISONED_CHECK(compound_head(page));})
 #define PF_NO_COMPOUND(page, enforce) ({				\
 		VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page);	\
-		page;})
+		PF_POISONED_CHECK(page);})
 
 /*
  * Macros to create function definitions for page flags
diff --git a/mm/memblock.c b/mm/memblock.c
index 5a9ca2a1751b..d85c8754e0ce 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1373,7 +1373,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
 					   min_addr, max_addr, nid);
 #ifdef CONFIG_DEBUG_VM
 	if (ptr && size > 0)
-		memset(ptr, 0xff, size);
+		memset(ptr, PAGE_POISON_PATTERN, size);
 #endif
 	return ptr;
 }
-- 
2.16.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug
  2018-02-13 19:31 ` Pavel Tatashin
@ 2018-02-13 19:31   ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

This patch was inspired by the discussion of this problem:
http://lkml.kernel.org/r/20180130083006.GB1245@in.ibm.com

Currently, during memory hotplugging we traverse struct pages several
times:

1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
   SetPageReserved(page);
3. loop in pages_correctly_reserved() to check that SetPageReserved is set.
4. loop in memmap_init_zone() to call __init_single_pfn()

This patch removes loops 1, 2, and 3 and only leaves the loop 4, where all
struct page fields are initialized in one go, the same as it is now done
during boot.

The benefits:
- We improve the memory hotplug performance because we are not evicting
  cache several times and also reduce loop branching overheads.

- Remove condition from hotpath in __init_single_pfn(), that was added in
  order to fix the problem that was reported by Bharata in the above email
  thread, thus also improve the performance during normal boot.

- Make memory hotplug more similar to boot memory initialization path
  because we zero and initialize struct pages only in one function.

- Simplifies memory hotplug strut page initialization code, and thus
  enables future improvements, such as multi-threading the initialization
  of struct pages in order to improve the hotplug performance even further
  on larger machines.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/memory.c          | 38 +++++++++++++++++++++-----------------
 drivers/base/node.c            | 17 ++++++++++-------
 include/linux/memory_hotplug.h |  2 ++
 include/linux/node.h           |  4 ++--
 mm/memory_hotplug.c            | 21 ++-------------------
 mm/page_alloc.c                | 28 ++++++++++------------------
 mm/sparse.c                    | 29 ++++++++++++++++++++++++++---
 7 files changed, 73 insertions(+), 66 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fe4b24f05f6a..a14fb0cd424a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -187,13 +187,14 @@ int memory_isolate_notify(unsigned long val, void *v)
 }
 
 /*
- * The probe routines leave the pages reserved, just as the bootmem code does.
- * Make sure they're still that way.
+ * The probe routines leave the pages uninitialized, just as the bootmem code
+ * does. Make sure we do not access them, but instead use only information from
+ * within sections.
  */
-static bool pages_correctly_reserved(unsigned long start_pfn)
+static bool pages_correctly_probed(unsigned long start_pfn)
 {
-	int i, j;
-	struct page *page;
+	unsigned long section_nr = pfn_to_section_nr(start_pfn);
+	unsigned long section_nr_end = section_nr + sections_per_block;
 	unsigned long pfn = start_pfn;
 
 	/*
@@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
 	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
 	 * and assume memmap is contiguous within each section
 	 */
-	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
+	for (; section_nr < section_nr_end; section_nr++) {
 		if (WARN_ON_ONCE(!pfn_valid(pfn)))
 			return false;
-		page = pfn_to_page(pfn);
-
-		for (j = 0; j < PAGES_PER_SECTION; j++) {
-			if (PageReserved(page + j))
-				continue;
-
-			printk(KERN_WARNING "section number %ld page number %d "
-				"not reserved, was it already online?\n",
-				pfn_to_section_nr(pfn), j);
 
+		if (!present_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) not present",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
+			return false;
+		} else if (!valid_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) no valid memmap",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
+			return false;
+		} else if (online_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) is already online",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
 			return false;
 		}
+		pfn += PAGES_PER_SECTION;
 	}
 
 	return true;
@@ -237,7 +241,7 @@ memory_block_action(unsigned long phys_index, unsigned long action, int online_t
 
 	switch (action) {
 	case MEM_ONLINE:
-		if (!pages_correctly_reserved(start_pfn))
+		if (!pages_correctly_probed(start_pfn))
 			return -EBUSY;
 
 		ret = online_pages(start_pfn, nr_pages, online_type);
@@ -727,7 +731,7 @@ int register_new_memory(int nid, struct mem_section *section)
 	}
 
 	if (mem->section_count == sections_per_block)
-		ret = register_mem_sect_under_node(mem, nid);
+		ret = register_mem_sect_under_node(mem, nid, false);
 out:
 	mutex_unlock(&mem_sysfs_mutex);
 	return ret;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab9171c..f1c0c130ac88 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -397,7 +397,8 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
 }
 
 /* register memory section under specified node if it spans that node */
-int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
+int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
+				 bool check_nid)
 {
 	int ret;
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 			continue;
 		}
 
-		page_nid = get_nid_for_pfn(pfn);
-		if (page_nid < 0)
-			continue;
-		if (page_nid != nid)
-			continue;
+		if (check_nid) {
+			page_nid = get_nid_for_pfn(pfn);
+			if (page_nid < 0)
+				continue;
+			if (page_nid != nid)
+				continue;
+		}
 		ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
 					&mem_blk->dev.kobj,
 					kobject_name(&mem_blk->dev.kobj));
@@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
 
 		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
 
-		ret = register_mem_sect_under_node(mem_blk, nid);
+		ret = register_mem_sect_under_node(mem_blk, nid, true);
 		if (!err)
 			err = ret;
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index aba5f86eb038..0f2c17bef9d6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -234,6 +234,8 @@ void put_online_mems(void);
 void mem_hotplug_begin(void);
 void mem_hotplug_done(void);
 
+int get_section_nid(unsigned long section_nr);
+
 extern void set_zone_contiguous(struct zone *zone);
 extern void clear_zone_contiguous(struct zone *zone);
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 4ece0fee0ffc..41f171861dcc 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -67,7 +67,7 @@ extern void unregister_one_node(int nid);
 extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
-						int nid);
+						int nid, bool check_nid);
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
@@ -97,7 +97,7 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 	return 0;
 }
 static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
-							int nid)
+							int nid, bool check_nid)
 {
 	return 0;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 565048f496f7..cf1041380ab7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,7 +250,6 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		struct vmem_altmap *altmap, bool want_memblock)
 {
 	int ret;
-	int i;
 
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
@@ -259,23 +258,6 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
-	/*
-	 * Make all the pages reserved so that nobody will stumble over half
-	 * initialized state.
-	 * FIXME: We also have to associate it with a node because page_to_nid
-	 * relies on having page with the proper node.
-	 */
-	for (i = 0; i < PAGES_PER_SECTION; i++) {
-		unsigned long pfn = phys_start_pfn + i;
-		struct page *page;
-		if (!pfn_valid(pfn))
-			continue;
-
-		page = pfn_to_page(pfn);
-		set_page_node(page, nid);
-		SetPageReserved(page);
-	}
-
 	if (!want_memblock)
 		return 0;
 
@@ -909,7 +891,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	int ret;
 	struct memory_notify arg;
 
-	nid = pfn_to_nid(pfn);
+	nid = get_section_nid(pfn_to_section_nr(pfn));
+
 	/* associate pfn range with the zone */
 	zone = move_pfn_range(online_type, nid, pfn, nr_pages);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ceef579..2667b35fca41 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1177,10 +1177,9 @@ static void free_one_page(struct zone *zone,
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
-				unsigned long zone, int nid, bool zero)
+				unsigned long zone, int nid)
 {
-	if (zero)
-		mm_zero_struct_page(page);
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
@@ -1194,12 +1193,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 #endif
 }
 
-static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
-					int nid, bool zero)
-{
-	return __init_single_page(pfn_to_page(pfn), pfn, zone, nid, zero);
-}
-
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static void __meminit init_reserved_page(unsigned long pfn)
 {
@@ -1218,7 +1211,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
 			break;
 	}
-	__init_single_pfn(pfn, zid, nid, true);
+	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
 }
 #else
 static inline void init_reserved_page(unsigned long pfn)
@@ -1535,7 +1528,7 @@ static unsigned long  __init deferred_init_pages(int nid, int zid,
 		} else {
 			page++;
 		}
-		__init_single_page(page, pfn, zid, nid, true);
+		__init_single_page(page, pfn, zid, nid);
 		nr_pages++;
 	}
 	return (nr_pages);
@@ -5328,6 +5321,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long pfn;
 	unsigned long nr_initialised = 0;
+	struct page *page;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	struct memblock_region *r = NULL, *tmp;
 #endif
@@ -5389,6 +5383,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 #endif
 
 not_early:
+		page = pfn_to_page(pfn);
+		__init_single_page(page, pfn, zone, nid);
+		if (context == MEMMAP_HOTPLUG)
+			SetPageReserved(page);
+
 		/*
 		 * Mark the block movable so that blocks are reserved for
 		 * movable at startup. This will force kernel allocations
@@ -5405,15 +5404,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		 * because this is done early in sparse_add_one_section
 		 */
 		if (!(pfn & (pageblock_nr_pages - 1))) {
-			struct page *page = pfn_to_page(pfn);
-
-			__init_single_page(page, pfn, zone, nid,
-					context != MEMMAP_HOTPLUG);
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 			cond_resched();
-		} else {
-			__init_single_pfn(pfn, zone, nid,
-					context != MEMMAP_HOTPLUG);
 		}
 	}
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index 7af5e7a92528..d7808307023b 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -30,11 +30,14 @@ struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
 #endif
 EXPORT_SYMBOL(mem_section);
 
-#ifdef NODE_NOT_IN_PAGE_FLAGS
+#if defined(NODE_NOT_IN_PAGE_FLAGS) || defined(CONFIG_MEMORY_HOTPLUG)
 /*
  * If we did not store the node number in the page then we have to
  * do a lookup in the section_to_node_table in order to find which
  * node the page belongs to.
+ *
+ * We also use this data in case memory hotplugging is enabled to be
+ * able to determine nid while struct pages are not yet initialized.
  */
 #if MAX_NUMNODES <= 256
 static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
@@ -42,17 +45,28 @@ static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
 static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
 #endif
 
+#ifdef NODE_NOT_IN_PAGE_FLAGS
 int page_to_nid(const struct page *page)
 {
 	return section_to_node_table[page_to_section(page)];
 }
 EXPORT_SYMBOL(page_to_nid);
+#endif /* NODE_NOT_IN_PAGE_FLAGS */
 
 static void set_section_nid(unsigned long section_nr, int nid)
 {
 	section_to_node_table[section_nr] = nid;
 }
-#else /* !NODE_NOT_IN_PAGE_FLAGS */
+
+/* Return NID for given section number */
+int get_section_nid(unsigned long section_nr)
+{
+	if (WARN_ON(section_nr >= NR_MEM_SECTIONS))
+		return 0;
+	return section_to_node_table[section_nr];
+}
+EXPORT_SYMBOL(get_section_nid);
+#else /* ! (NODE_NOT_IN_PAGE_FLAGS || CONFIG_MEMORY_HOTPLUG) */
 static inline void set_section_nid(unsigned long section_nr, int nid)
 {
 }
@@ -816,7 +830,14 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
 		goto out;
 	}
 
-	memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION);
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * poison uninitialized struct pages in order to catch invalid flags
+	 * combinations.
+	 */
+	memset(memmap, PAGE_POISON_PATTERN,
+	       sizeof(struct page) * PAGES_PER_SECTION);
+#endif
 
 	section_mark_present(ms);
 
@@ -827,6 +848,8 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
 	if (ret <= 0) {
 		kfree(usemap);
 		__kfree_section_memmap(memmap, altmap);
+	} else {
+		set_section_nid(section_nr, pgdat->node_id);
 	}
 	return ret;
 }
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug
@ 2018-02-13 19:31   ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-13 19:31 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

This patch was inspired by the discussion of this problem:
http://lkml.kernel.org/r/20180130083006.GB1245@in.ibm.com

Currently, during memory hotplugging we traverse struct pages several
times:

1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
   SetPageReserved(page);
3. loop in pages_correctly_reserved() to check that SetPageReserved is set.
4. loop in memmap_init_zone() to call __init_single_pfn()

This patch removes loops 1, 2, and 3 and only leaves the loop 4, where all
struct page fields are initialized in one go, the same as it is now done
during boot.

The benefits:
- We improve the memory hotplug performance because we are not evicting
  cache several times and also reduce loop branching overheads.

- Remove condition from hotpath in __init_single_pfn(), that was added in
  order to fix the problem that was reported by Bharata in the above email
  thread, thus also improve the performance during normal boot.

- Make memory hotplug more similar to boot memory initialization path
  because we zero and initialize struct pages only in one function.

- Simplifies memory hotplug strut page initialization code, and thus
  enables future improvements, such as multi-threading the initialization
  of struct pages in order to improve the hotplug performance even further
  on larger machines.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/memory.c          | 38 +++++++++++++++++++++-----------------
 drivers/base/node.c            | 17 ++++++++++-------
 include/linux/memory_hotplug.h |  2 ++
 include/linux/node.h           |  4 ++--
 mm/memory_hotplug.c            | 21 ++-------------------
 mm/page_alloc.c                | 28 ++++++++++------------------
 mm/sparse.c                    | 29 ++++++++++++++++++++++++++---
 7 files changed, 73 insertions(+), 66 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fe4b24f05f6a..a14fb0cd424a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -187,13 +187,14 @@ int memory_isolate_notify(unsigned long val, void *v)
 }
 
 /*
- * The probe routines leave the pages reserved, just as the bootmem code does.
- * Make sure they're still that way.
+ * The probe routines leave the pages uninitialized, just as the bootmem code
+ * does. Make sure we do not access them, but instead use only information from
+ * within sections.
  */
-static bool pages_correctly_reserved(unsigned long start_pfn)
+static bool pages_correctly_probed(unsigned long start_pfn)
 {
-	int i, j;
-	struct page *page;
+	unsigned long section_nr = pfn_to_section_nr(start_pfn);
+	unsigned long section_nr_end = section_nr + sections_per_block;
 	unsigned long pfn = start_pfn;
 
 	/*
@@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
 	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
 	 * and assume memmap is contiguous within each section
 	 */
-	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
+	for (; section_nr < section_nr_end; section_nr++) {
 		if (WARN_ON_ONCE(!pfn_valid(pfn)))
 			return false;
-		page = pfn_to_page(pfn);
-
-		for (j = 0; j < PAGES_PER_SECTION; j++) {
-			if (PageReserved(page + j))
-				continue;
-
-			printk(KERN_WARNING "section number %ld page number %d "
-				"not reserved, was it already online?\n",
-				pfn_to_section_nr(pfn), j);
 
+		if (!present_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) not present",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
+			return false;
+		} else if (!valid_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) no valid memmap",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
+			return false;
+		} else if (online_section_nr(section_nr)) {
+			pr_warn("section %ld pfn[%lx, %lx) is already online",
+				section_nr, pfn, pfn + PAGES_PER_SECTION);
 			return false;
 		}
+		pfn += PAGES_PER_SECTION;
 	}
 
 	return true;
@@ -237,7 +241,7 @@ memory_block_action(unsigned long phys_index, unsigned long action, int online_t
 
 	switch (action) {
 	case MEM_ONLINE:
-		if (!pages_correctly_reserved(start_pfn))
+		if (!pages_correctly_probed(start_pfn))
 			return -EBUSY;
 
 		ret = online_pages(start_pfn, nr_pages, online_type);
@@ -727,7 +731,7 @@ int register_new_memory(int nid, struct mem_section *section)
 	}
 
 	if (mem->section_count == sections_per_block)
-		ret = register_mem_sect_under_node(mem, nid);
+		ret = register_mem_sect_under_node(mem, nid, false);
 out:
 	mutex_unlock(&mem_sysfs_mutex);
 	return ret;
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ee090ab9171c..f1c0c130ac88 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -397,7 +397,8 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
 }
 
 /* register memory section under specified node if it spans that node */
-int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
+int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
+				 bool check_nid)
 {
 	int ret;
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 			continue;
 		}
 
-		page_nid = get_nid_for_pfn(pfn);
-		if (page_nid < 0)
-			continue;
-		if (page_nid != nid)
-			continue;
+		if (check_nid) {
+			page_nid = get_nid_for_pfn(pfn);
+			if (page_nid < 0)
+				continue;
+			if (page_nid != nid)
+				continue;
+		}
 		ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
 					&mem_blk->dev.kobj,
 					kobject_name(&mem_blk->dev.kobj));
@@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
 
 		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
 
-		ret = register_mem_sect_under_node(mem_blk, nid);
+		ret = register_mem_sect_under_node(mem_blk, nid, true);
 		if (!err)
 			err = ret;
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index aba5f86eb038..0f2c17bef9d6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -234,6 +234,8 @@ void put_online_mems(void);
 void mem_hotplug_begin(void);
 void mem_hotplug_done(void);
 
+int get_section_nid(unsigned long section_nr);
+
 extern void set_zone_contiguous(struct zone *zone);
 extern void clear_zone_contiguous(struct zone *zone);
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 4ece0fee0ffc..41f171861dcc 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -67,7 +67,7 @@ extern void unregister_one_node(int nid);
 extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
-						int nid);
+						int nid, bool check_nid);
 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
@@ -97,7 +97,7 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 	return 0;
 }
 static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
-							int nid)
+							int nid, bool check_nid)
 {
 	return 0;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 565048f496f7..cf1041380ab7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,7 +250,6 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		struct vmem_altmap *altmap, bool want_memblock)
 {
 	int ret;
-	int i;
 
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
@@ -259,23 +258,6 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
-	/*
-	 * Make all the pages reserved so that nobody will stumble over half
-	 * initialized state.
-	 * FIXME: We also have to associate it with a node because page_to_nid
-	 * relies on having page with the proper node.
-	 */
-	for (i = 0; i < PAGES_PER_SECTION; i++) {
-		unsigned long pfn = phys_start_pfn + i;
-		struct page *page;
-		if (!pfn_valid(pfn))
-			continue;
-
-		page = pfn_to_page(pfn);
-		set_page_node(page, nid);
-		SetPageReserved(page);
-	}
-
 	if (!want_memblock)
 		return 0;
 
@@ -909,7 +891,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	int ret;
 	struct memory_notify arg;
 
-	nid = pfn_to_nid(pfn);
+	nid = get_section_nid(pfn_to_section_nr(pfn));
+
 	/* associate pfn range with the zone */
 	zone = move_pfn_range(online_type, nid, pfn, nr_pages);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ceef579..2667b35fca41 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1177,10 +1177,9 @@ static void free_one_page(struct zone *zone,
 }
 
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
-				unsigned long zone, int nid, bool zero)
+				unsigned long zone, int nid)
 {
-	if (zero)
-		mm_zero_struct_page(page);
+	mm_zero_struct_page(page);
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
@@ -1194,12 +1193,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 #endif
 }
 
-static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone,
-					int nid, bool zero)
-{
-	return __init_single_page(pfn_to_page(pfn), pfn, zone, nid, zero);
-}
-
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static void __meminit init_reserved_page(unsigned long pfn)
 {
@@ -1218,7 +1211,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
 			break;
 	}
-	__init_single_pfn(pfn, zid, nid, true);
+	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
 }
 #else
 static inline void init_reserved_page(unsigned long pfn)
@@ -1535,7 +1528,7 @@ static unsigned long  __init deferred_init_pages(int nid, int zid,
 		} else {
 			page++;
 		}
-		__init_single_page(page, pfn, zid, nid, true);
+		__init_single_page(page, pfn, zid, nid);
 		nr_pages++;
 	}
 	return (nr_pages);
@@ -5328,6 +5321,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long pfn;
 	unsigned long nr_initialised = 0;
+	struct page *page;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	struct memblock_region *r = NULL, *tmp;
 #endif
@@ -5389,6 +5383,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 #endif
 
 not_early:
+		page = pfn_to_page(pfn);
+		__init_single_page(page, pfn, zone, nid);
+		if (context == MEMMAP_HOTPLUG)
+			SetPageReserved(page);
+
 		/*
 		 * Mark the block movable so that blocks are reserved for
 		 * movable at startup. This will force kernel allocations
@@ -5405,15 +5404,8 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		 * because this is done early in sparse_add_one_section
 		 */
 		if (!(pfn & (pageblock_nr_pages - 1))) {
-			struct page *page = pfn_to_page(pfn);
-
-			__init_single_page(page, pfn, zone, nid,
-					context != MEMMAP_HOTPLUG);
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 			cond_resched();
-		} else {
-			__init_single_pfn(pfn, zone, nid,
-					context != MEMMAP_HOTPLUG);
 		}
 	}
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index 7af5e7a92528..d7808307023b 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -30,11 +30,14 @@ struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
 #endif
 EXPORT_SYMBOL(mem_section);
 
-#ifdef NODE_NOT_IN_PAGE_FLAGS
+#if defined(NODE_NOT_IN_PAGE_FLAGS) || defined(CONFIG_MEMORY_HOTPLUG)
 /*
  * If we did not store the node number in the page then we have to
  * do a lookup in the section_to_node_table in order to find which
  * node the page belongs to.
+ *
+ * We also use this data in case memory hotplugging is enabled to be
+ * able to determine nid while struct pages are not yet initialized.
  */
 #if MAX_NUMNODES <= 256
 static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
@@ -42,17 +45,28 @@ static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
 static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
 #endif
 
+#ifdef NODE_NOT_IN_PAGE_FLAGS
 int page_to_nid(const struct page *page)
 {
 	return section_to_node_table[page_to_section(page)];
 }
 EXPORT_SYMBOL(page_to_nid);
+#endif /* NODE_NOT_IN_PAGE_FLAGS */
 
 static void set_section_nid(unsigned long section_nr, int nid)
 {
 	section_to_node_table[section_nr] = nid;
 }
-#else /* !NODE_NOT_IN_PAGE_FLAGS */
+
+/* Return NID for given section number */
+int get_section_nid(unsigned long section_nr)
+{
+	if (WARN_ON(section_nr >= NR_MEM_SECTIONS))
+		return 0;
+	return section_to_node_table[section_nr];
+}
+EXPORT_SYMBOL(get_section_nid);
+#else /* ! (NODE_NOT_IN_PAGE_FLAGS || CONFIG_MEMORY_HOTPLUG) */
 static inline void set_section_nid(unsigned long section_nr, int nid)
 {
 }
@@ -816,7 +830,14 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
 		goto out;
 	}
 
-	memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION);
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * poison uninitialized struct pages in order to catch invalid flags
+	 * combinations.
+	 */
+	memset(memmap, PAGE_POISON_PATTERN,
+	       sizeof(struct page) * PAGES_PER_SECTION);
+#endif
 
 	section_mark_present(ms);
 
@@ -827,6 +848,8 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
 	if (ret <= 0) {
 		kfree(usemap);
 		__kfree_section_memmap(memmap, altmap);
+	} else {
+		set_section_nid(section_nr, pgdat->node_id);
 	}
 	return ret;
 }
-- 
2.16.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] optimize memory hotplug
  2018-02-13 19:31 ` Pavel Tatashin
@ 2018-02-13 21:53   ` Andrew Morton
  -1 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2018-02-13 21:53 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> This patchset:
> - Improves hotplug performance by eliminating a number of
> struct page traverses during memory hotplug.
> 
> - Fixes some issues with hotplugging, where boundaries
> were not properly checked. And on x86 block size was not properly aligned
> with end of memory
> 
> - Also, potentially improves boot performance by eliminating condition from
>   __init_single_page().
> 
> - Adds robustness by verifying that that struct pages are correctly
>   poisoned when flags are accessed.

I'm now attempting to get a 100% review rate on MM patches, which is
why I started adding my Reviewed-by: when I do that thing.

I'm not familiar enough with this code to add my own Reviewed-by:, and
we'll need to figure out what to do in such cases.  I shall be sending
out periodic review-status summaries.

If you're able to identify a suitable reviewer for this work and to
offer them beer, that would help.  Let's see what happens as the weeks
unfold.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] optimize memory hotplug
@ 2018-02-13 21:53   ` Andrew Morton
  0 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2018-02-13 21:53 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, mgorman, mhocko, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> This patchset:
> - Improves hotplug performance by eliminating a number of
> struct page traverses during memory hotplug.
> 
> - Fixes some issues with hotplugging, where boundaries
> were not properly checked. And on x86 block size was not properly aligned
> with end of memory
> 
> - Also, potentially improves boot performance by eliminating condition from
>   __init_single_page().
> 
> - Adds robustness by verifying that that struct pages are correctly
>   poisoned when flags are accessed.

I'm now attempting to get a 100% review rate on MM patches, which is
why I started adding my Reviewed-by: when I do that thing.

I'm not familiar enough with this code to add my own Reviewed-by:, and
we'll need to figure out what to do in such cases.  I shall be sending
out periodic review-status summaries.

If you're able to identify a suitable reviewer for this work and to
offer them beer, that would help.  Let's see what happens as the weeks
unfold.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] optimize memory hotplug
  2018-02-13 21:53   ` Andrew Morton
@ 2018-02-14  8:09     ` Ingo Molnar
  -1 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2018-02-14  8:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Tatashin, steven.sistare, daniel.m.jordan, mgorman, mhocko,
	linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo,
	hpa, x86, dan.j.williams, kirill.shutemov, bhe


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
> 
> > This patchset:
> > - Improves hotplug performance by eliminating a number of
> > struct page traverses during memory hotplug.
> > 
> > - Fixes some issues with hotplugging, where boundaries
> > were not properly checked. And on x86 block size was not properly aligned
> > with end of memory
> > 
> > - Also, potentially improves boot performance by eliminating condition from
> >   __init_single_page().
> > 
> > - Adds robustness by verifying that that struct pages are correctly
> >   poisoned when flags are accessed.
> 
> I'm now attempting to get a 100% review rate on MM patches, which is
> why I started adding my Reviewed-by: when I do that thing.
> 
> I'm not familiar enough with this code to add my own Reviewed-by:, and
> we'll need to figure out what to do in such cases.  I shall be sending
> out periodic review-status summaries.
> 
> If you're able to identify a suitable reviewer for this work and to
> offer them beer, that would help.  Let's see what happens as the weeks
> unfold.

The largest patch, fix patch #2, looks good to me and fixes a real bug.
Patch #1 and #3 also look good to me (assuming the runtime overhead
added by patch #3 is OK to you):

  Reviewed-by: Ingo Molnar <mingo@kernel.org>

(I suspect patch #1 and patch #2 should also get a Cc: stable.)

Patch #4 is too large to review IMO: it should be split up into as many patches as 
practically possible. That will also help bisectability, should anything break.

Before applying these patches please fix changelog and code comment spelling.

But it's all good stuff AFAICS!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] optimize memory hotplug
@ 2018-02-14  8:09     ` Ingo Molnar
  0 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2018-02-14  8:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Tatashin, steven.sistare, daniel.m.jordan, mgorman, mhocko,
	linux-mm, linux-kernel, gregkh, vbabka, bharata, tglx, mingo,
	hpa, x86, dan.j.williams, kirill.shutemov, bhe


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
> 
> > This patchset:
> > - Improves hotplug performance by eliminating a number of
> > struct page traverses during memory hotplug.
> > 
> > - Fixes some issues with hotplugging, where boundaries
> > were not properly checked. And on x86 block size was not properly aligned
> > with end of memory
> > 
> > - Also, potentially improves boot performance by eliminating condition from
> >   __init_single_page().
> > 
> > - Adds robustness by verifying that that struct pages are correctly
> >   poisoned when flags are accessed.
> 
> I'm now attempting to get a 100% review rate on MM patches, which is
> why I started adding my Reviewed-by: when I do that thing.
> 
> I'm not familiar enough with this code to add my own Reviewed-by:, and
> we'll need to figure out what to do in such cases.  I shall be sending
> out periodic review-status summaries.
> 
> If you're able to identify a suitable reviewer for this work and to
> offer them beer, that would help.  Let's see what happens as the weeks
> unfold.

The largest patch, fix patch #2, looks good to me and fixes a real bug.
Patch #1 and #3 also look good to me (assuming the runtime overhead
added by patch #3 is OK to you):

  Reviewed-by: Ingo Molnar <mingo@kernel.org>

(I suspect patch #1 and patch #2 should also get a Cc: stable.)

Patch #4 is too large to review IMO: it should be split up into as many patches as 
practically possible. That will also help bisectability, should anything break.

Before applying these patches please fix changelog and code comment spelling.

But it's all good stuff AFAICS!

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] optimize memory hotplug
  2018-02-14  8:09     ` Ingo Molnar
@ 2018-02-14 14:14       ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-14 14:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Steve Sistare, Daniel Jordan, Mel Gorman,
	Michal Hocko, Linux Memory Management List,
	Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka,
	Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

Hi Ingo,

Thank you very much for your review. I will address spelling issues,
and will also try to split the patch #4.  Regarding runtime concern
for patch #3: the extra checking is only performed when the both of
the following CONFIGs are enabled:

CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=y

I do not expect either of these to be ever enabled on a production systems.

Thank you,
Pavel

On Wed, Feb 14, 2018 at 3:09 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
>>
>> > This patchset:
>> > - Improves hotplug performance by eliminating a number of
>> > struct page traverses during memory hotplug.
>> >
>> > - Fixes some issues with hotplugging, where boundaries
>> > were not properly checked. And on x86 block size was not properly aligned
>> > with end of memory
>> >
>> > - Also, potentially improves boot performance by eliminating condition from
>> >   __init_single_page().
>> >
>> > - Adds robustness by verifying that that struct pages are correctly
>> >   poisoned when flags are accessed.
>>
>> I'm now attempting to get a 100% review rate on MM patches, which is
>> why I started adding my Reviewed-by: when I do that thing.
>>
>> I'm not familiar enough with this code to add my own Reviewed-by:, and
>> we'll need to figure out what to do in such cases.  I shall be sending
>> out periodic review-status summaries.
>>
>> If you're able to identify a suitable reviewer for this work and to
>> offer them beer, that would help.  Let's see what happens as the weeks
>> unfold.
>
> The largest patch, fix patch #2, looks good to me and fixes a real bug.
> Patch #1 and #3 also look good to me (assuming the runtime overhead
> added by patch #3 is OK to you):
>
>   Reviewed-by: Ingo Molnar <mingo@kernel.org>
>
> (I suspect patch #1 and patch #2 should also get a Cc: stable.)
>
> Patch #4 is too large to review IMO: it should be split up into as many patches as
> practically possible. That will also help bisectability, should anything break.
>
> Before applying these patches please fix changelog and code comment spelling.
>
> But it's all good stuff AFAICS!
>
> Thanks,
>
>         Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] optimize memory hotplug
@ 2018-02-14 14:14       ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-14 14:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Steve Sistare, Daniel Jordan, Mel Gorman,
	Michal Hocko, Linux Memory Management List,
	Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka,
	Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

Hi Ingo,

Thank you very much for your review. I will address spelling issues,
and will also try to split the patch #4.  Regarding runtime concern
for patch #3: the extra checking is only performed when the both of
the following CONFIGs are enabled:

CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=y

I do not expect either of these to be ever enabled on a production systems.

Thank you,
Pavel

On Wed, Feb 14, 2018 at 3:09 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> On Tue, 13 Feb 2018 14:31:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:
>>
>> > This patchset:
>> > - Improves hotplug performance by eliminating a number of
>> > struct page traverses during memory hotplug.
>> >
>> > - Fixes some issues with hotplugging, where boundaries
>> > were not properly checked. And on x86 block size was not properly aligned
>> > with end of memory
>> >
>> > - Also, potentially improves boot performance by eliminating condition from
>> >   __init_single_page().
>> >
>> > - Adds robustness by verifying that that struct pages are correctly
>> >   poisoned when flags are accessed.
>>
>> I'm now attempting to get a 100% review rate on MM patches, which is
>> why I started adding my Reviewed-by: when I do that thing.
>>
>> I'm not familiar enough with this code to add my own Reviewed-by:, and
>> we'll need to figure out what to do in such cases.  I shall be sending
>> out periodic review-status summaries.
>>
>> If you're able to identify a suitable reviewer for this work and to
>> offer them beer, that would help.  Let's see what happens as the weeks
>> unfold.
>
> The largest patch, fix patch #2, looks good to me and fixes a real bug.
> Patch #1 and #3 also look good to me (assuming the runtime overhead
> added by patch #3 is OK to you):
>
>   Reviewed-by: Ingo Molnar <mingo@kernel.org>
>
> (I suspect patch #1 and patch #2 should also get a Cc: stable.)
>
> Patch #4 is too large to review IMO: it should be split up into as many patches as
> practically possible. That will also help bisectability, should anything break.
>
> Before applying these patches please fix changelog and code comment spelling.
>
> But it's all good stuff AFAICS!
>
> Thanks,
>
>         Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
  2018-02-13 19:31   ` Pavel Tatashin
@ 2018-02-15 11:34     ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 11:34 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue 13-02-18 14:31:56, Pavel Tatashin wrote:
> Start qemu with the following arguments:
> -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
> 
> Which: boots machine with 64G, and adds a device mem1 with 2G which can be
> hotplugged later.
> 
> Also make sure that config has the following turned on:
> CONFIG_MEMORY_HOTPLUG
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> CONFIG_ACPI_HOTPLUG_MEMORY
> 
> >From qemu monitor hotplug the memory (make sure config has
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> 
> The operation will fail with the following trace:
> 
> WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
> pages_correctly_reserved+0xe6/0x110
> Modules linked in:
> CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
> RIP: 0010:pages_correctly_reserved+0xe6/0x110
> RSP: 0018:ffffbe5086b53d98 EFLAGS: 00010246
> RAX: ffff9acb3fff3180 RBX: ffff9acaf7646038 RCX: 0000000000000800
> RDX: ffff9acb3fff3000 RSI: 0000000000000218 RDI: 00000000010c0000
> RBP: 0000000001080000 R08: ffffe81f83000040 R09: 0000000001100000
> R10: ffff9acb3fff6000 R11: 0000000000000246 R12: 0000000000080000
> R13: 0000000000000000 R14: ffffbe5086b53f08 R15: ffff9acaf7506f20
> FS:  00007fd7f20da8c0(0000) GS:ffff9acb3fc00000(0000) knlGS:000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fd7f20f2000 CR3: 0000000ff7ac2001 CR4: 00000000001606f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  memory_subsys_online+0x44/0xa0
>  device_online+0x51/0x80
>  store_mem_state+0x5e/0xe0
>  kernfs_fop_write+0xfa/0x170
>  __vfs_write+0x2e/0x150
>  ? __inode_security_revalidate+0x47/0x60
>  ? selinux_file_permission+0xd5/0x130
>  ? _cond_resched+0x10/0x20
>  vfs_write+0xa8/0x1a0
>  ? find_vma+0x54/0x60
>  SyS_write+0x4d/0xb0
>  do_syscall_64+0x5d/0x110
>  entry_SYSCALL_64_after_hwframe+0x21/0x86
> RIP: 0033:0x7fd7f0d3a840
> RSP: 002b:00007fff5db77c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7f0d3a840
> RDX: 0000000000000006 RSI: 00007fd7f20f2000 RDI: 0000000000000007
> RBP: 00007fd7f20f2000 R08: 000055db265c4ab0 R09: 00007fd7f20da8c0
> R10: 0000000000000006 R11: 0000000000000246 R12: 000055db265c49d0
> R13: 0000000000000006 R14: 000055db265c5510 R15: 000000000000000b
> Code: fe ff ff 07 00 77 24 48 89 f8 48 c1 e8 17 49 8b 14 c2 48 85 d2 74 14
> 40 0f b6 c6 49 81 c0 00 00 20 00 48 c1 e0 04 48 01 d0 75 93 <0f> ff 31 c0
> c3 b8 01 00 00 00 c3 31 d2 48 c7 c7 b0 32 67 a6 31
> ---[ end trace 6203bc4f1a5d30e8 ]---
> 
> The problem is detected in: drivers/base/memory.c
> static bool pages_correctly_reserved(unsigned long start_pfn)
> 205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))
> 
> This function loops through every section in the newly added memory block
> and verifies that the first pfn is valid, meaning section exists, has
> mapping (struct page array), and is online.
> 
> The block size on x86 is usually 128M, but when machine is booted with more
> than 64G of memory, the block size is changed to 2G:
> $ cat /sys/devices/system/memory/block_size_bytes
> 80000000
> 
> or
> $ dmesg | grep "block size"
> [    0.086469] x86/mm: Memory block size: 2048MB
> 
> During memory hotplug, and hotremove we verify that the range is section
> size aligned, but we actually must verify that it is block size aligned,
> because that is the proper unit for hotplug operations. See:
> Documentation/memory-hotplug.txt
> 
> So, when the start_pfn of newly added memory is not block size aligned, we
> can get a memory block that has only part of it with properly populated
> sections.
> 
> In our case the start_pfn starts from the last_pfn (end of physical
> memory).
> $ dmesg | grep last_pfn
> [    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
> 
> 0x1040000 == 65G, and so is not 2G aligned!
> 
> The fix is to enforce that memory that is hotplugged and hotremoved is
> block size aligned.
> 
> With this fix, running the above sequence yield to the following result:
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
> 							size 0x80000000
> acpi PNP0C80:00: add_memory failed
> acpi PNP0C80:00: acpi_memory_enable_device() error
> acpi PNP0C80:00: Enumeration failure


The whole memblock != section_size sucks! It leads to corner cases like
you see. There is no real reason why we shouldn't be able to to online
2G unaligned memory range. Failing for that purpose is just wrong. The
whole thing is just not well thought through and works only for well
configured cases.

Your patch doesn't address the underlying problem. On the other hand, it
is incorrect to check memory section here conceptually because this is
not a hotplug unit as you say so I am OK with the patch regardless. It
deserves a big fat TODO to fix this properly at least. I am not sure why
we insist on the alignment in the first place. All we should care about
is the proper memory section based range. The code is crap and it
assumes pageblock start aligned at some places but there shouldn't be
anything fundamental to change that.

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memory_hotplug.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b2bd52ff7605..565048f496f7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1083,15 +1083,16 @@ int try_online_node(int nid)
>  
>  static int check_hotplug_memory_range(u64 start, u64 size)
>  {
> -	u64 start_pfn = PFN_DOWN(start);
> +	unsigned long block_sz = memory_block_size_bytes();
> +	u64 block_nr_pages = block_sz >> PAGE_SHIFT;
>  	u64 nr_pages = size >> PAGE_SHIFT;
> +	u64 start_pfn = PFN_DOWN(start);
>  
> -	/* Memory range must be aligned with section */
> -	if ((start_pfn & ~PAGE_SECTION_MASK) ||
> -	    (nr_pages % PAGES_PER_SECTION) || (!nr_pages)) {
> -		pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n",
> -				(unsigned long long)start,
> -				(unsigned long long)size);
> +	/* memory range must be block size aligned */
> +	if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
> +	    !IS_ALIGNED(nr_pages, block_nr_pages)) {
> +		pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
> +		       block_sz, start, size);
>  		return -EINVAL;
>  	}
>  
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
@ 2018-02-15 11:34     ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 11:34 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue 13-02-18 14:31:56, Pavel Tatashin wrote:
> Start qemu with the following arguments:
> -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
> 
> Which: boots machine with 64G, and adds a device mem1 with 2G which can be
> hotplugged later.
> 
> Also make sure that config has the following turned on:
> CONFIG_MEMORY_HOTPLUG
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> CONFIG_ACPI_HOTPLUG_MEMORY
> 
> >From qemu monitor hotplug the memory (make sure config has
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> 
> The operation will fail with the following trace:
> 
> WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
> pages_correctly_reserved+0xe6/0x110
> Modules linked in:
> CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
> RIP: 0010:pages_correctly_reserved+0xe6/0x110
> RSP: 0018:ffffbe5086b53d98 EFLAGS: 00010246
> RAX: ffff9acb3fff3180 RBX: ffff9acaf7646038 RCX: 0000000000000800
> RDX: ffff9acb3fff3000 RSI: 0000000000000218 RDI: 00000000010c0000
> RBP: 0000000001080000 R08: ffffe81f83000040 R09: 0000000001100000
> R10: ffff9acb3fff6000 R11: 0000000000000246 R12: 0000000000080000
> R13: 0000000000000000 R14: ffffbe5086b53f08 R15: ffff9acaf7506f20
> FS:  00007fd7f20da8c0(0000) GS:ffff9acb3fc00000(0000) knlGS:000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fd7f20f2000 CR3: 0000000ff7ac2001 CR4: 00000000001606f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  memory_subsys_online+0x44/0xa0
>  device_online+0x51/0x80
>  store_mem_state+0x5e/0xe0
>  kernfs_fop_write+0xfa/0x170
>  __vfs_write+0x2e/0x150
>  ? __inode_security_revalidate+0x47/0x60
>  ? selinux_file_permission+0xd5/0x130
>  ? _cond_resched+0x10/0x20
>  vfs_write+0xa8/0x1a0
>  ? find_vma+0x54/0x60
>  SyS_write+0x4d/0xb0
>  do_syscall_64+0x5d/0x110
>  entry_SYSCALL_64_after_hwframe+0x21/0x86
> RIP: 0033:0x7fd7f0d3a840
> RSP: 002b:00007fff5db77c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7f0d3a840
> RDX: 0000000000000006 RSI: 00007fd7f20f2000 RDI: 0000000000000007
> RBP: 00007fd7f20f2000 R08: 000055db265c4ab0 R09: 00007fd7f20da8c0
> R10: 0000000000000006 R11: 0000000000000246 R12: 000055db265c49d0
> R13: 0000000000000006 R14: 000055db265c5510 R15: 000000000000000b
> Code: fe ff ff 07 00 77 24 48 89 f8 48 c1 e8 17 49 8b 14 c2 48 85 d2 74 14
> 40 0f b6 c6 49 81 c0 00 00 20 00 48 c1 e0 04 48 01 d0 75 93 <0f> ff 31 c0
> c3 b8 01 00 00 00 c3 31 d2 48 c7 c7 b0 32 67 a6 31
> ---[ end trace 6203bc4f1a5d30e8 ]---
> 
> The problem is detected in: drivers/base/memory.c
> static bool pages_correctly_reserved(unsigned long start_pfn)
> 205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))
> 
> This function loops through every section in the newly added memory block
> and verifies that the first pfn is valid, meaning section exists, has
> mapping (struct page array), and is online.
> 
> The block size on x86 is usually 128M, but when machine is booted with more
> than 64G of memory, the block size is changed to 2G:
> $ cat /sys/devices/system/memory/block_size_bytes
> 80000000
> 
> or
> $ dmesg | grep "block size"
> [    0.086469] x86/mm: Memory block size: 2048MB
> 
> During memory hotplug, and hotremove we verify that the range is section
> size aligned, but we actually must verify that it is block size aligned,
> because that is the proper unit for hotplug operations. See:
> Documentation/memory-hotplug.txt
> 
> So, when the start_pfn of newly added memory is not block size aligned, we
> can get a memory block that has only part of it with properly populated
> sections.
> 
> In our case the start_pfn starts from the last_pfn (end of physical
> memory).
> $ dmesg | grep last_pfn
> [    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
> 
> 0x1040000 == 65G, and so is not 2G aligned!
> 
> The fix is to enforce that memory that is hotplugged and hotremoved is
> block size aligned.
> 
> With this fix, running the above sequence yield to the following result:
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
> 							size 0x80000000
> acpi PNP0C80:00: add_memory failed
> acpi PNP0C80:00: acpi_memory_enable_device() error
> acpi PNP0C80:00: Enumeration failure


The whole memblock != section_size sucks! It leads to corner cases like
you see. There is no real reason why we shouldn't be able to to online
2G unaligned memory range. Failing for that purpose is just wrong. The
whole thing is just not well thought through and works only for well
configured cases.

Your patch doesn't address the underlying problem. On the other hand, it
is incorrect to check memory section here conceptually because this is
not a hotplug unit as you say so I am OK with the patch regardless. It
deserves a big fat TODO to fix this properly at least. I am not sure why
we insist on the alignment in the first place. All we should care about
is the proper memory section based range. The code is crap and it
assumes pageblock start aligned at some places but there shouldn't be
anything fundamental to change that.

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memory_hotplug.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b2bd52ff7605..565048f496f7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1083,15 +1083,16 @@ int try_online_node(int nid)
>  
>  static int check_hotplug_memory_range(u64 start, u64 size)
>  {
> -	u64 start_pfn = PFN_DOWN(start);
> +	unsigned long block_sz = memory_block_size_bytes();
> +	u64 block_nr_pages = block_sz >> PAGE_SHIFT;
>  	u64 nr_pages = size >> PAGE_SHIFT;
> +	u64 start_pfn = PFN_DOWN(start);
>  
> -	/* Memory range must be aligned with section */
> -	if ((start_pfn & ~PAGE_SECTION_MASK) ||
> -	    (nr_pages % PAGES_PER_SECTION) || (!nr_pages)) {
> -		pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n",
> -				(unsigned long long)start,
> -				(unsigned long long)size);
> +	/* memory range must be block size aligned */
> +	if (!nr_pages || !IS_ALIGNED(start_pfn, block_nr_pages) ||
> +	    !IS_ALIGNED(nr_pages, block_nr_pages)) {
> +		pr_err("Block size [%#lx] unaligned hotplug range: start %#llx, size %#llx",
> +		       block_sz, start, size);
>  		return -EINVAL;
>  	}
>  
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory
  2018-02-13 19:31   ` Pavel Tatashin
@ 2018-02-15 11:37     ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 11:37 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue 13-02-18 14:31:57, Pavel Tatashin wrote:
> Memory sections are combined into "memory block" chunks. These chunks are
> the units upon which memory can be added and removed.
> 
> On x86, the new memory may be added after the end of the boot memory,
> therefore, if block size does not align with end of boot memory, memory
> hot-plugging/hot-removing can be broken.
> 
> Currently, whenever machine is boot with more than 64G the block size
> unconditionally increased to 2G from the base 128M in order to reduce
> number of memory devices in sysfs:
> 	/sys/devices/system/memory/memoryXXX
> 
> But, we must use the largest allowed block size that aligns to the next
> address to be able to hotplug the next block of memory.
> 
> So, when memory is larger than 64G, we check the end address and find the
> largest block size that is still power of two but smaller or equal to 2G.
> 
> Before, the fix:
> Run qemu with:
> -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
> 
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
> 							size 0x80000000
> acpi PNP0C80:00: add_memory failed
> acpi PNP0C80:00: acpi_memory_enable_device() error
> acpi PNP0C80:00: Enumeration failure
> 
> With the fix memory is added successfully, as the block size is set to 1G,
> and therefore aligns with start address 0x1040000000.

I dunno. If x86 maintainers are OK with this then why not, but I do not
like how this is x86 specific. I would much rather address this by
making the memblock user interface more sane.

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++----
>  1 file changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 1ab42c852069..f7dc80364397 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1326,14 +1326,39 @@ int kern_addr_valid(unsigned long addr)
>  	return pfn_valid(pte_pfn(*pte));
>  }
>  
> +/*
> + * Block size is the minimum quantum of memory which can be hot-plugged or
> + * hot-removed. It must be power of two, and must be equal or larger than
> + * MIN_MEMORY_BLOCK_SIZE.
> + */
> +#define MAX_BLOCK_SIZE (2UL << 30)
> +
> +/* Amount of ram needed to start using large blocks */
> +#define MEM_SIZE_FOR_LARGE_BLOCK (64UL << 30)
> +
>  static unsigned long probe_memory_block_size(void)
>  {
> -	unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
> +	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
> +	unsigned long bz;
>  
> -	/* if system is UV or has 64GB of RAM or more, use large blocks */
> -	if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30)))
> -		bz = 2UL << 30; /* 2GB */
> +	/* If this is UV system, always set 2G block size */
> +	if (is_uv_system()) {
> +		bz = MAX_BLOCK_SIZE;
> +		goto done;
> +	}
>  
> +	/* Use regular block if RAM is smaller than MEM_SIZE_FOR_LARGE_BLOCK */
> +	if (boot_mem_end < MEM_SIZE_FOR_LARGE_BLOCK) {
> +		bz = MIN_MEMORY_BLOCK_SIZE;
> +		goto done;
> +	}
> +
> +	/* Find the largest allowed block size that aligns to memory end */
> +	for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
> +		if (IS_ALIGNED(boot_mem_end, bz))
> +			break;
> +	}
> +done:
>  	pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20);
>  
>  	return bz;
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory
@ 2018-02-15 11:37     ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 11:37 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue 13-02-18 14:31:57, Pavel Tatashin wrote:
> Memory sections are combined into "memory block" chunks. These chunks are
> the units upon which memory can be added and removed.
> 
> On x86, the new memory may be added after the end of the boot memory,
> therefore, if block size does not align with end of boot memory, memory
> hot-plugging/hot-removing can be broken.
> 
> Currently, whenever machine is boot with more than 64G the block size
> unconditionally increased to 2G from the base 128M in order to reduce
> number of memory devices in sysfs:
> 	/sys/devices/system/memory/memoryXXX
> 
> But, we must use the largest allowed block size that aligns to the next
> address to be able to hotplug the next block of memory.
> 
> So, when memory is larger than 64G, we check the end address and find the
> largest block size that is still power of two but smaller or equal to 2G.
> 
> Before, the fix:
> Run qemu with:
> -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
> 
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
> 							size 0x80000000
> acpi PNP0C80:00: add_memory failed
> acpi PNP0C80:00: acpi_memory_enable_device() error
> acpi PNP0C80:00: Enumeration failure
> 
> With the fix memory is added successfully, as the block size is set to 1G,
> and therefore aligns with start address 0x1040000000.

I dunno. If x86 maintainers are OK with this then why not, but I do not
like how this is x86 specific. I would much rather address this by
making the memblock user interface more sane.

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++++++++----
>  1 file changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 1ab42c852069..f7dc80364397 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1326,14 +1326,39 @@ int kern_addr_valid(unsigned long addr)
>  	return pfn_valid(pte_pfn(*pte));
>  }
>  
> +/*
> + * Block size is the minimum quantum of memory which can be hot-plugged or
> + * hot-removed. It must be power of two, and must be equal or larger than
> + * MIN_MEMORY_BLOCK_SIZE.
> + */
> +#define MAX_BLOCK_SIZE (2UL << 30)
> +
> +/* Amount of ram needed to start using large blocks */
> +#define MEM_SIZE_FOR_LARGE_BLOCK (64UL << 30)
> +
>  static unsigned long probe_memory_block_size(void)
>  {
> -	unsigned long bz = MIN_MEMORY_BLOCK_SIZE;
> +	unsigned long boot_mem_end = max_pfn << PAGE_SHIFT;
> +	unsigned long bz;
>  
> -	/* if system is UV or has 64GB of RAM or more, use large blocks */
> -	if (is_uv_system() || ((max_pfn << PAGE_SHIFT) >= (64UL << 30)))
> -		bz = 2UL << 30; /* 2GB */
> +	/* If this is UV system, always set 2G block size */
> +	if (is_uv_system()) {
> +		bz = MAX_BLOCK_SIZE;
> +		goto done;
> +	}
>  
> +	/* Use regular block if RAM is smaller than MEM_SIZE_FOR_LARGE_BLOCK */
> +	if (boot_mem_end < MEM_SIZE_FOR_LARGE_BLOCK) {
> +		bz = MIN_MEMORY_BLOCK_SIZE;
> +		goto done;
> +	}
> +
> +	/* Find the largest allowed block size that aligns to memory end */
> +	for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) {
> +		if (IS_ALIGNED(boot_mem_end, bz))
> +			break;
> +	}
> +done:
>  	pr_info("x86/mm: Memory block size: %ldMB\n", bz >> 20);
>  
>  	return bz;
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking
  2018-02-13 19:31   ` Pavel Tatashin
@ 2018-02-15 11:53     ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 11:53 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue 13-02-18 14:31:58, Pavel Tatashin wrote:
> During boot we poison struct page memory in order to ensure that no one is
> accessing this memory until the struct pages are initialized in
> __init_single_page().
> 
> This patch adds more scrutiny to this checking, by making sure that flags
> do not equal to poison pattern when the are accessed. The pattern is all

s@the are@they are@

> ones.
> 
> Since, node id is also stored in struct page, and may be accessed quiet

s@quiet@quite@

> early we add the enforcement into page_to_nid() function as well.

It would be worth adding that this applies only to
NODE_NOT_IN_PAGE_FLAGS=n

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>

Other than that it looks like a reasonable debugging feature.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h         |  4 +++-
>  include/linux/page-flags.h | 22 +++++++++++++++++-----
>  mm/memblock.c              |  2 +-
>  3 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad06d42adb1a..ad71136a6494 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -896,7 +896,9 @@ extern int page_to_nid(const struct page *page);
>  #else
>  static inline int page_to_nid(const struct page *page)
>  {
> -	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
> +	struct page *p = (struct page *)page;
> +
> +	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>  }
>  #endif
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 50c2b8786831..5d5493e1f7ba 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -156,9 +156,18 @@ static __always_inline int PageCompound(struct page *page)
>  	return test_bit(PG_head, &page->flags) || PageTail(page);
>  }
>  
> +#define	PAGE_POISON_PATTERN	~0ul
> +static inline int PagePoisoned(const struct page *page)
> +{
> +	return page->flags == PAGE_POISON_PATTERN;
> +}
> +
>  /*
>   * Page flags policies wrt compound pages
>   *
> + * PF_POISONED_CHECK
> + *     check if this struct page poisoned/uninitialized
> + *
>   * PF_ANY:
>   *     the page flag is relevant for small, head and tail pages.
>   *
> @@ -176,17 +185,20 @@ static __always_inline int PageCompound(struct page *page)
>   * PF_NO_COMPOUND:
>   *     the page flag is not relevant for compound pages.
>   */
> -#define PF_ANY(page, enforce)	page
> -#define PF_HEAD(page, enforce)	compound_head(page)
> +#define PF_POISONED_CHECK(page) ({					\
> +		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
> +		page;})
> +#define PF_ANY(page, enforce)	PF_POISONED_CHECK(page)
> +#define PF_HEAD(page, enforce)	PF_POISONED_CHECK(compound_head(page))
>  #define PF_ONLY_HEAD(page, enforce) ({					\
>  		VM_BUG_ON_PGFLAGS(PageTail(page), page);		\
> -		page;})
> +		PF_POISONED_CHECK(page);})
>  #define PF_NO_TAIL(page, enforce) ({					\
>  		VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page);	\
> -		compound_head(page);})
> +		PF_POISONED_CHECK(compound_head(page));})
>  #define PF_NO_COMPOUND(page, enforce) ({				\
>  		VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page);	\
> -		page;})
> +		PF_POISONED_CHECK(page);})
>  
>  /*
>   * Macros to create function definitions for page flags
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 5a9ca2a1751b..d85c8754e0ce 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1373,7 +1373,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  					   min_addr, max_addr, nid);
>  #ifdef CONFIG_DEBUG_VM
>  	if (ptr && size > 0)
> -		memset(ptr, 0xff, size);
> +		memset(ptr, PAGE_POISON_PATTERN, size);
>  #endif
>  	return ptr;
>  }
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking
@ 2018-02-15 11:53     ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 11:53 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue 13-02-18 14:31:58, Pavel Tatashin wrote:
> During boot we poison struct page memory in order to ensure that no one is
> accessing this memory until the struct pages are initialized in
> __init_single_page().
> 
> This patch adds more scrutiny to this checking, by making sure that flags
> do not equal to poison pattern when the are accessed. The pattern is all

s@the are@they are@

> ones.
> 
> Since, node id is also stored in struct page, and may be accessed quiet

s@quiet@quite@

> early we add the enforcement into page_to_nid() function as well.

It would be worth adding that this applies only to
NODE_NOT_IN_PAGE_FLAGS=n

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>

Other than that it looks like a reasonable debugging feature.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h         |  4 +++-
>  include/linux/page-flags.h | 22 +++++++++++++++++-----
>  mm/memblock.c              |  2 +-
>  3 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad06d42adb1a..ad71136a6494 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -896,7 +896,9 @@ extern int page_to_nid(const struct page *page);
>  #else
>  static inline int page_to_nid(const struct page *page)
>  {
> -	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
> +	struct page *p = (struct page *)page;
> +
> +	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>  }
>  #endif
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 50c2b8786831..5d5493e1f7ba 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -156,9 +156,18 @@ static __always_inline int PageCompound(struct page *page)
>  	return test_bit(PG_head, &page->flags) || PageTail(page);
>  }
>  
> +#define	PAGE_POISON_PATTERN	~0ul
> +static inline int PagePoisoned(const struct page *page)
> +{
> +	return page->flags == PAGE_POISON_PATTERN;
> +}
> +
>  /*
>   * Page flags policies wrt compound pages
>   *
> + * PF_POISONED_CHECK
> + *     check if this struct page poisoned/uninitialized
> + *
>   * PF_ANY:
>   *     the page flag is relevant for small, head and tail pages.
>   *
> @@ -176,17 +185,20 @@ static __always_inline int PageCompound(struct page *page)
>   * PF_NO_COMPOUND:
>   *     the page flag is not relevant for compound pages.
>   */
> -#define PF_ANY(page, enforce)	page
> -#define PF_HEAD(page, enforce)	compound_head(page)
> +#define PF_POISONED_CHECK(page) ({					\
> +		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
> +		page;})
> +#define PF_ANY(page, enforce)	PF_POISONED_CHECK(page)
> +#define PF_HEAD(page, enforce)	PF_POISONED_CHECK(compound_head(page))
>  #define PF_ONLY_HEAD(page, enforce) ({					\
>  		VM_BUG_ON_PGFLAGS(PageTail(page), page);		\
> -		page;})
> +		PF_POISONED_CHECK(page);})
>  #define PF_NO_TAIL(page, enforce) ({					\
>  		VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page);	\
> -		compound_head(page);})
> +		PF_POISONED_CHECK(compound_head(page));})
>  #define PF_NO_COMPOUND(page, enforce) ({				\
>  		VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page);	\
> -		page;})
> +		PF_POISONED_CHECK(page);})
>  
>  /*
>   * Macros to create function definitions for page flags
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 5a9ca2a1751b..d85c8754e0ce 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1373,7 +1373,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  					   min_addr, max_addr, nid);
>  #ifdef CONFIG_DEBUG_VM
>  	if (ptr && size > 0)
> -		memset(ptr, 0xff, size);
> +		memset(ptr, PAGE_POISON_PATTERN, size);
>  #endif
>  	return ptr;
>  }
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug
  2018-02-13 19:31   ` Pavel Tatashin
@ 2018-02-15 12:43     ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 12:43 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue 13-02-18 14:31:59, Pavel Tatashin wrote:
[...]
> @@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
>  	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
>  	 * and assume memmap is contiguous within each section
>  	 */
> -	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
> +	for (; section_nr < section_nr_end; section_nr++) {
>  		if (WARN_ON_ONCE(!pfn_valid(pfn)))
>  			return false;
> -		page = pfn_to_page(pfn);
> -
> -		for (j = 0; j < PAGES_PER_SECTION; j++) {
> -			if (PageReserved(page + j))
> -				continue;
> -
> -			printk(KERN_WARNING "section number %ld page number %d "
> -				"not reserved, was it already online?\n",
> -				pfn_to_section_nr(pfn), j);
>  
> +		if (!present_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) not present",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
> +			return false;
> +		} else if (!valid_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) no valid memmap",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
> +			return false;
> +		} else if (online_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) is already online",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
>  			return false;
>  		}
> +		pfn += PAGES_PER_SECTION;
>  	}

This should be a separate patch IMHO. It is an optimization on its
own. The original code tries to be sparse neutral but we do depend on
sparse anyway.

[...]
>  /* register memory section under specified node if it spans that node */
> -int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
> +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
> +				 bool check_nid)

This check_nid begs for a documentation. When do we need to set it? I
can see that register_new_memory path doesn't check node id. It is quite
reasonable to expect that a new memblock doesn't span multiple numa
nodes which can be the case for register_one_node but a word or two are
really due.

>  {
>  	int ret;
>  	unsigned long pfn, sect_start_pfn, sect_end_pfn;
> @@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
>  			continue;
>  		}
>  
> -		page_nid = get_nid_for_pfn(pfn);
> -		if (page_nid < 0)
> -			continue;
> -		if (page_nid != nid)
> -			continue;
> +		if (check_nid) {
> +			page_nid = get_nid_for_pfn(pfn);
> +			if (page_nid < 0)
> +				continue;
> +			if (page_nid != nid)
> +				continue;
> +		}
>  		ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
>  					&mem_blk->dev.kobj,
>  					kobject_name(&mem_blk->dev.kobj));
> @@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
>  
>  		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
>  
> -		ret = register_mem_sect_under_node(mem_blk, nid);
> +		ret = register_mem_sect_under_node(mem_blk, nid, true);
>  		if (!err)
>  			err = ret;
>  

I would be tempted to split this into a separate patch as well. The
review will be much easier.

[...]
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 7af5e7a92528..d7808307023b 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -30,11 +30,14 @@ struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
>  #endif
>  EXPORT_SYMBOL(mem_section);
>  
> -#ifdef NODE_NOT_IN_PAGE_FLAGS
> +#if defined(NODE_NOT_IN_PAGE_FLAGS) || defined(CONFIG_MEMORY_HOTPLUG)
>  /*
>   * If we did not store the node number in the page then we have to
>   * do a lookup in the section_to_node_table in order to find which
>   * node the page belongs to.
> + *
> + * We also use this data in case memory hotplugging is enabled to be
> + * able to determine nid while struct pages are not yet initialized.
>   */
>  #if MAX_NUMNODES <= 256
>  static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
> @@ -42,17 +45,28 @@ static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
>  static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
>  #endif
>  
> +#ifdef NODE_NOT_IN_PAGE_FLAGS
>  int page_to_nid(const struct page *page)
>  {
>  	return section_to_node_table[page_to_section(page)];
>  }
>  EXPORT_SYMBOL(page_to_nid);
> +#endif /* NODE_NOT_IN_PAGE_FLAGS */

This is quite ugly. You allocate 256MB for small numa systems and 512MB
for larger NUMAs unconditionally for MEMORY_HOTPLUG. I see you need it
to safely replace page_to_nid by get_section_nid but this is just too
high of the price. Please note that this shouldn't be really needed. At
least not for onlining. We already _do_ know the node association with
the pfn range. So we should be able to get the nid from memblock.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug
@ 2018-02-15 12:43     ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 12:43 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, mgorman, linux-mm,
	linux-kernel, gregkh, vbabka, bharata, tglx, mingo, hpa, x86,
	dan.j.williams, kirill.shutemov, bhe

On Tue 13-02-18 14:31:59, Pavel Tatashin wrote:
[...]
> @@ -201,21 +202,24 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
>  	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
>  	 * and assume memmap is contiguous within each section
>  	 */
> -	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
> +	for (; section_nr < section_nr_end; section_nr++) {
>  		if (WARN_ON_ONCE(!pfn_valid(pfn)))
>  			return false;
> -		page = pfn_to_page(pfn);
> -
> -		for (j = 0; j < PAGES_PER_SECTION; j++) {
> -			if (PageReserved(page + j))
> -				continue;
> -
> -			printk(KERN_WARNING "section number %ld page number %d "
> -				"not reserved, was it already online?\n",
> -				pfn_to_section_nr(pfn), j);
>  
> +		if (!present_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) not present",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
> +			return false;
> +		} else if (!valid_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) no valid memmap",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
> +			return false;
> +		} else if (online_section_nr(section_nr)) {
> +			pr_warn("section %ld pfn[%lx, %lx) is already online",
> +				section_nr, pfn, pfn + PAGES_PER_SECTION);
>  			return false;
>  		}
> +		pfn += PAGES_PER_SECTION;
>  	}

This should be a separate patch IMHO. It is an optimization on its
own. The original code tries to be sparse neutral but we do depend on
sparse anyway.

[...]
>  /* register memory section under specified node if it spans that node */
> -int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
> +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
> +				 bool check_nid)

This check_nid begs for a documentation. When do we need to set it? I
can see that register_new_memory path doesn't check node id. It is quite
reasonable to expect that a new memblock doesn't span multiple numa
nodes which can be the case for register_one_node but a word or two are
really due.

>  {
>  	int ret;
>  	unsigned long pfn, sect_start_pfn, sect_end_pfn;
> @@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
>  			continue;
>  		}
>  
> -		page_nid = get_nid_for_pfn(pfn);
> -		if (page_nid < 0)
> -			continue;
> -		if (page_nid != nid)
> -			continue;
> +		if (check_nid) {
> +			page_nid = get_nid_for_pfn(pfn);
> +			if (page_nid < 0)
> +				continue;
> +			if (page_nid != nid)
> +				continue;
> +		}
>  		ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
>  					&mem_blk->dev.kobj,
>  					kobject_name(&mem_blk->dev.kobj));
> @@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
>  
>  		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
>  
> -		ret = register_mem_sect_under_node(mem_blk, nid);
> +		ret = register_mem_sect_under_node(mem_blk, nid, true);
>  		if (!err)
>  			err = ret;
>  

I would be tempted to split this into a separate patch as well. The
review will be much easier.

[...]
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 7af5e7a92528..d7808307023b 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -30,11 +30,14 @@ struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
>  #endif
>  EXPORT_SYMBOL(mem_section);
>  
> -#ifdef NODE_NOT_IN_PAGE_FLAGS
> +#if defined(NODE_NOT_IN_PAGE_FLAGS) || defined(CONFIG_MEMORY_HOTPLUG)
>  /*
>   * If we did not store the node number in the page then we have to
>   * do a lookup in the section_to_node_table in order to find which
>   * node the page belongs to.
> + *
> + * We also use this data in case memory hotplugging is enabled to be
> + * able to determine nid while struct pages are not yet initialized.
>   */
>  #if MAX_NUMNODES <= 256
>  static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
> @@ -42,17 +45,28 @@ static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
>  static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
>  #endif
>  
> +#ifdef NODE_NOT_IN_PAGE_FLAGS
>  int page_to_nid(const struct page *page)
>  {
>  	return section_to_node_table[page_to_section(page)];
>  }
>  EXPORT_SYMBOL(page_to_nid);
> +#endif /* NODE_NOT_IN_PAGE_FLAGS */

This is quite ugly. You allocate 256MB for small numa systems and 512MB
for larger NUMAs unconditionally for MEMORY_HOTPLUG. I see you need it
to safely replace page_to_nid by get_section_nid but this is just too
high of the price. Please note that this shouldn't be really needed. At
least not for onlining. We already _do_ know the node association with
the pfn range. So we should be able to get the nid from memblock.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
  2018-02-15 11:34     ` Michal Hocko
@ 2018-02-15 13:36       ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 13:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

Hi Michal,

Thank you very much for your reviews and for Acking this patch.

>
> The whole memblock != section_size sucks! It leads to corner cases like
> you see. There is no real reason why we shouldn't be able to to online
> 2G unaligned memory range. Failing for that purpose is just wrong. The
> whole thing is just not well thought through and works only for well
> configured cases.

Hotplug operates over memory blocks, and it seems that conceptually
memory blocks are OK: their sizes are defined by arch, and may
represent a pluggable dimm (on virtual machines it is a different
story though). If we forced memory blocks to be equal to section size,
that would force us to handle millions of memory devices in sysfs,
which would not scale well.

>
> Your patch doesn't address the underlying problem.

What is the underlying problem? The hotplug operation was allowed, but
we ended up with half populated memory block, which is broken. The
patch solves this problem by making sure that this is never the case
for any arch, no matter what block size is defined as unit of
hotplugging.

> On the other hand, it
> is incorrect to check memory section here conceptually because this is
> not a hotplug unit as you say so I am OK with the patch regardless. It
> deserves a big fat TODO to fix this properly at least. I am not sure why
> we insist on the alignment in the first place. All we should care about
> is the proper memory section based range. The code is crap and it
> assumes pageblock start aligned at some places but there shouldn't be
> anything fundamental to change that.

So, if I understand correctly, ideally you would like to redefine unit
of memory hotplug to be equal to section size?

>
>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
@ 2018-02-15 13:36       ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 13:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

Hi Michal,

Thank you very much for your reviews and for Acking this patch.

>
> The whole memblock != section_size sucks! It leads to corner cases like
> you see. There is no real reason why we shouldn't be able to to online
> 2G unaligned memory range. Failing for that purpose is just wrong. The
> whole thing is just not well thought through and works only for well
> configured cases.

Hotplug operates over memory blocks, and it seems that conceptually
memory blocks are OK: their sizes are defined by arch, and may
represent a pluggable dimm (on virtual machines it is a different
story though). If we forced memory blocks to be equal to section size,
that would force us to handle millions of memory devices in sysfs,
which would not scale well.

>
> Your patch doesn't address the underlying problem.

What is the underlying problem? The hotplug operation was allowed, but
we ended up with half populated memory block, which is broken. The
patch solves this problem by making sure that this is never the case
for any arch, no matter what block size is defined as unit of
hotplugging.

> On the other hand, it
> is incorrect to check memory section here conceptually because this is
> not a hotplug unit as you say so I am OK with the patch regardless. It
> deserves a big fat TODO to fix this properly at least. I am not sure why
> we insist on the alignment in the first place. All we should care about
> is the proper memory section based range. The code is crap and it
> assumes pageblock start aligned at some places but there shouldn't be
> anything fundamental to change that.

So, if I understand correctly, ideally you would like to redefine unit
of memory hotplug to be equal to section size?

>
>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory
  2018-02-15 11:37     ` Michal Hocko
@ 2018-02-15 13:39       ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 13:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> I dunno. If x86 maintainers are OK with this then why not, but I do not
> like how this is x86 specific. I would much rather address this by
> making the memblock user interface more sane.
>

Hi Michal,

Ingo Molnar reviewed this patch, and Okayed it.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory
@ 2018-02-15 13:39       ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 13:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> I dunno. If x86 maintainers are OK with this then why not, but I do not
> like how this is x86 specific. I would much rather address this by
> making the memblock user interface more sane.
>

Hi Michal,

Ingo Molnar reviewed this patch, and Okayed it.

Thank you,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking
  2018-02-15 11:53     ` Michal Hocko
@ 2018-02-15 13:41       ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 13:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you, I will do the changes that you requested.

Pavel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking
@ 2018-02-15 13:41       ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 13:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you, I will do the changes that you requested.

Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug
  2018-02-15 12:43     ` Michal Hocko
@ 2018-02-15 13:46       ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 13:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> This should be a separate patch IMHO. It is an optimization on its
> own. The original code tries to be sparse neutral but we do depend on
> sparse anyway.

Yes, Mingo already asked me to split this patch. I've done just that
and will send it out soon.

>
> [...]
>>  /* register memory section under specified node if it spans that node */
>> -int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
>> +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
>> +                              bool check_nid)
>
> This check_nid begs for a documentation. When do we need to set it? I
> can see that register_new_memory path doesn't check node id. It is quite
> reasonable to expect that a new memblock doesn't span multiple numa
> nodes which can be the case for register_one_node but a word or two are
> really due.

OK, I will add a comment, and BTW, this is also going to be a separate
patch for ease of review.

>
>>  {
>>       int ret;
>>       unsigned long pfn, sect_start_pfn, sect_end_pfn;
>> @@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
>>                       continue;
>>               }
>>
>> -             page_nid = get_nid_for_pfn(pfn);
>> -             if (page_nid < 0)
>> -                     continue;
>> -             if (page_nid != nid)
>> -                     continue;
>> +             if (check_nid) {
>> +                     page_nid = get_nid_for_pfn(pfn);
>> +                     if (page_nid < 0)
>> +                             continue;
>> +                     if (page_nid != nid)
>> +                             continue;
>> +             }
>>               ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
>>                                       &mem_blk->dev.kobj,
>>                                       kobject_name(&mem_blk->dev.kobj));
>> @@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
>>
>>               mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
>>
>> -             ret = register_mem_sect_under_node(mem_blk, nid);
>> +             ret = register_mem_sect_under_node(mem_blk, nid, true);
>>               if (!err)
>>                       err = ret;
>>
>
> I would be tempted to split this into a separate patch as well. The
> review will be much easier.

Yes, but that would be the last patch in the series.

> This is quite ugly. You allocate 256MB for small numa systems and 512MB
> for larger NUMAs unconditionally for MEMORY_HOTPLUG. I see you need it
> to safely replace page_to_nid by get_section_nid but this is just too
> high of the price. Please note that this shouldn't be really needed. At
> least not for onlining. We already _do_ know the node association with
> the pfn range. So we should be able to get the nid from memblock.

OK, I will think for a different place to store nid temporarily, or
how to get it.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug
@ 2018-02-15 13:46       ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 13:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> This should be a separate patch IMHO. It is an optimization on its
> own. The original code tries to be sparse neutral but we do depend on
> sparse anyway.

Yes, Mingo already asked me to split this patch. I've done just that
and will send it out soon.

>
> [...]
>>  /* register memory section under specified node if it spans that node */
>> -int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
>> +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
>> +                              bool check_nid)
>
> This check_nid begs for a documentation. When do we need to set it? I
> can see that register_new_memory path doesn't check node id. It is quite
> reasonable to expect that a new memblock doesn't span multiple numa
> nodes which can be the case for register_one_node but a word or two are
> really due.

OK, I will add a comment, and BTW, this is also going to be a separate
patch for ease of review.

>
>>  {
>>       int ret;
>>       unsigned long pfn, sect_start_pfn, sect_end_pfn;
>> @@ -423,11 +424,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
>>                       continue;
>>               }
>>
>> -             page_nid = get_nid_for_pfn(pfn);
>> -             if (page_nid < 0)
>> -                     continue;
>> -             if (page_nid != nid)
>> -                     continue;
>> +             if (check_nid) {
>> +                     page_nid = get_nid_for_pfn(pfn);
>> +                     if (page_nid < 0)
>> +                             continue;
>> +                     if (page_nid != nid)
>> +                             continue;
>> +             }
>>               ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
>>                                       &mem_blk->dev.kobj,
>>                                       kobject_name(&mem_blk->dev.kobj));
>> @@ -502,7 +505,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
>>
>>               mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
>>
>> -             ret = register_mem_sect_under_node(mem_blk, nid);
>> +             ret = register_mem_sect_under_node(mem_blk, nid, true);
>>               if (!err)
>>                       err = ret;
>>
>
> I would be tempted to split this into a separate patch as well. The
> review will be much easier.

Yes, but that would be the last patch in the series.

> This is quite ugly. You allocate 256MB for small numa systems and 512MB
> for larger NUMAs unconditionally for MEMORY_HOTPLUG. I see you need it
> to safely replace page_to_nid by get_section_nid but this is just too
> high of the price. Please note that this shouldn't be really needed. At
> least not for onlining. We already _do_ know the node association with
> the pfn range. So we should be able to get the nid from memblock.

OK, I will think for a different place to store nid temporarily, or
how to get it.

Thank you,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
  2018-02-15 13:36       ` Pavel Tatashin
@ 2018-02-15 14:40         ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 14:40 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

On Thu 15-02-18 08:36:00, Pavel Tatashin wrote:
> Hi Michal,
> 
> Thank you very much for your reviews and for Acking this patch.
> 
> >
> > The whole memblock != section_size sucks! It leads to corner cases like
> > you see. There is no real reason why we shouldn't be able to to online
> > 2G unaligned memory range. Failing for that purpose is just wrong. The
> > whole thing is just not well thought through and works only for well
> > configured cases.
> 
> Hotplug operates over memory blocks, and it seems that conceptually
> memory blocks are OK: their sizes are defined by arch, and may
> represent a pluggable dimm (on virtual machines it is a different
> story though). If we forced memory blocks to be equal to section size,
> that would force us to handle millions of memory devices in sysfs,
> which would not scale well.

Yes, I am very well avare of the reason why memblock is larger on larger
systems. I was merely ranting about the way how it has been added to the
existing code.

> > Your patch doesn't address the underlying problem.
> 
> What is the underlying problem? The hotplug operation was allowed, but
> we ended up with half populated memory block, which is broken. The
> patch solves this problem by making sure that this is never the case
> for any arch, no matter what block size is defined as unit of
> hotplugging.

The underlying problem is that we require some alignment here. There
shouldn't be any reason to do so. The only requirement dictated by the
memory model is the size of the section.

> > On the other hand, it
> > is incorrect to check memory section here conceptually because this is
> > not a hotplug unit as you say so I am OK with the patch regardless. It
> > deserves a big fat TODO to fix this properly at least. I am not sure why
> > we insist on the alignment in the first place. All we should care about
> > is the proper memory section based range. The code is crap and it
> > assumes pageblock start aligned at some places but there shouldn't be
> > anything fundamental to change that.
> 
> So, if I understand correctly, ideally you would like to redefine unit
> of memory hotplug to be equal to section size?

No, not really. I just think the alignment shouldn't really matter. Each
memory block should simply represent a hotplugable entitity with a well
defined pfn start and size (in multiples of section size). This is in
fact what we do internally anyway. One problem might be that an existing
userspace might depend on the existing size restrictions so we might not
be able to have variable block sizes. But block size alignment should be
fixable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
@ 2018-02-15 14:40         ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2018-02-15 14:40 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

On Thu 15-02-18 08:36:00, Pavel Tatashin wrote:
> Hi Michal,
> 
> Thank you very much for your reviews and for Acking this patch.
> 
> >
> > The whole memblock != section_size sucks! It leads to corner cases like
> > you see. There is no real reason why we shouldn't be able to to online
> > 2G unaligned memory range. Failing for that purpose is just wrong. The
> > whole thing is just not well thought through and works only for well
> > configured cases.
> 
> Hotplug operates over memory blocks, and it seems that conceptually
> memory blocks are OK: their sizes are defined by arch, and may
> represent a pluggable dimm (on virtual machines it is a different
> story though). If we forced memory blocks to be equal to section size,
> that would force us to handle millions of memory devices in sysfs,
> which would not scale well.

Yes, I am very well avare of the reason why memblock is larger on larger
systems. I was merely ranting about the way how it has been added to the
existing code.

> > Your patch doesn't address the underlying problem.
> 
> What is the underlying problem? The hotplug operation was allowed, but
> we ended up with half populated memory block, which is broken. The
> patch solves this problem by making sure that this is never the case
> for any arch, no matter what block size is defined as unit of
> hotplugging.

The underlying problem is that we require some alignment here. There
shouldn't be any reason to do so. The only requirement dictated by the
memory model is the size of the section.

> > On the other hand, it
> > is incorrect to check memory section here conceptually because this is
> > not a hotplug unit as you say so I am OK with the patch regardless. It
> > deserves a big fat TODO to fix this properly at least. I am not sure why
> > we insist on the alignment in the first place. All we should care about
> > is the proper memory section based range. The code is crap and it
> > assumes pageblock start aligned at some places but there shouldn't be
> > anything fundamental to change that.
> 
> So, if I understand correctly, ideally you would like to redefine unit
> of memory hotplug to be equal to section size?

No, not really. I just think the alignment shouldn't really matter. Each
memory block should simply represent a hotplugable entitity with a well
defined pfn start and size (in multiples of section size). This is in
fact what we do internally anyway. One problem might be that an existing
userspace might depend on the existing size restrictions so we might not
be able to have variable block sizes. But block size alignment should be
fixable.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
  2018-02-15 14:40         ` Michal Hocko
@ 2018-02-15 15:05           ` Pavel Tatashin
  -1 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 15:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> No, not really. I just think the alignment shouldn't really matter. Each
> memory block should simply represent a hotplugable entitity with a well
> defined pfn start and size (in multiples of section size). This is in
> fact what we do internally anyway. One problem might be that an existing
> userspace might depend on the existing size restrictions so we might not
> be able to have variable block sizes. But block size alignment should be
> fixable.
> 
Hi Michal,

I see what you mean, and I agree Linux should simply honor reasonable 
requests from HW/HV.

On x86 qemu hotplugable entity is 128M, on sun4v SPARC it is 256M, with 
current scheme we still would end up with huge number of memory devices 
in sysfs if block size is fixed and equal to minimum hotplugable 
entitity. Just as an example, SPARC sun4v may have logical domains up-to 
32T, with 256M granularity that is 131K files in 
/sys/devices/system/memory/!

But, if it is variable, I am not sure how to solve it. The whole 
interface must be redefined. Because even if we hotplugged a highly 
aligned large chunk of memory and created only one memory device for it, 
we should have a way to remove just a small piece of that memory if 
underlying HV/HW requested.

/sys/devices/system/memory/block_size_bytes

Would have to be moved into memory block

echo offline > /sys/devices/system/memory/memoryXXX/state

This would need to be redefined somehow to work only on part of the block.

I am not really sure what a good solution would be without breaking the 
userspace.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check
@ 2018-02-15 15:05           ` Pavel Tatashin
  0 siblings, 0 replies; 38+ messages in thread
From: Pavel Tatashin @ 2018-02-15 15:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Steve Sistare, Daniel Jordan, Andrew Morton, Mel Gorman,
	Linux Memory Management List, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Vlastimil Babka, Bharata B Rao,
	Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe

> No, not really. I just think the alignment shouldn't really matter. Each
> memory block should simply represent a hotplugable entitity with a well
> defined pfn start and size (in multiples of section size). This is in
> fact what we do internally anyway. One problem might be that an existing
> userspace might depend on the existing size restrictions so we might not
> be able to have variable block sizes. But block size alignment should be
> fixable.
> 
Hi Michal,

I see what you mean, and I agree Linux should simply honor reasonable 
requests from HW/HV.

On x86 qemu hotplugable entity is 128M, on sun4v SPARC it is 256M, with 
current scheme we still would end up with huge number of memory devices 
in sysfs if block size is fixed and equal to minimum hotplugable 
entitity. Just as an example, SPARC sun4v may have logical domains up-to 
32T, with 256M granularity that is 131K files in 
/sys/devices/system/memory/!

But, if it is variable, I am not sure how to solve it. The whole 
interface must be redefined. Because even if we hotplugged a highly 
aligned large chunk of memory and created only one memory device for it, 
we should have a way to remove just a small piece of that memory if 
underlying HV/HW requested.

/sys/devices/system/memory/block_size_bytes

Would have to be moved into memory block

echo offline > /sys/devices/system/memory/memoryXXX/state

This would need to be redefined somehow to work only on part of the block.

I am not really sure what a good solution would be without breaking the 
userspace.

Thank you,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory
  2018-02-15 13:39       ` Pavel Tatashin
@ 2018-02-15 19:00         ` Ingo Molnar
  -1 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2018-02-15 19:00 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Michal Hocko, Steve Sistare, Daniel Jordan, Andrew Morton,
	Mel Gorman, Linux Memory Management List,
	Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka,
	Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe


* Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> > I dunno. If x86 maintainers are OK with this then why not, but I do not
> > like how this is x86 specific. I would much rather address this by
> > making the memblock user interface more sane.
> >
> 
> Hi Michal,
> 
> Ingo Molnar reviewed this patch, and Okayed it.

But I'd not be against robustifying the whole generic interface against such 
misconfiguration either.

But having the warning should be enough in practice, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory
@ 2018-02-15 19:00         ` Ingo Molnar
  0 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2018-02-15 19:00 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Michal Hocko, Steve Sistare, Daniel Jordan, Andrew Morton,
	Mel Gorman, Linux Memory Management List,
	Linux Kernel Mailing List, Greg Kroah-Hartman, Vlastimil Babka,
	Bharata B Rao, Thomas Gleixner, mingo, hpa, x86, dan.j.williams,
	kirill.shutemov, bhe


* Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> > I dunno. If x86 maintainers are OK with this then why not, but I do not
> > like how this is x86 specific. I would much rather address this by
> > making the memblock user interface more sane.
> >
> 
> Hi Michal,
> 
> Ingo Molnar reviewed this patch, and Okayed it.

But I'd not be against robustifying the whole generic interface against such 
misconfiguration either.

But having the warning should be enough in practice, right?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-02-15 19:00 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13 19:31 [PATCH v3 0/4] optimize memory hotplug Pavel Tatashin
2018-02-13 19:31 ` Pavel Tatashin
2018-02-13 19:31 ` [PATCH v3 1/4] mm/memory_hotplug: enforce block size aligned range check Pavel Tatashin
2018-02-13 19:31   ` Pavel Tatashin
2018-02-15 11:34   ` Michal Hocko
2018-02-15 11:34     ` Michal Hocko
2018-02-15 13:36     ` Pavel Tatashin
2018-02-15 13:36       ` Pavel Tatashin
2018-02-15 14:40       ` Michal Hocko
2018-02-15 14:40         ` Michal Hocko
2018-02-15 15:05         ` Pavel Tatashin
2018-02-15 15:05           ` Pavel Tatashin
2018-02-13 19:31 ` [PATCH v3 2/4] x86/mm/memory_hotplug: determine block size based on the end of boot memory Pavel Tatashin
2018-02-13 19:31   ` Pavel Tatashin
2018-02-15 11:37   ` Michal Hocko
2018-02-15 11:37     ` Michal Hocko
2018-02-15 13:39     ` Pavel Tatashin
2018-02-15 13:39       ` Pavel Tatashin
2018-02-15 19:00       ` Ingo Molnar
2018-02-15 19:00         ` Ingo Molnar
2018-02-13 19:31 ` [PATCH v3 3/4] mm: uninitialized struct page poisoning sanity checking Pavel Tatashin
2018-02-13 19:31   ` Pavel Tatashin
2018-02-15 11:53   ` Michal Hocko
2018-02-15 11:53     ` Michal Hocko
2018-02-15 13:41     ` Pavel Tatashin
2018-02-15 13:41       ` Pavel Tatashin
2018-02-13 19:31 ` [PATCH v3 4/4] mm/memory_hotplug: optimize memory hotplug Pavel Tatashin
2018-02-13 19:31   ` Pavel Tatashin
2018-02-15 12:43   ` Michal Hocko
2018-02-15 12:43     ` Michal Hocko
2018-02-15 13:46     ` Pavel Tatashin
2018-02-15 13:46       ` Pavel Tatashin
2018-02-13 21:53 ` [PATCH v3 0/4] " Andrew Morton
2018-02-13 21:53   ` Andrew Morton
2018-02-14  8:09   ` Ingo Molnar
2018-02-14  8:09     ` Ingo Molnar
2018-02-14 14:14     ` Pavel Tatashin
2018-02-14 14:14       ` Pavel Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.