linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
@ 2019-01-22 10:37 Oscar Salvador
  2019-01-22 10:37 ` [RFC PATCH v2 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Oscar Salvador @ 2019-01-22 10:37 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, dan.j.williams, Pavel.Tatashin, david, linux-kernel,
	dave.hansen, Oscar Salvador

Hi,

this is the v2 of the first RFC I sent back then in October [1].
In this new version I tried to reduce the complexity as much as possible,
plus some clean ups.

[Testing]

I have tested it on "x86_64" (small/big memblocks) and on "powerpc".
On both architectures hot-add/hot-remove online/offline operations
worked as expected using vmemmap pages, I have not seen any issues so far.
I wanted to try it out on Hyper-V/Xen, but I did not manage to.
I plan to do so along this week (if time allows).
I would also like to test it on arm64, but I am not sure I can grab
an arm64 box anytime soon.

[Coverletter]:

This is another step to make the memory hotplug more usable. The primary
goal of this patchset is to reduce memory overhead of the hot added
memory (at least for SPARSE_VMEMMAP memory model). The current way we use
to populate memmap (struct page array) has two main drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
   onlined and
b) memmap might end up on a different numa node which is especially true
   for movable_node configuration.

a) is problem especially for memory hotplug based memory "ballooning"
   solutions when the delay between physical memory hotplug and the
   onlining can lead to OOM and that led to introduction of hacks like auto
   onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
   policy for the newly added memory")).

b) can have performance drawbacks.

I have also seen hot-add operations failing on powerpc due to the fact
that we try to use order-8 pages when populating the memmap array.
Given 64KB base pagesize, that is 16MB.
If we run out of those, we just fail the operation and we cannot add
more memory.
We could fallback to base pages as x86_64 does, but we can do better.

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hotadded memory itself. VMEMMAP memory model allows us to map
any pfn range so the memory doesn't need to be online to be usable
for the array. See patch 3 for more details. In short I am reusing an
existing vmem_altmap which wants to achieve the same thing for nvdim
device memory.

There is also one potential drawback, though. If somebody uses memory
hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
for them obviously because each memory block will contain reserved
area. Large x86 machines will use 2G memblocks so at least one 1G page
will be available but this is still not 2G...

I am not really sure somebody does that and how reliable that can work
actually. Nevertheless, I _believe_ that onlining more memory into
virtual machines is much more common usecase. Anyway if there ever is a
strong demand for such a usecase we have basically 3 options a) enlarge
memory blocks even more b) enhance altmap allocation strategy and reuse
low memory sections to host memmaps of other sections on the same NUMA
node c) have the memmap allocation strategy configurable to fallback to
the current allocation.
 
[Overall design]:

Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M).
That is:

 - 16 sections
 - 524288 pages
 - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section)

 The range of pages is: 0xffffea0004000000 - 0xffffea0006000000
 The vmemmap range is:  0xffffea0004000000 - 0xffffea0004080000

 0xffffea0004000000 is the head vmemmap page (first page), while all the others
 are "tails".

 We keep the following information in it:

 - Head page:
   - head->_refcount: number of sections
   - head->private :  number of vmemmap pages
 - Tail page:
   - tail->freelist : pointer to the head

This is done because it eases the work in cases where we have to compute the
number of vmemmap pages to know how much do we have to skip etc, and to keep
the right accounting to present_pages.

When we want to hot-remove the range, we need to be careful because the first
pages of that range, are used for the memmap maping, so if we remove those
first, we would blow up while accessing the others later on.
For that reason we keep the number of sections in head->_refcount, to know how
much do we have to defer the free up.

Since in a hot-remove operation, sections are being removed sequentially, the
approach taken here is that every time we hit free_section_memmap(), we decrease
the refcount of the head.
When it reaches 0, we know that we hit the last section, so we call
vmemmap_free() for the whole memory-range in backwards, so we make sure that
the pages used for the mapping will be latest to be freed up.

The accounting is as follows:

 Vmemmap pages are charged to spanned/present_paged, but not to manages_pages.

I yet have to check a couple of things like creating an accounting item
like VMEMMAP_PAGES to show in /proc/meminfo to ease to spot the memory that
went in there, testing Hyper-V/Xen to see how they react to the fact that
we are using the beginning of the memory-range for our own purposes, and to
check the thing about gigantic pages + hotplug.
I also have to check that there is no compilation/runtime errors when
CONFIG_SPARSEMEM but !CONFIG_SPARSEMEM_VMEMMAP.
But before that, I would like to get people's feedback about the overall
design, and ideas/suggestions.


[1] https://patchwork.kernel.org/cover/10685835/

Michal Hocko (3):
  mm, memory_hotplug: cleanup memory offline path
  mm, memory_hotplug: provide a more generic restrictions for memory
    hotplug
  mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap

Oscar Salvador (1):
  mm, memory_hotplug: allocate memmap from the added memory range for
    sparse-vmemmap

 arch/arm64/mm/mmu.c            |  10 ++-
 arch/ia64/mm/init.c            |   5 +-
 arch/powerpc/mm/init_64.c      |   7 ++
 arch/powerpc/mm/mem.c          |   6 +-
 arch/s390/mm/init.c            |  12 ++-
 arch/sh/mm/init.c              |   6 +-
 arch/x86/mm/init_32.c          |   6 +-
 arch/x86/mm/init_64.c          |  20 +++--
 drivers/hv/hv_balloon.c        |   1 +
 drivers/xen/balloon.c          |   1 +
 include/linux/memory_hotplug.h |  42 ++++++++--
 include/linux/memremap.h       |   2 +-
 include/linux/page-flags.h     |  23 +++++
 kernel/memremap.c              |   9 +-
 mm/compaction.c                |   8 ++
 mm/memory_hotplug.c            | 186 +++++++++++++++++++++++++++++------------
 mm/page_alloc.c                |  47 ++++++++++-
 mm/page_isolation.c            |  13 +++
 mm/sparse.c                    | 124 +++++++++++++++++++++++++--
 mm/util.c                      |   2 +
 20 files changed, 431 insertions(+), 99 deletions(-)

-- 
2.13.7


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH v2 1/4] mm, memory_hotplug: cleanup memory offline path
  2019-01-22 10:37 [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
@ 2019-01-22 10:37 ` Oscar Salvador
  2019-01-22 10:37 ` [RFC PATCH v2 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug Oscar Salvador
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2019-01-22 10:37 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, dan.j.williams, Pavel.Tatashin, david, linux-kernel,
	dave.hansen, Oscar Salvador

From: Michal Hocko <mhocko@suse.com>

check_pages_isolated_cb currently accounts the whole pfn range as being
offlined if test_pages_isolated suceeds on the range. This is based on
the assumption that all pages in the range are freed which is currently
the case in most cases but it won't be with later changes. I haven't
double checked but if the range contains invalid pfns we could
theoretically over account and underflow zone's managed pages.

Move the offlined pages counting to offline_isolated_pages_cb and
rely on __offline_isolated_pages to return the correct value.
check_pages_isolated_cb will still do it's primary job and check the pfn
range.

While we are at it remove check_pages_isolated and offline_isolated_pages
and use directly walk_system_ram_range as do in online_pages.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c            | 45 +++++++++++-------------------------------
 mm/page_alloc.c                | 11 +++++++++--
 3 files changed, 22 insertions(+), 36 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d56bfbacf7d6..1a230dde6027 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -85,7 +85,7 @@ extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 extern int online_pages(unsigned long, unsigned long, int);
 extern int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
 	unsigned long *valid_start, unsigned long *valid_end);
-extern void __offline_isolated_pages(unsigned long, unsigned long);
+extern unsigned long __offline_isolated_pages(unsigned long, unsigned long);
 
 typedef int (*online_page_callback_t)(struct page *page, unsigned int order);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ec22c86d9f89..6efa44087b37 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1451,17 +1451,12 @@ static int
 offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
 			void *data)
 {
-	__offline_isolated_pages(start, start + nr_pages);
+	unsigned long offlined_pages;
+	offlined_pages = __offline_isolated_pages(start, start + nr_pages);
+	*(unsigned long *)data += offlined_pages;
 	return 0;
 }
 
-static void
-offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
-{
-	walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
-				offline_isolated_pages_cb);
-}
-
 /*
  * Check all pages in range, recoreded as memory resource, are isolated.
  */
@@ -1469,26 +1464,7 @@ static int
 check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
 			void *data)
 {
-	int ret;
-	long offlined = *(long *)data;
-	ret = test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
-	offlined = nr_pages;
-	if (!ret)
-		*(long *)data += offlined;
-	return ret;
-}
-
-static long
-check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
-{
-	long offlined = 0;
-	int ret;
-
-	ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined,
-			check_pages_isolated_cb);
-	if (ret < 0)
-		offlined = (long)ret;
-	return offlined;
+	return test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
 }
 
 static int __init cmdline_parse_movable_node(char *p)
@@ -1573,7 +1549,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		  unsigned long end_pfn)
 {
 	unsigned long pfn, nr_pages;
-	long offlined_pages;
+	unsigned long offlined_pages = 0;
 	int ret, node;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
@@ -1650,13 +1626,16 @@ static int __ref __offline_pages(unsigned long start_pfn,
 			goto failed_removal_isolated;
 		}
 		/* check again */
-		offlined_pages = check_pages_isolated(start_pfn, end_pfn);
-	} while (offlined_pages < 0);
+		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
+							check_pages_isolated_cb);
+	} while (ret);
 
-	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
-	offline_isolated_pages(start_pfn, end_pfn);
+	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined_pages,
+						offline_isolated_pages_cb);
+
+	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d7a521971a05..cad7468a0f20 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8479,7 +8479,7 @@ void zone_pcp_reset(struct zone *zone)
  * All pages in the range must be in a single zone and isolated
  * before calling this.
  */
-void
+unsigned long
 __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct page *page;
@@ -8487,12 +8487,15 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	unsigned int order, i;
 	unsigned long pfn;
 	unsigned long flags;
+	unsigned long offlined_pages = 0;
+
 	/* find the first valid pfn */
 	for (pfn = start_pfn; pfn < end_pfn; pfn++)
 		if (pfn_valid(pfn))
 			break;
 	if (pfn == end_pfn)
-		return;
+		return offlined_pages;
+
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
@@ -8510,12 +8513,14 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
 			pfn++;
 			SetPageReserved(page);
+			offlined_pages++;
 			continue;
 		}
 
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
+		offlined_pages += 1 << order;
 #ifdef CONFIG_DEBUG_VM
 		pr_info("remove from free list %lx %d %lx\n",
 			pfn, 1 << order, end_pfn);
@@ -8528,6 +8533,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return offlined_pages;
 }
 #endif
 
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH v2 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
  2019-01-22 10:37 [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
  2019-01-22 10:37 ` [RFC PATCH v2 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
@ 2019-01-22 10:37 ` Oscar Salvador
  2019-01-22 10:37 ` [RFC PATCH v2 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap Oscar Salvador
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2019-01-22 10:37 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, dan.j.williams, Pavel.Tatashin, david, linux-kernel,
	dave.hansen, Oscar Salvador

From: Michal Hocko <mhocko@suse.com>

arch_add_memory, __add_pages take a want_memblock which controls whether
the newly added memory should get the sysfs memblock user API (e.g.
ZONE_DEVICE users do not want/need this interface). Some callers even
want to control where do we allocate the memmap from by configuring
altmap.

Add a more generic hotplug context for arch_add_memory and __add_pages.
struct mhp_restrictions contains flags which contains additional
features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
currently) and altmap for alternative memmap allocator.

Please note that the complete altmap propagation down to vmemmap code
is still not done in this patch. It will be done in the follow up to
reduce the churn here.

This patch shouldn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/mm/mmu.c            |  5 ++---
 arch/ia64/mm/init.c            |  5 ++---
 arch/powerpc/mm/mem.c          |  6 +++---
 arch/s390/mm/init.c            |  6 +++---
 arch/sh/mm/init.c              |  6 +++---
 arch/x86/mm/init_32.c          |  6 +++---
 arch/x86/mm/init_64.c          | 10 +++++-----
 include/linux/memory_hotplug.h | 25 +++++++++++++++++++------
 kernel/memremap.c              |  9 ++++++---
 mm/memory_hotplug.c            | 10 ++++++----
 10 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b6f5aa52ac67..3926969f9187 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1049,8 +1049,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		    bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions)
 {
 	int flags = 0;
 
@@ -1061,6 +1060,6 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 			     size, PAGE_KERNEL, pgd_pgtable_alloc, flags);
 
 	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
-			   altmap, want_memblock);
+							restrictions);
 }
 #endif
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 29d841525ca1..f7bacfde1b7c 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -644,14 +644,13 @@ mem_init (void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
 	if (ret)
 		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
 		       __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 33cc6f676fa6..30a2a9b668d7 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -117,8 +117,8 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 	return -ENODEV;
 }
 
-int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int __meminit arch_add_memory(int nid, u64 start, u64 size,
+			struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
 	}
 	flush_inval_dcache_range(start, start + size);
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 3e82f66d5c61..9ae71a82e9e1 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -224,8 +224,8 @@ device_initcall(s390_cma_mem_init);
 
 #endif /* CONFIG_CMA */
 
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+		struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long size_pages = PFN_DOWN(size);
@@ -235,7 +235,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	if (rc)
 		return rc;
 
-	rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock);
+	rc = __add_pages(nid, start_pfn, size_pages, restrictions);
 	if (rc)
 		vmem_remove_mapping(start, size);
 	return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index a0fa4de03dd5..000232933934 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -410,15 +410,15 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+		struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 85c94f9a87f8..755dbed85531 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -850,13 +850,13 @@ void __init mem_init(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+			struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bccff68e3267..db42c11b48fb 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -777,11 +777,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
 }
 
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock)
+				struct mhp_restrictions *restrictions)
 {
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
@@ -791,15 +791,15 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 	return ret;
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+			struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
 	init_memory_mapping(start, start + size);
 
-	return add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
 #define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 1a230dde6027..4e0d75b17715 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -113,20 +113,33 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages, struct vmem_altmap *altmap);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
+/*
+ * Do we want sysfs memblock files created. This will allow userspace to online
+ * and offline memory explicitly. Lack of this bit means that the caller has to
+ * call move_pfn_range_to_zone to finish the initialization.
+ */
+
+#define MHP_MEMBLOCK_API               1<<0
+
+/* Restrictions for the memory hotplug */
+struct mhp_restrictions {
+	unsigned long flags;    /* MHP_ flags */
+	struct vmem_altmap *altmap; /* use this alternative allocator for memmaps */
+};
+
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+					struct mhp_restrictions *restrictions);
 
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		unsigned long nr_pages, struct mhp_restrictions *restrictions)
 {
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 #else /* ARCH_HAS_ADD_PAGES */
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+				struct mhp_restrictions *restrictions);
 #endif /* ARCH_HAS_ADD_PAGES */
 
 #ifdef CONFIG_NUMA
@@ -328,7 +341,7 @@ extern int __add_memory(int nid, u64 start, u64 size);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource);
 extern int arch_add_memory(int nid, u64 start, u64 size,
-		struct vmem_altmap *altmap, bool want_memblock);
+			struct mhp_restrictions *restrictions);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern bool is_memblock_offlined(struct memory_block *mem);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5ff192..d42f11673979 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -149,6 +149,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	struct resource *res = &pgmap->res;
 	struct dev_pagemap *conflict_pgmap;
 	pgprot_t pgprot = PAGE_KERNEL;
+	struct mhp_restrictions restrictions = {};
 	int error, nid, is_ram;
 
 	if (!pgmap->ref || !pgmap->kill)
@@ -199,6 +200,9 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	if (error)
 		goto err_pfn_remap;
 
+	/* We do not want any optional features only our own memmap */
+	restrictions.altmap = altmap;
+
 	mem_hotplug_begin();
 
 	/*
@@ -214,7 +218,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	 */
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		error = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL, false);
+				align_size >> PAGE_SHIFT, &restrictions);
 	} else {
 		error = kasan_add_zero_shadow(__va(align_start), align_size);
 		if (error) {
@@ -222,8 +226,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 			goto err_kasan;
 		}
 
-		error = arch_add_memory(nid, align_start, align_size, altmap,
-				false);
+		error = arch_add_memory(nid, align_start, align_size, &restrictions);
 	}
 
 	if (!error) {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6efa44087b37..8313279136ff 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -271,12 +271,12 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
  * add the new pages.
  */
 int __ref __add_pages(int nid, unsigned long phys_start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		unsigned long nr_pages, struct mhp_restrictions *restrictions)
 {
 	unsigned long i;
 	int err = 0;
 	int start_sec, end_sec;
+	struct vmem_altmap *altmap = restrictions->altmap;
 
 	/* during initialize mem_map, align hot-added range to section */
 	start_sec = pfn_to_section_nr(phys_start_pfn);
@@ -297,7 +297,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, section_nr_to_pfn(i), altmap,
-				want_memblock);
+				restrictions->flags & MHP_MEMBLOCK_API);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -1108,6 +1108,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	u64 start, size;
 	bool new_node = false;
 	int ret;
+	struct mhp_restrictions restrictions = {};
 
 	start = res->start;
 	size = resource_size(res);
@@ -1132,7 +1133,8 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	new_node = ret;
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, NULL, true);
+	restrictions.flags = MHP_MEMBLOCK_API;
+	ret = arch_add_memory(nid, start, size, &restrictions);
 	if (ret < 0)
 		goto error;
 
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH v2 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap
  2019-01-22 10:37 [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
  2019-01-22 10:37 ` [RFC PATCH v2 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
  2019-01-22 10:37 ` [RFC PATCH v2 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug Oscar Salvador
@ 2019-01-22 10:37 ` Oscar Salvador
  2019-01-22 10:37 ` [RFC PATCH v2 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap Oscar Salvador
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2019-01-22 10:37 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, dan.j.williams, Pavel.Tatashin, david, linux-kernel,
	dave.hansen, Oscar Salvador

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.

a) has turned out to be a problem for memory hotplug based ballooning
because the userspace might not react in time to online memory while
the memory consumed during physical hotadd consumes enough memory to push
system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory") has been added to workaround that
problem.

I have also seen hot-add operations failing on powerpc due to the fact
that we try to use order-8 pages. If the base page size is 64KB, this
gives us 16MB, and if we run out of those, we simply fail.
One could arge that we can fall back to basepages as we do in x86_64.

But We can do much better when CONFIG_SPARSEMEM_VMEMMAP=y because vmemap
page tables can map arbitrary memory. That means that we can simply
use the beginning of each memory section and map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Add {_Set,_Clear}PageVmemmap helpers to distinguish those pages in pfn
walkers. We do not have any spare page flag for this purpose so use the
combination of PageReserved bit which already tells that the page should
be ignored by the core mm code and store VMEMMAP_PAGE (which sets all
bits but PAGE_MAPPING_FLAGS) into page->mapping.

On the memory hotplug front add a new MHP_MEMMAP_FROM_RANGE restriction
flag. User is supposed to set the flag if the memmap should be allocated
from the hotadded range. Please note that this is just a hint and
architecture code can veto this if this cannot be supported. E.g. s390
cannot support this currently beause the physical memory range is made
accessible only during memory online.

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation.
For this, we define a init() and a constructor() callback
in mhp_restrictions structure.

Init() points now to init_altmap_memmap(), and constructor() to
mark_vmemmap_pages().

init_altmap_memmap() takes care of checking the flags, and inits the
vmemap_altmap structure with the required fields.
mark_vmemmap_pages() takes care of marking the pages as Vmemmap, and inits
some fields we need.

The current layout of the Vmemmap pages are:

- There is a head Vmemmap (first page), which has the following fields set:
  * page->_refcount: number of sections that used this altmap
  * page->private: total number of vmemmap pages
- The remaining vmemmap pages have:
  * page->freelist: pointer to the head vmemmap page

This is done to easy the computation we need in some places.

So, let us say we hot-add 9GB on x86_64:

head->_refcount = 72 sections
head->private = 36864 vmemmap pages
tail's->freelist = head

We keep a _refcount of the used sections to know how much do we have to defer
the call to vmemmap_free().
The thing is that the first pages of the hot-added range are used to create
the memmap mapping, so we cannot remove those first, otherwise we would blow up.

What we do is that since when we hot-remove a memory-range, sections are being
removed sequentially, we wait until we hit the last section, and then we free
the hole range to vmemmap_free backwards.
We know that it is the last section because in every pass we
decrease head->_refcount, and when it reaches 0, we got our last section.

We also have to be careful about those pages during online and offline
operations. They are simply skipped now so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.

Please note that only the memory hotplug is currently using this
allocation scheme. The boot time memmap allocation could use the same
trick as well but this is not done yet.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/mm/mmu.c            |   5 +-
 arch/powerpc/mm/init_64.c      |   7 +++
 arch/s390/mm/init.c            |   6 ++
 arch/x86/mm/init_64.c          |  10 ++++
 drivers/hv/hv_balloon.c        |   1 +
 drivers/xen/balloon.c          |   1 +
 include/linux/memory_hotplug.h |  23 +++++--
 include/linux/memremap.h       |   2 +-
 include/linux/page-flags.h     |  23 +++++++
 mm/compaction.c                |   8 +++
 mm/memory_hotplug.c            | 133 ++++++++++++++++++++++++++++++++++++-----
 mm/page_alloc.c                |  36 ++++++++++-
 mm/page_isolation.c            |  13 ++++
 mm/sparse.c                    | 108 +++++++++++++++++++++++++++++++++
 mm/util.c                      |   2 +
 15 files changed, 354 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 3926969f9187..c4eb6d96d088 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -749,7 +749,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
-			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+			if (altmap)
+				p = altmap_alloc_block_buf(PMD_SIZE, altmap);
+			else
+				p = vmemmap_alloc_block_buf(PMD_SIZE, node);
 			if (!p)
 				return -ENOMEM;
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a5091c034747..d8b487a6f019 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -296,6 +296,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
 
 		if (base_pfn >= alt_start && base_pfn < alt_end) {
 			vmem_altmap_free(altmap, nr_pages);
+		} else if (PageVmemmap(page)) {
+			/*
+			 * runtime vmemmap pages are residing inside the memory
+			 * section so they do not have to be freed anywhere.
+			 */
+			while (PageVmemmap(page))
+				__ClearPageVmemmap(page++);
 		} else if (PageReserved(page)) {
 			/* allocated from bootmem */
 			if (page_size < PAGE_SIZE) {
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 9ae71a82e9e1..75e96860a9ac 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	unsigned long size_pages = PFN_DOWN(size);
 	int rc;
 
+	/*
+	 * Physical memory is added only later during the memory online so we
+	 * cannot use the added range at this stage unfortunately.
+	 */
+	restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;
+
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index db42c11b48fb..2e40c9e637b9 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -809,6 +809,16 @@ static void __meminit free_pagetable(struct page *page, int order)
 	unsigned long magic;
 	unsigned int nr_pages = 1 << order;
 
+	/*
+	 * runtime vmemmap pages are residing inside the memory section so
+	 * they do not have to be freed anywhere.
+	 */
+	if (PageVmemmap(page)) {
+		while (nr_pages--)
+			__ClearPageVmemmap(page++);
+		return;
+	}
+
 	/* bootmem page has reserved flag */
 	if (PageReserved(page)) {
 		__ClearPageReserved(page);
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index b32036cbb7a4..582d6e8c734d 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -1585,6 +1585,7 @@ static int balloon_probe(struct hv_device *dev,
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 	do_hot_add = hot_add;
+	hotplug_vmemmap_enabled = false;
 #else
 	do_hot_add = false;
 #endif
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 3ff8f91b1fea..678e835718cf 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -715,6 +715,7 @@ static int __init balloon_init(void)
 	set_online_page_callback(&xen_online_pages);
 	register_memory_notifier(&xen_memory_nb);
 	register_sysctl_table(xen_root);
+	hotplug_vmemmap_enabled = false;
 #endif
 
 #ifdef CONFIG_XEN_PV
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 4e0d75b17715..89317ef50a61 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -118,13 +118,27 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
  * and offline memory explicitly. Lack of this bit means that the caller has to
  * call move_pfn_range_to_zone to finish the initialization.
  */
-
 #define MHP_MEMBLOCK_API               1<<0
 
-/* Restrictions for the memory hotplug */
+/*
+ * Do we want memmap (struct page array) allocated from the hotadded range.
+ * Please note that only SPARSE_VMEMMAP implements this feature and some
+ * architectures might not support it even for that memory model (e.g. s390)
+ */
+#define MHP_MEMMAP_FROM_RANGE          1<<1
+
+/* Restrictions for the memory hotplug
+ * flags: MHP_ flags
+ * altmap: use this alternative allocator for memmaps
+ * init: callback to be called before we add this memory
+ * constructor: callback to be called once the more has been added
+ */
 struct mhp_restrictions {
-	unsigned long flags;    /* MHP_ flags */
-	struct vmem_altmap *altmap; /* use this alternative allocator for memmaps */
+	unsigned long flags;
+	struct vmem_altmap *altmap;
+	void (*init)(unsigned long, unsigned long, struct vmem_altmap *,
+						struct mhp_restrictions *);
+	void (*constructor)(struct vmem_altmap *, struct mhp_restrictions *);
 };
 
 /* reasonably generic interface to expand the physical pages */
@@ -345,6 +359,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern bool is_memblock_offlined(struct memory_block *mem);
+extern void mark_vmemmap_pages(struct vmem_altmap *self, struct mhp_restrictions *r);
 extern int sparse_add_one_section(int nid, unsigned long start_pfn,
 				  struct vmem_altmap *altmap);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f0628660d541..cfde1c1febb7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -16,7 +16,7 @@ struct device;
  * @alloc: track pages consumed, private to vmemmap_populate()
  */
 struct vmem_altmap {
-	const unsigned long base_pfn;
+	unsigned long base_pfn;
 	const unsigned long reserve;
 	unsigned long free;
 	unsigned long align;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 808b4183e30d..2483fcbe8ed6 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -437,6 +437,29 @@ static __always_inline int __PageMovable(struct page *page)
 				PAGE_MAPPING_MOVABLE;
 }
 
+#define VMEMMAP_PAGE ~PAGE_MAPPING_FLAGS
+static __always_inline int PageVmemmap(struct page *page)
+{
+	return PageReserved(page) && (unsigned long)page->mapping == VMEMMAP_PAGE;
+}
+
+static __always_inline void __ClearPageVmemmap(struct page *page)
+{
+	__ClearPageReserved(page);
+	page->mapping = NULL;
+}
+
+static __always_inline void __SetPageVmemmap(struct page *page)
+{
+	__SetPageReserved(page);
+	page->mapping = (void *)VMEMMAP_PAGE;
+}
+
+static __always_inline struct page *vmemmap_get_head(struct page *page)
+{
+	return (struct page *)page->freelist;
+}
+
 #ifdef CONFIG_KSM
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
diff --git a/mm/compaction.c b/mm/compaction.c
index 9830f81cd27f..8bf59eaed204 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -852,6 +852,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		page = pfn_to_page(low_pfn);
 
 		/*
+		 * Vmemmap pages are pages that are used for creating the memmap
+		 * array mapping, and they reside in their hot-added memory range.
+		 * Therefore, we cannot migrate them.
+		 */
+		if (PageVmemmap(page))
+			goto isolate_fail;
+
+		/*
 		 * Check if the pageblock has already been marked skipped.
 		 * Only the aligned PFN is checked as the caller isolates
 		 * COMPACT_CLUSTER_MAX at a time so the second call must
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8313279136ff..3c9eb3b82b34 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -73,6 +73,12 @@ bool memhp_auto_online = true;
 #endif
 EXPORT_SYMBOL_GPL(memhp_auto_online);
 
+/*
+ * Do we want to allocate the memmap array from the
+ * hot-added range?
+ */
+bool hotplug_vmemmap_enabled = true;
+
 static int __init setup_memhp_default_state(char *str)
 {
 	if (!strcmp(str, "online"))
@@ -264,6 +270,18 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
 }
 
+static void init_altmap_memmap(unsigned long pfn, unsigned long nr_pages,
+						struct vmem_altmap *altmap,
+						struct mhp_restrictions *r)
+{
+	if (!(r->flags & MHP_MEMMAP_FROM_RANGE))
+		return;
+
+	altmap->base_pfn = pfn;
+	altmap->free = nr_pages;
+	r->altmap = altmap;
+}
+
 /*
  * Reasonably generic function for adding memory.  It is
  * expected that archs that support memory hotplug will
@@ -276,12 +294,18 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 	unsigned long i;
 	int err = 0;
 	int start_sec, end_sec;
-	struct vmem_altmap *altmap = restrictions->altmap;
+	struct vmem_altmap *altmap;
+	struct vmem_altmap __memblk_altmap = {};
 
 	/* during initialize mem_map, align hot-added range to section */
 	start_sec = pfn_to_section_nr(phys_start_pfn);
 	end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
 
+	if (restrictions->init)
+		restrictions->init(phys_start_pfn, nr_pages, &__memblk_altmap,
+								restrictions);
+
+	altmap = restrictions->altmap;
 	if (altmap) {
 		/*
 		 * Validate altmap is within bounds of the total request
@@ -310,6 +334,12 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 		cond_resched();
 	}
 	vmemmap_populate_print_last();
+
+	/*
+	 * Check if we have a constructor
+	 */
+	if (restrictions->constructor)
+		restrictions->constructor(altmap, restrictions);
 out:
 	return err;
 }
@@ -694,17 +724,48 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
 	return onlined_pages;
 }
 
+static unsigned long check_nr_vmemmap_pages(struct page *page)
+{
+	if (PageVmemmap(page)) {
+		struct page *head = vmemmap_get_head(page);
+		unsigned long vmemmap_pages = page_private(head);
+
+		return vmemmap_pages - (page - head);
+	}
+
+	return 0;
+}
+
 static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
 			void *arg)
 {
 	unsigned long onlined_pages = *(unsigned long *)arg;
+	unsigned long pfn = start_pfn;
+	unsigned long skip_pages = 0;
+
+	if (PageVmemmap(pfn_to_page(pfn))) {
+		/*
+		 * We do not want to send vmemmap pages to __free_pages_core,
+		 * as we will have to populate that with checks to make sure
+		 * vmemmap pages preserve their state.
+		 * Skipping them here saves us some complexity, and has the
+		 * side effect of not accounting vmemmap pages as managed_pages.
+		 */
+		skip_pages = check_nr_vmemmap_pages(pfn_to_page(pfn));
+		skip_pages = min_t(unsigned long, skip_pages, nr_pages);
+		pfn += skip_pages;
+	}
 
-	if (PageReserved(pfn_to_page(start_pfn)))
-		onlined_pages = online_pages_blocks(start_pfn, nr_pages);
+	if ((nr_pages > skip_pages) && PageReserved(pfn_to_page(pfn)))
+		onlined_pages = online_pages_blocks(pfn, nr_pages - skip_pages);
 
 	online_mem_sections(start_pfn, start_pfn + nr_pages);
 
-	*(unsigned long *)arg += onlined_pages;
+	/*
+	 * We do want to account vmemmap pages to present_pages, so
+	 * make sure to add it up.
+	 */
+	*(unsigned long *)arg += onlined_pages + skip_pages;
 	return 0;
 }
 
@@ -1134,6 +1195,12 @@ int __ref add_memory_resource(int nid, struct resource *res)
 
 	/* call arch's memory hotadd */
 	restrictions.flags = MHP_MEMBLOCK_API;
+	if (hotplug_vmemmap_enabled) {
+		restrictions.flags |= MHP_MEMMAP_FROM_RANGE;
+		restrictions.init = init_altmap_memmap;
+		restrictions.constructor = mark_vmemmap_pages;
+	}
+
 	ret = arch_add_memory(nid, start, size, &restrictions);
 	if (ret < 0)
 		goto error;
@@ -1547,8 +1614,7 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 		node_clear_state(node, N_MEMORY);
 }
 
-static int __ref __offline_pages(unsigned long start_pfn,
-		  unsigned long end_pfn)
+static int __ref __offline_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
 	unsigned long pfn, nr_pages;
 	unsigned long offlined_pages = 0;
@@ -1558,14 +1624,30 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	struct zone *zone;
 	struct memory_notify arg;
 	char *reason;
+	unsigned long nr_vmemmap_pages = 0;
+	bool skip_migration = false;
 
 	mem_hotplug_begin();
 
+	if (PageVmemmap(pfn_to_page(start_pfn))) {
+		nr_vmemmap_pages = check_nr_vmemmap_pages(pfn_to_page(start_pfn));
+		if (start_pfn + nr_vmemmap_pages >= end_pfn) {
+			/*
+			 * It can be that depending on how large is the
+			 * hot-added range, an entire memblock only contains
+			 * vmemmap pages.
+			 * Should be that the case, there is no reason in trying
+			 * to isolate and migrate this range.
+			 */
+			nr_vmemmap_pages = end_pfn - start_pfn;
+			skip_migration = true;
+		}
+	}
+
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
 	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
 				  &valid_end)) {
-		mem_hotplug_done();
 		ret = -EINVAL;
 		reason = "multizone range";
 		goto failed_removal;
@@ -1575,14 +1657,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	node = zone_to_nid(zone);
 	nr_pages = end_pfn - start_pfn;
 
-	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn,
-				       MIGRATE_MOVABLE,
-				       SKIP_HWPOISON | REPORT_FAILURE);
-	if (ret) {
-		mem_hotplug_done();
-		reason = "failure to isolate range";
-		goto failed_removal;
+	if (!skip_migration) {
+		/* set above range as isolated */
+		ret = start_isolate_page_range(start_pfn, end_pfn,
+					       MIGRATE_MOVABLE,
+					       SKIP_HWPOISON | REPORT_FAILURE);
+		if (ret) {
+			reason = "failure to isolate range";
+			goto failed_removal;
+		}
 	}
 
 	arg.start_pfn = start_pfn;
@@ -1596,6 +1679,13 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal_isolated;
 	}
 
+	if (skip_migration)
+		/*
+		 * If the entire memblock is populated with vmemmap pages,
+		 * there is nothing we can migrate, so skip it.
+		 */
+		goto no_migration;
+
 	do {
 		for (pfn = start_pfn; pfn;) {
 			if (signal_pending(current)) {
@@ -1634,14 +1724,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
+no_migration:
 	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined_pages,
 						offline_isolated_pages_cb);
 
 	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
-	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+	if (!skip_migration)
+		undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
+
+	/*
+	 * Vmemmap pages are not being accounted to managed_pages but to
+	 * present_pages.
+	 * We need to add them up to the already offlined pages to get
+	 * the accounting right.
+	 */
+	offlined_pages += nr_vmemmap_pages;
+
 	zone->present_pages -= offlined_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cad7468a0f20..05492cc95d74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1257,14 +1257,19 @@ static void __meminit __init_struct_page_nolru(struct page *page,
 					       unsigned long zone, int nid,
 					       bool is_reserved)
 {
-	mm_zero_struct_page(page);
+	if (!PageVmemmap(page)) {
+		/*
+		 * Vmemmap pages need to preserve their state.
+		 */
+		mm_zero_struct_page(page);
+		init_page_count(page);
+	}
 
 	/*
 	 * We can use a non-atomic operation for setting the
 	 * PG_reserved flag as we are still initializing the pages.
 	 */
 	set_page_links(page, zone, nid, pfn, is_reserved);
-	init_page_count(page);
 	page_mapcount_reset(page);
 	page_cpupid_reset_last(page);
 	page_kasan_tag_reset(page);
@@ -8138,6 +8143,19 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 
 		page = pfn_to_page(check);
 
+		/*
+		 * Vmemmap pages are marked as reserved, so skip them here,
+		 * otherwise the check below will drive us to a bad conclusion.
+		 */
+		if (PageVmemmap(page)) {
+			struct page *head = vmemmap_get_head(page);
+			unsigned int skip_pages;
+
+			skip_pages = page_private(head) - (page - head);
+			iter += skip_pages - 1;
+			continue;
+		}
+
 		if (PageReserved(page))
 			goto unmovable;
 
@@ -8506,6 +8524,20 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 			continue;
 		}
 		page = pfn_to_page(pfn);
+
+		/*
+		 * Vmemmap pages are self-hosted in the hot-added range,
+		 * we do not need to free them, so skip them.
+		 */
+		if (PageVmemmap(page)) {
+			struct page *head = vmemmap_get_head(page);
+			unsigned long skip_pages;
+
+			skip_pages = page_private(head) - (page - head);
+			pfn += skip_pages;
+			continue;
+		}
+
 		/*
 		 * The HWPoisoned page may be not in buddy system, and
 		 * page_count() is not 0.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index ce323e56b34d..e29b378f39ae 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -155,6 +155,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
 		page = pfn_to_online_page(pfn + i);
 		if (!page)
 			continue;
+		if (PageVmemmap(page))
+			continue;
 		return page;
 	}
 	return NULL;
@@ -257,6 +259,17 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 			continue;
 		}
 		page = pfn_to_page(pfn);
+		if (PageVmemmap(page)) {
+			/*
+			 * Vmemmap pages are not isolated. Skip them.
+			 */
+			struct page *head = vmemmap_get_head(page);
+			unsigned long skip_pages;
+
+			skip_pages = page_private(head) - (page - head);
+			pfn += skip_pages;
+			continue;
+		}
 		if (PageBuddy(page))
 			/*
 			 * If the page is on a free list, it has to be on
diff --git a/mm/sparse.c b/mm/sparse.c
index 7ea5dc6c6b19..dd30468dc8f5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -579,6 +579,103 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
+void mark_vmemmap_pages(struct vmem_altmap *self, struct mhp_restrictions *r)
+{
+	unsigned long pfn = self->base_pfn + self->reserve;
+	unsigned long nr_pages = self->alloc;
+	unsigned long nr_sects = self->free / PAGES_PER_SECTION;
+	unsigned long i;
+	struct page *head;
+
+	if (!(r->flags & MHP_MEMMAP_FROM_RANGE) || !nr_pages)
+		return;
+
+	/*
+	 * All allocations for the memory hotplug are the same sized so align
+	 * should be 0.
+	 */
+	WARN_ON(self->align);
+
+	/*
+	 * Mark these pages as Vmemmap pages.
+	 * We keep track of the sections used by this altmap by means
+	 * of a refcount, so we know how much do we have to defer the call
+	 * to vmemmap_free for this memory range.
+	 * This refcount is kept in the first vmemmap page (head).
+	 * For example:
+	 * We add 10GB: (ffffea0004000000 - ffffea000427ffc0)
+	 * ffffea0004000000 will have a refcount of 80.
+	 * To easily get the head of any vmemmap page, we keep a pointer of it
+	 * in page->freelist.
+	 * We also keep the total nr of pages used by this altmap in the head
+	 * page.
+	 * So, we have this picture:
+	 *
+	 * Head page:
+	 *  page->_refcount: nr of sections
+	 *  page->private: nr of vmemmap pages
+	 * Tail page:
+	 *  page->freelist: pointer to the head page
+	 */
+
+	/*
+	 * Head, first vmemmap page.
+	 */
+	head = pfn_to_page(pfn);
+
+	for (i = 0; i < nr_pages; i++, pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		mm_zero_struct_page(page);
+		__SetPageVmemmap(page);
+		page->freelist = head;
+		init_page_count(page);
+	}
+	set_page_count(head, (int)nr_sects);
+	set_page_private(head, nr_pages);
+}
+
+/*
+ * If the range we are trying to remove was hot-added with vmemmap pages,
+ * we need to keep track of it to know how much do we have do defer the
+ * the free up.
+ * Since sections are removed sequentally in __remove_pages()->__remove_section(),
+ * we just wait until we hit the last section.
+ * Once that happens, we can trigger free_deferred_vmemmap_range to actually
+ * free the whole memory-range.
+ * This is done because we actually have to free the memory-range backwards.
+ * The reason is that the first pages of that memory are used for the pagetables
+ * in order to create the memmap mapping.
+ * If we removed those pages first, we would blow up, so the vmemmap pages have
+ * to be freed the last.
+ * Since hot-add/hot-remove operations are serialized by the hotplug lock, we know
+ * that once we start a hot-remove operation, we will go all the way down until it
+ * is done, so we do not need any locking for these two variables.
+ */
+static struct page *head_vmemmap_page;
+static bool in_vmemmap_range;
+
+static inline bool vmemmap_dec_and_test(void)
+{
+	return page_ref_dec_and_test(head_vmemmap_page);
+}
+
+static void free_deferred_vmemmap_range(unsigned long start,
+					unsigned long end)
+{
+	unsigned long nr_pages = end - start;
+	unsigned long first_section = (unsigned long)head_vmemmap_page;
+
+	while (start >= first_section) {
+		pr_info("vmemmap_free: %lx - %lx\n", start, end);
+		vmemmap_free(start, end, NULL);
+		end = start;
+		start -= nr_pages;
+	}
+	head_vmemmap_page = NULL;
+	in_vmemmap_range = false;
+}
+
 static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap)
 {
@@ -591,6 +688,17 @@ static void __kfree_section_memmap(struct page *memmap,
 	unsigned long start = (unsigned long)memmap;
 	unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
 
+	if (PageVmemmap(memmap) && !in_vmemmap_range) {
+		in_vmemmap_range = true;
+		head_vmemmap_page = memmap;
+	}
+
+	if (in_vmemmap_range) {
+		if (vmemmap_dec_and_test())
+			free_deferred_vmemmap_range(start, end);
+		return;
+	}
+
 	vmemmap_free(start, end, altmap);
 }
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/mm/util.c b/mm/util.c
index 1ea055138043..e0ac8712a392 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -517,6 +517,8 @@ struct address_space *page_mapping(struct page *page)
 	mapping = page->mapping;
 	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
 		return NULL;
+	if ((unsigned long)mapping == VMEMMAP_PAGE)
+		return NULL;
 
 	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
 }
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH v2 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap
  2019-01-22 10:37 [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
                   ` (2 preceding siblings ...)
  2019-01-22 10:37 ` [RFC PATCH v2 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap Oscar Salvador
@ 2019-01-22 10:37 ` Oscar Salvador
  2019-01-25  8:53 ` [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2019-01-22 10:37 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, dan.j.williams, Pavel.Tatashin, david, linux-kernel,
	dave.hansen, Oscar Salvador

From: Michal Hocko <mhocko@suse.com>

The sufix "kmalloc" is misleading.
Rename it to alloc_section_memmap/free_section_memmap which
better reflects the funcionality.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/sparse.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index dd30468dc8f5..27428b965d46 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -676,13 +676,13 @@ static void free_deferred_vmemmap_range(unsigned long start,
 	in_vmemmap_range = false;
 }
 
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static inline struct page *alloc_section_memmap(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap)
 {
 	/* This will make the necessary allocations eventually. */
 	return sparse_mem_map_populate(pnum, nid, altmap);
 }
-static void __kfree_section_memmap(struct page *memmap,
+static void free_section_memmap(struct page *memmap,
 		struct vmem_altmap *altmap)
 {
 	unsigned long start = (unsigned long)memmap;
@@ -732,13 +732,13 @@ static struct page *__kmalloc_section_memmap(void)
 	return ret;
 }
 
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static inline struct page *alloc_section_memmap(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap)
 {
 	return __kmalloc_section_memmap();
 }
 
-static void __kfree_section_memmap(struct page *memmap,
+static void free_section_memmap(struct page *memmap,
 		struct vmem_altmap *altmap)
 {
 	if (is_vmalloc_addr(memmap))
@@ -803,12 +803,12 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	if (ret < 0 && ret != -EEXIST)
 		return ret;
 	ret = 0;
-	memmap = kmalloc_section_memmap(section_nr, nid, altmap);
+	memmap = alloc_section_memmap(section_nr, nid, altmap);
 	if (!memmap)
 		return -ENOMEM;
 	usemap = __kmalloc_section_usemap();
 	if (!usemap) {
-		__kfree_section_memmap(memmap, altmap);
+		free_section_memmap(memmap, altmap);
 		return -ENOMEM;
 	}
 
@@ -830,7 +830,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 out:
 	if (ret < 0) {
 		kfree(usemap);
-		__kfree_section_memmap(memmap, altmap);
+		free_section_memmap(memmap, altmap);
 	}
 	return ret;
 }
@@ -881,7 +881,7 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
 	if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
 		kfree(usemap);
 		if (memmap)
-			__kfree_section_memmap(memmap, altmap);
+			free_section_memmap(memmap, altmap);
 		return;
 	}
 
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-01-22 10:37 [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
                   ` (3 preceding siblings ...)
  2019-01-22 10:37 ` [RFC PATCH v2 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap Oscar Salvador
@ 2019-01-25  8:53 ` David Hildenbrand
  2019-01-29  8:43   ` Oscar Salvador
  2019-01-30 21:52 ` Oscar Salvador
  2019-02-12 12:47 ` Jonathan Cameron
  6 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2019-01-25  8:53 UTC (permalink / raw)
  To: Oscar Salvador, linux-mm
  Cc: mhocko, dan.j.williams, Pavel.Tatashin, linux-kernel, dave.hansen

On 22.01.19 11:37, Oscar Salvador wrote:
> Hi,
> 
> this is the v2 of the first RFC I sent back then in October [1].
> In this new version I tried to reduce the complexity as much as possible,
> plus some clean ups.
> 
> [Testing]
> 
> I have tested it on "x86_64" (small/big memblocks) and on "powerpc".
> On both architectures hot-add/hot-remove online/offline operations
> worked as expected using vmemmap pages, I have not seen any issues so far.
> I wanted to try it out on Hyper-V/Xen, but I did not manage to.
> I plan to do so along this week (if time allows).
> I would also like to test it on arm64, but I am not sure I can grab
> an arm64 box anytime soon.
> 
> [Coverletter]:
> 
> This is another step to make the memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot added
> memory (at least for SPARSE_VMEMMAP memory model). The current way we use
> to populate memmap (struct page array) has two main drawbacks:
> 
> a) it consumes an additional memory until the hotadded memory itself is
>    onlined and
> b) memmap might end up on a different numa node which is especially true
>    for movable_node configuration.
> 
> a) is problem especially for memory hotplug based memory "ballooning"
>    solutions when the delay between physical memory hotplug and the
>    onlining can lead to OOM and that led to introduction of hacks like auto
>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory")).
> 
> b) can have performance drawbacks.
> 
> I have also seen hot-add operations failing on powerpc due to the fact
> that we try to use order-8 pages when populating the memmap array.
> Given 64KB base pagesize, that is 16MB.
> If we run out of those, we just fail the operation and we cannot add
> more memory.
> We could fallback to base pages as x86_64 does, but we can do better.
> 
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hotadded memory itself. VMEMMAP memory model allows us to map
> any pfn range so the memory doesn't need to be online to be usable
> for the array. See patch 3 for more details. In short I am reusing an
> existing vmem_altmap which wants to achieve the same thing for nvdim
> device memory.
> 

I only had a quick glimpse. I would prefer if the caller of add_memory()
can specify whether it would be ok to allocate vmmap from the range.
This e.g. allows ACPI dimm code to allocate from the range, however
other machanisms (XEN, hyper-v, virtio-mem) can allow it once they
actually support it.

Also, while s390x standby memory cannot support allocating from the
range, virtio-mem could easily support it on s390x.

Not sure how such an interface could look like, but I would really like
to have control over that on the add_memory() interface, not per arch.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-01-25  8:53 ` [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
@ 2019-01-29  8:43   ` Oscar Salvador
  2019-01-29 10:08     ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Oscar Salvador @ 2019-01-29  8:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, mhocko, dan.j.williams, Pavel.Tatashin, linux-kernel,
	dave.hansen

On Fri, Jan 25, 2019 at 09:53:35AM +0100, David Hildenbrand wrote:
Hi David,

> I only had a quick glimpse. I would prefer if the caller of add_memory()
> can specify whether it would be ok to allocate vmmap from the range.
> This e.g. allows ACPI dimm code to allocate from the range, however
> other machanisms (XEN, hyper-v, virtio-mem) can allow it once they
> actually support it.

Well, I think this can be done, and it might make more sense, as we
would get rid of some other flags to prevent allocating vmemmap
besides mhp_restrictions.

> 
> Also, while s390x standby memory cannot support allocating from the
> range, virtio-mem could easily support it on s390x.
> 
> Not sure how such an interface could look like, but I would really like
> to have control over that on the add_memory() interface, not per arch.

Let me try it out and will report back.

Btw, since you are a virt-guy, would it be do feasible for you to test the patchset
on hyper-v, xen or your virtio-mem driver?

Thanks David!

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-01-29  8:43   ` Oscar Salvador
@ 2019-01-29 10:08     ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2019-01-29 10:08 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-mm, mhocko, dan.j.williams, Pavel.Tatashin, linux-kernel,
	dave.hansen, Vitaly Kuznetsov

On 29.01.19 09:43, Oscar Salvador wrote:
> On Fri, Jan 25, 2019 at 09:53:35AM +0100, David Hildenbrand wrote:
> Hi David,
> 
>> I only had a quick glimpse. I would prefer if the caller of add_memory()
>> can specify whether it would be ok to allocate vmmap from the range.
>> This e.g. allows ACPI dimm code to allocate from the range, however
>> other machanisms (XEN, hyper-v, virtio-mem) can allow it once they
>> actually support it.
> 
> Well, I think this can be done, and it might make more sense, as we
> would get rid of some other flags to prevent allocating vmemmap
> besides mhp_restrictions.

Maybe we can also start passing a struct to add_memory() to describe
such properties. This would avoid having to change all the layers over
and over again. We would just have to establish some rules to avoid
breaking stuff. E.g. the struct always has to be initialized to 0 so new
features won't break any caller not wanting to make use of that.

E.g. memory block types (or if we come up with something better) would
also have to add new parameters to add_memory() and friends.

> 
>>
>> Also, while s390x standby memory cannot support allocating from the
>> range, virtio-mem could easily support it on s390x.
>>
>> Not sure how such an interface could look like, but I would really like
>> to have control over that on the add_memory() interface, not per arch.
> 
> Let me try it out and will report back.
> 
> Btw, since you are a virt-guy, would it be do feasible for you to test the patchset
> on hyper-v, xen or your virtio-mem driver?

I don't have a XEN or Hyper-V installation myself. cc-ing Vitaly, maybe
he has time end resources to test on Hyper-V.

I'll be reworking my virtio-mem prototype soon and try with this
patchset than! But this could take a little bit longer as I have tons of
other stuff on my plate :) So don't worry about virtio-mem too much for now.

> 
> Thanks David!
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-01-22 10:37 [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
                   ` (4 preceding siblings ...)
  2019-01-25  8:53 ` [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
@ 2019-01-30 21:52 ` Oscar Salvador
  2019-01-31  7:23   ` Michal Hocko
  2019-02-12 12:47 ` Jonathan Cameron
  6 siblings, 1 reply; 16+ messages in thread
From: Oscar Salvador @ 2019-01-30 21:52 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, dan.j.williams, Pavel.Tatashin, david, linux-kernel, dave.hansen

On Tue, Jan 22, 2019 at 11:37:04AM +0100, Oscar Salvador wrote:
> I yet have to check a couple of things like creating an accounting item
> like VMEMMAP_PAGES to show in /proc/meminfo to ease to spot the memory that
> went in there, testing Hyper-V/Xen to see how they react to the fact that
> we are using the beginning of the memory-range for our own purposes, and to
> check the thing about gigantic pages + hotplug.
> I also have to check that there is no compilation/runtime errors when
> CONFIG_SPARSEMEM but !CONFIG_SPARSEMEM_VMEMMAP.
> But before that, I would like to get people's feedback about the overall
> design, and ideas/suggestions.

just a friendly reminder if some feedback is possible :-)

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-01-30 21:52 ` Oscar Salvador
@ 2019-01-31  7:23   ` Michal Hocko
  2019-01-31  8:03     ` Oscar Salvador
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-01-31  7:23 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-mm, dan.j.williams, Pavel.Tatashin, david, linux-kernel,
	dave.hansen

On Wed 30-01-19 22:52:04, Oscar Salvador wrote:
> On Tue, Jan 22, 2019 at 11:37:04AM +0100, Oscar Salvador wrote:
> > I yet have to check a couple of things like creating an accounting item
> > like VMEMMAP_PAGES to show in /proc/meminfo to ease to spot the memory that
> > went in there, testing Hyper-V/Xen to see how they react to the fact that
> > we are using the beginning of the memory-range for our own purposes, and to
> > check the thing about gigantic pages + hotplug.
> > I also have to check that there is no compilation/runtime errors when
> > CONFIG_SPARSEMEM but !CONFIG_SPARSEMEM_VMEMMAP.
> > But before that, I would like to get people's feedback about the overall
> > design, and ideas/suggestions.
> 
> just a friendly reminder if some feedback is possible :-)

I will be off next week and will not get to this this week.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-01-31  7:23   ` Michal Hocko
@ 2019-01-31  8:03     ` Oscar Salvador
  0 siblings, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2019-01-31  8:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, dan.j.williams, Pavel.Tatashin, david, linux-kernel,
	dave.hansen

On Thu, Jan 31, 2019 at 08:23:19AM +0100, Michal Hocko wrote:
> On Wed 30-01-19 22:52:04, Oscar Salvador wrote:
> > On Tue, Jan 22, 2019 at 11:37:04AM +0100, Oscar Salvador wrote:
> > > I yet have to check a couple of things like creating an accounting item
> > > like VMEMMAP_PAGES to show in /proc/meminfo to ease to spot the memory that
> > > went in there, testing Hyper-V/Xen to see how they react to the fact that
> > > we are using the beginning of the memory-range for our own purposes, and to
> > > check the thing about gigantic pages + hotplug.
> > > I also have to check that there is no compilation/runtime errors when
> > > CONFIG_SPARSEMEM but !CONFIG_SPARSEMEM_VMEMMAP.
> > > But before that, I would like to get people's feedback about the overall
> > > design, and ideas/suggestions.
> > 
> > just a friendly reminder if some feedback is possible :-)
> 
> I will be off next week and will not get to this this week.

Sure, it can wait.
In the meantime I will take the chance to clean up a couple of things.

Thanks
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-01-22 10:37 [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
                   ` (5 preceding siblings ...)
  2019-01-30 21:52 ` Oscar Salvador
@ 2019-02-12 12:47 ` Jonathan Cameron
  2019-02-12 13:21   ` Shameerali Kolothum Thodi
  6 siblings, 1 reply; 16+ messages in thread
From: Jonathan Cameron @ 2019-02-12 12:47 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-mm, mhocko, dan.j.williams, Pavel.Tatashin, david,
	linux-kernel, dave.hansen, shameerali.kolothum.thodi, linuxarm,
	Robin Murphy

On Tue, 22 Jan 2019 11:37:04 +0100
Oscar Salvador <osalvador@suse.de> wrote:

> Hi,
> 
> this is the v2 of the first RFC I sent back then in October [1].
> In this new version I tried to reduce the complexity as much as possible,
> plus some clean ups.
> 
> [Testing]
> 
> I have tested it on "x86_64" (small/big memblocks) and on "powerpc".
> On both architectures hot-add/hot-remove online/offline operations
> worked as expected using vmemmap pages, I have not seen any issues so far.
> I wanted to try it out on Hyper-V/Xen, but I did not manage to.
> I plan to do so along this week (if time allows).
> I would also like to test it on arm64, but I am not sure I can grab
> an arm64 box anytime soon.

Hi Oscar,

I ran tests on one of our arm64 machines. Particular machine doesn't actually have
the mechanics for hotplug, so was all 'faked', but software wise it's all the
same.

Upshot, seems to work as expected on arm64 as well.
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Remove currently relies on some out of tree patches (and dirty hacks) due
to the usual issue with how arm64 does pfn_valid. It's not even vaguely
ready for upstream. I'll aim to post an informational set for anyone else
testing in this area (it's more or less just a rebase of the patches from
a few years ago).

+CC Shameer who has been testing the virtualization side for more details on
that, and Robin who is driving forward memory hotplug in general on the arm64
side.

Thanks,

Jonathan

> 
> [Coverletter]:
> 
> This is another step to make the memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot added
> memory (at least for SPARSE_VMEMMAP memory model). The current way we use
> to populate memmap (struct page array) has two main drawbacks:
> 
> a) it consumes an additional memory until the hotadded memory itself is
>    onlined and
> b) memmap might end up on a different numa node which is especially true
>    for movable_node configuration.
> 
> a) is problem especially for memory hotplug based memory "ballooning"
>    solutions when the delay between physical memory hotplug and the
>    onlining can lead to OOM and that led to introduction of hacks like auto
>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory")).
> 
> b) can have performance drawbacks.
> 
> I have also seen hot-add operations failing on powerpc due to the fact
> that we try to use order-8 pages when populating the memmap array.
> Given 64KB base pagesize, that is 16MB.
> If we run out of those, we just fail the operation and we cannot add
> more memory.
> We could fallback to base pages as x86_64 does, but we can do better.
> 
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hotadded memory itself. VMEMMAP memory model allows us to map
> any pfn range so the memory doesn't need to be online to be usable
> for the array. See patch 3 for more details. In short I am reusing an
> existing vmem_altmap which wants to achieve the same thing for nvdim
> device memory.
> 
> There is also one potential drawback, though. If somebody uses memory
> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
> for them obviously because each memory block will contain reserved
> area. Large x86 machines will use 2G memblocks so at least one 1G page
> will be available but this is still not 2G...
> 
> I am not really sure somebody does that and how reliable that can work
> actually. Nevertheless, I _believe_ that onlining more memory into
> virtual machines is much more common usecase. Anyway if there ever is a
> strong demand for such a usecase we have basically 3 options a) enlarge
> memory blocks even more b) enhance altmap allocation strategy and reuse
> low memory sections to host memmaps of other sections on the same NUMA
> node c) have the memmap allocation strategy configurable to fallback to
> the current allocation.
>  
> [Overall design]:
> 
> Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M).
> That is:
> 
>  - 16 sections
>  - 524288 pages
>  - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section)
> 
>  The range of pages is: 0xffffea0004000000 - 0xffffea0006000000
>  The vmemmap range is:  0xffffea0004000000 - 0xffffea0004080000
> 
>  0xffffea0004000000 is the head vmemmap page (first page), while all the others
>  are "tails".
> 
>  We keep the following information in it:
> 
>  - Head page:
>    - head->_refcount: number of sections
>    - head->private :  number of vmemmap pages
>  - Tail page:
>    - tail->freelist : pointer to the head
> 
> This is done because it eases the work in cases where we have to compute the
> number of vmemmap pages to know how much do we have to skip etc, and to keep
> the right accounting to present_pages.
> 
> When we want to hot-remove the range, we need to be careful because the first
> pages of that range, are used for the memmap maping, so if we remove those
> first, we would blow up while accessing the others later on.
> For that reason we keep the number of sections in head->_refcount, to know how
> much do we have to defer the free up.
> 
> Since in a hot-remove operation, sections are being removed sequentially, the
> approach taken here is that every time we hit free_section_memmap(), we decrease
> the refcount of the head.
> When it reaches 0, we know that we hit the last section, so we call
> vmemmap_free() for the whole memory-range in backwards, so we make sure that
> the pages used for the mapping will be latest to be freed up.
> 
> The accounting is as follows:
> 
>  Vmemmap pages are charged to spanned/present_paged, but not to manages_pages.
> 
> I yet have to check a couple of things like creating an accounting item
> like VMEMMAP_PAGES to show in /proc/meminfo to ease to spot the memory that
> went in there, testing Hyper-V/Xen to see how they react to the fact that
> we are using the beginning of the memory-range for our own purposes, and to
> check the thing about gigantic pages + hotplug.
> I also have to check that there is no compilation/runtime errors when
> CONFIG_SPARSEMEM but !CONFIG_SPARSEMEM_VMEMMAP.
> But before that, I would like to get people's feedback about the overall
> design, and ideas/suggestions.
> 
> 
> [1] https://patchwork.kernel.org/cover/10685835/
> 
> Michal Hocko (3):
>   mm, memory_hotplug: cleanup memory offline path
>   mm, memory_hotplug: provide a more generic restrictions for memory
>     hotplug
>   mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap
> 
> Oscar Salvador (1):
>   mm, memory_hotplug: allocate memmap from the added memory range for
>     sparse-vmemmap
> 
>  arch/arm64/mm/mmu.c            |  10 ++-
>  arch/ia64/mm/init.c            |   5 +-
>  arch/powerpc/mm/init_64.c      |   7 ++
>  arch/powerpc/mm/mem.c          |   6 +-
>  arch/s390/mm/init.c            |  12 ++-
>  arch/sh/mm/init.c              |   6 +-
>  arch/x86/mm/init_32.c          |   6 +-
>  arch/x86/mm/init_64.c          |  20 +++--
>  drivers/hv/hv_balloon.c        |   1 +
>  drivers/xen/balloon.c          |   1 +
>  include/linux/memory_hotplug.h |  42 ++++++++--
>  include/linux/memremap.h       |   2 +-
>  include/linux/page-flags.h     |  23 +++++
>  kernel/memremap.c              |   9 +-
>  mm/compaction.c                |   8 ++
>  mm/memory_hotplug.c            | 186 +++++++++++++++++++++++++++++------------
>  mm/page_alloc.c                |  47 ++++++++++-
>  mm/page_isolation.c            |  13 +++
>  mm/sparse.c                    | 124 +++++++++++++++++++++++++--
>  mm/util.c                      |   2 +
>  20 files changed, 431 insertions(+), 99 deletions(-)
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-02-12 12:47 ` Jonathan Cameron
@ 2019-02-12 13:21   ` Shameerali Kolothum Thodi
  2019-02-12 13:56     ` Oscar Salvador
  0 siblings, 1 reply; 16+ messages in thread
From: Shameerali Kolothum Thodi @ 2019-02-12 13:21 UTC (permalink / raw)
  To: Jonathan Cameron, Oscar Salvador
  Cc: linux-mm, mhocko, dan.j.williams, Pavel.Tatashin, david,
	linux-kernel, dave.hansen, Linuxarm, Robin Murphy



> -----Original Message-----
> From: Jonathan Cameron
> Sent: 12 February 2019 12:47
> To: Oscar Salvador <osalvador@suse.de>
> Cc: linux-mm@kvack.org; mhocko@suse.com; dan.j.williams@intel.com;
> Pavel.Tatashin@microsoft.com; david@redhat.com;
> linux-kernel@vger.kernel.org; dave.hansen@intel.com; Shameerali Kolothum
> Thodi <shameerali.kolothum.thodi@huawei.com>; Linuxarm
> <linuxarm@huawei.com>; Robin Murphy <robin.murphy@arm.com>
> Subject: Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from
> hotadded memory
> 
> On Tue, 22 Jan 2019 11:37:04 +0100
> Oscar Salvador <osalvador@suse.de> wrote:
> 
> > Hi,
> >
> > this is the v2 of the first RFC I sent back then in October [1].
> > In this new version I tried to reduce the complexity as much as possible,
> > plus some clean ups.
> >
> > [Testing]
> >
> > I have tested it on "x86_64" (small/big memblocks) and on "powerpc".
> > On both architectures hot-add/hot-remove online/offline operations
> > worked as expected using vmemmap pages, I have not seen any issues so far.
> > I wanted to try it out on Hyper-V/Xen, but I did not manage to.
> > I plan to do so along this week (if time allows).
> > I would also like to test it on arm64, but I am not sure I can grab
> > an arm64 box anytime soon.
> 
> Hi Oscar,
> 
> I ran tests on one of our arm64 machines. Particular machine doesn't actually
> have
> the mechanics for hotplug, so was all 'faked', but software wise it's all the
> same.
> 
> Upshot, seems to work as expected on arm64 as well.
> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Remove currently relies on some out of tree patches (and dirty hacks) due
> to the usual issue with how arm64 does pfn_valid. It's not even vaguely
> ready for upstream. I'll aim to post an informational set for anyone else
> testing in this area (it's more or less just a rebase of the patches from
> a few years ago).
> 
> +CC Shameer who has been testing the virtualization side for more details on
> that, 

Right, I have sent out a RFC series[1] to enable mem hotplug for Qemu ARM virt
platform. Using this Qemu, I ran few tests with your patches on a HiSilicon ARM64
platform. Looks like it is doing the job.

root@ubuntu:~# uname -a
Linux ubuntu 5.0.0-rc1-mm1-00173-g22b0744 #5 SMP PREEMPT Tue Feb 5 10:32:26 GMT 2019 aarch64 aarch64 aarch64 GNU/Linux

root@ubuntu:~# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0
node 0 size: 981 MB
node 0 free: 854 MB
node 1 cpus:
node 1 size: 0 MB
node 1 free: 0 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
root@ubuntu:~# (qemu) 
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1,node=1
root@ubuntu:~# 
root@ubuntu:~# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0
node 0 size: 981 MB
node 0 free: 853 MB
node 1 cpus:
node 1 size: 1008 MB
node 1 free: 1008 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
root@ubuntu:~#  

FWIW,
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>

Thanks,
Shameer
[1] https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06966.html

and Robin who is driving forward memory hotplug in general on the arm64
> side.
> 
> Thanks,
> 
> Jonathan
> 
> >
> > [Coverletter]:
> >
> > This is another step to make the memory hotplug more usable. The primary
> > goal of this patchset is to reduce memory overhead of the hot added
> > memory (at least for SPARSE_VMEMMAP memory model). The current way
> we use
> > to populate memmap (struct page array) has two main drawbacks:
> >
> > a) it consumes an additional memory until the hotadded memory itself is
> >    onlined and
> > b) memmap might end up on a different numa node which is especially true
> >    for movable_node configuration.
> >
> > a) is problem especially for memory hotplug based memory "ballooning"
> >    solutions when the delay between physical memory hotplug and the
> >    onlining can lead to OOM and that led to introduction of hacks like auto
> >    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
> >    policy for the newly added memory")).
> >
> > b) can have performance drawbacks.
> >
> > I have also seen hot-add operations failing on powerpc due to the fact
> > that we try to use order-8 pages when populating the memmap array.
> > Given 64KB base pagesize, that is 16MB.
> > If we run out of those, we just fail the operation and we cannot add
> > more memory.
> > We could fallback to base pages as x86_64 does, but we can do better.
> >
> > One way to mitigate all these issues is to simply allocate memmap array
> > (which is the largest memory footprint of the physical memory hotplug)
> > from the hotadded memory itself. VMEMMAP memory model allows us to
> map
> > any pfn range so the memory doesn't need to be online to be usable
> > for the array. See patch 3 for more details. In short I am reusing an
> > existing vmem_altmap which wants to achieve the same thing for nvdim
> > device memory.
> >
> > There is also one potential drawback, though. If somebody uses memory
> > hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
> > for them obviously because each memory block will contain reserved
> > area. Large x86 machines will use 2G memblocks so at least one 1G page
> > will be available but this is still not 2G...
> >
> > I am not really sure somebody does that and how reliable that can work
> > actually. Nevertheless, I _believe_ that onlining more memory into
> > virtual machines is much more common usecase. Anyway if there ever is a
> > strong demand for such a usecase we have basically 3 options a) enlarge
> > memory blocks even more b) enhance altmap allocation strategy and reuse
> > low memory sections to host memmaps of other sections on the same NUMA
> > node c) have the memmap allocation strategy configurable to fallback to
> > the current allocation.
> >
> > [Overall design]:
> >
> > Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M).
> > That is:
> >
> >  - 16 sections
> >  - 524288 pages
> >  - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each
> section)
> >
> >  The range of pages is: 0xffffea0004000000 - 0xffffea0006000000
> >  The vmemmap range is:  0xffffea0004000000 - 0xffffea0004080000
> >
> >  0xffffea0004000000 is the head vmemmap page (first page), while all the
> others
> >  are "tails".
> >
> >  We keep the following information in it:
> >
> >  - Head page:
> >    - head->_refcount: number of sections
> >    - head->private :  number of vmemmap pages
> >  - Tail page:
> >    - tail->freelist : pointer to the head
> >
> > This is done because it eases the work in cases where we have to compute
> the
> > number of vmemmap pages to know how much do we have to skip etc, and to
> keep
> > the right accounting to present_pages.
> >
> > When we want to hot-remove the range, we need to be careful because the
> first
> > pages of that range, are used for the memmap maping, so if we remove
> those
> > first, we would blow up while accessing the others later on.
> > For that reason we keep the number of sections in head->_refcount, to know
> how
> > much do we have to defer the free up.
> >
> > Since in a hot-remove operation, sections are being removed sequentially, the
> > approach taken here is that every time we hit free_section_memmap(), we
> decrease
> > the refcount of the head.
> > When it reaches 0, we know that we hit the last section, so we call
> > vmemmap_free() for the whole memory-range in backwards, so we make
> sure that
> > the pages used for the mapping will be latest to be freed up.
> >
> > The accounting is as follows:
> >
> >  Vmemmap pages are charged to spanned/present_paged, but not to
> manages_pages.
> >
> > I yet have to check a couple of things like creating an accounting item
> > like VMEMMAP_PAGES to show in /proc/meminfo to ease to spot the
> memory that
> > went in there, testing Hyper-V/Xen to see how they react to the fact that
> > we are using the beginning of the memory-range for our own purposes, and
> to
> > check the thing about gigantic pages + hotplug.
> > I also have to check that there is no compilation/runtime errors when
> > CONFIG_SPARSEMEM but !CONFIG_SPARSEMEM_VMEMMAP.
> > But before that, I would like to get people's feedback about the overall
> > design, and ideas/suggestions.
> >
> >
> > [1] https://patchwork.kernel.org/cover/10685835/
> >
> > Michal Hocko (3):
> >   mm, memory_hotplug: cleanup memory offline path
> >   mm, memory_hotplug: provide a more generic restrictions for memory
> >     hotplug
> >   mm, sparse: rename kmalloc_section_memmap,
> __kfree_section_memmap
> >
> > Oscar Salvador (1):
> >   mm, memory_hotplug: allocate memmap from the added memory range
> for
> >     sparse-vmemmap
> >
> >  arch/arm64/mm/mmu.c            |  10 ++-
> >  arch/ia64/mm/init.c            |   5 +-
> >  arch/powerpc/mm/init_64.c      |   7 ++
> >  arch/powerpc/mm/mem.c          |   6 +-
> >  arch/s390/mm/init.c            |  12 ++-
> >  arch/sh/mm/init.c              |   6 +-
> >  arch/x86/mm/init_32.c          |   6 +-
> >  arch/x86/mm/init_64.c          |  20 +++--
> >  drivers/hv/hv_balloon.c        |   1 +
> >  drivers/xen/balloon.c          |   1 +
> >  include/linux/memory_hotplug.h |  42 ++++++++--
> >  include/linux/memremap.h       |   2 +-
> >  include/linux/page-flags.h     |  23 +++++
> >  kernel/memremap.c              |   9 +-
> >  mm/compaction.c                |   8 ++
> >  mm/memory_hotplug.c            | 186
> +++++++++++++++++++++++++++++------------
> >  mm/page_alloc.c                |  47 ++++++++++-
> >  mm/page_isolation.c            |  13 +++
> >  mm/sparse.c                    | 124
> +++++++++++++++++++++++++--
> >  mm/util.c                      |   2 +
> >  20 files changed, 431 insertions(+), 99 deletions(-)
> >
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-02-12 13:21   ` Shameerali Kolothum Thodi
@ 2019-02-12 13:56     ` Oscar Salvador
  2019-02-12 14:42       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Oscar Salvador @ 2019-02-12 13:56 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jonathan Cameron, linux-mm, mhocko, dan.j.williams,
	Pavel.Tatashin, david, linux-kernel, dave.hansen, Linuxarm,
	Robin Murphy

On Tue, Feb 12, 2019 at 01:21:38PM +0000, Shameerali Kolothum Thodi wrote:
> > Hi Oscar,
> > 
> > I ran tests on one of our arm64 machines. Particular machine doesn't actually
> > have
> > the mechanics for hotplug, so was all 'faked', but software wise it's all the
> > same.
> > 
> > Upshot, seems to work as expected on arm64 as well.
> > Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks Jonathan for having given it a spin, much appreciated!
I was short of arm64 machines.

> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1,node=1
> root@ubuntu:~# 
> root@ubuntu:~# numactl -H
...
> node 1 cpus:
> node 1 size: 1008 MB
> node 1 free: 1008 MB
> node distances:
> node   0   1 
>   0:  10  20 
>   1:  20  10 
> root@ubuntu:~#  

Ok, this is what I wanted to see.
When you hotplugged 1GB, 16MB out of 1024MB  were spent
for the memmap array, that is why you only see 1008MB there.

I am not sure what is the default section size for arm64, but assuming
is 128MB, that would make sense as 1GB would mean 8 sections,
and each section takes 2MB.

That means that at least the mechanism works.

> 
> FWIW,
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>

thanks for having tested it ;-)!
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-02-12 13:56     ` Oscar Salvador
@ 2019-02-12 14:42       ` Michal Hocko
  2019-02-12 14:50         ` Oscar Salvador
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-02-12 14:42 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Shameerali Kolothum Thodi, Jonathan Cameron, linux-mm,
	dan.j.williams, Pavel.Tatashin, david, linux-kernel, dave.hansen,
	Linuxarm, Robin Murphy

On Tue 12-02-19 14:56:58, Oscar Salvador wrote:
> On Tue, Feb 12, 2019 at 01:21:38PM +0000, Shameerali Kolothum Thodi wrote:
> > > Hi Oscar,
> > > 
> > > I ran tests on one of our arm64 machines. Particular machine doesn't actually
> > > have
> > > the mechanics for hotplug, so was all 'faked', but software wise it's all the
> > > same.
> > > 
> > > Upshot, seems to work as expected on arm64 as well.
> > > Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Thanks Jonathan for having given it a spin, much appreciated!
> I was short of arm64 machines.
> 
> > (qemu) object_add memory-backend-ram,id=mem1,size=1G
> > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1,node=1
> > root@ubuntu:~# 
> > root@ubuntu:~# numactl -H
> ...
> > node 1 cpus:
> > node 1 size: 1008 MB
> > node 1 free: 1008 MB
> > node distances:
> > node   0   1 
> >   0:  10  20 
> >   1:  20  10 
> > root@ubuntu:~#  
> 
> Ok, this is what I wanted to see.
> When you hotplugged 1GB, 16MB out of 1024MB  were spent
> for the memmap array, that is why you only see 1008MB there.
> 
> I am not sure what is the default section size for arm64, but assuming
> is 128MB, that would make sense as 1GB would mean 8 sections,
> and each section takes 2MB.
> 
> That means that at least the mechanism works.

Please make sure to test on a larger machine which has multi section
memblocks. This is where I was hitting on bugs hard.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory
  2019-02-12 14:42       ` Michal Hocko
@ 2019-02-12 14:50         ` Oscar Salvador
  0 siblings, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2019-02-12 14:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shameerali Kolothum Thodi, Jonathan Cameron, linux-mm,
	dan.j.williams, Pavel.Tatashin, david, linux-kernel, dave.hansen,
	Linuxarm, Robin Murphy

On Tue, Feb 12, 2019 at 03:42:42PM +0100, Michal Hocko wrote:
> Please make sure to test on a larger machine which has multi section
> memblocks. This is where I was hitting on bugs hard.

I tested the patchset with large memblocks (2GB) on x86_64, and worked
fine as well.
On powerpc I was only able to test it on normal memblocks, but I will check
if I can boost the memory there to get large memblocks.

And about arm64, I will talk to Jonathan off-list to see if we can do the same.

Btw, in the meantime, we could get some parts reviewed perhaps.
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-02-12 14:50 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-22 10:37 [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
2019-01-22 10:37 ` [RFC PATCH v2 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
2019-01-22 10:37 ` [RFC PATCH v2 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug Oscar Salvador
2019-01-22 10:37 ` [RFC PATCH v2 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap Oscar Salvador
2019-01-22 10:37 ` [RFC PATCH v2 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap Oscar Salvador
2019-01-25  8:53 ` [RFC PATCH v2 0/4] mm, memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
2019-01-29  8:43   ` Oscar Salvador
2019-01-29 10:08     ` David Hildenbrand
2019-01-30 21:52 ` Oscar Salvador
2019-01-31  7:23   ` Michal Hocko
2019-01-31  8:03     ` Oscar Salvador
2019-02-12 12:47 ` Jonathan Cameron
2019-02-12 13:21   ` Shameerali Kolothum Thodi
2019-02-12 13:56     ` Oscar Salvador
2019-02-12 14:42       ` Michal Hocko
2019-02-12 14:50         ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).