linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
@ 2019-03-28 13:43 Oscar Salvador
  2019-03-28 13:43 ` [PATCH 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-03-28 13:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, david, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm, Oscar Salvador

Hi,

since last two RFCs were almost unnoticed (thanks David for the feedback),
I decided to re-work some parts to make it more simple and give it a more
testing, and drop the RFC, to see if it gets more attention.
I also added David's feedback, so now all users of add_memory/__add_memory/
add_memory_resource can specify whether they want to use this feature or not.
I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set.

[Testing]

Testing has been carried out on the following platforms:

 - x86_64 (small and big memblocks)
 - powerpc
 - arm64 (Huawei's fellows)

I plan to test it on Xen and Hyper-V, but for now those two will not be
using this feature, and neither DAX/pmem.

Of course, if this does not find any strong objection, my next step is to
work on enabling this on Xen/Hyper-V.

[Coverletter]

This is another step to make the memory hotplug more usable. The primary
goal of this patchset is to reduce memory overhead of the hot added
memory (at least for SPARSE_VMEMMAP memory model). The current way we use
to populate memmap (struct page array) has two main drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
   onlined and
b) memmap might end up on a different numa node which is especially true
   for movable_node configuration.

a) is problem especially for memory hotplug based memory "ballooning"
   solutions when the delay between physical memory hotplug and the
   onlining can lead to OOM and that led to introduction of hacks like auto
   onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
   policy for the newly added memory")).

b) can have performance drawbacks.

I have also seen hot-add operations failing on archs because they
were running out of order-x pages.
E.g On powerpc, in certain configurations, we use order-8 pages,
and given 64KB base pagesize, that is 16MB.
If we run out of those, we just fail the operation and we cannot add
more memory.
We could fallback to base pages as x86_64 does, but we can do better.

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hotadded memory itself. VMEMMAP memory model allows us to map
any pfn range so the memory doesn't need to be online to be usable
for the array. See patch 3 for more details. In short I am reusing an
existing vmem_altmap which wants to achieve the same thing for nvdim
device memory.

There is also one potential drawback, though. If somebody uses memory
hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
for them obviously because each memory block will contain reserved
area. Large x86 machines will use 2G memblocks so at least one 1G page
will be available but this is still not 2G...

If that is a problem, we can always configure a fallback strategy to
use the current scheme.

Since this only works when CONFIG_VMEMMAP_ENABLED is set,
we do check for it before setting the flag that allows use
to use the feature, no matter if the user wanted it.

[Overall design]:

Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M).
That is:

 - 16 sections
 - 524288 pages
 - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section)

 The range of pages is: 0xffffea0004000000 - 0xffffea0006000000
 The vmemmap range is:  0xffffea0004000000 - 0xffffea0004080000

 0xffffea0004000000 is the head vmemmap page (first page), while all the others
 are "tails".

 We keep the following information in it:

 - Head page:
   - head->_refcount: number of sections
   - head->private :  number of vmemmap pages
 - Tail page:
   - tail->freelist : pointer to the head

This is done because it eases the work in cases where we have to compute the
number of vmemmap pages to know how much do we have to skip etc, and to keep
the right accounting to present_pages.

When we want to hot-remove the range, we need to be careful because the first
pages of that range, are used for the memmap maping, so if we remove those
first, we would blow up while accessing the others later on.
For that reason we keep the number of sections in head->_refcount, to know how
much do we have to defer the free up.

Since in a hot-remove operation, sections are being removed sequentially, the
approach taken here is that every time we hit free_section_memmap(), we decrease
the refcount of the head.
When it reaches 0, we know that we hit the last section, so we call
vmemmap_free() for the whole memory-range in backwards, so we make sure that
the pages used for the mapping will be latest to be freed up.

Vmemmap pages are charged to spanned/present_paged, but not to manages_pages.

Michal Hocko (3):
  mm, memory_hotplug: cleanup memory offline path
  mm, memory_hotplug: provide a more generic restrictions for memory
    hotplug
  mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap

Oscar Salvador (1):
  mm, memory_hotplug: allocate memmap from the added memory range for
    sparse-vmemmap

 arch/arm64/mm/mmu.c                             |  10 +-
 arch/ia64/mm/init.c                             |   5 +-
 arch/powerpc/mm/init_64.c                       |   7 +
 arch/powerpc/mm/mem.c                           |   6 +-
 arch/powerpc/platforms/powernv/memtrace.c       |   2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c |   2 +-
 arch/s390/mm/init.c                             |  12 +-
 arch/sh/mm/init.c                               |   6 +-
 arch/x86/mm/init_32.c                           |   6 +-
 arch/x86/mm/init_64.c                           |  20 ++-
 drivers/acpi/acpi_memhotplug.c                  |   2 +-
 drivers/base/memory.c                           |   2 +-
 drivers/dax/kmem.c                              |   2 +-
 drivers/hv/hv_balloon.c                         |   2 +-
 drivers/s390/char/sclp_cmd.c                    |   2 +-
 drivers/xen/balloon.c                           |   2 +-
 include/linux/memory_hotplug.h                  |  53 ++++++--
 include/linux/memremap.h                        |   2 +-
 include/linux/page-flags.h                      |  34 +++++
 kernel/memremap.c                               |   9 +-
 mm/compaction.c                                 |   6 +
 mm/memory_hotplug.c                             | 168 ++++++++++++++++--------
 mm/page_alloc.c                                 |  30 ++++-
 mm/page_isolation.c                             |  11 ++
 mm/sparse.c                                     | 104 +++++++++++++--
 mm/util.c                                       |   2 +
 26 files changed, 393 insertions(+), 114 deletions(-)

-- 
2.13.7


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/4] mm, memory_hotplug: cleanup memory offline path
  2019-03-28 13:43 [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
@ 2019-03-28 13:43 ` Oscar Salvador
  2019-04-03  8:43   ` Michal Hocko
  2019-03-28 13:43 ` [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug Oscar Salvador
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Oscar Salvador @ 2019-03-28 13:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, david, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm, Oscar Salvador

From: Michal Hocko <mhocko@suse.com>

check_pages_isolated_cb currently accounts the whole pfn range as being
offlined if test_pages_isolated suceeds on the range. This is based on
the assumption that all pages in the range are freed which is currently
the case in most cases but it won't be with later changes, as pages
marked as vmemmap won't be isolated.

Move the offlined pages counting to offline_isolated_pages_cb and
rely on __offline_isolated_pages to return the correct value.
check_pages_isolated_cb will still do it's primary job and check the pfn
range.

While we are at it remove check_pages_isolated and offline_isolated_pages
and use directly walk_system_ram_range as do in online_pages.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c            | 45 +++++++++++-------------------------------
 mm/page_alloc.c                | 11 +++++++++--
 3 files changed, 21 insertions(+), 37 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8ade08c50d26..42ba7199f701 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -87,7 +87,7 @@ extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 extern int online_pages(unsigned long, unsigned long, int);
 extern int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
 	unsigned long *valid_start, unsigned long *valid_end);
-extern void __offline_isolated_pages(unsigned long, unsigned long);
+extern unsigned long __offline_isolated_pages(unsigned long, unsigned long);
 
 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0082d699be94..5139b3bfd8b0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1453,17 +1453,12 @@ static int
 offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
 			void *data)
 {
-	__offline_isolated_pages(start, start + nr_pages);
+	unsigned long offlined_pages;
+	offlined_pages = __offline_isolated_pages(start, start + nr_pages);
+	*(unsigned long *)data += offlined_pages;
 	return 0;
 }
 
-static void
-offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
-{
-	walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
-				offline_isolated_pages_cb);
-}
-
 /*
  * Check all pages in range, recoreded as memory resource, are isolated.
  */
@@ -1471,26 +1466,7 @@ static int
 check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
 			void *data)
 {
-	int ret;
-	long offlined = *(long *)data;
-	ret = test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
-	offlined = nr_pages;
-	if (!ret)
-		*(long *)data += offlined;
-	return ret;
-}
-
-static long
-check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
-{
-	long offlined = 0;
-	int ret;
-
-	ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined,
-			check_pages_isolated_cb);
-	if (ret < 0)
-		offlined = (long)ret;
-	return offlined;
+	return test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
 }
 
 static int __init cmdline_parse_movable_node(char *p)
@@ -1575,7 +1551,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		  unsigned long end_pfn)
 {
 	unsigned long pfn, nr_pages;
-	long offlined_pages;
+	unsigned long offlined_pages = 0;
 	int ret, node, nr_isolate_pageblock;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
@@ -1651,14 +1627,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
 			goto failed_removal_isolated;
 		}
 		/* check again */
-		offlined_pages = check_pages_isolated(start_pfn, end_pfn);
-	} while (offlined_pages < 0);
+		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
+							check_pages_isolated_cb);
+	} while (ret);
 
-	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
-	offline_isolated_pages(start_pfn, end_pfn);
-
+	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined_pages,
+						offline_isolated_pages_cb);
+	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/*
 	 * Onlining will reset pagetype flags and makes migrate type
 	 * MOVABLE, so just need to decrease the number of isolated
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d96ca5bc555b..d128f53888b8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8374,7 +8374,7 @@ void zone_pcp_reset(struct zone *zone)
  * All pages in the range must be in a single zone and isolated
  * before calling this.
  */
-void
+unsigned long
 __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct page *page;
@@ -8382,12 +8382,15 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	unsigned int order, i;
 	unsigned long pfn;
 	unsigned long flags;
+	unsigned long offlined_pages = 0;
+
 	/* find the first valid pfn */
 	for (pfn = start_pfn; pfn < end_pfn; pfn++)
 		if (pfn_valid(pfn))
 			break;
 	if (pfn == end_pfn)
-		return;
+		return offlined_pages;
+
 	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
@@ -8405,12 +8408,14 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
 			pfn++;
 			SetPageReserved(page);
+			offlined_pages++;
 			continue;
 		}
 
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
+		offlined_pages += 1 << order;
 #ifdef CONFIG_DEBUG_VM
 		pr_info("remove from free list %lx %d %lx\n",
 			pfn, 1 << order, end_pfn);
@@ -8423,6 +8428,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return offlined_pages;
 }
 #endif
 
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
  2019-03-28 13:43 [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
  2019-03-28 13:43 ` [PATCH 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
@ 2019-03-28 13:43 ` Oscar Salvador
  2019-04-03  8:46   ` Michal Hocko
  2019-03-28 13:43 ` [PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap Oscar Salvador
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Oscar Salvador @ 2019-03-28 13:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, david, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm, Oscar Salvador

From: Michal Hocko <mhocko@suse.com>

arch_add_memory, __add_pages take a want_memblock which controls whether
the newly added memory should get the sysfs memblock user API (e.g.
ZONE_DEVICE users do not want/need this interface). Some callers even
want to control where do we allocate the memmap from by configuring
altmap.

Add a more generic hotplug context for arch_add_memory and __add_pages.
struct mhp_restrictions contains flags which contains additional
features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
currently) and altmap for alternative memmap allocator.

Please note that the complete altmap propagation down to vmemmap code
is still not done in this patch. It will be done in the follow up to
reduce the churn here.

This patch shouldn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/mm/mmu.c            |  5 ++---
 arch/ia64/mm/init.c            |  5 ++---
 arch/powerpc/mm/mem.c          |  6 +++---
 arch/s390/mm/init.c            |  6 +++---
 arch/sh/mm/init.c              |  6 +++---
 arch/x86/mm/init_32.c          |  6 +++---
 arch/x86/mm/init_64.c          | 10 +++++-----
 include/linux/memory_hotplug.h | 29 +++++++++++++++++++++++------
 kernel/memremap.c              |  9 ++++++---
 mm/memory_hotplug.c            | 10 ++++++----
 10 files changed, 56 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e97f018ff740..8c0d5484b38c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1046,8 +1046,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		    bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions)
 {
 	int flags = 0;
 
@@ -1058,6 +1057,6 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 			     size, PAGE_KERNEL, pgd_pgtable_alloc, flags);
 
 	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
-			   altmap, want_memblock);
+							restrictions);
 }
 #endif
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index e49200e31750..7af16f5d5ca6 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -666,14 +666,13 @@ mem_init (void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
 	if (ret)
 		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
 		       __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index f6787f90e158..76bcc29fa3e1 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -109,8 +109,8 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 	return -ENODEV;
 }
 
-int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int __meminit arch_add_memory(int nid, u64 start, u64 size,
+			struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -127,7 +127,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
 	}
 	flush_inval_dcache_range(start, start + size);
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 3e82f66d5c61..9ae71a82e9e1 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -224,8 +224,8 @@ device_initcall(s390_cma_mem_init);
 
 #endif /* CONFIG_CMA */
 
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+		struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long size_pages = PFN_DOWN(size);
@@ -235,7 +235,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	if (rc)
 		return rc;
 
-	rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock);
+	rc = __add_pages(nid, start_pfn, size_pages, restrictions);
 	if (rc)
 		vmem_remove_mapping(start, size);
 	return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 70621324db41..32798bd4c32f 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -416,15 +416,15 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+		struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 85c94f9a87f8..755dbed85531 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -850,13 +850,13 @@ void __init mem_init(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+			struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bccff68e3267..db42c11b48fb 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -777,11 +777,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
 }
 
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock)
+				struct mhp_restrictions *restrictions)
 {
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
@@ -791,15 +791,15 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 	return ret;
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+int arch_add_memory(int nid, u64 start, u64 size,
+			struct mhp_restrictions *restrictions)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
 	init_memory_mapping(start, start + size);
 
-	return add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 
 #define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 42ba7199f701..119a012d43b8 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -117,20 +117,37 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages, struct vmem_altmap *altmap);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
+/*
+ * Do we want sysfs memblock files created. This will allow userspace to online
+ * and offline memory explicitly. Lack of this bit means that the caller has to
+ * call move_pfn_range_to_zone to finish the initialization.
+ */
+
+#define MHP_MEMBLOCK_API               1<<0
+
+/*
+ * Restrictions for the memory hotplug:
+ * flags:  MHP_ flags
+ * altmap: alternative allocator for memmap array
+ */
+struct mhp_restrictions {
+	unsigned long flags;
+	struct vmem_altmap *altmap;
+};
+
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+					struct mhp_restrictions *restrictions);
 
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		unsigned long nr_pages, struct mhp_restrictions *restrictions)
 {
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, restrictions);
 }
 #else /* ARCH_HAS_ADD_PAGES */
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+				struct mhp_restrictions *restrictions);
 #endif /* ARCH_HAS_ADD_PAGES */
 
 #ifdef CONFIG_NUMA
@@ -332,7 +349,7 @@ extern int __add_memory(int nid, u64 start, u64 size);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource);
 extern int arch_add_memory(int nid, u64 start, u64 size,
-		struct vmem_altmap *altmap, bool want_memblock);
+			struct mhp_restrictions *restrictions);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern bool is_memblock_offlined(struct memory_block *mem);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5ff192..d42f11673979 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -149,6 +149,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	struct resource *res = &pgmap->res;
 	struct dev_pagemap *conflict_pgmap;
 	pgprot_t pgprot = PAGE_KERNEL;
+	struct mhp_restrictions restrictions = {};
 	int error, nid, is_ram;
 
 	if (!pgmap->ref || !pgmap->kill)
@@ -199,6 +200,9 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	if (error)
 		goto err_pfn_remap;
 
+	/* We do not want any optional features only our own memmap */
+	restrictions.altmap = altmap;
+
 	mem_hotplug_begin();
 
 	/*
@@ -214,7 +218,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	 */
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		error = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL, false);
+				align_size >> PAGE_SHIFT, &restrictions);
 	} else {
 		error = kasan_add_zero_shadow(__va(align_start), align_size);
 		if (error) {
@@ -222,8 +226,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 			goto err_kasan;
 		}
 
-		error = arch_add_memory(nid, align_start, align_size, altmap,
-				false);
+		error = arch_add_memory(nid, align_start, align_size, &restrictions);
 	}
 
 	if (!error) {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5139b3bfd8b0..836cb026ed7b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -273,12 +273,12 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
  * add the new pages.
  */
 int __ref __add_pages(int nid, unsigned long phys_start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		unsigned long nr_pages, struct mhp_restrictions *restrictions)
 {
 	unsigned long i;
 	int err = 0;
 	int start_sec, end_sec;
+	struct vmem_altmap *altmap = restrictions->altmap;
 
 	/* during initialize mem_map, align hot-added range to section */
 	start_sec = pfn_to_section_nr(phys_start_pfn);
@@ -299,7 +299,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, section_nr_to_pfn(i), altmap,
-				want_memblock);
+				restrictions->flags & MHP_MEMBLOCK_API);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -1099,6 +1099,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	u64 start, size;
 	bool new_node = false;
 	int ret;
+	struct mhp_restrictions restrictions = {};
 
 	start = res->start;
 	size = resource_size(res);
@@ -1123,7 +1124,8 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	new_node = ret;
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, NULL, true);
+	restrictions.flags = MHP_MEMBLOCK_API;
+	ret = arch_add_memory(nid, start, size, &restrictions);
 	if (ret < 0)
 		goto error;
 
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap
  2019-03-28 13:43 [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
  2019-03-28 13:43 ` [PATCH 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
  2019-03-28 13:43 ` [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug Oscar Salvador
@ 2019-03-28 13:43 ` Oscar Salvador
  2019-03-28 13:43 ` [PATCH 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap Oscar Salvador
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-03-28 13:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, david, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm, Oscar Salvador

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.

a) has turned out to be a problem for memory hotplug based ballooning
because the userspace might not react in time to online memory while
the memory consumed during physical hotadd consumes enough memory to push
system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory") has been added to workaround that
problem.

I have also seen hot-add operations failing on powerpc due to the fact
that we try to use order-8 pages. If the base page size is 64KB, this
gives us 16MB, and if we run out of those, we simply fail.
One could arge that we can fall back to basepages as we do in x86_64.

But we can do much better when CONFIG_SPARSEMEM_VMEMMAP=y because vmemap
page tables can map arbitrary memory. That means that we can simply
use the beginning of each memory section and map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Add {_Set,_Clear}PageVmemmap helpers to distinguish those pages in pfn
walkers. We do not have any spare page flag for this purpose so use the
combination of PageReserved bit which already tells that the page should
be ignored by the core mm code and store VMEMMAP_PAGE (which sets all
bits but PAGE_MAPPING_FLAGS) into page->mapping.

There is one case where we cannot check for PageReserved, and that is
when we have poisoning enabled + VM_BUG_ON_PGFLAGS is on + the page
is not initialized.
This happens in __init_single_page, where we do have to preserve the
state of PageVmemmap pages, so we cannot zero the page.
I added __PageVmemmap for that purpose, as it only checks for the
page->mapping field.
It should be enough as these pages are not yet onlined.

On the memory hotplug front add a new MHP_MEMMAP_FROM_RANGE restriction
flag. User is supposed to set the flag if the memmap should be allocated
from the hotadded range.
Right now, this is passed to add_memory(), __add_memory() and
add_memory_resource().
Unfortunately we do not have a single entry point, as Hyper-V, Acpi, Xen
 use those three functions, so all those users have to
specifiy if they want the memmap array allocated from the hot-added range.
For the time being, only ACPI enabled it.

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation.
If MHP_MEMMAP_FROM_RANGE was passed, we set up the layout of the altmap
structure at the beginning of __add_pages(), and then we call
mark_vmemmap_pages() after the memory has been added.
mark_vmemmap_pages marks the pages as vmemmap and sets some metadata:

The current layout of the Vmemmap pages are:

- There is a head Vmemmap (first page), which has the following fields set:
  * page->_refcount: number of sections that used this altmap
  * page->private: total number of vmemmap pages
- The remaining vmemmap pages have:
  * page->freelist: pointer to the head vmemmap page

This is done to easy the computation we need in some places.

E.g:
Let us say we hot-add 9GB on x86_64:

head->_refcount = 72 sections
head->private = 36864 vmemmap pages
tail's->freelist = head

We keep a _refcount of the used sections to know how much do we have to defer
the call to vmemmap_free().
The thing is that the first pages of the hot-added range are used to create
the memmap mapping, so we cannot remove those first, otherwise we would blow up.

What we do is that since when we hot-remove a memory-range, sections are being
removed sequentially, we wait until we hit the last section, and then we free
the hole range to vmemmap_free backwards.
We know that it is the last section because in every pass we
decrease head->_refcount, and when it reaches 0, we got our last section.

We also have to be careful about those pages during online and offline
operations. They are simply skipped, so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/mm/mmu.c                             |   5 +-
 arch/powerpc/mm/init_64.c                       |   7 ++
 arch/powerpc/platforms/powernv/memtrace.c       |   2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c |   2 +-
 arch/s390/mm/init.c                             |   6 ++
 arch/x86/mm/init_64.c                           |  10 +++
 drivers/acpi/acpi_memhotplug.c                  |   2 +-
 drivers/base/memory.c                           |   2 +-
 drivers/dax/kmem.c                              |   2 +-
 drivers/hv/hv_balloon.c                         |   2 +-
 drivers/s390/char/sclp_cmd.c                    |   2 +-
 drivers/xen/balloon.c                           |   2 +-
 include/linux/memory_hotplug.h                  |  22 ++++-
 include/linux/memremap.h                        |   2 +-
 include/linux/page-flags.h                      |  34 +++++++
 mm/compaction.c                                 |   6 ++
 mm/memory_hotplug.c                             | 115 ++++++++++++++++++++----
 mm/page_alloc.c                                 |  19 +++-
 mm/page_isolation.c                             |  11 +++
 mm/sparse.c                                     |  88 ++++++++++++++++++
 mm/util.c                                       |   2 +
 21 files changed, 309 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8c0d5484b38c..9607d4a3fc6b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -746,7 +746,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
-			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+			if (altmap)
+				p = altmap_alloc_block_buf(PMD_SIZE, altmap);
+			else
+				p = vmemmap_alloc_block_buf(PMD_SIZE, node);
 			if (!p)
 				return -ENOMEM;
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a4c155af1597..94bf60c1b388 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -294,6 +294,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
 
 		if (base_pfn >= alt_start && base_pfn < alt_end) {
 			vmem_altmap_free(altmap, nr_pages);
+		} else if (PageVmemmap(page)) {
+			/*
+			 * runtime vmemmap pages are residing inside the memory
+			 * section so they do not have to be freed anywhere.
+			 */
+			while (PageVmemmap(page))
+				__ClearPageVmemmap(page++);
 		} else if (PageReserved(page)) {
 			/* allocated from bootmem */
 			if (page_size < PAGE_SIZE) {
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 248a38ad25c7..6aa07ef19849 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -233,7 +233,7 @@ static int memtrace_online(void)
 			ent->mem = 0;
 		}
 
-		if (add_memory(ent->nid, ent->start, ent->size)) {
+		if (add_memory(ent->nid, ent->start, ent->size, true)) {
 			pr_err("Failed to add trace memory to node %d\n",
 				ent->nid);
 			ret += 1;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index d291b618a559..ffe02414248d 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -670,7 +670,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 	nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
 	/* Add the memory */
-	rc = __add_memory(nid, lmb->base_addr, block_sz);
+	rc = __add_memory(nid, lmb->base_addr, block_sz, true);
 	if (rc) {
 		invalidate_lmb_associativity_index(lmb);
 		return rc;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 9ae71a82e9e1..75e96860a9ac 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	unsigned long size_pages = PFN_DOWN(size);
 	int rc;
 
+	/*
+	 * Physical memory is added only later during the memory online so we
+	 * cannot use the added range at this stage unfortunately.
+	 */
+	restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;
+
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index db42c11b48fb..2e40c9e637b9 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -809,6 +809,16 @@ static void __meminit free_pagetable(struct page *page, int order)
 	unsigned long magic;
 	unsigned int nr_pages = 1 << order;
 
+	/*
+	 * runtime vmemmap pages are residing inside the memory section so
+	 * they do not have to be freed anywhere.
+	 */
+	if (PageVmemmap(page)) {
+		while (nr_pages--)
+			__ClearPageVmemmap(page++);
+		return;
+	}
+
 	/* bootmem page has reserved flag */
 	if (PageReserved(page)) {
 		__ClearPageReserved(page);
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 8fe0960ea572..ff9d78def208 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = __add_memory(node, info->start_addr, info->length);
+		result = __add_memory(node, info->start_addr, info->length, true);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index cb8347500ce2..28fd2a5cc0c9 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -510,7 +510,7 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = __add_memory(nid, phys_addr,
-			   MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			   MIN_MEMORY_BLOCK_SIZE * sections_per_block, true);
 
 	if (ret)
 		goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a02318c6d28a..904371834390 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -65,7 +65,7 @@ int dev_dax_kmem_probe(struct device *dev)
 	new_res->flags = IORESOURCE_SYSTEM_RAM;
 	new_res->name = dev_name(dev);
 
-	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res), false);
 	if (rc)
 		return rc;
 
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index dd475f3bcc8a..c88f32964b5f 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -741,7 +741,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				(HA_CHUNK << PAGE_SHIFT), false);
 
 		if (ret) {
 			pr_err("hot_add memory failed error is %d\n", ret);
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 37d42de06079..b021c96cb5c7 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -406,7 +406,7 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size, false);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index d37dd5bb7a8f..59f25f3a91d0 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -352,7 +352,7 @@ static enum bp_state reserve_additional_memory(void)
 	mutex_unlock(&balloon_mutex);
 	/* add_memory_resource() requires the device_hotplug lock */
 	lock_device_hotplug();
-	rc = add_memory_resource(nid, resource);
+	rc = add_memory_resource(nid, resource, false);
 	unlock_device_hotplug();
 	mutex_lock(&balloon_mutex);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 119a012d43b8..c304c2f529da 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -126,6 +126,14 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 #define MHP_MEMBLOCK_API               1<<0
 
 /*
+ * Do we want memmap (struct page array) allocated from the hotadded range.
+ * Please note that only SPARSE_VMEMMAP implements this feature and some
+ * architectures might not support it even for that memory model (e.g. s390)
+ */
+
+#define MHP_MEMMAP_FROM_RANGE          1<<1
+
+/*
  * Restrictions for the memory hotplug:
  * flags:  MHP_ flags
  * altmap: alternative allocator for memmap array
@@ -345,9 +353,9 @@ static inline void __remove_memory(int nid, u64 start, u64 size) {}
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
-extern int __add_memory(int nid, u64 start, u64 size);
-extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource);
+extern int __add_memory(int nid, u64 start, u64 size, bool use_vmemmap);
+extern int add_memory(int nid, u64 start, u64 size, bool use_vmemmap);
+extern int add_memory_resource(int nid, struct resource *resource, bool use_vmemmap);
 extern int arch_add_memory(int nid, u64 start, u64 size,
 			struct mhp_restrictions *restrictions);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
@@ -363,4 +371,12 @@ extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_
 		int online_type);
 extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 		unsigned long nr_pages);
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+extern void mark_vmemmap_pages(struct vmem_altmap *self,
+				struct mhp_restrictions *r);
+#else
+static inline void mark_vmemmap_pages(struct vmem_altmap *self,
+					struct mhp_restrictions *r) {}
+#endif
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f0628660d541..cfde1c1febb7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -16,7 +16,7 @@ struct device;
  * @alloc: track pages consumed, private to vmemmap_populate()
  */
 struct vmem_altmap {
-	const unsigned long base_pfn;
+	unsigned long base_pfn;
 	const unsigned long reserve;
 	unsigned long free;
 	unsigned long align;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..6718f6f04676 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -466,6 +466,40 @@ static __always_inline int __PageMovable(struct page *page)
 				PAGE_MAPPING_MOVABLE;
 }
 
+#define VMEMMAP_PAGE ~PAGE_MAPPING_FLAGS
+static __always_inline int PageVmemmap(struct page *page)
+{
+	return PageReserved(page) && (unsigned long)page->mapping == VMEMMAP_PAGE;
+}
+
+static __always_inline int __PageVmemmap(struct page *page)
+{
+	return (unsigned long)page->mapping == VMEMMAP_PAGE;
+}
+
+static __always_inline void __ClearPageVmemmap(struct page *page)
+{
+	__ClearPageReserved(page);
+	page->mapping = NULL;
+}
+
+static __always_inline void __SetPageVmemmap(struct page *page)
+{
+	__SetPageReserved(page);
+	page->mapping = (void *)VMEMMAP_PAGE;
+}
+
+static __always_inline struct page *vmemmap_get_head(struct page *page)
+{
+	return (struct page *)page->freelist;
+}
+
+static __always_inline unsigned long get_nr_vmemmap_pages(struct page *page)
+{
+       struct page *head = vmemmap_get_head(page);
+       return head->private - (page - head);
+}
+
 #ifdef CONFIG_KSM
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
diff --git a/mm/compaction.c b/mm/compaction.c
index f171a83707ce..926b1b424de8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -850,6 +850,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		page = pfn_to_page(low_pfn);
 
 		/*
+		 * Vmemmap pages cannot be migrated.
+		 */
+		if (PageVmemmap(page))
+			goto isolate_fail;
+
+		/*
 		 * Check if the pageblock has already been marked skipped.
 		 * Only the aligned PFN is checked as the caller isolates
 		 * COMPACT_CLUSTER_MAX at a time so the second call must
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 836cb026ed7b..0eb60c80b8bc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -96,6 +96,11 @@ void mem_hotplug_done(void)
 	cpus_read_unlock();
 }
 
+static inline bool can_use_vmemmap(void)
+{
+	return IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP);
+}
+
 u64 max_mem_size = U64_MAX;
 
 /* add this memory to iomem resource */
@@ -278,7 +283,16 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 	unsigned long i;
 	int err = 0;
 	int start_sec, end_sec;
-	struct vmem_altmap *altmap = restrictions->altmap;
+	struct vmem_altmap *altmap;
+	struct vmem_altmap __memblk_altmap = {};
+
+	if (restrictions->flags & MHP_MEMMAP_FROM_RANGE) {
+		__memblk_altmap.base_pfn = phys_start_pfn;
+		__memblk_altmap.free = nr_pages;
+		restrictions->altmap = &__memblk_altmap;
+	}
+
+	altmap = restrictions->altmap;
 
 	/* during initialize mem_map, align hot-added range to section */
 	start_sec = pfn_to_section_nr(phys_start_pfn);
@@ -312,6 +326,10 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 		cond_resched();
 	}
 	vmemmap_populate_print_last();
+
+	if (restrictions->flags & MHP_MEMMAP_FROM_RANGE)
+		mark_vmemmap_pages(altmap, restrictions);
+
 out:
 	return err;
 }
@@ -677,6 +695,13 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
 	while (start < end) {
 		order = min(MAX_ORDER - 1,
 			get_order(PFN_PHYS(end) - PFN_PHYS(start)));
+		/*
+		 * Check if the pfn is aligned to its order.
+		 * If not, we decrement the order until it is,
+		 * otherwise __free_one_page will bug us.
+		 */
+		while (start & ((1 << order) - 1))
+			order--;
 		(*online_page_callback)(pfn_to_page(start), order);
 
 		onlined_pages += (1UL << order);
@@ -689,13 +714,33 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
 			void *arg)
 {
 	unsigned long onlined_pages = *(unsigned long *)arg;
+	unsigned long pfn = start_pfn;
+	unsigned long nr_vmemmap_pages = 0;
+	bool skip_online = false;
+
+	if (PageVmemmap(pfn_to_page(start_pfn))) {
+		/*
+		 * We do not want to send Vmemmap pages to the buddy allocator.
+		 * Skip them.
+		 */
+		nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
+		nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
+		pfn += nr_vmemmap_pages;
+		if (nr_vmemmap_pages == nr_pages)
+			/*
+			 * If the entire memblock contains only vmemmap pages,
+			 * we do not have pages to free to the buddy allocator.
+			 */
+			skip_online = true;
+	}
 
-	if (PageReserved(pfn_to_page(start_pfn)))
-		onlined_pages += online_pages_blocks(start_pfn, nr_pages);
+	if (!skip_online && PageReserved(pfn_to_page(pfn)))
+		onlined_pages += online_pages_blocks(pfn,
+					nr_pages - nr_vmemmap_pages);
 
 	online_mem_sections(start_pfn, start_pfn + nr_pages);
 
-	*(unsigned long *)arg = onlined_pages;
+	*(unsigned long *)arg = onlined_pages + nr_vmemmap_pages;
 	return 0;
 }
 
@@ -1094,7 +1139,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int __ref add_memory_resource(int nid, struct resource *res)
+int __ref add_memory_resource(int nid, struct resource *res, bool use_vmemmap)
 {
 	u64 start, size;
 	bool new_node = false;
@@ -1125,6 +1170,9 @@ int __ref add_memory_resource(int nid, struct resource *res)
 
 	/* call arch's memory hotadd */
 	restrictions.flags = MHP_MEMBLOCK_API;
+	if (can_use_vmemmap() && use_vmemmap)
+		restrictions.flags |= MHP_MEMMAP_FROM_RANGE;
+
 	ret = arch_add_memory(nid, start, size, &restrictions);
 	if (ret < 0)
 		goto error;
@@ -1166,7 +1214,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
 }
 
 /* requires device_hotplug_lock, see add_memory_resource() */
-int __ref __add_memory(int nid, u64 start, u64 size)
+int __ref __add_memory(int nid, u64 start, u64 size, bool use_vmemmap)
 {
 	struct resource *res;
 	int ret;
@@ -1175,18 +1223,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res);
+	ret = add_memory_resource(nid, res, use_vmemmap);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
 }
 
-int add_memory(int nid, u64 start, u64 size)
+int add_memory(int nid, u64 start, u64 size, bool use_vmemmap)
 {
 	int rc;
 
 	lock_device_hotplug();
-	rc = __add_memory(nid, start, size);
+	rc = __add_memory(nid, start, size, use_vmemmap);
 	unlock_device_hotplug();
 
 	return rc;
@@ -1554,15 +1602,31 @@ static int __ref __offline_pages(unsigned long start_pfn,
 {
 	unsigned long pfn, nr_pages;
 	unsigned long offlined_pages = 0;
-	int ret, node, nr_isolate_pageblock;
+	int ret, node, nr_isolate_pageblock = 0;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
 	struct zone *zone;
 	struct memory_notify arg;
 	char *reason;
+	unsigned long nr_vmemmap_pages = 0;
+	bool skip_block = false;
 
 	mem_hotplug_begin();
 
+	if (PageVmemmap(pfn_to_page(start_pfn))) {
+		nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
+		if (start_pfn + nr_vmemmap_pages >= end_pfn) {
+			/*
+			 * Depending on how large was the hot-added range,
+			 * an entire memblock can only contain vmemmap pages.
+			 * Should be that the case, just skip isolation and
+			 * migration.
+			 */
+			nr_vmemmap_pages = end_pfn - start_pfn;
+			skip_block = true;
+		}
+	}
+
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
 	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
@@ -1576,15 +1640,17 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	node = zone_to_nid(zone);
 	nr_pages = end_pfn - start_pfn;
 
-	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn,
-				       MIGRATE_MOVABLE,
-				       SKIP_HWPOISON | REPORT_FAILURE);
-	if (ret < 0) {
-		reason = "failure to isolate range";
-		goto failed_removal;
+	if (!skip_block) {
+		/* set above range as isolated */
+		ret = start_isolate_page_range(start_pfn, end_pfn,
+					       MIGRATE_MOVABLE,
+					       SKIP_HWPOISON | REPORT_FAILURE);
+		if (ret < 0) {
+			reason = "failure to isolate range";
+			goto failed_removal;
+		}
+		nr_isolate_pageblock = ret;
 	}
-	nr_isolate_pageblock = ret;
 
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
@@ -1597,6 +1663,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal_isolated;
 	}
 
+	if (skip_block)
+		goto no_migration;
+
 	do {
 		for (pfn = start_pfn; pfn;) {
 			if (signal_pending(current)) {
@@ -1633,6 +1702,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 							check_pages_isolated_cb);
 	} while (ret);
 
+no_migration:
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
 	walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined_pages,
@@ -1649,6 +1719,12 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
+
+	/*
+	 * Vmemmap pages are not being accounted to managed_pages but to
+	 * present_pages.
+	 */
+	offlined_pages += nr_vmemmap_pages;
 	zone->present_pages -= offlined_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -1677,7 +1753,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	return 0;
 
 failed_removal_isolated:
-	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+	if (!skip_block)
+		undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 failed_removal:
 	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d128f53888b8..6c026690a0b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1273,9 +1273,12 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
-	mm_zero_struct_page(page);
+	if (!__PageVmemmap(page)) {
+		mm_zero_struct_page(page);
+		init_page_count(page);
+	}
+
 	set_page_links(page, zone, nid, pfn);
-	init_page_count(page);
 	page_mapcount_reset(page);
 	page_cpupid_reset_last(page);
 	page_kasan_tag_reset(page);
@@ -8033,6 +8036,14 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 
 		page = pfn_to_page(check);
 
+		if (PageVmemmap(page)) {
+			/*
+			 * Vmemmap pages are marked reserved, so skip them here.
+			 */
+			iter += get_nr_vmemmap_pages(page) - 1;
+			continue;
+                }
+
 		if (PageReserved(page))
 			goto unmovable;
 
@@ -8401,6 +8412,10 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 			continue;
 		}
 		page = pfn_to_page(pfn);
+		if (PageVmemmap(page)) {
+			pfn += get_nr_vmemmap_pages(page);
+			continue;
+		}
 		/*
 		 * The HWPoisoned page may be not in buddy system, and
 		 * page_count() is not 0.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 019280712e1b..0081ab74c1ba 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -156,6 +156,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
 		page = pfn_to_online_page(pfn + i);
 		if (!page)
 			continue;
+		if (PageVmemmap(page)) {
+			i += get_nr_vmemmap_pages(page) - 1;
+			continue;
+		}
 		return page;
 	}
 	return NULL;
@@ -270,6 +274,13 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 			continue;
 		}
 		page = pfn_to_page(pfn);
+		if (PageVmemmap(page)) {
+			/*
+			 * Vmemmap pages are not isolated. Skip them.
+			 */
+			pfn += get_nr_vmemmap_pages(page);
+			continue;
+		}
 		if (PageBuddy(page))
 			/*
 			 * If the page is on a free list, it has to be on
diff --git a/mm/sparse.c b/mm/sparse.c
index 56e057c432f9..82c7b119eb1d 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -590,6 +590,89 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
+void mark_vmemmap_pages(struct vmem_altmap *self, struct mhp_restrictions *r)
+{
+	unsigned long pfn = self->base_pfn + self->reserve;
+	unsigned long nr_pages = self->alloc;
+	unsigned long nr_sects = self->free / PAGES_PER_SECTION;
+	unsigned long i;
+	struct page *head;
+
+	if (!nr_pages)
+		return;
+
+	/*
+	 * All allocations for the memory hotplug are the same sized so align
+	 * should be 0.
+	 */
+	WARN_ON(self->align);
+
+	/*
+	 * Mark these pages as Vmemmap pages.
+	 * Layout:
+	 * [Head->refcount] : Nr sections used by this altmap
+	 * [Head->private]  : Nr of vmemmap pages
+	 * [Tail->freelist] : Pointer to the head page
+	 */
+
+	/*
+	 * Head, first vmemmap page
+	 */
+	head = pfn_to_page(pfn);
+	for (i = 0; i < nr_pages; i++, pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		mm_zero_struct_page(page);
+		__SetPageVmemmap(page);
+		page->freelist = head;
+		init_page_count(page);
+	}
+	set_page_count(head, (int)nr_sects);
+	set_page_private(head, nr_pages);
+}
+
+/*
+ * If the range we are trying to remove was hot-added with vmemmap pages,
+ * we need to keep track of it to know how much do we have do defer the
+ * free up. Since sections are removed sequentally in __remove_pages()->
+ * __remove_section(), we just wait until we hit the last section.
+ * Once that happens, we can trigger free_deferred_vmemmap_range to actually
+ * free the whole memory-range.
+ */
+static struct page *head_vmemmap_page;
+static bool in_vmemmap_range;
+
+static inline bool vmemmap_dec_and_test(void)
+{
+	return page_ref_dec_and_test(head_vmemmap_page);
+}
+
+static void free_deferred_vmemmap_range(unsigned long start,
+					unsigned long end)
+{
+	unsigned long nr_pages = end - start;
+	unsigned long first_section = (unsigned long)head_vmemmap_page;
+
+	while (start >= first_section) {
+		vmemmap_free(start, end, NULL);
+		end = start;
+		start -= nr_pages;
+	}
+	head_vmemmap_page = NULL;
+	in_vmemmap_range = false;
+}
+
+static void deferred_vmemmap_free(unsigned long start, unsigned long end)
+{
+	if (!in_vmemmap_range) {
+		in_vmemmap_range = true;
+		head_vmemmap_page = (struct page *)start;
+	}
+
+	if (vmemmap_dec_and_test())
+		free_deferred_vmemmap_range(start, end);
+}
+
 static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap)
 {
@@ -602,6 +685,11 @@ static void __kfree_section_memmap(struct page *memmap,
 	unsigned long start = (unsigned long)memmap;
 	unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
 
+	if (PageVmemmap(memmap) || in_vmemmap_range) {
+		deferred_vmemmap_free(start, end);
+		return;
+	}
+
 	vmemmap_free(start, end, altmap);
 }
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/mm/util.c b/mm/util.c
index d559bde497a9..5521d0a6c9e3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -531,6 +531,8 @@ struct address_space *page_mapping(struct page *page)
 	mapping = page->mapping;
 	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
 		return NULL;
+	if ((unsigned long)mapping == VMEMMAP_PAGE)
+		return NULL;
 
 	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
 }
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap
  2019-03-28 13:43 [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
                   ` (2 preceding siblings ...)
  2019-03-28 13:43 ` [PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap Oscar Salvador
@ 2019-03-28 13:43 ` Oscar Salvador
  2019-03-28 15:09 ` [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
  2019-03-29 22:23 ` John Hubbard
  5 siblings, 0 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-03-28 13:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, david, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm, Oscar Salvador

From: Michal Hocko <mhocko@suse.com>

The sufix "kmalloc" is misleading.
Rename it to alloc_section_memmap/free_section_memmap which
better reflects the funcionality.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/sparse.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index 82c7b119eb1d..63c1d0bd4755 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -673,13 +673,13 @@ static void deferred_vmemmap_free(unsigned long start, unsigned long end)
 		free_deferred_vmemmap_range(start, end);
 }
 
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static inline struct page *alloc_section_memmap(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap)
 {
 	/* This will make the necessary allocations eventually. */
 	return sparse_mem_map_populate(pnum, nid, altmap);
 }
-static void __kfree_section_memmap(struct page *memmap,
+static void free_section_memmap(struct page *memmap,
 		struct vmem_altmap *altmap)
 {
 	unsigned long start = (unsigned long)memmap;
@@ -723,13 +723,13 @@ static struct page *__kmalloc_section_memmap(void)
 	return ret;
 }
 
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static inline struct page *alloc_section_memmap(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap)
 {
 	return __kmalloc_section_memmap();
 }
 
-static void __kfree_section_memmap(struct page *memmap,
+static void free_section_memmap(struct page *memmap,
 		struct vmem_altmap *altmap)
 {
 	if (is_vmalloc_addr(memmap))
@@ -794,12 +794,12 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	if (ret < 0 && ret != -EEXIST)
 		return ret;
 	ret = 0;
-	memmap = kmalloc_section_memmap(section_nr, nid, altmap);
+	memmap = alloc_section_memmap(section_nr, nid, altmap);
 	if (!memmap)
 		return -ENOMEM;
 	usemap = __kmalloc_section_usemap();
 	if (!usemap) {
-		__kfree_section_memmap(memmap, altmap);
+		free_section_memmap(memmap, altmap);
 		return -ENOMEM;
 	}
 
@@ -821,7 +821,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 out:
 	if (ret < 0) {
 		kfree(usemap);
-		__kfree_section_memmap(memmap, altmap);
+		free_section_memmap(memmap, altmap);
 	}
 	return ret;
 }
@@ -872,7 +872,7 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
 	if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
 		kfree(usemap);
 		if (memmap)
-			__kfree_section_memmap(memmap, altmap);
+			free_section_memmap(memmap, altmap);
 		return;
 	}
 
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-28 13:43 [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
                   ` (3 preceding siblings ...)
  2019-03-28 13:43 ` [PATCH 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap Oscar Salvador
@ 2019-03-28 15:09 ` David Hildenbrand
  2019-03-28 15:31   ` David Hildenbrand
  2019-03-29  8:30   ` Oscar Salvador
  2019-03-29 22:23 ` John Hubbard
  5 siblings, 2 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-03-28 15:09 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On 28.03.19 14:43, Oscar Salvador wrote:
> Hi,
> 
> since last two RFCs were almost unnoticed (thanks David for the feedback),
> I decided to re-work some parts to make it more simple and give it a more
> testing, and drop the RFC, to see if it gets more attention.
> I also added David's feedback, so now all users of add_memory/__add_memory/
> add_memory_resource can specify whether they want to use this feature or not.

Terrific, I will also definetly try to make use of that in the next
virito-mem prototype (looks like I'll finally have time to look into it
again).

> I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set.
> 
> [Testing]
> 
> Testing has been carried out on the following platforms:
> 
>  - x86_64 (small and big memblocks)
>  - powerpc
>  - arm64 (Huawei's fellows)
> 
> I plan to test it on Xen and Hyper-V, but for now those two will not be
> using this feature, and neither DAX/pmem.

I think doing it step by step is the right approach. Less likely to
break stuff.

> 
> Of course, if this does not find any strong objection, my next step is to
> work on enabling this on Xen/Hyper-V.
> 
> [Coverletter]
> 
> This is another step to make the memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot added
> memory (at least for SPARSE_VMEMMAP memory model). The current way we use
> to populate memmap (struct page array) has two main drawbacks:
> 
> a) it consumes an additional memory until the hotadded memory itself is
>    onlined and
> b) memmap might end up on a different numa node which is especially true
>    for movable_node configuration.
> 
> a) is problem especially for memory hotplug based memory "ballooning"
>    solutions when the delay between physical memory hotplug and the
>    onlining can lead to OOM and that led to introduction of hacks like auto
>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory")).
> 
> b) can have performance drawbacks.
> 
> I have also seen hot-add operations failing on archs because they
> were running out of order-x pages.
> E.g On powerpc, in certain configurations, we use order-8 pages,
> and given 64KB base pagesize, that is 16MB.
> If we run out of those, we just fail the operation and we cannot add
> more memory.
> We could fallback to base pages as x86_64 does, but we can do better.
> 
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hotadded memory itself. VMEMMAP memory model allows us to map
> any pfn range so the memory doesn't need to be online to be usable
> for the array. See patch 3 for more details. In short I am reusing an
> existing vmem_altmap which wants to achieve the same thing for nvdim
> device memory.
> 
> There is also one potential drawback, though. If somebody uses memory
> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
> for them obviously because each memory block will contain reserved
> area. Large x86 machines will use 2G memblocks so at least one 1G page
> will be available but this is still not 2G...
> 
> If that is a problem, we can always configure a fallback strategy to
> use the current scheme.
> 
> Since this only works when CONFIG_VMEMMAP_ENABLED is set,
> we do check for it before setting the flag that allows use
> to use the feature, no matter if the user wanted it.
> 
> [Overall design]:
> 
> Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M).
> That is:
> 
>  - 16 sections
>  - 524288 pages
>  - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section)
> 
>  The range of pages is: 0xffffea0004000000 - 0xffffea0006000000
>  The vmemmap range is:  0xffffea0004000000 - 0xffffea0004080000
> 
>  0xffffea0004000000 is the head vmemmap page (first page), while all the others
>  are "tails".
> 
>  We keep the following information in it:
> 
>  - Head page:
>    - head->_refcount: number of sections
>    - head->private :  number of vmemmap pages
>  - Tail page:
>    - tail->freelist : pointer to the head
> 
> This is done because it eases the work in cases where we have to compute the
> number of vmemmap pages to know how much do we have to skip etc, and to keep
> the right accounting to present_pages.
> 
> When we want to hot-remove the range, we need to be careful because the first
> pages of that range, are used for the memmap maping, so if we remove those
> first, we would blow up while accessing the others later on.
> For that reason we keep the number of sections in head->_refcount, to know how
> much do we have to defer the free up.
> 
> Since in a hot-remove operation, sections are being removed sequentially, the
> approach taken here is that every time we hit free_section_memmap(), we decrease
> the refcount of the head.
> When it reaches 0, we know that we hit the last section, so we call
> vmemmap_free() for the whole memory-range in backwards, so we make sure that
> the pages used for the mapping will be latest to be freed up.
> 
> Vmemmap pages are charged to spanned/present_paged, but not to manages_pages.
> 

I guess one important thing to mention is that it is no longer possible
to remove memory in a different granularity it was added. I slightly
remember that ACPI code sometimes "reuses" parts of already added
memory. We would have to validate that this can indeed not be an issue.

drivers/acpi/acpi_memhotplug.c:

result = __add_memory(node, info->start_addr, info->length);
if (result && result != -EEXIST)
	continue;

What would happen when removing this dimm (->remove_memory())


Also have a look at

arch/powerpc/platforms/powernv/memtrace.c

I consider it evil code. It will simply try to offline+unplug *some*
memory it finds in *some granularity*. Not sure if this might be
problematic-

Would there be any "safety net" for adding/removing memory in different
granularities?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-28 15:09 ` [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
@ 2019-03-28 15:31   ` David Hildenbrand
  2019-03-29  8:45     ` Oscar Salvador
  2019-03-29  8:30   ` Oscar Salvador
  1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2019-03-28 15:31 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On 28.03.19 16:09, David Hildenbrand wrote:
> On 28.03.19 14:43, Oscar Salvador wrote:
>> Hi,
>>
>> since last two RFCs were almost unnoticed (thanks David for the feedback),
>> I decided to re-work some parts to make it more simple and give it a more
>> testing, and drop the RFC, to see if it gets more attention.
>> I also added David's feedback, so now all users of add_memory/__add_memory/
>> add_memory_resource can specify whether they want to use this feature or not.
> 
> Terrific, I will also definetly try to make use of that in the next
> virito-mem prototype (looks like I'll finally have time to look into it
> again).
> 
>> I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set.
>>
>> [Testing]
>>
>> Testing has been carried out on the following platforms:
>>
>>  - x86_64 (small and big memblocks)
>>  - powerpc
>>  - arm64 (Huawei's fellows)
>>
>> I plan to test it on Xen and Hyper-V, but for now those two will not be
>> using this feature, and neither DAX/pmem.
> 
> I think doing it step by step is the right approach. Less likely to
> break stuff.
> 
>>
>> Of course, if this does not find any strong objection, my next step is to
>> work on enabling this on Xen/Hyper-V.
>>
>> [Coverletter]
>>
>> This is another step to make the memory hotplug more usable. The primary
>> goal of this patchset is to reduce memory overhead of the hot added
>> memory (at least for SPARSE_VMEMMAP memory model). The current way we use
>> to populate memmap (struct page array) has two main drawbacks:
>>
>> a) it consumes an additional memory until the hotadded memory itself is
>>    onlined and
>> b) memmap might end up on a different numa node which is especially true
>>    for movable_node configuration.
>>
>> a) is problem especially for memory hotplug based memory "ballooning"
>>    solutions when the delay between physical memory hotplug and the
>>    onlining can lead to OOM and that led to introduction of hacks like auto
>>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>>    policy for the newly added memory")).
>>
>> b) can have performance drawbacks.
>>
>> I have also seen hot-add operations failing on archs because they
>> were running out of order-x pages.
>> E.g On powerpc, in certain configurations, we use order-8 pages,
>> and given 64KB base pagesize, that is 16MB.
>> If we run out of those, we just fail the operation and we cannot add
>> more memory.
>> We could fallback to base pages as x86_64 does, but we can do better.
>>
>> One way to mitigate all these issues is to simply allocate memmap array
>> (which is the largest memory footprint of the physical memory hotplug)
>> from the hotadded memory itself. VMEMMAP memory model allows us to map
>> any pfn range so the memory doesn't need to be online to be usable
>> for the array. See patch 3 for more details. In short I am reusing an
>> existing vmem_altmap which wants to achieve the same thing for nvdim
>> device memory.
>>
>> There is also one potential drawback, though. If somebody uses memory
>> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
>> for them obviously because each memory block will contain reserved
>> area. Large x86 machines will use 2G memblocks so at least one 1G page
>> will be available but this is still not 2G...
>>
>> If that is a problem, we can always configure a fallback strategy to
>> use the current scheme.
>>
>> Since this only works when CONFIG_VMEMMAP_ENABLED is set,
>> we do check for it before setting the flag that allows use
>> to use the feature, no matter if the user wanted it.
>>
>> [Overall design]:
>>
>> Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M).
>> That is:
>>
>>  - 16 sections
>>  - 524288 pages
>>  - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section)
>>
>>  The range of pages is: 0xffffea0004000000 - 0xffffea0006000000
>>  The vmemmap range is:  0xffffea0004000000 - 0xffffea0004080000
>>
>>  0xffffea0004000000 is the head vmemmap page (first page), while all the others
>>  are "tails".
>>
>>  We keep the following information in it:
>>
>>  - Head page:
>>    - head->_refcount: number of sections
>>    - head->private :  number of vmemmap pages
>>  - Tail page:
>>    - tail->freelist : pointer to the head
>>
>> This is done because it eases the work in cases where we have to compute the
>> number of vmemmap pages to know how much do we have to skip etc, and to keep
>> the right accounting to present_pages.
>>
>> When we want to hot-remove the range, we need to be careful because the first
>> pages of that range, are used for the memmap maping, so if we remove those
>> first, we would blow up while accessing the others later on.
>> For that reason we keep the number of sections in head->_refcount, to know how
>> much do we have to defer the free up.
>>
>> Since in a hot-remove operation, sections are being removed sequentially, the
>> approach taken here is that every time we hit free_section_memmap(), we decrease
>> the refcount of the head.
>> When it reaches 0, we know that we hit the last section, so we call
>> vmemmap_free() for the whole memory-range in backwards, so we make sure that
>> the pages used for the mapping will be latest to be freed up.
>>
>> Vmemmap pages are charged to spanned/present_paged, but not to manages_pages.
>>
> 
> I guess one important thing to mention is that it is no longer possible
> to remove memory in a different granularity it was added. I slightly
> remember that ACPI code sometimes "reuses" parts of already added
> memory. We would have to validate that this can indeed not be an issue.
> 
> drivers/acpi/acpi_memhotplug.c:
> 
> result = __add_memory(node, info->start_addr, info->length);
> if (result && result != -EEXIST)
> 	continue;
> 
> What would happen when removing this dimm (->remove_memory())
> 
> 
> Also have a look at
> 
> arch/powerpc/platforms/powernv/memtrace.c
> 
> I consider it evil code. It will simply try to offline+unplug *some*
> memory it finds in *some granularity*. Not sure if this might be
> problematic-
> 
> Would there be any "safety net" for adding/removing memory in different
> granularities?
> 

Correct me if I am wrong. I think I was confused - vmemmap data is still
allocated *per memory block*, not for the whole added memory, correct?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-28 15:09 ` [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
  2019-03-28 15:31   ` David Hildenbrand
@ 2019-03-29  8:30   ` Oscar Salvador
  2019-03-29  8:51     ` David Hildenbrand
  1 sibling, 1 reply; 42+ messages in thread
From: Oscar Salvador @ 2019-03-29  8:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Thu, Mar 28, 2019 at 04:09:06PM +0100, David Hildenbrand wrote:
> On 28.03.19 14:43, Oscar Salvador wrote:
> > Hi,
> > 
> > since last two RFCs were almost unnoticed (thanks David for the feedback),
> > I decided to re-work some parts to make it more simple and give it a more
> > testing, and drop the RFC, to see if it gets more attention.
> > I also added David's feedback, so now all users of add_memory/__add_memory/
> > add_memory_resource can specify whether they want to use this feature or not.
> 
> Terrific, I will also definetly try to make use of that in the next
> virito-mem prototype (looks like I'll finally have time to look into it
> again).

Great, I would like to see how this works there :-).

> I guess one important thing to mention is that it is no longer possible
> to remove memory in a different granularity it was added. I slightly
> remember that ACPI code sometimes "reuses" parts of already added
> memory. We would have to validate that this can indeed not be an issue.
> 
> drivers/acpi/acpi_memhotplug.c:
> 
> result = __add_memory(node, info->start_addr, info->length);
> if (result && result != -EEXIST)
> 	continue;
> 
> What would happen when removing this dimm (->remove_memory())

Yeah, I see the point.
Well, we are safe here because the vmemmap data is being allocated in
every call to __add_memory/add_memory/add_memory_resource.

E.g:

* Being memblock granularity 128M

# object_add memory-backend-ram,id=ram0,size=256M
# device_add pc-dimm,id=dimm0,memdev=ram0,node=1

I am not sure how ACPI gets to split the DIMM in memory resources
(aka mem_device->res_list), but it does not really matter.
For each mem_device->res_list item, we will make a call to __add_memory,
which will allocate the vmemmap data for __that__ item, we do not care
about the others.

And when removing the DIMM, acpi_memory_remove_memory will make a call to
__remove_memory() for each mem_device->res_list item, and that will take
care of free up the vmemmap data.

Now, with all my tests, ACPI always considered a DIMM a single memory resource,
but maybe under different circumstances it gets to split it in different mem
resources.
But it does not really matter, as vmemmap data is being created and isolated in
every call to __add_memory.

> Also have a look at
> 
> arch/powerpc/platforms/powernv/memtrace.c
> 
> I consider it evil code. It will simply try to offline+unplug *some*
> memory it finds in *some granularity*. Not sure if this might be
> problematic-

Heh, memtrace from powerpc ^^, I saw some oddities coming from there, but
with my code though because I did not get to test that in concret.
But I am interested to see if it can trigger something, so I will be testing
that the next days.

> Would there be any "safety net" for adding/removing memory in different
> granularities?

Uhm, I do not think we need it, or at least I cannot think of a case where this
could cause trouble with the current design.
Can you think of any? 

Thanks David ;-)

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-28 15:31   ` David Hildenbrand
@ 2019-03-29  8:45     ` Oscar Salvador
  2019-03-29  8:56       ` David Hildenbrand
  2019-03-29 13:42       ` Michal Hocko
  0 siblings, 2 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-03-29  8:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Thu, Mar 28, 2019 at 04:31:44PM +0100, David Hildenbrand wrote:
> Correct me if I am wrong. I think I was confused - vmemmap data is still
> allocated *per memory block*, not for the whole added memory, correct?

No, vmemap data is allocated per memory-resource added.
In case a DIMM, would be a DIMM, in case a qemu memory-device, would be that
memory-device.
That is counting that ACPI does not split the DIMM/memory-device in several memory
resources.
If that happens, then acpi_memory_enable_device() calls __add_memory for every
memory-resource, which means that the vmemmap data will be allocated per
memory-resource.
I did not see this happening though, and I am not sure under which circumstances
can happen (I have to study the ACPI code a bit more).

The problem with allocating vmemmap data per memblock, is the fragmentation.
Let us say you do the following:

* memblock granularity 128M

(qemu) object_add memory-backend-ram,id=ram0,size=256M
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1

This will create two memblocks (2 sections), and if we allocate the vmemmap
data for each corresponding section within it section(memblock), you only get
126M contiguous memory.

So, the taken approach is to allocate the vmemmap data corresponging to the
whole DIMM/memory-device/memory-resource from the beginning of its memory.

In the example from above, the vmemmap data for both sections is allocated from
the beginning of the first section:

memmap array takes 2MB per section, so 512 pfns.
If we add 2 sections:

[  pfn#0  ]  \
[  ...    ]  |  vmemmap used for memmap array
[pfn#1023 ]  /  

[pfn#1024 ]  \
[  ...    ]  |  used as normal memory
[pfn#65536]  /

So, out of 256M, we get 252M to use as a real memory, as 4M will be used for
building the memmap array.

Actually, it can happen that depending on how big a DIMM/memory-device is,
the first/s memblock is fully used for the memmap array (of course, this
can only be seen when adding a huge DIMM/memory-device).

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-29  8:30   ` Oscar Salvador
@ 2019-03-29  8:51     ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-03-29  8:51 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm


> Great, I would like to see how this works there :-).
> 
>> I guess one important thing to mention is that it is no longer possible
>> to remove memory in a different granularity it was added. I slightly
>> remember that ACPI code sometimes "reuses" parts of already added
>> memory. We would have to validate that this can indeed not be an issue.
>>
>> drivers/acpi/acpi_memhotplug.c:
>>
>> result = __add_memory(node, info->start_addr, info->length);
>> if (result && result != -EEXIST)
>> 	continue;
>>
>> What would happen when removing this dimm (->remove_memory())
> 
> Yeah, I see the point.
> Well, we are safe here because the vmemmap data is being allocated in
> every call to __add_memory/add_memory/add_memory_resource.
> 
> E.g:
> 
> * Being memblock granularity 128M
> 
> # object_add memory-backend-ram,id=ram0,size=256M
> # device_add pc-dimm,id=dimm0,memdev=ram0,node=1

So, this should result in one __add_memory() call with 256MB, creating
two memory block devices (128MB). I *assume* (haven't looked at the
details yet, sorry), that you will allocate vmmap for (and on!) each of
these two 128MB sections/memblocks, correct?

> 
> I am not sure how ACPI gets to split the DIMM in memory resources
> (aka mem_device->res_list), but it does not really matter.
> For each mem_device->res_list item, we will make a call to __add_memory,
> which will allocate the vmemmap data for __that__ item, we do not care
> about the others.
> 
> And when removing the DIMM, acpi_memory_remove_memory will make a call to
> __remove_memory() for each mem_device->res_list item, and that will take
> care of free up the vmemmap data.

Ah okay, that makes sense.

> 
> Now, with all my tests, ACPI always considered a DIMM a single memory resource,
> but maybe under different circumstances it gets to split it in different mem
> resources.
> But it does not really matter, as vmemmap data is being created and isolated in
> every call to __add_memory.

Yes, as long as the calls to add_memory matches remove_memory, we are
totally fine. I am wondering if that could not be the case. A simplified
example:

A DIMM overlaps with some other system ram, as detected and added during
boot. When detecting the dimm, __add_memory() returns -EEXIST.

Now, wehn unplugging the dimm, we call remove_memory(), but only remove
the DIMM part. I wonder how/if something like that can happen and how
the system would react.

I guess I'll have to do some more ACPI code reading to find out how this
-EEXIST case can come to life.

> 
>> Also have a look at
>>
>> arch/powerpc/platforms/powernv/memtrace.c
>>
>> I consider it evil code. It will simply try to offline+unplug *some*
>> memory it finds in *some granularity*. Not sure if this might be
>> problematic-
> 
> Heh, memtrace from powerpc ^^, I saw some oddities coming from there, but
> with my code though because I did not get to test that in concret.
> But I am interested to see if it can trigger something, so I will be testing
> that the next days.
> 
>> Would there be any "safety net" for adding/removing memory in different
>> granularities?
> 
> Uhm, I do not think we need it, or at least I cannot think of a case where this
> could cause trouble with the current design.
> Can you think of any? 

Nope, as long as it works (especially no change to what we had before),
no safety net needed :)


I was just curious if

add_memory() followed by remove_memory() used to work before and if you
patches might change that behavior.

Thanks! Will try to look into the details soon!

> 
> Thanks David ;-)
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-29  8:45     ` Oscar Salvador
@ 2019-03-29  8:56       ` David Hildenbrand
  2019-03-29  9:01         ` David Hildenbrand
  2019-03-29  9:20         ` Oscar Salvador
  2019-03-29 13:42       ` Michal Hocko
  1 sibling, 2 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-03-29  8:56 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On 29.03.19 09:45, Oscar Salvador wrote:
> On Thu, Mar 28, 2019 at 04:31:44PM +0100, David Hildenbrand wrote:
>> Correct me if I am wrong. I think I was confused - vmemmap data is still
>> allocated *per memory block*, not for the whole added memory, correct?
> 
> No, vmemap data is allocated per memory-resource added.
> In case a DIMM, would be a DIMM, in case a qemu memory-device, would be that
> memory-device.
> That is counting that ACPI does not split the DIMM/memory-device in several memory
> resources.
> If that happens, then acpi_memory_enable_device() calls __add_memory for every
> memory-resource, which means that the vmemmap data will be allocated per
> memory-resource.
> I did not see this happening though, and I am not sure under which circumstances
> can happen (I have to study the ACPI code a bit more).
> 
> The problem with allocating vmemmap data per memblock, is the fragmentation.
> Let us say you do the following:
> 
> * memblock granularity 128M
> 
> (qemu) object_add memory-backend-ram,id=ram0,size=256M
> (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
> 
> This will create two memblocks (2 sections), and if we allocate the vmemmap
> data for each corresponding section within it section(memblock), you only get
> 126M contiguous memory.

Oh okay, so actually the way I guessed it would be now.

While this makes totally sense, I'll have to look how it is currently
handled, meaning if there is a change. I somewhat remembering that
delayed struct pages initialization would initialize vmmap per section,
not per memory resource.

But as I work on 10 things differently, my mind sometimes seems to
forget stuff in order to replace it with random nonsense. Will look into
the details to not have to ask too many dumb questions.

> 
> So, the taken approach is to allocate the vmemmap data corresponging to the
> whole DIMM/memory-device/memory-resource from the beginning of its memory.
> 
> In the example from above, the vmemmap data for both sections is allocated from
> the beginning of the first section:
> 
> memmap array takes 2MB per section, so 512 pfns.
> If we add 2 sections:
> 
> [  pfn#0  ]  \
> [  ...    ]  |  vmemmap used for memmap array
> [pfn#1023 ]  /  
> 
> [pfn#1024 ]  \
> [  ...    ]  |  used as normal memory
> [pfn#65536]  /
> 
> So, out of 256M, we get 252M to use as a real memory, as 4M will be used for
> building the memmap array.
> 
> Actually, it can happen that depending on how big a DIMM/memory-device is,
> the first/s memblock is fully used for the memmap array (of course, this
> can only be seen when adding a huge DIMM/memory-device).
> 

Just stating here, that with your code, add_memory() and remove_memory()
always have to be called in the same granularity. Will have to see if
that implies a change.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-29  8:56       ` David Hildenbrand
@ 2019-03-29  9:01         ` David Hildenbrand
  2019-03-29  9:20         ` Oscar Salvador
  1 sibling, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-03-29  9:01 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On 29.03.19 09:56, David Hildenbrand wrote:
> On 29.03.19 09:45, Oscar Salvador wrote:
>> On Thu, Mar 28, 2019 at 04:31:44PM +0100, David Hildenbrand wrote:
>>> Correct me if I am wrong. I think I was confused - vmemmap data is still
>>> allocated *per memory block*, not for the whole added memory, correct?
>>
>> No, vmemap data is allocated per memory-resource added.
>> In case a DIMM, would be a DIMM, in case a qemu memory-device, would be that
>> memory-device.
>> That is counting that ACPI does not split the DIMM/memory-device in several memory
>> resources.
>> If that happens, then acpi_memory_enable_device() calls __add_memory for every
>> memory-resource, which means that the vmemmap data will be allocated per
>> memory-resource.
>> I did not see this happening though, and I am not sure under which circumstances
>> can happen (I have to study the ACPI code a bit more).
>>
>> The problem with allocating vmemmap data per memblock, is the fragmentation.
>> Let us say you do the following:
>>
>> * memblock granularity 128M
>>
>> (qemu) object_add memory-backend-ram,id=ram0,size=256M
>> (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
>>
>> This will create two memblocks (2 sections), and if we allocate the vmemmap
>> data for each corresponding section within it section(memblock), you only get
>> 126M contiguous memory.
> 
> Oh okay, so actually the way I guessed it would be now.
> 
> While this makes totally sense, I'll have to look how it is currently
> handled, meaning if there is a change. I somewhat remembering that
> delayed struct pages initialization would initialize vmmap per section,
> not per memory resource.
> 
> But as I work on 10 things differently, my mind sometimes seems to
> forget stuff in order to replace it with random nonsense. Will look into
> the details to not have to ask too many dumb questions.

s/differently/concurrently/

See, nonsense ;)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-29  8:56       ` David Hildenbrand
  2019-03-29  9:01         ` David Hildenbrand
@ 2019-03-29  9:20         ` Oscar Salvador
  1 sibling, 0 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-03-29  9:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mhocko, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Fri, Mar 29, 2019 at 09:56:37AM +0100, David Hildenbrand wrote:
> Oh okay, so actually the way I guessed it would be now.
> 
> While this makes totally sense, I'll have to look how it is currently
> handled, meaning if there is a change. I somewhat remembering that
> delayed struct pages initialization would initialize vmmap per section,
> not per memory resource.

Uhm, the memmap array for each section is built early during boot.
We actually do not care about deferred struct pages initialization there.
What we do is:

- We go through all memblock regions marked as memory
- We mark the sections within those regions present
- We initialize those sections and build the corresponding memmap array

The thing is that sparse_init_nid() allocates/reserves a buffer big enough
to allocate the memmap array for all those sections, and for each memmap
array to need to allocate, we consume it from that buffer, using contigous
memory.

Have a look at:

- sparse_memory_present_with_active_regions()
- sparse_init()
- sparse_init_nid
- sparse_buffer_init

> But as I work on 10 things differently, my mind sometimes seems to
> forget stuff in order to replace it with random nonsense. Will look into
> the details to not have to ask too many dumb questions.
> 
> > 
> > So, the taken approach is to allocate the vmemmap data corresponging to the
> > whole DIMM/memory-device/memory-resource from the beginning of its memory.
> > 
> > In the example from above, the vmemmap data for both sections is allocated from
> > the beginning of the first section:
> > 
> > memmap array takes 2MB per section, so 512 pfns.
> > If we add 2 sections:
> > 
> > [  pfn#0  ]  \
> > [  ...    ]  |  vmemmap used for memmap array
> > [pfn#1023 ]  /  
> > 
> > [pfn#1024 ]  \
> > [  ...    ]  |  used as normal memory
> > [pfn#65536]  /
> > 
> > So, out of 256M, we get 252M to use as a real memory, as 4M will be used for
> > building the memmap array.
> > 
> > Actually, it can happen that depending on how big a DIMM/memory-device is,
> > the first/s memblock is fully used for the memmap array (of course, this
> > can only be seen when adding a huge DIMM/memory-device).
> > 
> 
> Just stating here, that with your code, add_memory() and remove_memory()
> always have to be called in the same granularity. Will have to see if
> that implies a change.

Well, I only tested it in such scenario yes, but I think that ACPI code
enforces that somehow.
I will take a closer look though.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-29  8:45     ` Oscar Salvador
  2019-03-29  8:56       ` David Hildenbrand
@ 2019-03-29 13:42       ` Michal Hocko
  2019-04-01  7:59         ` Oscar Salvador
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2019-03-29 13:42 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Fri 29-03-19 09:45:47, Oscar Salvador wrote:
[...]
> * memblock granularity 128M
> 
> (qemu) object_add memory-backend-ram,id=ram0,size=256M
> (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
> 
> This will create two memblocks (2 sections), and if we allocate the vmemmap
> data for each corresponding section within it section(memblock), you only get
> 126M contiguous memory.
> 
> So, the taken approach is to allocate the vmemmap data corresponging to the
> whole DIMM/memory-device/memory-resource from the beginning of its memory.
> 
> In the example from above, the vmemmap data for both sections is allocated from
> the beginning of the first section:
> 
> memmap array takes 2MB per section, so 512 pfns.
> If we add 2 sections:
> 
> [  pfn#0  ]  \
> [  ...    ]  |  vmemmap used for memmap array
> [pfn#1023 ]  /  
> 
> [pfn#1024 ]  \
> [  ...    ]  |  used as normal memory
> [pfn#65536]  /
> 
> So, out of 256M, we get 252M to use as a real memory, as 4M will be used for
> building the memmap array.

Having a larger contiguous area is definitely nice to have but you also
have to consider the other side of the thing. If we have a movable
memblock with unmovable memory then we are breaking the movable
property. So there should be some flexibility for caller to tell whether
to allocate on per device or per memblock. Or we need something to move
memmaps during the hotremove.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-28 13:43 [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
                   ` (4 preceding siblings ...)
  2019-03-28 15:09 ` [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
@ 2019-03-29 22:23 ` John Hubbard
  2019-04-01  7:52   ` Oscar Salvador
  5 siblings, 1 reply; 42+ messages in thread
From: John Hubbard @ 2019-03-29 22:23 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, david, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On 3/28/19 6:43 AM, Oscar Salvador wrote:
> Hi,
> 
> since last two RFCs were almost unnoticed (thanks David for the feedback),
> I decided to re-work some parts to make it more simple and give it a more
> testing, and drop the RFC, to see if it gets more attention.
> I also added David's feedback, so now all users of add_memory/__add_memory/
> add_memory_resource can specify whether they want to use this feature or not.
> I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set.
> 

Hi Oscar, say, what tree and/or commit does this series apply to? I'm having some
trouble finding the right place. Sorry for the easy question, I did try quite
a few trees already...

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-29 22:23 ` John Hubbard
@ 2019-04-01  7:52   ` Oscar Salvador
  0 siblings, 0 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-04-01  7:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, mhocko, david, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Fri, Mar 29, 2019 at 03:23:00PM -0700, John Hubbard wrote:
> On 3/28/19 6:43 AM, Oscar Salvador wrote:
> > Hi,
> > 
> > since last two RFCs were almost unnoticed (thanks David for the feedback),
> > I decided to re-work some parts to make it more simple and give it a more
> > testing, and drop the RFC, to see if it gets more attention.
> > I also added David's feedback, so now all users of add_memory/__add_memory/
> > add_memory_resource can specify whether they want to use this feature or not.
> > I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set.
> > 
> 
> Hi Oscar, say, what tree and/or commit does this series apply to? I'm having some
> trouble finding the right place. Sorry for the easy question, I did try quite
> a few trees already...

Hi John, I somehow forgot to mention it in the cover-letter, sorry.
This patchsed is based on v5.1-rc2-31-gece06d4a8149 + the following fixes
on top:

* https://patchwork.kernel.org/patch/10862609/
* https://patchwork.kernel.org/patch/10862611/
* https://patchwork.kernel.org/patch/10853049/

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-03-29 13:42       ` Michal Hocko
@ 2019-04-01  7:59         ` Oscar Salvador
  2019-04-01 11:53           ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Oscar Salvador @ 2019-04-01  7:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote:
> Having a larger contiguous area is definitely nice to have but you also
> have to consider the other side of the thing. If we have a movable
> memblock with unmovable memory then we are breaking the movable
> property. So there should be some flexibility for caller to tell whether
> to allocate on per device or per memblock. Or we need something to move
> memmaps during the hotremove.

By movable memblock you mean a memblock whose pages can be migrated over when
this memblock is offlined, right?

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-01  7:59         ` Oscar Salvador
@ 2019-04-01 11:53           ` Michal Hocko
  2019-04-02  8:28             ` Oscar Salvador
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2019-04-01 11:53 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Mon 01-04-19 09:59:36, Oscar Salvador wrote:
> On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote:
> > Having a larger contiguous area is definitely nice to have but you also
> > have to consider the other side of the thing. If we have a movable
> > memblock with unmovable memory then we are breaking the movable
> > property. So there should be some flexibility for caller to tell whether
> > to allocate on per device or per memblock. Or we need something to move
> > memmaps during the hotremove.
> 
> By movable memblock you mean a memblock whose pages can be migrated over when
> this memblock is offlined, right?

I am mostly thinking about movable_node kernel parameter which makes
newly hotpluged memory go into ZONE_MOVABLE and people do use that to
make sure such a memory can be later hotremoved.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-01 11:53           ` Michal Hocko
@ 2019-04-02  8:28             ` Oscar Salvador
  2019-04-02  8:39               ` David Hildenbrand
  2019-04-02 12:48               ` Michal Hocko
  0 siblings, 2 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-04-02  8:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Mon, Apr 01, 2019 at 01:53:06PM +0200, Michal Hocko wrote:
> On Mon 01-04-19 09:59:36, Oscar Salvador wrote:
> > On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote:
> > > Having a larger contiguous area is definitely nice to have but you also
> > > have to consider the other side of the thing. If we have a movable
> > > memblock with unmovable memory then we are breaking the movable
> > > property. So there should be some flexibility for caller to tell whether
> > > to allocate on per device or per memblock. Or we need something to move
> > > memmaps during the hotremove.
> > 
> > By movable memblock you mean a memblock whose pages can be migrated over when
> > this memblock is offlined, right?
> 
> I am mostly thinking about movable_node kernel parameter which makes
> newly hotpluged memory go into ZONE_MOVABLE and people do use that to
> make sure such a memory can be later hotremoved.

Uhm, I might be missing your point, but hot-added memory that makes use of
vmemmap pages can be hot-removed as any other memory.

Vmemmap pages do not account as unmovable memory, they just stick around
until all sections they referred to have been removed, and then, we proceed
with removing them.
So, to put it in another way: vmemmap pages are left in the system until the
whole memory device (DIMM, virt mem-device or whatever) is completely
hot-removed.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-02  8:28             ` Oscar Salvador
@ 2019-04-02  8:39               ` David Hildenbrand
  2019-04-02 12:48               ` Michal Hocko
  1 sibling, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-04-02  8:39 UTC (permalink / raw)
  To: Oscar Salvador, Michal Hocko
  Cc: akpm, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On 02.04.19 10:28, Oscar Salvador wrote:
> On Mon, Apr 01, 2019 at 01:53:06PM +0200, Michal Hocko wrote:
>> On Mon 01-04-19 09:59:36, Oscar Salvador wrote:
>>> On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote:
>>>> Having a larger contiguous area is definitely nice to have but you also
>>>> have to consider the other side of the thing. If we have a movable
>>>> memblock with unmovable memory then we are breaking the movable
>>>> property. So there should be some flexibility for caller to tell whether
>>>> to allocate on per device or per memblock. Or we need something to move
>>>> memmaps during the hotremove.
>>>
>>> By movable memblock you mean a memblock whose pages can be migrated over when
>>> this memblock is offlined, right?
>>
>> I am mostly thinking about movable_node kernel parameter which makes
>> newly hotpluged memory go into ZONE_MOVABLE and people do use that to
>> make sure such a memory can be later hotremoved.
> 
> Uhm, I might be missing your point, but hot-added memory that makes use of
> vmemmap pages can be hot-removed as any other memory.
> 
> Vmemmap pages do not account as unmovable memory, they just stick around
> until all sections they referred to have been removed, and then, we proceed
> with removing them.
> So, to put it in another way: vmemmap pages are left in the system until the
> whole memory device (DIMM, virt mem-device or whatever) is completely
> hot-removed.

Indeed, separate memblocks can be offlined, but vmemmap is removed along
with remove_memory().

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-02  8:28             ` Oscar Salvador
  2019-04-02  8:39               ` David Hildenbrand
@ 2019-04-02 12:48               ` Michal Hocko
  2019-04-03  8:01                 ` Oscar Salvador
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2019-04-02 12:48 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Tue 02-04-19 10:28:15, Oscar Salvador wrote:
> On Mon, Apr 01, 2019 at 01:53:06PM +0200, Michal Hocko wrote:
> > On Mon 01-04-19 09:59:36, Oscar Salvador wrote:
> > > On Fri, Mar 29, 2019 at 02:42:43PM +0100, Michal Hocko wrote:
> > > > Having a larger contiguous area is definitely nice to have but you also
> > > > have to consider the other side of the thing. If we have a movable
> > > > memblock with unmovable memory then we are breaking the movable
> > > > property. So there should be some flexibility for caller to tell whether
> > > > to allocate on per device or per memblock. Or we need something to move
> > > > memmaps during the hotremove.
> > > 
> > > By movable memblock you mean a memblock whose pages can be migrated over when
> > > this memblock is offlined, right?
> > 
> > I am mostly thinking about movable_node kernel parameter which makes
> > newly hotpluged memory go into ZONE_MOVABLE and people do use that to
> > make sure such a memory can be later hotremoved.
> 
> Uhm, I might be missing your point, but hot-added memory that makes use of
> vmemmap pages can be hot-removed as any other memory.
> 
> Vmemmap pages do not account as unmovable memory, they just stick around
> until all sections they referred to have been removed, and then, we proceed
> with removing them.
> So, to put it in another way: vmemmap pages are left in the system until the
> whole memory device (DIMM, virt mem-device or whatever) is completely
> hot-removed.

So what is going to happen when you hotadd two memblocks. The first one
holds memmaps and then you want to hotremove (not just offline) it?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-02 12:48               ` Michal Hocko
@ 2019-04-03  8:01                 ` Oscar Salvador
  2019-04-03  8:12                   ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Oscar Salvador @ 2019-04-03  8:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote:
> So what is going to happen when you hotadd two memblocks. The first one
> holds memmaps and then you want to hotremove (not just offline) it?

If you hot-add two memblocks, this means that either:

a) you hot-add a 256MB-memory-device (128MB per memblock)
b) you hot-add two 128MB-memory-device

Either way, hot-removing only works for memory-device as a whole, so
there is no problem.

Vmemmaps are created per hot-added operations, this means that
vmemmaps will be created for the hot-added range.
And since hot-add/hot-remove operations works with the same granularity,
there is no problem.

E.g:

# (qemu) object_add memory-backend-ram,id=ram0,size=128M
# (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1

# (qemu) object_add memory-backend-ram,id=ram1,size=512M
# (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=1

# (qemu) object_add memory-backend-ram,id=ram2,size=1G
# (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1

These are three hot-add operations.
Each hot-add operation will create use vmemmaps to hold the memmap for
its hot-added sections.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:01                 ` Oscar Salvador
@ 2019-04-03  8:12                   ` Michal Hocko
  2019-04-03  8:17                     ` David Hildenbrand
  2019-04-03  8:34                     ` Oscar Salvador
  0 siblings, 2 replies; 42+ messages in thread
From: Michal Hocko @ 2019-04-03  8:12 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Wed 03-04-19 10:01:16, Oscar Salvador wrote:
> On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote:
> > So what is going to happen when you hotadd two memblocks. The first one
> > holds memmaps and then you want to hotremove (not just offline) it?
> 
> If you hot-add two memblocks, this means that either:
> 
> a) you hot-add a 256MB-memory-device (128MB per memblock)
> b) you hot-add two 128MB-memory-device
> 
> Either way, hot-removing only works for memory-device as a whole, so
> there is no problem.
> 
> Vmemmaps are created per hot-added operations, this means that
> vmemmaps will be created for the hot-added range.
> And since hot-add/hot-remove operations works with the same granularity,
> there is no problem.

What does prevent calling somebody arch_add_memory for a range spanning
multiple memblocks from a driver directly. In other words aren't you
making  assumptions about a future usage based on the qemu usecase?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:12                   ` Michal Hocko
@ 2019-04-03  8:17                     ` David Hildenbrand
  2019-04-03  8:37                       ` Michal Hocko
  2019-04-03  8:34                     ` Oscar Salvador
  1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2019-04-03  8:17 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador
  Cc: akpm, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On 03.04.19 10:12, Michal Hocko wrote:
> On Wed 03-04-19 10:01:16, Oscar Salvador wrote:
>> On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote:
>>> So what is going to happen when you hotadd two memblocks. The first one
>>> holds memmaps and then you want to hotremove (not just offline) it?
>>
>> If you hot-add two memblocks, this means that either:
>>
>> a) you hot-add a 256MB-memory-device (128MB per memblock)
>> b) you hot-add two 128MB-memory-device
>>
>> Either way, hot-removing only works for memory-device as a whole, so
>> there is no problem.
>>
>> Vmemmaps are created per hot-added operations, this means that
>> vmemmaps will be created for the hot-added range.
>> And since hot-add/hot-remove operations works with the same granularity,
>> there is no problem.
> 
> What does prevent calling somebody arch_add_memory for a range spanning
> multiple memblocks from a driver directly. In other words aren't you

To drivers, we only expose add_memory() and friends. And I think this is
a good idea.

> making  assumptions about a future usage based on the qemu usecase?
> 

As I noted, we only have an issue if add add_memory() and
remove_memory() is called with different granularity. I gave two
examples where this might not be the case, but we will have to look int
the details.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:12                   ` Michal Hocko
  2019-04-03  8:17                     ` David Hildenbrand
@ 2019-04-03  8:34                     ` Oscar Salvador
  2019-04-03  8:36                       ` David Hildenbrand
  1 sibling, 1 reply; 42+ messages in thread
From: Oscar Salvador @ 2019-04-03  8:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Wed, Apr 03, 2019 at 10:12:32AM +0200, Michal Hocko wrote:
> What does prevent calling somebody arch_add_memory for a range spanning
> multiple memblocks from a driver directly. In other words aren't you
> making  assumptions about a future usage based on the qemu usecase?

Well, right now they cannot as it is not exported.
But if we want to do it in the future, then yes, I would have to
be more careful because I made the assumption that hot-add/hot-remove
are working with the same granularity, which is the case right now.

Given said this, I think that something like you said before, giving
the option to the caller to specify whether it wants vmemmaps per the
whole hot-added range or per memblock is a reasonable thing to do.
That way, there will not be a problem working with different granularities
in hot-add/hot-remove operations and we would be on safe side.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:34                     ` Oscar Salvador
@ 2019-04-03  8:36                       ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-04-03  8:36 UTC (permalink / raw)
  To: Oscar Salvador, Michal Hocko
  Cc: akpm, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On 03.04.19 10:34, Oscar Salvador wrote:
> On Wed, Apr 03, 2019 at 10:12:32AM +0200, Michal Hocko wrote:
>> What does prevent calling somebody arch_add_memory for a range spanning
>> multiple memblocks from a driver directly. In other words aren't you
>> making  assumptions about a future usage based on the qemu usecase?
> 
> Well, right now they cannot as it is not exported.
> But if we want to do it in the future, then yes, I would have to
> be more careful because I made the assumption that hot-add/hot-remove
> are working with the same granularity, which is the case right now.
> 
> Given said this, I think that something like you said before, giving
> the option to the caller to specify whether it wants vmemmaps per the
> whole hot-added range or per memblock is a reasonable thing to do.
> That way, there will not be a problem working with different granularities
> in hot-add/hot-remove operations and we would be on safe side.

There might still be an issue if the person adding memory might be
somebody else removing memory. I am not yet sure if we should even allow
add_memory/remove_memory with different granularity. But as I noted,
ACPI and powernv.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:17                     ` David Hildenbrand
@ 2019-04-03  8:37                       ` Michal Hocko
  2019-04-03  8:41                         ` David Hildenbrand
  2019-04-03  9:40                         ` Oscar Salvador
  0 siblings, 2 replies; 42+ messages in thread
From: Michal Hocko @ 2019-04-03  8:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Wed 03-04-19 10:17:26, David Hildenbrand wrote:
> On 03.04.19 10:12, Michal Hocko wrote:
> > On Wed 03-04-19 10:01:16, Oscar Salvador wrote:
> >> On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote:
> >>> So what is going to happen when you hotadd two memblocks. The first one
> >>> holds memmaps and then you want to hotremove (not just offline) it?
> >>
> >> If you hot-add two memblocks, this means that either:
> >>
> >> a) you hot-add a 256MB-memory-device (128MB per memblock)
> >> b) you hot-add two 128MB-memory-device
> >>
> >> Either way, hot-removing only works for memory-device as a whole, so
> >> there is no problem.
> >>
> >> Vmemmaps are created per hot-added operations, this means that
> >> vmemmaps will be created for the hot-added range.
> >> And since hot-add/hot-remove operations works with the same granularity,
> >> there is no problem.
> > 
> > What does prevent calling somebody arch_add_memory for a range spanning
> > multiple memblocks from a driver directly. In other words aren't you
> 
> To drivers, we only expose add_memory() and friends. And I think this is
> a good idea.
> 
> > making  assumptions about a future usage based on the qemu usecase?
> > 
> 
> As I noted, we only have an issue if add add_memory() and
> remove_memory() is called with different granularity. I gave two
> examples where this might not be the case, but we will have to look int
> the details.

It seems natural that the DIMM will be hot remove all at once because
you cannot hot remove a half of the DIMM, right? But I can envision that
people might want to hotremove a faulty part of a really large DIMM
because they would like to save some resources.

With different users asking for the hotplug functionality, I do not
think we want to make such a strong assumption as hotremove will have
the same granularity as hotadd.


That being said it should be the caller of the hotplug code to tell
the vmemmap allocation strategy. For starter, I would only pack vmemmaps
for "regular" kernel zone memory. Movable zones should be more careful.
We can always re-evaluate later when there is a strong demand for huge
pages on movable zones but this is not the case now because those pages
are not really movable in practice.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:37                       ` Michal Hocko
@ 2019-04-03  8:41                         ` David Hildenbrand
  2019-04-03  8:49                           ` Michal Hocko
  2019-04-03  8:50                           ` Oscar Salvador
  2019-04-03  9:40                         ` Oscar Salvador
  1 sibling, 2 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-04-03  8:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oscar Salvador, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On 03.04.19 10:37, Michal Hocko wrote:
> On Wed 03-04-19 10:17:26, David Hildenbrand wrote:
>> On 03.04.19 10:12, Michal Hocko wrote:
>>> On Wed 03-04-19 10:01:16, Oscar Salvador wrote:
>>>> On Tue, Apr 02, 2019 at 02:48:45PM +0200, Michal Hocko wrote:
>>>>> So what is going to happen when you hotadd two memblocks. The first one
>>>>> holds memmaps and then you want to hotremove (not just offline) it?
>>>>
>>>> If you hot-add two memblocks, this means that either:
>>>>
>>>> a) you hot-add a 256MB-memory-device (128MB per memblock)
>>>> b) you hot-add two 128MB-memory-device
>>>>
>>>> Either way, hot-removing only works for memory-device as a whole, so
>>>> there is no problem.
>>>>
>>>> Vmemmaps are created per hot-added operations, this means that
>>>> vmemmaps will be created for the hot-added range.
>>>> And since hot-add/hot-remove operations works with the same granularity,
>>>> there is no problem.
>>>
>>> What does prevent calling somebody arch_add_memory for a range spanning
>>> multiple memblocks from a driver directly. In other words aren't you
>>
>> To drivers, we only expose add_memory() and friends. And I think this is
>> a good idea.
>>
>>> making  assumptions about a future usage based on the qemu usecase?
>>>
>>
>> As I noted, we only have an issue if add add_memory() and
>> remove_memory() is called with different granularity. I gave two
>> examples where this might not be the case, but we will have to look int
>> the details.
> 
> It seems natural that the DIMM will be hot remove all at once because
> you cannot hot remove a half of the DIMM, right? But I can envision that
> people might want to hotremove a faulty part of a really large DIMM
> because they would like to save some resources.

Even for virtio-mem, something like that would be useful. But I could
try to live without it :) Add a lot of memory in one go when starting up
(add_memory()) - much faster than doing individual remove_memory()
calls. When removing memory, as soon as all parts of a memblock are
offline, remove only the memblock to save memory (remove_memory()).

There, I would need to allocate it per memblock.

> 
> With different users asking for the hotplug functionality, I do not
> think we want to make such a strong assumption as hotremove will have
> the same granularity as hotadd.
> 

Then we have to make sure it works for all use cases.

> 
> That being said it should be the caller of the hotplug code to tell
> the vmemmap allocation strategy. For starter, I would only pack vmemmaps
> for "regular" kernel zone memory. Movable zones should be more careful.
> We can always re-evaluate later when there is a strong demand for huge
> pages on movable zones but this is not the case now because those pages
> are not really movable in practice.

Remains the issue with potential different user trying to remove memory
it didn't add in some other granularity. We then really have to identify
and isolate that case.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] mm, memory_hotplug: cleanup memory offline path
  2019-03-28 13:43 ` [PATCH 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
@ 2019-04-03  8:43   ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2019-04-03  8:43 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, david, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On Thu 28-03-19 14:43:17, Oscar Salvador wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> check_pages_isolated_cb currently accounts the whole pfn range as being
> offlined if test_pages_isolated suceeds on the range. This is based on
> the assumption that all pages in the range are freed which is currently
> the case in most cases but it won't be with later changes, as pages
> marked as vmemmap won't be isolated.
> 
> Move the offlined pages counting to offline_isolated_pages_cb and
> rely on __offline_isolated_pages to return the correct value.
> check_pages_isolated_cb will still do it's primary job and check the pfn
> range.
> 
> While we are at it remove check_pages_isolated and offline_isolated_pages
> and use directly walk_system_ram_range as do in online_pages.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Can we merge this even without the rest of the series? It looks like a
useful thing regardless of this series. Sure we do not really use it
right now but it cleans up the code and actually removes more than it
adds...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
  2019-03-28 13:43 ` [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug Oscar Salvador
@ 2019-04-03  8:46   ` Michal Hocko
  2019-04-03  8:48     ` David Hildenbrand
  2019-04-04 10:04     ` Oscar Salvador
  0 siblings, 2 replies; 42+ messages in thread
From: Michal Hocko @ 2019-04-03  8:46 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, david, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On Thu 28-03-19 14:43:18, Oscar Salvador wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> arch_add_memory, __add_pages take a want_memblock which controls whether
> the newly added memory should get the sysfs memblock user API (e.g.
> ZONE_DEVICE users do not want/need this interface). Some callers even
> want to control where do we allocate the memmap from by configuring
> altmap.
> 
> Add a more generic hotplug context for arch_add_memory and __add_pages.
> struct mhp_restrictions contains flags which contains additional
> features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
> currently) and altmap for alternative memmap allocator.
> 
> Please note that the complete altmap propagation down to vmemmap code
> is still not done in this patch. It will be done in the follow up to
> reduce the churn here.
> 
> This patch shouldn't introduce any functional change.

Is there an agreement on the interface here? Or do we want to hide almap
behind some more general looking interface? If the former is true, can
we merge it as it touches a code that might cause merge conflicts later on
as multiple people are working on this area.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
  2019-04-03  8:46   ` Michal Hocko
@ 2019-04-03  8:48     ` David Hildenbrand
  2019-04-04 10:04     ` Oscar Salvador
  1 sibling, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-04-03  8:48 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador
  Cc: akpm, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On 03.04.19 10:46, Michal Hocko wrote:
> On Thu 28-03-19 14:43:18, Oscar Salvador wrote:
>> From: Michal Hocko <mhocko@suse.com>
>>
>> arch_add_memory, __add_pages take a want_memblock which controls whether
>> the newly added memory should get the sysfs memblock user API (e.g.
>> ZONE_DEVICE users do not want/need this interface). Some callers even
>> want to control where do we allocate the memmap from by configuring
>> altmap.
>>
>> Add a more generic hotplug context for arch_add_memory and __add_pages.
>> struct mhp_restrictions contains flags which contains additional
>> features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
>> currently) and altmap for alternative memmap allocator.
>>
>> Please note that the complete altmap propagation down to vmemmap code
>> is still not done in this patch. It will be done in the follow up to
>> reduce the churn here.
>>
>> This patch shouldn't introduce any functional change.
> 
> Is there an agreement on the interface here? Or do we want to hide almap
> behind some more general looking interface? If the former is true, can
> we merge it as it touches a code that might cause merge conflicts later on
> as multiple people are working on this area.
> 
I was wondering if instead of calling it "mhp_restrctions" we should
call it something like "mhp_options", so other stuff might be easier to
fit in. Especially, so we don't have to touch all these functions
whenever we simply want to pass yet another paraemeter down to the core
- or remove one.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:41                         ` David Hildenbrand
@ 2019-04-03  8:49                           ` Michal Hocko
  2019-04-03  8:53                             ` David Hildenbrand
  2019-04-03  8:50                           ` Oscar Salvador
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2019-04-03  8:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Wed 03-04-19 10:41:35, David Hildenbrand wrote:
> On 03.04.19 10:37, Michal Hocko wrote:
[...]
> > That being said it should be the caller of the hotplug code to tell
> > the vmemmap allocation strategy. For starter, I would only pack vmemmaps
> > for "regular" kernel zone memory. Movable zones should be more careful.
> > We can always re-evaluate later when there is a strong demand for huge
> > pages on movable zones but this is not the case now because those pages
> > are not really movable in practice.
> 
> Remains the issue with potential different user trying to remove memory
> it didn't add in some other granularity. We then really have to identify
> and isolate that case.

Can you give an example of a sensible usecase that would require this?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:41                         ` David Hildenbrand
  2019-04-03  8:49                           ` Michal Hocko
@ 2019-04-03  8:50                           ` Oscar Salvador
  2019-04-03  8:54                             ` David Hildenbrand
  1 sibling, 1 reply; 42+ messages in thread
From: Oscar Salvador @ 2019-04-03  8:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Wed, Apr 03, 2019 at 10:41:35AM +0200, David Hildenbrand wrote:
> > That being said it should be the caller of the hotplug code to tell
> > the vmemmap allocation strategy. For starter, I would only pack vmemmaps
> > for "regular" kernel zone memory. Movable zones should be more careful.
> > We can always re-evaluate later when there is a strong demand for huge
> > pages on movable zones but this is not the case now because those pages
> > are not really movable in practice.
> 
> Remains the issue with potential different user trying to remove memory
> it didn't add in some other granularity. We then really have to identify
> and isolate that case.

If we let the caller specify whether it wants vmemmaps per memblock or range,
I would trust that caller to do the correct thing and specify one thing or
another depending on what it wants to do in the future.

So, say a driver adds 512MB memory and it specifies that it wants vmemmaps per
memblock because later on it will like to hot-remove in chunks of 128MB.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:49                           ` Michal Hocko
@ 2019-04-03  8:53                             ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-04-03  8:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oscar Salvador, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On 03.04.19 10:49, Michal Hocko wrote:
> On Wed 03-04-19 10:41:35, David Hildenbrand wrote:
>> On 03.04.19 10:37, Michal Hocko wrote:
> [...]
>>> That being said it should be the caller of the hotplug code to tell
>>> the vmemmap allocation strategy. For starter, I would only pack vmemmaps
>>> for "regular" kernel zone memory. Movable zones should be more careful.
>>> We can always re-evaluate later when there is a strong demand for huge
>>> pages on movable zones but this is not the case now because those pages
>>> are not really movable in practice.
>>
>> Remains the issue with potential different user trying to remove memory
>> it didn't add in some other granularity. We then really have to identify
>> and isolate that case.
> 
> Can you give an example of a sensible usecase that would require this?
> 

The two cases I mentioned are

1. arch/powerpc/platforms/powernv/memtrace.c: memtrace_alloc_node()

AFAIKS, memory that wasn't added by memtrace is tried to be offlined +
removed.

"Remove memory in memory block size chunks so that iomem resources are
always split to the same size and we never try to remove memory that
spans two iomem resources"

2. drivers/acpi/acpi_memhotplug.c:acpi_memory_enable_device()

We might hit "__add_memory() == -EEXIST" and continue. When removing the
devices, __remove_memory() is called. I am still to find out if that
could imply removing in a different granularity than added.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:50                           ` Oscar Salvador
@ 2019-04-03  8:54                             ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-04-03  8:54 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Michal Hocko, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On 03.04.19 10:50, Oscar Salvador wrote:
> On Wed, Apr 03, 2019 at 10:41:35AM +0200, David Hildenbrand wrote:
>>> That being said it should be the caller of the hotplug code to tell
>>> the vmemmap allocation strategy. For starter, I would only pack vmemmaps
>>> for "regular" kernel zone memory. Movable zones should be more careful.
>>> We can always re-evaluate later when there is a strong demand for huge
>>> pages on movable zones but this is not the case now because those pages
>>> are not really movable in practice.
>>
>> Remains the issue with potential different user trying to remove memory
>> it didn't add in some other granularity. We then really have to identify
>> and isolate that case.
> 
> If we let the caller specify whether it wants vmemmaps per memblock or range,
> I would trust that caller to do the correct thing and specify one thing or
> another depending on what it wants to do in the future.
> 
> So, say a driver adds 512MB memory and it specifies that it wants vmemmaps per
> memblock because later on it will like to hot-remove in chunks of 128MB.
> 

I am talking about the memtrace and ACPI thing. Otherwise I agree, trust
the user iff the user is the same person adding/removing memory.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  8:37                       ` Michal Hocko
  2019-04-03  8:41                         ` David Hildenbrand
@ 2019-04-03  9:40                         ` Oscar Salvador
  2019-04-03 10:46                           ` Michal Hocko
  2019-04-04 10:25                           ` Vlastimil Babka
  1 sibling, 2 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-04-03  9:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Wed, Apr 03, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> That being said it should be the caller of the hotplug code to tell
> the vmemmap allocation strategy. For starter, I would only pack vmemmaps
> for "regular" kernel zone memory. Movable zones should be more careful.
> We can always re-evaluate later when there is a strong demand for huge
> pages on movable zones but this is not the case now because those pages
> are not really movable in practice.

I agree that makes sense to let the caller specify if it wants to allocate
vmemmaps per memblock or per memory-range, so we are more flexible when it
comes to granularity in hot-add/hot-remove operations.

But the thing is that the zones are picked at onling stage, while
vmemmaps are created at hot-add stage, so I am not sure we can define
the strategy depending on the zone.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  9:40                         ` Oscar Salvador
@ 2019-04-03 10:46                           ` Michal Hocko
  2019-04-04 10:25                           ` Vlastimil Babka
  1 sibling, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2019-04-03 10:46 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On Wed 03-04-19 11:40:54, Oscar Salvador wrote:
> On Wed, Apr 03, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> > That being said it should be the caller of the hotplug code to tell
> > the vmemmap allocation strategy. For starter, I would only pack vmemmaps
> > for "regular" kernel zone memory. Movable zones should be more careful.
> > We can always re-evaluate later when there is a strong demand for huge
> > pages on movable zones but this is not the case now because those pages
> > are not really movable in practice.
> 
> I agree that makes sense to let the caller specify if it wants to allocate
> vmemmaps per memblock or per memory-range, so we are more flexible when it
> comes to granularity in hot-add/hot-remove operations.

And just to be more specific. The api shouldn't really care about this
implementation detail. So ideally the caller of add_pages just picks up
the proper allocator and the rest is completely transparent.

> But the thing is that the zones are picked at onling stage, while
> vmemmaps are created at hot-add stage, so I am not sure we can define
> the strategy depending on the zone.

THis is a good point. We do have means to tell a default zone for the
regular memory hotplug so this can be reused based on the pfn range
(default_zone_for_pfn).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
  2019-04-03  8:46   ` Michal Hocko
  2019-04-03  8:48     ` David Hildenbrand
@ 2019-04-04 10:04     ` Oscar Salvador
  2019-04-04 10:06       ` David Hildenbrand
  2019-04-04 10:31       ` Michal Hocko
  1 sibling, 2 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-04-04 10:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, david, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On Wed, Apr 03, 2019 at 10:46:03AM +0200, Michal Hocko wrote:
> On Thu 28-03-19 14:43:18, Oscar Salvador wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > arch_add_memory, __add_pages take a want_memblock which controls whether
> > the newly added memory should get the sysfs memblock user API (e.g.
> > ZONE_DEVICE users do not want/need this interface). Some callers even
> > want to control where do we allocate the memmap from by configuring
> > altmap.
> > 
> > Add a more generic hotplug context for arch_add_memory and __add_pages.
> > struct mhp_restrictions contains flags which contains additional
> > features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
> > currently) and altmap for alternative memmap allocator.
> > 
> > Please note that the complete altmap propagation down to vmemmap code
> > is still not done in this patch. It will be done in the follow up to
> > reduce the churn here.
> > 
> > This patch shouldn't introduce any functional change.
> 
> Is there an agreement on the interface here? Or do we want to hide almap
> behind some more general looking interface? If the former is true, can
> we merge it as it touches a code that might cause merge conflicts later on
> as multiple people are working on this area.

Uhm, I think that the interface is fine for now.
I thought about providing some callbacks to build the altmap layout, but I
realized that it was overcomplicated and I would rather start easy.
Maybe the naming could be changed to what David suggested, something like
"mhp_options", which actually looks more generic and allows us to stuff more
things into it should the need arise in the future.
But that is something that can come afterwards I guess.

But merging this now is not a bad idea taking into account that some people
is working on the same area and merge conflicts arise easily.
Otherwise re-working it every version is going to be a pita.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
  2019-04-04 10:04     ` Oscar Salvador
@ 2019-04-04 10:06       ` David Hildenbrand
  2019-04-04 10:31       ` Michal Hocko
  1 sibling, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2019-04-04 10:06 UTC (permalink / raw)
  To: Oscar Salvador, Michal Hocko
  Cc: akpm, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On 04.04.19 12:04, Oscar Salvador wrote:
> On Wed, Apr 03, 2019 at 10:46:03AM +0200, Michal Hocko wrote:
>> On Thu 28-03-19 14:43:18, Oscar Salvador wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> arch_add_memory, __add_pages take a want_memblock which controls whether
>>> the newly added memory should get the sysfs memblock user API (e.g.
>>> ZONE_DEVICE users do not want/need this interface). Some callers even
>>> want to control where do we allocate the memmap from by configuring
>>> altmap.
>>>
>>> Add a more generic hotplug context for arch_add_memory and __add_pages.
>>> struct mhp_restrictions contains flags which contains additional
>>> features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
>>> currently) and altmap for alternative memmap allocator.
>>>
>>> Please note that the complete altmap propagation down to vmemmap code
>>> is still not done in this patch. It will be done in the follow up to
>>> reduce the churn here.
>>>
>>> This patch shouldn't introduce any functional change.
>>
>> Is there an agreement on the interface here? Or do we want to hide almap
>> behind some more general looking interface? If the former is true, can
>> we merge it as it touches a code that might cause merge conflicts later on
>> as multiple people are working on this area.
> 
> Uhm, I think that the interface is fine for now.
> I thought about providing some callbacks to build the altmap layout, but I
> realized that it was overcomplicated and I would rather start easy.
> Maybe the naming could be changed to what David suggested, something like
> "mhp_options", which actually looks more generic and allows us to stuff more
> things into it should the need arise in the future.
> But that is something that can come afterwards I guess.

I'd vote to rename it right away, but feel free to continue how you prefer.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory
  2019-04-03  9:40                         ` Oscar Salvador
  2019-04-03 10:46                           ` Michal Hocko
@ 2019-04-04 10:25                           ` Vlastimil Babka
  1 sibling, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2019-04-04 10:25 UTC (permalink / raw)
  To: Oscar Salvador, Michal Hocko
  Cc: David Hildenbrand, akpm, dan.j.williams, Jonathan.Cameron,
	anshuman.khandual, linux-kernel, linux-mm

On 4/3/19 11:40 AM, Oscar Salvador wrote:
> On Wed, Apr 03, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
>> That being said it should be the caller of the hotplug code to tell
>> the vmemmap allocation strategy. For starter, I would only pack vmemmaps
>> for "regular" kernel zone memory. Movable zones should be more careful.
>> We can always re-evaluate later when there is a strong demand for huge
>> pages on movable zones but this is not the case now because those pages
>> are not really movable in practice.
> 
> I agree that makes sense to let the caller specify if it wants to allocate
> vmemmaps per memblock or per memory-range, so we are more flexible when it
> comes to granularity in hot-add/hot-remove operations.

Please also add possibility to not allocate from hotadded memory (i.e.
allocate from node 0 as it is done now). I know about some MCDRAM users
that want to expose as much as possible to userspace, and not even
occupy those ~4% for memmap.

> But the thing is that the zones are picked at onling stage, while
> vmemmaps are created at hot-add stage, so I am not sure we can define
> the strategy depending on the zone.
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
  2019-04-04 10:04     ` Oscar Salvador
  2019-04-04 10:06       ` David Hildenbrand
@ 2019-04-04 10:31       ` Michal Hocko
  2019-04-04 12:04         ` Oscar Salvador
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2019-04-04 10:31 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, david, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On Thu 04-04-19 12:04:05, Oscar Salvador wrote:
> On Wed, Apr 03, 2019 at 10:46:03AM +0200, Michal Hocko wrote:
> > On Thu 28-03-19 14:43:18, Oscar Salvador wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > arch_add_memory, __add_pages take a want_memblock which controls whether
> > > the newly added memory should get the sysfs memblock user API (e.g.
> > > ZONE_DEVICE users do not want/need this interface). Some callers even
> > > want to control where do we allocate the memmap from by configuring
> > > altmap.
> > > 
> > > Add a more generic hotplug context for arch_add_memory and __add_pages.
> > > struct mhp_restrictions contains flags which contains additional
> > > features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
> > > currently) and altmap for alternative memmap allocator.
> > > 
> > > Please note that the complete altmap propagation down to vmemmap code
> > > is still not done in this patch. It will be done in the follow up to
> > > reduce the churn here.
> > > 
> > > This patch shouldn't introduce any functional change.
> > 
> > Is there an agreement on the interface here? Or do we want to hide almap
> > behind some more general looking interface? If the former is true, can
> > we merge it as it touches a code that might cause merge conflicts later on
> > as multiple people are working on this area.
> 
> Uhm, I think that the interface is fine for now.
> I thought about providing some callbacks to build the altmap layout, but I
> realized that it was overcomplicated and I would rather start easy.
> Maybe the naming could be changed to what David suggested, something like
> "mhp_options", which actually looks more generic and allows us to stuff more
> things into it should the need arise in the future.
> But that is something that can come afterwards I guess.
> 
> But merging this now is not a bad idea taking into account that some people
> is working on the same area and merge conflicts arise easily.
> Otherwise re-working it every version is going to be a pita.

I do not get wee bit about naming TBH. Do as you like. But please repost
just these two patches and we can discuss the rest of this feature in a
separate discussion.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug
  2019-04-04 10:31       ` Michal Hocko
@ 2019-04-04 12:04         ` Oscar Salvador
  0 siblings, 0 replies; 42+ messages in thread
From: Oscar Salvador @ 2019-04-04 12:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, david, dan.j.williams, Jonathan.Cameron, anshuman.khandual,
	linux-kernel, linux-mm

On Thu, Apr 04, 2019 at 12:31:15PM +0200, Michal Hocko wrote:
> On Thu 04-04-19 12:04:05, Oscar Salvador wrote:
> > On Wed, Apr 03, 2019 at 10:46:03AM +0200, Michal Hocko wrote:
> > > On Thu 28-03-19 14:43:18, Oscar Salvador wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > arch_add_memory, __add_pages take a want_memblock which controls whether
> > > > the newly added memory should get the sysfs memblock user API (e.g.
> > > > ZONE_DEVICE users do not want/need this interface). Some callers even
> > > > want to control where do we allocate the memmap from by configuring
> > > > altmap.
> > > > 
> > > > Add a more generic hotplug context for arch_add_memory and __add_pages.
> > > > struct mhp_restrictions contains flags which contains additional
> > > > features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
> > > > currently) and altmap for alternative memmap allocator.
> > > > 
> > > > Please note that the complete altmap propagation down to vmemmap code
> > > > is still not done in this patch. It will be done in the follow up to
> > > > reduce the churn here.
> > > > 
> > > > This patch shouldn't introduce any functional change.
> > > 
> > > Is there an agreement on the interface here? Or do we want to hide almap
> > > behind some more general looking interface? If the former is true, can
> > > we merge it as it touches a code that might cause merge conflicts later on
> > > as multiple people are working on this area.
> > 
> > Uhm, I think that the interface is fine for now.
> > I thought about providing some callbacks to build the altmap layout, but I
> > realized that it was overcomplicated and I would rather start easy.
> > Maybe the naming could be changed to what David suggested, something like
> > "mhp_options", which actually looks more generic and allows us to stuff more
> > things into it should the need arise in the future.
> > But that is something that can come afterwards I guess.
> > 
> > But merging this now is not a bad idea taking into account that some people
> > is working on the same area and merge conflicts arise easily.
> > Otherwise re-working it every version is going to be a pita.
> 
> I do not get wee bit about naming TBH. Do as you like. But please repost
> just these two patches and we can discuss the rest of this feature in a
> separate discussion.

Sure, I will repost them in the next hour (just want to check that everything
is alright).

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2019-04-04 12:04 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-28 13:43 [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory Oscar Salvador
2019-03-28 13:43 ` [PATCH 1/4] mm, memory_hotplug: cleanup memory offline path Oscar Salvador
2019-04-03  8:43   ` Michal Hocko
2019-03-28 13:43 ` [PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug Oscar Salvador
2019-04-03  8:46   ` Michal Hocko
2019-04-03  8:48     ` David Hildenbrand
2019-04-04 10:04     ` Oscar Salvador
2019-04-04 10:06       ` David Hildenbrand
2019-04-04 10:31       ` Michal Hocko
2019-04-04 12:04         ` Oscar Salvador
2019-03-28 13:43 ` [PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap Oscar Salvador
2019-03-28 13:43 ` [PATCH 4/4] mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap Oscar Salvador
2019-03-28 15:09 ` [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory David Hildenbrand
2019-03-28 15:31   ` David Hildenbrand
2019-03-29  8:45     ` Oscar Salvador
2019-03-29  8:56       ` David Hildenbrand
2019-03-29  9:01         ` David Hildenbrand
2019-03-29  9:20         ` Oscar Salvador
2019-03-29 13:42       ` Michal Hocko
2019-04-01  7:59         ` Oscar Salvador
2019-04-01 11:53           ` Michal Hocko
2019-04-02  8:28             ` Oscar Salvador
2019-04-02  8:39               ` David Hildenbrand
2019-04-02 12:48               ` Michal Hocko
2019-04-03  8:01                 ` Oscar Salvador
2019-04-03  8:12                   ` Michal Hocko
2019-04-03  8:17                     ` David Hildenbrand
2019-04-03  8:37                       ` Michal Hocko
2019-04-03  8:41                         ` David Hildenbrand
2019-04-03  8:49                           ` Michal Hocko
2019-04-03  8:53                             ` David Hildenbrand
2019-04-03  8:50                           ` Oscar Salvador
2019-04-03  8:54                             ` David Hildenbrand
2019-04-03  9:40                         ` Oscar Salvador
2019-04-03 10:46                           ` Michal Hocko
2019-04-04 10:25                           ` Vlastimil Babka
2019-04-03  8:34                     ` Oscar Salvador
2019-04-03  8:36                       ` David Hildenbrand
2019-03-29  8:30   ` Oscar Salvador
2019-03-29  8:51     ` David Hildenbrand
2019-03-29 22:23 ` John Hubbard
2019-04-01  7:52   ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).