All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
@ 2018-05-23 15:11 David Hildenbrand
  2018-05-23 15:11   ` David Hildenbrand
                   ` (10 more replies)
  0 siblings, 11 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Alexander Potapenko,
	Andrew Morton, Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Michal Hocko, Miles Chen, Oscar Salvador,
	Paul Mackerras, Pavel Tatashin, Philippe Ombredanne,
	Rashmica Gupta, Reza Arbab, Souptick Joarder, Tetsuo Handa,
	Thomas Gleixner, Vlastimil Babka

This is now the !RFC version. I did some additional tests and inspected
all memory notifiers. At least page_ext and kasan need fixes.

==========

I am right now working on a paravirtualized memory device ("virtio-mem").
These devices control a memory region and the amount of memory available
via it. Memory will not be indicated/added/onlined via ACPI and friends,
the device driver is responsible for it.

When the device driver starts up, it will add and online the requested
amount of memory from its assigned physical memory region. On request, it
can add (online) either more memory or try to remove (offline) memory. As
it will be a virtio module, we also want to be able to have it as a loadable
kernel module.

Such a device can be thought of like a "resizable DIMM" or a "huge
number of 4MB DIMMS" that can be automatically managed.

As we want to be able to add/remove small chunks of memory to a VM without
fragmenting guest memory ("it's not what the guest pays for" and "what if
the hypervisor wants to use huge pages"), it looks like we can do that
under Linux in a 4MB granularity by using online_pages()/offline_pages()

We add a segment and online only 4MB blocks of it on demand. So the other
memory might not be accessible. For kdump and onlining/offlining code, we
have to mark pages as offline before a new segment is visible to the system
(e.g. as these pages might not be backed by real memory in the hypervisor).

This is not a balloon driver. Main differences:
- We can add more memory to a VM without having to use mixture of
  technologies - e.g. ACPI for plugging, balloon for unplugging (in contrast
  to virtio-balloon).
- The device is responsible for its own memory only - will not inflate on
  any system memory. (in contrast to all balloons)
- Works on a coarser granularity (e.g. 4MB because that's what we can
  online/offline in Linux). We are not using the buddy allocator when
  unplugging but really search for chunks of memory we can offline. We
  actually can support arbitrary block sizes. (in contrast to all balloons)
- That's why we don't fragment guest memory.
- A device can belong to exactly one NUMA node. This way we can online/
  offline memory in a fine granularity NUMA aware. Even if the guest does
  not even know how to spell NUMA. (in contrast to all balloons)
- Architectures that don't have proper memory hotplug interfaces (e.g. s390x)
  get memory hotplug support. I have a prototype for s390x.
- Once all 4MB chunks of a memory block are offline, we can remove the
  memory block and therefore the struct pages. (in contrast to all balloons)

This essentially allows us to add/remove 4MB chunks to/from a VM. Especially
without caring about the future when adding memory ("If I add a 128GB DIMM
I can only unplug 128GB again") or running into limits ("If I want my VM to
grow to 4TB, I have to plug at least 16GB per DIMM").

Future work:
 - Performance improvements
 - Be smarter about which blocks to offline first (e.g. free ones)
 - Automatically manage assignemnt to NORMAL/MOVABLE zone to make
   unplug more likely to succeed.

I will post the next prototype of virtio-mem shortly. This time for real :)

==========

RFCv2 -> v1:
- "mm: introduce and use PageOffline()"
-- fix set_page_address() handling for WANT_PAGE_VIRTUAL
- Include "mm/page_ext.c: support online/offline of memory < section size"
- Include "kasan: prepare for online/offline of different start/size"
- Include "mm/memory_hotplug: onlining pages can only fail due to notifiers"


David Hildenbrand (10):
  mm: introduce and use PageOffline()
  mm/page_ext.c: support online/offline of memory < section size
  kasan: prepare for online/offline of different start/size
  kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in VMCOREINFO
  mm/memory_hotplug: limit offline_pages() to sizes we can actually
    handle
  mm/memory_hotplug: onlining pages can only fail due to notifiers
  mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages()
  mm/memory_hotplug: allow to control onlining/offlining of memory by a
    driver
  mm/memory_hotplug: teach offline_pages() to not try forever
  mm/memory_hotplug: allow online/offline memory by a kernel module

 arch/powerpc/platforms/powernv/memtrace.c |   2 +-
 drivers/base/memory.c                     |  25 +--
 drivers/base/node.c                       |   1 -
 drivers/xen/balloon.c                     |   2 +-
 include/linux/memory.h                    |   2 +-
 include/linux/memory_hotplug.h            |  20 ++-
 include/linux/mm.h                        |  10 ++
 include/linux/page-flags.h                |   9 ++
 kernel/crash_core.c                       |   1 +
 mm/kasan/kasan.c                          | 107 ++++++++-----
 mm/memory_hotplug.c                       | 180 +++++++++++++++++-----
 mm/page_alloc.c                           |  32 ++--
 mm/page_ext.c                             |   9 +-
 mm/sparse.c                               |  25 ++-
 14 files changed, 315 insertions(+), 110 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v1 01/10] mm: introduce and use PageOffline()
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
@ 2018-05-23 15:11   ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 02/10] mm/page_ext.c: support online/offline of memory < section size David Hildenbrand
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Greg Kroah-Hartman, Ingo Molnar,
	Andrew Morton, Pavel Tatashin, Philippe Ombredanne,
	Thomas Gleixner, Dan Williams, Michal Hocko, Jan Kara,
	Kirill A. Shutemov, Jérôme Glisse, Matthew Wilcox,
	Souptick Joarder, Hugh Dickins, Huang Ying, Miles Chen,
	Vlastimil Babka, Reza Arbab, Mel Gorman, Tetsuo Handa

offline_pages() theoretically works on sub-section sizes. Problem is that
we have no way to know which pages are actually offline. So right now,
offline_pages() will always mark the whole section as offline.

In addition, in virtualized environments, we might soon have pages that are
logically offline and shall no longer be read or written - e.g. because
we offline a subsection and told our hypervisor to remove it. We need a way
(e.g. for kdump) to flag these pages (like PG_hwpoison), otherwise kdump
will happily access all memory and crash the system when accessing
memory that is not meant to be accessed.

Marking pages as offline will also later to give kdump that information
and to mark a section as offline once all pages are offline. It is save
to use mapcount as all pages are logically removed from the system
(offline_pages()).

This e.g. allows us to add/remove memory to Linux in a VM in 4MB chunks

Please note that we can't use PG_reserved for this. PG_reserved does not
imply that
- a page should not be dumped
- a page is offline and we should mark the section offline

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/node.c        |  1 -
 include/linux/memory.h     |  1 -
 include/linux/mm.h         | 10 ++++++++++
 include/linux/page-flags.h |  9 +++++++++
 mm/memory_hotplug.c        | 32 +++++++++++++++++++++++---------
 mm/page_alloc.c            | 32 ++++++++++++++++++--------------
 mm/sparse.c                | 25 ++++++++++++++++++++++++-
 7 files changed, 84 insertions(+), 26 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7a3a580821e0..58a889b2b2f4 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -408,7 +408,6 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
 	if (!mem_blk)
 		return -EFAULT;
 
-	mem_blk->nid = nid;
 	if (!node_online(nid))
 		return 0;
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 31ca3e28b0eb..9f8cd856ca1e 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -33,7 +33,6 @@ struct memory_block {
 	void *hw;			/* optional pointer to fw/hw data */
 	int (*phys_callback)(struct memory_block *);
 	struct device dev;
-	int nid;			/* NID for this memory block */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c6fa9a255dbf..197c251be775 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1111,7 +1111,15 @@ static inline void set_page_address(struct page *page, void *address)
 {
 	page->virtual = address;
 }
+static void set_page_virtual(struct page *page, and enum zone_type zone)
+{
+	/* The shift won't overflow because ZONE_NORMAL is below 4G. */
+	if (!is_highmem_idx(zone))
+		set_page_address(page, __va(pfn << PAGE_SHIFT));
+}
 #define page_address_init()  do { } while(0)
+#else
+#define set_page_virtual(page, zone)  do { } while(0)
 #endif
 
 #if defined(HASHED_PAGE_VIRTUAL)
@@ -2063,6 +2071,8 @@ extern unsigned long find_min_pfn_with_active_regions(void);
 extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
+extern void __meminit init_single_page(struct page *page, unsigned long pfn,
+				       unsigned long zone, int nid);
 
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e34a27727b9a..07ec6e48073b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -686,6 +686,15 @@ PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
 #define PAGE_KMEMCG_MAPCOUNT_VALUE		(-512)
 PAGE_MAPCOUNT_OPS(Kmemcg, KMEMCG)
 
+/*
+ * PageOffline() indicates that a page is offline (either never online via
+ * online_pages() or offlined via offline_pages()). Nobody in the system
+ * should have a reference to these pages. In virtual environments,
+ * the backing storage might already have been removed. Don't touch!
+ */
+#define PAGE_OFFLINE_MAPCOUNT_VALUE		(-1024)
+PAGE_MAPCOUNT_OPS(Offline, OFFLINE)
+
 extern bool is_free_buddy_page(struct page *page);
 
 __PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f74826cdceea..7f7bd2acb55b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,6 +250,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		struct vmem_altmap *altmap, bool want_memblock)
 {
 	int ret;
+	int i;
 
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
@@ -258,6 +259,25 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
+	/*
+	 * Mark all the pages in the section as offline before creating the
+	 * memblock and onlining any sub-sections (and therefore marking the
+	 * whole section as online). Mark them reserved so nobody will stumble
+	 * over a half inititalized state.
+	 */
+	for (i = 0; i < PAGES_PER_SECTION; i++) {
+		unsigned long pfn = phys_start_pfn + i;
+		struct page *page;
+		if (!pfn_valid(pfn))
+			continue;
+		page = pfn_to_page(pfn);
+
+		/* dummy zone, the actual one will be set when onlining pages */
+		init_single_page(page, pfn, ZONE_NORMAL, nid);
+		SetPageReserved(page);
+		__SetPageOffline(page);
+	}
+
 	if (!want_memblock)
 		return 0;
 
@@ -651,6 +671,7 @@ EXPORT_SYMBOL_GPL(__online_page_increment_counters);
 
 void __online_page_free(struct page *page)
 {
+	__ClearPageOffline(page);
 	__free_reserved_page(page);
 }
 EXPORT_SYMBOL_GPL(__online_page_free);
@@ -891,15 +912,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	int nid;
 	int ret;
 	struct memory_notify arg;
-	struct memory_block *mem;
-
-	/*
-	 * We can't use pfn_to_nid() because nid might be stored in struct page
-	 * which is not yet initialized. Instead, we find nid from memory block.
-	 */
-	mem = find_memory_block(__pfn_to_section(pfn));
-	nid = mem->nid;
 
+	nid = pfn_to_nid(pfn);
 	/* associate pfn range with the zone */
 	zone = move_pfn_range(online_type, nid, pfn, nr_pages);
 
@@ -1426,7 +1440,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 }
 
 /*
- * remove from free_area[] and mark all as Reserved.
+ * remove from free_area[] and mark all as Reserved and Offline.
  */
 static int
 offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d7962f..cd75686ff1d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1171,7 +1171,7 @@ static void free_one_page(struct zone *zone,
 	spin_unlock(&zone->lock);
 }
 
-static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+void __meminit init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
 	mm_zero_struct_page(page);
@@ -1181,11 +1181,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 	page_cpupid_reset_last(page);
 
 	INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
-	/* The shift won't overflow because ZONE_NORMAL is below 4G. */
-	if (!is_highmem_idx(zone))
-		set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+	set_page_virtual(page, zone);
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -1206,7 +1202,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
 			break;
 	}
-	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+	init_single_page(pfn_to_page(pfn), pfn, zid, nid);
 }
 #else
 static inline void init_reserved_page(unsigned long pfn)
@@ -1523,7 +1519,7 @@ static unsigned long  __init deferred_init_pages(int nid, int zid,
 		} else {
 			page++;
 		}
-		__init_single_page(page, pfn, zid, nid);
+		init_single_page(page, pfn, zid, nid);
 		nr_pages++;
 	}
 	return (nr_pages);
@@ -5514,9 +5510,13 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 not_early:
 		page = pfn_to_page(pfn);
-		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			SetPageReserved(page);
+		if (context == MEMMAP_HOTPLUG) {
+			/* everything but the zone was inititalized */
+			set_page_zone(page, zone);
+			set_page_virtual(page, zone);
+		}
+		else
+			init_single_page(page, pfn, zone, nid);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -6404,7 +6404,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 #ifdef CONFIG_HAVE_MEMBLOCK
 /*
  * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
+ * initialized by going through init_single_page(). But, there are some
  * struct pages which are reserved in memblock allocator and their fields
  * may be accessed (for example page_to_pfn() on some configuration accesses
  * flags). We must explicitly zero those struct pages.
@@ -8005,7 +8005,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 			break;
 	if (pfn == end_pfn)
 		return;
-	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
 	pfn = start_pfn;
@@ -8022,11 +8021,13 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
 			pfn++;
 			SetPageReserved(page);
+			__SetPageOffline(page);
 			continue;
 		}
 
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
+		BUG_ON(PageOffline(page));
 		order = page_order(page);
 #ifdef CONFIG_DEBUG_VM
 		pr_info("remove from free list %lx %d %lx\n",
@@ -8035,11 +8036,14 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		list_del(&page->lru);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
-		for (i = 0; i < (1 << order); i++)
+		for (i = 0; i < (1 << order); i++) {
 			SetPageReserved((page+i));
+			__SetPageOffline(page + i);
+		}
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+	offline_mem_sections(start_pfn, end_pfn);
 }
 #endif
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 73dc2fcc0eab..645d8a4e41a0 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -623,7 +623,24 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-/* Mark all memory sections within the pfn range as online */
+static bool all_pages_in_section_offline(unsigned long section_nr)
+{
+	unsigned long pfn = section_nr_to_pfn(section_nr);
+	struct page *page;
+	int i;
+
+	for (i = 0; i < PAGES_PER_SECTION; i++, pfn++) {
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (!PageOffline(page))
+			return false;
+	}
+	return true;
+}
+
+/* Try to mark all memory sections within the pfn range as offline */
 void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 {
 	unsigned long pfn;
@@ -639,6 +656,12 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 		if (WARN_ON(!valid_section_nr(section_nr)))
 			continue;
 
+		/* if we don't cover whole sections, check all pages */
+		if ((section_nr_to_pfn(section_nr) != start_pfn ||
+		     start_pfn + PAGES_PER_SECTION >= end_pfn) &&
+		    !all_pages_in_section_offline(section_nr))
+			continue;
+
 		ms = __nr_to_section(section_nr);
 		ms->section_mem_map &= ~SECTION_IS_ONLINE;
 	}
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 01/10] mm: introduce and use PageOffline()
@ 2018-05-23 15:11   ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Greg Kroah-Hartman, Ingo Molnar,
	Andrew Morton, Pavel Tatashin, Philippe Ombredanne,
	Thomas Gleixner, Dan Williams, Michal Hocko, Jan Kara,
	Kirill A. Shutemov, Jérôme Glisse, Matthew Wilcox,
	Souptick Joarder, Hugh Dickins, Huang Ying, Miles Chen,
	Vlastimil Babka, Reza Arbab, Mel Gorman, Tetsuo Handa

offline_pages() theoretically works on sub-section sizes. Problem is that
we have no way to know which pages are actually offline. So right now,
offline_pages() will always mark the whole section as offline.

In addition, in virtualized environments, we might soon have pages that are
logically offline and shall no longer be read or written - e.g. because
we offline a subsection and told our hypervisor to remove it. We need a way
(e.g. for kdump) to flag these pages (like PG_hwpoison), otherwise kdump
will happily access all memory and crash the system when accessing
memory that is not meant to be accessed.

Marking pages as offline will also later to give kdump that information
and to mark a section as offline once all pages are offline. It is save
to use mapcount as all pages are logically removed from the system
(offline_pages()).

This e.g. allows us to add/remove memory to Linux in a VM in 4MB chunks

Please note that we can't use PG_reserved for this. PG_reserved does not
imply that
- a page should not be dumped
- a page is offline and we should mark the section offline

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "JA(C)rA'me Glisse" <jglisse@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/node.c        |  1 -
 include/linux/memory.h     |  1 -
 include/linux/mm.h         | 10 ++++++++++
 include/linux/page-flags.h |  9 +++++++++
 mm/memory_hotplug.c        | 32 +++++++++++++++++++++++---------
 mm/page_alloc.c            | 32 ++++++++++++++++++--------------
 mm/sparse.c                | 25 ++++++++++++++++++++++++-
 7 files changed, 84 insertions(+), 26 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7a3a580821e0..58a889b2b2f4 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -408,7 +408,6 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
 	if (!mem_blk)
 		return -EFAULT;
 
-	mem_blk->nid = nid;
 	if (!node_online(nid))
 		return 0;
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 31ca3e28b0eb..9f8cd856ca1e 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -33,7 +33,6 @@ struct memory_block {
 	void *hw;			/* optional pointer to fw/hw data */
 	int (*phys_callback)(struct memory_block *);
 	struct device dev;
-	int nid;			/* NID for this memory block */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c6fa9a255dbf..197c251be775 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1111,7 +1111,15 @@ static inline void set_page_address(struct page *page, void *address)
 {
 	page->virtual = address;
 }
+static void set_page_virtual(struct page *page, and enum zone_type zone)
+{
+	/* The shift won't overflow because ZONE_NORMAL is below 4G. */
+	if (!is_highmem_idx(zone))
+		set_page_address(page, __va(pfn << PAGE_SHIFT));
+}
 #define page_address_init()  do { } while(0)
+#else
+#define set_page_virtual(page, zone)  do { } while(0)
 #endif
 
 #if defined(HASHED_PAGE_VIRTUAL)
@@ -2063,6 +2071,8 @@ extern unsigned long find_min_pfn_with_active_regions(void);
 extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
+extern void __meminit init_single_page(struct page *page, unsigned long pfn,
+				       unsigned long zone, int nid);
 
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e34a27727b9a..07ec6e48073b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -686,6 +686,15 @@ PAGE_MAPCOUNT_OPS(Balloon, BALLOON)
 #define PAGE_KMEMCG_MAPCOUNT_VALUE		(-512)
 PAGE_MAPCOUNT_OPS(Kmemcg, KMEMCG)
 
+/*
+ * PageOffline() indicates that a page is offline (either never online via
+ * online_pages() or offlined via offline_pages()). Nobody in the system
+ * should have a reference to these pages. In virtual environments,
+ * the backing storage might already have been removed. Don't touch!
+ */
+#define PAGE_OFFLINE_MAPCOUNT_VALUE		(-1024)
+PAGE_MAPCOUNT_OPS(Offline, OFFLINE)
+
 extern bool is_free_buddy_page(struct page *page);
 
 __PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f74826cdceea..7f7bd2acb55b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,6 +250,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		struct vmem_altmap *altmap, bool want_memblock)
 {
 	int ret;
+	int i;
 
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
@@ -258,6 +259,25 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
+	/*
+	 * Mark all the pages in the section as offline before creating the
+	 * memblock and onlining any sub-sections (and therefore marking the
+	 * whole section as online). Mark them reserved so nobody will stumble
+	 * over a half inititalized state.
+	 */
+	for (i = 0; i < PAGES_PER_SECTION; i++) {
+		unsigned long pfn = phys_start_pfn + i;
+		struct page *page;
+		if (!pfn_valid(pfn))
+			continue;
+		page = pfn_to_page(pfn);
+
+		/* dummy zone, the actual one will be set when onlining pages */
+		init_single_page(page, pfn, ZONE_NORMAL, nid);
+		SetPageReserved(page);
+		__SetPageOffline(page);
+	}
+
 	if (!want_memblock)
 		return 0;
 
@@ -651,6 +671,7 @@ EXPORT_SYMBOL_GPL(__online_page_increment_counters);
 
 void __online_page_free(struct page *page)
 {
+	__ClearPageOffline(page);
 	__free_reserved_page(page);
 }
 EXPORT_SYMBOL_GPL(__online_page_free);
@@ -891,15 +912,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	int nid;
 	int ret;
 	struct memory_notify arg;
-	struct memory_block *mem;
-
-	/*
-	 * We can't use pfn_to_nid() because nid might be stored in struct page
-	 * which is not yet initialized. Instead, we find nid from memory block.
-	 */
-	mem = find_memory_block(__pfn_to_section(pfn));
-	nid = mem->nid;
 
+	nid = pfn_to_nid(pfn);
 	/* associate pfn range with the zone */
 	zone = move_pfn_range(online_type, nid, pfn, nr_pages);
 
@@ -1426,7 +1440,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 }
 
 /*
- * remove from free_area[] and mark all as Reserved.
+ * remove from free_area[] and mark all as Reserved and Offline.
  */
 static int
 offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d7962f..cd75686ff1d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1171,7 +1171,7 @@ static void free_one_page(struct zone *zone,
 	spin_unlock(&zone->lock);
 }
 
-static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+void __meminit init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
 	mm_zero_struct_page(page);
@@ -1181,11 +1181,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 	page_cpupid_reset_last(page);
 
 	INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
-	/* The shift won't overflow because ZONE_NORMAL is below 4G. */
-	if (!is_highmem_idx(zone))
-		set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+	set_page_virtual(page, zone);
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -1206,7 +1202,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
 			break;
 	}
-	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+	init_single_page(pfn_to_page(pfn), pfn, zid, nid);
 }
 #else
 static inline void init_reserved_page(unsigned long pfn)
@@ -1523,7 +1519,7 @@ static unsigned long  __init deferred_init_pages(int nid, int zid,
 		} else {
 			page++;
 		}
-		__init_single_page(page, pfn, zid, nid);
+		init_single_page(page, pfn, zid, nid);
 		nr_pages++;
 	}
 	return (nr_pages);
@@ -5514,9 +5510,13 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 not_early:
 		page = pfn_to_page(pfn);
-		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			SetPageReserved(page);
+		if (context == MEMMAP_HOTPLUG) {
+			/* everything but the zone was inititalized */
+			set_page_zone(page, zone);
+			set_page_virtual(page, zone);
+		}
+		else
+			init_single_page(page, pfn, zone, nid);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -6404,7 +6404,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 #ifdef CONFIG_HAVE_MEMBLOCK
 /*
  * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
+ * initialized by going through init_single_page(). But, there are some
  * struct pages which are reserved in memblock allocator and their fields
  * may be accessed (for example page_to_pfn() on some configuration accesses
  * flags). We must explicitly zero those struct pages.
@@ -8005,7 +8005,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 			break;
 	if (pfn == end_pfn)
 		return;
-	offline_mem_sections(pfn, end_pfn);
 	zone = page_zone(pfn_to_page(pfn));
 	spin_lock_irqsave(&zone->lock, flags);
 	pfn = start_pfn;
@@ -8022,11 +8021,13 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
 			pfn++;
 			SetPageReserved(page);
+			__SetPageOffline(page);
 			continue;
 		}
 
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
+		BUG_ON(PageOffline(page));
 		order = page_order(page);
 #ifdef CONFIG_DEBUG_VM
 		pr_info("remove from free list %lx %d %lx\n",
@@ -8035,11 +8036,14 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		list_del(&page->lru);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
-		for (i = 0; i < (1 << order); i++)
+		for (i = 0; i < (1 << order); i++) {
 			SetPageReserved((page+i));
+			__SetPageOffline(page + i);
+		}
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
+	offline_mem_sections(start_pfn, end_pfn);
 }
 #endif
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 73dc2fcc0eab..645d8a4e41a0 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -623,7 +623,24 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-/* Mark all memory sections within the pfn range as online */
+static bool all_pages_in_section_offline(unsigned long section_nr)
+{
+	unsigned long pfn = section_nr_to_pfn(section_nr);
+	struct page *page;
+	int i;
+
+	for (i = 0; i < PAGES_PER_SECTION; i++, pfn++) {
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (!PageOffline(page))
+			return false;
+	}
+	return true;
+}
+
+/* Try to mark all memory sections within the pfn range as offline */
 void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 {
 	unsigned long pfn;
@@ -639,6 +656,12 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 		if (WARN_ON(!valid_section_nr(section_nr)))
 			continue;
 
+		/* if we don't cover whole sections, check all pages */
+		if ((section_nr_to_pfn(section_nr) != start_pfn ||
+		     start_pfn + PAGES_PER_SECTION >= end_pfn) &&
+		    !all_pages_in_section_offline(section_nr))
+			continue;
+
 		ms = __nr_to_section(section_nr);
 		ms->section_mem_map &= ~SECTION_IS_ONLINE;
 	}
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 02/10] mm/page_ext.c: support online/offline of memory < section size
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
  2018-05-23 15:11   ` David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 03/10] kasan: prepare for online/offline of different start/size David Hildenbrand
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Kate Stewart, Jaewon Kim,
	Greg Kroah-Hartman

Right now, we would free the extended page data if parts of a section
are offlined or if onlining is aborted, although still some pages are
online.

We can simply check if the section is online to see if we are allowed to
free. init_section_page_ext() already takes care of the allocation part
for sub sections.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Jaewon Kim <jaewon31.kim@samsung.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/page_ext.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 5295ef331165..71a025128dac 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -320,7 +320,9 @@ static int __meminit online_page_ext(unsigned long start_pfn,
 
 	/* rollback */
 	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
-		__free_page_ext(pfn);
+		/* still online? nothing to do then */
+		if (!online_section_nr(pfn_to_section_nr(pfn)))
+			__free_page_ext(pfn);
 
 	return -ENOMEM;
 }
@@ -334,7 +336,10 @@ static int __meminit offline_page_ext(unsigned long start_pfn,
 	end = SECTION_ALIGN_UP(start_pfn + nr_pages);
 
 	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
-		__free_page_ext(pfn);
+		/* still online? nothing to do then */
+		if (!online_section_nr(pfn_to_section_nr(pfn)))
+			__free_page_ext(pfn);
+
 	return 0;
 
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 03/10] kasan: prepare for online/offline of different start/size
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
  2018-05-23 15:11   ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 02/10] mm/page_ext.c: support online/offline of memory < section size David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 04/10] kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in VMCOREINFO David Hildenbrand
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, kasan-dev

The memory notifier has an important restriction right now: it only
works if offline_pages() is called with the same parameters as
online_pages().

To overcome this restriction, let's handle it per section. We could do
it in smaller granularity, but then we get more vm_area overhead and
cannot check that cleanly for actual online parts.

A section is marked online as soon as at least one page is online.
Similarly, a section is marked offline as soon as all pages are offline.

So handling it on a per-section basis allows us to be more flexible. We
asssume here, that a section is not split between boot and hotplug
memory.

Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: kasan-dev@googlegroups.com
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/kasan/kasan.c | 107 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 69 insertions(+), 38 deletions(-)

diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index a8b85706e2d6..901601a562a9 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -827,62 +827,93 @@ static bool shadow_mapped(unsigned long addr)
 	return !pte_none(*pte);
 }
 
-static int __meminit kasan_mem_notifier(struct notifier_block *nb,
-			unsigned long action, void *data)
+static void kasan_offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
-	struct memory_notify *mem_data = data;
-	unsigned long nr_shadow_pages, start_kaddr, shadow_start;
-	unsigned long shadow_end, shadow_size;
+	unsigned long start = SECTION_ALIGN_DOWN(start_pfn);
+	unsigned long end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+	unsigned long pfn;
 
-	nr_shadow_pages = mem_data->nr_pages >> KASAN_SHADOW_SCALE_SHIFT;
-	start_kaddr = (unsigned long)pfn_to_kaddr(mem_data->start_pfn);
-	shadow_start = (unsigned long)kasan_mem_to_shadow((void *)start_kaddr);
-	shadow_size = nr_shadow_pages << PAGE_SHIFT;
-	shadow_end = shadow_start + shadow_size;
+	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
+		void *addr, *shadow_start;
+		struct vm_struct *vm;
 
-	if (WARN_ON(mem_data->nr_pages % KASAN_SHADOW_SCALE_SIZE) ||
-		WARN_ON(start_kaddr % (KASAN_SHADOW_SCALE_SIZE << PAGE_SHIFT)))
-		return NOTIFY_BAD;
+		/* still online? nothing to do then */
+		if (online_section_nr(pfn_to_section_nr(pfn)))
+			continue;
 
-	switch (action) {
-	case MEM_GOING_ONLINE: {
-		void *ret;
+		addr = pfn_to_kaddr(pfn);
+		shadow_start = kasan_mem_to_shadow(addr);
+
+		/*
+		 * Only hot-added memory has a vm_area. Freeing shadow mapped
+		 * during boot would be tricky, so we'll just have to keep it.
+		 */
+		vm = find_vm_area(shadow_start);
+		if (vm)
+			vfree(shadow_start);
+	}
+}
+
+static int kasan_online_pages(unsigned long start_pfn, unsigned long nr_pages)
+{
+	unsigned long start = SECTION_ALIGN_DOWN(start_pfn);
+	unsigned long end = SECTION_ALIGN_UP(start_pfn + nr_pages);
+	unsigned long pfn;
+
+	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
+		unsigned long shadow_start, shadow_size;
+		void *addr, *ret;
+
+		/* already online? nothing to do then */
+		if (online_section_nr(pfn_to_section_nr(pfn)))
+			continue;
+
+		addr = pfn_to_kaddr(pfn);
+		shadow_size = (PAGES_PER_SECTION << PAGE_SHIFT) >>
+			      KASAN_SHADOW_SCALE_SHIFT;
+		shadow_start = (unsigned long)kasan_mem_to_shadow(addr);
 
 		/*
 		 * If shadow is mapped already than it must have been mapped
-		 * during the boot. This could happen if we onlining previously
+		 * during boot. This could happen if we're onlining previously
 		 * offlined memory.
 		 */
 		if (shadow_mapped(shadow_start))
-			return NOTIFY_OK;
+			continue;
 
 		ret = __vmalloc_node_range(shadow_size, PAGE_SIZE, shadow_start,
-					shadow_end, GFP_KERNEL,
-					PAGE_KERNEL, VM_NO_GUARD,
-					pfn_to_nid(mem_data->start_pfn),
-					__builtin_return_address(0));
+					   shadow_start + shadow_size,
+					   GFP_KERNEL, PAGE_KERNEL, VM_NO_GUARD,
+					   pfn_to_nid(pfn),
+					   __builtin_return_address(0));
 		if (!ret)
-			return NOTIFY_BAD;
-
+			goto out_free;
 		kmemleak_ignore(ret);
-		return NOTIFY_OK;
 	}
-	case MEM_CANCEL_ONLINE:
-	case MEM_OFFLINE: {
-		struct vm_struct *vm;
+	return 0;
+out_free:
+	kasan_offline_pages(start_pfn, nr_pages);
+	return -ENOMEM;
+}
 
-		/*
-		 * Only hot-added memory have vm_area. Freeing shadow
-		 * mapped during boot would be tricky, so we'll just
-		 * have to keep it.
-		 */
-		vm = find_vm_area((void *)shadow_start);
-		if (vm)
-			vfree((void *)shadow_start);
-	}
+static int __meminit kasan_mem_notifier(struct notifier_block *nb,
+			unsigned long action, void *data)
+{
+	struct memory_notify *mem_data = data;
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = kasan_online_pages(mem_data->start_pfn,
+					 mem_data->nr_pages);
+		break;
+	case MEM_CANCEL_ONLINE:
+	case MEM_OFFLINE:
+		kasan_offline_pages(mem_data->start_pfn, mem_data->nr_pages);
+		break;
 	}
 
-	return NOTIFY_OK;
+	return notifier_from_errno(ret);
 }
 
 static int __init kasan_memhotplug_init(void)
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 04/10] kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in VMCOREINFO
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
                   ` (2 preceding siblings ...)
  2018-05-23 15:11 ` [PATCH v1 03/10] kasan: prepare for online/offline of different start/size David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 05/10] mm/memory_hotplug: limit offline_pages() to sizes we can actually handle David Hildenbrand
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Andrew Morton, Dave Young,
	Baoquan He, Hari Bathini, Kirill A. Shutemov

This allows dump tools to skip pages that are offline.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/crash_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index f7674d676889..c0a45e9ba84e 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_init(void)
 #ifdef CONFIG_HUGETLB_PAGE
 	VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
 #endif
+	VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
 
 	arch_crash_save_vmcoreinfo();
 	update_vmcoreinfo_note();
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 05/10] mm/memory_hotplug: limit offline_pages() to sizes we can actually handle
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
                   ` (3 preceding siblings ...)
  2018-05-23 15:11 ` [PATCH v1 04/10] kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in VMCOREINFO David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 06/10] mm/memory_hotplug: onlining pages can only fail due to notifiers David Hildenbrand
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Andrew Morton, Vlastimil Babka,
	Michal Hocko, Dan Williams, Joonsoo Kim, Greg Kroah-Hartman,
	Pavel Tatashin, Reza Arbab, Thomas Gleixner

We have to take care of MAX_ORDER. Page blocks might contain references
to the next page block. So sometimes a page block cannot be offlined
independently. E.g. on x86: page block size is 2MB, MAX_ORDER -1 (10)
allows 4MB allocations.

E.g. a buddy page could either overlap at the beginning or the end of the
range to offline. While the end case could be handled easily (shrink the
buddy page), overlaps at the beginning are hard to handle (unknown page
order).

Let document offline_pages() while at it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memory_hotplug.h |  6 ++++++
 mm/memory_hotplug.c            | 22 ++++++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index e0e49b5b1ee1..d71829d54360 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -294,6 +294,12 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 #endif /* !(CONFIG_MEMORY_HOTPLUG || CONFIG_DEFERRED_STRUCT_PAGE_INIT) */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
+/*
+ * Isolation and offlining code cannot deal with pages (e.g. buddy)
+ * overlapping with the range to be offlined yet.
+ */
+#define offline_nr_pages	max((unsigned long)pageblock_nr_pages, \
+				    (unsigned long)MAX_ORDER_NR_PAGES)
 
 extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
 extern void try_offline_node(int nid);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7f7bd2acb55b..c971295a1100 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1599,10 +1599,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	struct zone *zone;
 	struct memory_notify arg;
 
-	/* at least, alignment against pageblock is necessary */
-	if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
+	if (!IS_ALIGNED(start_pfn, offline_nr_pages))
 		return -EINVAL;
-	if (!IS_ALIGNED(end_pfn, pageblock_nr_pages))
+	if (!IS_ALIGNED(end_pfn, offline_nr_pages))
 		return -EINVAL;
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
@@ -1700,7 +1699,22 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	return ret;
 }
 
-/* Must be protected by mem_hotplug_begin() or a device_lock */
+/**
+ * offline_pages - offline pages in a given range (that are currently online)
+ * @start_pfn: start pfn of the memory range
+ * @nr_pages: the number of pages
+ *
+ * This function tries to offline the given pages. The alignment/size that
+ * can be used is given by offline_nr_pages.
+ *
+ * Returns 0 if sucessful, -EBUSY if the pages cannot be offlined and
+ * -EINVAL if start_pfn/nr_pages is not properly aligned or not in a zone.
+ * -EINTR is returned if interrupted by a signal.
+ *
+ * Bad things will happen if pages in the range are already offline.
+ *
+ * Must be protected by mem_hotplug_begin() or a device_lock
+ */
 int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	return __offline_pages(start_pfn, start_pfn + nr_pages);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 06/10] mm/memory_hotplug: onlining pages can only fail due to notifiers
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
                   ` (4 preceding siblings ...)
  2018-05-23 15:11 ` [PATCH v1 05/10] mm/memory_hotplug: limit offline_pages() to sizes we can actually handle David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 07/10] mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages() David Hildenbrand
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Dan Williams, Reza Arbab, Pavel Tatashin,
	Thomas Gleixner

Onlining pages can only fail if a notifier reported a problem (e.g. -ENOMEM).
Remove and restructure error handling. While at it, document how
online_pages() can be used right now.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 47 +++++++++++++++++++++++++++++----------------
 1 file changed, 30 insertions(+), 17 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c971295a1100..8c0b7d85252b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -902,7 +902,26 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
 	return zone;
 }
 
-/* Must be protected by mem_hotplug_begin() or a device_lock */
+/**
+ * online_pages - online pages in a given range (that are currently offline)
+ * @start_pfn: start pfn of the memory range
+ * @nr_pages: the number of pages
+ * @online_type: how to online pages (esp. to which zone to add them)
+ *
+ * This function onlines the given pages. Usually, any alignemt / size
+ * can be used. However, all pages of memory to be removed later on in
+ * one piece via remove_memory() should be onlined the same way and at
+ * least the first page should be onlined if anything else is onlined.
+ * The zone of the first page is used to fixup zones when removing memory
+ * later on (see __remove_pages()).
+ *
+ * Returns 0 if sucessful, an error code if a memory notifier reported a
+ *         problem (e.g. -ENOMEM).
+ *
+ * Bad things will happen if pages in the range are already online.
+ *
+ * Must be protected by mem_hotplug_begin() or a device_lock
+ */
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
 {
 	unsigned long flags;
@@ -923,8 +942,13 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 
 	ret = memory_notify(MEM_GOING_ONLINE, &arg);
 	ret = notifier_to_errno(ret);
-	if (ret)
-		goto failed_addition;
+	if (ret) {
+		pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
+			 (unsigned long long) pfn << PAGE_SHIFT,
+			 (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
+		memory_notify(MEM_CANCEL_ONLINE, &arg);
+		return ret;
+	}
 
 	/*
 	 * If this zone is not populated, then it is not in zonelist.
@@ -936,13 +960,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 		setup_zone_pageset(zone);
 	}
 
-	ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
-		online_pages_range);
-	if (ret) {
-		if (need_zonelists_rebuild)
-			zone_pcp_reset(zone);
-		goto failed_addition;
-	}
+	/* onlining pages cannot fail */
+	walk_system_ram_range(pfn, nr_pages, &onlined_pages,
+			      online_pages_range);
 
 	zone->present_pages += onlined_pages;
 
@@ -972,13 +992,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 	return 0;
-
-failed_addition:
-	pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
-		 (unsigned long long) pfn << PAGE_SHIFT,
-		 (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
-	memory_notify(MEM_CANCEL_ONLINE, &arg);
-	return ret;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 07/10] mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages()
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
                   ` (5 preceding siblings ...)
  2018-05-23 15:11 ` [PATCH v1 06/10] mm/memory_hotplug: onlining pages can only fail due to notifiers David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 08/10] mm/memory_hotplug: allow to control onlining/offlining of memory by a driver David Hildenbrand
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Dan Williams, Reza Arbab, Pavel Tatashin,
	Thomas Gleixner

Let's try to minimze the noise.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c0b7d85252b..27f7c27f57ac 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -943,9 +943,11 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	ret = memory_notify(MEM_GOING_ONLINE, &arg);
 	ret = notifier_to_errno(ret);
 	if (ret) {
+#ifdef CONFIG_DEBUG_VM
 		pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
 			 (unsigned long long) pfn << PAGE_SHIFT,
 			 (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
+#endif
 		memory_notify(MEM_CANCEL_ONLINE, &arg);
 		return ret;
 	}
@@ -1668,7 +1670,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
 	if (offlined_pages < 0)
 		goto repeat;
+#ifdef CONFIG_DEBUG_VM
 	pr_info("Offlined Pages %ld\n", offlined_pages);
+#endif
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
 	offline_isolated_pages(start_pfn, end_pfn);
@@ -1703,9 +1707,11 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	return 0;
 
 failed_removal:
+#ifdef CONFIG_DEBUG_VM
 	pr_debug("memory offlining [mem %#010llx-%#010llx] failed\n",
 		 (unsigned long long) start_pfn << PAGE_SHIFT,
 		 ((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
+#endif
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 	/* pushback to free area */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 08/10] mm/memory_hotplug: allow to control onlining/offlining of memory by a driver
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
                   ` (6 preceding siblings ...)
  2018-05-23 15:11 ` [PATCH v1 07/10] mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages() David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-23 15:11 ` [PATCH v1 09/10] mm/memory_hotplug: teach offline_pages() to not try forever David Hildenbrand
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Greg Kroah-Hartman,
	Boris Ostrovsky, Juergen Gross, Ingo Molnar, Andrew Morton,
	Pavel Tatashin, Vlastimil Babka, Michal Hocko, Dan Williams,
	Joonsoo Kim, Reza Arbab, Thomas Gleixner

Some devices (esp. paravirtualized) might want to control
- when to online/offline a memory block
- how to online memory (MOVABLE/NORMAL)
- in which granularity to online/offline memory

So let's add a new flag "driver_managed" and disallow to change the
state by user space. Device onlining/offlining will still work, however
the memory will not be actually onlined/offlined. That has to be handled
by the device driver that owns the memory.

Please note that we have to create user visible memory blocks after all
since this is required to trigger the right udevs events in order to
reload kexec/kdump. Also, it allows to see what is going on in the
system (e.g. which memory blocks are still around).

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/memory.c          | 22 ++++++++++++++--------
 drivers/xen/balloon.c          |  2 +-
 include/linux/memory.h         |  1 +
 include/linux/memory_hotplug.h |  4 +++-
 mm/memory_hotplug.c            | 34 ++++++++++++++++++++++++++++++++--
 5 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index bffe8616bd55..3b8616551561 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -231,27 +231,28 @@ static bool pages_correctly_probed(unsigned long start_pfn)
  * Must already be protected by mem_hotplug_begin().
  */
 static int
-memory_block_action(unsigned long phys_index, unsigned long action, int online_type)
+memory_block_action(struct memory_block *mem, unsigned long action)
 {
-	unsigned long start_pfn;
+	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-	int ret;
+	int ret = 0;
 
-	start_pfn = section_nr_to_pfn(phys_index);
+	if (mem->driver_managed)
+		return 0;
 
 	switch (action) {
 	case MEM_ONLINE:
 		if (!pages_correctly_probed(start_pfn))
 			return -EBUSY;
 
-		ret = online_pages(start_pfn, nr_pages, online_type);
+		ret = online_pages(start_pfn, nr_pages, mem->online_type);
 		break;
 	case MEM_OFFLINE:
 		ret = offline_pages(start_pfn, nr_pages);
 		break;
 	default:
 		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
-		     "%ld\n", __func__, phys_index, action, action);
+		     "%ld\n", __func__, mem->start_section_nr, action, action);
 		ret = -EINVAL;
 	}
 
@@ -269,8 +270,7 @@ static int memory_block_change_state(struct memory_block *mem,
 	if (to_state == MEM_OFFLINE)
 		mem->state = MEM_GOING_OFFLINE;
 
-	ret = memory_block_action(mem->start_section_nr, to_state,
-				mem->online_type);
+	ret = memory_block_action(mem, to_state);
 
 	mem->state = ret ? from_state_req : to_state;
 
@@ -350,6 +350,11 @@ store_mem_state(struct device *dev,
 	 */
 	mem_hotplug_begin();
 
+	if (mem->driver_managed) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	switch (online_type) {
 	case MMOP_ONLINE_KERNEL:
 	case MMOP_ONLINE_MOVABLE:
@@ -364,6 +369,7 @@ store_mem_state(struct device *dev,
 		ret = -EINVAL; /* should never happen */
 	}
 
+out:
 	mem_hotplug_done();
 err:
 	unlock_device_hotplug();
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 065f0b607373..89981d573c06 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -401,7 +401,7 @@ static enum bp_state reserve_additional_memory(void)
 	 * callers drop the mutex before trying again.
 	 */
 	mutex_unlock(&balloon_mutex);
-	rc = add_memory_resource(nid, resource, memhp_auto_online);
+	rc = add_memory_resource(nid, resource, memhp_auto_online, false);
 	mutex_lock(&balloon_mutex);
 
 	if (rc) {
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 9f8cd856ca1e..018c5e5ecde1 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -29,6 +29,7 @@ struct memory_block {
 	unsigned long state;		/* serialized by the dev->lock */
 	int section_count;		/* serialized by mem_sysfs_mutex */
 	int online_type;		/* for passing data to online routine */
+	bool driver_managed;		/* driver handles online/offline */
 	int phys_device;		/* to which fru does this belong? */
 	void *hw;			/* optional pointer to fw/hw data */
 	int (*phys_callback)(struct memory_block *);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d71829d54360..497e28f5b000 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -326,7 +326,9 @@ static inline void remove_memory(int nid, u64 start, u64 size) {}
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
 extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource, bool online);
+extern int add_memory_driver_managed(int nid, u64 start, u64 size);
+extern int add_memory_resource(int nid, struct resource *resource, bool online,
+			       bool driver_managed);
 extern int arch_add_memory(int nid, u64 start, u64 size,
 		struct vmem_altmap *altmap, bool want_memblock);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 27f7c27f57ac..1610e214bfc8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1124,8 +1124,15 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+static int mark_memory_block_driver_managed(struct memory_block *mem, void *arg)
+{
+	mem->driver_managed = true;
+	return 0;
+}
+
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
-int __ref add_memory_resource(int nid, struct resource *res, bool online)
+int __ref add_memory_resource(int nid, struct resource *res, bool online,
+			      bool driver_managed)
 {
 	u64 start, size;
 	pg_data_t *pgdat = NULL;
@@ -1133,6 +1140,9 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	bool new_node;
 	int ret;
 
+	if (online && driver_managed)
+		return -EINVAL;
+
 	start = res->start;
 	size = resource_size(res);
 
@@ -1204,6 +1214,9 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	if (online)
 		walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
 				  NULL, online_memory_block);
+	else if (driver_managed)
+		walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
+				  NULL, mark_memory_block_driver_managed);
 
 	goto out;
 
@@ -1228,13 +1241,30 @@ int __ref add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res, memhp_auto_online);
+	ret = add_memory_resource(nid, res, memhp_auto_online, false);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(add_memory);
 
+int __ref add_memory_driver_managed(int nid, u64 start, u64 size)
+{
+	struct resource *res;
+	int ret;
+
+	res = register_memory_resource(start, size);
+	if (IS_ERR(res))
+		return PTR_ERR(res);
+
+	ret = add_memory_resource(nid, res, false, true);
+	if (ret < 0)
+		release_memory_resource(res);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(add_memory_driver_managed);
+
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 09/10] mm/memory_hotplug: teach offline_pages() to not try forever
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
                   ` (7 preceding siblings ...)
  2018-05-23 15:11 ` [PATCH v1 08/10] mm/memory_hotplug: allow to control onlining/offlining of memory by a driver David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-24 14:39   ` Michal Hocko
  2018-05-23 15:11 ` [PATCH v1 10/10] mm/memory_hotplug: allow online/offline memory by a kernel module David Hildenbrand
  2018-05-24  7:53 ` [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver Michal Hocko
  10 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Greg Kroah-Hartman,
	Rashmica Gupta, Balbir Singh, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Dan Williams, Joonsoo Kim, Pavel Tatashin,
	Reza Arbab, Thomas Gleixner

It can easily happen that we get stuck forever trying to offline pages -
e.g. on persistent errors.

Let's add a way to change this behavior and fail fast.

This is interesting if offline_pages() is called from a driver and we
just want to find some block to offline.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/platforms/powernv/memtrace.c |  2 +-
 drivers/base/memory.c                     |  2 +-
 include/linux/memory_hotplug.h            |  8 ++++----
 mm/memory_hotplug.c                       | 14 ++++++++++----
 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index fc222a0c2ac4..8ce71f7e1558 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -110,7 +110,7 @@ static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
 	walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
 			  change_memblock_state);
 
-	if (offline_pages(start_pfn, nr_pages)) {
+	if (offline_pages(start_pfn, nr_pages, true)) {
 		walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE,
 				  change_memblock_state);
 		return false;
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 3b8616551561..c785e4c01b23 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -248,7 +248,7 @@ memory_block_action(struct memory_block *mem, unsigned long action)
 		ret = online_pages(start_pfn, nr_pages, mem->online_type);
 		break;
 	case MEM_OFFLINE:
-		ret = offline_pages(start_pfn, nr_pages);
+		ret = offline_pages(start_pfn, nr_pages, true);
 		break;
 	default:
 		WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 497e28f5b000..ae53017b54df 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -303,7 +303,8 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 
 extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
 extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			 bool retry_forever);
 extern void remove_memory(int nid, u64 start, u64 size);
 
 #else
@@ -315,7 +316,8 @@ static inline bool is_mem_section_removable(unsigned long pfn,
 
 static inline void try_offline_node(int nid) {}
 
-static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+				bool retry_forever)
 {
 	return -EINVAL;
 }
@@ -333,9 +335,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
 		struct vmem_altmap *altmap, bool want_memblock);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern void remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct pglist_data *pgdat,
 		unsigned long start_pfn, struct vmem_altmap *altmap);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1610e214bfc8..3a5845a33910 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1633,8 +1633,8 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 		node_clear_state(node, N_MEMORY);
 }
 
-static int __ref __offline_pages(unsigned long start_pfn,
-		  unsigned long end_pfn)
+static int __ref __offline_pages(unsigned long start_pfn, unsigned long end_pfn,
+				 bool retry_forever)
 {
 	unsigned long pfn, nr_pages;
 	long offlined_pages;
@@ -1686,6 +1686,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	pfn = scan_movable_pages(start_pfn, end_pfn);
 	if (pfn) { /* We have movable pages */
 		ret = do_migrate_range(pfn, end_pfn);
+		if (ret && !retry_forever) {
+			ret = -EBUSY;
+			goto failed_removal;
+		}
 		goto repeat;
 	}
 
@@ -1752,6 +1756,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
  * offline_pages - offline pages in a given range (that are currently online)
  * @start_pfn: start pfn of the memory range
  * @nr_pages: the number of pages
+ * @retry_forever: weather to retry (possibly) forever
  *
  * This function tries to offline the given pages. The alignment/size that
  * can be used is given by offline_nr_pages.
@@ -1764,9 +1769,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
  *
  * Must be protected by mem_hotplug_begin() or a device_lock
  */
-int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+		  bool retry_forever)
 {
-	return __offline_pages(start_pfn, start_pfn + nr_pages);
+	return __offline_pages(start_pfn, start_pfn + nr_pages, retry_forever);
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v1 10/10] mm/memory_hotplug: allow online/offline memory by a kernel module
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
                   ` (8 preceding siblings ...)
  2018-05-23 15:11 ` [PATCH v1 09/10] mm/memory_hotplug: teach offline_pages() to not try forever David Hildenbrand
@ 2018-05-23 15:11 ` David Hildenbrand
  2018-05-23 19:51   ` Christoph Hellwig
  2018-05-24  7:53 ` [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver Michal Hocko
  10 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-05-23 15:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, David Hildenbrand, Greg Kroah-Hartman,
	Andrew Morton, Vlastimil Babka, Michal Hocko, Dan Williams,
	Pavel Tatashin, Joonsoo Kim, Thomas Gleixner

Kernel modules that want to control how/when memory is onlined/offlined
need a proper interface to these functions. Also, for adding memory
properly, memory_block_size_bytes is required.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/memory.c          |  1 +
 include/linux/memory_hotplug.h |  2 ++
 mm/memory_hotplug.c            | 27 +++++++++++++++++++++++++--
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c785e4c01b23..0a7c79cfaaf8 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_bytes(void)
 {
 	return MIN_MEMORY_BLOCK_SIZE;
 }
+EXPORT_SYMBOL(memory_block_size_bytes);
 
 static unsigned long get_memory_block_size(void)
 {
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ae53017b54df..0e3e48410415 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -97,6 +97,8 @@ extern void __online_page_increment_counters(struct page *page);
 extern void __online_page_free(struct page *page);
 
 extern int try_online_node(int nid);
+extern int online_memory_blocks(uint64_t start, uint64_t size);
+extern int offline_memory_blocks(uint64_t start, uint64_t size);
 
 extern bool memhp_auto_online;
 /* If movable_node boot option specified */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3a5845a33910..071976e5e7f6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -89,12 +89,14 @@ void mem_hotplug_begin(void)
 	cpus_read_lock();
 	percpu_down_write(&mem_hotplug_lock);
 }
+EXPORT_SYMBOL(mem_hotplug_begin);
 
 void mem_hotplug_done(void)
 {
 	percpu_up_write(&mem_hotplug_lock);
 	cpus_read_unlock();
 }
+EXPORT_SYMBOL(mem_hotplug_done);
 
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
@@ -995,6 +997,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 		memory_notify(MEM_ONLINE, &arg);
 	return 0;
 }
+EXPORT_SYMBOL(online_pages);
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 static void reset_node_present_pages(pg_data_t *pgdat)
@@ -1124,6 +1127,25 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+static int offline_memory_block(struct memory_block *mem, void *arg)
+{
+	return device_offline(&mem->dev);
+}
+
+int online_memory_blocks(uint64_t start, uint64_t size)
+{
+	return walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
+				 NULL, online_memory_block);
+}
+EXPORT_SYMBOL(online_memory_blocks);
+
+int offline_memory_blocks(uint64_t start, uint64_t size)
+{
+	return walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
+				 NULL, offline_memory_block);
+}
+EXPORT_SYMBOL(offline_memory_blocks);
+
 static int mark_memory_block_driver_managed(struct memory_block *mem, void *arg)
 {
 	mem->driver_managed = true;
@@ -1212,8 +1234,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online,
 
 	/* online pages if requested */
 	if (online)
-		walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
-				  NULL, online_memory_block);
+		online_memory_blocks(start, size);
 	else if (driver_managed)
 		walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
 				  NULL, mark_memory_block_driver_managed);
@@ -1312,6 +1333,7 @@ bool is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages)
 	/* All pageblocks in the memory block are likely to be hot-removable */
 	return true;
 }
+EXPORT_SYMBOL(is_mem_section_removable);
 
 /*
  * Confirm all pages in a range [start, end) belong to the same zone.
@@ -1774,6 +1796,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 {
 	return __offline_pages(start_pfn, start_pfn + nr_pages, retry_forever);
 }
+EXPORT_SYMBOL(offline_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /**
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 10/10] mm/memory_hotplug: allow online/offline memory by a kernel module
  2018-05-23 15:11 ` [PATCH v1 10/10] mm/memory_hotplug: allow online/offline memory by a kernel module David Hildenbrand
@ 2018-05-23 19:51   ` Christoph Hellwig
  2018-05-24  5:59     ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2018-05-23 19:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Greg Kroah-Hartman, Andrew Morton,
	Vlastimil Babka, Michal Hocko, Dan Williams, Pavel Tatashin,
	Joonsoo Kim, Thomas Gleixner

On Wed, May 23, 2018 at 05:11:51PM +0200, David Hildenbrand wrote:
> Kernel modules that want to control how/when memory is onlined/offlined
> need a proper interface to these functions. Also, for adding memory
> properly, memory_block_size_bytes is required.

Which module?  Please send it along with the enabling code.

> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_bytes(void)
>  {
>  	return MIN_MEMORY_BLOCK_SIZE;
>  }
> +EXPORT_SYMBOL(memory_block_size_bytes);

> +EXPORT_SYMBOL(mem_hotplug_begin);

> +EXPORT_SYMBOL(mem_hotplug_done);

EXPORT_SYMBOL_GPL for any deep down VM internals, please.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 10/10] mm/memory_hotplug: allow online/offline memory by a kernel module
  2018-05-23 19:51   ` Christoph Hellwig
@ 2018-05-24  5:59     ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-24  5:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, linux-kernel, Greg Kroah-Hartman, Andrew Morton,
	Vlastimil Babka, Michal Hocko, Dan Williams, Pavel Tatashin,
	Joonsoo Kim, Thomas Gleixner

On 23.05.2018 21:51, Christoph Hellwig wrote:
> On Wed, May 23, 2018 at 05:11:51PM +0200, David Hildenbrand wrote:
>> Kernel modules that want to control how/when memory is onlined/offlined
>> need a proper interface to these functions. Also, for adding memory
>> properly, memory_block_size_bytes is required.
> 
> Which module?  Please send it along with the enabling code.

Hi,

as indicated in the cover letter, it is called "virtio-mem".
I sent it yesterday as a separate series (RFC).

Cover letter: https://lkml.org/lkml/2018/5/23/800
Relevant patch: https://lkml.org/lkml/2018/5/23/803

> 
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_bytes(void)
>>  {
>>  	return MIN_MEMORY_BLOCK_SIZE;
>>  }
>> +EXPORT_SYMBOL(memory_block_size_bytes);
> 
>> +EXPORT_SYMBOL(mem_hotplug_begin);
> 
>> +EXPORT_SYMBOL(mem_hotplug_done);
> 
> EXPORT_SYMBOL_GPL for any deep down VM internals, please.
> 

I continued using what was being used for symbols in this file. If there
are not other opinions, I'll switch to EXPORT_SYMBOL_GPL. Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
                   ` (9 preceding siblings ...)
  2018-05-23 15:11 ` [PATCH v1 10/10] mm/memory_hotplug: allow online/offline memory by a kernel module David Hildenbrand
@ 2018-05-24  7:53 ` Michal Hocko
  2018-05-24  8:31   ` David Hildenbrand
  10 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2018-05-24  7:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

I've had some questions before and I am not sure they are fully covered.
At least not in the cover letter (I didn't get much further yet) which
should give us a highlevel overview of the feature.

On Wed 23-05-18 17:11:41, David Hildenbrand wrote:
> This is now the !RFC version. I did some additional tests and inspected
> all memory notifiers. At least page_ext and kasan need fixes.
> 
> ==========
> 
> I am right now working on a paravirtualized memory device ("virtio-mem").
> These devices control a memory region and the amount of memory available
> via it. Memory will not be indicated/added/onlined via ACPI and friends,
> the device driver is responsible for it.
> 
> When the device driver starts up, it will add and online the requested
> amount of memory from its assigned physical memory region. On request, it
> can add (online) either more memory or try to remove (offline) memory. As
> it will be a virtio module, we also want to be able to have it as a loadable
> kernel module.

How do you handle the offline case? Do you online all the memory to
zone_movable?

> Such a device can be thought of like a "resizable DIMM" or a "huge
> number of 4MB DIMMS" that can be automatically managed.

Why do we need such a small granularity? The whole memory hotplug is
centered around memory sections and those are 128MB in size. Smaller
sizes simply do not fit into that concept. How do you deal with that?

> As we want to be able to add/remove small chunks of memory to a VM without
> fragmenting guest memory ("it's not what the guest pays for" and "what if
> the hypervisor wants to use huge pages"), it looks like we can do that
> under Linux in a 4MB granularity by using online_pages()/offline_pages()

Please expand on this some more. Larger logical units usually lead to a
smaller fragmentation.

> We add a segment and online only 4MB blocks of it on demand. So the other
> memory might not be accessible.

But you still allocate vmemmap for the full memory section, right? That
would mean that you spend 2MB to online 4MB of memory. Sounds quite
wasteful to me.

> For kdump and onlining/offlining code, we
> have to mark pages as offline before a new segment is visible to the system
> (e.g. as these pages might not be backed by real memory in the hypervisor).

Please expand on the kdump part. That is really confusing because
hotplug should simply not depend on kdump at all. Moreover why don't you
simply mark those pages reserved and pull them out from the page
allocator?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24  7:53 ` [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver Michal Hocko
@ 2018-05-24  8:31   ` David Hildenbrand
  2018-05-24  8:56     ` Dave Young
  2018-05-24  9:31     ` Michal Hocko
  0 siblings, 2 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-24  8:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 24.05.2018 09:53, Michal Hocko wrote:
> I've had some questions before and I am not sure they are fully covered.
> At least not in the cover letter (I didn't get much further yet) which
> should give us a highlevel overview of the feature.

Sure, I can give you more details. Adding all details to the cover
letter will result in a cover letter that nobody will read :)

> 
> On Wed 23-05-18 17:11:41, David Hildenbrand wrote:
>> This is now the !RFC version. I did some additional tests and inspected
>> all memory notifiers. At least page_ext and kasan need fixes.
>>
>> ==========
>>
>> I am right now working on a paravirtualized memory device ("virtio-mem").
>> These devices control a memory region and the amount of memory available
>> via it. Memory will not be indicated/added/onlined via ACPI and friends,
>> the device driver is responsible for it.
>>
>> When the device driver starts up, it will add and online the requested
>> amount of memory from its assigned physical memory region. On request, it
>> can add (online) either more memory or try to remove (offline) memory. As
>> it will be a virtio module, we also want to be able to have it as a loadable
>> kernel module.
> 
> How do you handle the offline case? Do you online all the memory to
> zone_movable?

Right now everything is added to ZONE_NORMAL. I have some plans to
change that, but that will require more work (auto assigning to
ZONE_MOVABLE or ZONE_NORMAL depending on certain heuristics). For now
this showcases that offlining of memory actually works on that
granularity and it can be used in some scenarios. To make unplug more
reliable, more work is needed.

> 
>> Such a device can be thought of like a "resizable DIMM" or a "huge
>> number of 4MB DIMMS" that can be automatically managed.
> 
> Why do we need such a small granularity? The whole memory hotplug is
> centered around memory sections and those are 128MB in size. Smaller
> sizes simply do not fit into that concept. How do you deal with that?

1. Why do we need such a small granularity?

Because we can :) No, honestly, on s390x it is 256MB and if I am not
wrong on certain x86/arm machines it is even 1 or 2GB. This simply gives
more flexibility when plugging memory. (thinking about cloud environments)

Allowing to unplug such small chunks is actually the interesting thing.
Try unplugging a 128MB DIMM. With ZONE_NORMAL: pretty much impossible.
With ZONE_MOVABLE: maybe possible. Try to find one 4MB chunk of a
"128MB" DIMM that can be unplugged: With ZONE_NORMAL
maybe possible. With ZONE_MOVABLE: likely possible.

But let's not go into the discussion of ZONE_MOVABLE vs. ZONE_NORMAL, I
plan to work on that in the future.

Think about it that way: A compromise between section based memory
hotplug and page based ballooning.


2. memory hotplug and 128MB size

Interesting point. One thing to note is that "The whole memory hotplug
is centered around memory sections and those are 128MB in size" is
simply the current state how it is implemented in Linux, nothing more.

E.g. Windows supports 1MB DIMMs So the statement "Smaller sizes simply
do not fit into that concept" is wrong. It simply does not fit
*perfectly* into the way it is implemented right now in Linux. But as
this patch series shows, this is just a minor drawback we can easily
work around.

"How do you deal with that?"

I think the question is answered below: add a section and online only
parts of it.


> 
>> As we want to be able to add/remove small chunks of memory to a VM without
>> fragmenting guest memory ("it's not what the guest pays for" and "what if
>> the hypervisor wants to use huge pages"), it looks like we can do that
>> under Linux in a 4MB granularity by using online_pages()/offline_pages()
> 
> Please expand on this some more. Larger logical units usually lead to a
> smaller fragmentation.

I want to avoid what balloon drivers do: rip out random pages,
fragmenting guest memory until we eventually trigger the OOM killer. So
instead, using 4MB chunks produces no fragmentation. And if I can't find
such a chunk anymore: bad luck. At least I won't be risking stability of
my guest.

Does that answer your question?

> 
>> We add a segment and online only 4MB blocks of it on demand. So the other
>> memory might not be accessible.
> 
> But you still allocate vmemmap for the full memory section, right? That
> would mean that you spend 2MB to online 4MB of memory. Sounds quite
> wasteful to me.

This is true for the first 4MB chunk of a section, but not for the
remaining ones. Of course, I try to minimize the number of such sections
(ideally, this would only be one "open section" for a virtio-mem device
when only plugging memory). So once we online further 4MB chunk, the
overhead gets smaller and smaller.

> 
>> For kdump and onlining/offlining code, we
>> have to mark pages as offline before a new segment is visible to the system
>> (e.g. as these pages might not be backed by real memory in the hypervisor).
> 
> Please expand on the kdump part. That is really confusing because
> hotplug should simply not depend on kdump at all. Moreover why don't you
> simply mark those pages reserved and pull them out from the page
> allocator?

1. "hotplug should simply not depend on kdump at all"

In theory yes. In the current state we already have to trigger kdump to
reload whenever we add/remove a memory block.


2. kdump part

Whenever we offline a page and tell the hypervisor about it ("unplug"),
we should not assume that we can read that page again. Now, if dumping
tools assume they can read all memory that is offline, we are in trouble.

It is the same thing as we already have with Pg_hwpoison. Just a
different meaning - "don't touch this page, it is offline" compared to
"don't touch this page, hw is broken".

Balloon drivers solve this problem by always allowing to read unplugged
memory. In virtio-mem, this cannot and should even not be guaranteed.

And what we have to do to make this work is actually pretty simple: Just
like Pg_hwpoison, track per page if it is online and provide this
information to kdump.


3. Marking pages reserved and pulling them out of the page allocator

This is basically what e.g. the XEN balloon does. I don't see how that
helps related to kdump. Can you explain how that would solve any of the
problems I am trying to solve here? This does neither solve the unplug
part nor the "tell dump tools to not read such memory" part.

Thanks for looking into this.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24  8:31   ` David Hildenbrand
@ 2018-05-24  8:56     ` Dave Young
  2018-05-24  9:14       ` David Hildenbrand
  2018-05-24  9:31     ` Michal Hocko
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Young @ 2018-05-24  8:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, linux-kernel, Alexander Potapenko,
	Andrew Morton, Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini, Huang Ying,
	Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

Hi,

[snip]
> > 
> >> For kdump and onlining/offlining code, we
> >> have to mark pages as offline before a new segment is visible to the system
> >> (e.g. as these pages might not be backed by real memory in the hypervisor).
> > 
> > Please expand on the kdump part. That is really confusing because
> > hotplug should simply not depend on kdump at all. Moreover why don't you
> > simply mark those pages reserved and pull them out from the page
> > allocator?
> 
> 1. "hotplug should simply not depend on kdump at all"
> 
> In theory yes. In the current state we already have to trigger kdump to
> reload whenever we add/remove a memory block.
> 
> 
> 2. kdump part
> 
> Whenever we offline a page and tell the hypervisor about it ("unplug"),
> we should not assume that we can read that page again. Now, if dumping
> tools assume they can read all memory that is offline, we are in trouble.
> 
> It is the same thing as we already have with Pg_hwpoison. Just a
> different meaning - "don't touch this page, it is offline" compared to
> "don't touch this page, hw is broken".

Does that means in case an offline no kdump reload as mentioned in 1)?

If we have the offline event and reload kdump, I assume the memory state
is refreshed so kdump will not read the memory offlined, am I missing
something?

> 
> Balloon drivers solve this problem by always allowing to read unplugged
> memory. In virtio-mem, this cannot and should even not be guaranteed.
> 

Hmm, that sounds a bug..

> And what we have to do to make this work is actually pretty simple: Just
> like Pg_hwpoison, track per page if it is online and provide this
> information to kdump.
> 
> 

Thanks
Dave

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24  8:56     ` Dave Young
@ 2018-05-24  9:14       ` David Hildenbrand
  2018-05-28  8:28         ` Dave Young
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-05-24  9:14 UTC (permalink / raw)
  To: Dave Young
  Cc: Michal Hocko, linux-mm, linux-kernel, Alexander Potapenko,
	Andrew Morton, Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini, Huang Ying,
	Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 24.05.2018 10:56, Dave Young wrote:
> Hi,
> 
> [snip]
>>>
>>>> For kdump and onlining/offlining code, we
>>>> have to mark pages as offline before a new segment is visible to the system
>>>> (e.g. as these pages might not be backed by real memory in the hypervisor).
>>>
>>> Please expand on the kdump part. That is really confusing because
>>> hotplug should simply not depend on kdump at all. Moreover why don't you
>>> simply mark those pages reserved and pull them out from the page
>>> allocator?
>>
>> 1. "hotplug should simply not depend on kdump at all"
>>
>> In theory yes. In the current state we already have to trigger kdump to
>> reload whenever we add/remove a memory block.
>>
>>
>> 2. kdump part
>>
>> Whenever we offline a page and tell the hypervisor about it ("unplug"),
>> we should not assume that we can read that page again. Now, if dumping
>> tools assume they can read all memory that is offline, we are in trouble.
>>
>> It is the same thing as we already have with Pg_hwpoison. Just a
>> different meaning - "don't touch this page, it is offline" compared to
>> "don't touch this page, hw is broken".
> 
> Does that means in case an offline no kdump reload as mentioned in 1)?
> 
> If we have the offline event and reload kdump, I assume the memory state
> is refreshed so kdump will not read the memory offlined, am I missing
> something?

If a whole section is offline: yes. (ACPI hotplug)

If pages are online but broken ("logically offline" - hwpoison): no

If single pages are logically offline: no. (Balloon inflation - let's
call it unplug as that's what some people refer to)

If only subsections (4MB chunks) are offline: no.

Exporting memory ranges in a smaller granularity to kdump than section
size would a) be heavily complicated b) introduce a lot of overhead for
this tracking data c) make us retrigger kdump way too often.

So simply marking pages offline in the struct pages and telling kdump
about it is the straight forward thing to do. And it is fairly easy to
add and implement as we have the exact same thing in place for hwpoison.

> 
>>
>> Balloon drivers solve this problem by always allowing to read unplugged
>> memory. In virtio-mem, this cannot and should even not be guaranteed.
>>
> 
> Hmm, that sounds a bug..

I can give you a simple example why reading such unplugged (or balloon
inflated) memory is problematic: Huge page backed guests.

There is no zero page for huge pages. So if we allow the guest to read
that memory any time, we cannot guarantee that we actually consume less
memory in the hypervisor. This is absolutely to be avoided.

Existing balloon drivers don't support huge page backed guests. (well
you can inflate, but the hypervisor cannot madvise() 4k on a huge page,
resulting in no action being performed). This scenario is to be
supported with virtio-mem.


So yes, this is actually a bug in e.g. virtio-balloon implementations:

With "VIRTIO_BALLOON_F_MUST_TELL_HOST" we have to tell the hypervisor
before we access a page again. kdump cannot do this and does not care,
so this page is silently accessed and dumped. One of the main problems
why extending virtio-balloon hypervisor implementations to support
host-enforced R/W protection is impossible.

> 
>> And what we have to do to make this work is actually pretty simple: Just
>> like Pg_hwpoison, track per page if it is online and provide this
>> information to kdump.
>>
>>
> 
> Thanks
> Dave
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24  8:31   ` David Hildenbrand
  2018-05-24  8:56     ` Dave Young
@ 2018-05-24  9:31     ` Michal Hocko
  2018-05-24 10:45       ` David Hildenbrand
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2018-05-24  9:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On Thu 24-05-18 10:31:30, David Hildenbrand wrote:
> On 24.05.2018 09:53, Michal Hocko wrote:
> > I've had some questions before and I am not sure they are fully covered.
> > At least not in the cover letter (I didn't get much further yet) which
> > should give us a highlevel overview of the feature.
> 
> Sure, I can give you more details. Adding all details to the cover
> letter will result in a cover letter that nobody will read :)

Well, these are not details. Those are mostly highlevel design points
and integration with the existing hotplug scheme. I am definitely not
suggesting describing the code etc...

> > On Wed 23-05-18 17:11:41, David Hildenbrand wrote:
> >> This is now the !RFC version. I did some additional tests and inspected
> >> all memory notifiers. At least page_ext and kasan need fixes.
> >>
> >> ==========
> >>
> >> I am right now working on a paravirtualized memory device ("virtio-mem").
> >> These devices control a memory region and the amount of memory available
> >> via it. Memory will not be indicated/added/onlined via ACPI and friends,
> >> the device driver is responsible for it.
> >>
> >> When the device driver starts up, it will add and online the requested
> >> amount of memory from its assigned physical memory region. On request, it
> >> can add (online) either more memory or try to remove (offline) memory. As
> >> it will be a virtio module, we also want to be able to have it as a loadable
> >> kernel module.
> > 
> > How do you handle the offline case? Do you online all the memory to
> > zone_movable?
> 
> Right now everything is added to ZONE_NORMAL. I have some plans to
> change that, but that will require more work (auto assigning to
> ZONE_MOVABLE or ZONE_NORMAL depending on certain heuristics). For now
> this showcases that offlining of memory actually works on that
> granularity and it can be used in some scenarios. To make unplug more
> reliable, more work is needed.

Spell that out then. Memory offline is basically unusable for
zone_normal. So you are talking about adding memory only in practice and
it would be fair to be explicit about that.

> >> Such a device can be thought of like a "resizable DIMM" or a "huge
> >> number of 4MB DIMMS" that can be automatically managed.
> > 
> > Why do we need such a small granularity? The whole memory hotplug is
> > centered around memory sections and those are 128MB in size. Smaller
> > sizes simply do not fit into that concept. How do you deal with that?
> 
> 1. Why do we need such a small granularity?
> 
> Because we can :) No, honestly, on s390x it is 256MB and if I am not
> wrong on certain x86/arm machines it is even 1 or 2GB. This simply gives
> more flexibility when plugging memory. (thinking about cloud environments)

We can but if that makes the memory hotplug (cluttered enough in the
current state) more complicated then we simply won't. Or at least I will
not ack anything that will go that direction.
 
> Allowing to unplug such small chunks is actually the interesting thing.

Not really. The vmemmap will stay behind and so you are still wasting
memory. Well, unless you want to have small ptes for the hotplug memory
which is just too suboptimal

> Try unplugging a 128MB DIMM. With ZONE_NORMAL: pretty much impossible.
> With ZONE_MOVABLE: maybe possible.

It should be always possible. We have some players who pin pages for
arbitrary amount of time even from zone movable but we should focus on
fixing them or come with a way to handle that. Zone movable is about
movable memory pretty much by definition.

> Try to find one 4MB chunk of a
> "128MB" DIMM that can be unplugged: With ZONE_NORMAL
> maybe possible. With ZONE_MOVABLE: likely possible.
> 
> But let's not go into the discussion of ZONE_MOVABLE vs. ZONE_NORMAL, I
> plan to work on that in the future.
> 
> Think about it that way: A compromise between section based memory
> hotplug and page based ballooning.
> 
> 
> 2. memory hotplug and 128MB size
> 
> Interesting point. One thing to note is that "The whole memory hotplug
> is centered around memory sections and those are 128MB in size" is
> simply the current state how it is implemented in Linux, nothing more.

Yes, and we do care about that because the whole memory hotplug is a
bunch of hacks duct taped together to address very specific usecases.
It took me one year to put it into a state that my eyes do not bleed
anytime I have to look there. There are still way too many problems to
address. I certainly do not want to add more complication. Quite
contrary, the whole code cries for cleanups and sanity.
 
> E.g. Windows supports 1MB DIMMs So the statement "Smaller sizes simply
> do not fit into that concept" is wrong. It simply does not fit
> *perfectly* into the way it is implemented right now in Linux. But as
> this patch series shows, this is just a minor drawback we can easily
> work around.
> 
> "How do you deal with that?"
> 
> I think the question is answered below: add a section and online only
> parts of it.
> 
> 
> > 
> >> As we want to be able to add/remove small chunks of memory to a VM without
> >> fragmenting guest memory ("it's not what the guest pays for" and "what if
> >> the hypervisor wants to use huge pages"), it looks like we can do that
> >> under Linux in a 4MB granularity by using online_pages()/offline_pages()
> > 
> > Please expand on this some more. Larger logical units usually lead to a
> > smaller fragmentation.
> 
> I want to avoid what balloon drivers do: rip out random pages,
> fragmenting guest memory until we eventually trigger the OOM killer. So
> instead, using 4MB chunks produces no fragmentation. And if I can't find
> such a chunk anymore: bad luck. At least I won't be risking stability of
> my guest.
> 
> Does that answer your question?

So you basically pull out those pages from the page allocator and mark
them offline (reserved what ever)? Why do you need any integration to
the hotplug code base then? You should be perfectly fine to work on
top and only teach that hotplug code to recognize your pages are being
free when somebody decides to offline the whole section. I can think of
a callback that would allow that.

But then you are, well a balloon driver, aren't you?
 
> >> We add a segment and online only 4MB blocks of it on demand. So the other
> >> memory might not be accessible.
> > 
> > But you still allocate vmemmap for the full memory section, right? That
> > would mean that you spend 2MB to online 4MB of memory. Sounds quite
> > wasteful to me.
> 
> This is true for the first 4MB chunk of a section, but not for the
> remaining ones. Of course, I try to minimize the number of such sections
> (ideally, this would only be one "open section" for a virtio-mem device
> when only plugging memory). So once we online further 4MB chunk, the
> overhead gets smaller and smaller.
> 
> > 
> >> For kdump and onlining/offlining code, we
> >> have to mark pages as offline before a new segment is visible to the system
> >> (e.g. as these pages might not be backed by real memory in the hypervisor).
> > 
> > Please expand on the kdump part. That is really confusing because
> > hotplug should simply not depend on kdump at all. Moreover why don't you
> > simply mark those pages reserved and pull them out from the page
> > allocator?
> 
> 1. "hotplug should simply not depend on kdump at all"
> 
> In theory yes. In the current state we already have to trigger kdump to
> reload whenever we add/remove a memory block.

More details please.
 
> 2. kdump part
> 
> Whenever we offline a page and tell the hypervisor about it ("unplug"),
> we should not assume that we can read that page again. Now, if dumping
> tools assume they can read all memory that is offline, we are in trouble.

Sure. Just make those pages reserved. Nobody should touch those IIRC.

> It is the same thing as we already have with Pg_hwpoison. Just a
> different meaning - "don't touch this page, it is offline" compared to
> "don't touch this page, hw is broken".
> 
> Balloon drivers solve this problem by always allowing to read unplugged
> memory. In virtio-mem, this cannot and should even not be guaranteed.
> 
> And what we have to do to make this work is actually pretty simple: Just
> like Pg_hwpoison, track per page if it is online and provide this
> information to kdump.

If somebody doesn't check your new flag then you are screwed anyway.
Existing code should be quite used to PageReserved.
 
> 3. Marking pages reserved and pulling them out of the page allocator
> 
> This is basically what e.g. the XEN balloon does. I don't see how that
> helps related to kdump. Can you explain how that would solve any of the
> problems I am trying to solve here? This does neither solve the unplug
> part nor the "tell dump tools to not read such memory" part.

I still do not understand the unplug part, to be honest. And the rest
should be pretty much solveable. Or what do I miss. Your 4MB can be
perfectly emulated on top of existing sections and pulling pages out of
the allocator. How you mark those pages is an implementation detail.
PageReserved sounds like the easiest way forward.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24  9:31     ` Michal Hocko
@ 2018-05-24 10:45       ` David Hildenbrand
  2018-05-24 12:03         ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-05-24 10:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 24.05.2018 11:31, Michal Hocko wrote:
> On Thu 24-05-18 10:31:30, David Hildenbrand wrote:
>> On 24.05.2018 09:53, Michal Hocko wrote:
>>> I've had some questions before and I am not sure they are fully covered.
>>> At least not in the cover letter (I didn't get much further yet) which
>>> should give us a highlevel overview of the feature.
>>
>> Sure, I can give you more details. Adding all details to the cover
>> letter will result in a cover letter that nobody will read :)
> 
> Well, these are not details. Those are mostly highlevel design points
> and integration with the existing hotplug scheme. I am definitely not
> suggesting describing the code etc...

Will definitely do as we move on.

> 
>>> On Wed 23-05-18 17:11:41, David Hildenbrand wrote:
>>>> This is now the !RFC version. I did some additional tests and inspected
>>>> all memory notifiers. At least page_ext and kasan need fixes.
>>>>
>>>> ==========
>>>>
>>>> I am right now working on a paravirtualized memory device ("virtio-mem").
>>>> These devices control a memory region and the amount of memory available
>>>> via it. Memory will not be indicated/added/onlined via ACPI and friends,
>>>> the device driver is responsible for it.
>>>>
>>>> When the device driver starts up, it will add and online the requested
>>>> amount of memory from its assigned physical memory region. On request, it
>>>> can add (online) either more memory or try to remove (offline) memory. As
>>>> it will be a virtio module, we also want to be able to have it as a loadable
>>>> kernel module.
>>>
>>> How do you handle the offline case? Do you online all the memory to
>>> zone_movable?
>>
>> Right now everything is added to ZONE_NORMAL. I have some plans to
>> change that, but that will require more work (auto assigning to
>> ZONE_MOVABLE or ZONE_NORMAL depending on certain heuristics). For now
>> this showcases that offlining of memory actually works on that
>> granularity and it can be used in some scenarios. To make unplug more
>> reliable, more work is needed.
> 
> Spell that out then. Memory offline is basically unusable for
> zone_normal. So you are talking about adding memory only in practice and
> it would be fair to be explicit about that.

Yes, I will add that. For now hotplugs works (except when already very
tight on memory) reliably. unplug is done on a "best can do basis".
Nothing critical happens if I can't free up a 4MB chunk. Bad luck.

> 
>>>> Such a device can be thought of like a "resizable DIMM" or a "huge
>>>> number of 4MB DIMMS" that can be automatically managed.
>>>
>>> Why do we need such a small granularity? The whole memory hotplug is
>>> centered around memory sections and those are 128MB in size. Smaller
>>> sizes simply do not fit into that concept. How do you deal with that?
>>
>> 1. Why do we need such a small granularity?
>>
>> Because we can :) No, honestly, on s390x it is 256MB and if I am not
>> wrong on certain x86/arm machines it is even 1 or 2GB. This simply gives
>> more flexibility when plugging memory. (thinking about cloud environments)
> 
> We can but if that makes the memory hotplug (cluttered enough in the
> current state) more complicated then we simply won't. Or at least I will
> not ack anything that will go that direction.

I agree, but at least from the current set of patches it doesn't
increase the complexity. At least from my POV. E.g onlining of arbitrary
sizes was always implemented. Offlining was claimed to be supported to
some extend. And I am cleaning that up here (e.g. in offline_pages() or
in the memory notifiers). I am using what the existing interfaces
promised but fix it and clean it up.

>  
>> Allowing to unplug such small chunks is actually the interesting thing.
> 
> Not really. The vmemmap will stay behind and so you are still wasting
> memory. Well, unless you want to have small ptes for the hotplug memory
> which is just too suboptimal
> 
>> Try unplugging a 128MB DIMM. With ZONE_NORMAL: pretty much impossible.
>> With ZONE_MOVABLE: maybe possible.
> 
> It should be always possible. We have some players who pin pages for
> arbitrary amount of time even from zone movable but we should focus on
> fixing them or come with a way to handle that. Zone movable is about
> movable memory pretty much by definition.

You exactly describe what has been the case for way too long. But this
is only the tip of the ice berg. Simply adding all memory to
ZONE_MOVABLE is not going to work (we create an imbalance - e.g. page
tables have to go into ZONE_NORMAL. this imbalance will have to be
managed later on). That's why I am rather thinking about said assignment
to different zones in the future. For now using ZONE_NORMAL is the
easiest approach.

> 
>> Try to find one 4MB chunk of a
>> "128MB" DIMM that can be unplugged: With ZONE_NORMAL
>> maybe possible. With ZONE_MOVABLE: likely possible.
>>
>> But let's not go into the discussion of ZONE_MOVABLE vs. ZONE_NORMAL, I
>> plan to work on that in the future.
>>
>> Think about it that way: A compromise between section based memory
>> hotplug and page based ballooning.
>>
>>
>> 2. memory hotplug and 128MB size
>>
>> Interesting point. One thing to note is that "The whole memory hotplug
>> is centered around memory sections and those are 128MB in size" is
>> simply the current state how it is implemented in Linux, nothing more.
> 
> Yes, and we do care about that because the whole memory hotplug is a
> bunch of hacks duct taped together to address very specific usecases.
> It took me one year to put it into a state that my eyes do not bleed
> anytime I have to look there. There are still way too many problems to
> address. I certainly do not want to add more complication. Quite
> contrary, the whole code cries for cleanups and sanity.

And I highly appreciate your effort. But look at the details: I am even
cleaning up online_pages() and offline_pages(). And this is not the end
of my contributions :) This is one step into that direction. It
showcases what is easily possible right now. With existing interfaces.

>> I want to avoid what balloon drivers do: rip out random pages,
>> fragmenting guest memory until we eventually trigger the OOM killer. So
>> instead, using 4MB chunks produces no fragmentation. And if I can't find
>> such a chunk anymore: bad luck. At least I won't be risking stability of
>> my guest.
>>
>> Does that answer your question?
> 
> So you basically pull out those pages from the page allocator and mark
> them offline (reserved what ever)? Why do you need any integration to
> the hotplug code base then? You should be perfectly fine to work on
> top and only teach that hotplug code to recognize your pages are being
> free when somebody decides to offline the whole section. I can think of
> a callback that would allow that.
> 
> But then you are, well a balloon driver, aren't you?

Pointing you at: [1]

I *cannot* use the page allocator. Using it would be a potential addon
(for special cases!) in the future. I really scan for removable chunks
in the memory region a certain virtio-mem device owns. Why can't I do it:

1. I have to allocate memory in certain physical address range (the
range that belongs to a virtio-mem device). Well, we could write an
allocator.

2. I might have to deal with chunks that are bigger than MAX_ORDER - 1.
Say my virito-mem device has a block size of 8MB and I only can allocate
4MB. I'm out of luck then.

So, no, virtio-mem is not a balloon driver :)

>>>
>>>> For kdump and onlining/offlining code, we
>>>> have to mark pages as offline before a new segment is visible to the system
>>>> (e.g. as these pages might not be backed by real memory in the hypervisor).
>>>
>>> Please expand on the kdump part. That is really confusing because
>>> hotplug should simply not depend on kdump at all. Moreover why don't you
>>> simply mark those pages reserved and pull them out from the page
>>> allocator?
>>
>> 1. "hotplug should simply not depend on kdump at all"
>>
>> In theory yes. In the current state we already have to trigger kdump to
>> reload whenever we add/remove a memory block.
> 
> More details please.

I just had another look at the whole complexity of
makedumfile/kdump/uevents and I'll follow up with a detailed description.

kdump.service is definitely reloaded when setting a memory block
online/offline (not when adding/removing as I wrongly claimed before).

I'll follow up with a more detailed description and all the pointers.

>  
>> 2. kdump part
>>
>> Whenever we offline a page and tell the hypervisor about it ("unplug"),
>> we should not assume that we can read that page again. Now, if dumping
>> tools assume they can read all memory that is offline, we are in trouble.
> 
> Sure. Just make those pages reserved. Nobody should touch those IIRC.

I think I answered that question already (see [1]) in another thread: We
have certain buffers that are marked reserved. Reserved does not imply
don't dump. all dump tools I am aware of will dump reserved pages. I
cannot use reserved to mark sections offline once all pages are offline.

And I don't think the current approach of using a mapcount value is the
problematic part. This is straight forward now.

> 
>> It is the same thing as we already have with Pg_hwpoison. Just a
>> different meaning - "don't touch this page, it is offline" compared to
>> "don't touch this page, hw is broken".
>>
>> Balloon drivers solve this problem by always allowing to read unplugged
>> memory. In virtio-mem, this cannot and should even not be guaranteed.
>>
>> And what we have to do to make this work is actually pretty simple: Just
>> like Pg_hwpoison, track per page if it is online and provide this
>> information to kdump.
> 
> If somebody doesn't check your new flag then you are screwed anyway.

Yes, but that's not a problem in a new environment. We don't modify
existing environments. Same used to be true with Pg_hwpoison.

> Existing code should be quite used to PageReserved.

Please refer to [3]:makedump.c

And the performed checks to exclude pages. No sight of reserved pages.

>  
>> 3. Marking pages reserved and pulling them out of the page allocator
>>
>> This is basically what e.g. the XEN balloon does. I don't see how that
>> helps related to kdump. Can you explain how that would solve any of the
>> problems I am trying to solve here? This does neither solve the unplug
>> part nor the "tell dump tools to not read such memory" part.
> 
> I still do not understand the unplug part, to be honest. And the rest
> should be pretty much solveable. Or what do I miss. Your 4MB can be
> perfectly emulated on top of existing sections and pulling pages out of
> the allocator. How you mark those pages is an implementation detail.
> PageReserved sounds like the easiest way forward.
> 

No allocator. Please refer to the virtio-mem prototype and the
explanation I gave above. That should give you a better idea what I do
in the current prototype.

[1] https://lkml.org/lkml/2018/5/23/803
[2] https://www.spinics.net/lists/linux-mm/msg150029.html
[3] https://github.com/jmesmon/makedumpfile

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24 10:45       ` David Hildenbrand
@ 2018-05-24 12:03         ` Michal Hocko
  2018-05-24 14:04           ` David Hildenbrand
  2018-05-25 15:08             ` David Hildenbrand
  0 siblings, 2 replies; 42+ messages in thread
From: Michal Hocko @ 2018-05-24 12:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On Thu 24-05-18 12:45:50, David Hildenbrand wrote:
> On 24.05.2018 11:31, Michal Hocko wrote:
> > On Thu 24-05-18 10:31:30, David Hildenbrand wrote:
[...]
> >> Allowing to unplug such small chunks is actually the interesting thing.
> > 
> > Not really. The vmemmap will stay behind and so you are still wasting
> > memory. Well, unless you want to have small ptes for the hotplug memory
> > which is just too suboptimal
> > 
> >> Try unplugging a 128MB DIMM. With ZONE_NORMAL: pretty much impossible.
> >> With ZONE_MOVABLE: maybe possible.
> > 
> > It should be always possible. We have some players who pin pages for
> > arbitrary amount of time even from zone movable but we should focus on
> > fixing them or come with a way to handle that. Zone movable is about
> > movable memory pretty much by definition.
> 
> You exactly describe what has been the case for way too long. But this
> is only the tip of the ice berg. Simply adding all memory to
> ZONE_MOVABLE is not going to work (we create an imbalance - e.g. page
> tables have to go into ZONE_NORMAL. this imbalance will have to be
> managed later on). That's why I am rather thinking about said assignment
> to different zones in the future. For now using ZONE_NORMAL is the
> easiest approach.

Well, I think it would be fair to say that memory hotplug is not really
suitable for balancing memory between guests. Exactly because of the
offline part. If you want to have a reliable offline you just screw
yourself to the highmem land problems. This is the primary reason why I
really detest any balooning like solutions based on the memory hotplug.
I can see the case for adding memory to increase the initial guests
sizes but that can be achieved with what we have currently without any
additional complexity in the generic code. So you really have to have
some seriously convincing arguments in hands. So far I haven't heard
any, to be honest. This is just yet another thing that can be achieved
with what other ballooning solutions are doing. I might be wrong because
this is not really my area and I might underestimate some nuances but
as already said, you have to be really convincing...

> >> Try to find one 4MB chunk of a
> >> "128MB" DIMM that can be unplugged: With ZONE_NORMAL
> >> maybe possible. With ZONE_MOVABLE: likely possible.
> >>
> >> But let's not go into the discussion of ZONE_MOVABLE vs. ZONE_NORMAL, I
> >> plan to work on that in the future.
> >>
> >> Think about it that way: A compromise between section based memory
> >> hotplug and page based ballooning.
> >>
> >>
> >> 2. memory hotplug and 128MB size
> >>
> >> Interesting point. One thing to note is that "The whole memory hotplug
> >> is centered around memory sections and those are 128MB in size" is
> >> simply the current state how it is implemented in Linux, nothing more.
> > 
> > Yes, and we do care about that because the whole memory hotplug is a
> > bunch of hacks duct taped together to address very specific usecases.
> > It took me one year to put it into a state that my eyes do not bleed
> > anytime I have to look there. There are still way too many problems to
> > address. I certainly do not want to add more complication. Quite
> > contrary, the whole code cries for cleanups and sanity.
> 
> And I highly appreciate your effort. But look at the details: I am even
> cleaning up online_pages() and offline_pages(). And this is not the end
> of my contributions :) This is one step into that direction. It
> showcases what is easily possible right now. With existing interfaces.

If you can bring some cleanups then great. I would suggest pulling those
out and post separately.
 
> >> I want to avoid what balloon drivers do: rip out random pages,
> >> fragmenting guest memory until we eventually trigger the OOM killer. So
> >> instead, using 4MB chunks produces no fragmentation. And if I can't find
> >> such a chunk anymore: bad luck. At least I won't be risking stability of
> >> my guest.
> >>
> >> Does that answer your question?
> > 
> > So you basically pull out those pages from the page allocator and mark
> > them offline (reserved what ever)? Why do you need any integration to
> > the hotplug code base then? You should be perfectly fine to work on
> > top and only teach that hotplug code to recognize your pages are being
> > free when somebody decides to offline the whole section. I can think of
> > a callback that would allow that.
> > 
> > But then you are, well a balloon driver, aren't you?
> 
> Pointing you at: [1]
> 
> I *cannot* use the page allocator. Using it would be a potential addon
> (for special cases!) in the future. I really scan for removable chunks
> in the memory region a certain virtio-mem device owns. Why can't I do it:
> 
> 1. I have to allocate memory in certain physical address range (the
> range that belongs to a virtio-mem device). Well, we could write an
> allocator.

Not really. Why cannot you simply mimic what the hotplug already does?
Scan a pfn range and isolate/migrate it? We can abstract such
functionality to be better usable.

> 2. I might have to deal with chunks that are bigger than MAX_ORDER - 1.
> Say my virito-mem device has a block size of 8MB and I only can allocate
> 4MB. I'm out of luck then.

Confused again. So you are managing 4M chunks in 8M units?

> So, no, virtio-mem is not a balloon driver :)
[...]
> >> 1. "hotplug should simply not depend on kdump at all"
> >>
> >> In theory yes. In the current state we already have to trigger kdump to
> >> reload whenever we add/remove a memory block.
> > 
> > More details please.
> 
> I just had another look at the whole complexity of
> makedumfile/kdump/uevents and I'll follow up with a detailed description.
> 
> kdump.service is definitely reloaded when setting a memory block
> online/offline (not when adding/removing as I wrongly claimed before).
> 
> I'll follow up with a more detailed description and all the pointers.

Please make sure to describe what is the architecture then. I have no
idea what kdump.servise is supposed to do for example.

> >> 2. kdump part
> >>
> >> Whenever we offline a page and tell the hypervisor about it ("unplug"),
> >> we should not assume that we can read that page again. Now, if dumping
> >> tools assume they can read all memory that is offline, we are in trouble.
> > 
> > Sure. Just make those pages reserved. Nobody should touch those IIRC.
> 
> I think I answered that question already (see [1]) in another thread: We
> have certain buffers that are marked reserved. Reserved does not imply
> don't dump. all dump tools I am aware of will dump reserved pages. I
> cannot use reserved to mark sections offline once all pages are offline.

Then fix those dump tools. They shouldn't have any business touching
reserved memory, should they? That memory can be in an arbitrary state.
 
> And I don't think the current approach of using a mapcount value is the
> problematic part. This is straight forward now.

The thing is that struct page space is extremely scarce. Anytime you
take a bit or there somebody else will have to scratch his head much
harder in the future. So if we can go with the existing infrastructure
then it should be preferable. And we do have "this page is mine do not
even think about touching it" - PageReserved.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24 12:03         ` Michal Hocko
@ 2018-05-24 14:04           ` David Hildenbrand
  2018-05-24 14:22             ` Michal Hocko
  2018-05-25 15:08             ` David Hildenbrand
  1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-05-24 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

>> You exactly describe what has been the case for way too long. But this
>> is only the tip of the ice berg. Simply adding all memory to
>> ZONE_MOVABLE is not going to work (we create an imbalance - e.g. page
>> tables have to go into ZONE_NORMAL. this imbalance will have to be
>> managed later on). That's why I am rather thinking about said assignment
>> to different zones in the future. For now using ZONE_NORMAL is the
>> easiest approach.
> 
> Well, I think it would be fair to say that memory hotplug is not really
> suitable for balancing memory between guests. Exactly because of the
> offline part. If you want to have a reliable offline you just screw
> yourself to the highmem land problems. This is the primary reason why I
> really detest any balooning like solutions based on the memory hotplug.

The important remark first: I hate using ballooning. I hate it for
literally any use case. And I keep stressing that virtio-mem is not a
balloon driver. Because it isn't. And I don't want people to think it is
one that can be used for some sort of "auto ballooning" or
"collaboration memory management. We have people on working on other
approaches to solve these problems (free page hinting, fake DAX etc.).

The use case is *not* to balance memory between guests. The use case is
to be able to add (and eventually remove some portion) of memory to a
guest easily and automatically managed.

Simple use case:

You want to give a guest more memory. You resize the virtio-mem device.
virtio-mem will find chunks and online them. No manual interaction. No
finding of free ACPI slots. No manual onlining in the guest. No special
migration handling in the hypervisor. One command in the hypervisor.

You want to try to remove some memory from the guest. You resize the
virtio-mem device. No manual finding a ACPI DIMM that can be removed.
Not manual offlining in the guest. No special migration handling in the
hypervisor. virtio-mem will handle it. One command in the hypervisor. If
the requested amount of memory cannot be unplugged, bad luck, no guarantees.

So if you want to put it that way, assuming the virtio-mem block size is
exactly 128MB (which is possible!), we will basically automatically
manage sections on x86. Which is in my POV already a major improvement.

The 4MB part is just an optimization, because in virtual environments we
are actually able to plug 4MB as you can see. So why fallback on
128MB/2GB if it is all already in the code? All we have to do is tell
kdump to not touch the memory. How to do that is what we are arguing about.

> I can see the case for adding memory to increase the initial guests
> sizes but that can be achieved with what we have currently without any
> additional complexity in the generic code. So you really have to have
> some seriously convincing arguments in hands. So far I haven't heard
> any, to be honest. This is just yet another thing that can be achieved
> with what other ballooning solutions are doing. I might be wrong because
> this is not really my area and I might underestimate some nuances but
> as already said, you have to be really convincing..
If I plug a 16GB DIMM I can only unplug 16GB again. This is a clear
restriction. Not even talking about zones. By using 4MB/8MB/128MB parts
we have *a lot* more freedom. And a paravirtualized interface gives us
all the freedom to support this.

And again, no ballooning please. Combinations of ACPI hotplug and
balloon (fake) memory unplug is to be avoided by any means. And it only
works on selected architectures.

Please go back to the benefits section in the cover letter, what you are
looking for is all there (e.g. support for architectures without proper
HW interfaces).

Also, please, before you start using the term "additional complexity"
please look at the patches. It is all already in the code and I am
fixing offline_pages() and reusing it. Again, please have a look how
much of all this is already working. You keep phrasing it as I am
putting big hacks and complexity into existing code. This is not true.

> 
>>>> Try to find one 4MB chunk of a
>>>> "128MB" DIMM that can be unplugged: With ZONE_NORMAL
>>>> maybe possible. With ZONE_MOVABLE: likely possible.
>>>>
>>>> But let's not go into the discussion of ZONE_MOVABLE vs. ZONE_NORMAL, I
>>>> plan to work on that in the future.
>>>>
>>>> Think about it that way: A compromise between section based memory
>>>> hotplug and page based ballooning.
>>>>
>>>>
>>>> 2. memory hotplug and 128MB size
>>>>
>>>> Interesting point. One thing to note is that "The whole memory hotplug
>>>> is centered around memory sections and those are 128MB in size" is
>>>> simply the current state how it is implemented in Linux, nothing more.
>>>
>>> Yes, and we do care about that because the whole memory hotplug is a
>>> bunch of hacks duct taped together to address very specific usecases.
>>> It took me one year to put it into a state that my eyes do not bleed
>>> anytime I have to look there. There are still way too many problems to
>>> address. I certainly do not want to add more complication. Quite
>>> contrary, the whole code cries for cleanups and sanity.
>>
>> And I highly appreciate your effort. But look at the details: I am even
>> cleaning up online_pages() and offline_pages(). And this is not the end
>> of my contributions :) This is one step into that direction. It
>> showcases what is easily possible right now. With existing interfaces.
> 
> If you can bring some cleanups then great. I would suggest pulling those
> out and post separately.

The cleanup rely on patch number 1. This is marking pages as free
properly to detect when sections are offline. And even that part is
straight forward and contains no added complexity.

>  
>>>> I want to avoid what balloon drivers do: rip out random pages,
>>>> fragmenting guest memory until we eventually trigger the OOM killer. So
>>>> instead, using 4MB chunks produces no fragmentation. And if I can't find
>>>> such a chunk anymore: bad luck. At least I won't be risking stability of
>>>> my guest.
>>>>
>>>> Does that answer your question?
>>>
>>> So you basically pull out those pages from the page allocator and mark
>>> them offline (reserved what ever)? Why do you need any integration to
>>> the hotplug code base then? You should be perfectly fine to work on
>>> top and only teach that hotplug code to recognize your pages are being
>>> free when somebody decides to offline the whole section. I can think of
>>> a callback that would allow that.
>>>
>>> But then you are, well a balloon driver, aren't you?
>>
>> Pointing you at: [1]
>>
>> I *cannot* use the page allocator. Using it would be a potential addon
>> (for special cases!) in the future. I really scan for removable chunks
>> in the memory region a certain virtio-mem device owns. Why can't I do it:
>>
>> 1. I have to allocate memory in certain physical address range (the
>> range that belongs to a virtio-mem device). Well, we could write an
>> allocator.
> 
> Not really. Why cannot you simply mimic what the hotplug already does?
> Scan a pfn range and isolate/migrate it? We can abstract such
> functionality to be better usable.

Yes, I could, but why potentially duplicate code and "introduce more
complexity"? It is all there already in offlining code. And I am reusing
this. And it works just fine.

More importantly (please have a look at virtio-mem) it is a clean
implementation. No patching up of memory counts, no special function
calls. Using add_memory/remove_memory/online_pages/offline_pages is for
the "core" part enough.

Whatever you describe related to allocators and onlining screams like a
lot of complexity and new interfaces.

> 
>> 2. I might have to deal with chunks that are bigger than MAX_ORDER - 1.
>> Say my virito-mem device has a block size of 8MB and I only can allocate
>> 4MB. I'm out of luck then.
> 
> Confused again. So you are managing 4M chunks in 8M units?

Now the funny part: If my virtio-mem block size is 8MB chunks, I will
manage in 8MB chunks. If it is 128MB I will manage 128MB chunks
("sections").

The point I was making is: I cannot allocate 8MB/128MB using the buddy
allocator. All I want to do is manage the memory a virtio-mem device
provides as flexible as possible.

> 
>> So, no, virtio-mem is not a balloon driver :)
> [...]
>>>> 1. "hotplug should simply not depend on kdump at all"
>>>>
>>>> In theory yes. In the current state we already have to trigger kdump to
>>>> reload whenever we add/remove a memory block.
>>>
>>> More details please.
>>
>> I just had another look at the whole complexity of
>> makedumfile/kdump/uevents and I'll follow up with a detailed description.
>>
>> kdump.service is definitely reloaded when setting a memory block
>> online/offline (not when adding/removing as I wrongly claimed before).
>>
>> I'll follow up with a more detailed description and all the pointers.
> 
> Please make sure to describe what is the architecture then. I have no
> idea what kdump.servise is supposed to do for example.

Yes, will try to demangle how this all plays together :)

> 
>>>> 2. kdump part
>>>>
>>>> Whenever we offline a page and tell the hypervisor about it ("unplug"),
>>>> we should not assume that we can read that page again. Now, if dumping
>>>> tools assume they can read all memory that is offline, we are in trouble.
>>>
>>> Sure. Just make those pages reserved. Nobody should touch those IIRC.
>>
>> I think I answered that question already (see [1]) in another thread: We
>> have certain buffers that are marked reserved. Reserved does not imply
>> don't dump. all dump tools I am aware of will dump reserved pages. I
>> cannot use reserved to mark sections offline once all pages are offline.
> 
> Then fix those dump tools. They shouldn't have any business touching
> reserved memory, should they? That memory can be in an arbitrary state.

As I already stated, I don't think so. Please have a look how reserved
is used. Not exclusively for marking pages offline. There is a reason
why dump tools don't exclude that.

And for marking sections properly as offline (what offline_pages()
already implements to 99.999%), we need a new bit.

Again, it is all there and there is no added complexity. We just have to
set one additional bit properly.

>  
>> And I don't think the current approach of using a mapcount value is the
>> problematic part. This is straight forward now.
> 
> The thing is that struct page space is extremely scarce. Anytime you
> take a bit or there somebody else will have to scratch his head much
> harder in the future. So if we can go with the existing infrastructure
> then it should be preferable. And we do have "this page is mine do not
> even think about touching it" - PageReserved.
Yes, it is space. But I think it is space well spent for the reasons
outlined :)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24 14:04           ` David Hildenbrand
@ 2018-05-24 14:22             ` Michal Hocko
  2018-05-24 21:07               ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2018-05-24 14:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

I will go over the rest of the email later I just wanted to make this
point clear because I suspect we are talking past each other.

On Thu 24-05-18 16:04:38, David Hildenbrand wrote:
[...]
> The point I was making is: I cannot allocate 8MB/128MB using the buddy
> allocator. All I want to do is manage the memory a virtio-mem device
> provides as flexible as possible.

I didn't mean to use the page allocator to isolate pages from it. We do
have other means. Have a look at the page isolation framework and have a
look how the current memory hotplug (ab)uses it. In short you mark the
desired physical memory range as isolated (nobody can allocate from it)
and then simply remove it from the page allocator. And you are done with
it. Your particular range is gone, nobody will ever use it. If you mark
those struct pages reserved then pfn walkers should already ignore them.
If you keep those pages with ref count 0 then even hotplug should work
seemlessly (I would have to double check).

So all I am arguing is that whatever your driver wants to do can be
handled without touching the hotplug code much. You would still need
to add new ranges in the mem section units and manage on top of that.
You need to do that anyway to keep track of what parts are in use or
offlined anyway right? Now the mem sections. You have to do that anyway
for memmaps. Our sparse memory model simply works in those units. Even
if you make a part of that range unavailable then the section will still
be there.

Do I make at least some sense or I am completely missing your point?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 09/10] mm/memory_hotplug: teach offline_pages() to not try forever
  2018-05-23 15:11 ` [PATCH v1 09/10] mm/memory_hotplug: teach offline_pages() to not try forever David Hildenbrand
@ 2018-05-24 14:39   ` Michal Hocko
  2018-05-24 20:36     ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2018-05-24 14:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Greg Kroah-Hartman, Rashmica Gupta,
	Balbir Singh, Andrew Morton, Vlastimil Babka, Dan Williams,
	Joonsoo Kim, Pavel Tatashin, Reza Arbab, Thomas Gleixner

[I didn't really go through other patch but this one caught my eyes just
 because of the similar request proposed yesterday]

On Wed 23-05-18 17:11:50, David Hildenbrand wrote:
[...]
> @@ -1686,6 +1686,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	pfn = scan_movable_pages(start_pfn, end_pfn);
>  	if (pfn) { /* We have movable pages */
>  		ret = do_migrate_range(pfn, end_pfn);
> +		if (ret && !retry_forever) {
> +			ret = -EBUSY;
> +			goto failed_removal;
> +		}
>  		goto repeat;
>  	}
>  

Btw. this will not work in practice. Even a single temporary pin on a page
will fail way too easily. If you really need to control this then make
it a retry counter with default -1UL.

We really do need a better error reporting from do_migrate_range and
distinguish transient from permanent failures. In general we shouldn't
even get here for pages which are not migrateable...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 09/10] mm/memory_hotplug: teach offline_pages() to not try forever
  2018-05-24 14:39   ` Michal Hocko
@ 2018-05-24 20:36     ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-24 20:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Greg Kroah-Hartman, Rashmica Gupta,
	Balbir Singh, Andrew Morton, Vlastimil Babka, Dan Williams,
	Joonsoo Kim, Pavel Tatashin, Reza Arbab, Thomas Gleixner

On 24.05.2018 16:39, Michal Hocko wrote:
> [I didn't really go through other patch but this one caught my eyes just
>  because of the similar request proposed yesterday]
> 
> On Wed 23-05-18 17:11:50, David Hildenbrand wrote:
> [...]
>> @@ -1686,6 +1686,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
>>  	pfn = scan_movable_pages(start_pfn, end_pfn);
>>  	if (pfn) { /* We have movable pages */
>>  		ret = do_migrate_range(pfn, end_pfn);
>> +		if (ret && !retry_forever) {
>> +			ret = -EBUSY;
>> +			goto failed_removal;
>> +		}
>>  		goto repeat;
>>  	}
>>  
> 
> Btw. this will not work in practice. Even a single temporary pin on a page
> will fail way too easily. If you really need to control this then make
> it a retry counter with default -1UL.

Interestingly, this will work for the one specific use case that I am
using this interface for right now.

The reason is that I don't want to offline a specific chunk, I want to
find some chunks to offline (in contrast to e.g. DIMMs where you rely
try to offline a very specific one).

If I get a failure on that chunk (e.g. temporary pin) I will retry the
next chunk. At one point, I will eventually retry this chunk and then it
succeeds.

> 
> We really do need a better error reporting from do_migrate_range and
> distinguish transient from permanent failures. In general we shouldn't
> even get here for pages which are not migrateable...

I totally agree, I also want to know if an error is permanent or
transient - and I want the posibility to "fail fast" (e.g. -EAGAIN)
instead of looping forever.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24 14:22             ` Michal Hocko
@ 2018-05-24 21:07               ` David Hildenbrand
  2018-06-11 11:53                 ` David Hildenbrand
  2018-07-18 13:19                 ` Michal Hocko
  0 siblings, 2 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-24 21:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 24.05.2018 16:22, Michal Hocko wrote:
> I will go over the rest of the email later I just wanted to make this
> point clear because I suspect we are talking past each other.

It sounds like we are now talking about how to solve the problem. I like
that :)

> 
> On Thu 24-05-18 16:04:38, David Hildenbrand wrote:
> [...]
>> The point I was making is: I cannot allocate 8MB/128MB using the buddy
>> allocator. All I want to do is manage the memory a virtio-mem device
>> provides as flexible as possible.
> 
> I didn't mean to use the page allocator to isolate pages from it. We do
> have other means. Have a look at the page isolation framework and have a
> look how the current memory hotplug (ab)uses it. In short you mark the
> desired physical memory range as isolated (nobody can allocate from it)
> and then simply remove it from the page allocator. And you are done with
> it. Your particular range is gone, nobody will ever use it. If you mark
> those struct pages reserved then pfn walkers should already ignore them.
> If you keep those pages with ref count 0 then even hotplug should work
> seemlessly (I would have to double check).
> 
> So all I am arguing is that whatever your driver wants to do can be
> handled without touching the hotplug code much. You would still need
> to add new ranges in the mem section units and manage on top of that.
> You need to do that anyway to keep track of what parts are in use or
> offlined anyway right? Now the mem sections. You have to do that anyway
> for memmaps. Our sparse memory model simply works in those units. Even
> if you make a part of that range unavailable then the section will still
> be there.
> 
> Do I make at least some sense or I am completely missing your point?
> 

I think we're heading somewhere. I understand that you want to separate
this "semi" offline part from the general offlining code. If so, we
should definitely enforce segment alignment for online_pages/offline_pages.

Importantly, what I need is:

1. Indicate and prepare memory sections to be used for adding memory
   chunks (right now add_memory())
2. Make memory chunks of a section available to the system (right now
   online_pages())
3. Remove memory chunks of a section from the system (right now
   offline_pages())
4. Remove memory sections from the system (right now remove_memory())
5. Hinder dumping tools from reading memory chunks that are logically
   offline (right now PageOffline())
6. For 3. find removable memory chunks in a certain memory range with a
   variable size.

In an ideal world, 2. would never fail (in contrast to online_pages()
right now). This might make some further developments I have in mind
easier :) So if we can come up with an approach that can guarantee that,
extra points.

So what I think you are talking about is the following.

For 1. Use add_memory() followed by online_pages(). Don't actually
       online the pages, keep them reserved (like XEN balloon). Fixup
       stats.
For 2. Expose reserved pages to Buddy allocator. Clear reserved bit.
       Fixup stats. This can never fail. (yay)
For 3. Isolate pages, try to move everything away (basically but not
       comletely offlining code). Set reserved flag. Fixup flags.
For 4. offline_pages() followed by remove_memory().
       -> Q: How to distinguish reserved offline from other reserved
             pages? offline_pages() has to be able to deal with that
For 5. I don't think we can use reserved flag here.
       -> Q: What else to use?
For 6. Scan for movable ranges. The use


"You need to do that anyway to keep track of what parts are in use or
 offlined anyway right?"

I would manually track which chunks of a section is logically offline (I
do that right now already).

Is that what you had in mind? If not, where does your idea differ.
How could we solve 4/5. Of course, PageOffline() is again an option.

Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24 12:03         ` Michal Hocko
@ 2018-05-25 15:08             ` David Hildenbrand
  2018-05-25 15:08             ` David Hildenbrand
  1 sibling, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-25 15:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka,
	kexec

>> So, no, virtio-mem is not a balloon driver :)
> [...]
>>>> 1. "hotplug should simply not depend on kdump at all"
>>>>
>>>> In theory yes. In the current state we already have to trigger kdump to
>>>> reload whenever we add/remove a memory block.
>>>
>>> More details please.
>>
>> I just had another look at the whole complexity of
>> makedumfile/kdump/uevents and I'll follow up with a detailed description.
>>
>> kdump.service is definitely reloaded when setting a memory block
>> online/offline (not when adding/removing as I wrongly claimed before).
>>
>> I'll follow up with a more detailed description and all the pointers.
> 
> Please make sure to describe what is the architecture then. I have no
> idea what kdump.servise is supposed to do for example.

Giving a high level description, going into applicable details:


Dump tools always generate the dump file from /proc/vmcore inside the
kexec environment. This is a vmcore dump in ELF format, with required
and optional headers and notes.


1. Core collectors

The tool that writes /proc/vmcore into a file is called "core collector".

"This allows you to specify the command to copy the vmcore. You could
use the dump filtering program makedumpfile, the default one, to
retrieve your core, which on some arches can drastically reduce core
file size." [1]

E.g. under RHEL, the only supported core collector is in fact
makedumpfile [2][3], which is e.g. able to exclude e.g. hwpoison pages,
which could result otherwise in a crash if you simply copy /proc/vmcore
into a file on harddisk.


2. vmcoreinfo

/proc/vmcore can optionally contain a vmcoreinfo, that exposes some
magic variables necessary to e.g. find and interpret segments but also
struct pages. This is generated in "kernel/crash_core.c" in the crashed
linux kernel.

...
VMCOREINFO_SYMBOL_ARRAY(mem_section);
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
...
VMCOREINFO_NUMBER(PG_hwpoison);
...
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
...

If not available, it is e.g. tried to extract relevant
symbols/variables/pointers from vmlinux (similar like e.g. GDB).


3. PM_LOAD / Memory holes

Each vmcore contains "PM_LOAD" sections. These sections define which
physical memory areas are available in the vmcore (and to which virtual
addresses they translate). Generated e.g. in "kernel/kexec_file.c" - and
in some other places "git grep Elf64_Phdr".

This information is generated in the crashed kernel.

arch/x86/kernel/crash.c:
 walk_system_ram_res() is effectively used to generate PM_LOAD segments

arch/s390/kernel/crash_dump.c:
 for_each_mem_range() is effectively used to generate PM_LOAD
 information

At this point, I don't see how offline sections are treated. I assume
they are always also included. So PT_LOAD will include all memory, no
matter if online or offline.


4. Reloading kexec/kdump.service

The important thing is that the vmcore *excluding* the actual memory has
to be prepared by the *old* kernel. The kexec kernel will allow to
- Read the prepared vmcore (contained in kexec kernel)
- Read the memory

So dump tools only have the vmcore (esp. PT_LOAD) to figure out which
physical memory was available in the *old* system. The kexec kernel
neither reads or interprets segments/struct pages from the old kernel
(and there would be no way to really do it). All it does is allow to
read old memory as defined in the prepared vmcore. If that memory is not
accessible or broken (hwpoison), we will crash the system.

So what does this imply? vmcore (including PT_LOAD sections) has to be
regenerated every time memory is added/removed from the system.
Otherwise the data contained in the prepared vmcore is stale. As far as
I understand this cannot be done by the actual kernel when
adding/removing memory but has to be done by user space.

The same is e.g. also true when hot(un)plugging CPUs.

This is done by reloading kexec, resulting in a regeneration of the
vmcore. UDEV events are used to reload kdump.service and therefore
regenerate. This events are triggered when onlining/offlining a memory
block.

...
SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl
try-restart kdump.service"
SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl
try-restart kdump.service"
...

For "online", this is the right thing to do.

I am right now not 100% if that is the right thing to do for "offline".
I guess we should regenerate actually after "remove" events, but I
didn't follow the details. Otherwise it could happen that the vmcore is
regenerated before the actual removal of memory blocks. So the
applicable memory blocks would still be included as PT_LOAD in the
vmcore. If we then remove the actual DIMM then, trying to dump the
vmcore will result in reading invalid memory. But maybe I am missing
something there.


5. Access to vmcore / memory in the kexec environment

fs/proc/vmcore.c: contains the code for parsing vmcore in the kexec
kernel, prepared by the crashed kernel. The kexec kernel provides read
access to /proc/vmcore on this basis.

All PT_LOAD sections will be converted and stored in "vmcore_list".

When reading the vmcore, this list will be used to actually provide
access to the original crash memory (__read_vmcore()).

So only memory that was originally in vmcore PT_LOAD will be allowed to
be red.

read_from_oldmem() will perform the actual read. At that point we have
no control over old page flags or segments. Just a straight memory read.

There is special handling for e.g. XEN in there: pfn_is_ram() can be
used to hinder reading inflated memory. (register_oldmem_pfn_is_ram)

However reusing that for virtio-mem with multiple devices and queues and
such might not be possible. It is the last resort :)


6. makedumpfile

makedumpfile can exclude free (buddy) pages, hwpoison pages and some
more. It will *not* exclude reserved pages or balloon (e.g.
virtio-balloon) inflated pages. So it will read inflated pages and if
they are zero, save a compressed zero page. However it will (read)
access that memory.

makedumpfile was adapted to the new SECTION_IS_ONLINE bit (to mask the
right section address), offline sections will *not* be excluded. So also
all memory in offline sections will be accessed and dumped - unless
pages don't fall into PT_LOAD sections ("memory hole"), in this case
they are not accessed.


7. Further information

Some more details can be found in "Documentation/kdump/kdump.txt".


"All of the necessary information about the system kernel's core image
is encoded in the ELF format, and stored in a reserved area of memory
before a crash. The physical address of the start of the ELF header is
passed to the dump-capture kernel through the elfcorehdr= boot
parameter."
-> I am pretty sure this is why the kexec reload from user space is
   necessary

"For s390x there are two kdump modes: If a ELF header is specified with
 the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
 is done on all other architectures. If no elfcorehdr= kernel parameter
 is specified, the s390x kdump kernel dynamically creates the header.
 The second mode has the advantage that for CPU and memory hotplug,
 kdump has not to be reloaded with kexec_load()."


Any experts, please jump in :)



[1] https://www.systutorials.com/docs/linux/man/5-kdump/
[2] https://sourceforge.net/projects/makedumpfile/
[3] git://git.code.sf.net/p/makedumpfile/code

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
@ 2018-05-25 15:08             ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-25 15:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Jaewon Kim, Oscar Salvador, Jan Kara,
	Boris Ostrovsky, Benjamin Herrenschmidt, Balbir Singh, linux-mm,
	Alexander Potapenko, Hari Bathini, Rashmica Gupta, Ingo Molnar,
	Baoquan He, Matthew Wilcox, Michael Ellerman, Hugh Dickins,
	Pavel Tatashin, Tetsuo Handa, Mel Gorman, Huang Ying,
	Andrey Ryabinin, Dave Young, Vlastimil Babka, Reza Arbab,
	Jérôme Glisse, Dan Williams, Joonsoo Kim,
	Dmitry Vyukov, Juergen Gross, Greg Kroah-Hartman, kexec,
	linux-kernel, Miles Chen, Thomas Gleixner, Souptick Joarder,
	Philippe Ombredanne, Paul Mackerras, Andrew Morton,
	Kirill A. Shutemov

>> So, no, virtio-mem is not a balloon driver :)
> [...]
>>>> 1. "hotplug should simply not depend on kdump at all"
>>>>
>>>> In theory yes. In the current state we already have to trigger kdump to
>>>> reload whenever we add/remove a memory block.
>>>
>>> More details please.
>>
>> I just had another look at the whole complexity of
>> makedumfile/kdump/uevents and I'll follow up with a detailed description.
>>
>> kdump.service is definitely reloaded when setting a memory block
>> online/offline (not when adding/removing as I wrongly claimed before).
>>
>> I'll follow up with a more detailed description and all the pointers.
> 
> Please make sure to describe what is the architecture then. I have no
> idea what kdump.servise is supposed to do for example.

Giving a high level description, going into applicable details:


Dump tools always generate the dump file from /proc/vmcore inside the
kexec environment. This is a vmcore dump in ELF format, with required
and optional headers and notes.


1. Core collectors

The tool that writes /proc/vmcore into a file is called "core collector".

"This allows you to specify the command to copy the vmcore. You could
use the dump filtering program makedumpfile, the default one, to
retrieve your core, which on some arches can drastically reduce core
file size." [1]

E.g. under RHEL, the only supported core collector is in fact
makedumpfile [2][3], which is e.g. able to exclude e.g. hwpoison pages,
which could result otherwise in a crash if you simply copy /proc/vmcore
into a file on harddisk.


2. vmcoreinfo

/proc/vmcore can optionally contain a vmcoreinfo, that exposes some
magic variables necessary to e.g. find and interpret segments but also
struct pages. This is generated in "kernel/crash_core.c" in the crashed
linux kernel.

...
VMCOREINFO_SYMBOL_ARRAY(mem_section);
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
...
VMCOREINFO_NUMBER(PG_hwpoison);
...
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
...

If not available, it is e.g. tried to extract relevant
symbols/variables/pointers from vmlinux (similar like e.g. GDB).


3. PM_LOAD / Memory holes

Each vmcore contains "PM_LOAD" sections. These sections define which
physical memory areas are available in the vmcore (and to which virtual
addresses they translate). Generated e.g. in "kernel/kexec_file.c" - and
in some other places "git grep Elf64_Phdr".

This information is generated in the crashed kernel.

arch/x86/kernel/crash.c:
 walk_system_ram_res() is effectively used to generate PM_LOAD segments

arch/s390/kernel/crash_dump.c:
 for_each_mem_range() is effectively used to generate PM_LOAD
 information

At this point, I don't see how offline sections are treated. I assume
they are always also included. So PT_LOAD will include all memory, no
matter if online or offline.


4. Reloading kexec/kdump.service

The important thing is that the vmcore *excluding* the actual memory has
to be prepared by the *old* kernel. The kexec kernel will allow to
- Read the prepared vmcore (contained in kexec kernel)
- Read the memory

So dump tools only have the vmcore (esp. PT_LOAD) to figure out which
physical memory was available in the *old* system. The kexec kernel
neither reads or interprets segments/struct pages from the old kernel
(and there would be no way to really do it). All it does is allow to
read old memory as defined in the prepared vmcore. If that memory is not
accessible or broken (hwpoison), we will crash the system.

So what does this imply? vmcore (including PT_LOAD sections) has to be
regenerated every time memory is added/removed from the system.
Otherwise the data contained in the prepared vmcore is stale. As far as
I understand this cannot be done by the actual kernel when
adding/removing memory but has to be done by user space.

The same is e.g. also true when hot(un)plugging CPUs.

This is done by reloading kexec, resulting in a regeneration of the
vmcore. UDEV events are used to reload kdump.service and therefore
regenerate. This events are triggered when onlining/offlining a memory
block.

...
SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl
try-restart kdump.service"
SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl
try-restart kdump.service"
...

For "online", this is the right thing to do.

I am right now not 100% if that is the right thing to do for "offline".
I guess we should regenerate actually after "remove" events, but I
didn't follow the details. Otherwise it could happen that the vmcore is
regenerated before the actual removal of memory blocks. So the
applicable memory blocks would still be included as PT_LOAD in the
vmcore. If we then remove the actual DIMM then, trying to dump the
vmcore will result in reading invalid memory. But maybe I am missing
something there.


5. Access to vmcore / memory in the kexec environment

fs/proc/vmcore.c: contains the code for parsing vmcore in the kexec
kernel, prepared by the crashed kernel. The kexec kernel provides read
access to /proc/vmcore on this basis.

All PT_LOAD sections will be converted and stored in "vmcore_list".

When reading the vmcore, this list will be used to actually provide
access to the original crash memory (__read_vmcore()).

So only memory that was originally in vmcore PT_LOAD will be allowed to
be red.

read_from_oldmem() will perform the actual read. At that point we have
no control over old page flags or segments. Just a straight memory read.

There is special handling for e.g. XEN in there: pfn_is_ram() can be
used to hinder reading inflated memory. (register_oldmem_pfn_is_ram)

However reusing that for virtio-mem with multiple devices and queues and
such might not be possible. It is the last resort :)


6. makedumpfile

makedumpfile can exclude free (buddy) pages, hwpoison pages and some
more. It will *not* exclude reserved pages or balloon (e.g.
virtio-balloon) inflated pages. So it will read inflated pages and if
they are zero, save a compressed zero page. However it will (read)
access that memory.

makedumpfile was adapted to the new SECTION_IS_ONLINE bit (to mask the
right section address), offline sections will *not* be excluded. So also
all memory in offline sections will be accessed and dumped - unless
pages don't fall into PT_LOAD sections ("memory hole"), in this case
they are not accessed.


7. Further information

Some more details can be found in "Documentation/kdump/kdump.txt".


"All of the necessary information about the system kernel's core image
is encoded in the ELF format, and stored in a reserved area of memory
before a crash. The physical address of the start of the ELF header is
passed to the dump-capture kernel through the elfcorehdr= boot
parameter."
-> I am pretty sure this is why the kexec reload from user space is
   necessary

"For s390x there are two kdump modes: If a ELF header is specified with
 the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
 is done on all other architectures. If no elfcorehdr= kernel parameter
 is specified, the s390x kdump kernel dynamically creates the header.
 The second mode has the advantage that for CPU and memory hotplug,
 kdump has not to be reloaded with kexec_load()."


Any experts, please jump in :)



[1] https://www.systutorials.com/docs/linux/man/5-kdump/
[2] https://sourceforge.net/projects/makedumpfile/
[3] git://git.code.sf.net/p/makedumpfile/code

-- 

Thanks,

David / dhildenb

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24  9:14       ` David Hildenbrand
@ 2018-05-28  8:28         ` Dave Young
  2018-05-28 10:03           ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Young @ 2018-05-28  8:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, linux-kernel, Alexander Potapenko,
	Andrew Morton, Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini, Huang Ying,
	Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 05/24/18 at 11:14am, David Hildenbrand wrote:
> On 24.05.2018 10:56, Dave Young wrote:
> > Hi,
> > 
> > [snip]
> >>>
> >>>> For kdump and onlining/offlining code, we
> >>>> have to mark pages as offline before a new segment is visible to the system
> >>>> (e.g. as these pages might not be backed by real memory in the hypervisor).
> >>>
> >>> Please expand on the kdump part. That is really confusing because
> >>> hotplug should simply not depend on kdump at all. Moreover why don't you
> >>> simply mark those pages reserved and pull them out from the page
> >>> allocator?
> >>
> >> 1. "hotplug should simply not depend on kdump at all"
> >>
> >> In theory yes. In the current state we already have to trigger kdump to
> >> reload whenever we add/remove a memory block.
> >>
> >>
> >> 2. kdump part
> >>
> >> Whenever we offline a page and tell the hypervisor about it ("unplug"),
> >> we should not assume that we can read that page again. Now, if dumping
> >> tools assume they can read all memory that is offline, we are in trouble.
> >>
> >> It is the same thing as we already have with Pg_hwpoison. Just a
> >> different meaning - "don't touch this page, it is offline" compared to
> >> "don't touch this page, hw is broken".
> > 
> > Does that means in case an offline no kdump reload as mentioned in 1)?
> > 
> > If we have the offline event and reload kdump, I assume the memory state
> > is refreshed so kdump will not read the memory offlined, am I missing
> > something?
> 
> If a whole section is offline: yes. (ACPI hotplug)
> 
> If pages are online but broken ("logically offline" - hwpoison): no
> 
> If single pages are logically offline: no. (Balloon inflation - let's
> call it unplug as that's what some people refer to)
> 
> If only subsections (4MB chunks) are offline: no.
> 
> Exporting memory ranges in a smaller granularity to kdump than section
> size would a) be heavily complicated b) introduce a lot of overhead for
> this tracking data c) make us retrigger kdump way too often.
> 
> So simply marking pages offline in the struct pages and telling kdump
> about it is the straight forward thing to do. And it is fairly easy to
> add and implement as we have the exact same thing in place for hwpoison.

Ok, it is clear enough.   If case fine grained page offline is is like
a hwpoison page so a userspace patch for makedumpfile is needes to
exclude them when copying vmcore.

> 
> > 
> >>
> >> Balloon drivers solve this problem by always allowing to read unplugged
> >> memory. In virtio-mem, this cannot and should even not be guaranteed.
> >>
> > 
> > Hmm, that sounds a bug..
> 
> I can give you a simple example why reading such unplugged (or balloon
> inflated) memory is problematic: Huge page backed guests.
> 
> There is no zero page for huge pages. So if we allow the guest to read
> that memory any time, we cannot guarantee that we actually consume less
> memory in the hypervisor. This is absolutely to be avoided.
> 
> Existing balloon drivers don't support huge page backed guests. (well
> you can inflate, but the hypervisor cannot madvise() 4k on a huge page,
> resulting in no action being performed). This scenario is to be
> supported with virtio-mem.
> 
> 
> So yes, this is actually a bug in e.g. virtio-balloon implementations:
> 
> With "VIRTIO_BALLOON_F_MUST_TELL_HOST" we have to tell the hypervisor
> before we access a page again. kdump cannot do this and does not care,
> so this page is silently accessed and dumped. One of the main problems
> why extending virtio-balloon hypervisor implementations to support
> host-enforced R/W protection is impossible.

I'm not sure I got all virt related background, but still thank you
for the detailed explanation.  This is the first time I heard about
this, nobody complained before :(

> 
> > 
> >> And what we have to do to make this work is actually pretty simple: Just
> >> like Pg_hwpoison, track per page if it is online and provide this
> >> information to kdump.
> >>
> >>
> > 
> > Thanks
> > Dave
> > 
> 
> 
> -- 
> 
> Thanks,
> 
> David / dhildenb

Thanks
Dave

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-28  8:28         ` Dave Young
@ 2018-05-28 10:03           ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-05-28 10:03 UTC (permalink / raw)
  To: Dave Young
  Cc: Michal Hocko, linux-mm, linux-kernel, Alexander Potapenko,
	Andrew Morton, Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini, Huang Ying,
	Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 28.05.2018 10:28, Dave Young wrote:
> On 05/24/18 at 11:14am, David Hildenbrand wrote:
>> On 24.05.2018 10:56, Dave Young wrote:
>>> Hi,
>>>
>>> [snip]
>>>>>
>>>>>> For kdump and onlining/offlining code, we
>>>>>> have to mark pages as offline before a new segment is visible to the system
>>>>>> (e.g. as these pages might not be backed by real memory in the hypervisor).
>>>>>
>>>>> Please expand on the kdump part. That is really confusing because
>>>>> hotplug should simply not depend on kdump at all. Moreover why don't you
>>>>> simply mark those pages reserved and pull them out from the page
>>>>> allocator?
>>>>
>>>> 1. "hotplug should simply not depend on kdump at all"
>>>>
>>>> In theory yes. In the current state we already have to trigger kdump to
>>>> reload whenever we add/remove a memory block.
>>>>
>>>>
>>>> 2. kdump part
>>>>
>>>> Whenever we offline a page and tell the hypervisor about it ("unplug"),
>>>> we should not assume that we can read that page again. Now, if dumping
>>>> tools assume they can read all memory that is offline, we are in trouble.
>>>>
>>>> It is the same thing as we already have with Pg_hwpoison. Just a
>>>> different meaning - "don't touch this page, it is offline" compared to
>>>> "don't touch this page, hw is broken".
>>>
>>> Does that means in case an offline no kdump reload as mentioned in 1)?
>>>
>>> If we have the offline event and reload kdump, I assume the memory state
>>> is refreshed so kdump will not read the memory offlined, am I missing
>>> something?
>>
>> If a whole section is offline: yes. (ACPI hotplug)

After my investigation and reply to the other subthread, I think this is
not the case.

If a section/memory block is offline, it will currently still be dumped
as far as I can see. The ONLINE flag for sections is not (yet)
interpreted in makedumpfile.

>>
>> If pages are online but broken ("logically offline" - hwpoison): no
>>
>> If single pages are logically offline: no. (Balloon inflation - let's
>> call it unplug as that's what some people refer to)
>>
>> If only subsections (4MB chunks) are offline: no.
>>
>> Exporting memory ranges in a smaller granularity to kdump than section
>> size would a) be heavily complicated b) introduce a lot of overhead for
>> this tracking data c) make us retrigger kdump way too often.
>>
>> So simply marking pages offline in the struct pages and telling kdump
>> about it is the straight forward thing to do. And it is fairly easy to
>> add and implement as we have the exact same thing in place for hwpoison.
> 
> Ok, it is clear enough.   If case fine grained page offline is is like
> a hwpoison page so a userspace patch for makedumpfile is needes to
> exclude them when copying vmcore.

Exactly, to not touch pages that have no backing in the hypervisor. Even
if the pages would be readable on the hypervisor side, it makes no sense
to read/process them, as the y are logically offline and the content is
of no importance anymore - performance improvement, possible dump size
reduction.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24 21:07               ` David Hildenbrand
@ 2018-06-11 11:53                 ` David Hildenbrand
  2018-06-11 11:56                   ` Michal Hocko
  2018-07-18 13:19                 ` Michal Hocko
  1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-06-11 11:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 24.05.2018 23:07, David Hildenbrand wrote:
> On 24.05.2018 16:22, Michal Hocko wrote:
>> I will go over the rest of the email later I just wanted to make this
>> point clear because I suspect we are talking past each other.
> 
> It sounds like we are now talking about how to solve the problem. I like
> that :)
> 

Hi Michal,

did you have time to think about the details of your proposed idea?
(especially the questions I had as response below to make it work at all?)

Personally, I still think that using Pg_reserved is wrong and that your
proposal will be significantly more complicated.

Thanks!

>>
>> On Thu 24-05-18 16:04:38, David Hildenbrand wrote:
>> [...]
>>> The point I was making is: I cannot allocate 8MB/128MB using the buddy
>>> allocator. All I want to do is manage the memory a virtio-mem device
>>> provides as flexible as possible.
>>
>> I didn't mean to use the page allocator to isolate pages from it. We do
>> have other means. Have a look at the page isolation framework and have a
>> look how the current memory hotplug (ab)uses it. In short you mark the
>> desired physical memory range as isolated (nobody can allocate from it)
>> and then simply remove it from the page allocator. And you are done with
>> it. Your particular range is gone, nobody will ever use it. If you mark
>> those struct pages reserved then pfn walkers should already ignore them.
>> If you keep those pages with ref count 0 then even hotplug should work
>> seemlessly (I would have to double check).
>>
>> So all I am arguing is that whatever your driver wants to do can be
>> handled without touching the hotplug code much. You would still need
>> to add new ranges in the mem section units and manage on top of that.
>> You need to do that anyway to keep track of what parts are in use or
>> offlined anyway right? Now the mem sections. You have to do that anyway
>> for memmaps. Our sparse memory model simply works in those units. Even
>> if you make a part of that range unavailable then the section will still
>> be there.
>>
>> Do I make at least some sense or I am completely missing your point?
>>
> 
> I think we're heading somewhere. I understand that you want to separate
> this "semi" offline part from the general offlining code. If so, we
> should definitely enforce segment alignment for online_pages/offline_pages.
> 
> Importantly, what I need is:
> 
> 1. Indicate and prepare memory sections to be used for adding memory
>    chunks (right now add_memory())
> 2. Make memory chunks of a section available to the system (right now
>    online_pages())
> 3. Remove memory chunks of a section from the system (right now
>    offline_pages())
> 4. Remove memory sections from the system (right now remove_memory())
> 5. Hinder dumping tools from reading memory chunks that are logically
>    offline (right now PageOffline())
> 6. For 3. find removable memory chunks in a certain memory range with a
>    variable size.
> 
> In an ideal world, 2. would never fail (in contrast to online_pages()
> right now). This might make some further developments I have in mind
> easier :) So if we can come up with an approach that can guarantee that,
> extra points.
> 
> So what I think you are talking about is the following.
> 
> For 1. Use add_memory() followed by online_pages(). Don't actually
>        online the pages, keep them reserved (like XEN balloon). Fixup
>        stats.
> For 2. Expose reserved pages to Buddy allocator. Clear reserved bit.
>        Fixup stats. This can never fail. (yay)
> For 3. Isolate pages, try to move everything away (basically but not
>        comletely offlining code). Set reserved flag. Fixup flags.
> For 4. offline_pages() followed by remove_memory().
>        -> Q: How to distinguish reserved offline from other reserved
>              pages? offline_pages() has to be able to deal with that
> For 5. I don't think we can use reserved flag here.
>        -> Q: What else to use?
> For 6. Scan for movable ranges. The use
> 
> 
> "You need to do that anyway to keep track of what parts are in use or
>  offlined anyway right?"
> 
> I would manually track which chunks of a section is logically offline (I
> do that right now already).
> 
> Is that what you had in mind? If not, where does your idea differ.
> How could we solve 4/5. Of course, PageOffline() is again an option.
> 
> Thanks!
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-06-11 11:53                 ` David Hildenbrand
@ 2018-06-11 11:56                   ` Michal Hocko
  2018-06-11 12:33                     ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2018-06-11 11:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On Mon 11-06-18 13:53:49, David Hildenbrand wrote:
> On 24.05.2018 23:07, David Hildenbrand wrote:
> > On 24.05.2018 16:22, Michal Hocko wrote:
> >> I will go over the rest of the email later I just wanted to make this
> >> point clear because I suspect we are talking past each other.
> > 
> > It sounds like we are now talking about how to solve the problem. I like
> > that :)
> > 
> 
> Hi Michal,
> 
> did you have time to think about the details of your proposed idea?

Not really. Sorry about that. It's been busy time. I am planning to
revisit after merge window closes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-06-11 11:56                   ` Michal Hocko
@ 2018-06-11 12:33                     ` David Hildenbrand
  2018-07-16 19:48                       ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-06-11 12:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 11.06.2018 13:56, Michal Hocko wrote:
> On Mon 11-06-18 13:53:49, David Hildenbrand wrote:
>> On 24.05.2018 23:07, David Hildenbrand wrote:
>>> On 24.05.2018 16:22, Michal Hocko wrote:
>>>> I will go over the rest of the email later I just wanted to make this
>>>> point clear because I suspect we are talking past each other.
>>>
>>> It sounds like we are now talking about how to solve the problem. I like
>>> that :)
>>>
>>
>> Hi Michal,
>>
>> did you have time to think about the details of your proposed idea?
> 
> Not really. Sorry about that. It's been busy time. I am planning to
> revisit after merge window closes.
> 

Sure no worries, I still have a bunch of other things to work on. But it
would be nice to clarify soon in which direction I have to head to get
this implemented and upstream (e.g. what I proposed, what you proposed
or maybe something different).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-06-11 12:33                     ` David Hildenbrand
@ 2018-07-16 19:48                       ` David Hildenbrand
  2018-07-16 20:05                         ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-07-16 19:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 11.06.2018 14:33, David Hildenbrand wrote:
> On 11.06.2018 13:56, Michal Hocko wrote:
>> On Mon 11-06-18 13:53:49, David Hildenbrand wrote:
>>> On 24.05.2018 23:07, David Hildenbrand wrote:
>>>> On 24.05.2018 16:22, Michal Hocko wrote:
>>>>> I will go over the rest of the email later I just wanted to make this
>>>>> point clear because I suspect we are talking past each other.
>>>>
>>>> It sounds like we are now talking about how to solve the problem. I like
>>>> that :)
>>>>
>>>
>>> Hi Michal,
>>>
>>> did you have time to think about the details of your proposed idea?
>>
>> Not really. Sorry about that. It's been busy time. I am planning to
>> revisit after merge window closes.
>>
> 
> Sure no worries, I still have a bunch of other things to work on. But it
> would be nice to clarify soon in which direction I have to head to get
> this implemented and upstream (e.g. what I proposed, what you proposed
> or maybe something different).
> 
I would really like to make progress here.

I pointed out basic problems/questions with the proposed alternative. I
think I answered all your questions. But you also said that you are not
going to accept the current approach. So some decision has to be made.

Although it's very demotivating and frustrating (I hope not all work in
the MM area will be like this), if there is no guidance on how to
proceed, I'll have to switch to adding/removing/onlining/offlining whole
segments. This is not what I want, but maybe this has a higher chance of
getting reviews/acks.

Understanding that you are busy, please if you make suggestions, follow
up on responses.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-07-16 19:48                       ` David Hildenbrand
@ 2018-07-16 20:05                         ` Michal Hocko
  2018-07-18  9:56                           ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2018-07-16 20:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On Mon 16-07-18 21:48:59, David Hildenbrand wrote:
> On 11.06.2018 14:33, David Hildenbrand wrote:
> > On 11.06.2018 13:56, Michal Hocko wrote:
> >> On Mon 11-06-18 13:53:49, David Hildenbrand wrote:
> >>> On 24.05.2018 23:07, David Hildenbrand wrote:
> >>>> On 24.05.2018 16:22, Michal Hocko wrote:
> >>>>> I will go over the rest of the email later I just wanted to make this
> >>>>> point clear because I suspect we are talking past each other.
> >>>>
> >>>> It sounds like we are now talking about how to solve the problem. I like
> >>>> that :)
> >>>>
> >>>
> >>> Hi Michal,
> >>>
> >>> did you have time to think about the details of your proposed idea?
> >>
> >> Not really. Sorry about that. It's been busy time. I am planning to
> >> revisit after merge window closes.
> >>
> > 
> > Sure no worries, I still have a bunch of other things to work on. But it
> > would be nice to clarify soon in which direction I have to head to get
> > this implemented and upstream (e.g. what I proposed, what you proposed
> > or maybe something different).
> > 
> I would really like to make progress here.
> 
> I pointed out basic problems/questions with the proposed alternative. I
> think I answered all your questions. But you also said that you are not
> going to accept the current approach. So some decision has to be made.
> 
> Although it's very demotivating and frustrating (I hope not all work in
> the MM area will be like this), if there is no guidance on how to
> proceed, I'll have to switch to adding/removing/onlining/offlining whole
> segments. This is not what I want, but maybe this has a higher chance of
> getting reviews/acks.
> 
> Understanding that you are busy, please if you make suggestions, follow
> up on responses.

I plan to get back to this. It's busy time with too many things
happening both upstream and on my work table as well. Sorry about that.
I do understand your frustration but there is only that much time I
have. There are not that many people to review this code unfortunately.

In principle though, I still maintain my position that the memory
hotplug code is way too subtle to add more on top. Maybe the code can be
reworked to be less section oriented but that will be a lot of work.
If you _really_ need a smaller granularity I do not have a better
suggestion than to emulate that on top of sections. I still have to go
back to your last emails though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-07-16 20:05                         ` Michal Hocko
@ 2018-07-18  9:56                           ` David Hildenbrand
  2018-07-18 11:23                             ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-07-18  9:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 16.07.2018 22:05, Michal Hocko wrote:
> On Mon 16-07-18 21:48:59, David Hildenbrand wrote:
>> On 11.06.2018 14:33, David Hildenbrand wrote:
>>> On 11.06.2018 13:56, Michal Hocko wrote:
>>>> On Mon 11-06-18 13:53:49, David Hildenbrand wrote:
>>>>> On 24.05.2018 23:07, David Hildenbrand wrote:
>>>>>> On 24.05.2018 16:22, Michal Hocko wrote:
>>>>>>> I will go over the rest of the email later I just wanted to make this
>>>>>>> point clear because I suspect we are talking past each other.
>>>>>>
>>>>>> It sounds like we are now talking about how to solve the problem. I like
>>>>>> that :)
>>>>>>
>>>>>
>>>>> Hi Michal,
>>>>>
>>>>> did you have time to think about the details of your proposed idea?
>>>>
>>>> Not really. Sorry about that. It's been busy time. I am planning to
>>>> revisit after merge window closes.
>>>>
>>>
>>> Sure no worries, I still have a bunch of other things to work on. But it
>>> would be nice to clarify soon in which direction I have to head to get
>>> this implemented and upstream (e.g. what I proposed, what you proposed
>>> or maybe something different).
>>>
>> I would really like to make progress here.
>>
>> I pointed out basic problems/questions with the proposed alternative. I
>> think I answered all your questions. But you also said that you are not
>> going to accept the current approach. So some decision has to be made.
>>
>> Although it's very demotivating and frustrating (I hope not all work in
>> the MM area will be like this), if there is no guidance on how to
>> proceed, I'll have to switch to adding/removing/onlining/offlining whole
>> segments. This is not what I want, but maybe this has a higher chance of
>> getting reviews/acks.
>>
>> Understanding that you are busy, please if you make suggestions, follow
>> up on responses.
> 
> I plan to get back to this. It's busy time with too many things
> happening both upstream and on my work table as well. Sorry about that.
> I do understand your frustration but there is only that much time I
> have. There are not that many people to review this code unfortunately.
> 
> In principle though, I still maintain my position that the memory
> hotplug code is way too subtle to add more on top. Maybe the code can be
> reworked to be less section oriented but that will be a lot of work.
> If you _really_ need a smaller granularity I do not have a better
> suggestion than to emulate that on top of sections. I still have to go
> back to your last emails though.
> 

The only way I see doing the stuff on top will be using a new bit for
marking pages as offline (PageOffline - Patch 1).

When a section is added, all pages are initialized to PageOffline.

online_pages() can be then hindered to online specific pages using the
well known hook set_online_page_callback().

In my driver, I can manually "soft offline" parts, setting them to
PageOffline or "soft online" them again (including clearing PageOffline).

offline_pages() can then skip all pages that are already "soft offline"
- PageOffline set - and effectively set the section offline.


Without this new bit offline_pages() cannot know if a page is actually
offline or simply reserved by some other part of the system. Imagine
that all parts of a section are "soft offline". Now I want to offline
the section and remove the memory. I would have to temporarily online
all pages again, adding them to the buddy in order to properly offline
them using offline_pages(). Prone to races as these pages must not be
touched.

So in summary, PageOffline would have to be used but
online_pages/offline_pages would continue calling e.g. notifiers on
segment basis. Boils down to patch 1 and another patch that skips pages
that are already offline in offline_pages().

Once you have some spare cycles, please let me know what you think or
which other alternatives you see. Thanks.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-07-18  9:56                           ` David Hildenbrand
@ 2018-07-18 11:23                             ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2018-07-18 11:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On Wed 18-07-18 11:56:29, David Hildenbrand wrote:
> On 16.07.2018 22:05, Michal Hocko wrote:
> > On Mon 16-07-18 21:48:59, David Hildenbrand wrote:
> >> On 11.06.2018 14:33, David Hildenbrand wrote:
> >>> On 11.06.2018 13:56, Michal Hocko wrote:
> >>>> On Mon 11-06-18 13:53:49, David Hildenbrand wrote:
> >>>>> On 24.05.2018 23:07, David Hildenbrand wrote:
> >>>>>> On 24.05.2018 16:22, Michal Hocko wrote:
> >>>>>>> I will go over the rest of the email later I just wanted to make this
> >>>>>>> point clear because I suspect we are talking past each other.
> >>>>>>
> >>>>>> It sounds like we are now talking about how to solve the problem. I like
> >>>>>> that :)
> >>>>>>
> >>>>>
> >>>>> Hi Michal,
> >>>>>
> >>>>> did you have time to think about the details of your proposed idea?
> >>>>
> >>>> Not really. Sorry about that. It's been busy time. I am planning to
> >>>> revisit after merge window closes.
> >>>>
> >>>
> >>> Sure no worries, I still have a bunch of other things to work on. But it
> >>> would be nice to clarify soon in which direction I have to head to get
> >>> this implemented and upstream (e.g. what I proposed, what you proposed
> >>> or maybe something different).
> >>>
> >> I would really like to make progress here.
> >>
> >> I pointed out basic problems/questions with the proposed alternative. I
> >> think I answered all your questions. But you also said that you are not
> >> going to accept the current approach. So some decision has to be made.
> >>
> >> Although it's very demotivating and frustrating (I hope not all work in
> >> the MM area will be like this), if there is no guidance on how to
> >> proceed, I'll have to switch to adding/removing/onlining/offlining whole
> >> segments. This is not what I want, but maybe this has a higher chance of
> >> getting reviews/acks.
> >>
> >> Understanding that you are busy, please if you make suggestions, follow
> >> up on responses.
> > 
> > I plan to get back to this. It's busy time with too many things
> > happening both upstream and on my work table as well. Sorry about that.
> > I do understand your frustration but there is only that much time I
> > have. There are not that many people to review this code unfortunately.
> > 
> > In principle though, I still maintain my position that the memory
> > hotplug code is way too subtle to add more on top. Maybe the code can be
> > reworked to be less section oriented but that will be a lot of work.
> > If you _really_ need a smaller granularity I do not have a better
> > suggestion than to emulate that on top of sections. I still have to go
> > back to your last emails though.
> > 
> 
> The only way I see doing the stuff on top will be using a new bit for
> marking pages as offline (PageOffline - Patch 1).
> 
> When a section is added, all pages are initialized to PageOffline.
> 
> online_pages() can be then hindered to online specific pages using the
> well known hook set_online_page_callback().

Not really. You just make those pages unavailable from the page
allocator and mark them reserved. You can keep them linked at your
convenience. If you need to put them back to the allocator, just do so
and drop the reserved bit.
Once you gather a section worth of pages then you can simply offline and
remove the whole section.

> In my driver, I can manually "soft offline" parts, setting them to
> PageOffline or "soft online" them again (including clearing PageOffline).
> 
> offline_pages() can then skip all pages that are already "soft offline"
> - PageOffline set - and effectively set the section offline.
> 
> 
> Without this new bit offline_pages() cannot know if a page is actually
> offline or simply reserved by some other part of the system. Imagine
> that all parts of a section are "soft offline". Now I want to offline
> the section and remove the memory. I would have to temporarily online
> all pages again, adding them to the buddy in order to properly offline
> them using offline_pages(). Prone to races as these pages must not be
> touched.

Not really. Once the page is reserved it is yours. You can reuse any
parts of the struct page as you wish. You can call the state PageOffline
and teach the offlining code to skip them (with some protocol to ensure
they will not become online all of the sudden of course). I have no
problem to integrate that part into the generic hotplug code. It should
be a trivial check at a single place. But I do not think you really need
a new page flag for that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-05-24 21:07               ` David Hildenbrand
  2018-06-11 11:53                 ` David Hildenbrand
@ 2018-07-18 13:19                 ` Michal Hocko
  2018-07-18 13:39                   ` David Hildenbrand
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2018-07-18 13:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

[got back to this really late. Sorry about that]

On Thu 24-05-18 23:07:23, David Hildenbrand wrote:
> On 24.05.2018 16:22, Michal Hocko wrote:
> > I will go over the rest of the email later I just wanted to make this
> > point clear because I suspect we are talking past each other.
> 
> It sounds like we are now talking about how to solve the problem. I like
> that :)
> 
> > 
> > On Thu 24-05-18 16:04:38, David Hildenbrand wrote:
> > [...]
> >> The point I was making is: I cannot allocate 8MB/128MB using the buddy
> >> allocator. All I want to do is manage the memory a virtio-mem device
> >> provides as flexible as possible.
> > 
> > I didn't mean to use the page allocator to isolate pages from it. We do
> > have other means. Have a look at the page isolation framework and have a
> > look how the current memory hotplug (ab)uses it. In short you mark the
> > desired physical memory range as isolated (nobody can allocate from it)
> > and then simply remove it from the page allocator. And you are done with
> > it. Your particular range is gone, nobody will ever use it. If you mark
> > those struct pages reserved then pfn walkers should already ignore them.
> > If you keep those pages with ref count 0 then even hotplug should work
> > seemlessly (I would have to double check).
> > 
> > So all I am arguing is that whatever your driver wants to do can be
> > handled without touching the hotplug code much. You would still need
> > to add new ranges in the mem section units and manage on top of that.
> > You need to do that anyway to keep track of what parts are in use or
> > offlined anyway right? Now the mem sections. You have to do that anyway
> > for memmaps. Our sparse memory model simply works in those units. Even
> > if you make a part of that range unavailable then the section will still
> > be there.
> > 
> > Do I make at least some sense or I am completely missing your point?
> > 
> 
> I think we're heading somewhere. I understand that you want to separate
> this "semi" offline part from the general offlining code. If so, we
> should definitely enforce segment alignment for online_pages/offline_pages.
> 
> Importantly, what I need is:
> 
> 1. Indicate and prepare memory sections to be used for adding memory
>    chunks (right now add_memory())

Yes, this is section based. So you will always get memmap (struct page)
for the whole section.

> 2. Make memory chunks of a section available to the system (right now
>    online_pages())

Yes, this doesn't have to be section based. All you need is to mark
remaining pages as offline. They are reserved at this moment so nobody
should touch tehem.

> 3. Remove memory chunks of a section from the system (right now
>    offline_pages())

Yes. All we need is to note that those reserved pages are actually good
to offline. I have mentioned that reserved pages are yours at this stage
so you can note the special state without an additional page flag.

The generic hotplug code just have to learn about this new state.
has_unmovable_pages sounds like a proper place to do that. You simply
clear the offline state and the PageReserved and you are done with the
page.

> 4. Remove memory sections from the system (right now remove_memory())

no change needed

> 5. Hinder dumping tools from reading memory chunks that are logically
>    offline (right now PageOffline())

I still fail to see why do we even care about some dumping tools. Pages
are reserved so they simply shouldn't touch that memory at all.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-07-18 13:19                 ` Michal Hocko
@ 2018-07-18 13:39                   ` David Hildenbrand
  2018-07-18 13:43                     ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-07-18 13:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 18.07.2018 15:19, Michal Hocko wrote:
> [got back to this really late. Sorry about that]
> 
> On Thu 24-05-18 23:07:23, David Hildenbrand wrote:
>> On 24.05.2018 16:22, Michal Hocko wrote:
>>> I will go over the rest of the email later I just wanted to make this
>>> point clear because I suspect we are talking past each other.
>>
>> It sounds like we are now talking about how to solve the problem. I like
>> that :)
>>
>>>
>>> On Thu 24-05-18 16:04:38, David Hildenbrand wrote:
>>> [...]
>>>> The point I was making is: I cannot allocate 8MB/128MB using the buddy
>>>> allocator. All I want to do is manage the memory a virtio-mem device
>>>> provides as flexible as possible.
>>>
>>> I didn't mean to use the page allocator to isolate pages from it. We do
>>> have other means. Have a look at the page isolation framework and have a
>>> look how the current memory hotplug (ab)uses it. In short you mark the
>>> desired physical memory range as isolated (nobody can allocate from it)
>>> and then simply remove it from the page allocator. And you are done with
>>> it. Your particular range is gone, nobody will ever use it. If you mark
>>> those struct pages reserved then pfn walkers should already ignore them.
>>> If you keep those pages with ref count 0 then even hotplug should work
>>> seemlessly (I would have to double check).
>>>
>>> So all I am arguing is that whatever your driver wants to do can be
>>> handled without touching the hotplug code much. You would still need
>>> to add new ranges in the mem section units and manage on top of that.
>>> You need to do that anyway to keep track of what parts are in use or
>>> offlined anyway right? Now the mem sections. You have to do that anyway
>>> for memmaps. Our sparse memory model simply works in those units. Even
>>> if you make a part of that range unavailable then the section will still
>>> be there.
>>>
>>> Do I make at least some sense or I am completely missing your point?
>>>
>>
>> I think we're heading somewhere. I understand that you want to separate
>> this "semi" offline part from the general offlining code. If so, we
>> should definitely enforce segment alignment for online_pages/offline_pages.
>>
>> Importantly, what I need is:
>>
>> 1. Indicate and prepare memory sections to be used for adding memory
>>    chunks (right now add_memory())
> 
> Yes, this is section based. So you will always get memmap (struct page)
> for the whole section.
> 
>> 2. Make memory chunks of a section available to the system (right now
>>    online_pages())
> 
> Yes, this doesn't have to be section based. All you need is to mark
> remaining pages as offline. They are reserved at this moment so nobody
> should touch tehem.
> 
>> 3. Remove memory chunks of a section from the system (right now
>>    offline_pages())
> 
> Yes. All we need is to note that those reserved pages are actually good
> to offline. I have mentioned that reserved pages are yours at this stage
> so you can note the special state without an additional page flag.
> 
> The generic hotplug code just have to learn about this new state.
> has_unmovable_pages sounds like a proper place to do that. You simply
> clear the offline state and the PageReserved and you are done with the
> page.
> 

I agree. This would be minimal invassive - notifiers are still called on
whole segment.

>> 4. Remove memory sections from the system (right now remove_memory())
> 
> no change needed
> 
>> 5. Hinder dumping tools from reading memory chunks that are logically
>>    offline (right now PageOffline())
> 
> I still fail to see why do we even care about some dumping tools. Pages
> are reserved so they simply shouldn't touch that memory at all.
> 

Thanks for having a look!

I wonder why reserved pages never got excluded by dump tools. So I
assume there is some kind of magic hidden in it.

`git grep SetPageReserved` returns a number of buffers that are not to
be swapped. So "reserved" there is used for:
  "PG_reserved is set for special pages, which can never be swapped out"

And my point would be that these pages are still to be dumped (just as
it is being done now). They are valid memory.

It seems like this bit is used for two different purposes. My take would
be then to have another way of indicating "don't swap" vs. "page not
accessible / offline". And that's why I propose PageOffline.

I would even go one step further and rename "reserved" to "dontswap".


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-07-18 13:39                   ` David Hildenbrand
@ 2018-07-18 13:43                     ` Michal Hocko
  2018-07-18 13:47                       ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2018-07-18 13:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On Wed 18-07-18 15:39:29, David Hildenbrand wrote:
> On 18.07.2018 15:19, Michal Hocko wrote:
> > [got back to this really late. Sorry about that]
> > 
> > On Thu 24-05-18 23:07:23, David Hildenbrand wrote:
> >> On 24.05.2018 16:22, Michal Hocko wrote:
> >>> I will go over the rest of the email later I just wanted to make this
> >>> point clear because I suspect we are talking past each other.
> >>
> >> It sounds like we are now talking about how to solve the problem. I like
> >> that :)
> >>
> >>>
> >>> On Thu 24-05-18 16:04:38, David Hildenbrand wrote:
> >>> [...]
> >>>> The point I was making is: I cannot allocate 8MB/128MB using the buddy
> >>>> allocator. All I want to do is manage the memory a virtio-mem device
> >>>> provides as flexible as possible.
> >>>
> >>> I didn't mean to use the page allocator to isolate pages from it. We do
> >>> have other means. Have a look at the page isolation framework and have a
> >>> look how the current memory hotplug (ab)uses it. In short you mark the
> >>> desired physical memory range as isolated (nobody can allocate from it)
> >>> and then simply remove it from the page allocator. And you are done with
> >>> it. Your particular range is gone, nobody will ever use it. If you mark
> >>> those struct pages reserved then pfn walkers should already ignore them.
> >>> If you keep those pages with ref count 0 then even hotplug should work
> >>> seemlessly (I would have to double check).
> >>>
> >>> So all I am arguing is that whatever your driver wants to do can be
> >>> handled without touching the hotplug code much. You would still need
> >>> to add new ranges in the mem section units and manage on top of that.
> >>> You need to do that anyway to keep track of what parts are in use or
> >>> offlined anyway right? Now the mem sections. You have to do that anyway
> >>> for memmaps. Our sparse memory model simply works in those units. Even
> >>> if you make a part of that range unavailable then the section will still
> >>> be there.
> >>>
> >>> Do I make at least some sense or I am completely missing your point?
> >>>
> >>
> >> I think we're heading somewhere. I understand that you want to separate
> >> this "semi" offline part from the general offlining code. If so, we
> >> should definitely enforce segment alignment for online_pages/offline_pages.
> >>
> >> Importantly, what I need is:
> >>
> >> 1. Indicate and prepare memory sections to be used for adding memory
> >>    chunks (right now add_memory())
> > 
> > Yes, this is section based. So you will always get memmap (struct page)
> > for the whole section.
> > 
> >> 2. Make memory chunks of a section available to the system (right now
> >>    online_pages())
> > 
> > Yes, this doesn't have to be section based. All you need is to mark
> > remaining pages as offline. They are reserved at this moment so nobody
> > should touch tehem.
> > 
> >> 3. Remove memory chunks of a section from the system (right now
> >>    offline_pages())
> > 
> > Yes. All we need is to note that those reserved pages are actually good
> > to offline. I have mentioned that reserved pages are yours at this stage
> > so you can note the special state without an additional page flag.
> > 
> > The generic hotplug code just have to learn about this new state.
> > has_unmovable_pages sounds like a proper place to do that. You simply
> > clear the offline state and the PageReserved and you are done with the
> > page.
> > 
> 
> I agree. This would be minimal invassive - notifiers are still called on
> whole segment.

That shouldn't matter because notifiers should never step on pages they
do not manage or own.

> >> 4. Remove memory sections from the system (right now remove_memory())
> > 
> > no change needed
> > 
> >> 5. Hinder dumping tools from reading memory chunks that are logically
> >>    offline (right now PageOffline())
> > 
> > I still fail to see why do we even care about some dumping tools. Pages
> > are reserved so they simply shouldn't touch that memory at all.
> > 
> 
> Thanks for having a look!
> 
> I wonder why reserved pages never got excluded by dump tools. So I
> assume there is some kind of magic hidden in it.
> 
> `git grep SetPageReserved` returns a number of buffers that are not to
> be swapped. So "reserved" there is used for:
>   "PG_reserved is set for special pages, which can never be swapped out"

That was an ancient menaing of the flag. The flag in general means that
you shouldn't touch it unless you own it.

> And my point would be that these pages are still to be dumped (just as
> it is being done now). They are valid memory.

Then fix kdump or what ever is touching them.

> It seems like this bit is used for two different purposes. My take would
> be then to have another way of indicating "don't swap" vs. "page not
> accessible / offline". And that's why I propose PageOffline.
> 
> I would even go one step further and rename "reserved" to "dontswap".

No, it really doesn't have that meaning for years.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-07-18 13:43                     ` Michal Hocko
@ 2018-07-18 13:47                       ` David Hildenbrand
  2018-07-18 13:56                         ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2018-07-18 13:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On 18.07.2018 15:43, Michal Hocko wrote:
> On Wed 18-07-18 15:39:29, David Hildenbrand wrote:
>> On 18.07.2018 15:19, Michal Hocko wrote:
>>> [got back to this really late. Sorry about that]
>>>
>>> On Thu 24-05-18 23:07:23, David Hildenbrand wrote:
>>>> On 24.05.2018 16:22, Michal Hocko wrote:
>>>>> I will go over the rest of the email later I just wanted to make this
>>>>> point clear because I suspect we are talking past each other.
>>>>
>>>> It sounds like we are now talking about how to solve the problem. I like
>>>> that :)
>>>>
>>>>>
>>>>> On Thu 24-05-18 16:04:38, David Hildenbrand wrote:
>>>>> [...]
>>>>>> The point I was making is: I cannot allocate 8MB/128MB using the buddy
>>>>>> allocator. All I want to do is manage the memory a virtio-mem device
>>>>>> provides as flexible as possible.
>>>>>
>>>>> I didn't mean to use the page allocator to isolate pages from it. We do
>>>>> have other means. Have a look at the page isolation framework and have a
>>>>> look how the current memory hotplug (ab)uses it. In short you mark the
>>>>> desired physical memory range as isolated (nobody can allocate from it)
>>>>> and then simply remove it from the page allocator. And you are done with
>>>>> it. Your particular range is gone, nobody will ever use it. If you mark
>>>>> those struct pages reserved then pfn walkers should already ignore them.
>>>>> If you keep those pages with ref count 0 then even hotplug should work
>>>>> seemlessly (I would have to double check).
>>>>>
>>>>> So all I am arguing is that whatever your driver wants to do can be
>>>>> handled without touching the hotplug code much. You would still need
>>>>> to add new ranges in the mem section units and manage on top of that.
>>>>> You need to do that anyway to keep track of what parts are in use or
>>>>> offlined anyway right? Now the mem sections. You have to do that anyway
>>>>> for memmaps. Our sparse memory model simply works in those units. Even
>>>>> if you make a part of that range unavailable then the section will still
>>>>> be there.
>>>>>
>>>>> Do I make at least some sense or I am completely missing your point?
>>>>>
>>>>
>>>> I think we're heading somewhere. I understand that you want to separate
>>>> this "semi" offline part from the general offlining code. If so, we
>>>> should definitely enforce segment alignment for online_pages/offline_pages.
>>>>
>>>> Importantly, what I need is:
>>>>
>>>> 1. Indicate and prepare memory sections to be used for adding memory
>>>>    chunks (right now add_memory())
>>>
>>> Yes, this is section based. So you will always get memmap (struct page)
>>> for the whole section.
>>>
>>>> 2. Make memory chunks of a section available to the system (right now
>>>>    online_pages())
>>>
>>> Yes, this doesn't have to be section based. All you need is to mark
>>> remaining pages as offline. They are reserved at this moment so nobody
>>> should touch tehem.
>>>
>>>> 3. Remove memory chunks of a section from the system (right now
>>>>    offline_pages())
>>>
>>> Yes. All we need is to note that those reserved pages are actually good
>>> to offline. I have mentioned that reserved pages are yours at this stage
>>> so you can note the special state without an additional page flag.
>>>
>>> The generic hotplug code just have to learn about this new state.
>>> has_unmovable_pages sounds like a proper place to do that. You simply
>>> clear the offline state and the PageReserved and you are done with the
>>> page.
>>>
>>
>> I agree. This would be minimal invassive - notifiers are still called on
>> whole segment.
> 
> That shouldn't matter because notifiers should never step on pages they
> do not manage or own.
> 
>>>> 4. Remove memory sections from the system (right now remove_memory())
>>>
>>> no change needed
>>>
>>>> 5. Hinder dumping tools from reading memory chunks that are logically
>>>>    offline (right now PageOffline())
>>>
>>> I still fail to see why do we even care about some dumping tools. Pages
>>> are reserved so they simply shouldn't touch that memory at all.
>>>
>>
>> Thanks for having a look!
>>
>> I wonder why reserved pages never got excluded by dump tools. So I
>> assume there is some kind of magic hidden in it.
>>
>> `git grep SetPageReserved` returns a number of buffers that are not to
>> be swapped. So "reserved" there is used for:
>>   "PG_reserved is set for special pages, which can never be swapped out"
> 
> That was an ancient menaing of the flag. The flag in general means that
> you shouldn't touch it unless you own it.
> 
>> And my point would be that these pages are still to be dumped (just as
>> it is being done now). They are valid memory.
> 
> Then fix kdump or what ever is touching them.

If the rule is really reserved -> dontouch, then I agree.

> 
>> It seems like this bit is used for two different purposes. My take would
>> be then to have another way of indicating "don't swap" vs. "page not
>> accessible / offline". And that's why I propose PageOffline.
>>
>> I would even go one step further and rename "reserved" to "dontswap".
> 
> No, it really doesn't have that meaning for years.
> 

So would you agree to change the comment in page-flags.h to something like

"PG_reserved is set for special pages, that should never be touched
(read/written). Some of them might not even exist."

Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver
  2018-07-18 13:47                       ` David Hildenbrand
@ 2018-07-18 13:56                         ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2018-07-18 13:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Alexander Potapenko, Andrew Morton,
	Andrey Ryabinin, Balbir Singh, Baoquan He,
	Benjamin Herrenschmidt, Boris Ostrovsky, Dan Williams,
	Dave Young, Dmitry Vyukov, Greg Kroah-Hartman, Hari Bathini,
	Huang Ying, Hugh Dickins, Ingo Molnar, Jaewon Kim, Jan Kara,
	Jérôme Glisse, Joonsoo Kim, Juergen Gross,
	Kate Stewart, Kirill A. Shutemov, Matthew Wilcox, Mel Gorman,
	Michael Ellerman, Miles Chen, Oscar Salvador, Paul Mackerras,
	Pavel Tatashin, Philippe Ombredanne, Rashmica Gupta, Reza Arbab,
	Souptick Joarder, Tetsuo Handa, Thomas Gleixner, Vlastimil Babka

On Wed 18-07-18 15:47:33, David Hildenbrand wrote:
[...]
> So would you agree to change the comment in page-flags.h to something like
> 
> "PG_reserved is set for special pages, that should never be touched
> (read/written). Some of them might not even exist."

Yes. Except you should mention that they should be touched by anybody
except their owners.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2018-07-18 13:56 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-23 15:11 [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 01/10] mm: introduce and use PageOffline() David Hildenbrand
2018-05-23 15:11   ` David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 02/10] mm/page_ext.c: support online/offline of memory < section size David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 03/10] kasan: prepare for online/offline of different start/size David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 04/10] kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in VMCOREINFO David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 05/10] mm/memory_hotplug: limit offline_pages() to sizes we can actually handle David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 06/10] mm/memory_hotplug: onlining pages can only fail due to notifiers David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 07/10] mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages() David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 08/10] mm/memory_hotplug: allow to control onlining/offlining of memory by a driver David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 09/10] mm/memory_hotplug: teach offline_pages() to not try forever David Hildenbrand
2018-05-24 14:39   ` Michal Hocko
2018-05-24 20:36     ` David Hildenbrand
2018-05-23 15:11 ` [PATCH v1 10/10] mm/memory_hotplug: allow online/offline memory by a kernel module David Hildenbrand
2018-05-23 19:51   ` Christoph Hellwig
2018-05-24  5:59     ` David Hildenbrand
2018-05-24  7:53 ` [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver Michal Hocko
2018-05-24  8:31   ` David Hildenbrand
2018-05-24  8:56     ` Dave Young
2018-05-24  9:14       ` David Hildenbrand
2018-05-28  8:28         ` Dave Young
2018-05-28 10:03           ` David Hildenbrand
2018-05-24  9:31     ` Michal Hocko
2018-05-24 10:45       ` David Hildenbrand
2018-05-24 12:03         ` Michal Hocko
2018-05-24 14:04           ` David Hildenbrand
2018-05-24 14:22             ` Michal Hocko
2018-05-24 21:07               ` David Hildenbrand
2018-06-11 11:53                 ` David Hildenbrand
2018-06-11 11:56                   ` Michal Hocko
2018-06-11 12:33                     ` David Hildenbrand
2018-07-16 19:48                       ` David Hildenbrand
2018-07-16 20:05                         ` Michal Hocko
2018-07-18  9:56                           ` David Hildenbrand
2018-07-18 11:23                             ` Michal Hocko
2018-07-18 13:19                 ` Michal Hocko
2018-07-18 13:39                   ` David Hildenbrand
2018-07-18 13:43                     ` Michal Hocko
2018-07-18 13:47                       ` David Hildenbrand
2018-07-18 13:56                         ` Michal Hocko
2018-05-25 15:08           ` David Hildenbrand
2018-05-25 15:08             ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.