All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-09-28 15:03 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-09-28 15:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, Pavel Tatashin, Michal Hocko, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, David Hildenbrand, linux-acpi, Ingo Molnar,
	xen-devel

How to/when to online hotplugged memory is hard to manage for
distributions because different memory types are to be treated differently.
Right now, we need complicated udev rules that e.g. check if we are
running on s390x, on a physical system or on a virtualized system. But
there is also sometimes the demand to really online memory immediately
while adding in the kernel and not to wait for user space to make a
decision. And on virtualized systems there might be different
requirements, depending on "how" the memory was added (and if it will
eventually get unplugged again - DIMM vs. paravirtualized mechanisms).

On the one hand, we have physical systems where we sometimes
want to be able to unplug memory again - e.g. a DIMM - so we have to online
it to the MOVABLE zone optionally. That decision is usually made in user
space.

On the other hand, we have memory that should never be onlined
automatically, only when asked for by an administrator. Such memory only
applies to virtualized environments like s390x, where the concept of
"standby" memory exists. Memory is detected and added during boot, so it
can be onlined when requested by the admininistrator or some tooling.
Only when onlining, memory will be allocated in the hypervisor.

But then, we also have paravirtualized devices (namely xen and hyper-v
balloons), that hotplug memory that will never ever be removed from a
system right now using offline_pages/remove_memory. If at all, this memory
is logically unplugged and handed back to the hypervisor via ballooning.

For paravirtualized devices it is relevant that memory is onlined as
quickly as possible after adding - and that it is added to the NORMAL
zone. Otherwise, it could happen that too much memory in a row is added
(but not onlined), resulting in out-of-memory conditions due to the
additional memory for "struct pages" and friends. MOVABLE zone as well
as delays might be very problematic and lead to crashes (e.g. zone
imbalance).

Therefore, introduce memory block types and online memory depending on
it when adding the memory. Expose the memory type to user space, so user
space handlers can start to process only "normal" memory. Other memory
block types can be ignored. One thing less to worry about in user space.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: "Jonathan Neuschäfer" <j.neuschaefer@gmx.net>
Cc: Joe Perches <joe@perches.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

This patch is based on the current mm-tree, where some related
patches from me are currently residing that touched the add_memory()
functions.

 arch/ia64/mm/init.c                       |  4 +-
 arch/powerpc/mm/mem.c                     |  4 +-
 arch/powerpc/platforms/powernv/memtrace.c |  3 +-
 arch/s390/mm/init.c                       |  4 +-
 arch/sh/mm/init.c                         |  4 +-
 arch/x86/mm/init_32.c                     |  4 +-
 arch/x86/mm/init_64.c                     |  8 +--
 drivers/acpi/acpi_memhotplug.c            |  3 +-
 drivers/base/memory.c                     | 63 ++++++++++++++++++++---
 drivers/hv/hv_balloon.c                   | 33 ++----------
 drivers/s390/char/sclp_cmd.c              |  3 +-
 drivers/xen/balloon.c                     |  2 +-
 include/linux/memory.h                    | 28 +++++++++-
 include/linux/memory_hotplug.h            | 17 +++---
 mm/hmm.c                                  |  6 ++-
 mm/memory_hotplug.c                       | 31 ++++++-----
 16 files changed, 139 insertions(+), 78 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d5e12ff1d73c..813d1d86bf95 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -646,13 +646,13 @@ mem_init (void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	if (ret)
 		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
 		       __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5551f5870dcc..dd32fcc9099c 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -118,7 +118,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 }
 
 int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+			      int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
 	}
 	flush_inval_dcache_range(start, start + size);
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 84d038ed3882..57d6b3d46382 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -232,7 +232,8 @@ static int memtrace_online(void)
 			ent->mem = 0;
 		}
 
-		if (add_memory(ent->nid, ent->start, ent->size)) {
+		if (add_memory(ent->nid, ent->start, ent->size,
+			       MEMORY_BLOCK_NORMAL)) {
 			pr_err("Failed to add trace memory to node %d\n",
 				ent->nid);
 			ret += 1;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index e472cd763eb3..b5324527c7f6 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -222,7 +222,7 @@ device_initcall(s390_cma_mem_init);
 #endif /* CONFIG_CMA */
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long size_pages = PFN_DOWN(size);
@@ -232,7 +232,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	if (rc)
 		return rc;
 
-	rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock);
+	rc = __add_pages(nid, start_pfn, size_pages, altmap, memory_block_type);
 	if (rc)
 		vmem_remove_mapping(start, size);
 	return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index c8c13c777162..6b876000731a 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -419,14 +419,14 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index f2837e4c40b3..4f50cd4467a9 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -851,12 +851,12 @@ void __init mem_init(void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..fc3df573f0f3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -783,11 +783,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
 }
 
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock)
+		struct vmem_altmap *altmap, int memory_block_type)
 {
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
@@ -798,14 +798,14 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 }
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
 	init_memory_mapping(start, start + size);
 
-	return add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #define PAGE_INUSE 0xFD
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 8fe0960ea572..c5f646b4e97e 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,8 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = __add_memory(node, info->start_addr, info->length);
+		result = __add_memory(node, info->start_addr, info->length,
+				      MEMORY_BLOCK_NORMAL);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 0e5985682642..2686101e41b5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -381,6 +381,32 @@ static ssize_t show_phys_device(struct device *dev,
 	return sprintf(buf, "%d\n", mem->phys_device);
 }
 
+static ssize_t type_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct memory_block *mem = to_memory_block(dev);
+	ssize_t len = 0;
+
+	switch (mem->state) {
+	case MEMORY_BLOCK_NORMAL:
+		len = sprintf(buf, "normal\n");
+		break;
+	case MEMORY_BLOCK_STANDBY:
+		len = sprintf(buf, "standby\n");
+		break;
+	case MEMORY_BLOCK_PARAVIRT:
+		len = sprintf(buf, "paravirt\n");
+		break;
+	default:
+		len = sprintf(buf, "ERROR-UNKNOWN-%ld\n",
+				mem->state);
+		WARN_ON(1);
+		break;
+	}
+
+	return len;
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn,
 		unsigned long nr_pages, int online_type,
@@ -442,6 +468,7 @@ static DEVICE_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
 static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state);
 static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL);
 static DEVICE_ATTR(removable, 0444, show_mem_removable, NULL);
+static DEVICE_ATTR_RO(type);
 
 /*
  * Block size attribute stuff
@@ -514,7 +541,8 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = __add_memory(nid, phys_addr,
-			   MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			   MIN_MEMORY_BLOCK_SIZE * sections_per_block,
+			   MEMORY_BLOCK_NORMAL);
 
 	if (ret)
 		goto out;
@@ -620,6 +648,7 @@ static struct attribute *memory_memblk_attrs[] = {
 	&dev_attr_state.attr,
 	&dev_attr_phys_device.attr,
 	&dev_attr_removable.attr,
+	&dev_attr_type.attr,
 #ifdef CONFIG_MEMORY_HOTREMOVE
 	&dev_attr_valid_zones.attr,
 #endif
@@ -657,13 +686,17 @@ int register_memory(struct memory_block *memory)
 }
 
 static int init_memory_block(struct memory_block **memory,
-			     struct mem_section *section, unsigned long state)
+			     struct mem_section *section, unsigned long state,
+			     int memory_block_type)
 {
 	struct memory_block *mem;
 	unsigned long start_pfn;
 	int scn_nr;
 	int ret = 0;
 
+	if (memory_block_type == MEMORY_BLOCK_NONE)
+		return -EINVAL;
+
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem)
 		return -ENOMEM;
@@ -675,6 +708,7 @@ static int init_memory_block(struct memory_block **memory,
 	mem->state = state;
 	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
+	mem->type = memory_block_type;
 
 	ret = register_memory(mem);
 
@@ -699,7 +733,8 @@ static int add_memory_block(int base_section_nr)
 
 	if (section_count == 0)
 		return 0;
-	ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE);
+	ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE,
+				MEMORY_BLOCK_NORMAL);
 	if (ret)
 		return ret;
 	mem->section_count = section_count;
@@ -710,19 +745,35 @@ static int add_memory_block(int base_section_nr)
  * need an interface for the VM to add new memory regions,
  * but without onlining it.
  */
-int hotplug_memory_register(int nid, struct mem_section *section)
+int hotplug_memory_register(int nid, struct mem_section *section,
+			    int memory_block_type)
 {
 	int ret = 0;
 	struct memory_block *mem;
 
 	mutex_lock(&mem_sysfs_mutex);
 
+	/* make sure there is no memblock if we don't want one */
+	if (memory_block_type == MEMORY_BLOCK_NONE) {
+		mem = find_memory_block(section);
+		if (mem) {
+			put_device(&mem->dev);
+			ret = -EINVAL;
+		}
+		goto out;
+	}
+
 	mem = find_memory_block(section);
 	if (mem) {
-		mem->section_count++;
+		/* make sure the type matches */
+		if (mem->type == memory_block_type)
+			mem->section_count++;
+		else
+			ret = -EINVAL;
 		put_device(&mem->dev);
 	} else {
-		ret = init_memory_block(&mem, section, MEM_OFFLINE);
+		ret = init_memory_block(&mem, section, MEM_OFFLINE,
+					memory_block_type);
 		if (ret)
 			goto out;
 		mem->section_count++;
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index b1b788082793..5a8d18c4d699 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -537,11 +537,6 @@ struct hv_dynmem_device {
 	 */
 	bool host_specified_ha_region;
 
-	/*
-	 * State to synchronize hot-add.
-	 */
-	struct completion  ol_waitevent;
-	bool ha_waiting;
 	/*
 	 * This thread handles hot-add
 	 * requests from the host as well as notifying
@@ -640,14 +635,6 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 	unsigned long flags, pfn_count;
 
 	switch (val) {
-	case MEM_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		if (dm_device.ha_waiting) {
-			dm_device.ha_waiting = false;
-			complete(&dm_device.ol_waitevent);
-		}
-		break;
-
 	case MEM_OFFLINE:
 		spin_lock_irqsave(&dm_device.ha_lock, flags);
 		pfn_count = hv_page_offline_check(mem->start_pfn,
@@ -665,9 +652,7 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 		}
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 		break;
-	case MEM_GOING_ONLINE:
-	case MEM_GOING_OFFLINE:
-	case MEM_CANCEL_OFFLINE:
+	default:
 		break;
 	}
 	return NOTIFY_OK;
@@ -731,12 +716,10 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 		has->covered_end_pfn +=  processed_pfn;
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 
-		init_completion(&dm_device.ol_waitevent);
-		dm_device.ha_waiting = !memhp_auto_online;
-
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				 (HA_CHUNK << PAGE_SHIFT),
+				 MEMORY_BLOCK_PARAVIRT);
 
 		if (ret) {
 			pr_err("hot_add memory failed error is %d\n", ret);
@@ -757,16 +740,6 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 			break;
 		}
 
-		/*
-		 * Wait for the memory block to be onlined when memory onlining
-		 * is done outside of kernel (memhp_auto_online). Since the hot
-		 * add has succeeded, it is ok to proceed even if the pages in
-		 * the hot added region have not been "onlined" within the
-		 * allowed time.
-		 */
-		if (dm_device.ha_waiting)
-			wait_for_completion_timeout(&dm_device.ol_waitevent,
-						    5*HZ);
 		post_status(&dm_device);
 	}
 }
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index d7686a68c093..1928a2411456 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -406,7 +406,8 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size,
+			   MEMORY_BLOCK_STANDBY);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index fdfc64f5acea..291a8aac6af3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -397,7 +397,7 @@ static enum bp_state reserve_additional_memory(void)
 	mutex_unlock(&balloon_mutex);
 	/* add_memory_resource() requires the device_hotplug lock */
 	lock_device_hotplug();
-	rc = add_memory_resource(nid, resource, memhp_auto_online);
+	rc = add_memory_resource(nid, resource, MEMORY_BLOCK_PARAVIRT);
 	unlock_device_hotplug();
 	mutex_lock(&balloon_mutex);
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index a6ddefc60517..3dc2a0b12653 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -23,6 +23,30 @@
 
 #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
 
+/*
+ * NONE:     No memory block is to be created (e.g. device memory).
+ * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
+ *           (e.g. ACPI DIMMs) that should be onlined either automatically
+ *           (memhp_auto_online) or manually by user space to select a
+ *           specific zone.
+ *           Applicable to memhp_auto_online.
+ * STANDBY:  Memory block that represents standby memory that should only
+ *           be onlined on demand by user space (e.g. standby memory on
+ *           s390x), but never automatically by the kernel.
+ *           Not applicable to memhp_auto_online.
+ * PARAVIRT: Memory block that represents memory added by
+ *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
+ *           always automatically get onlined. Memory will be unplugged
+ *           using ballooning, not by relying on the MOVABLE ZONE.
+ *           Not applicable to memhp_auto_online.
+ */
+enum {
+	MEMORY_BLOCK_NONE,
+	MEMORY_BLOCK_NORMAL,
+	MEMORY_BLOCK_STANDBY,
+	MEMORY_BLOCK_PARAVIRT,
+};
+
 struct memory_block {
 	unsigned long start_section_nr;
 	unsigned long end_section_nr;
@@ -34,6 +58,7 @@ struct memory_block {
 	int (*phys_callback)(struct memory_block *);
 	struct device dev;
 	int nid;			/* NID for this memory block */
+	int type;			/* type of this memory block */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -111,7 +136,8 @@ extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
-int hotplug_memory_register(int nid, struct mem_section *section);
+int hotplug_memory_register(int nid, struct mem_section *section,
+			    int memory_block_type);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int unregister_memory_section(struct mem_section *);
 #endif
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ffd9cd10fcf3..b560a9ee0e8c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -115,18 +115,18 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+		struct vmem_altmap *altmap, int memory_block_type);
 
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		int memory_block_type)
 {
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 #else /* ARCH_HAS_ADD_PAGES */
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+	      struct vmem_altmap *altmap, int memory_block_type);
 #endif /* ARCH_HAS_ADD_PAGES */
 
 #ifdef CONFIG_NUMA
@@ -324,11 +324,12 @@ static inline void __remove_memory(int nid, u64 start, u64 size) {}
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
-extern int __add_memory(int nid, u64 start, u64 size);
-extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource, bool online);
+extern int __add_memory(int nid, u64 start, u64 size, int memory_block_type);
+extern int add_memory(int nid, u64 start, u64 size, int memory_block_type);
+extern int add_memory_resource(int nid, struct resource *resource,
+			       int memory_block_type);
 extern int arch_add_memory(int nid, u64 start, u64 size,
-		struct vmem_altmap *altmap, bool want_memblock);
+			   struct vmem_altmap *altmap, int memory_block_type);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index c968e49f7a0c..2350f6f6ab42 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -32,6 +32,7 @@
 #include <linux/jump_label.h>
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
+#include <linux/memory.h>
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
@@ -1096,10 +1097,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	 */
 	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
 		ret = arch_add_memory(nid, align_start, align_size, NULL,
-				false);
+				      MEMORY_BLOCK_NONE);
 	else
 		ret = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL, false);
+				align_size >> PAGE_SHIFT, NULL,
+				MEMORY_BLOCK_NONE);
 	if (ret) {
 		mem_hotplug_done();
 		goto error_add_memory;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4c7e42e46f3..bce6c41d721c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -246,7 +246,7 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
 static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
-		struct vmem_altmap *altmap, bool want_memblock)
+		struct vmem_altmap *altmap, int memory_block_type)
 {
 	int ret;
 
@@ -257,10 +257,11 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
-	if (!want_memblock)
+	if (memory_block_type == MEMBLOCK_NONE)
 		return 0;
 
-	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
+	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn),
+				       memory_block_type);
 }
 
 /*
@@ -271,7 +272,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
  */
 int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		int memory_block_type)
 {
 	unsigned long i;
 	int err = 0;
@@ -296,7 +297,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, section_nr_to_pfn(i), altmap,
-				want_memblock);
+				    memory_block_type);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -1099,7 +1100,8 @@ static int online_memory_block(struct memory_block *mem, void *arg)
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int __ref add_memory_resource(int nid, struct resource *res, bool online)
+int __ref add_memory_resource(int nid, struct resource *res,
+			      int memory_block_type)
 {
 	u64 start, size;
 	bool new_node = false;
@@ -1108,6 +1110,9 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	start = res->start;
 	size = resource_size(res);
 
+	if (memory_block_type == MEMORY_BLOCK_NONE)
+		return -EINVAL;
+
 	ret = check_hotplug_memory_range(start, size);
 	if (ret)
 		return ret;
@@ -1128,7 +1133,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	new_node = ret;
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, NULL, true);
+	ret = arch_add_memory(nid, start, size, NULL, memory_block_type);
 	if (ret < 0)
 		goto error;
 
@@ -1153,8 +1158,8 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	/* device_online() will take the lock when calling online_pages() */
 	mem_hotplug_done();
 
-	/* online pages if requested */
-	if (online)
+	if (memory_block_type == MEMORY_BLOCK_PARAVIRT ||
+	    (memory_block_type == MEMORY_BLOCK_NORMAL && memhp_auto_online))
 		walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
 				  NULL, online_memory_block);
 
@@ -1169,7 +1174,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 }
 
 /* requires device_hotplug_lock, see add_memory_resource() */
-int __ref __add_memory(int nid, u64 start, u64 size)
+int __ref __add_memory(int nid, u64 start, u64 size, int memory_block_type)
 {
 	struct resource *res;
 	int ret;
@@ -1178,18 +1183,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res, memhp_auto_online);
+	ret = add_memory_resource(nid, res, memory_block_type);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
 }
 
-int add_memory(int nid, u64 start, u64 size)
+int add_memory(int nid, u64 start, u64 size, int memory_block_type)
 {
 	int rc;
 
 	lock_device_hotplug();
-	rc = __add_memory(nid, start, size);
+	rc = __add_memory(nid, start, size, memory_block_type);
 	unlock_device_hotplug();
 
 	return rc;
-- 
2.17.1

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-09-28 15:03 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-09-28 15:03 UTC (permalink / raw)
  To: linux-mm
  Cc: xen-devel, devel, linux-acpi, linux-sh, linux-s390, linuxppc-dev,
	linux-kernel, linux-ia64, David Hildenbrand, Tony Luck,
	Fenghua Yu, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, Heiko Carstens,
	Yoshinori Sato, Rich Felker, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	Boris Ostrovsky, Juergen Gross, Jérôme Glisse,
	Andrew Morton, Mike Rapoport, Dan Williams, Stephen Rothwell,
	Michal Hocko, Kirill A. Shutemov, Nicholas Piggin,
	Jonathan Neuschäfer, Joe Perches, Michael Neuling,
	Mauricio Faria de Oliveira, Balbir Singh, Rashmica Gupta,
	Pavel Tatashin, Rob Herring, Philippe Ombredanne, Kate Stewart,
	mike.travis, Joonsoo Kim, Oscar Salvador, Mathieu Malaterre

How to/when to online hotplugged memory is hard to manage for
distributions because different memory types are to be treated differently.
Right now, we need complicated udev rules that e.g. check if we are
running on s390x, on a physical system or on a virtualized system. But
there is also sometimes the demand to really online memory immediately
while adding in the kernel and not to wait for user space to make a
decision. And on virtualized systems there might be different
requirements, depending on "how" the memory was added (and if it will
eventually get unplugged again - DIMM vs. paravirtualized mechanisms).

On the one hand, we have physical systems where we sometimes
want to be able to unplug memory again - e.g. a DIMM - so we have to online
it to the MOVABLE zone optionally. That decision is usually made in user
space.

On the other hand, we have memory that should never be onlined
automatically, only when asked for by an administrator. Such memory only
applies to virtualized environments like s390x, where the concept of
"standby" memory exists. Memory is detected and added during boot, so it
can be onlined when requested by the admininistrator or some tooling.
Only when onlining, memory will be allocated in the hypervisor.

But then, we also have paravirtualized devices (namely xen and hyper-v
balloons), that hotplug memory that will never ever be removed from a
system right now using offline_pages/remove_memory. If at all, this memory
is logically unplugged and handed back to the hypervisor via ballooning.

For paravirtualized devices it is relevant that memory is onlined as
quickly as possible after adding - and that it is added to the NORMAL
zone. Otherwise, it could happen that too much memory in a row is added
(but not onlined), resulting in out-of-memory conditions due to the
additional memory for "struct pages" and friends. MOVABLE zone as well
as delays might be very problematic and lead to crashes (e.g. zone
imbalance).

Therefore, introduce memory block types and online memory depending on
it when adding the memory. Expose the memory type to user space, so user
space handlers can start to process only "normal" memory. Other memory
block types can be ignored. One thing less to worry about in user space.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "JA(C)rA'me Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: "Jonathan NeuschA?fer" <j.neuschaefer@gmx.net>
Cc: Joe Perches <joe@perches.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

This patch is based on the current mm-tree, where some related
patches from me are currently residing that touched the add_memory()
functions.

 arch/ia64/mm/init.c                       |  4 +-
 arch/powerpc/mm/mem.c                     |  4 +-
 arch/powerpc/platforms/powernv/memtrace.c |  3 +-
 arch/s390/mm/init.c                       |  4 +-
 arch/sh/mm/init.c                         |  4 +-
 arch/x86/mm/init_32.c                     |  4 +-
 arch/x86/mm/init_64.c                     |  8 +--
 drivers/acpi/acpi_memhotplug.c            |  3 +-
 drivers/base/memory.c                     | 63 ++++++++++++++++++++---
 drivers/hv/hv_balloon.c                   | 33 ++----------
 drivers/s390/char/sclp_cmd.c              |  3 +-
 drivers/xen/balloon.c                     |  2 +-
 include/linux/memory.h                    | 28 +++++++++-
 include/linux/memory_hotplug.h            | 17 +++---
 mm/hmm.c                                  |  6 ++-
 mm/memory_hotplug.c                       | 31 ++++++-----
 16 files changed, 139 insertions(+), 78 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d5e12ff1d73c..813d1d86bf95 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -646,13 +646,13 @@ mem_init (void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	if (ret)
 		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
 		       __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5551f5870dcc..dd32fcc9099c 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -118,7 +118,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 }
 
 int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+			      int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
 	}
 	flush_inval_dcache_range(start, start + size);
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 84d038ed3882..57d6b3d46382 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -232,7 +232,8 @@ static int memtrace_online(void)
 			ent->mem = 0;
 		}
 
-		if (add_memory(ent->nid, ent->start, ent->size)) {
+		if (add_memory(ent->nid, ent->start, ent->size,
+			       MEMORY_BLOCK_NORMAL)) {
 			pr_err("Failed to add trace memory to node %d\n",
 				ent->nid);
 			ret += 1;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index e472cd763eb3..b5324527c7f6 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -222,7 +222,7 @@ device_initcall(s390_cma_mem_init);
 #endif /* CONFIG_CMA */
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long size_pages = PFN_DOWN(size);
@@ -232,7 +232,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	if (rc)
 		return rc;
 
-	rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock);
+	rc = __add_pages(nid, start_pfn, size_pages, altmap, memory_block_type);
 	if (rc)
 		vmem_remove_mapping(start, size);
 	return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index c8c13c777162..6b876000731a 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -419,14 +419,14 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index f2837e4c40b3..4f50cd4467a9 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -851,12 +851,12 @@ void __init mem_init(void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..fc3df573f0f3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -783,11 +783,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
 }
 
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock)
+		struct vmem_altmap *altmap, int memory_block_type)
 {
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
@@ -798,14 +798,14 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 }
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
 	init_memory_mapping(start, start + size);
 
-	return add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #define PAGE_INUSE 0xFD
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 8fe0960ea572..c5f646b4e97e 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,8 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = __add_memory(node, info->start_addr, info->length);
+		result = __add_memory(node, info->start_addr, info->length,
+				      MEMORY_BLOCK_NORMAL);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 0e5985682642..2686101e41b5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -381,6 +381,32 @@ static ssize_t show_phys_device(struct device *dev,
 	return sprintf(buf, "%d\n", mem->phys_device);
 }
 
+static ssize_t type_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct memory_block *mem = to_memory_block(dev);
+	ssize_t len = 0;
+
+	switch (mem->state) {
+	case MEMORY_BLOCK_NORMAL:
+		len = sprintf(buf, "normal\n");
+		break;
+	case MEMORY_BLOCK_STANDBY:
+		len = sprintf(buf, "standby\n");
+		break;
+	case MEMORY_BLOCK_PARAVIRT:
+		len = sprintf(buf, "paravirt\n");
+		break;
+	default:
+		len = sprintf(buf, "ERROR-UNKNOWN-%ld\n",
+				mem->state);
+		WARN_ON(1);
+		break;
+	}
+
+	return len;
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn,
 		unsigned long nr_pages, int online_type,
@@ -442,6 +468,7 @@ static DEVICE_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
 static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state);
 static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL);
 static DEVICE_ATTR(removable, 0444, show_mem_removable, NULL);
+static DEVICE_ATTR_RO(type);
 
 /*
  * Block size attribute stuff
@@ -514,7 +541,8 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = __add_memory(nid, phys_addr,
-			   MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			   MIN_MEMORY_BLOCK_SIZE * sections_per_block,
+			   MEMORY_BLOCK_NORMAL);
 
 	if (ret)
 		goto out;
@@ -620,6 +648,7 @@ static struct attribute *memory_memblk_attrs[] = {
 	&dev_attr_state.attr,
 	&dev_attr_phys_device.attr,
 	&dev_attr_removable.attr,
+	&dev_attr_type.attr,
 #ifdef CONFIG_MEMORY_HOTREMOVE
 	&dev_attr_valid_zones.attr,
 #endif
@@ -657,13 +686,17 @@ int register_memory(struct memory_block *memory)
 }
 
 static int init_memory_block(struct memory_block **memory,
-			     struct mem_section *section, unsigned long state)
+			     struct mem_section *section, unsigned long state,
+			     int memory_block_type)
 {
 	struct memory_block *mem;
 	unsigned long start_pfn;
 	int scn_nr;
 	int ret = 0;
 
+	if (memory_block_type == MEMORY_BLOCK_NONE)
+		return -EINVAL;
+
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem)
 		return -ENOMEM;
@@ -675,6 +708,7 @@ static int init_memory_block(struct memory_block **memory,
 	mem->state = state;
 	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
+	mem->type = memory_block_type;
 
 	ret = register_memory(mem);
 
@@ -699,7 +733,8 @@ static int add_memory_block(int base_section_nr)
 
 	if (section_count == 0)
 		return 0;
-	ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE);
+	ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE,
+				MEMORY_BLOCK_NORMAL);
 	if (ret)
 		return ret;
 	mem->section_count = section_count;
@@ -710,19 +745,35 @@ static int add_memory_block(int base_section_nr)
  * need an interface for the VM to add new memory regions,
  * but without onlining it.
  */
-int hotplug_memory_register(int nid, struct mem_section *section)
+int hotplug_memory_register(int nid, struct mem_section *section,
+			    int memory_block_type)
 {
 	int ret = 0;
 	struct memory_block *mem;
 
 	mutex_lock(&mem_sysfs_mutex);
 
+	/* make sure there is no memblock if we don't want one */
+	if (memory_block_type == MEMORY_BLOCK_NONE) {
+		mem = find_memory_block(section);
+		if (mem) {
+			put_device(&mem->dev);
+			ret = -EINVAL;
+		}
+		goto out;
+	}
+
 	mem = find_memory_block(section);
 	if (mem) {
-		mem->section_count++;
+		/* make sure the type matches */
+		if (mem->type == memory_block_type)
+			mem->section_count++;
+		else
+			ret = -EINVAL;
 		put_device(&mem->dev);
 	} else {
-		ret = init_memory_block(&mem, section, MEM_OFFLINE);
+		ret = init_memory_block(&mem, section, MEM_OFFLINE,
+					memory_block_type);
 		if (ret)
 			goto out;
 		mem->section_count++;
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index b1b788082793..5a8d18c4d699 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -537,11 +537,6 @@ struct hv_dynmem_device {
 	 */
 	bool host_specified_ha_region;
 
-	/*
-	 * State to synchronize hot-add.
-	 */
-	struct completion  ol_waitevent;
-	bool ha_waiting;
 	/*
 	 * This thread handles hot-add
 	 * requests from the host as well as notifying
@@ -640,14 +635,6 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 	unsigned long flags, pfn_count;
 
 	switch (val) {
-	case MEM_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		if (dm_device.ha_waiting) {
-			dm_device.ha_waiting = false;
-			complete(&dm_device.ol_waitevent);
-		}
-		break;
-
 	case MEM_OFFLINE:
 		spin_lock_irqsave(&dm_device.ha_lock, flags);
 		pfn_count = hv_page_offline_check(mem->start_pfn,
@@ -665,9 +652,7 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 		}
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 		break;
-	case MEM_GOING_ONLINE:
-	case MEM_GOING_OFFLINE:
-	case MEM_CANCEL_OFFLINE:
+	default:
 		break;
 	}
 	return NOTIFY_OK;
@@ -731,12 +716,10 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 		has->covered_end_pfn +=  processed_pfn;
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 
-		init_completion(&dm_device.ol_waitevent);
-		dm_device.ha_waiting = !memhp_auto_online;
-
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				 (HA_CHUNK << PAGE_SHIFT),
+				 MEMORY_BLOCK_PARAVIRT);
 
 		if (ret) {
 			pr_err("hot_add memory failed error is %d\n", ret);
@@ -757,16 +740,6 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 			break;
 		}
 
-		/*
-		 * Wait for the memory block to be onlined when memory onlining
-		 * is done outside of kernel (memhp_auto_online). Since the hot
-		 * add has succeeded, it is ok to proceed even if the pages in
-		 * the hot added region have not been "onlined" within the
-		 * allowed time.
-		 */
-		if (dm_device.ha_waiting)
-			wait_for_completion_timeout(&dm_device.ol_waitevent,
-						    5*HZ);
 		post_status(&dm_device);
 	}
 }
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index d7686a68c093..1928a2411456 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -406,7 +406,8 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size,
+			   MEMORY_BLOCK_STANDBY);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index fdfc64f5acea..291a8aac6af3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -397,7 +397,7 @@ static enum bp_state reserve_additional_memory(void)
 	mutex_unlock(&balloon_mutex);
 	/* add_memory_resource() requires the device_hotplug lock */
 	lock_device_hotplug();
-	rc = add_memory_resource(nid, resource, memhp_auto_online);
+	rc = add_memory_resource(nid, resource, MEMORY_BLOCK_PARAVIRT);
 	unlock_device_hotplug();
 	mutex_lock(&balloon_mutex);
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index a6ddefc60517..3dc2a0b12653 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -23,6 +23,30 @@
 
 #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
 
+/*
+ * NONE:     No memory block is to be created (e.g. device memory).
+ * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
+ *           (e.g. ACPI DIMMs) that should be onlined either automatically
+ *           (memhp_auto_online) or manually by user space to select a
+ *           specific zone.
+ *           Applicable to memhp_auto_online.
+ * STANDBY:  Memory block that represents standby memory that should only
+ *           be onlined on demand by user space (e.g. standby memory on
+ *           s390x), but never automatically by the kernel.
+ *           Not applicable to memhp_auto_online.
+ * PARAVIRT: Memory block that represents memory added by
+ *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
+ *           always automatically get onlined. Memory will be unplugged
+ *           using ballooning, not by relying on the MOVABLE ZONE.
+ *           Not applicable to memhp_auto_online.
+ */
+enum {
+	MEMORY_BLOCK_NONE,
+	MEMORY_BLOCK_NORMAL,
+	MEMORY_BLOCK_STANDBY,
+	MEMORY_BLOCK_PARAVIRT,
+};
+
 struct memory_block {
 	unsigned long start_section_nr;
 	unsigned long end_section_nr;
@@ -34,6 +58,7 @@ struct memory_block {
 	int (*phys_callback)(struct memory_block *);
 	struct device dev;
 	int nid;			/* NID for this memory block */
+	int type;			/* type of this memory block */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -111,7 +136,8 @@ extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
-int hotplug_memory_register(int nid, struct mem_section *section);
+int hotplug_memory_register(int nid, struct mem_section *section,
+			    int memory_block_type);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int unregister_memory_section(struct mem_section *);
 #endif
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ffd9cd10fcf3..b560a9ee0e8c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -115,18 +115,18 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+		struct vmem_altmap *altmap, int memory_block_type);
 
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		int memory_block_type)
 {
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 #else /* ARCH_HAS_ADD_PAGES */
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+	      struct vmem_altmap *altmap, int memory_block_type);
 #endif /* ARCH_HAS_ADD_PAGES */
 
 #ifdef CONFIG_NUMA
@@ -324,11 +324,12 @@ static inline void __remove_memory(int nid, u64 start, u64 size) {}
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
-extern int __add_memory(int nid, u64 start, u64 size);
-extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource, bool online);
+extern int __add_memory(int nid, u64 start, u64 size, int memory_block_type);
+extern int add_memory(int nid, u64 start, u64 size, int memory_block_type);
+extern int add_memory_resource(int nid, struct resource *resource,
+			       int memory_block_type);
 extern int arch_add_memory(int nid, u64 start, u64 size,
-		struct vmem_altmap *altmap, bool want_memblock);
+			   struct vmem_altmap *altmap, int memory_block_type);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index c968e49f7a0c..2350f6f6ab42 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -32,6 +32,7 @@
 #include <linux/jump_label.h>
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
+#include <linux/memory.h>
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
@@ -1096,10 +1097,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	 */
 	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
 		ret = arch_add_memory(nid, align_start, align_size, NULL,
-				false);
+				      MEMORY_BLOCK_NONE);
 	else
 		ret = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL, false);
+				align_size >> PAGE_SHIFT, NULL,
+				MEMORY_BLOCK_NONE);
 	if (ret) {
 		mem_hotplug_done();
 		goto error_add_memory;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4c7e42e46f3..bce6c41d721c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -246,7 +246,7 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
 static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
-		struct vmem_altmap *altmap, bool want_memblock)
+		struct vmem_altmap *altmap, int memory_block_type)
 {
 	int ret;
 
@@ -257,10 +257,11 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
-	if (!want_memblock)
+	if (memory_block_type == MEMBLOCK_NONE)
 		return 0;
 
-	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
+	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn),
+				       memory_block_type);
 }
 
 /*
@@ -271,7 +272,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
  */
 int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		int memory_block_type)
 {
 	unsigned long i;
 	int err = 0;
@@ -296,7 +297,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, section_nr_to_pfn(i), altmap,
-				want_memblock);
+				    memory_block_type);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -1099,7 +1100,8 @@ static int online_memory_block(struct memory_block *mem, void *arg)
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int __ref add_memory_resource(int nid, struct resource *res, bool online)
+int __ref add_memory_resource(int nid, struct resource *res,
+			      int memory_block_type)
 {
 	u64 start, size;
 	bool new_node = false;
@@ -1108,6 +1110,9 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	start = res->start;
 	size = resource_size(res);
 
+	if (memory_block_type == MEMORY_BLOCK_NONE)
+		return -EINVAL;
+
 	ret = check_hotplug_memory_range(start, size);
 	if (ret)
 		return ret;
@@ -1128,7 +1133,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	new_node = ret;
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, NULL, true);
+	ret = arch_add_memory(nid, start, size, NULL, memory_block_type);
 	if (ret < 0)
 		goto error;
 
@@ -1153,8 +1158,8 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	/* device_online() will take the lock when calling online_pages() */
 	mem_hotplug_done();
 
-	/* online pages if requested */
-	if (online)
+	if (memory_block_type == MEMORY_BLOCK_PARAVIRT ||
+	    (memory_block_type == MEMORY_BLOCK_NORMAL && memhp_auto_online))
 		walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
 				  NULL, online_memory_block);
 
@@ -1169,7 +1174,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 }
 
 /* requires device_hotplug_lock, see add_memory_resource() */
-int __ref __add_memory(int nid, u64 start, u64 size)
+int __ref __add_memory(int nid, u64 start, u64 size, int memory_block_type)
 {
 	struct resource *res;
 	int ret;
@@ -1178,18 +1183,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res, memhp_auto_online);
+	ret = add_memory_resource(nid, res, memory_block_type);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
 }
 
-int add_memory(int nid, u64 start, u64 size)
+int add_memory(int nid, u64 start, u64 size, int memory_block_type)
 {
 	int rc;
 
 	lock_device_hotplug();
-	rc = __add_memory(nid, start, size);
+	rc = __add_memory(nid, start, size, memory_block_type);
 	unlock_device_hotplug();
 
 	return rc;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 144+ messages in thread

* [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-09-28 15:03 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-09-28 15:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Pavel Tatashin, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, David Hildenbrand, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

How to/when to online hotplugged memory is hard to manage for
distributions because different memory types are to be treated differently.
Right now, we need complicated udev rules that e.g. check if we are
running on s390x, on a physical system or on a virtualized system. But
there is also sometimes the demand to really online memory immediately
while adding in the kernel and not to wait for user space to make a
decision. And on virtualized systems there might be different
requirements, depending on "how" the memory was added (and if it will
eventually get unplugged again - DIMM vs. paravirtualized mechanisms).

On the one hand, we have physical systems where we sometimes
want to be able to unplug memory again - e.g. a DIMM - so we have to online
it to the MOVABLE zone optionally. That decision is usually made in user
space.

On the other hand, we have memory that should never be onlined
automatically, only when asked for by an administrator. Such memory only
applies to virtualized environments like s390x, where the concept of
"standby" memory exists. Memory is detected and added during boot, so it
can be onlined when requested by the admininistrator or some tooling.
Only when onlining, memory will be allocated in the hypervisor.

But then, we also have paravirtualized devices (namely xen and hyper-v
balloons), that hotplug memory that will never ever be removed from a
system right now using offline_pages/remove_memory. If at all, this memory
is logically unplugged and handed back to the hypervisor via ballooning.

For paravirtualized devices it is relevant that memory is onlined as
quickly as possible after adding - and that it is added to the NORMAL
zone. Otherwise, it could happen that too much memory in a row is added
(but not onlined), resulting in out-of-memory conditions due to the
additional memory for "struct pages" and friends. MOVABLE zone as well
as delays might be very problematic and lead to crashes (e.g. zone
imbalance).

Therefore, introduce memory block types and online memory depending on
it when adding the memory. Expose the memory type to user space, so user
space handlers can start to process only "normal" memory. Other memory
block types can be ignored. One thing less to worry about in user space.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: "Jonathan Neuschäfer" <j.neuschaefer@gmx.net>
Cc: Joe Perches <joe@perches.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

This patch is based on the current mm-tree, where some related
patches from me are currently residing that touched the add_memory()
functions.

 arch/ia64/mm/init.c                       |  4 +-
 arch/powerpc/mm/mem.c                     |  4 +-
 arch/powerpc/platforms/powernv/memtrace.c |  3 +-
 arch/s390/mm/init.c                       |  4 +-
 arch/sh/mm/init.c                         |  4 +-
 arch/x86/mm/init_32.c                     |  4 +-
 arch/x86/mm/init_64.c                     |  8 +--
 drivers/acpi/acpi_memhotplug.c            |  3 +-
 drivers/base/memory.c                     | 63 ++++++++++++++++++++---
 drivers/hv/hv_balloon.c                   | 33 ++----------
 drivers/s390/char/sclp_cmd.c              |  3 +-
 drivers/xen/balloon.c                     |  2 +-
 include/linux/memory.h                    | 28 +++++++++-
 include/linux/memory_hotplug.h            | 17 +++---
 mm/hmm.c                                  |  6 ++-
 mm/memory_hotplug.c                       | 31 ++++++-----
 16 files changed, 139 insertions(+), 78 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d5e12ff1d73c..813d1d86bf95 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -646,13 +646,13 @@ mem_init (void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	if (ret)
 		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
 		       __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5551f5870dcc..dd32fcc9099c 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -118,7 +118,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 }
 
 int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+			      int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
 	}
 	flush_inval_dcache_range(start, start + size);
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 84d038ed3882..57d6b3d46382 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -232,7 +232,8 @@ static int memtrace_online(void)
 			ent->mem = 0;
 		}
 
-		if (add_memory(ent->nid, ent->start, ent->size)) {
+		if (add_memory(ent->nid, ent->start, ent->size,
+			       MEMORY_BLOCK_NORMAL)) {
 			pr_err("Failed to add trace memory to node %d\n",
 				ent->nid);
 			ret += 1;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index e472cd763eb3..b5324527c7f6 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -222,7 +222,7 @@ device_initcall(s390_cma_mem_init);
 #endif /* CONFIG_CMA */
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long size_pages = PFN_DOWN(size);
@@ -232,7 +232,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	if (rc)
 		return rc;
 
-	rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock);
+	rc = __add_pages(nid, start_pfn, size_pages, altmap, memory_block_type);
 	if (rc)
 		vmem_remove_mapping(start, size);
 	return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index c8c13c777162..6b876000731a 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -419,14 +419,14 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index f2837e4c40b3..4f50cd4467a9 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -851,12 +851,12 @@ void __init mem_init(void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..fc3df573f0f3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -783,11 +783,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
 }
 
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock)
+		struct vmem_altmap *altmap, int memory_block_type)
 {
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
@@ -798,14 +798,14 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 }
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
 	init_memory_mapping(start, start + size);
 
-	return add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #define PAGE_INUSE 0xFD
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 8fe0960ea572..c5f646b4e97e 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,8 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = __add_memory(node, info->start_addr, info->length);
+		result = __add_memory(node, info->start_addr, info->length,
+				      MEMORY_BLOCK_NORMAL);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 0e5985682642..2686101e41b5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -381,6 +381,32 @@ static ssize_t show_phys_device(struct device *dev,
 	return sprintf(buf, "%d\n", mem->phys_device);
 }
 
+static ssize_t type_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct memory_block *mem = to_memory_block(dev);
+	ssize_t len = 0;
+
+	switch (mem->state) {
+	case MEMORY_BLOCK_NORMAL:
+		len = sprintf(buf, "normal\n");
+		break;
+	case MEMORY_BLOCK_STANDBY:
+		len = sprintf(buf, "standby\n");
+		break;
+	case MEMORY_BLOCK_PARAVIRT:
+		len = sprintf(buf, "paravirt\n");
+		break;
+	default:
+		len = sprintf(buf, "ERROR-UNKNOWN-%ld\n",
+				mem->state);
+		WARN_ON(1);
+		break;
+	}
+
+	return len;
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn,
 		unsigned long nr_pages, int online_type,
@@ -442,6 +468,7 @@ static DEVICE_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
 static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state);
 static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL);
 static DEVICE_ATTR(removable, 0444, show_mem_removable, NULL);
+static DEVICE_ATTR_RO(type);
 
 /*
  * Block size attribute stuff
@@ -514,7 +541,8 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = __add_memory(nid, phys_addr,
-			   MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			   MIN_MEMORY_BLOCK_SIZE * sections_per_block,
+			   MEMORY_BLOCK_NORMAL);
 
 	if (ret)
 		goto out;
@@ -620,6 +648,7 @@ static struct attribute *memory_memblk_attrs[] = {
 	&dev_attr_state.attr,
 	&dev_attr_phys_device.attr,
 	&dev_attr_removable.attr,
+	&dev_attr_type.attr,
 #ifdef CONFIG_MEMORY_HOTREMOVE
 	&dev_attr_valid_zones.attr,
 #endif
@@ -657,13 +686,17 @@ int register_memory(struct memory_block *memory)
 }
 
 static int init_memory_block(struct memory_block **memory,
-			     struct mem_section *section, unsigned long state)
+			     struct mem_section *section, unsigned long state,
+			     int memory_block_type)
 {
 	struct memory_block *mem;
 	unsigned long start_pfn;
 	int scn_nr;
 	int ret = 0;
 
+	if (memory_block_type == MEMORY_BLOCK_NONE)
+		return -EINVAL;
+
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem)
 		return -ENOMEM;
@@ -675,6 +708,7 @@ static int init_memory_block(struct memory_block **memory,
 	mem->state = state;
 	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
+	mem->type = memory_block_type;
 
 	ret = register_memory(mem);
 
@@ -699,7 +733,8 @@ static int add_memory_block(int base_section_nr)
 
 	if (section_count == 0)
 		return 0;
-	ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE);
+	ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE,
+				MEMORY_BLOCK_NORMAL);
 	if (ret)
 		return ret;
 	mem->section_count = section_count;
@@ -710,19 +745,35 @@ static int add_memory_block(int base_section_nr)
  * need an interface for the VM to add new memory regions,
  * but without onlining it.
  */
-int hotplug_memory_register(int nid, struct mem_section *section)
+int hotplug_memory_register(int nid, struct mem_section *section,
+			    int memory_block_type)
 {
 	int ret = 0;
 	struct memory_block *mem;
 
 	mutex_lock(&mem_sysfs_mutex);
 
+	/* make sure there is no memblock if we don't want one */
+	if (memory_block_type == MEMORY_BLOCK_NONE) {
+		mem = find_memory_block(section);
+		if (mem) {
+			put_device(&mem->dev);
+			ret = -EINVAL;
+		}
+		goto out;
+	}
+
 	mem = find_memory_block(section);
 	if (mem) {
-		mem->section_count++;
+		/* make sure the type matches */
+		if (mem->type == memory_block_type)
+			mem->section_count++;
+		else
+			ret = -EINVAL;
 		put_device(&mem->dev);
 	} else {
-		ret = init_memory_block(&mem, section, MEM_OFFLINE);
+		ret = init_memory_block(&mem, section, MEM_OFFLINE,
+					memory_block_type);
 		if (ret)
 			goto out;
 		mem->section_count++;
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index b1b788082793..5a8d18c4d699 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -537,11 +537,6 @@ struct hv_dynmem_device {
 	 */
 	bool host_specified_ha_region;
 
-	/*
-	 * State to synchronize hot-add.
-	 */
-	struct completion  ol_waitevent;
-	bool ha_waiting;
 	/*
 	 * This thread handles hot-add
 	 * requests from the host as well as notifying
@@ -640,14 +635,6 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 	unsigned long flags, pfn_count;
 
 	switch (val) {
-	case MEM_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		if (dm_device.ha_waiting) {
-			dm_device.ha_waiting = false;
-			complete(&dm_device.ol_waitevent);
-		}
-		break;
-
 	case MEM_OFFLINE:
 		spin_lock_irqsave(&dm_device.ha_lock, flags);
 		pfn_count = hv_page_offline_check(mem->start_pfn,
@@ -665,9 +652,7 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 		}
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 		break;
-	case MEM_GOING_ONLINE:
-	case MEM_GOING_OFFLINE:
-	case MEM_CANCEL_OFFLINE:
+	default:
 		break;
 	}
 	return NOTIFY_OK;
@@ -731,12 +716,10 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 		has->covered_end_pfn +=  processed_pfn;
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 
-		init_completion(&dm_device.ol_waitevent);
-		dm_device.ha_waiting = !memhp_auto_online;
-
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				 (HA_CHUNK << PAGE_SHIFT),
+				 MEMORY_BLOCK_PARAVIRT);
 
 		if (ret) {
 			pr_err("hot_add memory failed error is %d\n", ret);
@@ -757,16 +740,6 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 			break;
 		}
 
-		/*
-		 * Wait for the memory block to be onlined when memory onlining
-		 * is done outside of kernel (memhp_auto_online). Since the hot
-		 * add has succeeded, it is ok to proceed even if the pages in
-		 * the hot added region have not been "onlined" within the
-		 * allowed time.
-		 */
-		if (dm_device.ha_waiting)
-			wait_for_completion_timeout(&dm_device.ol_waitevent,
-						    5*HZ);
 		post_status(&dm_device);
 	}
 }
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index d7686a68c093..1928a2411456 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -406,7 +406,8 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size,
+			   MEMORY_BLOCK_STANDBY);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index fdfc64f5acea..291a8aac6af3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -397,7 +397,7 @@ static enum bp_state reserve_additional_memory(void)
 	mutex_unlock(&balloon_mutex);
 	/* add_memory_resource() requires the device_hotplug lock */
 	lock_device_hotplug();
-	rc = add_memory_resource(nid, resource, memhp_auto_online);
+	rc = add_memory_resource(nid, resource, MEMORY_BLOCK_PARAVIRT);
 	unlock_device_hotplug();
 	mutex_lock(&balloon_mutex);
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index a6ddefc60517..3dc2a0b12653 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -23,6 +23,30 @@
 
 #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
 
+/*
+ * NONE:     No memory block is to be created (e.g. device memory).
+ * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
+ *           (e.g. ACPI DIMMs) that should be onlined either automatically
+ *           (memhp_auto_online) or manually by user space to select a
+ *           specific zone.
+ *           Applicable to memhp_auto_online.
+ * STANDBY:  Memory block that represents standby memory that should only
+ *           be onlined on demand by user space (e.g. standby memory on
+ *           s390x), but never automatically by the kernel.
+ *           Not applicable to memhp_auto_online.
+ * PARAVIRT: Memory block that represents memory added by
+ *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
+ *           always automatically get onlined. Memory will be unplugged
+ *           using ballooning, not by relying on the MOVABLE ZONE.
+ *           Not applicable to memhp_auto_online.
+ */
+enum {
+	MEMORY_BLOCK_NONE,
+	MEMORY_BLOCK_NORMAL,
+	MEMORY_BLOCK_STANDBY,
+	MEMORY_BLOCK_PARAVIRT,
+};
+
 struct memory_block {
 	unsigned long start_section_nr;
 	unsigned long end_section_nr;
@@ -34,6 +58,7 @@ struct memory_block {
 	int (*phys_callback)(struct memory_block *);
 	struct device dev;
 	int nid;			/* NID for this memory block */
+	int type;			/* type of this memory block */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -111,7 +136,8 @@ extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
-int hotplug_memory_register(int nid, struct mem_section *section);
+int hotplug_memory_register(int nid, struct mem_section *section,
+			    int memory_block_type);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int unregister_memory_section(struct mem_section *);
 #endif
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ffd9cd10fcf3..b560a9ee0e8c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -115,18 +115,18 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+		struct vmem_altmap *altmap, int memory_block_type);
 
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		int memory_block_type)
 {
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 #else /* ARCH_HAS_ADD_PAGES */
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+	      struct vmem_altmap *altmap, int memory_block_type);
 #endif /* ARCH_HAS_ADD_PAGES */
 
 #ifdef CONFIG_NUMA
@@ -324,11 +324,12 @@ static inline void __remove_memory(int nid, u64 start, u64 size) {}
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
-extern int __add_memory(int nid, u64 start, u64 size);
-extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource, bool online);
+extern int __add_memory(int nid, u64 start, u64 size, int memory_block_type);
+extern int add_memory(int nid, u64 start, u64 size, int memory_block_type);
+extern int add_memory_resource(int nid, struct resource *resource,
+			       int memory_block_type);
 extern int arch_add_memory(int nid, u64 start, u64 size,
-		struct vmem_altmap *altmap, bool want_memblock);
+			   struct vmem_altmap *altmap, int memory_block_type);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index c968e49f7a0c..2350f6f6ab42 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -32,6 +32,7 @@
 #include <linux/jump_label.h>
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
+#include <linux/memory.h>
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
@@ -1096,10 +1097,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	 */
 	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
 		ret = arch_add_memory(nid, align_start, align_size, NULL,
-				false);
+				      MEMORY_BLOCK_NONE);
 	else
 		ret = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL, false);
+				align_size >> PAGE_SHIFT, NULL,
+				MEMORY_BLOCK_NONE);
 	if (ret) {
 		mem_hotplug_done();
 		goto error_add_memory;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4c7e42e46f3..bce6c41d721c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -246,7 +246,7 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
 static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
-		struct vmem_altmap *altmap, bool want_memblock)
+		struct vmem_altmap *altmap, int memory_block_type)
 {
 	int ret;
 
@@ -257,10 +257,11 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
-	if (!want_memblock)
+	if (memory_block_type == MEMBLOCK_NONE)
 		return 0;
 
-	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
+	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn),
+				       memory_block_type);
 }
 
 /*
@@ -271,7 +272,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
  */
 int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		int memory_block_type)
 {
 	unsigned long i;
 	int err = 0;
@@ -296,7 +297,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, section_nr_to_pfn(i), altmap,
-				want_memblock);
+				    memory_block_type);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -1099,7 +1100,8 @@ static int online_memory_block(struct memory_block *mem, void *arg)
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int __ref add_memory_resource(int nid, struct resource *res, bool online)
+int __ref add_memory_resource(int nid, struct resource *res,
+			      int memory_block_type)
 {
 	u64 start, size;
 	bool new_node = false;
@@ -1108,6 +1110,9 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	start = res->start;
 	size = resource_size(res);
 
+	if (memory_block_type == MEMORY_BLOCK_NONE)
+		return -EINVAL;
+
 	ret = check_hotplug_memory_range(start, size);
 	if (ret)
 		return ret;
@@ -1128,7 +1133,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	new_node = ret;
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, NULL, true);
+	ret = arch_add_memory(nid, start, size, NULL, memory_block_type);
 	if (ret < 0)
 		goto error;
 
@@ -1153,8 +1158,8 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	/* device_online() will take the lock when calling online_pages() */
 	mem_hotplug_done();
 
-	/* online pages if requested */
-	if (online)
+	if (memory_block_type == MEMORY_BLOCK_PARAVIRT ||
+	    (memory_block_type == MEMORY_BLOCK_NORMAL && memhp_auto_online))
 		walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
 				  NULL, online_memory_block);
 
@@ -1169,7 +1174,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 }
 
 /* requires device_hotplug_lock, see add_memory_resource() */
-int __ref __add_memory(int nid, u64 start, u64 size)
+int __ref __add_memory(int nid, u64 start, u64 size, int memory_block_type)
 {
 	struct resource *res;
 	int ret;
@@ -1178,18 +1183,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res, memhp_auto_online);
+	ret = add_memory_resource(nid, res, memory_block_type);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
 }
 
-int add_memory(int nid, u64 start, u64 size)
+int add_memory(int nid, u64 start, u64 size, int memory_block_type)
 {
 	int rc;
 
 	lock_device_hotplug();
-	rc = __add_memory(nid, start, size);
+	rc = __add_memory(nid, start, size, memory_block_type);
 	unlock_device_hotplug();
 
 	return rc;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-09-28 15:03 ` David Hildenbrand
  (?)
@ 2018-09-28 17:02   ` Dave Hansen
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-09-28 17:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens,
	Pavel Tatashin, Michal Hocko, Paul Mackerras, H. Peter Anvin,
	Rashmica Gupta, Boris Ostrovsky, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, Michael Ellerman, linux-acpi,
	Ingo Molnar, xen-devel, Rob Herring, Len Brown

It's really nice if these kinds of things are broken up.  First, replace
the old want_memblock parameter, then add the parameter to the
__add_page() calls.

> +/*
> + * NONE:     No memory block is to be created (e.g. device memory).
> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
> + *           (e.g. ACPI DIMMs) that should be onlined either automatically
> + *           (memhp_auto_online) or manually by user space to select a
> + *           specific zone.
> + *           Applicable to memhp_auto_online.
> + * STANDBY:  Memory block that represents standby memory that should only
> + *           be onlined on demand by user space (e.g. standby memory on
> + *           s390x), but never automatically by the kernel.
> + *           Not applicable to memhp_auto_online.
> + * PARAVIRT: Memory block that represents memory added by
> + *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
> + *           always automatically get onlined. Memory will be unplugged
> + *           using ballooning, not by relying on the MOVABLE ZONE.
> + *           Not applicable to memhp_auto_online.
> + */
> +enum {
> +	MEMORY_BLOCK_NONE,
> +	MEMORY_BLOCK_NORMAL,
> +	MEMORY_BLOCK_STANDBY,
> +	MEMORY_BLOCK_PARAVIRT,
> +};

This does not seem like the best way to expose these.

STANDBY, for instance, seems to be essentially a replacement for a check
against running on s390 in userspace to implement a _typical_ s390
policy.  It seems rather weird to try to make the userspace policy
determination easier by telling userspace about the typical s390 policy
via the kernel.

As for the OOM issues, that sounds like something we need to fix by
refusing to do (or delaying) hot-add operations once we consume too much
ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
userspace to hurry thing along.

So, to my eye, we need:

 +enum {
 +	MEMORY_BLOCK_NONE,
 +	MEMORY_BLOCK_STANDBY, /* the default */
 +	MEMORY_BLOCK_AUTO_ONLINE,
 +};

and we can probably collapse NONE into AUTO_ONLINE because userspace
ends up doing the same thing for both: nothing.

>  struct memory_block {
>  	unsigned long start_section_nr;
>  	unsigned long end_section_nr;
> @@ -34,6 +58,7 @@ struct memory_block {
>  	int (*phys_callback)(struct memory_block *);
>  	struct device dev;
>  	int nid;			/* NID for this memory block */
> +	int type;			/* type of this memory block */
>  };

Shouldn't we just be creating and using an actual named enum type?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-09-28 17:02   ` Dave Hansen
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-09-28 17:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: xen-devel, devel, linux-acpi, linux-sh, linux-s390, linuxppc-dev,
	linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki, Len Brown,
	Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Michal Hocko, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

It's really nice if these kinds of things are broken up.  First, replace
the old want_memblock parameter, then add the parameter to the
__add_page() calls.

> +/*
> + * NONE:     No memory block is to be created (e.g. device memory).
> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
> + *           (e.g. ACPI DIMMs) that should be onlined either automatically
> + *           (memhp_auto_online) or manually by user space to select a
> + *           specific zone.
> + *           Applicable to memhp_auto_online.
> + * STANDBY:  Memory block that represents standby memory that should only
> + *           be onlined on demand by user space (e.g. standby memory on
> + *           s390x), but never automatically by the kernel.
> + *           Not applicable to memhp_auto_online.
> + * PARAVIRT: Memory block that represents memory added by
> + *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
> + *           always automatically get onlined. Memory will be unplugged
> + *           using ballooning, not by relying on the MOVABLE ZONE.
> + *           Not applicable to memhp_auto_online.
> + */
> +enum {
> +	MEMORY_BLOCK_NONE,
> +	MEMORY_BLOCK_NORMAL,
> +	MEMORY_BLOCK_STANDBY,
> +	MEMORY_BLOCK_PARAVIRT,
> +};

This does not seem like the best way to expose these.

STANDBY, for instance, seems to be essentially a replacement for a check
against running on s390 in userspace to implement a _typical_ s390
policy.  It seems rather weird to try to make the userspace policy
determination easier by telling userspace about the typical s390 policy
via the kernel.

As for the OOM issues, that sounds like something we need to fix by
refusing to do (or delaying) hot-add operations once we consume too much
ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
userspace to hurry thing along.

So, to my eye, we need:

 +enum {
 +	MEMORY_BLOCK_NONE,
 +	MEMORY_BLOCK_STANDBY, /* the default */
 +	MEMORY_BLOCK_AUTO_ONLINE,
 +};

and we can probably collapse NONE into AUTO_ONLINE because userspace
ends up doing the same thing for both: nothing.

>  struct memory_block {
>  	unsigned long start_section_nr;
>  	unsigned long end_section_nr;
> @@ -34,6 +58,7 @@ struct memory_block {
>  	int (*phys_callback)(struct memory_block *);
>  	struct device dev;
>  	int nid;			/* NID for this memory block */
> +	int type;			/* type of this memory block */
>  };

Shouldn't we just be creating and using an actual named enum type?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-09-28 17:02   ` Dave Hansen
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-09-28 17:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Heiko Carstens, Pavel Tatashin, Michal Hocko, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

It's really nice if these kinds of things are broken up.  First, replace
the old want_memblock parameter, then add the parameter to the
__add_page() calls.

> +/*
> + * NONE:     No memory block is to be created (e.g. device memory).
> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
> + *           (e.g. ACPI DIMMs) that should be onlined either automatically
> + *           (memhp_auto_online) or manually by user space to select a
> + *           specific zone.
> + *           Applicable to memhp_auto_online.
> + * STANDBY:  Memory block that represents standby memory that should only
> + *           be onlined on demand by user space (e.g. standby memory on
> + *           s390x), but never automatically by the kernel.
> + *           Not applicable to memhp_auto_online.
> + * PARAVIRT: Memory block that represents memory added by
> + *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
> + *           always automatically get onlined. Memory will be unplugged
> + *           using ballooning, not by relying on the MOVABLE ZONE.
> + *           Not applicable to memhp_auto_online.
> + */
> +enum {
> +	MEMORY_BLOCK_NONE,
> +	MEMORY_BLOCK_NORMAL,
> +	MEMORY_BLOCK_STANDBY,
> +	MEMORY_BLOCK_PARAVIRT,
> +};

This does not seem like the best way to expose these.

STANDBY, for instance, seems to be essentially a replacement for a check
against running on s390 in userspace to implement a _typical_ s390
policy.  It seems rather weird to try to make the userspace policy
determination easier by telling userspace about the typical s390 policy
via the kernel.

As for the OOM issues, that sounds like something we need to fix by
refusing to do (or delaying) hot-add operations once we consume too much
ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
userspace to hurry thing along.

So, to my eye, we need:

 +enum {
 +	MEMORY_BLOCK_NONE,
 +	MEMORY_BLOCK_STANDBY, /* the default */
 +	MEMORY_BLOCK_AUTO_ONLINE,
 +};

and we can probably collapse NONE into AUTO_ONLINE because userspace
ends up doing the same thing for both: nothing.

>  struct memory_block {
>  	unsigned long start_section_nr;
>  	unsigned long end_section_nr;
> @@ -34,6 +58,7 @@ struct memory_block {
>  	int (*phys_callback)(struct memory_block *);
>  	struct device dev;
>  	int nid;			/* NID for this memory block */
> +	int type;			/* type of this memory block */
>  };

Shouldn't we just be creating and using an actual named enum type?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-09-28 15:03 ` David Hildenbrand
                   ` (2 preceding siblings ...)
  (?)
@ 2018-09-28 17:02 ` Dave Hansen
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-09-28 17:02 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens,
	Pavel Tatashin, Michal Hocko, Paul Mackerras, H. Peter Anvin,
	Rashmica Gupta, K. Y. Srinivasan, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

It's really nice if these kinds of things are broken up.  First, replace
the old want_memblock parameter, then add the parameter to the
__add_page() calls.

> +/*
> + * NONE:     No memory block is to be created (e.g. device memory).
> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
> + *           (e.g. ACPI DIMMs) that should be onlined either automatically
> + *           (memhp_auto_online) or manually by user space to select a
> + *           specific zone.
> + *           Applicable to memhp_auto_online.
> + * STANDBY:  Memory block that represents standby memory that should only
> + *           be onlined on demand by user space (e.g. standby memory on
> + *           s390x), but never automatically by the kernel.
> + *           Not applicable to memhp_auto_online.
> + * PARAVIRT: Memory block that represents memory added by
> + *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
> + *           always automatically get onlined. Memory will be unplugged
> + *           using ballooning, not by relying on the MOVABLE ZONE.
> + *           Not applicable to memhp_auto_online.
> + */
> +enum {
> +	MEMORY_BLOCK_NONE,
> +	MEMORY_BLOCK_NORMAL,
> +	MEMORY_BLOCK_STANDBY,
> +	MEMORY_BLOCK_PARAVIRT,
> +};

This does not seem like the best way to expose these.

STANDBY, for instance, seems to be essentially a replacement for a check
against running on s390 in userspace to implement a _typical_ s390
policy.  It seems rather weird to try to make the userspace policy
determination easier by telling userspace about the typical s390 policy
via the kernel.

As for the OOM issues, that sounds like something we need to fix by
refusing to do (or delaying) hot-add operations once we consume too much
ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
userspace to hurry thing along.

So, to my eye, we need:

 +enum {
 +	MEMORY_BLOCK_NONE,
 +	MEMORY_BLOCK_STANDBY, /* the default */
 +	MEMORY_BLOCK_AUTO_ONLINE,
 +};

and we can probably collapse NONE into AUTO_ONLINE because userspace
ends up doing the same thing for both: nothing.

>  struct memory_block {
>  	unsigned long start_section_nr;
>  	unsigned long end_section_nr;
> @@ -34,6 +58,7 @@ struct memory_block {
>  	int (*phys_callback)(struct memory_block *);
>  	struct device dev;
>  	int nid;			/* NID for this memory block */
> +	int type;			/* type of this memory block */
>  };

Shouldn't we just be creating and using an actual named enum type?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-09-28 15:03 ` David Hildenbrand
  (?)
@ 2018-10-01  8:40   ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-01  8:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
[...]

I haven't read the patch itself but I just wanted to note one thing
about this part

> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).

I have proposed (but haven't finished this due to other stuff) a
solution for this. Newly added memory can host memmaps itself and then
you do not have the problem in the first place. For vmemmap it would
have an advantage that you do not really have to beg for 2MB pages to
back the whole section but you would get it for free because the initial
part of the section is by definition properly aligned and unused.

I yet have to think about the whole proposal but I am missing the most
important part. _Who_ is going to use the new exported information and
for what purpose. You said that distributions have hard time to
distinguish different types of onlinining policies but isn't this
something that is inherently usecase specific?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-01  8:40   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-01  8:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, xen-devel, devel, linux-acpi, linux-sh, linux-s390,
	linuxppc-dev, linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
[...]

I haven't read the patch itself but I just wanted to note one thing
about this part

> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).

I have proposed (but haven't finished this due to other stuff) a
solution for this. Newly added memory can host memmaps itself and then
you do not have the problem in the first place. For vmemmap it would
have an advantage that you do not really have to beg for 2MB pages to
back the whole section but you would get it for free because the initial
part of the section is by definition properly aligned and unused.

I yet have to think about the whole proposal but I am missing the most
important part. _Who_ is going to use the new exported information and
for what purpose. You said that distributions have hard time to
distinguish different types of onlinining policies but isn't this
something that is inherently usecase specific?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-01  8:40   ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-01  8:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
[...]

I haven't read the patch itself but I just wanted to note one thing
about this part

> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).

I have proposed (but haven't finished this due to other stuff) a
solution for this. Newly added memory can host memmaps itself and then
you do not have the problem in the first place. For vmemmap it would
have an advantage that you do not really have to beg for 2MB pages to
back the whole section but you would get it for free because the initial
part of the section is by definition properly aligned and unused.

I yet have to think about the whole proposal but I am missing the most
important part. _Who_ is going to use the new exported information and
for what purpose. You said that distributions have hard time to
distinguish different types of onlinining policies but isn't this
something that is inherently usecase specific?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-09-28 15:03 ` David Hildenbrand
                   ` (4 preceding siblings ...)
  (?)
@ 2018-10-01  8:40 ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-01  8:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
[...]

I haven't read the patch itself but I just wanted to note one thing
about this part

> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).

I have proposed (but haven't finished this due to other stuff) a
solution for this. Newly added memory can host memmaps itself and then
you do not have the problem in the first place. For vmemmap it would
have an advantage that you do not really have to beg for 2MB pages to
back the whole section but you would get it for free because the initial
part of the section is by definition properly aligned and unused.

I yet have to think about the whole proposal but I am missing the most
important part. _Who_ is going to use the new exported information and
for what purpose. You said that distributions have hard time to
distinguish different types of onlinining policies but isn't this
something that is inherently usecase specific?
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-09-28 17:02   ` Dave Hansen
  (?)
@ 2018-10-01  9:13     ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-01  9:13 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens,
	Pavel Tatashin, Michal Hocko, Paul Mackerras, H. Peter Anvin,
	Rashmica Gupta, Boris Ostrovsky, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, Michael Ellerman, linux-acpi,
	Ingo Molnar, xen-devel, Rob Herring, Len Brown

On 28/09/2018 19:02, Dave Hansen wrote:
> It's really nice if these kinds of things are broken up.  First, replace
> the old want_memblock parameter, then add the parameter to the
> __add_page() calls.

Definitely, once we agree that is is not nuts, I will split it up for
the next version :)

> 
>> +/*
>> + * NONE:     No memory block is to be created (e.g. device memory).
>> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
>> + *           (e.g. ACPI DIMMs) that should be onlined either automatically
>> + *           (memhp_auto_online) or manually by user space to select a
>> + *           specific zone.
>> + *           Applicable to memhp_auto_online.
>> + * STANDBY:  Memory block that represents standby memory that should only
>> + *           be onlined on demand by user space (e.g. standby memory on
>> + *           s390x), but never automatically by the kernel.
>> + *           Not applicable to memhp_auto_online.
>> + * PARAVIRT: Memory block that represents memory added by
>> + *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
>> + *           always automatically get onlined. Memory will be unplugged
>> + *           using ballooning, not by relying on the MOVABLE ZONE.
>> + *           Not applicable to memhp_auto_online.
>> + */
>> +enum {
>> +	MEMORY_BLOCK_NONE,
>> +	MEMORY_BLOCK_NORMAL,
>> +	MEMORY_BLOCK_STANDBY,
>> +	MEMORY_BLOCK_PARAVIRT,
>> +};
> 
> This does not seem like the best way to expose these.
> 
> STANDBY, for instance, seems to be essentially a replacement for a check
> against running on s390 in userspace to implement a _typical_ s390
> policy.  It seems rather weird to try to make the userspace policy
> determination easier by telling userspace about the typical s390 policy
> via the kernel.

Now comes the fun part: I am working on another paravirtualized memory
hotplug way for KVM guests, based on virtio ("virtio-mem").

These devices can potentially be used concurrently with
- s390x standby memory
- DIMMs

How should a policy in user space look like when new memory gets added
- on s390x? Not onlining paravirtualized memory is very wrong.
- on e.g. x86? Onlining memory to the MOVABLE zone is very wrong.

So the type of memory is very important here to have in user space.
Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
to decide whether to online memory and how to online memory is wrong.
Only some specific memory types (which I call "normal") are to be
handled by user space.

For the other ones, we exactly know what to do:
- standby? don't online
- paravirt? always online to normal zone

I will add some more details as reply to Michal.

> 
> As for the OOM issues, that sounds like something we need to fix by
> refusing to do (or delaying) hot-add operations once we consume too much
> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
> userspace to hurry thing along.

That is a moving target and doing that automatically is basically
impossible. You can add a lot of memory to the movable zone and
everything is fine. Suddenly a lot of processes are started - boom.
MOVABLE should only every be used if you expect an unplug. And for
paravirtualized devices, a "typical" unplug does not exist.

> 
> So, to my eye, we need:
> 
>  +enum {
>  +	MEMORY_BLOCK_NONE,
>  +	MEMORY_BLOCK_STANDBY, /* the default */
>  +	MEMORY_BLOCK_AUTO_ONLINE,
>  +};

auto-online is strongly misleading, that's why I called it "normal", but
I am open for suggestions. The information about devices handles fully
in the kernel - "paravirt" is key for me.

> 
> and we can probably collapse NONE into AUTO_ONLINE because userspace
> ends up doing the same thing for both: nothing.

For external reasons, yes, for internal reasons no (see hmm/device
memory). In user space, we will never end up with MEMORY_BLOCK_NONE,
because there is no memory block.

> 
>>  struct memory_block {
>>  	unsigned long start_section_nr;
>>  	unsigned long end_section_nr;
>> @@ -34,6 +58,7 @@ struct memory_block {
>>  	int (*phys_callback)(struct memory_block *);
>>  	struct device dev;
>>  	int nid;			/* NID for this memory block */
>> +	int type;			/* type of this memory block */
>>  };
> 
> Shouldn't we just be creating and using an actual named enum type?
> 

That makes sense.

Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-01  9:13     ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-01  9:13 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: xen-devel, devel, linux-acpi, linux-sh, linux-s390, linuxppc-dev,
	linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki, Len Brown,
	Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Michal Hocko, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On 28/09/2018 19:02, Dave Hansen wrote:
> It's really nice if these kinds of things are broken up.  First, replace
> the old want_memblock parameter, then add the parameter to the
> __add_page() calls.

Definitely, once we agree that is is not nuts, I will split it up for
the next version :)

> 
>> +/*
>> + * NONE:     No memory block is to be created (e.g. device memory).
>> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
>> + *           (e.g. ACPI DIMMs) that should be onlined either automatically
>> + *           (memhp_auto_online) or manually by user space to select a
>> + *           specific zone.
>> + *           Applicable to memhp_auto_online.
>> + * STANDBY:  Memory block that represents standby memory that should only
>> + *           be onlined on demand by user space (e.g. standby memory on
>> + *           s390x), but never automatically by the kernel.
>> + *           Not applicable to memhp_auto_online.
>> + * PARAVIRT: Memory block that represents memory added by
>> + *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
>> + *           always automatically get onlined. Memory will be unplugged
>> + *           using ballooning, not by relying on the MOVABLE ZONE.
>> + *           Not applicable to memhp_auto_online.
>> + */
>> +enum {
>> +	MEMORY_BLOCK_NONE,
>> +	MEMORY_BLOCK_NORMAL,
>> +	MEMORY_BLOCK_STANDBY,
>> +	MEMORY_BLOCK_PARAVIRT,
>> +};
> 
> This does not seem like the best way to expose these.
> 
> STANDBY, for instance, seems to be essentially a replacement for a check
> against running on s390 in userspace to implement a _typical_ s390
> policy.  It seems rather weird to try to make the userspace policy
> determination easier by telling userspace about the typical s390 policy
> via the kernel.

Now comes the fun part: I am working on another paravirtualized memory
hotplug way for KVM guests, based on virtio ("virtio-mem").

These devices can potentially be used concurrently with
- s390x standby memory
- DIMMs

How should a policy in user space look like when new memory gets added
- on s390x? Not onlining paravirtualized memory is very wrong.
- on e.g. x86? Onlining memory to the MOVABLE zone is very wrong.

So the type of memory is very important here to have in user space.
Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
to decide whether to online memory and how to online memory is wrong.
Only some specific memory types (which I call "normal") are to be
handled by user space.

For the other ones, we exactly know what to do:
- standby? don't online
- paravirt? always online to normal zone

I will add some more details as reply to Michal.

> 
> As for the OOM issues, that sounds like something we need to fix by
> refusing to do (or delaying) hot-add operations once we consume too much
> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
> userspace to hurry thing along.

That is a moving target and doing that automatically is basically
impossible. You can add a lot of memory to the movable zone and
everything is fine. Suddenly a lot of processes are started - boom.
MOVABLE should only every be used if you expect an unplug. And for
paravirtualized devices, a "typical" unplug does not exist.

> 
> So, to my eye, we need:
> 
>  +enum {
>  +	MEMORY_BLOCK_NONE,
>  +	MEMORY_BLOCK_STANDBY, /* the default */
>  +	MEMORY_BLOCK_AUTO_ONLINE,
>  +};

auto-online is strongly misleading, that's why I called it "normal", but
I am open for suggestions. The information about devices handles fully
in the kernel - "paravirt" is key for me.

> 
> and we can probably collapse NONE into AUTO_ONLINE because userspace
> ends up doing the same thing for both: nothing.

For external reasons, yes, for internal reasons no (see hmm/device
memory). In user space, we will never end up with MEMORY_BLOCK_NONE,
because there is no memory block.

> 
>>  struct memory_block {
>>  	unsigned long start_section_nr;
>>  	unsigned long end_section_nr;
>> @@ -34,6 +58,7 @@ struct memory_block {
>>  	int (*phys_callback)(struct memory_block *);
>>  	struct device dev;
>>  	int nid;			/* NID for this memory block */
>> +	int type;			/* type of this memory block */
>>  };
> 
> Shouldn't we just be creating and using an actual named enum type?
> 

That makes sense.

Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-01  9:13     ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-01  9:13 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Heiko Carstens, Pavel Tatashin, Michal Hocko, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On 28/09/2018 19:02, Dave Hansen wrote:
> It's really nice if these kinds of things are broken up.  First, replace
> the old want_memblock parameter, then add the parameter to the
> __add_page() calls.

Definitely, once we agree that is is not nuts, I will split it up for
the next version :)

> 
>> +/*
>> + * NONE:     No memory block is to be created (e.g. device memory).
>> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
>> + *           (e.g. ACPI DIMMs) that should be onlined either automatically
>> + *           (memhp_auto_online) or manually by user space to select a
>> + *           specific zone.
>> + *           Applicable to memhp_auto_online.
>> + * STANDBY:  Memory block that represents standby memory that should only
>> + *           be onlined on demand by user space (e.g. standby memory on
>> + *           s390x), but never automatically by the kernel.
>> + *           Not applicable to memhp_auto_online.
>> + * PARAVIRT: Memory block that represents memory added by
>> + *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
>> + *           always automatically get onlined. Memory will be unplugged
>> + *           using ballooning, not by relying on the MOVABLE ZONE.
>> + *           Not applicable to memhp_auto_online.
>> + */
>> +enum {
>> +	MEMORY_BLOCK_NONE,
>> +	MEMORY_BLOCK_NORMAL,
>> +	MEMORY_BLOCK_STANDBY,
>> +	MEMORY_BLOCK_PARAVIRT,
>> +};
> 
> This does not seem like the best way to expose these.
> 
> STANDBY, for instance, seems to be essentially a replacement for a check
> against running on s390 in userspace to implement a _typical_ s390
> policy.  It seems rather weird to try to make the userspace policy
> determination easier by telling userspace about the typical s390 policy
> via the kernel.

Now comes the fun part: I am working on another paravirtualized memory
hotplug way for KVM guests, based on virtio ("virtio-mem").

These devices can potentially be used concurrently with
- s390x standby memory
- DIMMs

How should a policy in user space look like when new memory gets added
- on s390x? Not onlining paravirtualized memory is very wrong.
- on e.g. x86? Onlining memory to the MOVABLE zone is very wrong.

So the type of memory is very important here to have in user space.
Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
to decide whether to online memory and how to online memory is wrong.
Only some specific memory types (which I call "normal") are to be
handled by user space.

For the other ones, we exactly know what to do:
- standby? don't online
- paravirt? always online to normal zone

I will add some more details as reply to Michal.

> 
> As for the OOM issues, that sounds like something we need to fix by
> refusing to do (or delaying) hot-add operations once we consume too much
> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
> userspace to hurry thing along.

That is a moving target and doing that automatically is basically
impossible. You can add a lot of memory to the movable zone and
everything is fine. Suddenly a lot of processes are started - boom.
MOVABLE should only every be used if you expect an unplug. And for
paravirtualized devices, a "typical" unplug does not exist.

> 
> So, to my eye, we need:
> 
>  +enum {
>  +	MEMORY_BLOCK_NONE,
>  +	MEMORY_BLOCK_STANDBY, /* the default */
>  +	MEMORY_BLOCK_AUTO_ONLINE,
>  +};

auto-online is strongly misleading, that's why I called it "normal", but
I am open for suggestions. The information about devices handles fully
in the kernel - "paravirt" is key for me.

> 
> and we can probably collapse NONE into AUTO_ONLINE because userspace
> ends up doing the same thing for both: nothing.

For external reasons, yes, for internal reasons no (see hmm/device
memory). In user space, we will never end up with MEMORY_BLOCK_NONE,
because there is no memory block.

> 
>>  struct memory_block {
>>  	unsigned long start_section_nr;
>>  	unsigned long end_section_nr;
>> @@ -34,6 +58,7 @@ struct memory_block {
>>  	int (*phys_callback)(struct memory_block *);
>>  	struct device dev;
>>  	int nid;			/* NID for this memory block */
>> +	int type;			/* type of this memory block */
>>  };
> 
> Shouldn't we just be creating and using an actual named enum type?
> 

That makes sense.

Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-09-28 17:02   ` Dave Hansen
  (?)
  (?)
@ 2018-10-01  9:13   ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-01  9:13 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens,
	Pavel Tatashin, Michal Hocko, Paul Mackerras, H. Peter Anvin,
	Rashmica Gupta, K. Y. Srinivasan, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On 28/09/2018 19:02, Dave Hansen wrote:
> It's really nice if these kinds of things are broken up.  First, replace
> the old want_memblock parameter, then add the parameter to the
> __add_page() calls.

Definitely, once we agree that is is not nuts, I will split it up for
the next version :)

> 
>> +/*
>> + * NONE:     No memory block is to be created (e.g. device memory).
>> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
>> + *           (e.g. ACPI DIMMs) that should be onlined either automatically
>> + *           (memhp_auto_online) or manually by user space to select a
>> + *           specific zone.
>> + *           Applicable to memhp_auto_online.
>> + * STANDBY:  Memory block that represents standby memory that should only
>> + *           be onlined on demand by user space (e.g. standby memory on
>> + *           s390x), but never automatically by the kernel.
>> + *           Not applicable to memhp_auto_online.
>> + * PARAVIRT: Memory block that represents memory added by
>> + *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
>> + *           always automatically get onlined. Memory will be unplugged
>> + *           using ballooning, not by relying on the MOVABLE ZONE.
>> + *           Not applicable to memhp_auto_online.
>> + */
>> +enum {
>> +	MEMORY_BLOCK_NONE,
>> +	MEMORY_BLOCK_NORMAL,
>> +	MEMORY_BLOCK_STANDBY,
>> +	MEMORY_BLOCK_PARAVIRT,
>> +};
> 
> This does not seem like the best way to expose these.
> 
> STANDBY, for instance, seems to be essentially a replacement for a check
> against running on s390 in userspace to implement a _typical_ s390
> policy.  It seems rather weird to try to make the userspace policy
> determination easier by telling userspace about the typical s390 policy
> via the kernel.

Now comes the fun part: I am working on another paravirtualized memory
hotplug way for KVM guests, based on virtio ("virtio-mem").

These devices can potentially be used concurrently with
- s390x standby memory
- DIMMs

How should a policy in user space look like when new memory gets added
- on s390x? Not onlining paravirtualized memory is very wrong.
- on e.g. x86? Onlining memory to the MOVABLE zone is very wrong.

So the type of memory is very important here to have in user space.
Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
to decide whether to online memory and how to online memory is wrong.
Only some specific memory types (which I call "normal") are to be
handled by user space.

For the other ones, we exactly know what to do:
- standby? don't online
- paravirt? always online to normal zone

I will add some more details as reply to Michal.

> 
> As for the OOM issues, that sounds like something we need to fix by
> refusing to do (or delaying) hot-add operations once we consume too much
> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
> userspace to hurry thing along.

That is a moving target and doing that automatically is basically
impossible. You can add a lot of memory to the movable zone and
everything is fine. Suddenly a lot of processes are started - boom.
MOVABLE should only every be used if you expect an unplug. And for
paravirtualized devices, a "typical" unplug does not exist.

> 
> So, to my eye, we need:
> 
>  +enum {
>  +	MEMORY_BLOCK_NONE,
>  +	MEMORY_BLOCK_STANDBY, /* the default */
>  +	MEMORY_BLOCK_AUTO_ONLINE,
>  +};

auto-online is strongly misleading, that's why I called it "normal", but
I am open for suggestions. The information about devices handles fully
in the kernel - "paravirt" is key for me.

> 
> and we can probably collapse NONE into AUTO_ONLINE because userspace
> ends up doing the same thing for both: nothing.

For external reasons, yes, for internal reasons no (see hmm/device
memory). In user space, we will never end up with MEMORY_BLOCK_NONE,
because there is no memory block.

> 
>>  struct memory_block {
>>  	unsigned long start_section_nr;
>>  	unsigned long end_section_nr;
>> @@ -34,6 +58,7 @@ struct memory_block {
>>  	int (*phys_callback)(struct memory_block *);
>>  	struct device dev;
>>  	int nid;			/* NID for this memory block */
>> +	int type;			/* type of this memory block */
>>  };
> 
> Shouldn't we just be creating and using an actual named enum type?
> 

That makes sense.

Thanks!

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-01  8:40   ` Michal Hocko
  (?)
@ 2018-10-01  9:34     ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-01  9:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On 01/10/2018 10:40, Michal Hocko wrote:
> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> [...]
> 
> I haven't read the patch itself but I just wanted to note one thing
> about this part
> 
>> For paravirtualized devices it is relevant that memory is onlined as
>> quickly as possible after adding - and that it is added to the NORMAL
>> zone. Otherwise, it could happen that too much memory in a row is added
>> (but not onlined), resulting in out-of-memory conditions due to the
>> additional memory for "struct pages" and friends. MOVABLE zone as well
>> as delays might be very problematic and lead to crashes (e.g. zone
>> imbalance).
> 
> I have proposed (but haven't finished this due to other stuff) a
> solution for this. Newly added memory can host memmaps itself and then
> you do not have the problem in the first place. For vmemmap it would
> have an advantage that you do not really have to beg for 2MB pages to
> back the whole section but you would get it for free because the initial
> part of the section is by definition properly aligned and unused.

So the plan is to "host metadata for new memory on the memory itself".
Just want to note that this is basically impossible for s390x with the
current mechanisms. (added memory is dead, until onlining notifies the
hypervisor and memory is allocated). It will also be problematic for
paravirtualized memory devices (e.g. XEN's "not backed by the
hypervisor" hacks).

This would only be possible for memory DIMMs, memory that is completely
accessible as far as I can see. Or at least, some specified "first part"
is accessible.

Other problems are other metadata like extended struct pages and friends.

(I really like the idea of adding memory without allocating memory in
the hypervisor in the first place, please keep me tuned).

And please note: This solves some problematic part ("adding too much
memory to the movable zone or not onlining it"), but not the issue of
zone imbalance in the first place. And not one issue I try to tackle
here: don't add paravirtualized memory to the movable zone.

> 
> I yet have to think about the whole proposal but I am missing the most
> important part. _Who_ is going to use the new exported information and
> for what purpose. You said that distributions have hard time to
> distinguish different types of onlinining policies but isn't this
> something that is inherently usecase specific?
> 

Let's think about a distribution. We have a clash of use cases here
(just what you describe). What I propose solves one part of it ("handle
what you know how to handle right in the kernel").

1. Users of DIMMs usually expect that they can be unplugged again. That
is why you want to control how to online memory in user space (== add it
to the movable zone).

2. Users of standby memory (s390) expect that memory will never be
onlined automatically. It will be onlined manually.

3. Users of paravirtualized devices (esp. Hyper-V) don't care about
memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
add a whole bunch of memory and expect that everything works fine. So
that memory is onlined immediately and that memory is added to the
NORMAL zone. Users never want the MOVABLE zone.

1. is a reason why distributions usually don't configure
"MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
MOVABLE zone. That however implies, that e.g. for x86, you have to
handle all new memory in user space, especially also HyperV memory.
There, you then have to check for things like "isHyperV()" to decide
"oh, yes, this should definitely not go to the MOVABLE zone".

As you know, I am working on virtio-mem, which can basically be combined
with 1 or 2. And user space has no idea about the difference between
added memory blocks. Was it memory from a DIMM (== ZONE_MOVABLE)? Was it
memory from a paravirtualized device (== ZONE_NORMAL)? Was it standby
memory? (don't online)


That part, I try to solve with this interface.

To answer your question: User space will only care about "normal" memory
and then decide how to online it (for now, usually MOVABLE, because
that's what customers expect with DIMMs). The use case of DIMMS, we
don't know and therefore we can't expose. The use case of the other
cases, we know exactly already in the kernel.

Existing user space hacks will continue to work but can be replaces by a
new check against "normal" memory block types.

Thanks for looking into this!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-01  9:34     ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-01  9:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, xen-devel, devel, linux-acpi, linux-sh, linux-s390,
	linuxppc-dev, linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On 01/10/2018 10:40, Michal Hocko wrote:
> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> [...]
> 
> I haven't read the patch itself but I just wanted to note one thing
> about this part
> 
>> For paravirtualized devices it is relevant that memory is onlined as
>> quickly as possible after adding - and that it is added to the NORMAL
>> zone. Otherwise, it could happen that too much memory in a row is added
>> (but not onlined), resulting in out-of-memory conditions due to the
>> additional memory for "struct pages" and friends. MOVABLE zone as well
>> as delays might be very problematic and lead to crashes (e.g. zone
>> imbalance).
> 
> I have proposed (but haven't finished this due to other stuff) a
> solution for this. Newly added memory can host memmaps itself and then
> you do not have the problem in the first place. For vmemmap it would
> have an advantage that you do not really have to beg for 2MB pages to
> back the whole section but you would get it for free because the initial
> part of the section is by definition properly aligned and unused.

So the plan is to "host metadata for new memory on the memory itself".
Just want to note that this is basically impossible for s390x with the
current mechanisms. (added memory is dead, until onlining notifies the
hypervisor and memory is allocated). It will also be problematic for
paravirtualized memory devices (e.g. XEN's "not backed by the
hypervisor" hacks).

This would only be possible for memory DIMMs, memory that is completely
accessible as far as I can see. Or at least, some specified "first part"
is accessible.

Other problems are other metadata like extended struct pages and friends.

(I really like the idea of adding memory without allocating memory in
the hypervisor in the first place, please keep me tuned).

And please note: This solves some problematic part ("adding too much
memory to the movable zone or not onlining it"), but not the issue of
zone imbalance in the first place. And not one issue I try to tackle
here: don't add paravirtualized memory to the movable zone.

> 
> I yet have to think about the whole proposal but I am missing the most
> important part. _Who_ is going to use the new exported information and
> for what purpose. You said that distributions have hard time to
> distinguish different types of onlinining policies but isn't this
> something that is inherently usecase specific?
> 

Let's think about a distribution. We have a clash of use cases here
(just what you describe). What I propose solves one part of it ("handle
what you know how to handle right in the kernel").

1. Users of DIMMs usually expect that they can be unplugged again. That
is why you want to control how to online memory in user space (== add it
to the movable zone).

2. Users of standby memory (s390) expect that memory will never be
onlined automatically. It will be onlined manually.

3. Users of paravirtualized devices (esp. Hyper-V) don't care about
memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
add a whole bunch of memory and expect that everything works fine. So
that memory is onlined immediately and that memory is added to the
NORMAL zone. Users never want the MOVABLE zone.

1. is a reason why distributions usually don't configure
"MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
MOVABLE zone. That however implies, that e.g. for x86, you have to
handle all new memory in user space, especially also HyperV memory.
There, you then have to check for things like "isHyperV()" to decide
"oh, yes, this should definitely not go to the MOVABLE zone".

As you know, I am working on virtio-mem, which can basically be combined
with 1 or 2. And user space has no idea about the difference between
added memory blocks. Was it memory from a DIMM (== ZONE_MOVABLE)? Was it
memory from a paravirtualized device (== ZONE_NORMAL)? Was it standby
memory? (don't online)


That part, I try to solve with this interface.

To answer your question: User space will only care about "normal" memory
and then decide how to online it (for now, usually MOVABLE, because
that's what customers expect with DIMMs). The use case of DIMMS, we
don't know and therefore we can't expose. The use case of the other
cases, we know exactly already in the kernel.

Existing user space hacks will continue to work but can be replaces by a
new check against "normal" memory block types.

Thanks for looking into this!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-01  9:34     ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-01  9:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On 01/10/2018 10:40, Michal Hocko wrote:
> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> [...]
> 
> I haven't read the patch itself but I just wanted to note one thing
> about this part
> 
>> For paravirtualized devices it is relevant that memory is onlined as
>> quickly as possible after adding - and that it is added to the NORMAL
>> zone. Otherwise, it could happen that too much memory in a row is added
>> (but not onlined), resulting in out-of-memory conditions due to the
>> additional memory for "struct pages" and friends. MOVABLE zone as well
>> as delays might be very problematic and lead to crashes (e.g. zone
>> imbalance).
> 
> I have proposed (but haven't finished this due to other stuff) a
> solution for this. Newly added memory can host memmaps itself and then
> you do not have the problem in the first place. For vmemmap it would
> have an advantage that you do not really have to beg for 2MB pages to
> back the whole section but you would get it for free because the initial
> part of the section is by definition properly aligned and unused.

So the plan is to "host metadata for new memory on the memory itself".
Just want to note that this is basically impossible for s390x with the
current mechanisms. (added memory is dead, until onlining notifies the
hypervisor and memory is allocated). It will also be problematic for
paravirtualized memory devices (e.g. XEN's "not backed by the
hypervisor" hacks).

This would only be possible for memory DIMMs, memory that is completely
accessible as far as I can see. Or at least, some specified "first part"
is accessible.

Other problems are other metadata like extended struct pages and friends.

(I really like the idea of adding memory without allocating memory in
the hypervisor in the first place, please keep me tuned).

And please note: This solves some problematic part ("adding too much
memory to the movable zone or not onlining it"), but not the issue of
zone imbalance in the first place. And not one issue I try to tackle
here: don't add paravirtualized memory to the movable zone.

> 
> I yet have to think about the whole proposal but I am missing the most
> important part. _Who_ is going to use the new exported information and
> for what purpose. You said that distributions have hard time to
> distinguish different types of onlinining policies but isn't this
> something that is inherently usecase specific?
> 

Let's think about a distribution. We have a clash of use cases here
(just what you describe). What I propose solves one part of it ("handle
what you know how to handle right in the kernel").

1. Users of DIMMs usually expect that they can be unplugged again. That
is why you want to control how to online memory in user space (== add it
to the movable zone).

2. Users of standby memory (s390) expect that memory will never be
onlined automatically. It will be onlined manually.

3. Users of paravirtualized devices (esp. Hyper-V) don't care about
memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
add a whole bunch of memory and expect that everything works fine. So
that memory is onlined immediately and that memory is added to the
NORMAL zone. Users never want the MOVABLE zone.

1. is a reason why distributions usually don't configure
"MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
MOVABLE zone. That however implies, that e.g. for x86, you have to
handle all new memory in user space, especially also HyperV memory.
There, you then have to check for things like "isHyperV()" to decide
"oh, yes, this should definitely not go to the MOVABLE zone".

As you know, I am working on virtio-mem, which can basically be combined
with 1 or 2. And user space has no idea about the difference between
added memory blocks. Was it memory from a DIMM (== ZONE_MOVABLE)? Was it
memory from a paravirtualized device (== ZONE_NORMAL)? Was it standby
memory? (don't online)


That part, I try to solve with this interface.

To answer your question: User space will only care about "normal" memory
and then decide how to online it (for now, usually MOVABLE, because
that's what customers expect with DIMMs). The use case of DIMMS, we
don't know and therefore we can't expose. The use case of the other
cases, we know exactly already in the kernel.

Existing user space hacks will continue to work but can be replaces by a
new check against "normal" memory block types.

Thanks for looking into this!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-01  8:40   ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2018-10-01  9:34   ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-01  9:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On 01/10/2018 10:40, Michal Hocko wrote:
> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> [...]
> 
> I haven't read the patch itself but I just wanted to note one thing
> about this part
> 
>> For paravirtualized devices it is relevant that memory is onlined as
>> quickly as possible after adding - and that it is added to the NORMAL
>> zone. Otherwise, it could happen that too much memory in a row is added
>> (but not onlined), resulting in out-of-memory conditions due to the
>> additional memory for "struct pages" and friends. MOVABLE zone as well
>> as delays might be very problematic and lead to crashes (e.g. zone
>> imbalance).
> 
> I have proposed (but haven't finished this due to other stuff) a
> solution for this. Newly added memory can host memmaps itself and then
> you do not have the problem in the first place. For vmemmap it would
> have an advantage that you do not really have to beg for 2MB pages to
> back the whole section but you would get it for free because the initial
> part of the section is by definition properly aligned and unused.

So the plan is to "host metadata for new memory on the memory itself".
Just want to note that this is basically impossible for s390x with the
current mechanisms. (added memory is dead, until onlining notifies the
hypervisor and memory is allocated). It will also be problematic for
paravirtualized memory devices (e.g. XEN's "not backed by the
hypervisor" hacks).

This would only be possible for memory DIMMs, memory that is completely
accessible as far as I can see. Or at least, some specified "first part"
is accessible.

Other problems are other metadata like extended struct pages and friends.

(I really like the idea of adding memory without allocating memory in
the hypervisor in the first place, please keep me tuned).

And please note: This solves some problematic part ("adding too much
memory to the movable zone or not onlining it"), but not the issue of
zone imbalance in the first place. And not one issue I try to tackle
here: don't add paravirtualized memory to the movable zone.

> 
> I yet have to think about the whole proposal but I am missing the most
> important part. _Who_ is going to use the new exported information and
> for what purpose. You said that distributions have hard time to
> distinguish different types of onlinining policies but isn't this
> something that is inherently usecase specific?
> 

Let's think about a distribution. We have a clash of use cases here
(just what you describe). What I propose solves one part of it ("handle
what you know how to handle right in the kernel").

1. Users of DIMMs usually expect that they can be unplugged again. That
is why you want to control how to online memory in user space (== add it
to the movable zone).

2. Users of standby memory (s390) expect that memory will never be
onlined automatically. It will be onlined manually.

3. Users of paravirtualized devices (esp. Hyper-V) don't care about
memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
add a whole bunch of memory and expect that everything works fine. So
that memory is onlined immediately and that memory is added to the
NORMAL zone. Users never want the MOVABLE zone.

1. is a reason why distributions usually don't configure
"MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
MOVABLE zone. That however implies, that e.g. for x86, you have to
handle all new memory in user space, especially also HyperV memory.
There, you then have to check for things like "isHyperV()" to decide
"oh, yes, this should definitely not go to the MOVABLE zone".

As you know, I am working on virtio-mem, which can basically be combined
with 1 or 2. And user space has no idea about the difference between
added memory blocks. Was it memory from a DIMM (== ZONE_MOVABLE)? Was it
memory from a paravirtualized device (== ZONE_NORMAL)? Was it standby
memory? (don't online)


That part, I try to solve with this interface.

To answer your question: User space will only care about "normal" memory
and then decide how to online it (for now, usually MOVABLE, because
that's what customers expect with DIMMs). The use case of DIMMS, we
don't know and therefore we can't expose. The use case of the other
cases, we know exactly already in the kernel.

Existing user space hacks will continue to work but can be replaces by a
new check against "normal" memory block types.

Thanks for looking into this!

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-01  9:13     ` David Hildenbrand
  (?)
@ 2018-10-01 16:24       ` Dave Hansen
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-10-01 16:24 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens,
	Pavel Tatashin, Michal Hocko, Paul Mackerras, H. Peter Anvin,
	Rashmica Gupta, Boris Ostrovsky, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, Michael Ellerman, linux-acpi,
	Ingo Molnar, xen-devel, Rob Herring, Len Brown

> How should a policy in user space look like when new memory gets added
> - on s390x? Not onlining paravirtualized memory is very wrong.

Because we're going to balloon it away in a moment anyway?

We have auto-onlining.  Why isn't that being used on s390?


> So the type of memory is very important here to have in user space.
> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
> to decide whether to online memory and how to online memory is wrong.
> Only some specific memory types (which I call "normal") are to be
> handled by user space.
> 
> For the other ones, we exactly know what to do:
> - standby? don't online

I think you're horribly conflating the software desire for what the stae
should be and the hardware itself.

>> As for the OOM issues, that sounds like something we need to fix by
>> refusing to do (or delaying) hot-add operations once we consume too much
>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>> userspace to hurry thing along.
> 
> That is a moving target and doing that automatically is basically
> impossible.

Nah.  We know how much metadata we've allocated.  We know how much
ZONE_NORMAL we are eating.  We can *easily* add something to
add_memory() that just sleeps until the ratio is not out-of-whack.

> You can add a lot of memory to the movable zone and
> everything is fine. Suddenly a lot of processes are started - boom.
> MOVABLE should only every be used if you expect an unplug. And for
> paravirtualized devices, a "typical" unplug does not exist.

No, it's more complicated than that.  People use MOVABLE, for instance,
to allow more consistent huge page allocations.  It's certainly not just
hot-remove.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-01 16:24       ` Dave Hansen
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-10-01 16:24 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: xen-devel, devel, linux-acpi, linux-sh, linux-s390, linuxppc-dev,
	linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki, Len Brown,
	Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Michal Hocko, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

> How should a policy in user space look like when new memory gets added
> - on s390x? Not onlining paravirtualized memory is very wrong.

Because we're going to balloon it away in a moment anyway?

We have auto-onlining.  Why isn't that being used on s390?


> So the type of memory is very important here to have in user space.
> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
> to decide whether to online memory and how to online memory is wrong.
> Only some specific memory types (which I call "normal") are to be
> handled by user space.
> 
> For the other ones, we exactly know what to do:
> - standby? don't online

I think you're horribly conflating the software desire for what the stae
should be and the hardware itself.

>> As for the OOM issues, that sounds like something we need to fix by
>> refusing to do (or delaying) hot-add operations once we consume too much
>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>> userspace to hurry thing along.
> 
> That is a moving target and doing that automatically is basically
> impossible.

Nah.  We know how much metadata we've allocated.  We know how much
ZONE_NORMAL we are eating.  We can *easily* add something to
add_memory() that just sleeps until the ratio is not out-of-whack.

> You can add a lot of memory to the movable zone and
> everything is fine. Suddenly a lot of processes are started - boom.
> MOVABLE should only every be used if you expect an unplug. And for
> paravirtualized devices, a "typical" unplug does not exist.

No, it's more complicated than that.  People use MOVABLE, for instance,
to allow more consistent huge page allocations.  It's certainly not just
hot-remove.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-01 16:24       ` Dave Hansen
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-10-01 16:24 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Heiko Carstens, Pavel Tatashin, Michal Hocko, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

> How should a policy in user space look like when new memory gets added
> - on s390x? Not onlining paravirtualized memory is very wrong.

Because we're going to balloon it away in a moment anyway?

We have auto-onlining.  Why isn't that being used on s390?


> So the type of memory is very important here to have in user space.
> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
> to decide whether to online memory and how to online memory is wrong.
> Only some specific memory types (which I call "normal") are to be
> handled by user space.
> 
> For the other ones, we exactly know what to do:
> - standby? don't online

I think you're horribly conflating the software desire for what the stae
should be and the hardware itself.

>> As for the OOM issues, that sounds like something we need to fix by
>> refusing to do (or delaying) hot-add operations once we consume too much
>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>> userspace to hurry thing along.
> 
> That is a moving target and doing that automatically is basically
> impossible.

Nah.  We know how much metadata we've allocated.  We know how much
ZONE_NORMAL we are eating.  We can *easily* add something to
add_memory() that just sleeps until the ratio is not out-of-whack.

> You can add a lot of memory to the movable zone and
> everything is fine. Suddenly a lot of processes are started - boom.
> MOVABLE should only every be used if you expect an unplug. And for
> paravirtualized devices, a "typical" unplug does not exist.

No, it's more complicated than that.  People use MOVABLE, for instance,
to allow more consistent huge page allocations.  It's certainly not just
hot-remove.

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-01  9:13     ` David Hildenbrand
  (?)
  (?)
@ 2018-10-01 16:24     ` Dave Hansen
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-10-01 16:24 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens,
	Pavel Tatashin, Michal Hocko, Paul Mackerras, H. Peter Anvin,
	Rashmica Gupta, K. Y. Srinivasan, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

> How should a policy in user space look like when new memory gets added
> - on s390x? Not onlining paravirtualized memory is very wrong.

Because we're going to balloon it away in a moment anyway?

We have auto-onlining.  Why isn't that being used on s390?


> So the type of memory is very important here to have in user space.
> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
> to decide whether to online memory and how to online memory is wrong.
> Only some specific memory types (which I call "normal") are to be
> handled by user space.
> 
> For the other ones, we exactly know what to do:
> - standby? don't online

I think you're horribly conflating the software desire for what the stae
should be and the hardware itself.

>> As for the OOM issues, that sounds like something we need to fix by
>> refusing to do (or delaying) hot-add operations once we consume too much
>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>> userspace to hurry thing along.
> 
> That is a moving target and doing that automatically is basically
> impossible.

Nah.  We know how much metadata we've allocated.  We know how much
ZONE_NORMAL we are eating.  We can *easily* add something to
add_memory() that just sleeps until the ratio is not out-of-whack.

> You can add a lot of memory to the movable zone and
> everything is fine. Suddenly a lot of processes are started - boom.
> MOVABLE should only every be used if you expect an unplug. And for
> paravirtualized devices, a "typical" unplug does not exist.

No, it's more complicated than that.  People use MOVABLE, for instance,
to allow more consistent huge page allocations.  It's certainly not just
hot-remove.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-01  9:34     ` David Hildenbrand
  (?)
@ 2018-10-02 13:47       ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-02 13:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
> On 01/10/2018 10:40, Michal Hocko wrote:
> > On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> > [...]
> > 
> > I haven't read the patch itself but I just wanted to note one thing
> > about this part
> > 
> >> For paravirtualized devices it is relevant that memory is onlined as
> >> quickly as possible after adding - and that it is added to the NORMAL
> >> zone. Otherwise, it could happen that too much memory in a row is added
> >> (but not onlined), resulting in out-of-memory conditions due to the
> >> additional memory for "struct pages" and friends. MOVABLE zone as well
> >> as delays might be very problematic and lead to crashes (e.g. zone
> >> imbalance).
> > 
> > I have proposed (but haven't finished this due to other stuff) a
> > solution for this. Newly added memory can host memmaps itself and then
> > you do not have the problem in the first place. For vmemmap it would
> > have an advantage that you do not really have to beg for 2MB pages to
> > back the whole section but you would get it for free because the initial
> > part of the section is by definition properly aligned and unused.
> 
> So the plan is to "host metadata for new memory on the memory itself".
> Just want to note that this is basically impossible for s390x with the
> current mechanisms. (added memory is dead, until onlining notifies the
> hypervisor and memory is allocated). It will also be problematic for
> paravirtualized memory devices (e.g. XEN's "not backed by the
> hypervisor" hacks).

OK, I understand that not all usecases can use self memmap hosting
others do not have much choice left though. You have to allocate from
somewhere. Well and alternative would be to have no memmap until
onlining but I am not sure how much work that would be.

> This would only be possible for memory DIMMs, memory that is completely
> accessible as far as I can see. Or at least, some specified "first part"
> is accessible.
> 
> Other problems are other metadata like extended struct pages and friends.

I wouldn't really worry about extended struct pages. Those should be
used for debugging purposes mostly. Ot at least that was the case last
time I've checked.

> (I really like the idea of adding memory without allocating memory in
> the hypervisor in the first place, please keep me tuned).
> 
> And please note: This solves some problematic part ("adding too much
> memory to the movable zone or not onlining it"), but not the issue of
> zone imbalance in the first place. And not one issue I try to tackle
> here: don't add paravirtualized memory to the movable zone.

Zone imbalance is an inherent problem of the highmem zone. It is
essentially the highmem zone we all loved so much back in 32b days.
Yes the movable zone doesn't have any addressing limitations so it is a
bit more relaxed but considering the hotplug scenarios I have seen so
far people just want to have full NUMA nodes movable to allow replacing
DIMMs. And then we are back to square one and the zone imbalance issue.
You have those regardless where memmaps are allocated from.

> > I yet have to think about the whole proposal but I am missing the most
> > important part. _Who_ is going to use the new exported information and
> > for what purpose. You said that distributions have hard time to
> > distinguish different types of onlinining policies but isn't this
> > something that is inherently usecase specific?
> > 
> 
> Let's think about a distribution. We have a clash of use cases here
> (just what you describe). What I propose solves one part of it ("handle
> what you know how to handle right in the kernel").
> 
> 1. Users of DIMMs usually expect that they can be unplugged again. That
> is why you want to control how to online memory in user space (== add it
> to the movable zone).

Which is only true if you really want to hotremove them. I am not going
to tell how much I believe in this usecase but movable policy is not
generally applicable here.

> 2. Users of standby memory (s390) expect that memory will never be
> onlined automatically. It will be onlined manually.

yeah

> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
> add a whole bunch of memory and expect that everything works fine. So
> that memory is onlined immediately and that memory is added to the
> NORMAL zone. Users never want the MOVABLE zone.

Then the immediate question would be why to use memory hotplug for that
at all? Why don't you simply start with a huge pre-allocated physical
address space and balloon memory in an out per demand. Why do you want
to inject new memory during the runtime?

> 1. is a reason why distributions usually don't configure
> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> MOVABLE zone. That however implies, that e.g. for x86, you have to
> handle all new memory in user space, especially also HyperV memory.
> There, you then have to check for things like "isHyperV()" to decide
> "oh, yes, this should definitely not go to the MOVABLE zone".

Why do you need a generic hotplug rule in the first place? Why don't you
simply provide different set of rules for different usecases? Let users
decide which usecase they prefer rather than try to be clever which
almost always hits weird corner cases.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-02 13:47       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-02 13:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, xen-devel, devel, linux-acpi, linux-sh, linux-s390,
	linuxppc-dev, linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
> On 01/10/2018 10:40, Michal Hocko wrote:
> > On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> > [...]
> > 
> > I haven't read the patch itself but I just wanted to note one thing
> > about this part
> > 
> >> For paravirtualized devices it is relevant that memory is onlined as
> >> quickly as possible after adding - and that it is added to the NORMAL
> >> zone. Otherwise, it could happen that too much memory in a row is added
> >> (but not onlined), resulting in out-of-memory conditions due to the
> >> additional memory for "struct pages" and friends. MOVABLE zone as well
> >> as delays might be very problematic and lead to crashes (e.g. zone
> >> imbalance).
> > 
> > I have proposed (but haven't finished this due to other stuff) a
> > solution for this. Newly added memory can host memmaps itself and then
> > you do not have the problem in the first place. For vmemmap it would
> > have an advantage that you do not really have to beg for 2MB pages to
> > back the whole section but you would get it for free because the initial
> > part of the section is by definition properly aligned and unused.
> 
> So the plan is to "host metadata for new memory on the memory itself".
> Just want to note that this is basically impossible for s390x with the
> current mechanisms. (added memory is dead, until onlining notifies the
> hypervisor and memory is allocated). It will also be problematic for
> paravirtualized memory devices (e.g. XEN's "not backed by the
> hypervisor" hacks).

OK, I understand that not all usecases can use self memmap hosting
others do not have much choice left though. You have to allocate from
somewhere. Well and alternative would be to have no memmap until
onlining but I am not sure how much work that would be.

> This would only be possible for memory DIMMs, memory that is completely
> accessible as far as I can see. Or at least, some specified "first part"
> is accessible.
> 
> Other problems are other metadata like extended struct pages and friends.

I wouldn't really worry about extended struct pages. Those should be
used for debugging purposes mostly. Ot at least that was the case last
time I've checked.

> (I really like the idea of adding memory without allocating memory in
> the hypervisor in the first place, please keep me tuned).
> 
> And please note: This solves some problematic part ("adding too much
> memory to the movable zone or not onlining it"), but not the issue of
> zone imbalance in the first place. And not one issue I try to tackle
> here: don't add paravirtualized memory to the movable zone.

Zone imbalance is an inherent problem of the highmem zone. It is
essentially the highmem zone we all loved so much back in 32b days.
Yes the movable zone doesn't have any addressing limitations so it is a
bit more relaxed but considering the hotplug scenarios I have seen so
far people just want to have full NUMA nodes movable to allow replacing
DIMMs. And then we are back to square one and the zone imbalance issue.
You have those regardless where memmaps are allocated from.

> > I yet have to think about the whole proposal but I am missing the most
> > important part. _Who_ is going to use the new exported information and
> > for what purpose. You said that distributions have hard time to
> > distinguish different types of onlinining policies but isn't this
> > something that is inherently usecase specific?
> > 
> 
> Let's think about a distribution. We have a clash of use cases here
> (just what you describe). What I propose solves one part of it ("handle
> what you know how to handle right in the kernel").
> 
> 1. Users of DIMMs usually expect that they can be unplugged again. That
> is why you want to control how to online memory in user space (== add it
> to the movable zone).

Which is only true if you really want to hotremove them. I am not going
to tell how much I believe in this usecase but movable policy is not
generally applicable here.

> 2. Users of standby memory (s390) expect that memory will never be
> onlined automatically. It will be onlined manually.

yeah

> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
> add a whole bunch of memory and expect that everything works fine. So
> that memory is onlined immediately and that memory is added to the
> NORMAL zone. Users never want the MOVABLE zone.

Then the immediate question would be why to use memory hotplug for that
at all? Why don't you simply start with a huge pre-allocated physical
address space and balloon memory in an out per demand. Why do you want
to inject new memory during the runtime?

> 1. is a reason why distributions usually don't configure
> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> MOVABLE zone. That however implies, that e.g. for x86, you have to
> handle all new memory in user space, especially also HyperV memory.
> There, you then have to check for things like "isHyperV()" to decide
> "oh, yes, this should definitely not go to the MOVABLE zone".

Why do you need a generic hotplug rule in the first place? Why don't you
simply provide different set of rules for different usecases? Let users
decide which usecase they prefer rather than try to be clever which
almost always hits weird corner cases.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-02 13:47       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-02 13:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
> On 01/10/2018 10:40, Michal Hocko wrote:
> > On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> > [...]
> > 
> > I haven't read the patch itself but I just wanted to note one thing
> > about this part
> > 
> >> For paravirtualized devices it is relevant that memory is onlined as
> >> quickly as possible after adding - and that it is added to the NORMAL
> >> zone. Otherwise, it could happen that too much memory in a row is added
> >> (but not onlined), resulting in out-of-memory conditions due to the
> >> additional memory for "struct pages" and friends. MOVABLE zone as well
> >> as delays might be very problematic and lead to crashes (e.g. zone
> >> imbalance).
> > 
> > I have proposed (but haven't finished this due to other stuff) a
> > solution for this. Newly added memory can host memmaps itself and then
> > you do not have the problem in the first place. For vmemmap it would
> > have an advantage that you do not really have to beg for 2MB pages to
> > back the whole section but you would get it for free because the initial
> > part of the section is by definition properly aligned and unused.
> 
> So the plan is to "host metadata for new memory on the memory itself".
> Just want to note that this is basically impossible for s390x with the
> current mechanisms. (added memory is dead, until onlining notifies the
> hypervisor and memory is allocated). It will also be problematic for
> paravirtualized memory devices (e.g. XEN's "not backed by the
> hypervisor" hacks).

OK, I understand that not all usecases can use self memmap hosting
others do not have much choice left though. You have to allocate from
somewhere. Well and alternative would be to have no memmap until
onlining but I am not sure how much work that would be.

> This would only be possible for memory DIMMs, memory that is completely
> accessible as far as I can see. Or at least, some specified "first part"
> is accessible.
> 
> Other problems are other metadata like extended struct pages and friends.

I wouldn't really worry about extended struct pages. Those should be
used for debugging purposes mostly. Ot at least that was the case last
time I've checked.

> (I really like the idea of adding memory without allocating memory in
> the hypervisor in the first place, please keep me tuned).
> 
> And please note: This solves some problematic part ("adding too much
> memory to the movable zone or not onlining it"), but not the issue of
> zone imbalance in the first place. And not one issue I try to tackle
> here: don't add paravirtualized memory to the movable zone.

Zone imbalance is an inherent problem of the highmem zone. It is
essentially the highmem zone we all loved so much back in 32b days.
Yes the movable zone doesn't have any addressing limitations so it is a
bit more relaxed but considering the hotplug scenarios I have seen so
far people just want to have full NUMA nodes movable to allow replacing
DIMMs. And then we are back to square one and the zone imbalance issue.
You have those regardless where memmaps are allocated from.

> > I yet have to think about the whole proposal but I am missing the most
> > important part. _Who_ is going to use the new exported information and
> > for what purpose. You said that distributions have hard time to
> > distinguish different types of onlinining policies but isn't this
> > something that is inherently usecase specific?
> > 
> 
> Let's think about a distribution. We have a clash of use cases here
> (just what you describe). What I propose solves one part of it ("handle
> what you know how to handle right in the kernel").
> 
> 1. Users of DIMMs usually expect that they can be unplugged again. That
> is why you want to control how to online memory in user space (== add it
> to the movable zone).

Which is only true if you really want to hotremove them. I am not going
to tell how much I believe in this usecase but movable policy is not
generally applicable here.

> 2. Users of standby memory (s390) expect that memory will never be
> onlined automatically. It will be onlined manually.

yeah

> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
> add a whole bunch of memory and expect that everything works fine. So
> that memory is onlined immediately and that memory is added to the
> NORMAL zone. Users never want the MOVABLE zone.

Then the immediate question would be why to use memory hotplug for that
at all? Why don't you simply start with a huge pre-allocated physical
address space and balloon memory in an out per demand. Why do you want
to inject new memory during the runtime?

> 1. is a reason why distributions usually don't configure
> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> MOVABLE zone. That however implies, that e.g. for x86, you have to
> handle all new memory in user space, especially also HyperV memory.
> There, you then have to check for things like "isHyperV()" to decide
> "oh, yes, this should definitely not go to the MOVABLE zone".

Why do you need a generic hotplug rule in the first place? Why don't you
simply provide different set of rules for different usecases? Let users
decide which usecase they prefer rather than try to be clever which
almost always hits weird corner cases.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-01  9:34     ` David Hildenbrand
  (?)
  (?)
@ 2018-10-02 13:47     ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-02 13:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
> On 01/10/2018 10:40, Michal Hocko wrote:
> > On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> > [...]
> > 
> > I haven't read the patch itself but I just wanted to note one thing
> > about this part
> > 
> >> For paravirtualized devices it is relevant that memory is onlined as
> >> quickly as possible after adding - and that it is added to the NORMAL
> >> zone. Otherwise, it could happen that too much memory in a row is added
> >> (but not onlined), resulting in out-of-memory conditions due to the
> >> additional memory for "struct pages" and friends. MOVABLE zone as well
> >> as delays might be very problematic and lead to crashes (e.g. zone
> >> imbalance).
> > 
> > I have proposed (but haven't finished this due to other stuff) a
> > solution for this. Newly added memory can host memmaps itself and then
> > you do not have the problem in the first place. For vmemmap it would
> > have an advantage that you do not really have to beg for 2MB pages to
> > back the whole section but you would get it for free because the initial
> > part of the section is by definition properly aligned and unused.
> 
> So the plan is to "host metadata for new memory on the memory itself".
> Just want to note that this is basically impossible for s390x with the
> current mechanisms. (added memory is dead, until onlining notifies the
> hypervisor and memory is allocated). It will also be problematic for
> paravirtualized memory devices (e.g. XEN's "not backed by the
> hypervisor" hacks).

OK, I understand that not all usecases can use self memmap hosting
others do not have much choice left though. You have to allocate from
somewhere. Well and alternative would be to have no memmap until
onlining but I am not sure how much work that would be.

> This would only be possible for memory DIMMs, memory that is completely
> accessible as far as I can see. Or at least, some specified "first part"
> is accessible.
> 
> Other problems are other metadata like extended struct pages and friends.

I wouldn't really worry about extended struct pages. Those should be
used for debugging purposes mostly. Ot at least that was the case last
time I've checked.

> (I really like the idea of adding memory without allocating memory in
> the hypervisor in the first place, please keep me tuned).
> 
> And please note: This solves some problematic part ("adding too much
> memory to the movable zone or not onlining it"), but not the issue of
> zone imbalance in the first place. And not one issue I try to tackle
> here: don't add paravirtualized memory to the movable zone.

Zone imbalance is an inherent problem of the highmem zone. It is
essentially the highmem zone we all loved so much back in 32b days.
Yes the movable zone doesn't have any addressing limitations so it is a
bit more relaxed but considering the hotplug scenarios I have seen so
far people just want to have full NUMA nodes movable to allow replacing
DIMMs. And then we are back to square one and the zone imbalance issue.
You have those regardless where memmaps are allocated from.

> > I yet have to think about the whole proposal but I am missing the most
> > important part. _Who_ is going to use the new exported information and
> > for what purpose. You said that distributions have hard time to
> > distinguish different types of onlinining policies but isn't this
> > something that is inherently usecase specific?
> > 
> 
> Let's think about a distribution. We have a clash of use cases here
> (just what you describe). What I propose solves one part of it ("handle
> what you know how to handle right in the kernel").
> 
> 1. Users of DIMMs usually expect that they can be unplugged again. That
> is why you want to control how to online memory in user space (== add it
> to the movable zone).

Which is only true if you really want to hotremove them. I am not going
to tell how much I believe in this usecase but movable policy is not
generally applicable here.

> 2. Users of standby memory (s390) expect that memory will never be
> onlined automatically. It will be onlined manually.

yeah

> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
> add a whole bunch of memory and expect that everything works fine. So
> that memory is onlined immediately and that memory is added to the
> NORMAL zone. Users never want the MOVABLE zone.

Then the immediate question would be why to use memory hotplug for that
at all? Why don't you simply start with a huge pre-allocated physical
address space and balloon memory in an out per demand. Why do you want
to inject new memory during the runtime?

> 1. is a reason why distributions usually don't configure
> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> MOVABLE zone. That however implies, that e.g. for x86, you have to
> handle all new memory in user space, especially also HyperV memory.
> There, you then have to check for things like "isHyperV()" to decide
> "oh, yes, this should definitely not go to the MOVABLE zone".

Why do you need a generic hotplug rule in the first place? Why don't you
simply provide different set of rules for different usecases? Let users
decide which usecase they prefer rather than try to be clever which
almost always hits weird corner cases.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-02 13:47       ` Michal Hocko
  (?)
@ 2018-10-02 15:25         ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-02 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On 02/10/2018 15:47, Michal Hocko wrote:
> On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
>> On 01/10/2018 10:40, Michal Hocko wrote:
>>> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
>>> [...]
>>>
>>> I haven't read the patch itself but I just wanted to note one thing
>>> about this part
>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>
>>> I have proposed (but haven't finished this due to other stuff) a
>>> solution for this. Newly added memory can host memmaps itself and then
>>> you do not have the problem in the first place. For vmemmap it would
>>> have an advantage that you do not really have to beg for 2MB pages to
>>> back the whole section but you would get it for free because the initial
>>> part of the section is by definition properly aligned and unused.
>>
>> So the plan is to "host metadata for new memory on the memory itself".
>> Just want to note that this is basically impossible for s390x with the
>> current mechanisms. (added memory is dead, until onlining notifies the
>> hypervisor and memory is allocated). It will also be problematic for
>> paravirtualized memory devices (e.g. XEN's "not backed by the
>> hypervisor" hacks).
> 
> OK, I understand that not all usecases can use self memmap hosting
> others do not have much choice left though. You have to allocate from
> somewhere. Well and alternative would be to have no memmap until
> onlining but I am not sure how much work that would be.
> 
>> This would only be possible for memory DIMMs, memory that is completely
>> accessible as far as I can see. Or at least, some specified "first part"
>> is accessible.
>>
>> Other problems are other metadata like extended struct pages and friends.
> 
> I wouldn't really worry about extended struct pages. Those should be
> used for debugging purposes mostly. Ot at least that was the case last
> time I've checked.

Yes, I guess that is true. Being able to add and online memory without
the need for additional (external) memory would be the ultimate goal,
but highly complicated. But steps into that direction is a good idea.

> 
>> (I really like the idea of adding memory without allocating memory in
>> the hypervisor in the first place, please keep me tuned).
>>
>> And please note: This solves some problematic part ("adding too much
>> memory to the movable zone or not onlining it"), but not the issue of
>> zone imbalance in the first place. And not one issue I try to tackle
>> here: don't add paravirtualized memory to the movable zone.
> 
> Zone imbalance is an inherent problem of the highmem zone. It is
> essentially the highmem zone we all loved so much back in 32b days.
> Yes the movable zone doesn't have any addressing limitations so it is a
> bit more relaxed but considering the hotplug scenarios I have seen so
> far people just want to have full NUMA nodes movable to allow replacing
> DIMMs. And then we are back to square one and the zone imbalance issue.
> You have those regardless where memmaps are allocated from.

Unfortunately yes. And things get more complicated as you are adding a
whole DIMMs and get notifications in the granularity of memory blocks.
Usually you are not interested in onlining any memory block of that DIMM
as MOVABLE as soon as you would have to online one memory block of that
DIMM as NORMAL - because that can already block the whole DIMM.

> 
>>> I yet have to think about the whole proposal but I am missing the most
>>> important part. _Who_ is going to use the new exported information and
>>> for what purpose. You said that distributions have hard time to
>>> distinguish different types of onlinining policies but isn't this
>>> something that is inherently usecase specific?
>>>
>>
>> Let's think about a distribution. We have a clash of use cases here
>> (just what you describe). What I propose solves one part of it ("handle
>> what you know how to handle right in the kernel").
>>
>> 1. Users of DIMMs usually expect that they can be unplugged again. That
>> is why you want to control how to online memory in user space (== add it
>> to the movable zone).
> 
> Which is only true if you really want to hotremove them. I am not going
> to tell how much I believe in this usecase but movable policy is not
> generally applicable here.

Customers expect this to work and the both of us know that we can't make
any guarantees. At least MOVABLE makes it more likely to work. NORMAL is
basically impossible.

> 
>> 2. Users of standby memory (s390) expect that memory will never be
>> onlined automatically. It will be onlined manually.
> 
> yeah
> 
>> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
>> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
>> add a whole bunch of memory and expect that everything works fine. So
>> that memory is onlined immediately and that memory is added to the
>> NORMAL zone. Users never want the MOVABLE zone.
> 
> Then the immediate question would be why to use memory hotplug for that
> at all? Why don't you simply start with a huge pre-allocated physical
> address space and balloon memory in an out per demand. Why do you want
> to inject new memory during the runtime?

Let's assume you have a guest with 20GB size and eventually want to
allow to grow it to 4TB. You would have to allocate metadata for 4TB
right from the beginning. That's definitely now what we want. That is
why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
hypervisor even tells you at which places additional memory has been
made available.

> 
>> 1. is a reason why distributions usually don't configure
>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>> handle all new memory in user space, especially also HyperV memory.
>> There, you then have to check for things like "isHyperV()" to decide
>> "oh, yes, this should definitely not go to the MOVABLE zone".
> 
> Why do you need a generic hotplug rule in the first place? Why don't you
> simply provide different set of rules for different usecases? Let users
> decide which usecase they prefer rather than try to be clever which
> almost always hits weird corner cases.
> 

Memory hotplug has to work as reliable as we can out of the box. Letting
the user make simple decisions like "oh, I am on hyper-V, I want to
online memory to the normal zone" does not feel right. But yes, we
should definitely allow to make modifications. So some sane default rule
+ possible modification is usually a good idea.

I think Dave has a point with using MOVABLE for huge page use cases. And
there might be other corner cases as you correctly state.

I wonder if this patch itself minus modifying online/offline might make
sense. We can then implement simple rules in user space

if (normal) {
	/* customers expect hotplugged DIMMs to be unpluggable */
	online_movable();
} else if (paravirt) {
	/* paravirt memory should as default always go to the NORMAL */
	online();
} else {
	/* standby memory will never get onlined automatically */
}

Compared to having to guess what is to be done (isKVM(), isHyperV,
isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
and ACPI support for x86 KVM).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-02 15:25         ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-02 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, xen-devel, devel, linux-acpi, linux-sh, linux-s390,
	linuxppc-dev, linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On 02/10/2018 15:47, Michal Hocko wrote:
> On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
>> On 01/10/2018 10:40, Michal Hocko wrote:
>>> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
>>> [...]
>>>
>>> I haven't read the patch itself but I just wanted to note one thing
>>> about this part
>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>
>>> I have proposed (but haven't finished this due to other stuff) a
>>> solution for this. Newly added memory can host memmaps itself and then
>>> you do not have the problem in the first place. For vmemmap it would
>>> have an advantage that you do not really have to beg for 2MB pages to
>>> back the whole section but you would get it for free because the initial
>>> part of the section is by definition properly aligned and unused.
>>
>> So the plan is to "host metadata for new memory on the memory itself".
>> Just want to note that this is basically impossible for s390x with the
>> current mechanisms. (added memory is dead, until onlining notifies the
>> hypervisor and memory is allocated). It will also be problematic for
>> paravirtualized memory devices (e.g. XEN's "not backed by the
>> hypervisor" hacks).
> 
> OK, I understand that not all usecases can use self memmap hosting
> others do not have much choice left though. You have to allocate from
> somewhere. Well and alternative would be to have no memmap until
> onlining but I am not sure how much work that would be.
> 
>> This would only be possible for memory DIMMs, memory that is completely
>> accessible as far as I can see. Or at least, some specified "first part"
>> is accessible.
>>
>> Other problems are other metadata like extended struct pages and friends.
> 
> I wouldn't really worry about extended struct pages. Those should be
> used for debugging purposes mostly. Ot at least that was the case last
> time I've checked.

Yes, I guess that is true. Being able to add and online memory without
the need for additional (external) memory would be the ultimate goal,
but highly complicated. But steps into that direction is a good idea.

> 
>> (I really like the idea of adding memory without allocating memory in
>> the hypervisor in the first place, please keep me tuned).
>>
>> And please note: This solves some problematic part ("adding too much
>> memory to the movable zone or not onlining it"), but not the issue of
>> zone imbalance in the first place. And not one issue I try to tackle
>> here: don't add paravirtualized memory to the movable zone.
> 
> Zone imbalance is an inherent problem of the highmem zone. It is
> essentially the highmem zone we all loved so much back in 32b days.
> Yes the movable zone doesn't have any addressing limitations so it is a
> bit more relaxed but considering the hotplug scenarios I have seen so
> far people just want to have full NUMA nodes movable to allow replacing
> DIMMs. And then we are back to square one and the zone imbalance issue.
> You have those regardless where memmaps are allocated from.

Unfortunately yes. And things get more complicated as you are adding a
whole DIMMs and get notifications in the granularity of memory blocks.
Usually you are not interested in onlining any memory block of that DIMM
as MOVABLE as soon as you would have to online one memory block of that
DIMM as NORMAL - because that can already block the whole DIMM.

> 
>>> I yet have to think about the whole proposal but I am missing the most
>>> important part. _Who_ is going to use the new exported information and
>>> for what purpose. You said that distributions have hard time to
>>> distinguish different types of onlinining policies but isn't this
>>> something that is inherently usecase specific?
>>>
>>
>> Let's think about a distribution. We have a clash of use cases here
>> (just what you describe). What I propose solves one part of it ("handle
>> what you know how to handle right in the kernel").
>>
>> 1. Users of DIMMs usually expect that they can be unplugged again. That
>> is why you want to control how to online memory in user space (== add it
>> to the movable zone).
> 
> Which is only true if you really want to hotremove them. I am not going
> to tell how much I believe in this usecase but movable policy is not
> generally applicable here.

Customers expect this to work and the both of us know that we can't make
any guarantees. At least MOVABLE makes it more likely to work. NORMAL is
basically impossible.

> 
>> 2. Users of standby memory (s390) expect that memory will never be
>> onlined automatically. It will be onlined manually.
> 
> yeah
> 
>> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
>> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
>> add a whole bunch of memory and expect that everything works fine. So
>> that memory is onlined immediately and that memory is added to the
>> NORMAL zone. Users never want the MOVABLE zone.
> 
> Then the immediate question would be why to use memory hotplug for that
> at all? Why don't you simply start with a huge pre-allocated physical
> address space and balloon memory in an out per demand. Why do you want
> to inject new memory during the runtime?

Let's assume you have a guest with 20GB size and eventually want to
allow to grow it to 4TB. You would have to allocate metadata for 4TB
right from the beginning. That's definitely now what we want. That is
why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
hypervisor even tells you at which places additional memory has been
made available.

> 
>> 1. is a reason why distributions usually don't configure
>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>> handle all new memory in user space, especially also HyperV memory.
>> There, you then have to check for things like "isHyperV()" to decide
>> "oh, yes, this should definitely not go to the MOVABLE zone".
> 
> Why do you need a generic hotplug rule in the first place? Why don't you
> simply provide different set of rules for different usecases? Let users
> decide which usecase they prefer rather than try to be clever which
> almost always hits weird corner cases.
> 

Memory hotplug has to work as reliable as we can out of the box. Letting
the user make simple decisions like "oh, I am on hyper-V, I want to
online memory to the normal zone" does not feel right. But yes, we
should definitely allow to make modifications. So some sane default rule
+ possible modification is usually a good idea.

I think Dave has a point with using MOVABLE for huge page use cases. And
there might be other corner cases as you correctly state.

I wonder if this patch itself minus modifying online/offline might make
sense. We can then implement simple rules in user space

if (normal) {
	/* customers expect hotplugged DIMMs to be unpluggable */
	online_movable();
} else if (paravirt) {
	/* paravirt memory should as default always go to the NORMAL */
	online();
} else {
	/* standby memory will never get onlined automatically */
}

Compared to having to guess what is to be done (isKVM(), isHyperV,
isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
and ACPI support for x86 KVM).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-02 15:25         ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-02 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On 02/10/2018 15:47, Michal Hocko wrote:
> On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
>> On 01/10/2018 10:40, Michal Hocko wrote:
>>> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
>>> [...]
>>>
>>> I haven't read the patch itself but I just wanted to note one thing
>>> about this part
>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>
>>> I have proposed (but haven't finished this due to other stuff) a
>>> solution for this. Newly added memory can host memmaps itself and then
>>> you do not have the problem in the first place. For vmemmap it would
>>> have an advantage that you do not really have to beg for 2MB pages to
>>> back the whole section but you would get it for free because the initial
>>> part of the section is by definition properly aligned and unused.
>>
>> So the plan is to "host metadata for new memory on the memory itself".
>> Just want to note that this is basically impossible for s390x with the
>> current mechanisms. (added memory is dead, until onlining notifies the
>> hypervisor and memory is allocated). It will also be problematic for
>> paravirtualized memory devices (e.g. XEN's "not backed by the
>> hypervisor" hacks).
> 
> OK, I understand that not all usecases can use self memmap hosting
> others do not have much choice left though. You have to allocate from
> somewhere. Well and alternative would be to have no memmap until
> onlining but I am not sure how much work that would be.
> 
>> This would only be possible for memory DIMMs, memory that is completely
>> accessible as far as I can see. Or at least, some specified "first part"
>> is accessible.
>>
>> Other problems are other metadata like extended struct pages and friends.
> 
> I wouldn't really worry about extended struct pages. Those should be
> used for debugging purposes mostly. Ot at least that was the case last
> time I've checked.

Yes, I guess that is true. Being able to add and online memory without
the need for additional (external) memory would be the ultimate goal,
but highly complicated. But steps into that direction is a good idea.

> 
>> (I really like the idea of adding memory without allocating memory in
>> the hypervisor in the first place, please keep me tuned).
>>
>> And please note: This solves some problematic part ("adding too much
>> memory to the movable zone or not onlining it"), but not the issue of
>> zone imbalance in the first place. And not one issue I try to tackle
>> here: don't add paravirtualized memory to the movable zone.
> 
> Zone imbalance is an inherent problem of the highmem zone. It is
> essentially the highmem zone we all loved so much back in 32b days.
> Yes the movable zone doesn't have any addressing limitations so it is a
> bit more relaxed but considering the hotplug scenarios I have seen so
> far people just want to have full NUMA nodes movable to allow replacing
> DIMMs. And then we are back to square one and the zone imbalance issue.
> You have those regardless where memmaps are allocated from.

Unfortunately yes. And things get more complicated as you are adding a
whole DIMMs and get notifications in the granularity of memory blocks.
Usually you are not interested in onlining any memory block of that DIMM
as MOVABLE as soon as you would have to online one memory block of that
DIMM as NORMAL - because that can already block the whole DIMM.

> 
>>> I yet have to think about the whole proposal but I am missing the most
>>> important part. _Who_ is going to use the new exported information and
>>> for what purpose. You said that distributions have hard time to
>>> distinguish different types of onlinining policies but isn't this
>>> something that is inherently usecase specific?
>>>
>>
>> Let's think about a distribution. We have a clash of use cases here
>> (just what you describe). What I propose solves one part of it ("handle
>> what you know how to handle right in the kernel").
>>
>> 1. Users of DIMMs usually expect that they can be unplugged again. That
>> is why you want to control how to online memory in user space (== add it
>> to the movable zone).
> 
> Which is only true if you really want to hotremove them. I am not going
> to tell how much I believe in this usecase but movable policy is not
> generally applicable here.

Customers expect this to work and the both of us know that we can't make
any guarantees. At least MOVABLE makes it more likely to work. NORMAL is
basically impossible.

> 
>> 2. Users of standby memory (s390) expect that memory will never be
>> onlined automatically. It will be onlined manually.
> 
> yeah
> 
>> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
>> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
>> add a whole bunch of memory and expect that everything works fine. So
>> that memory is onlined immediately and that memory is added to the
>> NORMAL zone. Users never want the MOVABLE zone.
> 
> Then the immediate question would be why to use memory hotplug for that
> at all? Why don't you simply start with a huge pre-allocated physical
> address space and balloon memory in an out per demand. Why do you want
> to inject new memory during the runtime?

Let's assume you have a guest with 20GB size and eventually want to
allow to grow it to 4TB. You would have to allocate metadata for 4TB
right from the beginning. That's definitely now what we want. That is
why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
hypervisor even tells you at which places additional memory has been
made available.

> 
>> 1. is a reason why distributions usually don't configure
>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>> handle all new memory in user space, especially also HyperV memory.
>> There, you then have to check for things like "isHyperV()" to decide
>> "oh, yes, this should definitely not go to the MOVABLE zone".
> 
> Why do you need a generic hotplug rule in the first place? Why don't you
> simply provide different set of rules for different usecases? Let users
> decide which usecase they prefer rather than try to be clever which
> almost always hits weird corner cases.
> 

Memory hotplug has to work as reliable as we can out of the box. Letting
the user make simple decisions like "oh, I am on hyper-V, I want to
online memory to the normal zone" does not feel right. But yes, we
should definitely allow to make modifications. So some sane default rule
+ possible modification is usually a good idea.

I think Dave has a point with using MOVABLE for huge page use cases. And
there might be other corner cases as you correctly state.

I wonder if this patch itself minus modifying online/offline might make
sense. We can then implement simple rules in user space

if (normal) {
	/* customers expect hotplugged DIMMs to be unpluggable */
	online_movable();
} else if (paravirt) {
	/* paravirt memory should as default always go to the NORMAL */
	online();
} else {
	/* standby memory will never get onlined automatically */
}

Compared to having to guess what is to be done (isKVM(), isHyperV,
isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
and ACPI support for x86 KVM).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-02 13:47       ` Michal Hocko
  (?)
  (?)
@ 2018-10-02 15:25       ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-02 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On 02/10/2018 15:47, Michal Hocko wrote:
> On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
>> On 01/10/2018 10:40, Michal Hocko wrote:
>>> On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
>>> [...]
>>>
>>> I haven't read the patch itself but I just wanted to note one thing
>>> about this part
>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>
>>> I have proposed (but haven't finished this due to other stuff) a
>>> solution for this. Newly added memory can host memmaps itself and then
>>> you do not have the problem in the first place. For vmemmap it would
>>> have an advantage that you do not really have to beg for 2MB pages to
>>> back the whole section but you would get it for free because the initial
>>> part of the section is by definition properly aligned and unused.
>>
>> So the plan is to "host metadata for new memory on the memory itself".
>> Just want to note that this is basically impossible for s390x with the
>> current mechanisms. (added memory is dead, until onlining notifies the
>> hypervisor and memory is allocated). It will also be problematic for
>> paravirtualized memory devices (e.g. XEN's "not backed by the
>> hypervisor" hacks).
> 
> OK, I understand that not all usecases can use self memmap hosting
> others do not have much choice left though. You have to allocate from
> somewhere. Well and alternative would be to have no memmap until
> onlining but I am not sure how much work that would be.
> 
>> This would only be possible for memory DIMMs, memory that is completely
>> accessible as far as I can see. Or at least, some specified "first part"
>> is accessible.
>>
>> Other problems are other metadata like extended struct pages and friends.
> 
> I wouldn't really worry about extended struct pages. Those should be
> used for debugging purposes mostly. Ot at least that was the case last
> time I've checked.

Yes, I guess that is true. Being able to add and online memory without
the need for additional (external) memory would be the ultimate goal,
but highly complicated. But steps into that direction is a good idea.

> 
>> (I really like the idea of adding memory without allocating memory in
>> the hypervisor in the first place, please keep me tuned).
>>
>> And please note: This solves some problematic part ("adding too much
>> memory to the movable zone or not onlining it"), but not the issue of
>> zone imbalance in the first place. And not one issue I try to tackle
>> here: don't add paravirtualized memory to the movable zone.
> 
> Zone imbalance is an inherent problem of the highmem zone. It is
> essentially the highmem zone we all loved so much back in 32b days.
> Yes the movable zone doesn't have any addressing limitations so it is a
> bit more relaxed but considering the hotplug scenarios I have seen so
> far people just want to have full NUMA nodes movable to allow replacing
> DIMMs. And then we are back to square one and the zone imbalance issue.
> You have those regardless where memmaps are allocated from.

Unfortunately yes. And things get more complicated as you are adding a
whole DIMMs and get notifications in the granularity of memory blocks.
Usually you are not interested in onlining any memory block of that DIMM
as MOVABLE as soon as you would have to online one memory block of that
DIMM as NORMAL - because that can already block the whole DIMM.

> 
>>> I yet have to think about the whole proposal but I am missing the most
>>> important part. _Who_ is going to use the new exported information and
>>> for what purpose. You said that distributions have hard time to
>>> distinguish different types of onlinining policies but isn't this
>>> something that is inherently usecase specific?
>>>
>>
>> Let's think about a distribution. We have a clash of use cases here
>> (just what you describe). What I propose solves one part of it ("handle
>> what you know how to handle right in the kernel").
>>
>> 1. Users of DIMMs usually expect that they can be unplugged again. That
>> is why you want to control how to online memory in user space (== add it
>> to the movable zone).
> 
> Which is only true if you really want to hotremove them. I am not going
> to tell how much I believe in this usecase but movable policy is not
> generally applicable here.

Customers expect this to work and the both of us know that we can't make
any guarantees. At least MOVABLE makes it more likely to work. NORMAL is
basically impossible.

> 
>> 2. Users of standby memory (s390) expect that memory will never be
>> onlined automatically. It will be onlined manually.
> 
> yeah
> 
>> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
>> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
>> add a whole bunch of memory and expect that everything works fine. So
>> that memory is onlined immediately and that memory is added to the
>> NORMAL zone. Users never want the MOVABLE zone.
> 
> Then the immediate question would be why to use memory hotplug for that
> at all? Why don't you simply start with a huge pre-allocated physical
> address space and balloon memory in an out per demand. Why do you want
> to inject new memory during the runtime?

Let's assume you have a guest with 20GB size and eventually want to
allow to grow it to 4TB. You would have to allocate metadata for 4TB
right from the beginning. That's definitely now what we want. That is
why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
hypervisor even tells you at which places additional memory has been
made available.

> 
>> 1. is a reason why distributions usually don't configure
>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>> handle all new memory in user space, especially also HyperV memory.
>> There, you then have to check for things like "isHyperV()" to decide
>> "oh, yes, this should definitely not go to the MOVABLE zone".
> 
> Why do you need a generic hotplug rule in the first place? Why don't you
> simply provide different set of rules for different usecases? Let users
> decide which usecase they prefer rather than try to be clever which
> almost always hits weird corner cases.
> 

Memory hotplug has to work as reliable as we can out of the box. Letting
the user make simple decisions like "oh, I am on hyper-V, I want to
online memory to the normal zone" does not feel right. But yes, we
should definitely allow to make modifications. So some sane default rule
+ possible modification is usually a good idea.

I think Dave has a point with using MOVABLE for huge page use cases. And
there might be other corner cases as you correctly state.

I wonder if this patch itself minus modifying online/offline might make
sense. We can then implement simple rules in user space

if (normal) {
	/* customers expect hotplugged DIMMs to be unpluggable */
	online_movable();
} else if (paravirt) {
	/* paravirt memory should as default always go to the NORMAL */
	online();
} else {
	/* standby memory will never get onlined automatically */
}

Compared to having to guess what is to be done (isKVM(), isHyperV,
isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
and ACPI support for x86 KVM).

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-02 15:25         ` David Hildenbrand
  (?)
@ 2018-10-03 13:38           ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 13:38 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

David Hildenbrand <david@redhat.com> writes:

> On 02/10/2018 15:47, Michal Hocko wrote:
...
>> 
>> Why do you need a generic hotplug rule in the first place? Why don't you
>> simply provide different set of rules for different usecases? Let users
>> decide which usecase they prefer rather than try to be clever which
>> almost always hits weird corner cases.
>> 
>
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right. But yes, we
> should definitely allow to make modifications.

Last time I was thinking about the imperfectness of the auto-online
solution we have and any other solution we're able to suggest an idea
came to my mind - what if we add an eBPF attach point to the
auto-onlining mechanism effecively offloading decision-making to
userspace. We'll of couse need to provide all required data (e.g. how
memory blocks are aligned with physical DIMMs as it makes no sense to
online part of DIMM as normal and the rest as movable as it's going to
be impossible to unplug such DIMM anyways).

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 13:38           ` Vitaly Kuznetsov
  0 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 13:38 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

David Hildenbrand <david@redhat.com> writes:

> On 02/10/2018 15:47, Michal Hocko wrote:
...
>> 
>> Why do you need a generic hotplug rule in the first place? Why don't you
>> simply provide different set of rules for different usecases? Let users
>> decide which usecase they prefer rather than try to be clever which
>> almost always hits weird corner cases.
>> 
>
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right. But yes, we
> should definitely allow to make modifications.

Last time I was thinking about the imperfectness of the auto-online
solution we have and any other solution we're able to suggest an idea
came to my mind - what if we add an eBPF attach point to the
auto-onlining mechanism effecively offloading decision-making to
userspace. We'll of couse need to provide all required data (e.g. how
memory blocks are aligned with physical DIMMs as it makes no sense to
online part of DIMM as normal and the rest as movable as it's going to
be impossible to unplug such DIMM anyways).

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 13:38           ` Vitaly Kuznetsov
  0 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 13:38 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Paul Mackerras,
	H. Peter Anvin, Stephen Rothwell, Rashmica Gupta, Dan Williams,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	linux-acpi, Ingo Molnar, xen-devel, Len Brown, Pavel Tatashin,
	Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Andrew Morton, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

David Hildenbrand <david@redhat.com> writes:

> On 02/10/2018 15:47, Michal Hocko wrote:
...
>> 
>> Why do you need a generic hotplug rule in the first place? Why don't you
>> simply provide different set of rules for different usecases? Let users
>> decide which usecase they prefer rather than try to be clever which
>> almost always hits weird corner cases.
>> 
>
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right. But yes, we
> should definitely allow to make modifications.

Last time I was thinking about the imperfectness of the auto-online
solution we have and any other solution we're able to suggest an idea
came to my mind - what if we add an eBPF attach point to the
auto-onlining mechanism effecively offloading decision-making to
userspace. We'll of couse need to provide all required data (e.g. how
memory blocks are aligned with physical DIMMs as it makes no sense to
online part of DIMM as normal and the rest as movable as it's going to
be impossible to unplug such DIMM anyways).

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-02 15:25         ` David Hildenbrand
  (?)
  (?)
@ 2018-10-03 13:38         ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 13:38 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

David Hildenbrand <david@redhat.com> writes:

> On 02/10/2018 15:47, Michal Hocko wrote:
...
>> 
>> Why do you need a generic hotplug rule in the first place? Why don't you
>> simply provide different set of rules for different usecases? Let users
>> decide which usecase they prefer rather than try to be clever which
>> almost always hits weird corner cases.
>> 
>
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right. But yes, we
> should definitely allow to make modifications.

Last time I was thinking about the imperfectness of the auto-online
solution we have and any other solution we're able to suggest an idea
came to my mind - what if we add an eBPF attach point to the
auto-onlining mechanism effecively offloading decision-making to
userspace. We'll of couse need to provide all required data (e.g. how
memory blocks are aligned with physical DIMMs as it makes no sense to
online part of DIMM as normal and the rest as movable as it's going to
be impossible to unplug such DIMM anyways).

-- 
Vitaly

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:38           ` Vitaly Kuznetsov
  (?)
@ 2018-10-03 13:44             ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 13:44 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, David Hildenbrand, linux-acpi, Ingo Molnar,
	xen-devel, Len

On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
> > On 02/10/2018 15:47, Michal Hocko wrote:
> ...
> >> 
> >> Why do you need a generic hotplug rule in the first place? Why don't you
> >> simply provide different set of rules for different usecases? Let users
> >> decide which usecase they prefer rather than try to be clever which
> >> almost always hits weird corner cases.
> >> 
> >
> > Memory hotplug has to work as reliable as we can out of the box. Letting
> > the user make simple decisions like "oh, I am on hyper-V, I want to
> > online memory to the normal zone" does not feel right. But yes, we
> > should definitely allow to make modifications.
> 
> Last time I was thinking about the imperfectness of the auto-online
> solution we have and any other solution we're able to suggest an idea
> came to my mind - what if we add an eBPF attach point to the
> auto-onlining mechanism effecively offloading decision-making to
> userspace. We'll of couse need to provide all required data (e.g. how
> memory blocks are aligned with physical DIMMs as it makes no sense to
> online part of DIMM as normal and the rest as movable as it's going to
> be impossible to unplug such DIMM anyways).

And how does that differ from the notification mechanism we have? Just
by not relying on the process scheduling? If yes then this revolves
around the implementation detail that you care about time-to-hot-add
vs. time-to-online. And that is a solveable problem - just allocate
memmaps from the hot-added memory.

As David said some of the memory cannot be onlined without further steps
(e.g. when it is standby as David called it) and then I fail to see how
eBPF help in any way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 13:44             ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 13:44 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: David Hildenbrand, Kate Stewart, Rich Felker, linux-ia64,
	linux-sh, Peter Zijlstra, Benjamin Herrenschmidt, Balbir Singh,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
> > On 02/10/2018 15:47, Michal Hocko wrote:
> ...
> >> 
> >> Why do you need a generic hotplug rule in the first place? Why don't you
> >> simply provide different set of rules for different usecases? Let users
> >> decide which usecase they prefer rather than try to be clever which
> >> almost always hits weird corner cases.
> >> 
> >
> > Memory hotplug has to work as reliable as we can out of the box. Letting
> > the user make simple decisions like "oh, I am on hyper-V, I want to
> > online memory to the normal zone" does not feel right. But yes, we
> > should definitely allow to make modifications.
> 
> Last time I was thinking about the imperfectness of the auto-online
> solution we have and any other solution we're able to suggest an idea
> came to my mind - what if we add an eBPF attach point to the
> auto-onlining mechanism effecively offloading decision-making to
> userspace. We'll of couse need to provide all required data (e.g. how
> memory blocks are aligned with physical DIMMs as it makes no sense to
> online part of DIMM as normal and the rest as movable as it's going to
> be impossible to unplug such DIMM anyways).

And how does that differ from the notification mechanism we have? Just
by not relying on the process scheduling? If yes then this revolves
around the implementation detail that you care about time-to-hot-add
vs. time-to-online. And that is a solveable problem - just allocate
memmaps from the hot-added memory.

As David said some of the memory cannot be onlined without further steps
(e.g. when it is standby as David called it) and then I fail to see how
eBPF help in any way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 13:44             ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 13:44 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Paul Mackerras,
	H. Peter Anvin, Stephen Rothwell, Rashmica Gupta, Dan Williams,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	David Hildenbrand, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Andrew Morton, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
> > On 02/10/2018 15:47, Michal Hocko wrote:
> ...
> >> 
> >> Why do you need a generic hotplug rule in the first place? Why don't you
> >> simply provide different set of rules for different usecases? Let users
> >> decide which usecase they prefer rather than try to be clever which
> >> almost always hits weird corner cases.
> >> 
> >
> > Memory hotplug has to work as reliable as we can out of the box. Letting
> > the user make simple decisions like "oh, I am on hyper-V, I want to
> > online memory to the normal zone" does not feel right. But yes, we
> > should definitely allow to make modifications.
> 
> Last time I was thinking about the imperfectness of the auto-online
> solution we have and any other solution we're able to suggest an idea
> came to my mind - what if we add an eBPF attach point to the
> auto-onlining mechanism effecively offloading decision-making to
> userspace. We'll of couse need to provide all required data (e.g. how
> memory blocks are aligned with physical DIMMs as it makes no sense to
> online part of DIMM as normal and the rest as movable as it's going to
> be impossible to unplug such DIMM anyways).

And how does that differ from the notification mechanism we have? Just
by not relying on the process scheduling? If yes then this revolves
around the implementation detail that you care about time-to-hot-add
vs. time-to-online. And that is a solveable problem - just allocate
memmaps from the hot-added memory.

As David said some of the memory cannot be onlined without further steps
(e.g. when it is standby as David called it) and then I fail to see how
eBPF help in any way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:38           ` Vitaly Kuznetsov
  (?)
  (?)
@ 2018-10-03 13:44           ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 13:44 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, David Hildenbrand, linux-acpi, Ingo Molnar,
	xen-devel, Len

On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
> > On 02/10/2018 15:47, Michal Hocko wrote:
> ...
> >> 
> >> Why do you need a generic hotplug rule in the first place? Why don't you
> >> simply provide different set of rules for different usecases? Let users
> >> decide which usecase they prefer rather than try to be clever which
> >> almost always hits weird corner cases.
> >> 
> >
> > Memory hotplug has to work as reliable as we can out of the box. Letting
> > the user make simple decisions like "oh, I am on hyper-V, I want to
> > online memory to the normal zone" does not feel right. But yes, we
> > should definitely allow to make modifications.
> 
> Last time I was thinking about the imperfectness of the auto-online
> solution we have and any other solution we're able to suggest an idea
> came to my mind - what if we add an eBPF attach point to the
> auto-onlining mechanism effecively offloading decision-making to
> userspace. We'll of couse need to provide all required data (e.g. how
> memory blocks are aligned with physical DIMMs as it makes no sense to
> online part of DIMM as normal and the rest as movable as it's going to
> be impossible to unplug such DIMM anyways).

And how does that differ from the notification mechanism we have? Just
by not relying on the process scheduling? If yes then this revolves
around the implementation detail that you care about time-to-hot-add
vs. time-to-online. And that is a solveable problem - just allocate
memmaps from the hot-added memory.

As David said some of the memory cannot be onlined without further steps
(e.g. when it is standby as David called it) and then I fail to see how
eBPF help in any way.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:44             ` Michal Hocko
  (?)
@ 2018-10-03 13:52               ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, David Hildenbrand, linux-acpi, Ingo Molnar,
	xen-devel, Len

Michal Hocko <mhocko@kernel.org> writes:

> On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>> > On 02/10/2018 15:47, Michal Hocko wrote:
>> ...
>> >> 
>> >> Why do you need a generic hotplug rule in the first place? Why don't you
>> >> simply provide different set of rules for different usecases? Let users
>> >> decide which usecase they prefer rather than try to be clever which
>> >> almost always hits weird corner cases.
>> >> 
>> >
>> > Memory hotplug has to work as reliable as we can out of the box. Letting
>> > the user make simple decisions like "oh, I am on hyper-V, I want to
>> > online memory to the normal zone" does not feel right. But yes, we
>> > should definitely allow to make modifications.
>> 
>> Last time I was thinking about the imperfectness of the auto-online
>> solution we have and any other solution we're able to suggest an idea
>> came to my mind - what if we add an eBPF attach point to the
>> auto-onlining mechanism effecively offloading decision-making to
>> userspace. We'll of couse need to provide all required data (e.g. how
>> memory blocks are aligned with physical DIMMs as it makes no sense to
>> online part of DIMM as normal and the rest as movable as it's going to
>> be impossible to unplug such DIMM anyways).
>
> And how does that differ from the notification mechanism we have? Just
> by not relying on the process scheduling? If yes then this revolves
> around the implementation detail that you care about time-to-hot-add
> vs. time-to-online. And that is a solveable problem - just allocate
> memmaps from the hot-added memory.

It is more than just memmaps (e.g. forking udev process doing memory
onlining also needs memory) but yes, the main idea is to make the
onlining synchronous with hotplug.

>
> As David said some of the memory cannot be onlined without further steps
> (e.g. when it is standby as David called it) and then I fail to see how
> eBPF help in any way.

and also, we can fight till the end of days here trying to come up with
an onlining solution which would work for everyone and eBPF would move
this decision to distro level.

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 13:52               ` Vitaly Kuznetsov
  0 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Kate Stewart, Rich Felker, linux-ia64,
	linux-sh, Peter Zijlstra, Benjamin Herrenschmidt, Balbir Singh,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

Michal Hocko <mhocko@kernel.org> writes:

> On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>> > On 02/10/2018 15:47, Michal Hocko wrote:
>> ...
>> >> 
>> >> Why do you need a generic hotplug rule in the first place? Why don't you
>> >> simply provide different set of rules for different usecases? Let users
>> >> decide which usecase they prefer rather than try to be clever which
>> >> almost always hits weird corner cases.
>> >> 
>> >
>> > Memory hotplug has to work as reliable as we can out of the box. Letting
>> > the user make simple decisions like "oh, I am on hyper-V, I want to
>> > online memory to the normal zone" does not feel right. But yes, we
>> > should definitely allow to make modifications.
>> 
>> Last time I was thinking about the imperfectness of the auto-online
>> solution we have and any other solution we're able to suggest an idea
>> came to my mind - what if we add an eBPF attach point to the
>> auto-onlining mechanism effecively offloading decision-making to
>> userspace. We'll of couse need to provide all required data (e.g. how
>> memory blocks are aligned with physical DIMMs as it makes no sense to
>> online part of DIMM as normal and the rest as movable as it's going to
>> be impossible to unplug such DIMM anyways).
>
> And how does that differ from the notification mechanism we have? Just
> by not relying on the process scheduling? If yes then this revolves
> around the implementation detail that you care about time-to-hot-add
> vs. time-to-online. And that is a solveable problem - just allocate
> memmaps from the hot-added memory.

It is more than just memmaps (e.g. forking udev process doing memory
onlining also needs memory) but yes, the main idea is to make the
onlining synchronous with hotplug.

>
> As David said some of the memory cannot be onlined without further steps
> (e.g. when it is standby as David called it) and then I fail to see how
> eBPF help in any way.

and also, we can fight till the end of days here trying to come up with
an onlining solution which would work for everyone and eBPF would move
this decision to distro level.

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 13:52               ` Vitaly Kuznetsov
  0 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Paul Mackerras,
	H. Peter Anvin, Stephen Rothwell, Rashmica Gupta, Dan Williams,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	David Hildenbrand, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Andrew Morton, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

Michal Hocko <mhocko@kernel.org> writes:

> On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>> > On 02/10/2018 15:47, Michal Hocko wrote:
>> ...
>> >> 
>> >> Why do you need a generic hotplug rule in the first place? Why don't you
>> >> simply provide different set of rules for different usecases? Let users
>> >> decide which usecase they prefer rather than try to be clever which
>> >> almost always hits weird corner cases.
>> >> 
>> >
>> > Memory hotplug has to work as reliable as we can out of the box. Letting
>> > the user make simple decisions like "oh, I am on hyper-V, I want to
>> > online memory to the normal zone" does not feel right. But yes, we
>> > should definitely allow to make modifications.
>> 
>> Last time I was thinking about the imperfectness of the auto-online
>> solution we have and any other solution we're able to suggest an idea
>> came to my mind - what if we add an eBPF attach point to the
>> auto-onlining mechanism effecively offloading decision-making to
>> userspace. We'll of couse need to provide all required data (e.g. how
>> memory blocks are aligned with physical DIMMs as it makes no sense to
>> online part of DIMM as normal and the rest as movable as it's going to
>> be impossible to unplug such DIMM anyways).
>
> And how does that differ from the notification mechanism we have? Just
> by not relying on the process scheduling? If yes then this revolves
> around the implementation detail that you care about time-to-hot-add
> vs. time-to-online. And that is a solveable problem - just allocate
> memmaps from the hot-added memory.

It is more than just memmaps (e.g. forking udev process doing memory
onlining also needs memory) but yes, the main idea is to make the
onlining synchronous with hotplug.

>
> As David said some of the memory cannot be onlined without further steps
> (e.g. when it is standby as David called it) and then I fail to see how
> eBPF help in any way.

and also, we can fight till the end of days here trying to come up with
an onlining solution which would work for everyone and eBPF would move
this decision to distro level.

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:44             ` Michal Hocko
                               ` (2 preceding siblings ...)
  (?)
@ 2018-10-03 13:52             ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, David Hildenbrand, linux-acpi, Ingo Molnar,
	xen-devel, Len

Michal Hocko <mhocko@kernel.org> writes:

> On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>> > On 02/10/2018 15:47, Michal Hocko wrote:
>> ...
>> >> 
>> >> Why do you need a generic hotplug rule in the first place? Why don't you
>> >> simply provide different set of rules for different usecases? Let users
>> >> decide which usecase they prefer rather than try to be clever which
>> >> almost always hits weird corner cases.
>> >> 
>> >
>> > Memory hotplug has to work as reliable as we can out of the box. Letting
>> > the user make simple decisions like "oh, I am on hyper-V, I want to
>> > online memory to the normal zone" does not feel right. But yes, we
>> > should definitely allow to make modifications.
>> 
>> Last time I was thinking about the imperfectness of the auto-online
>> solution we have and any other solution we're able to suggest an idea
>> came to my mind - what if we add an eBPF attach point to the
>> auto-onlining mechanism effecively offloading decision-making to
>> userspace. We'll of couse need to provide all required data (e.g. how
>> memory blocks are aligned with physical DIMMs as it makes no sense to
>> online part of DIMM as normal and the rest as movable as it's going to
>> be impossible to unplug such DIMM anyways).
>
> And how does that differ from the notification mechanism we have? Just
> by not relying on the process scheduling? If yes then this revolves
> around the implementation detail that you care about time-to-hot-add
> vs. time-to-online. And that is a solveable problem - just allocate
> memmaps from the hot-added memory.

It is more than just memmaps (e.g. forking udev process doing memory
onlining also needs memory) but yes, the main idea is to make the
onlining synchronous with hotplug.

>
> As David said some of the memory cannot be onlined without further steps
> (e.g. when it is standby as David called it) and then I fail to see how
> eBPF help in any way.

and also, we can fight till the end of days here trying to come up with
an onlining solution which would work for everyone and eBPF would move
this decision to distro level.

-- 
Vitaly

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-02 15:25         ` David Hildenbrand
  (?)
@ 2018-10-03 13:54           ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 13:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
> On 02/10/2018 15:47, Michal Hocko wrote:
[...]
> > Zone imbalance is an inherent problem of the highmem zone. It is
> > essentially the highmem zone we all loved so much back in 32b days.
> > Yes the movable zone doesn't have any addressing limitations so it is a
> > bit more relaxed but considering the hotplug scenarios I have seen so
> > far people just want to have full NUMA nodes movable to allow replacing
> > DIMMs. And then we are back to square one and the zone imbalance issue.
> > You have those regardless where memmaps are allocated from.
> 
> Unfortunately yes. And things get more complicated as you are adding a
> whole DIMMs and get notifications in the granularity of memory blocks.
> Usually you are not interested in onlining any memory block of that DIMM
> as MOVABLE as soon as you would have to online one memory block of that
> DIMM as NORMAL - because that can already block the whole DIMM.

For the purpose of the hotremove, yes. But as Dave has noted people are
(ab)using zone movable for other purposes - e.g. large pages.
 
[...]
> > Then the immediate question would be why to use memory hotplug for that
> > at all? Why don't you simply start with a huge pre-allocated physical
> > address space and balloon memory in an out per demand. Why do you want
> > to inject new memory during the runtime?
> 
> Let's assume you have a guest with 20GB size and eventually want to
> allow to grow it to 4TB. You would have to allocate metadata for 4TB
> right from the beginning. That's definitely now what we want. That is
> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
> hypervisor even tells you at which places additional memory has been
> made available.

Then you have to live with the fact that your hot added memory will be
self hosted and find a way for ballooning to work with that. The price
would be that some part of the memory is not really balloonable in the
end.

> >> 1. is a reason why distributions usually don't configure
> >> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> >> MOVABLE zone. That however implies, that e.g. for x86, you have to
> >> handle all new memory in user space, especially also HyperV memory.
> >> There, you then have to check for things like "isHyperV()" to decide
> >> "oh, yes, this should definitely not go to the MOVABLE zone".
> > 
> > Why do you need a generic hotplug rule in the first place? Why don't you
> > simply provide different set of rules for different usecases? Let users
> > decide which usecase they prefer rather than try to be clever which
> > almost always hits weird corner cases.
> > 
> 
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right.

Users usually know what is their usecase and then it is just a matter of
plumbing (e.g. distribution can provide proper tools to deploy those
usecases) to chose the right and for user obscure way to make it work.

> But yes, we
> should definitely allow to make modifications. So some sane default rule
> + possible modification is usually a good idea.
> 
> I think Dave has a point with using MOVABLE for huge page use cases. And
> there might be other corner cases as you correctly state.
> 
> I wonder if this patch itself minus modifying online/offline might make
> sense. We can then implement simple rules in user space
> 
> if (normal) {
> 	/* customers expect hotplugged DIMMs to be unpluggable */
> 	online_movable();
> } else if (paravirt) {
> 	/* paravirt memory should as default always go to the NORMAL */
> 	online();
> } else {
> 	/* standby memory will never get onlined automatically */
> }
> 
> Compared to having to guess what is to be done (isKVM(), isHyperV,
> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
> and ACPI support for x86 KVM).

I am worried that exporing a type will just push us even further to the
corner. The current design is really simple and 2 stage and that is good
because it allows for very different usecases. The more specific the API
be the more likely we are going to hit "I haven't even dreamed somebody
would be using hotplug for this thing". And I would bet this will happen
sooner or later.

Just look at how the whole auto onlining screwed the API to workaround
an implementation detail. It has created a one purpose behavior that
doesn't suite many usecases. Yet we have to live with that because
somebody really relies on it. Let's not repeat same errors.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 13:54           ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 13:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, xen-devel, devel, linux-acpi, linux-sh, linux-s390,
	linuxppc-dev, linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
> On 02/10/2018 15:47, Michal Hocko wrote:
[...]
> > Zone imbalance is an inherent problem of the highmem zone. It is
> > essentially the highmem zone we all loved so much back in 32b days.
> > Yes the movable zone doesn't have any addressing limitations so it is a
> > bit more relaxed but considering the hotplug scenarios I have seen so
> > far people just want to have full NUMA nodes movable to allow replacing
> > DIMMs. And then we are back to square one and the zone imbalance issue.
> > You have those regardless where memmaps are allocated from.
> 
> Unfortunately yes. And things get more complicated as you are adding a
> whole DIMMs and get notifications in the granularity of memory blocks.
> Usually you are not interested in onlining any memory block of that DIMM
> as MOVABLE as soon as you would have to online one memory block of that
> DIMM as NORMAL - because that can already block the whole DIMM.

For the purpose of the hotremove, yes. But as Dave has noted people are
(ab)using zone movable for other purposes - e.g. large pages.
 
[...]
> > Then the immediate question would be why to use memory hotplug for that
> > at all? Why don't you simply start with a huge pre-allocated physical
> > address space and balloon memory in an out per demand. Why do you want
> > to inject new memory during the runtime?
> 
> Let's assume you have a guest with 20GB size and eventually want to
> allow to grow it to 4TB. You would have to allocate metadata for 4TB
> right from the beginning. That's definitely now what we want. That is
> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
> hypervisor even tells you at which places additional memory has been
> made available.

Then you have to live with the fact that your hot added memory will be
self hosted and find a way for ballooning to work with that. The price
would be that some part of the memory is not really balloonable in the
end.

> >> 1. is a reason why distributions usually don't configure
> >> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> >> MOVABLE zone. That however implies, that e.g. for x86, you have to
> >> handle all new memory in user space, especially also HyperV memory.
> >> There, you then have to check for things like "isHyperV()" to decide
> >> "oh, yes, this should definitely not go to the MOVABLE zone".
> > 
> > Why do you need a generic hotplug rule in the first place? Why don't you
> > simply provide different set of rules for different usecases? Let users
> > decide which usecase they prefer rather than try to be clever which
> > almost always hits weird corner cases.
> > 
> 
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right.

Users usually know what is their usecase and then it is just a matter of
plumbing (e.g. distribution can provide proper tools to deploy those
usecases) to chose the right and for user obscure way to make it work.

> But yes, we
> should definitely allow to make modifications. So some sane default rule
> + possible modification is usually a good idea.
> 
> I think Dave has a point with using MOVABLE for huge page use cases. And
> there might be other corner cases as you correctly state.
> 
> I wonder if this patch itself minus modifying online/offline might make
> sense. We can then implement simple rules in user space
> 
> if (normal) {
> 	/* customers expect hotplugged DIMMs to be unpluggable */
> 	online_movable();
> } else if (paravirt) {
> 	/* paravirt memory should as default always go to the NORMAL */
> 	online();
> } else {
> 	/* standby memory will never get onlined automatically */
> }
> 
> Compared to having to guess what is to be done (isKVM(), isHyperV,
> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
> and ACPI support for x86 KVM).

I am worried that exporing a type will just push us even further to the
corner. The current design is really simple and 2 stage and that is good
because it allows for very different usecases. The more specific the API
be the more likely we are going to hit "I haven't even dreamed somebody
would be using hotplug for this thing". And I would bet this will happen
sooner or later.

Just look at how the whole auto onlining screwed the API to workaround
an implementation detail. It has created a one purpose behavior that
doesn't suite many usecases. Yet we have to live with that because
somebody really relies on it. Let's not repeat same errors.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 13:54           ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 13:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
> On 02/10/2018 15:47, Michal Hocko wrote:
[...]
> > Zone imbalance is an inherent problem of the highmem zone. It is
> > essentially the highmem zone we all loved so much back in 32b days.
> > Yes the movable zone doesn't have any addressing limitations so it is a
> > bit more relaxed but considering the hotplug scenarios I have seen so
> > far people just want to have full NUMA nodes movable to allow replacing
> > DIMMs. And then we are back to square one and the zone imbalance issue.
> > You have those regardless where memmaps are allocated from.
> 
> Unfortunately yes. And things get more complicated as you are adding a
> whole DIMMs and get notifications in the granularity of memory blocks.
> Usually you are not interested in onlining any memory block of that DIMM
> as MOVABLE as soon as you would have to online one memory block of that
> DIMM as NORMAL - because that can already block the whole DIMM.

For the purpose of the hotremove, yes. But as Dave has noted people are
(ab)using zone movable for other purposes - e.g. large pages.
 
[...]
> > Then the immediate question would be why to use memory hotplug for that
> > at all? Why don't you simply start with a huge pre-allocated physical
> > address space and balloon memory in an out per demand. Why do you want
> > to inject new memory during the runtime?
> 
> Let's assume you have a guest with 20GB size and eventually want to
> allow to grow it to 4TB. You would have to allocate metadata for 4TB
> right from the beginning. That's definitely now what we want. That is
> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
> hypervisor even tells you at which places additional memory has been
> made available.

Then you have to live with the fact that your hot added memory will be
self hosted and find a way for ballooning to work with that. The price
would be that some part of the memory is not really balloonable in the
end.

> >> 1. is a reason why distributions usually don't configure
> >> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> >> MOVABLE zone. That however implies, that e.g. for x86, you have to
> >> handle all new memory in user space, especially also HyperV memory.
> >> There, you then have to check for things like "isHyperV()" to decide
> >> "oh, yes, this should definitely not go to the MOVABLE zone".
> > 
> > Why do you need a generic hotplug rule in the first place? Why don't you
> > simply provide different set of rules for different usecases? Let users
> > decide which usecase they prefer rather than try to be clever which
> > almost always hits weird corner cases.
> > 
> 
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right.

Users usually know what is their usecase and then it is just a matter of
plumbing (e.g. distribution can provide proper tools to deploy those
usecases) to chose the right and for user obscure way to make it work.

> But yes, we
> should definitely allow to make modifications. So some sane default rule
> + possible modification is usually a good idea.
> 
> I think Dave has a point with using MOVABLE for huge page use cases. And
> there might be other corner cases as you correctly state.
> 
> I wonder if this patch itself minus modifying online/offline might make
> sense. We can then implement simple rules in user space
> 
> if (normal) {
> 	/* customers expect hotplugged DIMMs to be unpluggable */
> 	online_movable();
> } else if (paravirt) {
> 	/* paravirt memory should as default always go to the NORMAL */
> 	online();
> } else {
> 	/* standby memory will never get onlined automatically */
> }
> 
> Compared to having to guess what is to be done (isKVM(), isHyperV,
> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
> and ACPI support for x86 KVM).

I am worried that exporing a type will just push us even further to the
corner. The current design is really simple and 2 stage and that is good
because it allows for very different usecases. The more specific the API
be the more likely we are going to hit "I haven't even dreamed somebody
would be using hotplug for this thing". And I would bet this will happen
sooner or later.

Just look at how the whole auto onlining screwed the API to workaround
an implementation detail. It has created a one purpose behavior that
doesn't suite many usecases. Yet we have to live with that because
somebody really relies on it. Let's not repeat same errors.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-02 15:25         ` David Hildenbrand
                           ` (3 preceding siblings ...)
  (?)
@ 2018-10-03 13:54         ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 13:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
> On 02/10/2018 15:47, Michal Hocko wrote:
[...]
> > Zone imbalance is an inherent problem of the highmem zone. It is
> > essentially the highmem zone we all loved so much back in 32b days.
> > Yes the movable zone doesn't have any addressing limitations so it is a
> > bit more relaxed but considering the hotplug scenarios I have seen so
> > far people just want to have full NUMA nodes movable to allow replacing
> > DIMMs. And then we are back to square one and the zone imbalance issue.
> > You have those regardless where memmaps are allocated from.
> 
> Unfortunately yes. And things get more complicated as you are adding a
> whole DIMMs and get notifications in the granularity of memory blocks.
> Usually you are not interested in onlining any memory block of that DIMM
> as MOVABLE as soon as you would have to online one memory block of that
> DIMM as NORMAL - because that can already block the whole DIMM.

For the purpose of the hotremove, yes. But as Dave has noted people are
(ab)using zone movable for other purposes - e.g. large pages.
 
[...]
> > Then the immediate question would be why to use memory hotplug for that
> > at all? Why don't you simply start with a huge pre-allocated physical
> > address space and balloon memory in an out per demand. Why do you want
> > to inject new memory during the runtime?
> 
> Let's assume you have a guest with 20GB size and eventually want to
> allow to grow it to 4TB. You would have to allocate metadata for 4TB
> right from the beginning. That's definitely now what we want. That is
> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
> hypervisor even tells you at which places additional memory has been
> made available.

Then you have to live with the fact that your hot added memory will be
self hosted and find a way for ballooning to work with that. The price
would be that some part of the memory is not really balloonable in the
end.

> >> 1. is a reason why distributions usually don't configure
> >> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> >> MOVABLE zone. That however implies, that e.g. for x86, you have to
> >> handle all new memory in user space, especially also HyperV memory.
> >> There, you then have to check for things like "isHyperV()" to decide
> >> "oh, yes, this should definitely not go to the MOVABLE zone".
> > 
> > Why do you need a generic hotplug rule in the first place? Why don't you
> > simply provide different set of rules for different usecases? Let users
> > decide which usecase they prefer rather than try to be clever which
> > almost always hits weird corner cases.
> > 
> 
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right.

Users usually know what is their usecase and then it is just a matter of
plumbing (e.g. distribution can provide proper tools to deploy those
usecases) to chose the right and for user obscure way to make it work.

> But yes, we
> should definitely allow to make modifications. So some sane default rule
> + possible modification is usually a good idea.
> 
> I think Dave has a point with using MOVABLE for huge page use cases. And
> there might be other corner cases as you correctly state.
> 
> I wonder if this patch itself minus modifying online/offline might make
> sense. We can then implement simple rules in user space
> 
> if (normal) {
> 	/* customers expect hotplugged DIMMs to be unpluggable */
> 	online_movable();
> } else if (paravirt) {
> 	/* paravirt memory should as default always go to the NORMAL */
> 	online();
> } else {
> 	/* standby memory will never get onlined automatically */
> }
> 
> Compared to having to guess what is to be done (isKVM(), isHyperV,
> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
> and ACPI support for x86 KVM).

I am worried that exporing a type will just push us even further to the
corner. The current design is really simple and 2 stage and that is good
because it allows for very different usecases. The more specific the API
be the more likely we are going to hit "I haven't even dreamed somebody
would be using hotplug for this thing". And I would bet this will happen
sooner or later.

Just look at how the whole auto onlining screwed the API to workaround
an implementation detail. It has created a one purpose behavior that
doesn't suite many usecases. Yet we have to live with that because
somebody really relies on it. Let's not repeat same errors.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:52               ` Vitaly Kuznetsov
  (?)
@ 2018-10-03 14:07                 ` Dave Hansen
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-10-03 14:07 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens, linux-mm,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, David Hildenbrand, linux-acpi,
	Ingo Molnar, xen-devel, Len Brown

On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> It is more than just memmaps (e.g. forking udev process doing memory
> onlining also needs memory) but yes, the main idea is to make the
> onlining synchronous with hotplug.

That's a good theoretical concern.

But, is it a problem we need to solve in practice?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 14:07                 ` Dave Hansen
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-10-03 14:07 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Michal Hocko
  Cc: David Hildenbrand, Kate Stewart, Rich Felker, linux-ia64,
	linux-sh, Peter Zijlstra, Benjamin Herrenschmidt, Balbir Singh,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> It is more than just memmaps (e.g. forking udev process doing memory
> onlining also needs memory) but yes, the main idea is to make the
> onlining synchronous with hotplug.

That's a good theoretical concern.

But, is it a problem we need to solve in practice?

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 14:07                 ` Dave Hansen
  0 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-10-03 14:07 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	David Hildenbrand, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Andrew Morton, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> It is more than just memmaps (e.g. forking udev process doing memory
> onlining also needs memory) but yes, the main idea is to make the
> onlining synchronous with hotplug.

That's a good theoretical concern.

But, is it a problem we need to solve in practice?


^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:52               ` Vitaly Kuznetsov
                                 ` (2 preceding siblings ...)
  (?)
@ 2018-10-03 14:07               ` Dave Hansen
  -1 siblings, 0 replies; 144+ messages in thread
From: Dave Hansen @ 2018-10-03 14:07 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens, linux-mm,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, David Hildenbrand, linux-acpi,
	Ingo Molnar, xen-devel, Len Brown

On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> It is more than just memmaps (e.g. forking udev process doing memory
> onlining also needs memory) but yes, the main idea is to make the
> onlining synchronous with hotplug.

That's a good theoretical concern.

But, is it a problem we need to solve in practice?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:52               ` Vitaly Kuznetsov
  (?)
@ 2018-10-03 14:24                 ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 14:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, David Hildenbrand, linux-acpi, Ingo Molnar,
	xen-devel, Len

On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
[...]
> > As David said some of the memory cannot be onlined without further steps
> > (e.g. when it is standby as David called it) and then I fail to see how
> > eBPF help in any way.
> 
> and also, we can fight till the end of days here trying to come up with
> an onlining solution which would work for everyone and eBPF would move
> this decision to distro level.

The point is that there is _no_ general onlining solution. This is
basically policy which belongs to the userspace.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 14:24                 ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 14:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: David Hildenbrand, Kate Stewart, Rich Felker, linux-ia64,
	linux-sh, Peter Zijlstra, Benjamin Herrenschmidt, Balbir Singh,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
[...]
> > As David said some of the memory cannot be onlined without further steps
> > (e.g. when it is standby as David called it) and then I fail to see how
> > eBPF help in any way.
> 
> and also, we can fight till the end of days here trying to come up with
> an onlining solution which would work for everyone and eBPF would move
> this decision to distro level.

The point is that there is _no_ general onlining solution. This is
basically policy which belongs to the userspace.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 14:24                 ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 14:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Paul Mackerras,
	H. Peter Anvin, Stephen Rothwell, Rashmica Gupta, Dan Williams,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	David Hildenbrand, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Andrew Morton, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
[...]
> > As David said some of the memory cannot be onlined without further steps
> > (e.g. when it is standby as David called it) and then I fail to see how
> > eBPF help in any way.
> 
> and also, we can fight till the end of days here trying to come up with
> an onlining solution which would work for everyone and eBPF would move
> this decision to distro level.

The point is that there is _no_ general onlining solution. This is
basically policy which belongs to the userspace.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:52               ` Vitaly Kuznetsov
                                 ` (3 preceding siblings ...)
  (?)
@ 2018-10-03 14:24               ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-03 14:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, David Hildenbrand, linux-acpi, Ingo Molnar,
	xen-devel, Len

On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
[...]
> > As David said some of the memory cannot be onlined without further steps
> > (e.g. when it is standby as David called it) and then I fail to see how
> > eBPF help in any way.
> 
> and also, we can fight till the end of days here trying to come up with
> an onlining solution which would work for everyone and eBPF would move
> this decision to distro level.

The point is that there is _no_ general onlining solution. This is
basically policy which belongs to the userspace.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 14:07                 ` Dave Hansen
  (?)
@ 2018-10-03 14:34                   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 14:34 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens, linux-mm,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, David Hildenbrand, linux-acpi,
	Ingo Molnar, xen-devel, Len Brown

Dave Hansen <dave.hansen@linux.intel.com> writes:

> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>> It is more than just memmaps (e.g. forking udev process doing memory
>> onlining also needs memory) but yes, the main idea is to make the
>> onlining synchronous with hotplug.
>
> That's a good theoretical concern.
>
> But, is it a problem we need to solve in practice?

Yes, unfortunately. It was previously discovered that when we try to
hotplug tons of memory to a low memory system (this is a common scenario
with VMs) we end up with OOM because for all new memory blocks we need
to allocate page tables, struct pages, ... and we need memory to do
that. The userspace program doing memory onlining also needs memory to
run and in case it prefers to fork to handle hundreds of notfifications
... well, it may get OOMkilled before it manages to online anything.

Allocating all kernel objects from the newly hotplugged blocks would
definitely help to manage the situation but as I said this won't solve
the 'forking udev' problem completely (it will likely remain in
'extreme' cases only. We can probably work around it by onlining with a
dedicated process which doesn't do memory allocation).

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 14:34                   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 14:34 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko
  Cc: David Hildenbrand, Kate Stewart, Rich Felker, linux-ia64,
	linux-sh, Peter Zijlstra, Benjamin Herrenschmidt, Balbir Singh,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

Dave Hansen <dave.hansen@linux.intel.com> writes:

> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>> It is more than just memmaps (e.g. forking udev process doing memory
>> onlining also needs memory) but yes, the main idea is to make the
>> onlining synchronous with hotplug.
>
> That's a good theoretical concern.
>
> But, is it a problem we need to solve in practice?

Yes, unfortunately. It was previously discovered that when we try to
hotplug tons of memory to a low memory system (this is a common scenario
with VMs) we end up with OOM because for all new memory blocks we need
to allocate page tables, struct pages, ... and we need memory to do
that. The userspace program doing memory onlining also needs memory to
run and in case it prefers to fork to handle hundreds of notfifications
... well, it may get OOMkilled before it manages to online anything.

Allocating all kernel objects from the newly hotplugged blocks would
definitely help to manage the situation but as I said this won't solve
the 'forking udev' problem completely (it will likely remain in
'extreme' cases only. We can probably work around it by onlining with a
dedicated process which doesn't do memory allocation).

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 14:34                   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 14:34 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	David Hildenbrand, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Andrew Morton, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

Dave Hansen <dave.hansen@linux.intel.com> writes:

> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>> It is more than just memmaps (e.g. forking udev process doing memory
>> onlining also needs memory) but yes, the main idea is to make the
>> onlining synchronous with hotplug.
>
> That's a good theoretical concern.
>
> But, is it a problem we need to solve in practice?

Yes, unfortunately. It was previously discovered that when we try to
hotplug tons of memory to a low memory system (this is a common scenario
with VMs) we end up with OOM because for all new memory blocks we need
to allocate page tables, struct pages, ... and we need memory to do
that. The userspace program doing memory onlining also needs memory to
run and in case it prefers to fork to handle hundreds of notfifications
... well, it may get OOMkilled before it manages to online anything.

Allocating all kernel objects from the newly hotplugged blocks would
definitely help to manage the situation but as I said this won't solve
the 'forking udev' problem completely (it will likely remain in
'extreme' cases only. We can probably work around it by onlining with a
dedicated process which doesn't do memory allocation).

-- 
Vitaly

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 14:07                 ` Dave Hansen
  (?)
  (?)
@ 2018-10-03 14:34                 ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 144+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 14:34 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens, linux-mm,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, David Hildenbrand, linux-acpi,
	Ingo Molnar, xen-devel, Len Brown

Dave Hansen <dave.hansen@linux.intel.com> writes:

> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>> It is more than just memmaps (e.g. forking udev process doing memory
>> onlining also needs memory) but yes, the main idea is to make the
>> onlining synchronous with hotplug.
>
> That's a good theoretical concern.
>
> But, is it a problem we need to solve in practice?

Yes, unfortunately. It was previously discovered that when we try to
hotplug tons of memory to a low memory system (this is a common scenario
with VMs) we end up with OOM because for all new memory blocks we need
to allocate page tables, struct pages, ... and we need memory to do
that. The userspace program doing memory onlining also needs memory to
run and in case it prefers to fork to handle hundreds of notfifications
... well, it may get OOMkilled before it manages to online anything.

Allocating all kernel objects from the newly hotplugged blocks would
definitely help to manage the situation but as I said this won't solve
the 'forking udev' problem completely (it will likely remain in
'extreme' cases only. We can probably work around it by onlining with a
dedicated process which doesn't do memory allocation).

-- 
Vitaly

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:54           ` Michal Hocko
  (?)
@ 2018-10-03 17:00             ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On 03/10/2018 15:54, Michal Hocko wrote:
> On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
>> On 02/10/2018 15:47, Michal Hocko wrote:
> [...]
>>> Zone imbalance is an inherent problem of the highmem zone. It is
>>> essentially the highmem zone we all loved so much back in 32b days.
>>> Yes the movable zone doesn't have any addressing limitations so it is a
>>> bit more relaxed but considering the hotplug scenarios I have seen so
>>> far people just want to have full NUMA nodes movable to allow replacing
>>> DIMMs. And then we are back to square one and the zone imbalance issue.
>>> You have those regardless where memmaps are allocated from.
>>
>> Unfortunately yes. And things get more complicated as you are adding a
>> whole DIMMs and get notifications in the granularity of memory blocks.
>> Usually you are not interested in onlining any memory block of that DIMM
>> as MOVABLE as soon as you would have to online one memory block of that
>> DIMM as NORMAL - because that can already block the whole DIMM.
> 
> For the purpose of the hotremove, yes. But as Dave has noted people are
> (ab)using zone movable for other purposes - e.g. large pages.

That might be right for some very special use cases. For most of users
this is not the case (meaning it should be the default but if the user
wants to change it, he should be allowed to change it).

>  
> [...]
>>> Then the immediate question would be why to use memory hotplug for that
>>> at all? Why don't you simply start with a huge pre-allocated physical
>>> address space and balloon memory in an out per demand. Why do you want
>>> to inject new memory during the runtime?
>>
>> Let's assume you have a guest with 20GB size and eventually want to
>> allow to grow it to 4TB. You would have to allocate metadata for 4TB
>> right from the beginning. That's definitely now what we want. That is
>> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
>> hypervisor even tells you at which places additional memory has been
>> made available.
> 
> Then you have to live with the fact that your hot added memory will be
> self hosted and find a way for ballooning to work with that. The price
> would be that some part of the memory is not really balloonable in the
> end.
> 
>>>> 1. is a reason why distributions usually don't configure
>>>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>>>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>>>> handle all new memory in user space, especially also HyperV memory.
>>>> There, you then have to check for things like "isHyperV()" to decide
>>>> "oh, yes, this should definitely not go to the MOVABLE zone".
>>>
>>> Why do you need a generic hotplug rule in the first place? Why don't you
>>> simply provide different set of rules for different usecases? Let users
>>> decide which usecase they prefer rather than try to be clever which
>>> almost always hits weird corner cases.
>>>
>>
>> Memory hotplug has to work as reliable as we can out of the box. Letting
>> the user make simple decisions like "oh, I am on hyper-V, I want to
>> online memory to the normal zone" does not feel right.
> 
> Users usually know what is their usecase and then it is just a matter of
> plumbing (e.g. distribution can provide proper tools to deploy those
> usecases) to chose the right and for user obscure way to make it work.

I disagree. If we can ship sane defaults, we should do that and allow to
make changes later on. This is how distributions have been working for
ever. But yes, allowing to make modifications is always a good idea to
tailor it to some special case user scenarios. (tuned or whatever we
have in place).

> 
>> But yes, we
>> should definitely allow to make modifications. So some sane default rule
>> + possible modification is usually a good idea.
>>
>> I think Dave has a point with using MOVABLE for huge page use cases. And
>> there might be other corner cases as you correctly state.
>>
>> I wonder if this patch itself minus modifying online/offline might make
>> sense. We can then implement simple rules in user space
>>
>> if (normal) {
>> 	/* customers expect hotplugged DIMMs to be unpluggable */
>> 	online_movable();
>> } else if (paravirt) {
>> 	/* paravirt memory should as default always go to the NORMAL */
>> 	online();
>> } else {
>> 	/* standby memory will never get onlined automatically */
>> }
>>
>> Compared to having to guess what is to be done (isKVM(), isHyperV,
>> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
>> and ACPI support for x86 KVM).
> 
> I am worried that exporing a type will just push us even further to the
> corner. The current design is really simple and 2 stage and that is good
> because it allows for very different usecases. The more specific the API
> be the more likely we are going to hit "I haven't even dreamed somebody
> would be using hotplug for this thing". And I would bet this will happen
> sooner or later.

Exposing the type of memory is in my point of view just forwarding facts
to user space. We should not export arbitrary information, that is true.

> 
> Just look at how the whole auto onlining screwed the API to workaround
> an implementation detail. It has created a one purpose behavior that
> doesn't suite many usecases. Yet we have to live with that because
> somebody really relies on it. Let's not repeat same errors.
> 

Let me rephrase: You state that user space has to make the decision and
that user should be able to set/reconfigure rules. That is perfectly fine.

But then we should give user space access to sufficient information to
make a decision. This might be the type of memory as we learned (what
some part of this patch proposes), but maybe later more, e.g. to which
physical device memory belongs (e.g. to hotplug it all movable or all
normal) ...

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 17:00             ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, xen-devel, devel, linux-acpi, linux-sh, linux-s390,
	linuxppc-dev, linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On 03/10/2018 15:54, Michal Hocko wrote:
> On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
>> On 02/10/2018 15:47, Michal Hocko wrote:
> [...]
>>> Zone imbalance is an inherent problem of the highmem zone. It is
>>> essentially the highmem zone we all loved so much back in 32b days.
>>> Yes the movable zone doesn't have any addressing limitations so it is a
>>> bit more relaxed but considering the hotplug scenarios I have seen so
>>> far people just want to have full NUMA nodes movable to allow replacing
>>> DIMMs. And then we are back to square one and the zone imbalance issue.
>>> You have those regardless where memmaps are allocated from.
>>
>> Unfortunately yes. And things get more complicated as you are adding a
>> whole DIMMs and get notifications in the granularity of memory blocks.
>> Usually you are not interested in onlining any memory block of that DIMM
>> as MOVABLE as soon as you would have to online one memory block of that
>> DIMM as NORMAL - because that can already block the whole DIMM.
> 
> For the purpose of the hotremove, yes. But as Dave has noted people are
> (ab)using zone movable for other purposes - e.g. large pages.

That might be right for some very special use cases. For most of users
this is not the case (meaning it should be the default but if the user
wants to change it, he should be allowed to change it).

>  
> [...]
>>> Then the immediate question would be why to use memory hotplug for that
>>> at all? Why don't you simply start with a huge pre-allocated physical
>>> address space and balloon memory in an out per demand. Why do you want
>>> to inject new memory during the runtime?
>>
>> Let's assume you have a guest with 20GB size and eventually want to
>> allow to grow it to 4TB. You would have to allocate metadata for 4TB
>> right from the beginning. That's definitely now what we want. That is
>> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
>> hypervisor even tells you at which places additional memory has been
>> made available.
> 
> Then you have to live with the fact that your hot added memory will be
> self hosted and find a way for ballooning to work with that. The price
> would be that some part of the memory is not really balloonable in the
> end.
> 
>>>> 1. is a reason why distributions usually don't configure
>>>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>>>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>>>> handle all new memory in user space, especially also HyperV memory.
>>>> There, you then have to check for things like "isHyperV()" to decide
>>>> "oh, yes, this should definitely not go to the MOVABLE zone".
>>>
>>> Why do you need a generic hotplug rule in the first place? Why don't you
>>> simply provide different set of rules for different usecases? Let users
>>> decide which usecase they prefer rather than try to be clever which
>>> almost always hits weird corner cases.
>>>
>>
>> Memory hotplug has to work as reliable as we can out of the box. Letting
>> the user make simple decisions like "oh, I am on hyper-V, I want to
>> online memory to the normal zone" does not feel right.
> 
> Users usually know what is their usecase and then it is just a matter of
> plumbing (e.g. distribution can provide proper tools to deploy those
> usecases) to chose the right and for user obscure way to make it work.

I disagree. If we can ship sane defaults, we should do that and allow to
make changes later on. This is how distributions have been working for
ever. But yes, allowing to make modifications is always a good idea to
tailor it to some special case user scenarios. (tuned or whatever we
have in place).

> 
>> But yes, we
>> should definitely allow to make modifications. So some sane default rule
>> + possible modification is usually a good idea.
>>
>> I think Dave has a point with using MOVABLE for huge page use cases. And
>> there might be other corner cases as you correctly state.
>>
>> I wonder if this patch itself minus modifying online/offline might make
>> sense. We can then implement simple rules in user space
>>
>> if (normal) {
>> 	/* customers expect hotplugged DIMMs to be unpluggable */
>> 	online_movable();
>> } else if (paravirt) {
>> 	/* paravirt memory should as default always go to the NORMAL */
>> 	online();
>> } else {
>> 	/* standby memory will never get onlined automatically */
>> }
>>
>> Compared to having to guess what is to be done (isKVM(), isHyperV,
>> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
>> and ACPI support for x86 KVM).
> 
> I am worried that exporing a type will just push us even further to the
> corner. The current design is really simple and 2 stage and that is good
> because it allows for very different usecases. The more specific the API
> be the more likely we are going to hit "I haven't even dreamed somebody
> would be using hotplug for this thing". And I would bet this will happen
> sooner or later.

Exposing the type of memory is in my point of view just forwarding facts
to user space. We should not export arbitrary information, that is true.

> 
> Just look at how the whole auto onlining screwed the API to workaround
> an implementation detail. It has created a one purpose behavior that
> doesn't suite many usecases. Yet we have to live with that because
> somebody really relies on it. Let's not repeat same errors.
> 

Let me rephrase: You state that user space has to make the decision and
that user should be able to set/reconfigure rules. That is perfectly fine.

But then we should give user space access to sufficient information to
make a decision. This might be the type of memory as we learned (what
some part of this patch proposes), but maybe later more, e.g. to which
physical device memory belongs (e.g. to hotplug it all movable or all
normal) ...

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 17:00             ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On 03/10/2018 15:54, Michal Hocko wrote:
> On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
>> On 02/10/2018 15:47, Michal Hocko wrote:
> [...]
>>> Zone imbalance is an inherent problem of the highmem zone. It is
>>> essentially the highmem zone we all loved so much back in 32b days.
>>> Yes the movable zone doesn't have any addressing limitations so it is a
>>> bit more relaxed but considering the hotplug scenarios I have seen so
>>> far people just want to have full NUMA nodes movable to allow replacing
>>> DIMMs. And then we are back to square one and the zone imbalance issue.
>>> You have those regardless where memmaps are allocated from.
>>
>> Unfortunately yes. And things get more complicated as you are adding a
>> whole DIMMs and get notifications in the granularity of memory blocks.
>> Usually you are not interested in onlining any memory block of that DIMM
>> as MOVABLE as soon as you would have to online one memory block of that
>> DIMM as NORMAL - because that can already block the whole DIMM.
> 
> For the purpose of the hotremove, yes. But as Dave has noted people are
> (ab)using zone movable for other purposes - e.g. large pages.

That might be right for some very special use cases. For most of users
this is not the case (meaning it should be the default but if the user
wants to change it, he should be allowed to change it).

>  
> [...]
>>> Then the immediate question would be why to use memory hotplug for that
>>> at all? Why don't you simply start with a huge pre-allocated physical
>>> address space and balloon memory in an out per demand. Why do you want
>>> to inject new memory during the runtime?
>>
>> Let's assume you have a guest with 20GB size and eventually want to
>> allow to grow it to 4TB. You would have to allocate metadata for 4TB
>> right from the beginning. That's definitely now what we want. That is
>> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
>> hypervisor even tells you at which places additional memory has been
>> made available.
> 
> Then you have to live with the fact that your hot added memory will be
> self hosted and find a way for ballooning to work with that. The price
> would be that some part of the memory is not really balloonable in the
> end.
> 
>>>> 1. is a reason why distributions usually don't configure
>>>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>>>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>>>> handle all new memory in user space, especially also HyperV memory.
>>>> There, you then have to check for things like "isHyperV()" to decide
>>>> "oh, yes, this should definitely not go to the MOVABLE zone".
>>>
>>> Why do you need a generic hotplug rule in the first place? Why don't you
>>> simply provide different set of rules for different usecases? Let users
>>> decide which usecase they prefer rather than try to be clever which
>>> almost always hits weird corner cases.
>>>
>>
>> Memory hotplug has to work as reliable as we can out of the box. Letting
>> the user make simple decisions like "oh, I am on hyper-V, I want to
>> online memory to the normal zone" does not feel right.
> 
> Users usually know what is their usecase and then it is just a matter of
> plumbing (e.g. distribution can provide proper tools to deploy those
> usecases) to chose the right and for user obscure way to make it work.

I disagree. If we can ship sane defaults, we should do that and allow to
make changes later on. This is how distributions have been working for
ever. But yes, allowing to make modifications is always a good idea to
tailor it to some special case user scenarios. (tuned or whatever we
have in place).

> 
>> But yes, we
>> should definitely allow to make modifications. So some sane default rule
>> + possible modification is usually a good idea.
>>
>> I think Dave has a point with using MOVABLE for huge page use cases. And
>> there might be other corner cases as you correctly state.
>>
>> I wonder if this patch itself minus modifying online/offline might make
>> sense. We can then implement simple rules in user space
>>
>> if (normal) {
>> 	/* customers expect hotplugged DIMMs to be unpluggable */
>> 	online_movable();
>> } else if (paravirt) {
>> 	/* paravirt memory should as default always go to the NORMAL */
>> 	online();
>> } else {
>> 	/* standby memory will never get onlined automatically */
>> }
>>
>> Compared to having to guess what is to be done (isKVM(), isHyperV,
>> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
>> and ACPI support for x86 KVM).
> 
> I am worried that exporing a type will just push us even further to the
> corner. The current design is really simple and 2 stage and that is good
> because it allows for very different usecases. The more specific the API
> be the more likely we are going to hit "I haven't even dreamed somebody
> would be using hotplug for this thing". And I would bet this will happen
> sooner or later.

Exposing the type of memory is in my point of view just forwarding facts
to user space. We should not export arbitrary information, that is true.

> 
> Just look at how the whole auto onlining screwed the API to workaround
> an implementation detail. It has created a one purpose behavior that
> doesn't suite many usecases. Yet we have to live with that because
> somebody really relies on it. Let's not repeat same errors.
> 

Let me rephrase: You state that user space has to make the decision and
that user should be able to set/reconfigure rules. That is perfectly fine.

But then we should give user space access to sufficient information to
make a decision. This might be the type of memory as we learned (what
some part of this patch proposes), but maybe later more, e.g. to which
physical device memory belongs (e.g. to hotplug it all movable or all
normal) ...

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 13:54           ` Michal Hocko
                             ` (2 preceding siblings ...)
  (?)
@ 2018-10-03 17:00           ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On 03/10/2018 15:54, Michal Hocko wrote:
> On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
>> On 02/10/2018 15:47, Michal Hocko wrote:
> [...]
>>> Zone imbalance is an inherent problem of the highmem zone. It is
>>> essentially the highmem zone we all loved so much back in 32b days.
>>> Yes the movable zone doesn't have any addressing limitations so it is a
>>> bit more relaxed but considering the hotplug scenarios I have seen so
>>> far people just want to have full NUMA nodes movable to allow replacing
>>> DIMMs. And then we are back to square one and the zone imbalance issue.
>>> You have those regardless where memmaps are allocated from.
>>
>> Unfortunately yes. And things get more complicated as you are adding a
>> whole DIMMs and get notifications in the granularity of memory blocks.
>> Usually you are not interested in onlining any memory block of that DIMM
>> as MOVABLE as soon as you would have to online one memory block of that
>> DIMM as NORMAL - because that can already block the whole DIMM.
> 
> For the purpose of the hotremove, yes. But as Dave has noted people are
> (ab)using zone movable for other purposes - e.g. large pages.

That might be right for some very special use cases. For most of users
this is not the case (meaning it should be the default but if the user
wants to change it, he should be allowed to change it).

>  
> [...]
>>> Then the immediate question would be why to use memory hotplug for that
>>> at all? Why don't you simply start with a huge pre-allocated physical
>>> address space and balloon memory in an out per demand. Why do you want
>>> to inject new memory during the runtime?
>>
>> Let's assume you have a guest with 20GB size and eventually want to
>> allow to grow it to 4TB. You would have to allocate metadata for 4TB
>> right from the beginning. That's definitely now what we want. That is
>> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
>> hypervisor even tells you at which places additional memory has been
>> made available.
> 
> Then you have to live with the fact that your hot added memory will be
> self hosted and find a way for ballooning to work with that. The price
> would be that some part of the memory is not really balloonable in the
> end.
> 
>>>> 1. is a reason why distributions usually don't configure
>>>> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
>>>> MOVABLE zone. That however implies, that e.g. for x86, you have to
>>>> handle all new memory in user space, especially also HyperV memory.
>>>> There, you then have to check for things like "isHyperV()" to decide
>>>> "oh, yes, this should definitely not go to the MOVABLE zone".
>>>
>>> Why do you need a generic hotplug rule in the first place? Why don't you
>>> simply provide different set of rules for different usecases? Let users
>>> decide which usecase they prefer rather than try to be clever which
>>> almost always hits weird corner cases.
>>>
>>
>> Memory hotplug has to work as reliable as we can out of the box. Letting
>> the user make simple decisions like "oh, I am on hyper-V, I want to
>> online memory to the normal zone" does not feel right.
> 
> Users usually know what is their usecase and then it is just a matter of
> plumbing (e.g. distribution can provide proper tools to deploy those
> usecases) to chose the right and for user obscure way to make it work.

I disagree. If we can ship sane defaults, we should do that and allow to
make changes later on. This is how distributions have been working for
ever. But yes, allowing to make modifications is always a good idea to
tailor it to some special case user scenarios. (tuned or whatever we
have in place).

> 
>> But yes, we
>> should definitely allow to make modifications. So some sane default rule
>> + possible modification is usually a good idea.
>>
>> I think Dave has a point with using MOVABLE for huge page use cases. And
>> there might be other corner cases as you correctly state.
>>
>> I wonder if this patch itself minus modifying online/offline might make
>> sense. We can then implement simple rules in user space
>>
>> if (normal) {
>> 	/* customers expect hotplugged DIMMs to be unpluggable */
>> 	online_movable();
>> } else if (paravirt) {
>> 	/* paravirt memory should as default always go to the NORMAL */
>> 	online();
>> } else {
>> 	/* standby memory will never get onlined automatically */
>> }
>>
>> Compared to having to guess what is to be done (isKVM(), isHyperV,
>> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
>> and ACPI support for x86 KVM).
> 
> I am worried that exporing a type will just push us even further to the
> corner. The current design is really simple and 2 stage and that is good
> because it allows for very different usecases. The more specific the API
> be the more likely we are going to hit "I haven't even dreamed somebody
> would be using hotplug for this thing". And I would bet this will happen
> sooner or later.

Exposing the type of memory is in my point of view just forwarding facts
to user space. We should not export arbitrary information, that is true.

> 
> Just look at how the whole auto onlining screwed the API to workaround
> an implementation detail. It has created a one purpose behavior that
> doesn't suite many usecases. Yet we have to live with that because
> somebody really relies on it. Let's not repeat same errors.
> 

Let me rephrase: You state that user space has to make the decision and
that user should be able to set/reconfigure rules. That is perfectly fine.

But then we should give user space access to sufficient information to
make a decision. This might be the type of memory as we learned (what
some part of this patch proposes), but maybe later more, e.g. to which
physical device memory belongs (e.g. to hotplug it all movable or all
normal) ...

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 14:24                 ` Michal Hocko
  (?)
@ 2018-10-03 17:06                   ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:06 UTC (permalink / raw)
  To: Michal Hocko, Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

As already stated, I guess we should then provide user space with
sufficient information to make a good decision (to implement rules).

The eBPF is basically the same idea, only the rules are formulated
differently and directly handle in the kernel. Still it might be e.e.
relevant if memory is standby memory (that's what I remember the
official s390x name), or something else.

Right now, the (udev) rules we have make assumptions based on general
system properties (s390x, HyperV ...).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 17:06                   ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:06 UTC (permalink / raw)
  To: Michal Hocko, Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

As already stated, I guess we should then provide user space with
sufficient information to make a good decision (to implement rules).

The eBPF is basically the same idea, only the rules are formulated
differently and directly handle in the kernel. Still it might be e.e.
relevant if memory is standby memory (that's what I remember the
official s390x name), or something else.

Right now, the (udev) rules we have make assumptions based on general
system properties (s390x, HyperV ...).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 17:06                   ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:06 UTC (permalink / raw)
  To: Michal Hocko, Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Paul Mackerras,
	H. Peter Anvin, Stephen Rothwell, Rashmica Gupta, Dan Williams,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	linux-acpi, Ingo Molnar, xen-devel, Len Brown, Pavel Tatashin,
	Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Andrew Morton, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

As already stated, I guess we should then provide user space with
sufficient information to make a good decision (to implement rules).

The eBPF is basically the same idea, only the rules are formulated
differently and directly handle in the kernel. Still it might be e.e.
relevant if memory is standby memory (that's what I remember the
official s390x name), or something else.

Right now, the (udev) rules we have make assumptions based on general
system properties (s390x, HyperV ...).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 14:24                 ` Michal Hocko
  (?)
  (?)
@ 2018-10-03 17:06                 ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:06 UTC (permalink / raw)
  To: Michal Hocko, Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

As already stated, I guess we should then provide user space with
sufficient information to make a good decision (to implement rules).

The eBPF is basically the same idea, only the rules are formulated
differently and directly handle in the kernel. Still it might be e.e.
relevant if memory is standby memory (that's what I remember the
official s390x name), or something else.

Right now, the (udev) rules we have make assumptions based on general
system properties (s390x, HyperV ...).

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 14:34                   ` Vitaly Kuznetsov
  (?)
@ 2018-10-03 17:14                     ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:14 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Dave Hansen, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens, linux-mm,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob

On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> Dave Hansen <dave.hansen@linux.intel.com> writes:
> 
>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>> It is more than just memmaps (e.g. forking udev process doing memory
>>> onlining also needs memory) but yes, the main idea is to make the
>>> onlining synchronous with hotplug.
>>
>> That's a good theoretical concern.
>>
>> But, is it a problem we need to solve in practice?
> 
> Yes, unfortunately. It was previously discovered that when we try to
> hotplug tons of memory to a low memory system (this is a common scenario
> with VMs) we end up with OOM because for all new memory blocks we need
> to allocate page tables, struct pages, ... and we need memory to do
> that. The userspace program doing memory onlining also needs memory to
> run and in case it prefers to fork to handle hundreds of notfifications
> ... well, it may get OOMkilled before it manages to online anything.
> 
> Allocating all kernel objects from the newly hotplugged blocks would
> definitely help to manage the situation but as I said this won't solve
> the 'forking udev' problem completely (it will likely remain in
> 'extreme' cases only. We can probably work around it by onlining with a
> dedicated process which doesn't do memory allocation).
> 

I guess the problem is even worse. We always have two phases

1. add memory - requires memory allocation
2. online memory - might require memory allocations e.g. for slab/slub

So if we just added memory but don't have sufficient memory to start a
user space process to trigger onlining, then we most likely also don't
have sufficient memory to online the memory right away (in some scenarios).

We would have to allocate all new memory for 1 and 2 from the memory to
be onlined. I guess the latter part is less trivial.

So while onlining the memory from the kernel might make things a little
more robust, we would still have the chance for OOM / onlining failing.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 17:14                     ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:14 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Dave Hansen, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens, linux-mm,
	Pavel Tatashin, Paul Mackerras, H. Peter Anvin, Rashmica Gupta,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> Dave Hansen <dave.hansen@linux.intel.com> writes:
> 
>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>> It is more than just memmaps (e.g. forking udev process doing memory
>>> onlining also needs memory) but yes, the main idea is to make the
>>> onlining synchronous with hotplug.
>>
>> That's a good theoretical concern.
>>
>> But, is it a problem we need to solve in practice?
> 
> Yes, unfortunately. It was previously discovered that when we try to
> hotplug tons of memory to a low memory system (this is a common scenario
> with VMs) we end up with OOM because for all new memory blocks we need
> to allocate page tables, struct pages, ... and we need memory to do
> that. The userspace program doing memory onlining also needs memory to
> run and in case it prefers to fork to handle hundreds of notfifications
> ... well, it may get OOMkilled before it manages to online anything.
> 
> Allocating all kernel objects from the newly hotplugged blocks would
> definitely help to manage the situation but as I said this won't solve
> the 'forking udev' problem completely (it will likely remain in
> 'extreme' cases only. We can probably work around it by onlining with a
> dedicated process which doesn't do memory allocation).
> 

I guess the problem is even worse. We always have two phases

1. add memory - requires memory allocation
2. online memory - might require memory allocations e.g. for slab/slub

So if we just added memory but don't have sufficient memory to start a
user space process to trigger onlining, then we most likely also don't
have sufficient memory to online the memory right away (in some scenarios).

We would have to allocate all new memory for 1 and 2 from the memory to
be onlined. I guess the latter part is less trivial.

So while onlining the memory from the kernel might make things a little
more robust, we would still have the chance for OOM / onlining failing.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-03 17:14                     ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:14 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Dave Hansen, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato, linux-acpi,
	Ingo Molnar, xen-devel, Len Brown, Pavel Tatashin, Rob Herring,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Martin Schwidefsky, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Boris Ostrovsky,
	Andrew Morton, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Joe Perches, devel,
	Joonsoo Kim, linuxppc-dev, Kirill A. Shutemov

On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> Dave Hansen <dave.hansen@linux.intel.com> writes:
> 
>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>> It is more than just memmaps (e.g. forking udev process doing memory
>>> onlining also needs memory) but yes, the main idea is to make the
>>> onlining synchronous with hotplug.
>>
>> That's a good theoretical concern.
>>
>> But, is it a problem we need to solve in practice?
> 
> Yes, unfortunately. It was previously discovered that when we try to
> hotplug tons of memory to a low memory system (this is a common scenario
> with VMs) we end up with OOM because for all new memory blocks we need
> to allocate page tables, struct pages, ... and we need memory to do
> that. The userspace program doing memory onlining also needs memory to
> run and in case it prefers to fork to handle hundreds of notfifications
> ... well, it may get OOMkilled before it manages to online anything.
> 
> Allocating all kernel objects from the newly hotplugged blocks would
> definitely help to manage the situation but as I said this won't solve
> the 'forking udev' problem completely (it will likely remain in
> 'extreme' cases only. We can probably work around it by onlining with a
> dedicated process which doesn't do memory allocation).
> 

I guess the problem is even worse. We always have two phases

1. add memory - requires memory allocation
2. online memory - might require memory allocations e.g. for slab/slub

So if we just added memory but don't have sufficient memory to start a
user space process to trigger onlining, then we most likely also don't
have sufficient memory to online the memory right away (in some scenarios).

We would have to allocate all new memory for 1 and 2 from the memory to
be onlined. I guess the latter part is less trivial.

So while onlining the memory from the kernel might make things a little
more robust, we would still have the chance for OOM / onlining failing.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 14:34                   ` Vitaly Kuznetsov
                                     ` (2 preceding siblings ...)
  (?)
@ 2018-10-03 17:14                   ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-03 17:14 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Dave Hansen, Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens, linux-mm,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob

On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> Dave Hansen <dave.hansen@linux.intel.com> writes:
> 
>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>> It is more than just memmaps (e.g. forking udev process doing memory
>>> onlining also needs memory) but yes, the main idea is to make the
>>> onlining synchronous with hotplug.
>>
>> That's a good theoretical concern.
>>
>> But, is it a problem we need to solve in practice?
> 
> Yes, unfortunately. It was previously discovered that when we try to
> hotplug tons of memory to a low memory system (this is a common scenario
> with VMs) we end up with OOM because for all new memory blocks we need
> to allocate page tables, struct pages, ... and we need memory to do
> that. The userspace program doing memory onlining also needs memory to
> run and in case it prefers to fork to handle hundreds of notfifications
> ... well, it may get OOMkilled before it manages to online anything.
> 
> Allocating all kernel objects from the newly hotplugged blocks would
> definitely help to manage the situation but as I said this won't solve
> the 'forking udev' problem completely (it will likely remain in
> 'extreme' cases only. We can probably work around it by onlining with a
> dedicated process which doesn't do memory allocation).
> 

I guess the problem is even worse. We always have two phases

1. add memory - requires memory allocation
2. online memory - might require memory allocations e.g. for slab/slub

So if we just added memory but don't have sufficient memory to start a
user space process to trigger onlining, then we most likely also don't
have sufficient memory to online the memory right away (in some scenarios).

We would have to allocate all new memory for 1 and 2 from the memory to
be onlined. I guess the latter part is less trivial.

So while onlining the memory from the kernel might make things a little
more robust, we would still have the chance for OOM / onlining failing.

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 17:14                     ` David Hildenbrand
  (?)
@ 2018-10-04  6:19                       ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-04  6:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> > Dave Hansen <dave.hansen@linux.intel.com> writes:
> > 
> >> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> >>> It is more than just memmaps (e.g. forking udev process doing memory
> >>> onlining also needs memory) but yes, the main idea is to make the
> >>> onlining synchronous with hotplug.
> >>
> >> That's a good theoretical concern.
> >>
> >> But, is it a problem we need to solve in practice?
> > 
> > Yes, unfortunately. It was previously discovered that when we try to
> > hotplug tons of memory to a low memory system (this is a common scenario
> > with VMs) we end up with OOM because for all new memory blocks we need
> > to allocate page tables, struct pages, ... and we need memory to do
> > that. The userspace program doing memory onlining also needs memory to
> > run and in case it prefers to fork to handle hundreds of notfifications
> > ... well, it may get OOMkilled before it manages to online anything.
> > 
> > Allocating all kernel objects from the newly hotplugged blocks would
> > definitely help to manage the situation but as I said this won't solve
> > the 'forking udev' problem completely (it will likely remain in
> > 'extreme' cases only. We can probably work around it by onlining with a
> > dedicated process which doesn't do memory allocation).
> > 
> 
> I guess the problem is even worse. We always have two phases
> 
> 1. add memory - requires memory allocation
> 2. online memory - might require memory allocations e.g. for slab/slub
> 
> So if we just added memory but don't have sufficient memory to start a
> user space process to trigger onlining, then we most likely also don't
> have sufficient memory to online the memory right away (in some scenarios).
> 
> We would have to allocate all new memory for 1 and 2 from the memory to
> be onlined. I guess the latter part is less trivial.
> 
> So while onlining the memory from the kernel might make things a little
> more robust, we would still have the chance for OOM / onlining failing.

Yes, _theoretically_. Is this a practical problem for reasonable
configurations though? I mean, this will never be perfect and we simply
cannot support all possible configurations. We should focus on
reasonable subset of them. From my practical experience the vast
majority of memory is consumed by memmaps (roughly 1.5%). That is not a
lot but I agree that allocating that from the zone normal and off node
is not great. Especially the second part which is noticeable for whole
node hotplug.

I have a feeling that arguing about fork not able to proceed or OOMing
for the memory hotplug is a bit of a stretch and a sign a of
misconfiguration.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  6:19                       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-04  6:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vitaly Kuznetsov, Dave Hansen, Kate Stewart, Rich Felker,
	linux-ia64, linux-sh, Peter Zijlstra, Benjamin Herrenschmidt,
	Balbir Singh, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> > Dave Hansen <dave.hansen@linux.intel.com> writes:
> > 
> >> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> >>> It is more than just memmaps (e.g. forking udev process doing memory
> >>> onlining also needs memory) but yes, the main idea is to make the
> >>> onlining synchronous with hotplug.
> >>
> >> That's a good theoretical concern.
> >>
> >> But, is it a problem we need to solve in practice?
> > 
> > Yes, unfortunately. It was previously discovered that when we try to
> > hotplug tons of memory to a low memory system (this is a common scenario
> > with VMs) we end up with OOM because for all new memory blocks we need
> > to allocate page tables, struct pages, ... and we need memory to do
> > that. The userspace program doing memory onlining also needs memory to
> > run and in case it prefers to fork to handle hundreds of notfifications
> > ... well, it may get OOMkilled before it manages to online anything.
> > 
> > Allocating all kernel objects from the newly hotplugged blocks would
> > definitely help to manage the situation but as I said this won't solve
> > the 'forking udev' problem completely (it will likely remain in
> > 'extreme' cases only. We can probably work around it by onlining with a
> > dedicated process which doesn't do memory allocation).
> > 
> 
> I guess the problem is even worse. We always have two phases
> 
> 1. add memory - requires memory allocation
> 2. online memory - might require memory allocations e.g. for slab/slub
> 
> So if we just added memory but don't have sufficient memory to start a
> user space process to trigger onlining, then we most likely also don't
> have sufficient memory to online the memory right away (in some scenarios).
> 
> We would have to allocate all new memory for 1 and 2 from the memory to
> be onlined. I guess the latter part is less trivial.
> 
> So while onlining the memory from the kernel might make things a little
> more robust, we would still have the chance for OOM / onlining failing.

Yes, _theoretically_. Is this a practical problem for reasonable
configurations though? I mean, this will never be perfect and we simply
cannot support all possible configurations. We should focus on
reasonable subset of them. From my practical experience the vast
majority of memory is consumed by memmaps (roughly 1.5%). That is not a
lot but I agree that allocating that from the zone normal and off node
is not great. Especially the second part which is noticeable for whole
node hotplug.

I have a feeling that arguing about fork not able to proceed or OOMing
for the memory hotplug is a bit of a stretch and a sign a of
misconfiguration.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  6:19                       ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-04  6:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Paul Mackerras,
	H. Peter Anvin, Stephen Rothwell, Rashmica Gupta, Dan Williams,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	linux-acpi, Ingo Molnar, xen-devel, Len Brown, Pavel Tatashin,
	Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Andrew Morton, Mathieu Malaterre,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Vitaly Kuznetsov, linuxppc-dev,
	Kirill A. Shutemov

On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> > Dave Hansen <dave.hansen@linux.intel.com> writes:
> > 
> >> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> >>> It is more than just memmaps (e.g. forking udev process doing memory
> >>> onlining also needs memory) but yes, the main idea is to make the
> >>> onlining synchronous with hotplug.
> >>
> >> That's a good theoretical concern.
> >>
> >> But, is it a problem we need to solve in practice?
> > 
> > Yes, unfortunately. It was previously discovered that when we try to
> > hotplug tons of memory to a low memory system (this is a common scenario
> > with VMs) we end up with OOM because for all new memory blocks we need
> > to allocate page tables, struct pages, ... and we need memory to do
> > that. The userspace program doing memory onlining also needs memory to
> > run and in case it prefers to fork to handle hundreds of notfifications
> > ... well, it may get OOMkilled before it manages to online anything.
> > 
> > Allocating all kernel objects from the newly hotplugged blocks would
> > definitely help to manage the situation but as I said this won't solve
> > the 'forking udev' problem completely (it will likely remain in
> > 'extreme' cases only. We can probably work around it by onlining with a
> > dedicated process which doesn't do memory allocation).
> > 
> 
> I guess the problem is even worse. We always have two phases
> 
> 1. add memory - requires memory allocation
> 2. online memory - might require memory allocations e.g. for slab/slub
> 
> So if we just added memory but don't have sufficient memory to start a
> user space process to trigger onlining, then we most likely also don't
> have sufficient memory to online the memory right away (in some scenarios).
> 
> We would have to allocate all new memory for 1 and 2 from the memory to
> be onlined. I guess the latter part is less trivial.
> 
> So while onlining the memory from the kernel might make things a little
> more robust, we would still have the chance for OOM / onlining failing.

Yes, _theoretically_. Is this a practical problem for reasonable
configurations though? I mean, this will never be perfect and we simply
cannot support all possible configurations. We should focus on
reasonable subset of them. From my practical experience the vast
majority of memory is consumed by memmaps (roughly 1.5%). That is not a
lot but I agree that allocating that from the zone normal and off node
is not great. Especially the second part which is noticeable for whole
node hotplug.

I have a feeling that arguing about fork not able to proceed or OOMing
for the memory hotplug is a bit of a stretch and a sign a of
misconfiguration.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 17:14                     ` David Hildenbrand
                                       ` (2 preceding siblings ...)
  (?)
@ 2018-10-04  6:19                     ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-04  6:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> > Dave Hansen <dave.hansen@linux.intel.com> writes:
> > 
> >> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> >>> It is more than just memmaps (e.g. forking udev process doing memory
> >>> onlining also needs memory) but yes, the main idea is to make the
> >>> onlining synchronous with hotplug.
> >>
> >> That's a good theoretical concern.
> >>
> >> But, is it a problem we need to solve in practice?
> > 
> > Yes, unfortunately. It was previously discovered that when we try to
> > hotplug tons of memory to a low memory system (this is a common scenario
> > with VMs) we end up with OOM because for all new memory blocks we need
> > to allocate page tables, struct pages, ... and we need memory to do
> > that. The userspace program doing memory onlining also needs memory to
> > run and in case it prefers to fork to handle hundreds of notfifications
> > ... well, it may get OOMkilled before it manages to online anything.
> > 
> > Allocating all kernel objects from the newly hotplugged blocks would
> > definitely help to manage the situation but as I said this won't solve
> > the 'forking udev' problem completely (it will likely remain in
> > 'extreme' cases only. We can probably work around it by onlining with a
> > dedicated process which doesn't do memory allocation).
> > 
> 
> I guess the problem is even worse. We always have two phases
> 
> 1. add memory - requires memory allocation
> 2. online memory - might require memory allocations e.g. for slab/slub
> 
> So if we just added memory but don't have sufficient memory to start a
> user space process to trigger onlining, then we most likely also don't
> have sufficient memory to online the memory right away (in some scenarios).
> 
> We would have to allocate all new memory for 1 and 2 from the memory to
> be onlined. I guess the latter part is less trivial.
> 
> So while onlining the memory from the kernel might make things a little
> more robust, we would still have the chance for OOM / onlining failing.

Yes, _theoretically_. Is this a practical problem for reasonable
configurations though? I mean, this will never be perfect and we simply
cannot support all possible configurations. We should focus on
reasonable subset of them. From my practical experience the vast
majority of memory is consumed by memmaps (roughly 1.5%). That is not a
lot but I agree that allocating that from the zone normal and off node
is not great. Especially the second part which is noticeable for whole
node hotplug.

I have a feeling that arguing about fork not able to proceed or OOMing
for the memory hotplug is a bit of a stretch and a sign a of
misconfiguration.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 17:00             ` David Hildenbrand
  (?)
@ 2018-10-04  6:28               ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-04  6:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
[...]
> Let me rephrase: You state that user space has to make the decision and
> that user should be able to set/reconfigure rules. That is perfectly fine.
> 
> But then we should give user space access to sufficient information to
> make a decision. This might be the type of memory as we learned (what
> some part of this patch proposes), but maybe later more, e.g. to which
> physical device memory belongs (e.g. to hotplug it all movable or all
> normal) ...

I am pretty sure that user knows he/she wants to use ballooning in
HyperV or Xen, or that the memory hotplug should be used as a "RAS"
feature to allow add and remove DIMMs for reliability. Why shouldn't we
have a package to deploy an appropriate set of udev rules for each of
those usecases? I am pretty sure you need some other plumbing to enable
them anyway (e.g. RAS would require to have movable_node kernel
parameters, ballooning a kernel module etc.).

Really, one udev script to rule them all will simply never work.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  6:28               ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-04  6:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, xen-devel, devel, linux-acpi, linux-sh, linux-s390,
	linuxppc-dev, linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
[...]
> Let me rephrase: You state that user space has to make the decision and
> that user should be able to set/reconfigure rules. That is perfectly fine.
> 
> But then we should give user space access to sufficient information to
> make a decision. This might be the type of memory as we learned (what
> some part of this patch proposes), but maybe later more, e.g. to which
> physical device memory belongs (e.g. to hotplug it all movable or all
> normal) ...

I am pretty sure that user knows he/she wants to use ballooning in
HyperV or Xen, or that the memory hotplug should be used as a "RAS"
feature to allow add and remove DIMMs for reliability. Why shouldn't we
have a package to deploy an appropriate set of udev rules for each of
those usecases? I am pretty sure you need some other plumbing to enable
them anyway (e.g. RAS would require to have movable_node kernel
parameters, ballooning a kernel module etc.).

Really, one udev script to rule them all will simply never work.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  6:28               ` Michal Hocko
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-04  6:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
[...]
> Let me rephrase: You state that user space has to make the decision and
> that user should be able to set/reconfigure rules. That is perfectly fine.
> 
> But then we should give user space access to sufficient information to
> make a decision. This might be the type of memory as we learned (what
> some part of this patch proposes), but maybe later more, e.g. to which
> physical device memory belongs (e.g. to hotplug it all movable or all
> normal) ...

I am pretty sure that user knows he/she wants to use ballooning in
HyperV or Xen, or that the memory hotplug should be used as a "RAS"
feature to allow add and remove DIMMs for reliability. Why shouldn't we
have a package to deploy an appropriate set of udev rules for each of
those usecases? I am pretty sure you need some other plumbing to enable
them anyway (e.g. RAS would require to have movable_node kernel
parameters, ballooning a kernel module etc.).

Really, one udev script to rule them all will simply never work.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 17:00             ` David Hildenbrand
                               ` (2 preceding siblings ...)
  (?)
@ 2018-10-04  6:28             ` Michal Hocko
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Hocko @ 2018-10-04  6:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
[...]
> Let me rephrase: You state that user space has to make the decision and
> that user should be able to set/reconfigure rules. That is perfectly fine.
> 
> But then we should give user space access to sufficient information to
> make a decision. This might be the type of memory as we learned (what
> some part of this patch proposes), but maybe later more, e.g. to which
> physical device memory belongs (e.g. to hotplug it all movable or all
> normal) ...

I am pretty sure that user knows he/she wants to use ballooning in
HyperV or Xen, or that the memory hotplug should be used as a "RAS"
feature to allow add and remove DIMMs for reliability. Why shouldn't we
have a package to deploy an appropriate set of udev rules for each of
those usecases? I am pretty sure you need some other plumbing to enable
them anyway (e.g. RAS would require to have movable_node kernel
parameters, ballooning a kernel module etc.).

Really, one udev script to rule them all will simply never work.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04  6:28               ` Michal Hocko
  (?)
@ 2018-10-04  7:40                 ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  7:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On 04/10/2018 08:28, Michal Hocko wrote:
> On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
> [...]
>> Let me rephrase: You state that user space has to make the decision and
>> that user should be able to set/reconfigure rules. That is perfectly fine.
>>
>> But then we should give user space access to sufficient information to
>> make a decision. This might be the type of memory as we learned (what
>> some part of this patch proposes), but maybe later more, e.g. to which
>> physical device memory belongs (e.g. to hotplug it all movable or all
>> normal) ...
> 
> I am pretty sure that user knows he/she wants to use ballooning in
> HyperV or Xen, or that the memory hotplug should be used as a "RAS"
> feature to allow add and remove DIMMs for reliability. Why shouldn't we
> have a package to deploy an appropriate set of udev rules for each of
> those usecases? I am pretty sure you need some other plumbing to enable
> them anyway (e.g. RAS would require to have movable_node kernel
> parameters, ballooning a kernel module etc.).
> 
> Really, one udev script to rule them all will simply never work.
> 

I am on your side. We will need multiple ones. But we need sane
defaults. And a default rule will always exist. And users will expect
that the defaults somewhat match their expectation unless they really
have some special use cases.

All I am saying is, again, that if user space is to make decisions, it
should get sufficient information to make sane decision. And in my point
of view, the type of memory allows us to make these decision and to
provide a "single udev script to rule them all" with sane defaults.

I at least think the distinction between "auto-online" and "standby" is
required (what Dave suggested).

The we can make a simple rule

if (auto-online memory) {
	if (virtual environment) {
		"online"
	} else {
		"online_movable"
	}
}
/* standby memory not onlined as default */

We are able to provide sane defaults.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  7:40                 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  7:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, xen-devel, devel, linux-acpi, linux-sh, linux-s390,
	linuxppc-dev, linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On 04/10/2018 08:28, Michal Hocko wrote:
> On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
> [...]
>> Let me rephrase: You state that user space has to make the decision and
>> that user should be able to set/reconfigure rules. That is perfectly fine.
>>
>> But then we should give user space access to sufficient information to
>> make a decision. This might be the type of memory as we learned (what
>> some part of this patch proposes), but maybe later more, e.g. to which
>> physical device memory belongs (e.g. to hotplug it all movable or all
>> normal) ...
> 
> I am pretty sure that user knows he/she wants to use ballooning in
> HyperV or Xen, or that the memory hotplug should be used as a "RAS"
> feature to allow add and remove DIMMs for reliability. Why shouldn't we
> have a package to deploy an appropriate set of udev rules for each of
> those usecases? I am pretty sure you need some other plumbing to enable
> them anyway (e.g. RAS would require to have movable_node kernel
> parameters, ballooning a kernel module etc.).
> 
> Really, one udev script to rule them all will simply never work.
> 

I am on your side. We will need multiple ones. But we need sane
defaults. And a default rule will always exist. And users will expect
that the defaults somewhat match their expectation unless they really
have some special use cases.

All I am saying is, again, that if user space is to make decisions, it
should get sufficient information to make sane decision. And in my point
of view, the type of memory allows us to make these decision and to
provide a "single udev script to rule them all" with sane defaults.

I at least think the distinction between "auto-online" and "standby" is
required (what Dave suggested).

The we can make a simple rule

if (auto-online memory) {
	if (virtual environment) {
		"online"
	} else {
		"online_movable"
	}
}
/* standby memory not onlined as default */

We are able to provide sane defaults.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  7:40                 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  7:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On 04/10/2018 08:28, Michal Hocko wrote:
> On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
> [...]
>> Let me rephrase: You state that user space has to make the decision and
>> that user should be able to set/reconfigure rules. That is perfectly fine.
>>
>> But then we should give user space access to sufficient information to
>> make a decision. This might be the type of memory as we learned (what
>> some part of this patch proposes), but maybe later more, e.g. to which
>> physical device memory belongs (e.g. to hotplug it all movable or all
>> normal) ...
> 
> I am pretty sure that user knows he/she wants to use ballooning in
> HyperV or Xen, or that the memory hotplug should be used as a "RAS"
> feature to allow add and remove DIMMs for reliability. Why shouldn't we
> have a package to deploy an appropriate set of udev rules for each of
> those usecases? I am pretty sure you need some other plumbing to enable
> them anyway (e.g. RAS would require to have movable_node kernel
> parameters, ballooning a kernel module etc.).
> 
> Really, one udev script to rule them all will simply never work.
> 

I am on your side. We will need multiple ones. But we need sane
defaults. And a default rule will always exist. And users will expect
that the defaults somewhat match their expectation unless they really
have some special use cases.

All I am saying is, again, that if user space is to make decisions, it
should get sufficient information to make sane decision. And in my point
of view, the type of memory allows us to make these decision and to
provide a "single udev script to rule them all" with sane defaults.

I at least think the distinction between "auto-online" and "standby" is
required (what Dave suggested).

The we can make a simple rule

if (auto-online memory) {
	if (virtual environment) {
		"online"
	} else {
		"online_movable"
	}
}
/* standby memory not onlined as default */

We are able to provide sane defaults.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04  6:28               ` Michal Hocko
                                 ` (2 preceding siblings ...)
  (?)
@ 2018-10-04  7:40               ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  7:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On 04/10/2018 08:28, Michal Hocko wrote:
> On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
> [...]
>> Let me rephrase: You state that user space has to make the decision and
>> that user should be able to set/reconfigure rules. That is perfectly fine.
>>
>> But then we should give user space access to sufficient information to
>> make a decision. This might be the type of memory as we learned (what
>> some part of this patch proposes), but maybe later more, e.g. to which
>> physical device memory belongs (e.g. to hotplug it all movable or all
>> normal) ...
> 
> I am pretty sure that user knows he/she wants to use ballooning in
> HyperV or Xen, or that the memory hotplug should be used as a "RAS"
> feature to allow add and remove DIMMs for reliability. Why shouldn't we
> have a package to deploy an appropriate set of udev rules for each of
> those usecases? I am pretty sure you need some other plumbing to enable
> them anyway (e.g. RAS would require to have movable_node kernel
> parameters, ballooning a kernel module etc.).
> 
> Really, one udev script to rule them all will simply never work.
> 

I am on your side. We will need multiple ones. But we need sane
defaults. And a default rule will always exist. And users will expect
that the defaults somewhat match their expectation unless they really
have some special use cases.

All I am saying is, again, that if user space is to make decisions, it
should get sufficient information to make sane decision. And in my point
of view, the type of memory allows us to make these decision and to
provide a "single udev script to rule them all" with sane defaults.

I at least think the distinction between "auto-online" and "standby" is
required (what Dave suggested).

The we can make a simple rule

if (auto-online memory) {
	if (virtual environment) {
		"online"
	} else {
		"online_movable"
	}
}
/* standby memory not onlined as default */

We are able to provide sane defaults.

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-01 16:24       ` Dave Hansen
  (?)
@ 2018-10-04  7:48         ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  7:48 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens,
	Pavel Tatashin, Michal Hocko, Paul Mackerras, H. Peter Anvin,
	Rashmica Gupta, Boris Ostrovsky, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, Michael Ellerman, linux-acpi,
	Ingo Molnar, xen-devel, Rob Herring, Len Brown

On 01/10/2018 18:24, Dave Hansen wrote:
>> How should a policy in user space look like when new memory gets added
>> - on s390x? Not onlining paravirtualized memory is very wrong.
> 
> Because we're going to balloon it away in a moment anyway?

No, rether somebody wanted this VM to have more memory, so it should use
it - basically what HyperV or XEN also do. (in contrast to the concept
of standby memory on s390).

> > We have auto-onlining.  Why isn't that being used on s390?

Do you mean the sys parameter? How would that help?

> 
> 
>> So the type of memory is very important here to have in user space.
>> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
>> to decide whether to online memory and how to online memory is wrong.
>> Only some specific memory types (which I call "normal") are to be
>> handled by user space.
>>
>> For the other ones, we exactly know what to do:
>> - standby? don't online
> 
> I think you're horribly conflating the software desire for what the stae
> should be and the hardware itself.

Agreed, user space should be able to configure it.

> 
>>> As for the OOM issues, that sounds like something we need to fix by
>>> refusing to do (or delaying) hot-add operations once we consume too much
>>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>>> userspace to hurry thing along.
>>
>> That is a moving target and doing that automatically is basically
>> impossible.
> 
> Nah.  We know how much metadata we've allocated.  We know how much
> ZONE_NORMAL we are eating.  We can *easily* add something to
> add_memory() that just sleeps until the ratio is not out-of-whack.
> 
>> You can add a lot of memory to the movable zone and
>> everything is fine. Suddenly a lot of processes are started - boom.
>> MOVABLE should only every be used if you expect an unplug. And for
>> paravirtualized devices, a "typical" unplug does not exist.
> 
> No, it's more complicated than that.  People use MOVABLE, for instance,
> to allow more consistent huge page allocations.  It's certainly not just
> hot-remove.
> 

As noted in the other thread, that's a good point. We have to allow to
make a decision in user space.

I agree to your initial proposal to distinguish "standby" from
"auto-online". It would allow to have sane defaults in user space.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  7:48         ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  7:48 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: xen-devel, devel, linux-acpi, linux-sh, linux-s390, linuxppc-dev,
	linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki, Len Brown,
	Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Michal Hocko, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre

On 01/10/2018 18:24, Dave Hansen wrote:
>> How should a policy in user space look like when new memory gets added
>> - on s390x? Not onlining paravirtualized memory is very wrong.
> 
> Because we're going to balloon it away in a moment anyway?

No, rether somebody wanted this VM to have more memory, so it should use
it - basically what HyperV or XEN also do. (in contrast to the concept
of standby memory on s390).

> > We have auto-onlining.  Why isn't that being used on s390?

Do you mean the sys parameter? How would that help?

> 
> 
>> So the type of memory is very important here to have in user space.
>> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
>> to decide whether to online memory and how to online memory is wrong.
>> Only some specific memory types (which I call "normal") are to be
>> handled by user space.
>>
>> For the other ones, we exactly know what to do:
>> - standby? don't online
> 
> I think you're horribly conflating the software desire for what the stae
> should be and the hardware itself.

Agreed, user space should be able to configure it.

> 
>>> As for the OOM issues, that sounds like something we need to fix by
>>> refusing to do (or delaying) hot-add operations once we consume too much
>>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>>> userspace to hurry thing along.
>>
>> That is a moving target and doing that automatically is basically
>> impossible.
> 
> Nah.  We know how much metadata we've allocated.  We know how much
> ZONE_NORMAL we are eating.  We can *easily* add something to
> add_memory() that just sleeps until the ratio is not out-of-whack.
> 
>> You can add a lot of memory to the movable zone and
>> everything is fine. Suddenly a lot of processes are started - boom.
>> MOVABLE should only every be used if you expect an unplug. And for
>> paravirtualized devices, a "typical" unplug does not exist.
> 
> No, it's more complicated than that.  People use MOVABLE, for instance,
> to allow more consistent huge page allocations.  It's certainly not just
> hot-remove.
> 

As noted in the other thread, that's a good point. We have to allow to
make a decision in user space.

I agree to your initial proposal to distinguish "standby" from
"auto-online". It would allow to have sane defaults in user space.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  7:48         ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  7:48 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Heiko Carstens, Pavel Tatashin, Michal Hocko, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev,
	Kirill A. Shutemov

On 01/10/2018 18:24, Dave Hansen wrote:
>> How should a policy in user space look like when new memory gets added
>> - on s390x? Not onlining paravirtualized memory is very wrong.
> 
> Because we're going to balloon it away in a moment anyway?

No, rether somebody wanted this VM to have more memory, so it should use
it - basically what HyperV or XEN also do. (in contrast to the concept
of standby memory on s390).

> > We have auto-onlining.  Why isn't that being used on s390?

Do you mean the sys parameter? How would that help?

> 
> 
>> So the type of memory is very important here to have in user space.
>> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
>> to decide whether to online memory and how to online memory is wrong.
>> Only some specific memory types (which I call "normal") are to be
>> handled by user space.
>>
>> For the other ones, we exactly know what to do:
>> - standby? don't online
> 
> I think you're horribly conflating the software desire for what the stae
> should be and the hardware itself.

Agreed, user space should be able to configure it.

> 
>>> As for the OOM issues, that sounds like something we need to fix by
>>> refusing to do (or delaying) hot-add operations once we consume too much
>>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>>> userspace to hurry thing along.
>>
>> That is a moving target and doing that automatically is basically
>> impossible.
> 
> Nah.  We know how much metadata we've allocated.  We know how much
> ZONE_NORMAL we are eating.  We can *easily* add something to
> add_memory() that just sleeps until the ratio is not out-of-whack.
> 
>> You can add a lot of memory to the movable zone and
>> everything is fine. Suddenly a lot of processes are started - boom.
>> MOVABLE should only every be used if you expect an unplug. And for
>> paravirtualized devices, a "typical" unplug does not exist.
> 
> No, it's more complicated than that.  People use MOVABLE, for instance,
> to allow more consistent huge page allocations.  It's certainly not just
> hot-remove.
> 

As noted in the other thread, that's a good point. We have to allow to
make a decision in user space.

I agree to your initial proposal to distinguish "standby" from
"auto-online". It would allow to have sane defaults in user space.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-01 16:24       ` Dave Hansen
  (?)
  (?)
@ 2018-10-04  7:48       ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  7:48 UTC (permalink / raw)
  To: Dave Hansen, linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Heiko Carstens,
	Pavel Tatashin, Michal Hocko, Paul Mackerras, H. Peter Anvin,
	Rashmica Gupta, K. Y. Srinivasan, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring

On 01/10/2018 18:24, Dave Hansen wrote:
>> How should a policy in user space look like when new memory gets added
>> - on s390x? Not onlining paravirtualized memory is very wrong.
> 
> Because we're going to balloon it away in a moment anyway?

No, rether somebody wanted this VM to have more memory, so it should use
it - basically what HyperV or XEN also do. (in contrast to the concept
of standby memory on s390).

> > We have auto-onlining.  Why isn't that being used on s390?

Do you mean the sys parameter? How would that help?

> 
> 
>> So the type of memory is very important here to have in user space.
>> Relying on checks like "isS390()", "isKVMGuest()" or "isHyperVGuest()"
>> to decide whether to online memory and how to online memory is wrong.
>> Only some specific memory types (which I call "normal") are to be
>> handled by user space.
>>
>> For the other ones, we exactly know what to do:
>> - standby? don't online
> 
> I think you're horribly conflating the software desire for what the stae
> should be and the hardware itself.

Agreed, user space should be able to configure it.

> 
>>> As for the OOM issues, that sounds like something we need to fix by
>>> refusing to do (or delaying) hot-add operations once we consume too much
>>> ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
>>> userspace to hurry thing along.
>>
>> That is a moving target and doing that automatically is basically
>> impossible.
> 
> Nah.  We know how much metadata we've allocated.  We know how much
> ZONE_NORMAL we are eating.  We can *easily* add something to
> add_memory() that just sleeps until the ratio is not out-of-whack.
> 
>> You can add a lot of memory to the movable zone and
>> everything is fine. Suddenly a lot of processes are started - boom.
>> MOVABLE should only every be used if you expect an unplug. And for
>> paravirtualized devices, a "typical" unplug does not exist.
> 
> No, it's more complicated than that.  People use MOVABLE, for instance,
> to allow more consistent huge page allocations.  It's certainly not just
> hot-remove.
> 

As noted in the other thread, that's a good point. We have to allow to
make a decision in user space.

I agree to your initial proposal to distinguish "standby" from
"auto-online". It would allow to have sane defaults in user space.

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 14:24                 ` Michal Hocko
  (?)
@ 2018-10-04  8:12                   ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  8:12 UTC (permalink / raw)
  To: Michal Hocko, Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

Vitalys point was that this policy is to be formulated by user space via
eBPF and handled by the kernel. So the work of formulating the policy
would still have to be done by user space.

I guess the only problem this would partially solve is onlining of
memory failing as we can no longer fork in user space. Just as you said,
I also think this is rather some serious misconfiguration. We will
already most probably have other applications triggering OOM already.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  8:12                   ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  8:12 UTC (permalink / raw)
  To: Michal Hocko, Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Pavel Tatashin, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

Vitalys point was that this policy is to be formulated by user space via
eBPF and handled by the kernel. So the work of formulating the policy
would still have to be done by user space.

I guess the only problem this would partially solve is onlining of
memory failing as we can no longer fork in user space. Just as you said,
I also think this is rather some serious misconfiguration. We will
already most probably have other applications triggering OOM already.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  8:12                   ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  8:12 UTC (permalink / raw)
  To: Michal Hocko, Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Paul Mackerras,
	H. Peter Anvin, Stephen Rothwell, Rashmica Gupta, Dan Williams,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	linux-acpi, Ingo Molnar, xen-devel, Len Brown, Pavel Tatashin,
	Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Andrew Morton, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

Vitalys point was that this policy is to be formulated by user space via
eBPF and handled by the kernel. So the work of formulating the policy
would still have to be done by user space.

I guess the only problem this would partially solve is onlining of
memory failing as we can no longer fork in user space. Just as you said,
I also think this is rather some serious misconfiguration. We will
already most probably have other applications triggering OOM already.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-03 14:24                 ` Michal Hocko
                                   ` (3 preceding siblings ...)
  (?)
@ 2018-10-04  8:12                 ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  8:12 UTC (permalink / raw)
  To: Michal Hocko, Vitaly Kuznetsov
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

On 03/10/2018 16:24, Michal Hocko wrote:
> On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
> [...]
>>> As David said some of the memory cannot be onlined without further steps
>>> (e.g. when it is standby as David called it) and then I fail to see how
>>> eBPF help in any way.
>>
>> and also, we can fight till the end of days here trying to come up with
>> an onlining solution which would work for everyone and eBPF would move
>> this decision to distro level.
> 
> The point is that there is _no_ general onlining solution. This is
> basically policy which belongs to the userspace.
> 

Vitalys point was that this policy is to be formulated by user space via
eBPF and handled by the kernel. So the work of formulating the policy
would still have to be done by user space.

I guess the only problem this would partially solve is onlining of
memory failing as we can no longer fork in user space. Just as you said,
I also think this is rather some serious misconfiguration. We will
already most probably have other applications triggering OOM already.

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04  6:19                       ` Michal Hocko
  (?)
@ 2018-10-04  8:13                         ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

On 04/10/2018 08:19, Michal Hocko wrote:
> On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
>> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
>>> Dave Hansen <dave.hansen@linux.intel.com> writes:
>>>
>>>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>>>> It is more than just memmaps (e.g. forking udev process doing memory
>>>>> onlining also needs memory) but yes, the main idea is to make the
>>>>> onlining synchronous with hotplug.
>>>>
>>>> That's a good theoretical concern.
>>>>
>>>> But, is it a problem we need to solve in practice?
>>>
>>> Yes, unfortunately. It was previously discovered that when we try to
>>> hotplug tons of memory to a low memory system (this is a common scenario
>>> with VMs) we end up with OOM because for all new memory blocks we need
>>> to allocate page tables, struct pages, ... and we need memory to do
>>> that. The userspace program doing memory onlining also needs memory to
>>> run and in case it prefers to fork to handle hundreds of notfifications
>>> ... well, it may get OOMkilled before it manages to online anything.
>>>
>>> Allocating all kernel objects from the newly hotplugged blocks would
>>> definitely help to manage the situation but as I said this won't solve
>>> the 'forking udev' problem completely (it will likely remain in
>>> 'extreme' cases only. We can probably work around it by onlining with a
>>> dedicated process which doesn't do memory allocation).
>>>
>>
>> I guess the problem is even worse. We always have two phases
>>
>> 1. add memory - requires memory allocation
>> 2. online memory - might require memory allocations e.g. for slab/slub
>>
>> So if we just added memory but don't have sufficient memory to start a
>> user space process to trigger onlining, then we most likely also don't
>> have sufficient memory to online the memory right away (in some scenarios).
>>
>> We would have to allocate all new memory for 1 and 2 from the memory to
>> be onlined. I guess the latter part is less trivial.
>>
>> So while onlining the memory from the kernel might make things a little
>> more robust, we would still have the chance for OOM / onlining failing.
> 
> Yes, _theoretically_. Is this a practical problem for reasonable
> configurations though? I mean, this will never be perfect and we simply
> cannot support all possible configurations. We should focus on
> reasonable subset of them. From my practical experience the vast
> majority of memory is consumed by memmaps (roughly 1.5%). That is not a
> lot but I agree that allocating that from the zone normal and off node
> is not great. Especially the second part which is noticeable for whole
> node hotplug.
> 
> I have a feeling that arguing about fork not able to proceed or OOMing
> for the memory hotplug is a bit of a stretch and a sign a of
> misconfiguration.
> 

Just to rephrase, I have the same opinion. Something is already messed
up if we cannot even fork anymore. We will have OOM already all over the
place before/during/after forking.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  8:13                         ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vitaly Kuznetsov, Dave Hansen, Kate Stewart, Rich Felker,
	linux-ia64, linux-sh, Peter Zijlstra, Benjamin Herrenschmidt,
	Balbir Singh, Heiko Carstens, linux-mm, Pavel Tatashin,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel,
	Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On 04/10/2018 08:19, Michal Hocko wrote:
> On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
>> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
>>> Dave Hansen <dave.hansen@linux.intel.com> writes:
>>>
>>>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>>>> It is more than just memmaps (e.g. forking udev process doing memory
>>>>> onlining also needs memory) but yes, the main idea is to make the
>>>>> onlining synchronous with hotplug.
>>>>
>>>> That's a good theoretical concern.
>>>>
>>>> But, is it a problem we need to solve in practice?
>>>
>>> Yes, unfortunately. It was previously discovered that when we try to
>>> hotplug tons of memory to a low memory system (this is a common scenario
>>> with VMs) we end up with OOM because for all new memory blocks we need
>>> to allocate page tables, struct pages, ... and we need memory to do
>>> that. The userspace program doing memory onlining also needs memory to
>>> run and in case it prefers to fork to handle hundreds of notfifications
>>> ... well, it may get OOMkilled before it manages to online anything.
>>>
>>> Allocating all kernel objects from the newly hotplugged blocks would
>>> definitely help to manage the situation but as I said this won't solve
>>> the 'forking udev' problem completely (it will likely remain in
>>> 'extreme' cases only. We can probably work around it by onlining with a
>>> dedicated process which doesn't do memory allocation).
>>>
>>
>> I guess the problem is even worse. We always have two phases
>>
>> 1. add memory - requires memory allocation
>> 2. online memory - might require memory allocations e.g. for slab/slub
>>
>> So if we just added memory but don't have sufficient memory to start a
>> user space process to trigger onlining, then we most likely also don't
>> have sufficient memory to online the memory right away (in some scenarios).
>>
>> We would have to allocate all new memory for 1 and 2 from the memory to
>> be onlined. I guess the latter part is less trivial.
>>
>> So while onlining the memory from the kernel might make things a little
>> more robust, we would still have the chance for OOM / onlining failing.
> 
> Yes, _theoretically_. Is this a practical problem for reasonable
> configurations though? I mean, this will never be perfect and we simply
> cannot support all possible configurations. We should focus on
> reasonable subset of them. From my practical experience the vast
> majority of memory is consumed by memmaps (roughly 1.5%). That is not a
> lot but I agree that allocating that from the zone normal and off node
> is not great. Especially the second part which is noticeable for whole
> node hotplug.
> 
> I have a feeling that arguing about fork not able to proceed or OOMing
> for the memory hotplug is a bit of a stretch and a sign a of
> misconfiguration.
> 

Just to rephrase, I have the same opinion. Something is already messed
up if we cannot even fork anymore. We will have OOM already all over the
place before/during/after forking.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04  8:13                         ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Paul Mackerras,
	H. Peter Anvin, Stephen Rothwell, Rashmica Gupta, Dan Williams,
	linux-s390, Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	linux-acpi, Ingo Molnar, xen-devel, Len Brown, Pavel Tatashin,
	Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Andrew Morton, Mathieu Malaterre,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Vitaly Kuznetsov, linuxppc-dev,
	Kirill A. Shutemov

On 04/10/2018 08:19, Michal Hocko wrote:
> On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
>> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
>>> Dave Hansen <dave.hansen@linux.intel.com> writes:
>>>
>>>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>>>> It is more than just memmaps (e.g. forking udev process doing memory
>>>>> onlining also needs memory) but yes, the main idea is to make the
>>>>> onlining synchronous with hotplug.
>>>>
>>>> That's a good theoretical concern.
>>>>
>>>> But, is it a problem we need to solve in practice?
>>>
>>> Yes, unfortunately. It was previously discovered that when we try to
>>> hotplug tons of memory to a low memory system (this is a common scenario
>>> with VMs) we end up with OOM because for all new memory blocks we need
>>> to allocate page tables, struct pages, ... and we need memory to do
>>> that. The userspace program doing memory onlining also needs memory to
>>> run and in case it prefers to fork to handle hundreds of notfifications
>>> ... well, it may get OOMkilled before it manages to online anything.
>>>
>>> Allocating all kernel objects from the newly hotplugged blocks would
>>> definitely help to manage the situation but as I said this won't solve
>>> the 'forking udev' problem completely (it will likely remain in
>>> 'extreme' cases only. We can probably work around it by onlining with a
>>> dedicated process which doesn't do memory allocation).
>>>
>>
>> I guess the problem is even worse. We always have two phases
>>
>> 1. add memory - requires memory allocation
>> 2. online memory - might require memory allocations e.g. for slab/slub
>>
>> So if we just added memory but don't have sufficient memory to start a
>> user space process to trigger onlining, then we most likely also don't
>> have sufficient memory to online the memory right away (in some scenarios).
>>
>> We would have to allocate all new memory for 1 and 2 from the memory to
>> be onlined. I guess the latter part is less trivial.
>>
>> So while onlining the memory from the kernel might make things a little
>> more robust, we would still have the chance for OOM / onlining failing.
> 
> Yes, _theoretically_. Is this a practical problem for reasonable
> configurations though? I mean, this will never be perfect and we simply
> cannot support all possible configurations. We should focus on
> reasonable subset of them. From my practical experience the vast
> majority of memory is consumed by memmaps (roughly 1.5%). That is not a
> lot but I agree that allocating that from the zone normal and off node
> is not great. Especially the second part which is noticeable for whole
> node hotplug.
> 
> I have a feeling that arguing about fork not able to proceed or OOMing
> for the memory hotplug is a bit of a stretch and a sign a of
> misconfiguration.
> 

Just to rephrase, I have the same opinion. Something is already messed
up if we cannot even fork anymore. We will have OOM already all over the
place before/during/after forking.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04  6:19                       ` Michal Hocko
  (?)
  (?)
@ 2018-10-04  8:13                       ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, linux-mm, Paul Mackerras, H. Peter Anvin,
	Stephen Rothwell, Rashmica Gupta, Dan Williams, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel

On 04/10/2018 08:19, Michal Hocko wrote:
> On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
>> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
>>> Dave Hansen <dave.hansen@linux.intel.com> writes:
>>>
>>>> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
>>>>> It is more than just memmaps (e.g. forking udev process doing memory
>>>>> onlining also needs memory) but yes, the main idea is to make the
>>>>> onlining synchronous with hotplug.
>>>>
>>>> That's a good theoretical concern.
>>>>
>>>> But, is it a problem we need to solve in practice?
>>>
>>> Yes, unfortunately. It was previously discovered that when we try to
>>> hotplug tons of memory to a low memory system (this is a common scenario
>>> with VMs) we end up with OOM because for all new memory blocks we need
>>> to allocate page tables, struct pages, ... and we need memory to do
>>> that. The userspace program doing memory onlining also needs memory to
>>> run and in case it prefers to fork to handle hundreds of notfifications
>>> ... well, it may get OOMkilled before it manages to online anything.
>>>
>>> Allocating all kernel objects from the newly hotplugged blocks would
>>> definitely help to manage the situation but as I said this won't solve
>>> the 'forking udev' problem completely (it will likely remain in
>>> 'extreme' cases only. We can probably work around it by onlining with a
>>> dedicated process which doesn't do memory allocation).
>>>
>>
>> I guess the problem is even worse. We always have two phases
>>
>> 1. add memory - requires memory allocation
>> 2. online memory - might require memory allocations e.g. for slab/slub
>>
>> So if we just added memory but don't have sufficient memory to start a
>> user space process to trigger onlining, then we most likely also don't
>> have sufficient memory to online the memory right away (in some scenarios).
>>
>> We would have to allocate all new memory for 1 and 2 from the memory to
>> be onlined. I guess the latter part is less trivial.
>>
>> So while onlining the memory from the kernel might make things a little
>> more robust, we would still have the chance for OOM / onlining failing.
> 
> Yes, _theoretically_. Is this a practical problem for reasonable
> configurations though? I mean, this will never be perfect and we simply
> cannot support all possible configurations. We should focus on
> reasonable subset of them. From my practical experience the vast
> majority of memory is consumed by memmaps (roughly 1.5%). That is not a
> lot but I agree that allocating that from the zone normal and off node
> is not great. Especially the second part which is noticeable for whole
> node hotplug.
> 
> I have a feeling that arguing about fork not able to proceed or OOMing
> for the memory hotplug is a bit of a stretch and a sign a of
> misconfiguration.
> 

Just to rephrase, I have the same opinion. Something is already messed
up if we cannot even fork anymore. We will have OOM already all over the
place before/during/after forking.

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04  8:13                         ` David Hildenbrand
  (?)
@ 2018-10-04 15:28                           ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-10-04 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Vitaly Kuznetsov, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390

On Thu, 4 Oct 2018 10:13:48 +0200
David Hildenbrand <david@redhat.com> wrote:

ok, so what is the problem here?

Handling the hotplug in userspace through udev may be suboptimal and
kernel handling might be faster but that's orthogonal to the problem at
hand.

The state of the art is to determine what to do with hotplugged memory
in userspace based on platform and virtualization type.

Changing the default to depend on the driver that added the memory
rather than platform type should solve the issue of VMs growing
different types of memory device emulation.

Am I missing something?

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04 15:28                           ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-10-04 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Kate Stewart, Rich Felker, linux-ia64, linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, linux-mm,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Andrew Morton, Mathieu Malaterre,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Vitaly Kuznetsov, linuxppc-dev,
	Kirill A. Shutemov

On Thu, 4 Oct 2018 10:13:48 +0200
David Hildenbrand <david@redhat.com> wrote:

ok, so what is the problem here?

Handling the hotplug in userspace through udev may be suboptimal and
kernel handling might be faster but that's orthogonal to the problem at
hand.

The state of the art is to determine what to do with hotplugged memory
in userspace based on platform and virtualization type.

Changing the default to depend on the driver that added the memory
rather than platform type should solve the issue of VMs growing
different types of memory device emulation.

Am I missing something?

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04 15:28                           ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-10-04 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Vitaly Kuznetsov, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Dan Williams,
	Andrew Morton, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Martin Schwidefsky, devel,
	Joonsoo Kim, linuxppc-dev, Kirill A. Shutemov

On Thu, 4 Oct 2018 10:13:48 +0200
David Hildenbrand <david@redhat.com> wrote:

ok, so what is the problem here?

Handling the hotplug in userspace through udev may be suboptimal and
kernel handling might be faster but that's orthogonal to the problem at
hand.

The state of the art is to determine what to do with hotplugged memory
in userspace based on platform and virtualization type.

Changing the default to depend on the driver that added the memory
rather than platform type should solve the issue of VMs growing
different types of memory device emulation.

Am I missing something?

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04  8:13                         ` David Hildenbrand
  (?)
  (?)
@ 2018-10-04 15:28                         ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-10-04 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Vitaly Kuznetsov, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390

On Thu, 4 Oct 2018 10:13:48 +0200
David Hildenbrand <david@redhat.com> wrote:

ok, so what is the problem here?

Handling the hotplug in userspace through udev may be suboptimal and
kernel handling might be faster but that's orthogonal to the problem at
hand.

The state of the art is to determine what to do with hotplugged memory
in userspace based on platform and virtualization type.

Changing the default to depend on the driver that added the memory
rather than platform type should solve the issue of VMs growing
different types of memory device emulation.

Am I missing something?

Thanks

Michal

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04 15:28                           ` Michal Suchánek
  (?)
@ 2018-10-04 15:45                             ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04 15:45 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Vitaly Kuznetsov, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390

On 04/10/2018 17:28, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 10:13:48 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
> ok, so what is the problem here?
> 
> Handling the hotplug in userspace through udev may be suboptimal and
> kernel handling might be faster but that's orthogonal to the problem at
> hand.

Yes, that one to solve is a different story.

> 
> The state of the art is to determine what to do with hotplugged memory
> in userspace based on platform and virtualization type.

Exactly.

> 
> Changing the default to depend on the driver that added the memory
> rather than platform type should solve the issue of VMs growing
> different types of memory device emulation.

Yes, my original proposal (this patch) was to handle it in the kernel
for known types. But as we learned, there might be some use cases that
might still require to make a decision in user space.

So providing the user space either with some type hint (auto-online vs.
standby) or the driver that added it (system vs. hyper-v ...) would
solve the issue.

> 
> Am I missing something?
> 

No, that's it. Thanks!

> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04 15:45                             ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04 15:45 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Michal Hocko, Kate Stewart, Rich Felker, linux-ia64, linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, linux-mm,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Jonathan Neuschäfer, Nicholas Piggin, Martin Schwidefsky,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Boris Ostrovsky, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Andrew Morton, Mathieu Malaterre,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Vitaly Kuznetsov, linuxppc-dev,
	Kirill A. Shutemov

On 04/10/2018 17:28, Michal SuchA!nek wrote:
> On Thu, 4 Oct 2018 10:13:48 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
> ok, so what is the problem here?
> 
> Handling the hotplug in userspace through udev may be suboptimal and
> kernel handling might be faster but that's orthogonal to the problem at
> hand.

Yes, that one to solve is a different story.

> 
> The state of the art is to determine what to do with hotplugged memory
> in userspace based on platform and virtualization type.

Exactly.

> 
> Changing the default to depend on the driver that added the memory
> rather than platform type should solve the issue of VMs growing
> different types of memory device emulation.

Yes, my original proposal (this patch) was to handle it in the kernel
for known types. But as we learned, there might be some use cases that
might still require to make a decision in user space.

So providing the user space either with some type hint (auto-online vs.
standby) or the driver that added it (system vs. hyper-v ...) would
solve the issue.

> 
> Am I missing something?
> 

No, that's it. Thanks!

> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04 15:45                             ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04 15:45 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Vitaly Kuznetsov, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Dan Williams,
	Andrew Morton, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Martin Schwidefsky, devel,
	Joonsoo Kim, linuxppc-dev, Kirill A. Shutemov

On 04/10/2018 17:28, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 10:13:48 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
> ok, so what is the problem here?
> 
> Handling the hotplug in userspace through udev may be suboptimal and
> kernel handling might be faster but that's orthogonal to the problem at
> hand.

Yes, that one to solve is a different story.

> 
> The state of the art is to determine what to do with hotplugged memory
> in userspace based on platform and virtualization type.

Exactly.

> 
> Changing the default to depend on the driver that added the memory
> rather than platform type should solve the issue of VMs growing
> different types of memory device emulation.

Yes, my original proposal (this patch) was to handle it in the kernel
for known types. But as we learned, there might be some use cases that
might still require to make a decision in user space.

So providing the user space either with some type hint (auto-online vs.
standby) or the driver that added it (system vs. hyper-v ...) would
solve the issue.

> 
> Am I missing something?
> 

No, that's it. Thanks!

> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04 15:28                           ` Michal Suchánek
  (?)
  (?)
@ 2018-10-04 15:45                           ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-04 15:45 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Vitaly Kuznetsov, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390

On 04/10/2018 17:28, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 10:13:48 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
> ok, so what is the problem here?
> 
> Handling the hotplug in userspace through udev may be suboptimal and
> kernel handling might be faster but that's orthogonal to the problem at
> hand.

Yes, that one to solve is a different story.

> 
> The state of the art is to determine what to do with hotplugged memory
> in userspace based on platform and virtualization type.

Exactly.

> 
> Changing the default to depend on the driver that added the memory
> rather than platform type should solve the issue of VMs growing
> different types of memory device emulation.

Yes, my original proposal (this patch) was to handle it in the kernel
for known types. But as we learned, there might be some use cases that
might still require to make a decision in user space.

So providing the user space either with some type hint (auto-online vs.
standby) or the driver that added it (system vs. hyper-v ...) would
solve the issue.

> 
> Am I missing something?
> 

No, that's it. Thanks!

> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04 15:45                             ` David Hildenbrand
  (?)
@ 2018-10-04 17:50                               ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-10-04 17:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Joonsoo Kim, Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Dan Williams,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis

On Thu, 4 Oct 2018 17:45:13 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 04/10/2018 17:28, Michal Suchánek wrote:

> > 
> > The state of the art is to determine what to do with hotplugged
> > memory in userspace based on platform and virtualization type.  
> 
> Exactly.
> 
> > 
> > Changing the default to depend on the driver that added the memory
> > rather than platform type should solve the issue of VMs growing
> > different types of memory device emulation.  
> 
> Yes, my original proposal (this patch) was to handle it in the kernel
> for known types. But as we learned, there might be some use cases that
> might still require to make a decision in user space.
> 
> So providing the user space either with some type hint (auto-online
> vs. standby) or the driver that added it (system vs. hyper-v ...)
> would solve the issue.

Is that not available in the udev event?

Thanks

Michal
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04 17:50                               ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-10-04 17:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Vitaly Kuznetsov, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Dan Williams,
	Andrew Morton, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Martin Schwidefsky, devel,
	Joonsoo Kim, linuxppc-dev, Kirill A. Shutemov

On Thu, 4 Oct 2018 17:45:13 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 04/10/2018 17:28, Michal Suchánek wrote:

> > 
> > The state of the art is to determine what to do with hotplugged
> > memory in userspace based on platform and virtualization type.  
> 
> Exactly.
> 
> > 
> > Changing the default to depend on the driver that added the memory
> > rather than platform type should solve the issue of VMs growing
> > different types of memory device emulation.  
> 
> Yes, my original proposal (this patch) was to handle it in the kernel
> for known types. But as we learned, there might be some use cases that
> might still require to make a decision in user space.
> 
> So providing the user space either with some type hint (auto-online
> vs. standby) or the driver that added it (system vs. hyper-v ...)
> would solve the issue.

Is that not available in the udev event?

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-04 17:50                               ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-10-04 17:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Joonsoo Kim, Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Dan Williams,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Philippe Ombredanne, Jonathan Neuschäfer, Nicholas Piggin,
	Martin Schwidefsky, Jérôme Glisse, Mike Rapoport,
	Borislav Petkov, Andy Lutomirski, Boris Ostrovsky, Andrew Morton,
	Oscar Salvador, Juergen Gross, Tony Luck, Mathieu Malaterre,
	linux-s390, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Greg Kroah-Hartman,
	Joe Perches, devel, Vitaly Kuznetsov, linuxppc-dev,
	Kirill A. Shutemov

On Thu, 4 Oct 2018 17:45:13 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 04/10/2018 17:28, Michal Suchánek wrote:

> > 
> > The state of the art is to determine what to do with hotplugged
> > memory in userspace based on platform and virtualization type.  
> 
> Exactly.
> 
> > 
> > Changing the default to depend on the driver that added the memory
> > rather than platform type should solve the issue of VMs growing
> > different types of memory device emulation.  
> 
> Yes, my original proposal (this patch) was to handle it in the kernel
> for known types. But as we learned, there might be some use cases that
> might still require to make a decision in user space.
> 
> So providing the user space either with some type hint (auto-online
> vs. standby) or the driver that added it (system vs. hyper-v ...)
> would solve the issue.

Is that not available in the udev event?

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04 15:45                             ` David Hildenbrand
                                               ` (2 preceding siblings ...)
  (?)
@ 2018-10-04 17:50                             ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-10-04 17:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Joonsoo Kim, Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Dan Williams,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis

On Thu, 4 Oct 2018 17:45:13 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 04/10/2018 17:28, Michal Suchánek wrote:

> > 
> > The state of the art is to determine what to do with hotplugged
> > memory in userspace based on platform and virtualization type.  
> 
> Exactly.
> 
> > 
> > Changing the default to depend on the driver that added the memory
> > rather than platform type should solve the issue of VMs growing
> > different types of memory device emulation.  
> 
> Yes, my original proposal (this patch) was to handle it in the kernel
> for known types. But as we learned, there might be some use cases that
> might still require to make a decision in user space.
> 
> So providing the user space either with some type hint (auto-online
> vs. standby) or the driver that added it (system vs. hyper-v ...)
> would solve the issue.

Is that not available in the udev event?

Thanks

Michal

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04 17:50                               ` Michal Suchánek
  (?)
@ 2018-10-05  7:37                                 ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-05  7:37 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Joonsoo Kim, Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Dan Williams,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis

On 04/10/2018 19:50, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 17:45:13 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 04/10/2018 17:28, Michal Suchánek wrote:
> 
>>>
>>> The state of the art is to determine what to do with hotplugged
>>> memory in userspace based on platform and virtualization type.  
>>
>> Exactly.
>>
>>>
>>> Changing the default to depend on the driver that added the memory
>>> rather than platform type should solve the issue of VMs growing
>>> different types of memory device emulation.  
>>
>> Yes, my original proposal (this patch) was to handle it in the kernel
>> for known types. But as we learned, there might be some use cases that
>> might still require to make a decision in user space.
>>
>> So providing the user space either with some type hint (auto-online
>> vs. standby) or the driver that added it (system vs. hyper-v ...)
>> would solve the issue.
> 
> Is that not available in the udev event?
> 

Not that I am aware. Memory blocks "devices" have no drivers.

ls -la /sys/devices/system/memory/memory0/subsystem/drivers
total 0

(add_memory()/add_memory_resource() creates the memory block devices
when called from a driver)


> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-05  7:37                                 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-05  7:37 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Vitaly Kuznetsov, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Dan Williams,
	Andrew Morton, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Martin Schwidefsky, devel,
	Joonsoo Kim, linuxppc-dev, Kirill A. Shutemov

On 04/10/2018 19:50, Michal SuchA!nek wrote:
> On Thu, 4 Oct 2018 17:45:13 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 04/10/2018 17:28, Michal SuchA!nek wrote:
> 
>>>
>>> The state of the art is to determine what to do with hotplugged
>>> memory in userspace based on platform and virtualization type.  
>>
>> Exactly.
>>
>>>
>>> Changing the default to depend on the driver that added the memory
>>> rather than platform type should solve the issue of VMs growing
>>> different types of memory device emulation.  
>>
>> Yes, my original proposal (this patch) was to handle it in the kernel
>> for known types. But as we learned, there might be some use cases that
>> might still require to make a decision in user space.
>>
>> So providing the user space either with some type hint (auto-online
>> vs. standby) or the driver that added it (system vs. hyper-v ...)
>> would solve the issue.
> 
> Is that not available in the udev event?
> 

Not that I am aware. Memory blocks "devices" have no drivers.

ls -la /sys/devices/system/memory/memory0/subsystem/drivers
total 0

(add_memory()/add_memory_resource() creates the memory block devices
when called from a driver)


> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-10-05  7:37                                 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-05  7:37 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Joonsoo Kim, Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Dan Williams,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang Zhang,
	Philippe Ombredanne, Jonathan Neuschäfer, Nicholas Piggin,
	Martin Schwidefsky, Jérôme Glisse, Mike Rapoport,
	Borislav Petkov, Andy Lutomirski, Boris Ostrovsky, Andrew Morton,
	Oscar Salvador, Juergen Gross, Tony Luck, Mathieu Malaterre,
	linux-s390, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Greg Kroah-Hartman,
	Joe Perches, devel, Vitaly Kuznetsov, linuxppc-dev,
	Kirill A. Shutemov

On 04/10/2018 19:50, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 17:45:13 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 04/10/2018 17:28, Michal Suchánek wrote:
> 
>>>
>>> The state of the art is to determine what to do with hotplugged
>>> memory in userspace based on platform and virtualization type.  
>>
>> Exactly.
>>
>>>
>>> Changing the default to depend on the driver that added the memory
>>> rather than platform type should solve the issue of VMs growing
>>> different types of memory device emulation.  
>>
>> Yes, my original proposal (this patch) was to handle it in the kernel
>> for known types. But as we learned, there might be some use cases that
>> might still require to make a decision in user space.
>>
>> So providing the user space either with some type hint (auto-online
>> vs. standby) or the driver that added it (system vs. hyper-v ...)
>> would solve the issue.
> 
> Is that not available in the udev event?
> 

Not that I am aware. Memory blocks "devices" have no drivers.

ls -la /sys/devices/system/memory/memory0/subsystem/drivers
total 0

(add_memory()/add_memory_resource() creates the memory block devices
when called from a driver)


> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-10-04 17:50                               ` Michal Suchánek
                                                 ` (2 preceding siblings ...)
  (?)
@ 2018-10-05  7:37                               ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-10-05  7:37 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Joonsoo Kim, Dave Hansen, Heiko Carstens, Michal Hocko, linux-mm,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Dan Williams,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis

On 04/10/2018 19:50, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 17:45:13 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 04/10/2018 17:28, Michal Suchánek wrote:
> 
>>>
>>> The state of the art is to determine what to do with hotplugged
>>> memory in userspace based on platform and virtualization type.  
>>
>> Exactly.
>>
>>>
>>> Changing the default to depend on the driver that added the memory
>>> rather than platform type should solve the issue of VMs growing
>>> different types of memory device emulation.  
>>
>> Yes, my original proposal (this patch) was to handle it in the kernel
>> for known types. But as we learned, there might be some use cases that
>> might still require to make a decision in user space.
>>
>> So providing the user space either with some type hint (auto-online
>> vs. standby) or the driver that added it (system vs. hyper-v ...)
>> would solve the issue.
> 
> Is that not available in the udev event?
> 

Not that I am aware. Memory blocks "devices" have no drivers.

ls -la /sys/devices/system/memory/memory0/subsystem/drivers
total 0

(add_memory()/add_memory_resource() creates the memory block devices
when called from a driver)


> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-09-28 15:03 ` David Hildenbrand
  (?)
@ 2018-11-23 11:13   ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-23 11:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, Pavel Tatashin, Michal Hocko, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky, linux-s390,
	Michael Neuling, Stephen Hemminger, Yoshinori Sato,
	Michael Ellerman, linux-acpi, Ingo Molnar, xen-devel

On 28.09.18 17:03, David Hildenbrand wrote:
> How to/when to online hotplugged memory is hard to manage for
> distributions because different memory types are to be treated differently.
> Right now, we need complicated udev rules that e.g. check if we are
> running on s390x, on a physical system or on a virtualized system. But
> there is also sometimes the demand to really online memory immediately
> while adding in the kernel and not to wait for user space to make a
> decision. And on virtualized systems there might be different
> requirements, depending on "how" the memory was added (and if it will
> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> 
> On the one hand, we have physical systems where we sometimes
> want to be able to unplug memory again - e.g. a DIMM - so we have to online
> it to the MOVABLE zone optionally. That decision is usually made in user
> space.
> 
> On the other hand, we have memory that should never be onlined
> automatically, only when asked for by an administrator. Such memory only
> applies to virtualized environments like s390x, where the concept of
> "standby" memory exists. Memory is detected and added during boot, so it
> can be onlined when requested by the admininistrator or some tooling.
> Only when onlining, memory will be allocated in the hypervisor.
> 
> But then, we also have paravirtualized devices (namely xen and hyper-v
> balloons), that hotplug memory that will never ever be removed from a
> system right now using offline_pages/remove_memory. If at all, this memory
> is logically unplugged and handed back to the hypervisor via ballooning.
> 
> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).
> 
> Therefore, introduce memory block types and online memory depending on
> it when adding the memory. Expose the memory type to user space, so user
> space handlers can start to process only "normal" memory. Other memory
> block types can be ignored. One thing less to worry about in user space.
> 

So I was looking into alternatives.

1. Provide only "normal" and "standby" memory types to user space. This
way user space can make smarter decisions about how to online memory.
Not really sure if this is the right way to go.


2. Use device driver information (as mentioned by Michal S.).

The problem right now is that there are no drivers for memory block
devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
will not contain a "DRIVER" information and we ave no idea what kind of
memory block device we hold in our hands.

$ udevadm info -q all -a /sys/devices/system/memory/memory0

  looking at device '/devices/system/memory/memory0':
    KERNEL=="memory0"
    SUBSYSTEM=="memory"
    DRIVER==""
    ATTR{online}=="1"
    ATTR{phys_device}=="0"
    ATTR{phys_index}=="00000000"
    ATTR{removable}=="0"
    ATTR{state}=="online"
    ATTR{valid_zones}=="none"


If we would provide "fake" drivers for the memory block devices we want
to treat in a special way in user space (e.g. standby memory on s390x),
user space could use that information to make smarter decisions.

Adding such drivers might work. My suggestion would be to let ordinary
DIMMs be without a driver for now and only special case standby memory
and eventually paravirtualized memory devices (XEN and Hyper-V).

Any thoughts?


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-23 11:13   ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-23 11:13 UTC (permalink / raw)
  To: linux-mm
  Cc: xen-devel, devel, linux-acpi, linux-sh, linux-s390, linuxppc-dev,
	linux-kernel, linux-ia64, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Rafael J. Wysocki,
	Len Brown, Greg Kroah-Hartman, K. Y. Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrew Morton, Mike Rapoport,
	Dan Williams, Stephen Rothwell, Michal Hocko, Kirill A. Shutemov,
	Nicholas Piggin, Jonathan Neuschäfer, Joe Perches,
	Michael Neuling, Mauricio Faria de Oliveira, Balbir Singh,
	Rashmica Gupta, Pavel Tatashin, Rob Herring, Philippe Ombredanne,
	Kate Stewart, mike.travis, Joonsoo Kim, Oscar Salvador,
	Mathieu Malaterre, Michal Suchánek

On 28.09.18 17:03, David Hildenbrand wrote:
> How to/when to online hotplugged memory is hard to manage for
> distributions because different memory types are to be treated differently.
> Right now, we need complicated udev rules that e.g. check if we are
> running on s390x, on a physical system or on a virtualized system. But
> there is also sometimes the demand to really online memory immediately
> while adding in the kernel and not to wait for user space to make a
> decision. And on virtualized systems there might be different
> requirements, depending on "how" the memory was added (and if it will
> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> 
> On the one hand, we have physical systems where we sometimes
> want to be able to unplug memory again - e.g. a DIMM - so we have to online
> it to the MOVABLE zone optionally. That decision is usually made in user
> space.
> 
> On the other hand, we have memory that should never be onlined
> automatically, only when asked for by an administrator. Such memory only
> applies to virtualized environments like s390x, where the concept of
> "standby" memory exists. Memory is detected and added during boot, so it
> can be onlined when requested by the admininistrator or some tooling.
> Only when onlining, memory will be allocated in the hypervisor.
> 
> But then, we also have paravirtualized devices (namely xen and hyper-v
> balloons), that hotplug memory that will never ever be removed from a
> system right now using offline_pages/remove_memory. If at all, this memory
> is logically unplugged and handed back to the hypervisor via ballooning.
> 
> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).
> 
> Therefore, introduce memory block types and online memory depending on
> it when adding the memory. Expose the memory type to user space, so user
> space handlers can start to process only "normal" memory. Other memory
> block types can be ignored. One thing less to worry about in user space.
> 

So I was looking into alternatives.

1. Provide only "normal" and "standby" memory types to user space. This
way user space can make smarter decisions about how to online memory.
Not really sure if this is the right way to go.


2. Use device driver information (as mentioned by Michal S.).

The problem right now is that there are no drivers for memory block
devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
will not contain a "DRIVER" information and we ave no idea what kind of
memory block device we hold in our hands.

$ udevadm info -q all -a /sys/devices/system/memory/memory0

  looking at device '/devices/system/memory/memory0':
    KERNEL=="memory0"
    SUBSYSTEM=="memory"
    DRIVER==""
    ATTR{online}=="1"
    ATTR{phys_device}=="0"
    ATTR{phys_index}=="00000000"
    ATTR{removable}=="0"
    ATTR{state}=="online"
    ATTR{valid_zones}=="none"


If we would provide "fake" drivers for the memory block devices we want
to treat in a special way in user space (e.g. standby memory on s390x),
user space could use that information to make smarter decisions.

Adding such drivers might work. My suggestion would be to let ordinary
DIMMs be without a driver for now and only special case standby memory
and eventually paravirtualized memory devices (XEN and Hyper-V).

Any thoughts?


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-23 11:13   ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-23 11:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, Pavel Tatashin, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel,
	Michal Suchánek, Rob Herring, Len Brown, Fenghua Yu,
	Stephen Rothwell, mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On 28.09.18 17:03, David Hildenbrand wrote:
> How to/when to online hotplugged memory is hard to manage for
> distributions because different memory types are to be treated differently.
> Right now, we need complicated udev rules that e.g. check if we are
> running on s390x, on a physical system or on a virtualized system. But
> there is also sometimes the demand to really online memory immediately
> while adding in the kernel and not to wait for user space to make a
> decision. And on virtualized systems there might be different
> requirements, depending on "how" the memory was added (and if it will
> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> 
> On the one hand, we have physical systems where we sometimes
> want to be able to unplug memory again - e.g. a DIMM - so we have to online
> it to the MOVABLE zone optionally. That decision is usually made in user
> space.
> 
> On the other hand, we have memory that should never be onlined
> automatically, only when asked for by an administrator. Such memory only
> applies to virtualized environments like s390x, where the concept of
> "standby" memory exists. Memory is detected and added during boot, so it
> can be onlined when requested by the admininistrator or some tooling.
> Only when onlining, memory will be allocated in the hypervisor.
> 
> But then, we also have paravirtualized devices (namely xen and hyper-v
> balloons), that hotplug memory that will never ever be removed from a
> system right now using offline_pages/remove_memory. If at all, this memory
> is logically unplugged and handed back to the hypervisor via ballooning.
> 
> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).
> 
> Therefore, introduce memory block types and online memory depending on
> it when adding the memory. Expose the memory type to user space, so user
> space handlers can start to process only "normal" memory. Other memory
> block types can be ignored. One thing less to worry about in user space.
> 

So I was looking into alternatives.

1. Provide only "normal" and "standby" memory types to user space. This
way user space can make smarter decisions about how to online memory.
Not really sure if this is the right way to go.


2. Use device driver information (as mentioned by Michal S.).

The problem right now is that there are no drivers for memory block
devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
will not contain a "DRIVER" information and we ave no idea what kind of
memory block device we hold in our hands.

$ udevadm info -q all -a /sys/devices/system/memory/memory0

  looking at device '/devices/system/memory/memory0':
    KERNEL=="memory0"
    SUBSYSTEM=="memory"
    DRIVER==""
    ATTR{online}=="1"
    ATTR{phys_device}=="0"
    ATTR{phys_index}=="00000000"
    ATTR{removable}=="0"
    ATTR{state}=="online"
    ATTR{valid_zones}=="none"


If we would provide "fake" drivers for the memory block devices we want
to treat in a special way in user space (e.g. standby memory on s390x),
user space could use that information to make smarter decisions.

Adding such drivers might work. My suggestion would be to let ordinary
DIMMs be without a driver for now and only special case standby memory
and eventually paravirtualized memory devices (XEN and Hyper-V).

Any thoughts?


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-09-28 15:03 ` David Hildenbrand
                   ` (6 preceding siblings ...)
  (?)
@ 2018-11-23 11:13 ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-23 11:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, Pavel Tatashin, Michal Hocko, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, linux-acpi, Ingo Molnar,
	xen-devel

On 28.09.18 17:03, David Hildenbrand wrote:
> How to/when to online hotplugged memory is hard to manage for
> distributions because different memory types are to be treated differently.
> Right now, we need complicated udev rules that e.g. check if we are
> running on s390x, on a physical system or on a virtualized system. But
> there is also sometimes the demand to really online memory immediately
> while adding in the kernel and not to wait for user space to make a
> decision. And on virtualized systems there might be different
> requirements, depending on "how" the memory was added (and if it will
> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> 
> On the one hand, we have physical systems where we sometimes
> want to be able to unplug memory again - e.g. a DIMM - so we have to online
> it to the MOVABLE zone optionally. That decision is usually made in user
> space.
> 
> On the other hand, we have memory that should never be onlined
> automatically, only when asked for by an administrator. Such memory only
> applies to virtualized environments like s390x, where the concept of
> "standby" memory exists. Memory is detected and added during boot, so it
> can be onlined when requested by the admininistrator or some tooling.
> Only when onlining, memory will be allocated in the hypervisor.
> 
> But then, we also have paravirtualized devices (namely xen and hyper-v
> balloons), that hotplug memory that will never ever be removed from a
> system right now using offline_pages/remove_memory. If at all, this memory
> is logically unplugged and handed back to the hypervisor via ballooning.
> 
> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).
> 
> Therefore, introduce memory block types and online memory depending on
> it when adding the memory. Expose the memory type to user space, so user
> space handlers can start to process only "normal" memory. Other memory
> block types can be ignored. One thing less to worry about in user space.
> 

So I was looking into alternatives.

1. Provide only "normal" and "standby" memory types to user space. This
way user space can make smarter decisions about how to online memory.
Not really sure if this is the right way to go.


2. Use device driver information (as mentioned by Michal S.).

The problem right now is that there are no drivers for memory block
devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
will not contain a "DRIVER" information and we ave no idea what kind of
memory block device we hold in our hands.

$ udevadm info -q all -a /sys/devices/system/memory/memory0

  looking at device '/devices/system/memory/memory0':
    KERNEL=="memory0"
    SUBSYSTEM=="memory"
    DRIVER==""
    ATTR{online}=="1"
    ATTR{phys_device}=="0"
    ATTR{phys_index}=="00000000"
    ATTR{removable}=="0"
    ATTR{state}=="online"
    ATTR{valid_zones}=="none"


If we would provide "fake" drivers for the memory block devices we want
to treat in a special way in user space (e.g. standby memory on s390x),
user space could use that information to make smarter decisions.

Adding such drivers might work. My suggestion would be to let ordinary
DIMMs be without a driver for now and only special case standby memory
and eventually paravirtualized memory devices (XEN and Hyper-V).

Any thoughts?


-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-23 11:13   ` David Hildenbrand
  (?)
@ 2018-11-23 18:06     ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-23 18:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis@hpe.com

On Fri, 23 Nov 2018 12:13:58 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 28.09.18 17:03, David Hildenbrand wrote:
> > How to/when to online hotplugged memory is hard to manage for
> > distributions because different memory types are to be treated differently.
> > Right now, we need complicated udev rules that e.g. check if we are
> > running on s390x, on a physical system or on a virtualized system. But
> > there is also sometimes the demand to really online memory immediately
> > while adding in the kernel and not to wait for user space to make a
> > decision. And on virtualized systems there might be different
> > requirements, depending on "how" the memory was added (and if it will
> > eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> > 
> > On the one hand, we have physical systems where we sometimes
> > want to be able to unplug memory again - e.g. a DIMM - so we have to online
> > it to the MOVABLE zone optionally. That decision is usually made in user
> > space.
> > 
> > On the other hand, we have memory that should never be onlined
> > automatically, only when asked for by an administrator. Such memory only
> > applies to virtualized environments like s390x, where the concept of
> > "standby" memory exists. Memory is detected and added during boot, so it
> > can be onlined when requested by the admininistrator or some tooling.
> > Only when onlining, memory will be allocated in the hypervisor.
> > 
> > But then, we also have paravirtualized devices (namely xen and hyper-v
> > balloons), that hotplug memory that will never ever be removed from a
> > system right now using offline_pages/remove_memory. If at all, this memory
> > is logically unplugged and handed back to the hypervisor via ballooning.
> > 
> > For paravirtualized devices it is relevant that memory is onlined as
> > quickly as possible after adding - and that it is added to the NORMAL
> > zone. Otherwise, it could happen that too much memory in a row is added
> > (but not onlined), resulting in out-of-memory conditions due to the
> > additional memory for "struct pages" and friends. MOVABLE zone as well
> > as delays might be very problematic and lead to crashes (e.g. zone
> > imbalance).
> > 
> > Therefore, introduce memory block types and online memory depending on
> > it when adding the memory. Expose the memory type to user space, so user
> > space handlers can start to process only "normal" memory. Other memory
> > block types can be ignored. One thing less to worry about in user space.
> >   
> 
> So I was looking into alternatives.
> 
> 1. Provide only "normal" and "standby" memory types to user space. This
> way user space can make smarter decisions about how to online memory.
> Not really sure if this is the right way to go.
> 
> 
> 2. Use device driver information (as mentioned by Michal S.).
> 
> The problem right now is that there are no drivers for memory block
> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
> will not contain a "DRIVER" information and we ave no idea what kind of
> memory block device we hold in our hands.
> 
> $ udevadm info -q all -a /sys/devices/system/memory/memory0
> 
>   looking at device '/devices/system/memory/memory0':
>     KERNEL=="memory0"
>     SUBSYSTEM=="memory"
>     DRIVER==""
>     ATTR{online}=="1"
>     ATTR{phys_device}=="0"
>     ATTR{phys_index}=="00000000"
>     ATTR{removable}=="0"
>     ATTR{state}=="online"
>     ATTR{valid_zones}=="none"
> 
> 
> If we would provide "fake" drivers for the memory block devices we want
> to treat in a special way in user space (e.g. standby memory on s390x),
> user space could use that information to make smarter decisions.
> 
> Adding such drivers might work. My suggestion would be to let ordinary
> DIMMs be without a driver for now and only special case standby memory
> and eventually paravirtualized memory devices (XEN and Hyper-V).
> 
> Any thoughts?

If we are going to fake the driver information we may as well add the
type attribute and be done with it.

I think the problem with the patch was more with the semantic than the
attribute itself.

What is normal, paravirtualized, and standby memory?

I can understand DIMM device, baloon device, or whatever mechanism for
adding memory you might have.

I can understand "memory designated as standby by the cluster
administrator".

However, DIMM vs baloon is orthogonal to standby and should not be
conflated into one property.

paravirtualized means nothing at all in relationship to memory type and
the desired online policy to me.

Lastly I would suggest if you add any property you add it to *all*
memory that is hotplugged. That way the userspace can detect if it can
rely on the information from your patch or not. Leaving some memory
untagged makes things needlessly vague.

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-23 18:06     ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-23 18:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Kate Stewart, Rich Felker, linux-ia64, linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Pavel Tatashin,
	Michal Hocko, Paul Mackerras,
	H. Peter Anvin  <hpa@zytor.com>,
	Rashmica Gupta <rashmica.g@gmail.com>,
	 K. Y. Srinivasan  <kys@microsoft.com>,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Fenghua Yu, Stephen Rothwell, mike.travis,
	Haiyang Zhang, Dan Williams, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Mauricio Faria de Oliveira, Philippe Ombredanne,
	Martin Schwidefsky, devel, Andrew Morton, linuxppc-dev

On Fri, 23 Nov 2018 12:13:58 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 28.09.18 17:03, David Hildenbrand wrote:
> > How to/when to online hotplugged memory is hard to manage for
> > distributions because different memory types are to be treated differently.
> > Right now, we need complicated udev rules that e.g. check if we are
> > running on s390x, on a physical system or on a virtualized system. But
> > there is also sometimes the demand to really online memory immediately
> > while adding in the kernel and not to wait for user space to make a
> > decision. And on virtualized systems there might be different
> > requirements, depending on "how" the memory was added (and if it will
> > eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> > 
> > On the one hand, we have physical systems where we sometimes
> > want to be able to unplug memory again - e.g. a DIMM - so we have to online
> > it to the MOVABLE zone optionally. That decision is usually made in user
> > space.
> > 
> > On the other hand, we have memory that should never be onlined
> > automatically, only when asked for by an administrator. Such memory only
> > applies to virtualized environments like s390x, where the concept of
> > "standby" memory exists. Memory is detected and added during boot, so it
> > can be onlined when requested by the admininistrator or some tooling.
> > Only when onlining, memory will be allocated in the hypervisor.
> > 
> > But then, we also have paravirtualized devices (namely xen and hyper-v
> > balloons), that hotplug memory that will never ever be removed from a
> > system right now using offline_pages/remove_memory. If at all, this memory
> > is logically unplugged and handed back to the hypervisor via ballooning.
> > 
> > For paravirtualized devices it is relevant that memory is onlined as
> > quickly as possible after adding - and that it is added to the NORMAL
> > zone. Otherwise, it could happen that too much memory in a row is added
> > (but not onlined), resulting in out-of-memory conditions due to the
> > additional memory for "struct pages" and friends. MOVABLE zone as well
> > as delays might be very problematic and lead to crashes (e.g. zone
> > imbalance).
> > 
> > Therefore, introduce memory block types and online memory depending on
> > it when adding the memory. Expose the memory type to user space, so user
> > space handlers can start to process only "normal" memory. Other memory
> > block types can be ignored. One thing less to worry about in user space.
> >   
> 
> So I was looking into alternatives.
> 
> 1. Provide only "normal" and "standby" memory types to user space. This
> way user space can make smarter decisions about how to online memory.
> Not really sure if this is the right way to go.
> 
> 
> 2. Use device driver information (as mentioned by Michal S.).
> 
> The problem right now is that there are no drivers for memory block
> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
> will not contain a "DRIVER" information and we ave no idea what kind of
> memory block device we hold in our hands.
> 
> $ udevadm info -q all -a /sys/devices/system/memory/memory0
> 
>   looking at device '/devices/system/memory/memory0':
>     KERNEL=="memory0"
>     SUBSYSTEM=="memory"
>     DRIVER==""
>     ATTR{online}=="1"
>     ATTR{phys_device}=="0"
>     ATTR{phys_index}=="00000000"
>     ATTR{removable}=="0"
>     ATTR{state}=="online"
>     ATTR{valid_zones}=="none"
> 
> 
> If we would provide "fake" drivers for the memory block devices we want
> to treat in a special way in user space (e.g. standby memory on s390x),
> user space could use that information to make smarter decisions.
> 
> Adding such drivers might work. My suggestion would be to let ordinary
> DIMMs be without a driver for now and only special case standby memory
> and eventually paravirtualized memory devices (XEN and Hyper-V).
> 
> Any thoughts?

If we are going to fake the driver information we may as well add the
type attribute and be done with it.

I think the problem with the patch was more with the semantic than the
attribute itself.

What is normal, paravirtualized, and standby memory?

I can understand DIMM device, baloon device, or whatever mechanism for
adding memory you might have.

I can understand "memory designated as standby by the cluster
administrator".

However, DIMM vs baloon is orthogonal to standby and should not be
conflated into one property.

paravirtualized means nothing at all in relationship to memory type and
the desired online policy to me.

Lastly I would suggest if you add any property you add it to *all*
memory that is hotplugged. That way the userspace can detect if it can
rely on the information from your patch or not. Leaving some memory
untagged makes things needlessly vague.

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-23 18:06     ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-23 18:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	K. Y. Srinivasan, Dan Williams, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring, mike.travis,
	Haiyang Zhang, Jonathan Neuschäfer, Nicholas Piggin,
	Martin Schwidefsky, Jérôme Glisse, Mike Rapoport,
	Borislav Petkov, Andy Lutomirski, Boris Ostrovsky, Andrew Morton,
	Oscar Salvador, Juergen Gross, Tony Luck, Mathieu Malaterre,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On Fri, 23 Nov 2018 12:13:58 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 28.09.18 17:03, David Hildenbrand wrote:
> > How to/when to online hotplugged memory is hard to manage for
> > distributions because different memory types are to be treated differently.
> > Right now, we need complicated udev rules that e.g. check if we are
> > running on s390x, on a physical system or on a virtualized system. But
> > there is also sometimes the demand to really online memory immediately
> > while adding in the kernel and not to wait for user space to make a
> > decision. And on virtualized systems there might be different
> > requirements, depending on "how" the memory was added (and if it will
> > eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> > 
> > On the one hand, we have physical systems where we sometimes
> > want to be able to unplug memory again - e.g. a DIMM - so we have to online
> > it to the MOVABLE zone optionally. That decision is usually made in user
> > space.
> > 
> > On the other hand, we have memory that should never be onlined
> > automatically, only when asked for by an administrator. Such memory only
> > applies to virtualized environments like s390x, where the concept of
> > "standby" memory exists. Memory is detected and added during boot, so it
> > can be onlined when requested by the admininistrator or some tooling.
> > Only when onlining, memory will be allocated in the hypervisor.
> > 
> > But then, we also have paravirtualized devices (namely xen and hyper-v
> > balloons), that hotplug memory that will never ever be removed from a
> > system right now using offline_pages/remove_memory. If at all, this memory
> > is logically unplugged and handed back to the hypervisor via ballooning.
> > 
> > For paravirtualized devices it is relevant that memory is onlined as
> > quickly as possible after adding - and that it is added to the NORMAL
> > zone. Otherwise, it could happen that too much memory in a row is added
> > (but not onlined), resulting in out-of-memory conditions due to the
> > additional memory for "struct pages" and friends. MOVABLE zone as well
> > as delays might be very problematic and lead to crashes (e.g. zone
> > imbalance).
> > 
> > Therefore, introduce memory block types and online memory depending on
> > it when adding the memory. Expose the memory type to user space, so user
> > space handlers can start to process only "normal" memory. Other memory
> > block types can be ignored. One thing less to worry about in user space.
> >   
> 
> So I was looking into alternatives.
> 
> 1. Provide only "normal" and "standby" memory types to user space. This
> way user space can make smarter decisions about how to online memory.
> Not really sure if this is the right way to go.
> 
> 
> 2. Use device driver information (as mentioned by Michal S.).
> 
> The problem right now is that there are no drivers for memory block
> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
> will not contain a "DRIVER" information and we ave no idea what kind of
> memory block device we hold in our hands.
> 
> $ udevadm info -q all -a /sys/devices/system/memory/memory0
> 
>   looking at device '/devices/system/memory/memory0':
>     KERNEL=="memory0"
>     SUBSYSTEM=="memory"
>     DRIVER==""
>     ATTR{online}=="1"
>     ATTR{phys_device}=="0"
>     ATTR{phys_index}=="00000000"
>     ATTR{removable}=="0"
>     ATTR{state}=="online"
>     ATTR{valid_zones}=="none"
> 
> 
> If we would provide "fake" drivers for the memory block devices we want
> to treat in a special way in user space (e.g. standby memory on s390x),
> user space could use that information to make smarter decisions.
> 
> Adding such drivers might work. My suggestion would be to let ordinary
> DIMMs be without a driver for now and only special case standby memory
> and eventually paravirtualized memory devices (XEN and Hyper-V).
> 
> Any thoughts?

If we are going to fake the driver information we may as well add the
type attribute and be done with it.

I think the problem with the patch was more with the semantic than the
attribute itself.

What is normal, paravirtualized, and standby memory?

I can understand DIMM device, baloon device, or whatever mechanism for
adding memory you might have.

I can understand "memory designated as standby by the cluster
administrator".

However, DIMM vs baloon is orthogonal to standby and should not be
conflated into one property.

paravirtualized means nothing at all in relationship to memory type and
the desired online policy to me.

Lastly I would suggest if you add any property you add it to *all*
memory that is hotplugged. That way the userspace can detect if it can
rely on the information from your patch or not. Leaving some memory
untagged makes things needlessly vague.

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-23 11:13   ` David Hildenbrand
  (?)
  (?)
@ 2018-11-23 18:06   ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-23 18:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	K. Y. Srinivasan, Dan Williams, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring

On Fri, 23 Nov 2018 12:13:58 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 28.09.18 17:03, David Hildenbrand wrote:
> > How to/when to online hotplugged memory is hard to manage for
> > distributions because different memory types are to be treated differently.
> > Right now, we need complicated udev rules that e.g. check if we are
> > running on s390x, on a physical system or on a virtualized system. But
> > there is also sometimes the demand to really online memory immediately
> > while adding in the kernel and not to wait for user space to make a
> > decision. And on virtualized systems there might be different
> > requirements, depending on "how" the memory was added (and if it will
> > eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
> > 
> > On the one hand, we have physical systems where we sometimes
> > want to be able to unplug memory again - e.g. a DIMM - so we have to online
> > it to the MOVABLE zone optionally. That decision is usually made in user
> > space.
> > 
> > On the other hand, we have memory that should never be onlined
> > automatically, only when asked for by an administrator. Such memory only
> > applies to virtualized environments like s390x, where the concept of
> > "standby" memory exists. Memory is detected and added during boot, so it
> > can be onlined when requested by the admininistrator or some tooling.
> > Only when onlining, memory will be allocated in the hypervisor.
> > 
> > But then, we also have paravirtualized devices (namely xen and hyper-v
> > balloons), that hotplug memory that will never ever be removed from a
> > system right now using offline_pages/remove_memory. If at all, this memory
> > is logically unplugged and handed back to the hypervisor via ballooning.
> > 
> > For paravirtualized devices it is relevant that memory is onlined as
> > quickly as possible after adding - and that it is added to the NORMAL
> > zone. Otherwise, it could happen that too much memory in a row is added
> > (but not onlined), resulting in out-of-memory conditions due to the
> > additional memory for "struct pages" and friends. MOVABLE zone as well
> > as delays might be very problematic and lead to crashes (e.g. zone
> > imbalance).
> > 
> > Therefore, introduce memory block types and online memory depending on
> > it when adding the memory. Expose the memory type to user space, so user
> > space handlers can start to process only "normal" memory. Other memory
> > block types can be ignored. One thing less to worry about in user space.
> >   
> 
> So I was looking into alternatives.
> 
> 1. Provide only "normal" and "standby" memory types to user space. This
> way user space can make smarter decisions about how to online memory.
> Not really sure if this is the right way to go.
> 
> 
> 2. Use device driver information (as mentioned by Michal S.).
> 
> The problem right now is that there are no drivers for memory block
> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
> will not contain a "DRIVER" information and we ave no idea what kind of
> memory block device we hold in our hands.
> 
> $ udevadm info -q all -a /sys/devices/system/memory/memory0
> 
>   looking at device '/devices/system/memory/memory0':
>     KERNEL=="memory0"
>     SUBSYSTEM=="memory"
>     DRIVER==""
>     ATTR{online}=="1"
>     ATTR{phys_device}=="0"
>     ATTR{phys_index}=="00000000"
>     ATTR{removable}=="0"
>     ATTR{state}=="online"
>     ATTR{valid_zones}=="none"
> 
> 
> If we would provide "fake" drivers for the memory block devices we want
> to treat in a special way in user space (e.g. standby memory on s390x),
> user space could use that information to make smarter decisions.
> 
> Adding such drivers might work. My suggestion would be to let ordinary
> DIMMs be without a driver for now and only special case standby memory
> and eventually paravirtualized memory devices (XEN and Hyper-V).
> 
> Any thoughts?

If we are going to fake the driver information we may as well add the
type attribute and be done with it.

I think the problem with the patch was more with the semantic than the
attribute itself.

What is normal, paravirtualized, and standby memory?

I can understand DIMM device, baloon device, or whatever mechanism for
adding memory you might have.

I can understand "memory designated as standby by the cluster
administrator".

However, DIMM vs baloon is orthogonal to standby and should not be
conflated into one property.

paravirtualized means nothing at all in relationship to memory type and
the desired online policy to me.

Lastly I would suggest if you add any property you add it to *all*
memory that is hotplugged. That way the userspace can detect if it can
rely on the information from your patch or not. Leaving some memory
untagged makes things needlessly vague.

Thanks

Michal

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-23 18:06     ` Michal Suchánek
  (?)
@ 2018-11-26 12:30       ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 12:30 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis@hpe.com

On 23.11.18 19:06, Michal Suchánek wrote:
> On Fri, 23 Nov 2018 12:13:58 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 28.09.18 17:03, David Hildenbrand wrote:
>>> How to/when to online hotplugged memory is hard to manage for
>>> distributions because different memory types are to be treated differently.
>>> Right now, we need complicated udev rules that e.g. check if we are
>>> running on s390x, on a physical system or on a virtualized system. But
>>> there is also sometimes the demand to really online memory immediately
>>> while adding in the kernel and not to wait for user space to make a
>>> decision. And on virtualized systems there might be different
>>> requirements, depending on "how" the memory was added (and if it will
>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>
>>> On the one hand, we have physical systems where we sometimes
>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>> space.
>>>
>>> On the other hand, we have memory that should never be onlined
>>> automatically, only when asked for by an administrator. Such memory only
>>> applies to virtualized environments like s390x, where the concept of
>>> "standby" memory exists. Memory is detected and added during boot, so it
>>> can be onlined when requested by the admininistrator or some tooling.
>>> Only when onlining, memory will be allocated in the hypervisor.
>>>
>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>> balloons), that hotplug memory that will never ever be removed from a
>>> system right now using offline_pages/remove_memory. If at all, this memory
>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>
>>> For paravirtualized devices it is relevant that memory is onlined as
>>> quickly as possible after adding - and that it is added to the NORMAL
>>> zone. Otherwise, it could happen that too much memory in a row is added
>>> (but not onlined), resulting in out-of-memory conditions due to the
>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>> as delays might be very problematic and lead to crashes (e.g. zone
>>> imbalance).
>>>
>>> Therefore, introduce memory block types and online memory depending on
>>> it when adding the memory. Expose the memory type to user space, so user
>>> space handlers can start to process only "normal" memory. Other memory
>>> block types can be ignored. One thing less to worry about in user space.
>>>   
>>
>> So I was looking into alternatives.
>>
>> 1. Provide only "normal" and "standby" memory types to user space. This
>> way user space can make smarter decisions about how to online memory.
>> Not really sure if this is the right way to go.
>>
>>
>> 2. Use device driver information (as mentioned by Michal S.).
>>
>> The problem right now is that there are no drivers for memory block
>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>> will not contain a "DRIVER" information and we ave no idea what kind of
>> memory block device we hold in our hands.
>>
>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>
>>   looking at device '/devices/system/memory/memory0':
>>     KERNEL=="memory0"
>>     SUBSYSTEM=="memory"
>>     DRIVER==""
>>     ATTR{online}=="1"
>>     ATTR{phys_device}=="0"
>>     ATTR{phys_index}=="00000000"
>>     ATTR{removable}=="0"
>>     ATTR{state}=="online"
>>     ATTR{valid_zones}=="none"
>>
>>
>> If we would provide "fake" drivers for the memory block devices we want
>> to treat in a special way in user space (e.g. standby memory on s390x),
>> user space could use that information to make smarter decisions.
>>
>> Adding such drivers might work. My suggestion would be to let ordinary
>> DIMMs be without a driver for now and only special case standby memory
>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>
>> Any thoughts?
> 
> If we are going to fake the driver information we may as well add the
> type attribute and be done with it.
> 
> I think the problem with the patch was more with the semantic than the
> attribute itself.
> 
> What is normal, paravirtualized, and standby memory?
> 
> I can understand DIMM device, baloon device, or whatever mechanism for
> adding memory you might have.
> 
> I can understand "memory designated as standby by the cluster
> administrator".
> 
> However, DIMM vs baloon is orthogonal to standby and should not be
> conflated into one property.
> 
> paravirtualized means nothing at all in relationship to memory type and
> the desired online policy to me.

Right, so with whatever we come up, it should allow to make a decision
in user space about
- if memory is to be onlined automatically
- to which zone memory is to be onlined

The rules are encoded in user space, the type will allow to make a
decision. One important part will be if the memory can eventually be
offlined + removed again (DIMM style unplug) vs. memory unplug is
handled balloon-style.

As we learned, some use cases might require to e.g. online balloon
memory to the movable zone in order to make better use of huge pages.
This has to be handled in user space.

I'll think about possible types.

> 
> Lastly I would suggest if you add any property you add it to *all*
> memory that is hotplugged. That way the userspace can detect if it can
> rely on the information from your patch or not. Leaving some memory
> untagged makes things needlessly vague.

Yes, that makes sense.

Thanks!

> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-26 12:30       ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 12:30 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: linux-mm, Kate Stewart, Rich Felker, linux-ia64, linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Pavel Tatashin,
	Michal Hocko, Paul Mackerras, H. Peter Anvin, Rashmica Gupta,
	K. Y. Srinivasan, Boris Ostrovsky, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On 23.11.18 19:06, Michal Suchánek wrote:
> On Fri, 23 Nov 2018 12:13:58 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 28.09.18 17:03, David Hildenbrand wrote:
>>> How to/when to online hotplugged memory is hard to manage for
>>> distributions because different memory types are to be treated differently.
>>> Right now, we need complicated udev rules that e.g. check if we are
>>> running on s390x, on a physical system or on a virtualized system. But
>>> there is also sometimes the demand to really online memory immediately
>>> while adding in the kernel and not to wait for user space to make a
>>> decision. And on virtualized systems there might be different
>>> requirements, depending on "how" the memory was added (and if it will
>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>
>>> On the one hand, we have physical systems where we sometimes
>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>> space.
>>>
>>> On the other hand, we have memory that should never be onlined
>>> automatically, only when asked for by an administrator. Such memory only
>>> applies to virtualized environments like s390x, where the concept of
>>> "standby" memory exists. Memory is detected and added during boot, so it
>>> can be onlined when requested by the admininistrator or some tooling.
>>> Only when onlining, memory will be allocated in the hypervisor.
>>>
>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>> balloons), that hotplug memory that will never ever be removed from a
>>> system right now using offline_pages/remove_memory. If at all, this memory
>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>
>>> For paravirtualized devices it is relevant that memory is onlined as
>>> quickly as possible after adding - and that it is added to the NORMAL
>>> zone. Otherwise, it could happen that too much memory in a row is added
>>> (but not onlined), resulting in out-of-memory conditions due to the
>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>> as delays might be very problematic and lead to crashes (e.g. zone
>>> imbalance).
>>>
>>> Therefore, introduce memory block types and online memory depending on
>>> it when adding the memory. Expose the memory type to user space, so user
>>> space handlers can start to process only "normal" memory. Other memory
>>> block types can be ignored. One thing less to worry about in user space.
>>>   
>>
>> So I was looking into alternatives.
>>
>> 1. Provide only "normal" and "standby" memory types to user space. This
>> way user space can make smarter decisions about how to online memory.
>> Not really sure if this is the right way to go.
>>
>>
>> 2. Use device driver information (as mentioned by Michal S.).
>>
>> The problem right now is that there are no drivers for memory block
>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>> will not contain a "DRIVER" information and we ave no idea what kind of
>> memory block device we hold in our hands.
>>
>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>
>>   looking at device '/devices/system/memory/memory0':
>>     KERNEL=="memory0"
>>     SUBSYSTEM=="memory"
>>     DRIVER==""
>>     ATTR{online}=="1"
>>     ATTR{phys_device}=="0"
>>     ATTR{phys_index}=="00000000"
>>     ATTR{removable}=="0"
>>     ATTR{state}=="online"
>>     ATTR{valid_zones}=="none"
>>
>>
>> If we would provide "fake" drivers for the memory block devices we want
>> to treat in a special way in user space (e.g. standby memory on s390x),
>> user space could use that information to make smarter decisions.
>>
>> Adding such drivers might work. My suggestion would be to let ordinary
>> DIMMs be without a driver for now and only special case standby memory
>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>
>> Any thoughts?
> 
> If we are going to fake the driver information we may as well add the
> type attribute and be done with it.
> 
> I think the problem with the patch was more with the semantic than the
> attribute itself.
> 
> What is normal, paravirtualized, and standby memory?
> 
> I can understand DIMM device, baloon device, or whatever mechanism for
> adding memory you might have.
> 
> I can understand "memory designated as standby by the cluster
> administrator".
> 
> However, DIMM vs baloon is orthogonal to standby and should not be
> conflated into one property.
> 
> paravirtualized means nothing at all in relationship to memory type and
> the desired online policy to me.

Right, so with whatever we come up, it should allow to make a decision
in user space about
- if memory is to be onlined automatically
- to which zone memory is to be onlined

The rules are encoded in user space, the type will allow to make a
decision. One important part will be if the memory can eventually be
offlined + removed again (DIMM style unplug) vs. memory unplug is
handled balloon-style.

As we learned, some use cases might require to e.g. online balloon
memory to the movable zone in order to make better use of huge pages.
This has to be handled in user space.

I'll think about possible types.

> 
> Lastly I would suggest if you add any property you add it to *all*
> memory that is hotplugged. That way the userspace can detect if it can
> rely on the information from your patch or not. Leaving some memory
> untagged makes things needlessly vague.

Yes, that makes sense.

Thanks!

> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-26 12:30       ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 12:30 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	K. Y. Srinivasan, Dan Williams, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring, mike.travis,
	Haiyang Zhang, Jonathan Neuschäfer, Nicholas Piggin,
	Martin Schwidefsky, Jérôme Glisse, Mike Rapoport,
	Borislav Petkov, Andy Lutomirski, Boris Ostrovsky, Andrew Morton,
	Oscar Salvador, Juergen Gross, Tony Luck, Mathieu Malaterre,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On 23.11.18 19:06, Michal Suchánek wrote:
> On Fri, 23 Nov 2018 12:13:58 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 28.09.18 17:03, David Hildenbrand wrote:
>>> How to/when to online hotplugged memory is hard to manage for
>>> distributions because different memory types are to be treated differently.
>>> Right now, we need complicated udev rules that e.g. check if we are
>>> running on s390x, on a physical system or on a virtualized system. But
>>> there is also sometimes the demand to really online memory immediately
>>> while adding in the kernel and not to wait for user space to make a
>>> decision. And on virtualized systems there might be different
>>> requirements, depending on "how" the memory was added (and if it will
>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>
>>> On the one hand, we have physical systems where we sometimes
>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>> space.
>>>
>>> On the other hand, we have memory that should never be onlined
>>> automatically, only when asked for by an administrator. Such memory only
>>> applies to virtualized environments like s390x, where the concept of
>>> "standby" memory exists. Memory is detected and added during boot, so it
>>> can be onlined when requested by the admininistrator or some tooling.
>>> Only when onlining, memory will be allocated in the hypervisor.
>>>
>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>> balloons), that hotplug memory that will never ever be removed from a
>>> system right now using offline_pages/remove_memory. If at all, this memory
>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>
>>> For paravirtualized devices it is relevant that memory is onlined as
>>> quickly as possible after adding - and that it is added to the NORMAL
>>> zone. Otherwise, it could happen that too much memory in a row is added
>>> (but not onlined), resulting in out-of-memory conditions due to the
>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>> as delays might be very problematic and lead to crashes (e.g. zone
>>> imbalance).
>>>
>>> Therefore, introduce memory block types and online memory depending on
>>> it when adding the memory. Expose the memory type to user space, so user
>>> space handlers can start to process only "normal" memory. Other memory
>>> block types can be ignored. One thing less to worry about in user space.
>>>   
>>
>> So I was looking into alternatives.
>>
>> 1. Provide only "normal" and "standby" memory types to user space. This
>> way user space can make smarter decisions about how to online memory.
>> Not really sure if this is the right way to go.
>>
>>
>> 2. Use device driver information (as mentioned by Michal S.).
>>
>> The problem right now is that there are no drivers for memory block
>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>> will not contain a "DRIVER" information and we ave no idea what kind of
>> memory block device we hold in our hands.
>>
>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>
>>   looking at device '/devices/system/memory/memory0':
>>     KERNEL=="memory0"
>>     SUBSYSTEM=="memory"
>>     DRIVER==""
>>     ATTR{online}=="1"
>>     ATTR{phys_device}=="0"
>>     ATTR{phys_index}=="00000000"
>>     ATTR{removable}=="0"
>>     ATTR{state}=="online"
>>     ATTR{valid_zones}=="none"
>>
>>
>> If we would provide "fake" drivers for the memory block devices we want
>> to treat in a special way in user space (e.g. standby memory on s390x),
>> user space could use that information to make smarter decisions.
>>
>> Adding such drivers might work. My suggestion would be to let ordinary
>> DIMMs be without a driver for now and only special case standby memory
>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>
>> Any thoughts?
> 
> If we are going to fake the driver information we may as well add the
> type attribute and be done with it.
> 
> I think the problem with the patch was more with the semantic than the
> attribute itself.
> 
> What is normal, paravirtualized, and standby memory?
> 
> I can understand DIMM device, baloon device, or whatever mechanism for
> adding memory you might have.
> 
> I can understand "memory designated as standby by the cluster
> administrator".
> 
> However, DIMM vs baloon is orthogonal to standby and should not be
> conflated into one property.
> 
> paravirtualized means nothing at all in relationship to memory type and
> the desired online policy to me.

Right, so with whatever we come up, it should allow to make a decision
in user space about
- if memory is to be onlined automatically
- to which zone memory is to be onlined

The rules are encoded in user space, the type will allow to make a
decision. One important part will be if the memory can eventually be
offlined + removed again (DIMM style unplug) vs. memory unplug is
handled balloon-style.

As we learned, some use cases might require to e.g. online balloon
memory to the movable zone in order to make better use of huge pages.
This has to be handled in user space.

I'll think about possible types.

> 
> Lastly I would suggest if you add any property you add it to *all*
> memory that is hotplugged. That way the userspace can detect if it can
> rely on the information from your patch or not. Leaving some memory
> untagged makes things needlessly vague.

Yes, that makes sense.

Thanks!

> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-23 18:06     ` Michal Suchánek
                       ` (2 preceding siblings ...)
  (?)
@ 2018-11-26 12:30     ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 12:30 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	K. Y. Srinivasan, Dan Williams, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring

On 23.11.18 19:06, Michal Suchánek wrote:
> On Fri, 23 Nov 2018 12:13:58 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 28.09.18 17:03, David Hildenbrand wrote:
>>> How to/when to online hotplugged memory is hard to manage for
>>> distributions because different memory types are to be treated differently.
>>> Right now, we need complicated udev rules that e.g. check if we are
>>> running on s390x, on a physical system or on a virtualized system. But
>>> there is also sometimes the demand to really online memory immediately
>>> while adding in the kernel and not to wait for user space to make a
>>> decision. And on virtualized systems there might be different
>>> requirements, depending on "how" the memory was added (and if it will
>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>
>>> On the one hand, we have physical systems where we sometimes
>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>> space.
>>>
>>> On the other hand, we have memory that should never be onlined
>>> automatically, only when asked for by an administrator. Such memory only
>>> applies to virtualized environments like s390x, where the concept of
>>> "standby" memory exists. Memory is detected and added during boot, so it
>>> can be onlined when requested by the admininistrator or some tooling.
>>> Only when onlining, memory will be allocated in the hypervisor.
>>>
>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>> balloons), that hotplug memory that will never ever be removed from a
>>> system right now using offline_pages/remove_memory. If at all, this memory
>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>
>>> For paravirtualized devices it is relevant that memory is onlined as
>>> quickly as possible after adding - and that it is added to the NORMAL
>>> zone. Otherwise, it could happen that too much memory in a row is added
>>> (but not onlined), resulting in out-of-memory conditions due to the
>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>> as delays might be very problematic and lead to crashes (e.g. zone
>>> imbalance).
>>>
>>> Therefore, introduce memory block types and online memory depending on
>>> it when adding the memory. Expose the memory type to user space, so user
>>> space handlers can start to process only "normal" memory. Other memory
>>> block types can be ignored. One thing less to worry about in user space.
>>>   
>>
>> So I was looking into alternatives.
>>
>> 1. Provide only "normal" and "standby" memory types to user space. This
>> way user space can make smarter decisions about how to online memory.
>> Not really sure if this is the right way to go.
>>
>>
>> 2. Use device driver information (as mentioned by Michal S.).
>>
>> The problem right now is that there are no drivers for memory block
>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>> will not contain a "DRIVER" information and we ave no idea what kind of
>> memory block device we hold in our hands.
>>
>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>
>>   looking at device '/devices/system/memory/memory0':
>>     KERNEL=="memory0"
>>     SUBSYSTEM=="memory"
>>     DRIVER==""
>>     ATTR{online}=="1"
>>     ATTR{phys_device}=="0"
>>     ATTR{phys_index}=="00000000"
>>     ATTR{removable}=="0"
>>     ATTR{state}=="online"
>>     ATTR{valid_zones}=="none"
>>
>>
>> If we would provide "fake" drivers for the memory block devices we want
>> to treat in a special way in user space (e.g. standby memory on s390x),
>> user space could use that information to make smarter decisions.
>>
>> Adding such drivers might work. My suggestion would be to let ordinary
>> DIMMs be without a driver for now and only special case standby memory
>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>
>> Any thoughts?
> 
> If we are going to fake the driver information we may as well add the
> type attribute and be done with it.
> 
> I think the problem with the patch was more with the semantic than the
> attribute itself.
> 
> What is normal, paravirtualized, and standby memory?
> 
> I can understand DIMM device, baloon device, or whatever mechanism for
> adding memory you might have.
> 
> I can understand "memory designated as standby by the cluster
> administrator".
> 
> However, DIMM vs baloon is orthogonal to standby and should not be
> conflated into one property.
> 
> paravirtualized means nothing at all in relationship to memory type and
> the desired online policy to me.

Right, so with whatever we come up, it should allow to make a decision
in user space about
- if memory is to be onlined automatically
- to which zone memory is to be onlined

The rules are encoded in user space, the type will allow to make a
decision. One important part will be if the memory can eventually be
offlined + removed again (DIMM style unplug) vs. memory unplug is
handled balloon-style.

As we learned, some use cases might require to e.g. online balloon
memory to the movable zone in order to make better use of huge pages.
This has to be handled in user space.

I'll think about possible types.

> 
> Lastly I would suggest if you add any property you add it to *all*
> memory that is hotplugged. That way the userspace can detect if it can
> rely on the information from your patch or not. Leaving some memory
> untagged makes things needlessly vague.

Yes, that makes sense.

Thanks!

> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-26 12:30       ` David Hildenbrand
  (?)
@ 2018-11-26 13:33         ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 13:33 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	Dan Williams, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis@hpe.com

On 26.11.18 13:30, David Hildenbrand wrote:
> On 23.11.18 19:06, Michal Suchánek wrote:
>> On Fri, 23 Nov 2018 12:13:58 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 28.09.18 17:03, David Hildenbrand wrote:
>>>> How to/when to online hotplugged memory is hard to manage for
>>>> distributions because different memory types are to be treated differently.
>>>> Right now, we need complicated udev rules that e.g. check if we are
>>>> running on s390x, on a physical system or on a virtualized system. But
>>>> there is also sometimes the demand to really online memory immediately
>>>> while adding in the kernel and not to wait for user space to make a
>>>> decision. And on virtualized systems there might be different
>>>> requirements, depending on "how" the memory was added (and if it will
>>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>>
>>>> On the one hand, we have physical systems where we sometimes
>>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>>> space.
>>>>
>>>> On the other hand, we have memory that should never be onlined
>>>> automatically, only when asked for by an administrator. Such memory only
>>>> applies to virtualized environments like s390x, where the concept of
>>>> "standby" memory exists. Memory is detected and added during boot, so it
>>>> can be onlined when requested by the admininistrator or some tooling.
>>>> Only when onlining, memory will be allocated in the hypervisor.
>>>>
>>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>>> balloons), that hotplug memory that will never ever be removed from a
>>>> system right now using offline_pages/remove_memory. If at all, this memory
>>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>>
>>>> Therefore, introduce memory block types and online memory depending on
>>>> it when adding the memory. Expose the memory type to user space, so user
>>>> space handlers can start to process only "normal" memory. Other memory
>>>> block types can be ignored. One thing less to worry about in user space.
>>>>   
>>>
>>> So I was looking into alternatives.
>>>
>>> 1. Provide only "normal" and "standby" memory types to user space. This
>>> way user space can make smarter decisions about how to online memory.
>>> Not really sure if this is the right way to go.
>>>
>>>
>>> 2. Use device driver information (as mentioned by Michal S.).
>>>
>>> The problem right now is that there are no drivers for memory block
>>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>>> will not contain a "DRIVER" information and we ave no idea what kind of
>>> memory block device we hold in our hands.
>>>
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>>
>>>   looking at device '/devices/system/memory/memory0':
>>>     KERNEL=="memory0"
>>>     SUBSYSTEM=="memory"
>>>     DRIVER==""
>>>     ATTR{online}=="1"
>>>     ATTR{phys_device}=="0"
>>>     ATTR{phys_index}=="00000000"
>>>     ATTR{removable}=="0"
>>>     ATTR{state}=="online"
>>>     ATTR{valid_zones}=="none"
>>>
>>>
>>> If we would provide "fake" drivers for the memory block devices we want
>>> to treat in a special way in user space (e.g. standby memory on s390x),
>>> user space could use that information to make smarter decisions.
>>>
>>> Adding such drivers might work. My suggestion would be to let ordinary
>>> DIMMs be without a driver for now and only special case standby memory
>>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>>
>>> Any thoughts?
>>
>> If we are going to fake the driver information we may as well add the
>> type attribute and be done with it.
>>
>> I think the problem with the patch was more with the semantic than the
>> attribute itself.
>>
>> What is normal, paravirtualized, and standby memory?
>>
>> I can understand DIMM device, baloon device, or whatever mechanism for
>> adding memory you might have.
>>
>> I can understand "memory designated as standby by the cluster
>> administrator".
>>
>> However, DIMM vs baloon is orthogonal to standby and should not be
>> conflated into one property.
>>
>> paravirtualized means nothing at all in relationship to memory type and
>> the desired online policy to me.
> 
> Right, so with whatever we come up, it should allow to make a decision
> in user space about
> - if memory is to be onlined automatically

And I will think about if we really should model standby memory. Maybe
it is really better to have in user space something like (as Dan noted)

if (isS390x() && type == "dimm") {
	/* don't online, on s390x system DIMMs are standby memory */
}

The we could have in addition

if (type == "balloon") {
	/*
	 * Balloon will not be unplugged by offlining the whole block at
	 * once, online as !movable.
	 */
}

But I'll have to think about the wording / types etc. (I neither like
"dimm" nor "balloon").

-- 

Thanks,

David / dhildenb
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-26 13:33         ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 13:33 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: linux-mm, Kate Stewart, Rich Felker, linux-ia64, linux-sh,
	Peter Zijlstra, Dave Hansen, Heiko Carstens, Pavel Tatashin,
	Michal Hocko, Paul Mackerras, H. Peter Anvin, Rashmica Gupta,
	K. Y. Srinivasan, Boris Ostrovsky, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Fenghua Yu, Stephen Rothwell,
	mike.travis, Haiyang Zhang, Dan Williams,
	Jonathan Neuschäfer, Nicholas Piggin, Joe Perches,
	Jérôme Glisse, Mike Rapoport, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, Joonsoo Kim, Oscar Salvador,
	Juergen Gross, Tony Luck, Mathieu Malaterre, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-kernel, Mauricio Faria de Oliveira,
	Philippe Ombredanne, Martin Schwidefsky, devel, Andrew Morton,
	linuxppc-dev, Kirill A. Shutemov

On 26.11.18 13:30, David Hildenbrand wrote:
> On 23.11.18 19:06, Michal Suchánek wrote:
>> On Fri, 23 Nov 2018 12:13:58 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 28.09.18 17:03, David Hildenbrand wrote:
>>>> How to/when to online hotplugged memory is hard to manage for
>>>> distributions because different memory types are to be treated differently.
>>>> Right now, we need complicated udev rules that e.g. check if we are
>>>> running on s390x, on a physical system or on a virtualized system. But
>>>> there is also sometimes the demand to really online memory immediately
>>>> while adding in the kernel and not to wait for user space to make a
>>>> decision. And on virtualized systems there might be different
>>>> requirements, depending on "how" the memory was added (and if it will
>>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>>
>>>> On the one hand, we have physical systems where we sometimes
>>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>>> space.
>>>>
>>>> On the other hand, we have memory that should never be onlined
>>>> automatically, only when asked for by an administrator. Such memory only
>>>> applies to virtualized environments like s390x, where the concept of
>>>> "standby" memory exists. Memory is detected and added during boot, so it
>>>> can be onlined when requested by the admininistrator or some tooling.
>>>> Only when onlining, memory will be allocated in the hypervisor.
>>>>
>>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>>> balloons), that hotplug memory that will never ever be removed from a
>>>> system right now using offline_pages/remove_memory. If at all, this memory
>>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>>
>>>> Therefore, introduce memory block types and online memory depending on
>>>> it when adding the memory. Expose the memory type to user space, so user
>>>> space handlers can start to process only "normal" memory. Other memory
>>>> block types can be ignored. One thing less to worry about in user space.
>>>>   
>>>
>>> So I was looking into alternatives.
>>>
>>> 1. Provide only "normal" and "standby" memory types to user space. This
>>> way user space can make smarter decisions about how to online memory.
>>> Not really sure if this is the right way to go.
>>>
>>>
>>> 2. Use device driver information (as mentioned by Michal S.).
>>>
>>> The problem right now is that there are no drivers for memory block
>>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>>> will not contain a "DRIVER" information and we ave no idea what kind of
>>> memory block device we hold in our hands.
>>>
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>>
>>>   looking at device '/devices/system/memory/memory0':
>>>     KERNEL=="memory0"
>>>     SUBSYSTEM=="memory"
>>>     DRIVER==""
>>>     ATTR{online}=="1"
>>>     ATTR{phys_device}=="0"
>>>     ATTR{phys_index}=="00000000"
>>>     ATTR{removable}=="0"
>>>     ATTR{state}=="online"
>>>     ATTR{valid_zones}=="none"
>>>
>>>
>>> If we would provide "fake" drivers for the memory block devices we want
>>> to treat in a special way in user space (e.g. standby memory on s390x),
>>> user space could use that information to make smarter decisions.
>>>
>>> Adding such drivers might work. My suggestion would be to let ordinary
>>> DIMMs be without a driver for now and only special case standby memory
>>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>>
>>> Any thoughts?
>>
>> If we are going to fake the driver information we may as well add the
>> type attribute and be done with it.
>>
>> I think the problem with the patch was more with the semantic than the
>> attribute itself.
>>
>> What is normal, paravirtualized, and standby memory?
>>
>> I can understand DIMM device, baloon device, or whatever mechanism for
>> adding memory you might have.
>>
>> I can understand "memory designated as standby by the cluster
>> administrator".
>>
>> However, DIMM vs baloon is orthogonal to standby and should not be
>> conflated into one property.
>>
>> paravirtualized means nothing at all in relationship to memory type and
>> the desired online policy to me.
> 
> Right, so with whatever we come up, it should allow to make a decision
> in user space about
> - if memory is to be onlined automatically

And I will think about if we really should model standby memory. Maybe
it is really better to have in user space something like (as Dan noted)

if (isS390x() && type == "dimm") {
	/* don't online, on s390x system DIMMs are standby memory */
}

The we could have in addition

if (type == "balloon") {
	/*
	 * Balloon will not be unplugged by offlining the whole block at
	 * once, online as !movable.
	 */
}

But I'll have to think about the wording / types etc. (I neither like
"dimm" nor "balloon").

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-26 13:33         ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 13:33 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	K. Y. Srinivasan, Dan Williams, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring, mike.travis,
	Haiyang Zhang, Jonathan Neuschäfer, Nicholas Piggin,
	Martin Schwidefsky, Jérôme Glisse, Mike Rapoport,
	Borislav Petkov, Andy Lutomirski, Boris Ostrovsky, Andrew Morton,
	Oscar Salvador, Juergen Gross, Tony Luck, Mathieu Malaterre,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On 26.11.18 13:30, David Hildenbrand wrote:
> On 23.11.18 19:06, Michal Suchánek wrote:
>> On Fri, 23 Nov 2018 12:13:58 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 28.09.18 17:03, David Hildenbrand wrote:
>>>> How to/when to online hotplugged memory is hard to manage for
>>>> distributions because different memory types are to be treated differently.
>>>> Right now, we need complicated udev rules that e.g. check if we are
>>>> running on s390x, on a physical system or on a virtualized system. But
>>>> there is also sometimes the demand to really online memory immediately
>>>> while adding in the kernel and not to wait for user space to make a
>>>> decision. And on virtualized systems there might be different
>>>> requirements, depending on "how" the memory was added (and if it will
>>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>>
>>>> On the one hand, we have physical systems where we sometimes
>>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>>> space.
>>>>
>>>> On the other hand, we have memory that should never be onlined
>>>> automatically, only when asked for by an administrator. Such memory only
>>>> applies to virtualized environments like s390x, where the concept of
>>>> "standby" memory exists. Memory is detected and added during boot, so it
>>>> can be onlined when requested by the admininistrator or some tooling.
>>>> Only when onlining, memory will be allocated in the hypervisor.
>>>>
>>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>>> balloons), that hotplug memory that will never ever be removed from a
>>>> system right now using offline_pages/remove_memory. If at all, this memory
>>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>>
>>>> Therefore, introduce memory block types and online memory depending on
>>>> it when adding the memory. Expose the memory type to user space, so user
>>>> space handlers can start to process only "normal" memory. Other memory
>>>> block types can be ignored. One thing less to worry about in user space.
>>>>   
>>>
>>> So I was looking into alternatives.
>>>
>>> 1. Provide only "normal" and "standby" memory types to user space. This
>>> way user space can make smarter decisions about how to online memory.
>>> Not really sure if this is the right way to go.
>>>
>>>
>>> 2. Use device driver information (as mentioned by Michal S.).
>>>
>>> The problem right now is that there are no drivers for memory block
>>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>>> will not contain a "DRIVER" information and we ave no idea what kind of
>>> memory block device we hold in our hands.
>>>
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>>
>>>   looking at device '/devices/system/memory/memory0':
>>>     KERNEL=="memory0"
>>>     SUBSYSTEM=="memory"
>>>     DRIVER==""
>>>     ATTR{online}=="1"
>>>     ATTR{phys_device}=="0"
>>>     ATTR{phys_index}=="00000000"
>>>     ATTR{removable}=="0"
>>>     ATTR{state}=="online"
>>>     ATTR{valid_zones}=="none"
>>>
>>>
>>> If we would provide "fake" drivers for the memory block devices we want
>>> to treat in a special way in user space (e.g. standby memory on s390x),
>>> user space could use that information to make smarter decisions.
>>>
>>> Adding such drivers might work. My suggestion would be to let ordinary
>>> DIMMs be without a driver for now and only special case standby memory
>>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>>
>>> Any thoughts?
>>
>> If we are going to fake the driver information we may as well add the
>> type attribute and be done with it.
>>
>> I think the problem with the patch was more with the semantic than the
>> attribute itself.
>>
>> What is normal, paravirtualized, and standby memory?
>>
>> I can understand DIMM device, baloon device, or whatever mechanism for
>> adding memory you might have.
>>
>> I can understand "memory designated as standby by the cluster
>> administrator".
>>
>> However, DIMM vs baloon is orthogonal to standby and should not be
>> conflated into one property.
>>
>> paravirtualized means nothing at all in relationship to memory type and
>> the desired online policy to me.
> 
> Right, so with whatever we come up, it should allow to make a decision
> in user space about
> - if memory is to be onlined automatically

And I will think about if we really should model standby memory. Maybe
it is really better to have in user space something like (as Dan noted)

if (isS390x() && type == "dimm") {
	/* don't online, on s390x system DIMMs are standby memory */
}

The we could have in addition

if (type == "balloon") {
	/*
	 * Balloon will not be unplugged by offlining the whole block at
	 * once, online as !movable.
	 */
}

But I'll have to think about the wording / types etc. (I neither like
"dimm" nor "balloon").

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-26 12:30       ` David Hildenbrand
                         ` (2 preceding siblings ...)
  (?)
@ 2018-11-26 13:33       ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 13:33 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	K. Y. Srinivasan, Dan Williams, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring

On 26.11.18 13:30, David Hildenbrand wrote:
> On 23.11.18 19:06, Michal Suchánek wrote:
>> On Fri, 23 Nov 2018 12:13:58 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 28.09.18 17:03, David Hildenbrand wrote:
>>>> How to/when to online hotplugged memory is hard to manage for
>>>> distributions because different memory types are to be treated differently.
>>>> Right now, we need complicated udev rules that e.g. check if we are
>>>> running on s390x, on a physical system or on a virtualized system. But
>>>> there is also sometimes the demand to really online memory immediately
>>>> while adding in the kernel and not to wait for user space to make a
>>>> decision. And on virtualized systems there might be different
>>>> requirements, depending on "how" the memory was added (and if it will
>>>> eventually get unplugged again - DIMM vs. paravirtualized mechanisms).
>>>>
>>>> On the one hand, we have physical systems where we sometimes
>>>> want to be able to unplug memory again - e.g. a DIMM - so we have to online
>>>> it to the MOVABLE zone optionally. That decision is usually made in user
>>>> space.
>>>>
>>>> On the other hand, we have memory that should never be onlined
>>>> automatically, only when asked for by an administrator. Such memory only
>>>> applies to virtualized environments like s390x, where the concept of
>>>> "standby" memory exists. Memory is detected and added during boot, so it
>>>> can be onlined when requested by the admininistrator or some tooling.
>>>> Only when onlining, memory will be allocated in the hypervisor.
>>>>
>>>> But then, we also have paravirtualized devices (namely xen and hyper-v
>>>> balloons), that hotplug memory that will never ever be removed from a
>>>> system right now using offline_pages/remove_memory. If at all, this memory
>>>> is logically unplugged and handed back to the hypervisor via ballooning.
>>>>
>>>> For paravirtualized devices it is relevant that memory is onlined as
>>>> quickly as possible after adding - and that it is added to the NORMAL
>>>> zone. Otherwise, it could happen that too much memory in a row is added
>>>> (but not onlined), resulting in out-of-memory conditions due to the
>>>> additional memory for "struct pages" and friends. MOVABLE zone as well
>>>> as delays might be very problematic and lead to crashes (e.g. zone
>>>> imbalance).
>>>>
>>>> Therefore, introduce memory block types and online memory depending on
>>>> it when adding the memory. Expose the memory type to user space, so user
>>>> space handlers can start to process only "normal" memory. Other memory
>>>> block types can be ignored. One thing less to worry about in user space.
>>>>   
>>>
>>> So I was looking into alternatives.
>>>
>>> 1. Provide only "normal" and "standby" memory types to user space. This
>>> way user space can make smarter decisions about how to online memory.
>>> Not really sure if this is the right way to go.
>>>
>>>
>>> 2. Use device driver information (as mentioned by Michal S.).
>>>
>>> The problem right now is that there are no drivers for memory block
>>> devices. The "memory" subsystem has no drivers, so the KOBJ_ADD uevent
>>> will not contain a "DRIVER" information and we ave no idea what kind of
>>> memory block device we hold in our hands.
>>>
>>> $ udevadm info -q all -a /sys/devices/system/memory/memory0
>>>
>>>   looking at device '/devices/system/memory/memory0':
>>>     KERNEL=="memory0"
>>>     SUBSYSTEM=="memory"
>>>     DRIVER==""
>>>     ATTR{online}=="1"
>>>     ATTR{phys_device}=="0"
>>>     ATTR{phys_index}=="00000000"
>>>     ATTR{removable}=="0"
>>>     ATTR{state}=="online"
>>>     ATTR{valid_zones}=="none"
>>>
>>>
>>> If we would provide "fake" drivers for the memory block devices we want
>>> to treat in a special way in user space (e.g. standby memory on s390x),
>>> user space could use that information to make smarter decisions.
>>>
>>> Adding such drivers might work. My suggestion would be to let ordinary
>>> DIMMs be without a driver for now and only special case standby memory
>>> and eventually paravirtualized memory devices (XEN and Hyper-V).
>>>
>>> Any thoughts?
>>
>> If we are going to fake the driver information we may as well add the
>> type attribute and be done with it.
>>
>> I think the problem with the patch was more with the semantic than the
>> attribute itself.
>>
>> What is normal, paravirtualized, and standby memory?
>>
>> I can understand DIMM device, baloon device, or whatever mechanism for
>> adding memory you might have.
>>
>> I can understand "memory designated as standby by the cluster
>> administrator".
>>
>> However, DIMM vs baloon is orthogonal to standby and should not be
>> conflated into one property.
>>
>> paravirtualized means nothing at all in relationship to memory type and
>> the desired online policy to me.
> 
> Right, so with whatever we come up, it should allow to make a decision
> in user space about
> - if memory is to be onlined automatically

And I will think about if we really should model standby memory. Maybe
it is really better to have in user space something like (as Dan noted)

if (isS390x() && type == "dimm") {
	/* don't online, on s390x system DIMMs are standby memory */
}

The we could have in addition

if (type == "balloon") {
	/*
	 * Balloon will not be unplugged by offlining the whole block at
	 * once, online as !movable.
	 */
}

But I'll have to think about the wording / types etc. (I neither like
"dimm" nor "balloon").

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-26 13:33         ` David Hildenbrand
  (?)
@ 2018-11-26 14:20           ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-26 14:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Pavel Tatashin, linux-s390

On Mon, 26 Nov 2018 14:33:29 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 26.11.18 13:30, David Hildenbrand wrote:
> > On 23.11.18 19:06, Michal Suchánek wrote:  

> >>
> >> If we are going to fake the driver information we may as well add the
> >> type attribute and be done with it.
> >>
> >> I think the problem with the patch was more with the semantic than the
> >> attribute itself.
> >>
> >> What is normal, paravirtualized, and standby memory?
> >>
> >> I can understand DIMM device, baloon device, or whatever mechanism for
> >> adding memory you might have.
> >>
> >> I can understand "memory designated as standby by the cluster
> >> administrator".
> >>
> >> However, DIMM vs baloon is orthogonal to standby and should not be
> >> conflated into one property.
> >>
> >> paravirtualized means nothing at all in relationship to memory type and
> >> the desired online policy to me.  
> > 
> > Right, so with whatever we come up, it should allow to make a decision
> > in user space about
> > - if memory is to be onlined automatically  
> 
> And I will think about if we really should model standby memory. Maybe
> it is really better to have in user space something like (as Dan noted)

If it is possible to designate the memory as standby or online in the
s390 admin interface and the kernel does have access to this
information it makes sense to forward it to userspace (as separate
s390-specific property). If not then you need to make some kind of
assumption like below and the user can tune the script according to
their usecase.

> 
> if (isS390x() && type == "dimm") {
> 	/* don't online, on s390x system DIMMs are standby memory */
> }

Thanks

Michal
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-26 14:20           ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-26 14:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta

On Mon, 26 Nov 2018 14:33:29 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 26.11.18 13:30, David Hildenbrand wrote:
> > On 23.11.18 19:06, Michal Suchánek wrote:  

> >>
> >> If we are going to fake the driver information we may as well add the
> >> type attribute and be done with it.
> >>
> >> I think the problem with the patch was more with the semantic than the
> >> attribute itself.
> >>
> >> What is normal, paravirtualized, and standby memory?
> >>
> >> I can understand DIMM device, baloon device, or whatever mechanism for
> >> adding memory you might have.
> >>
> >> I can understand "memory designated as standby by the cluster
> >> administrator".
> >>
> >> However, DIMM vs baloon is orthogonal to standby and should not be
> >> conflated into one property.
> >>
> >> paravirtualized means nothing at all in relationship to memory type and
> >> the desired online policy to me.  
> > 
> > Right, so with whatever we come up, it should allow to make a decision
> > in user space about
> > - if memory is to be onlined automatically  
> 
> And I will think about if we really should model standby memory. Maybe
> it is really better to have in user space something like (as Dan noted)

If it is possible to designate the memory as standby or online in the
s390 admin interface and the kernel does have access to this
information it makes sense to forward it to userspace (as separate
s390-specific property). If not then you need to make some kind of
assumption like below and the user can tune the script according to
their usecase.

> 
> if (isS390x() && type == "dimm") {
> 	/* don't online, on s390x system DIMMs are standby memory */
> }

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-26 14:20           ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-26 14:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Dan Williams,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Martin Schwidefsky, devel,
	Andrew Morton, linuxppc-dev, Kirill A. Shutemov

On Mon, 26 Nov 2018 14:33:29 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 26.11.18 13:30, David Hildenbrand wrote:
> > On 23.11.18 19:06, Michal Suchánek wrote:  

> >>
> >> If we are going to fake the driver information we may as well add the
> >> type attribute and be done with it.
> >>
> >> I think the problem with the patch was more with the semantic than the
> >> attribute itself.
> >>
> >> What is normal, paravirtualized, and standby memory?
> >>
> >> I can understand DIMM device, baloon device, or whatever mechanism for
> >> adding memory you might have.
> >>
> >> I can understand "memory designated as standby by the cluster
> >> administrator".
> >>
> >> However, DIMM vs baloon is orthogonal to standby and should not be
> >> conflated into one property.
> >>
> >> paravirtualized means nothing at all in relationship to memory type and
> >> the desired online policy to me.  
> > 
> > Right, so with whatever we come up, it should allow to make a decision
> > in user space about
> > - if memory is to be onlined automatically  
> 
> And I will think about if we really should model standby memory. Maybe
> it is really better to have in user space something like (as Dan noted)

If it is possible to designate the memory as standby or online in the
s390 admin interface and the kernel does have access to this
information it makes sense to forward it to userspace (as separate
s390-specific property). If not then you need to make some kind of
assumption like below and the user can tune the script according to
their usecase.

> 
> if (isS390x() && type == "dimm") {
> 	/* don't online, on s390x system DIMMs are standby memory */
> }

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-26 13:33         ` David Hildenbrand
                           ` (2 preceding siblings ...)
  (?)
@ 2018-11-26 14:20         ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-26 14:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390

On Mon, 26 Nov 2018 14:33:29 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 26.11.18 13:30, David Hildenbrand wrote:
> > On 23.11.18 19:06, Michal Suchánek wrote:  

> >>
> >> If we are going to fake the driver information we may as well add the
> >> type attribute and be done with it.
> >>
> >> I think the problem with the patch was more with the semantic than the
> >> attribute itself.
> >>
> >> What is normal, paravirtualized, and standby memory?
> >>
> >> I can understand DIMM device, baloon device, or whatever mechanism for
> >> adding memory you might have.
> >>
> >> I can understand "memory designated as standby by the cluster
> >> administrator".
> >>
> >> However, DIMM vs baloon is orthogonal to standby and should not be
> >> conflated into one property.
> >>
> >> paravirtualized means nothing at all in relationship to memory type and
> >> the desired online policy to me.  
> > 
> > Right, so with whatever we come up, it should allow to make a decision
> > in user space about
> > - if memory is to be onlined automatically  
> 
> And I will think about if we really should model standby memory. Maybe
> it is really better to have in user space something like (as Dan noted)

If it is possible to designate the memory as standby or online in the
s390 admin interface and the kernel does have access to this
information it makes sense to forward it to userspace (as separate
s390-specific property). If not then you need to make some kind of
assumption like below and the user can tune the script according to
their usecase.

> 
> if (isS390x() && type == "dimm") {
> 	/* don't online, on s390x system DIMMs are standby memory */
> }

Thanks

Michal

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-26 14:20           ` Michal Suchánek
  (?)
@ 2018-11-26 15:59             ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 15:59 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Boris Ostrovsky,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Rob Herring,
	Len Brown, Pavel Tatashin, linux-s390

On 26.11.18 15:20, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 14:33:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 26.11.18 13:30, David Hildenbrand wrote:
>>> On 23.11.18 19:06, Michal Suchánek wrote:  
> 
>>>>
>>>> If we are going to fake the driver information we may as well add the
>>>> type attribute and be done with it.
>>>>
>>>> I think the problem with the patch was more with the semantic than the
>>>> attribute itself.
>>>>
>>>> What is normal, paravirtualized, and standby memory?
>>>>
>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>> adding memory you might have.
>>>>
>>>> I can understand "memory designated as standby by the cluster
>>>> administrator".
>>>>
>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>> conflated into one property.
>>>>
>>>> paravirtualized means nothing at all in relationship to memory type and
>>>> the desired online policy to me.  
>>>
>>> Right, so with whatever we come up, it should allow to make a decision
>>> in user space about
>>> - if memory is to be onlined automatically  
>>
>> And I will think about if we really should model standby memory. Maybe
>> it is really better to have in user space something like (as Dan noted)
> 
> If it is possible to designate the memory as standby or online in the
> s390 admin interface and the kernel does have access to this
> information it makes sense to forward it to userspace (as separate
> s390-specific property). If not then you need to make some kind of
> assumption like below and the user can tune the script according to
> their usecase.

Also true, standby memory really represents a distinct type of memory
block (memory seems to be there but really isn't). Right now I am
thinking about something like this (tried to formulate it on a very
generic level because we can't predict which mechanism might want to
make use of these types in the future).


/*
 * Memory block types allow user space to formulate rules if and how to
 * online memory blocks. The types are exposed to user space as text
 * strings in sysfs. While the typical online strategies are described
 * along with the types, there are use cases where that can differ (e.g.
 * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
 * due to zone imbalance or because memory unplug is not intended).
 *
 * MEMORY_BLOCK_NONE:
 *  No memory block is to be created (e.g. device memory). Used internally
 *  only.
 *
 * MEMORY_BLOCK_REMOVABLE:
 *  This memory block type should be treated as if it can be
 *  removed/unplugged from the system again. E.g. there is a hardware
 *  interface to unplug such memory. This memory block type is usually
 *  onlined to the MOVABLE zone, to e.g. make offlining of it more
 *  reliable. Examples include ACPI and PPC DIMMs.
 *
 * MEMORY_BLOCK_UNREMOVABLE:
 *  This memory block type should be treated as if it can not be
 *  removed/unplugged again. E.g. there is no hardware interface to
 *  unplug such memory. This memory block type is usually onlined to
 *  the NORMAL zone, as offlining is not beneficial. Examples include boot
 *  memory on most architectures and memory added via balloon devices.
 *
 * MEMORY_BLOCK_STANDBY:
 *  The memory block type should be treated as if it can be
 *  removed/unplugged again, however the actual memory hot(un)plug is
 *  performed by onlining/offlining. In virtual environments, such memory
 *  is usually added during boot and never removed. Onlining memory will
 *  result in memory getting allocated to a VM. This memory type is usually
 *  not onlined automatically but explicitly by the administrator. One
 *  example is standby memory on s390x.
 */

> 
>>
>> if (isS390x() && type == "dimm") {
>> 	/* don't online, on s390x system DIMMs are standby memory */
>> }
> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-26 15:59             ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 15:59 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Stephen Rothwell, Rashmica Gupta,
	K. Y. Srinivasan, Dan Williams, linux-s390, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring, mike.travis,
	Haiyang Zhang, Jonathan Neuschäfer, Nicholas Piggin,
	Martin Schwidefsky, Jérôme Glisse, Mike Rapoport,
	Borislav Petkov, Andy Lutomirski, Boris Ostrovsky, Andrew Morton,
	Oscar Salvador, Juergen Gross, Tony Luck, Mathieu Malaterre,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-kernel, Fenghua Yu,
	Mauricio Faria de Oliveira, Thomas Gleixner, Philippe Ombredanne,
	Joe Perches, devel, Joonsoo Kim, linuxppc-dev,
	Kirill A. Shutemov

On 26.11.18 15:20, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 14:33:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 26.11.18 13:30, David Hildenbrand wrote:
>>> On 23.11.18 19:06, Michal Suchánek wrote:  
> 
>>>>
>>>> If we are going to fake the driver information we may as well add the
>>>> type attribute and be done with it.
>>>>
>>>> I think the problem with the patch was more with the semantic than the
>>>> attribute itself.
>>>>
>>>> What is normal, paravirtualized, and standby memory?
>>>>
>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>> adding memory you might have.
>>>>
>>>> I can understand "memory designated as standby by the cluster
>>>> administrator".
>>>>
>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>> conflated into one property.
>>>>
>>>> paravirtualized means nothing at all in relationship to memory type and
>>>> the desired online policy to me.  
>>>
>>> Right, so with whatever we come up, it should allow to make a decision
>>> in user space about
>>> - if memory is to be onlined automatically  
>>
>> And I will think about if we really should model standby memory. Maybe
>> it is really better to have in user space something like (as Dan noted)
> 
> If it is possible to designate the memory as standby or online in the
> s390 admin interface and the kernel does have access to this
> information it makes sense to forward it to userspace (as separate
> s390-specific property). If not then you need to make some kind of
> assumption like below and the user can tune the script according to
> their usecase.

Also true, standby memory really represents a distinct type of memory
block (memory seems to be there but really isn't). Right now I am
thinking about something like this (tried to formulate it on a very
generic level because we can't predict which mechanism might want to
make use of these types in the future).


/*
 * Memory block types allow user space to formulate rules if and how to
 * online memory blocks. The types are exposed to user space as text
 * strings in sysfs. While the typical online strategies are described
 * along with the types, there are use cases where that can differ (e.g.
 * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
 * due to zone imbalance or because memory unplug is not intended).
 *
 * MEMORY_BLOCK_NONE:
 *  No memory block is to be created (e.g. device memory). Used internally
 *  only.
 *
 * MEMORY_BLOCK_REMOVABLE:
 *  This memory block type should be treated as if it can be
 *  removed/unplugged from the system again. E.g. there is a hardware
 *  interface to unplug such memory. This memory block type is usually
 *  onlined to the MOVABLE zone, to e.g. make offlining of it more
 *  reliable. Examples include ACPI and PPC DIMMs.
 *
 * MEMORY_BLOCK_UNREMOVABLE:
 *  This memory block type should be treated as if it can not be
 *  removed/unplugged again. E.g. there is no hardware interface to
 *  unplug such memory. This memory block type is usually onlined to
 *  the NORMAL zone, as offlining is not beneficial. Examples include boot
 *  memory on most architectures and memory added via balloon devices.
 *
 * MEMORY_BLOCK_STANDBY:
 *  The memory block type should be treated as if it can be
 *  removed/unplugged again, however the actual memory hot(un)plug is
 *  performed by onlining/offlining. In virtual environments, such memory
 *  is usually added during boot and never removed. Onlining memory will
 *  result in memory getting allocated to a VM. This memory type is usually
 *  not onlined automatically but explicitly by the administrator. One
 *  example is standby memory on s390x.
 */

> 
>>
>> if (isS390x() && type == "dimm") {
>> 	/* don't online, on s390x system DIMMs are standby memory */
>> }
> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-26 15:59             ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 15:59 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Dan Williams,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Martin Schwidefsky, devel,
	Andrew Morton, linuxppc-dev, Kirill A. Shutemov

On 26.11.18 15:20, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 14:33:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 26.11.18 13:30, David Hildenbrand wrote:
>>> On 23.11.18 19:06, Michal Suchánek wrote:  
> 
>>>>
>>>> If we are going to fake the driver information we may as well add the
>>>> type attribute and be done with it.
>>>>
>>>> I think the problem with the patch was more with the semantic than the
>>>> attribute itself.
>>>>
>>>> What is normal, paravirtualized, and standby memory?
>>>>
>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>> adding memory you might have.
>>>>
>>>> I can understand "memory designated as standby by the cluster
>>>> administrator".
>>>>
>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>> conflated into one property.
>>>>
>>>> paravirtualized means nothing at all in relationship to memory type and
>>>> the desired online policy to me.  
>>>
>>> Right, so with whatever we come up, it should allow to make a decision
>>> in user space about
>>> - if memory is to be onlined automatically  
>>
>> And I will think about if we really should model standby memory. Maybe
>> it is really better to have in user space something like (as Dan noted)
> 
> If it is possible to designate the memory as standby or online in the
> s390 admin interface and the kernel does have access to this
> information it makes sense to forward it to userspace (as separate
> s390-specific property). If not then you need to make some kind of
> assumption like below and the user can tune the script according to
> their usecase.

Also true, standby memory really represents a distinct type of memory
block (memory seems to be there but really isn't). Right now I am
thinking about something like this (tried to formulate it on a very
generic level because we can't predict which mechanism might want to
make use of these types in the future).


/*
 * Memory block types allow user space to formulate rules if and how to
 * online memory blocks. The types are exposed to user space as text
 * strings in sysfs. While the typical online strategies are described
 * along with the types, there are use cases where that can differ (e.g.
 * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
 * due to zone imbalance or because memory unplug is not intended).
 *
 * MEMORY_BLOCK_NONE:
 *  No memory block is to be created (e.g. device memory). Used internally
 *  only.
 *
 * MEMORY_BLOCK_REMOVABLE:
 *  This memory block type should be treated as if it can be
 *  removed/unplugged from the system again. E.g. there is a hardware
 *  interface to unplug such memory. This memory block type is usually
 *  onlined to the MOVABLE zone, to e.g. make offlining of it more
 *  reliable. Examples include ACPI and PPC DIMMs.
 *
 * MEMORY_BLOCK_UNREMOVABLE:
 *  This memory block type should be treated as if it can not be
 *  removed/unplugged again. E.g. there is no hardware interface to
 *  unplug such memory. This memory block type is usually onlined to
 *  the NORMAL zone, as offlining is not beneficial. Examples include boot
 *  memory on most architectures and memory added via balloon devices.
 *
 * MEMORY_BLOCK_STANDBY:
 *  The memory block type should be treated as if it can be
 *  removed/unplugged again, however the actual memory hot(un)plug is
 *  performed by onlining/offlining. In virtual environments, such memory
 *  is usually added during boot and never removed. Onlining memory will
 *  result in memory getting allocated to a VM. This memory type is usually
 *  not onlined automatically but explicitly by the administrator. One
 *  example is standby memory on s390x.
 */

> 
>>
>> if (isS390x() && type == "dimm") {
>> 	/* don't online, on s390x system DIMMs are standby memory */
>> }
> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-26 14:20           ` Michal Suchánek
  (?)
  (?)
@ 2018-11-26 15:59           ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-26 15:59 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390

On 26.11.18 15:20, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 14:33:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 26.11.18 13:30, David Hildenbrand wrote:
>>> On 23.11.18 19:06, Michal Suchánek wrote:  
> 
>>>>
>>>> If we are going to fake the driver information we may as well add the
>>>> type attribute and be done with it.
>>>>
>>>> I think the problem with the patch was more with the semantic than the
>>>> attribute itself.
>>>>
>>>> What is normal, paravirtualized, and standby memory?
>>>>
>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>> adding memory you might have.
>>>>
>>>> I can understand "memory designated as standby by the cluster
>>>> administrator".
>>>>
>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>> conflated into one property.
>>>>
>>>> paravirtualized means nothing at all in relationship to memory type and
>>>> the desired online policy to me.  
>>>
>>> Right, so with whatever we come up, it should allow to make a decision
>>> in user space about
>>> - if memory is to be onlined automatically  
>>
>> And I will think about if we really should model standby memory. Maybe
>> it is really better to have in user space something like (as Dan noted)
> 
> If it is possible to designate the memory as standby or online in the
> s390 admin interface and the kernel does have access to this
> information it makes sense to forward it to userspace (as separate
> s390-specific property). If not then you need to make some kind of
> assumption like below and the user can tune the script according to
> their usecase.

Also true, standby memory really represents a distinct type of memory
block (memory seems to be there but really isn't). Right now I am
thinking about something like this (tried to formulate it on a very
generic level because we can't predict which mechanism might want to
make use of these types in the future).


/*
 * Memory block types allow user space to formulate rules if and how to
 * online memory blocks. The types are exposed to user space as text
 * strings in sysfs. While the typical online strategies are described
 * along with the types, there are use cases where that can differ (e.g.
 * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
 * due to zone imbalance or because memory unplug is not intended).
 *
 * MEMORY_BLOCK_NONE:
 *  No memory block is to be created (e.g. device memory). Used internally
 *  only.
 *
 * MEMORY_BLOCK_REMOVABLE:
 *  This memory block type should be treated as if it can be
 *  removed/unplugged from the system again. E.g. there is a hardware
 *  interface to unplug such memory. This memory block type is usually
 *  onlined to the MOVABLE zone, to e.g. make offlining of it more
 *  reliable. Examples include ACPI and PPC DIMMs.
 *
 * MEMORY_BLOCK_UNREMOVABLE:
 *  This memory block type should be treated as if it can not be
 *  removed/unplugged again. E.g. there is no hardware interface to
 *  unplug such memory. This memory block type is usually onlined to
 *  the NORMAL zone, as offlining is not beneficial. Examples include boot
 *  memory on most architectures and memory added via balloon devices.
 *
 * MEMORY_BLOCK_STANDBY:
 *  The memory block type should be treated as if it can be
 *  removed/unplugged again, however the actual memory hot(un)plug is
 *  performed by onlining/offlining. In virtual environments, such memory
 *  is usually added during boot and never removed. Onlining memory will
 *  result in memory getting allocated to a VM. This memory type is usually
 *  not onlined automatically but explicitly by the administrator. One
 *  example is standby memory on s390x.
 */

> 
>>
>> if (isS390x() && type == "dimm") {
>> 	/* don't online, on s390x system DIMMs are standby memory */
>> }
> 
> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-26 15:59             ` David Hildenbrand
  (?)
@ 2018-11-27 16:32               ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-27 16:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Dan Williams,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang

On Mon, 26 Nov 2018 16:59:14 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 26.11.18 15:20, Michal Suchánek wrote:
> > On Mon, 26 Nov 2018 14:33:29 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 26.11.18 13:30, David Hildenbrand wrote:  
> >>> On 23.11.18 19:06, Michal Suchánek wrote:    
> >   
> >>>>
> >>>> If we are going to fake the driver information we may as well add the
> >>>> type attribute and be done with it.
> >>>>
> >>>> I think the problem with the patch was more with the semantic than the
> >>>> attribute itself.
> >>>>
> >>>> What is normal, paravirtualized, and standby memory?
> >>>>
> >>>> I can understand DIMM device, baloon device, or whatever mechanism for
> >>>> adding memory you might have.
> >>>>
> >>>> I can understand "memory designated as standby by the cluster
> >>>> administrator".
> >>>>
> >>>> However, DIMM vs baloon is orthogonal to standby and should not be
> >>>> conflated into one property.
> >>>>
> >>>> paravirtualized means nothing at all in relationship to memory type and
> >>>> the desired online policy to me.    
> >>>
> >>> Right, so with whatever we come up, it should allow to make a decision
> >>> in user space about
> >>> - if memory is to be onlined automatically    
> >>
> >> And I will think about if we really should model standby memory. Maybe
> >> it is really better to have in user space something like (as Dan noted)  
> > 
> > If it is possible to designate the memory as standby or online in the
> > s390 admin interface and the kernel does have access to this
> > information it makes sense to forward it to userspace (as separate
> > s390-specific property). If not then you need to make some kind of
> > assumption like below and the user can tune the script according to
> > their usecase.  
> 
> Also true, standby memory really represents a distinct type of memory
> block (memory seems to be there but really isn't). Right now I am
> thinking about something like this (tried to formulate it on a very
> generic level because we can't predict which mechanism might want to
> make use of these types in the future).
> 
> 
> /*
>  * Memory block types allow user space to formulate rules if and how to
>  * online memory blocks. The types are exposed to user space as text
>  * strings in sysfs. While the typical online strategies are described
>  * along with the types, there are use cases where that can differ (e.g.
>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>  * due to zone imbalance or because memory unplug is not intended).
>  *
>  * MEMORY_BLOCK_NONE:
>  *  No memory block is to be created (e.g. device memory). Used internally
>  *  only.
>  *
>  * MEMORY_BLOCK_REMOVABLE:
>  *  This memory block type should be treated as if it can be
>  *  removed/unplugged from the system again. E.g. there is a hardware
>  *  interface to unplug such memory. This memory block type is usually
>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>  *  reliable. Examples include ACPI and PPC DIMMs.
>  *
>  * MEMORY_BLOCK_UNREMOVABLE:
>  *  This memory block type should be treated as if it can not be
>  *  removed/unplugged again. E.g. there is no hardware interface to
>  *  unplug such memory. This memory block type is usually onlined to
>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>  *  memory on most architectures and memory added via balloon devices.

AFAIK baloon device can be inflated as well so this does not really
describe how this memory type works in any meaningful way. Also it
should not be possible to see this kind of memory from userspace. The
baloon driver just takes existing memory that is properly backed,
allocates it for itself, and allows the hypervisor to use it. Thus it
creates the equivalent to s390 standby memory which is not backed in
the VM. When memory is reclaimed from hypervisor the baloon driver
frees it making it available to the VM kernel again. However, the whole
time the memory appears present in the machine and no hotplug events
should be visible unless the docs I am looking at are really outdated.

>  *
>  * MEMORY_BLOCK_STANDBY:
>  *  The memory block type should be treated as if it can be
>  *  removed/unplugged again, however the actual memory hot(un)plug is
>  *  performed by onlining/offlining. In virtual environments, such memory
>  *  is usually added during boot and never removed. Onlining memory will
>  *  result in memory getting allocated to a VM. This memory type is usually
>  *  not onlined automatically but explicitly by the administrator. One
>  *  example is standby memory on s390x.

Again, this does not meaningfully describe the memory type. There is
no memory on standby. There is in fact no backing at all unless you
online it. So this probably is some kind of shared memory. However, the
(de)allocation is controlled differently compared to the baloon device.
The concept is very similar, though.

Thanks

Michal
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-27 16:32               ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-27 16:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Dan Williams,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Martin Schwidefsky, devel,
	Andrew Morton, linuxppc-dev

On Mon, 26 Nov 2018 16:59:14 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 26.11.18 15:20, Michal Suchánek wrote:
> > On Mon, 26 Nov 2018 14:33:29 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 26.11.18 13:30, David Hildenbrand wrote:  
> >>> On 23.11.18 19:06, Michal Suchánek wrote:    
> >   
> >>>>
> >>>> If we are going to fake the driver information we may as well add the
> >>>> type attribute and be done with it.
> >>>>
> >>>> I think the problem with the patch was more with the semantic than the
> >>>> attribute itself.
> >>>>
> >>>> What is normal, paravirtualized, and standby memory?
> >>>>
> >>>> I can understand DIMM device, baloon device, or whatever mechanism for
> >>>> adding memory you might have.
> >>>>
> >>>> I can understand "memory designated as standby by the cluster
> >>>> administrator".
> >>>>
> >>>> However, DIMM vs baloon is orthogonal to standby and should not be
> >>>> conflated into one property.
> >>>>
> >>>> paravirtualized means nothing at all in relationship to memory type and
> >>>> the desired online policy to me.    
> >>>
> >>> Right, so with whatever we come up, it should allow to make a decision
> >>> in user space about
> >>> - if memory is to be onlined automatically    
> >>
> >> And I will think about if we really should model standby memory. Maybe
> >> it is really better to have in user space something like (as Dan noted)  
> > 
> > If it is possible to designate the memory as standby or online in the
> > s390 admin interface and the kernel does have access to this
> > information it makes sense to forward it to userspace (as separate
> > s390-specific property). If not then you need to make some kind of
> > assumption like below and the user can tune the script according to
> > their usecase.  
> 
> Also true, standby memory really represents a distinct type of memory
> block (memory seems to be there but really isn't). Right now I am
> thinking about something like this (tried to formulate it on a very
> generic level because we can't predict which mechanism might want to
> make use of these types in the future).
> 
> 
> /*
>  * Memory block types allow user space to formulate rules if and how to
>  * online memory blocks. The types are exposed to user space as text
>  * strings in sysfs. While the typical online strategies are described
>  * along with the types, there are use cases where that can differ (e.g.
>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>  * due to zone imbalance or because memory unplug is not intended).
>  *
>  * MEMORY_BLOCK_NONE:
>  *  No memory block is to be created (e.g. device memory). Used internally
>  *  only.
>  *
>  * MEMORY_BLOCK_REMOVABLE:
>  *  This memory block type should be treated as if it can be
>  *  removed/unplugged from the system again. E.g. there is a hardware
>  *  interface to unplug such memory. This memory block type is usually
>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>  *  reliable. Examples include ACPI and PPC DIMMs.
>  *
>  * MEMORY_BLOCK_UNREMOVABLE:
>  *  This memory block type should be treated as if it can not be
>  *  removed/unplugged again. E.g. there is no hardware interface to
>  *  unplug such memory. This memory block type is usually onlined to
>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>  *  memory on most architectures and memory added via balloon devices.

AFAIK baloon device can be inflated as well so this does not really
describe how this memory type works in any meaningful way. Also it
should not be possible to see this kind of memory from userspace. The
baloon driver just takes existing memory that is properly backed,
allocates it for itself, and allows the hypervisor to use it. Thus it
creates the equivalent to s390 standby memory which is not backed in
the VM. When memory is reclaimed from hypervisor the baloon driver
frees it making it available to the VM kernel again. However, the whole
time the memory appears present in the machine and no hotplug events
should be visible unless the docs I am looking at are really outdated.

>  *
>  * MEMORY_BLOCK_STANDBY:
>  *  The memory block type should be treated as if it can be
>  *  removed/unplugged again, however the actual memory hot(un)plug is
>  *  performed by onlining/offlining. In virtual environments, such memory
>  *  is usually added during boot and never removed. Onlining memory will
>  *  result in memory getting allocated to a VM. This memory type is usually
>  *  not onlined automatically but explicitly by the administrator. One
>  *  example is standby memory on s390x.

Again, this does not meaningfully describe the memory type. There is
no memory on standby. There is in fact no backing at all unless you
online it. So this probably is some kind of shared memory. However, the
(de)allocation is controlled differently compared to the baloon device.
The concept is very similar, though.

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-27 16:32               ` Michal Suchánek
  0 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-27 16:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Dan Williams, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring, mike.travis,
	Haiyang Zhang, Philippe Ombredanne, Jonathan Neuschäfer,
	Nicholas Piggin, Martin Schwidefsky, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Boris Ostrovsky,
	Andrew Morton, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, linux-s390, Rafael J. Wysocki, linux-kernel,
	Fenghua Yu, Mauricio Faria de Oliveira, Thomas Gleixner,
	Greg Kroah-Hartman, Joe Perches, devel, Joonsoo Kim,
	linuxppc-dev, Kirill A. Shutemov

On Mon, 26 Nov 2018 16:59:14 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 26.11.18 15:20, Michal Suchánek wrote:
> > On Mon, 26 Nov 2018 14:33:29 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 26.11.18 13:30, David Hildenbrand wrote:  
> >>> On 23.11.18 19:06, Michal Suchánek wrote:    
> >   
> >>>>
> >>>> If we are going to fake the driver information we may as well add the
> >>>> type attribute and be done with it.
> >>>>
> >>>> I think the problem with the patch was more with the semantic than the
> >>>> attribute itself.
> >>>>
> >>>> What is normal, paravirtualized, and standby memory?
> >>>>
> >>>> I can understand DIMM device, baloon device, or whatever mechanism for
> >>>> adding memory you might have.
> >>>>
> >>>> I can understand "memory designated as standby by the cluster
> >>>> administrator".
> >>>>
> >>>> However, DIMM vs baloon is orthogonal to standby and should not be
> >>>> conflated into one property.
> >>>>
> >>>> paravirtualized means nothing at all in relationship to memory type and
> >>>> the desired online policy to me.    
> >>>
> >>> Right, so with whatever we come up, it should allow to make a decision
> >>> in user space about
> >>> - if memory is to be onlined automatically    
> >>
> >> And I will think about if we really should model standby memory. Maybe
> >> it is really better to have in user space something like (as Dan noted)  
> > 
> > If it is possible to designate the memory as standby or online in the
> > s390 admin interface and the kernel does have access to this
> > information it makes sense to forward it to userspace (as separate
> > s390-specific property). If not then you need to make some kind of
> > assumption like below and the user can tune the script according to
> > their usecase.  
> 
> Also true, standby memory really represents a distinct type of memory
> block (memory seems to be there but really isn't). Right now I am
> thinking about something like this (tried to formulate it on a very
> generic level because we can't predict which mechanism might want to
> make use of these types in the future).
> 
> 
> /*
>  * Memory block types allow user space to formulate rules if and how to
>  * online memory blocks. The types are exposed to user space as text
>  * strings in sysfs. While the typical online strategies are described
>  * along with the types, there are use cases where that can differ (e.g.
>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>  * due to zone imbalance or because memory unplug is not intended).
>  *
>  * MEMORY_BLOCK_NONE:
>  *  No memory block is to be created (e.g. device memory). Used internally
>  *  only.
>  *
>  * MEMORY_BLOCK_REMOVABLE:
>  *  This memory block type should be treated as if it can be
>  *  removed/unplugged from the system again. E.g. there is a hardware
>  *  interface to unplug such memory. This memory block type is usually
>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>  *  reliable. Examples include ACPI and PPC DIMMs.
>  *
>  * MEMORY_BLOCK_UNREMOVABLE:
>  *  This memory block type should be treated as if it can not be
>  *  removed/unplugged again. E.g. there is no hardware interface to
>  *  unplug such memory. This memory block type is usually onlined to
>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>  *  memory on most architectures and memory added via balloon devices.

AFAIK baloon device can be inflated as well so this does not really
describe how this memory type works in any meaningful way. Also it
should not be possible to see this kind of memory from userspace. The
baloon driver just takes existing memory that is properly backed,
allocates it for itself, and allows the hypervisor to use it. Thus it
creates the equivalent to s390 standby memory which is not backed in
the VM. When memory is reclaimed from hypervisor the baloon driver
frees it making it available to the VM kernel again. However, the whole
time the memory appears present in the machine and no hotplug events
should be visible unless the docs I am looking at are really outdated.

>  *
>  * MEMORY_BLOCK_STANDBY:
>  *  The memory block type should be treated as if it can be
>  *  removed/unplugged again, however the actual memory hot(un)plug is
>  *  performed by onlining/offlining. In virtual environments, such memory
>  *  is usually added during boot and never removed. Onlining memory will
>  *  result in memory getting allocated to a VM. This memory type is usually
>  *  not onlined automatically but explicitly by the administrator. One
>  *  example is standby memory on s390x.

Again, this does not meaningfully describe the memory type. There is
no memory on standby. There is in fact no backing at all unless you
online it. So this probably is some kind of shared memory. However, the
(de)allocation is controlled differently compared to the baloon device.
The concept is very similar, though.

Thanks

Michal

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-26 15:59             ` David Hildenbrand
  (?)
  (?)
@ 2018-11-27 16:32             ` Michal Suchánek
  -1 siblings, 0 replies; 144+ messages in thread
From: Michal Suchánek @ 2018-11-27 16:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Dan Williams, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring, mike.travis

On Mon, 26 Nov 2018 16:59:14 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 26.11.18 15:20, Michal Suchánek wrote:
> > On Mon, 26 Nov 2018 14:33:29 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 26.11.18 13:30, David Hildenbrand wrote:  
> >>> On 23.11.18 19:06, Michal Suchánek wrote:    
> >   
> >>>>
> >>>> If we are going to fake the driver information we may as well add the
> >>>> type attribute and be done with it.
> >>>>
> >>>> I think the problem with the patch was more with the semantic than the
> >>>> attribute itself.
> >>>>
> >>>> What is normal, paravirtualized, and standby memory?
> >>>>
> >>>> I can understand DIMM device, baloon device, or whatever mechanism for
> >>>> adding memory you might have.
> >>>>
> >>>> I can understand "memory designated as standby by the cluster
> >>>> administrator".
> >>>>
> >>>> However, DIMM vs baloon is orthogonal to standby and should not be
> >>>> conflated into one property.
> >>>>
> >>>> paravirtualized means nothing at all in relationship to memory type and
> >>>> the desired online policy to me.    
> >>>
> >>> Right, so with whatever we come up, it should allow to make a decision
> >>> in user space about
> >>> - if memory is to be onlined automatically    
> >>
> >> And I will think about if we really should model standby memory. Maybe
> >> it is really better to have in user space something like (as Dan noted)  
> > 
> > If it is possible to designate the memory as standby or online in the
> > s390 admin interface and the kernel does have access to this
> > information it makes sense to forward it to userspace (as separate
> > s390-specific property). If not then you need to make some kind of
> > assumption like below and the user can tune the script according to
> > their usecase.  
> 
> Also true, standby memory really represents a distinct type of memory
> block (memory seems to be there but really isn't). Right now I am
> thinking about something like this (tried to formulate it on a very
> generic level because we can't predict which mechanism might want to
> make use of these types in the future).
> 
> 
> /*
>  * Memory block types allow user space to formulate rules if and how to
>  * online memory blocks. The types are exposed to user space as text
>  * strings in sysfs. While the typical online strategies are described
>  * along with the types, there are use cases where that can differ (e.g.
>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>  * due to zone imbalance or because memory unplug is not intended).
>  *
>  * MEMORY_BLOCK_NONE:
>  *  No memory block is to be created (e.g. device memory). Used internally
>  *  only.
>  *
>  * MEMORY_BLOCK_REMOVABLE:
>  *  This memory block type should be treated as if it can be
>  *  removed/unplugged from the system again. E.g. there is a hardware
>  *  interface to unplug such memory. This memory block type is usually
>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>  *  reliable. Examples include ACPI and PPC DIMMs.
>  *
>  * MEMORY_BLOCK_UNREMOVABLE:
>  *  This memory block type should be treated as if it can not be
>  *  removed/unplugged again. E.g. there is no hardware interface to
>  *  unplug such memory. This memory block type is usually onlined to
>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>  *  memory on most architectures and memory added via balloon devices.

AFAIK baloon device can be inflated as well so this does not really
describe how this memory type works in any meaningful way. Also it
should not be possible to see this kind of memory from userspace. The
baloon driver just takes existing memory that is properly backed,
allocates it for itself, and allows the hypervisor to use it. Thus it
creates the equivalent to s390 standby memory which is not backed in
the VM. When memory is reclaimed from hypervisor the baloon driver
frees it making it available to the VM kernel again. However, the whole
time the memory appears present in the machine and no hotplug events
should be visible unless the docs I am looking at are really outdated.

>  *
>  * MEMORY_BLOCK_STANDBY:
>  *  The memory block type should be treated as if it can be
>  *  removed/unplugged again, however the actual memory hot(un)plug is
>  *  performed by onlining/offlining. In virtual environments, such memory
>  *  is usually added during boot and never removed. Onlining memory will
>  *  result in memory getting allocated to a VM. This memory type is usually
>  *  not onlined automatically but explicitly by the administrator. One
>  *  example is standby memory on s390x.

Again, this does not meaningfully describe the memory type. There is
no memory on standby. There is in fact no backing at all unless you
online it. So this probably is some kind of shared memory. However, the
(de)allocation is controlled differently compared to the baloon device.
The concept is very similar, though.

Thanks

Michal

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-27 16:32               ` Michal Suchánek
  (?)
@ 2018-11-27 16:47                 ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-27 16:47 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, Dan Williams,
	Stephen Rothwell, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, linux-acpi, Ingo Molnar, xen-devel, Len Brown,
	Pavel Tatashin, Rob Herring, mike.travis, Haiyang

On 27.11.18 17:32, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 16:59:14 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 26.11.18 15:20, Michal Suchánek wrote:
>>> On Mon, 26 Nov 2018 14:33:29 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>> On 26.11.18 13:30, David Hildenbrand wrote:  
>>>>> On 23.11.18 19:06, Michal Suchánek wrote:    
>>>   
>>>>>>
>>>>>> If we are going to fake the driver information we may as well add the
>>>>>> type attribute and be done with it.
>>>>>>
>>>>>> I think the problem with the patch was more with the semantic than the
>>>>>> attribute itself.
>>>>>>
>>>>>> What is normal, paravirtualized, and standby memory?
>>>>>>
>>>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>>>> adding memory you might have.
>>>>>>
>>>>>> I can understand "memory designated as standby by the cluster
>>>>>> administrator".
>>>>>>
>>>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>>>> conflated into one property.
>>>>>>
>>>>>> paravirtualized means nothing at all in relationship to memory type and
>>>>>> the desired online policy to me.    
>>>>>
>>>>> Right, so with whatever we come up, it should allow to make a decision
>>>>> in user space about
>>>>> - if memory is to be onlined automatically    
>>>>
>>>> And I will think about if we really should model standby memory. Maybe
>>>> it is really better to have in user space something like (as Dan noted)  
>>>
>>> If it is possible to designate the memory as standby or online in the
>>> s390 admin interface and the kernel does have access to this
>>> information it makes sense to forward it to userspace (as separate
>>> s390-specific property). If not then you need to make some kind of
>>> assumption like below and the user can tune the script according to
>>> their usecase.  
>>
>> Also true, standby memory really represents a distinct type of memory
>> block (memory seems to be there but really isn't). Right now I am
>> thinking about something like this (tried to formulate it on a very
>> generic level because we can't predict which mechanism might want to
>> make use of these types in the future).
>>
>>
>> /*
>>  * Memory block types allow user space to formulate rules if and how to
>>  * online memory blocks. The types are exposed to user space as text
>>  * strings in sysfs. While the typical online strategies are described
>>  * along with the types, there are use cases where that can differ (e.g.
>>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>>  * due to zone imbalance or because memory unplug is not intended).
>>  *
>>  * MEMORY_BLOCK_NONE:
>>  *  No memory block is to be created (e.g. device memory). Used internally
>>  *  only.
>>  *
>>  * MEMORY_BLOCK_REMOVABLE:
>>  *  This memory block type should be treated as if it can be
>>  *  removed/unplugged from the system again. E.g. there is a hardware
>>  *  interface to unplug such memory. This memory block type is usually
>>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>>  *  reliable. Examples include ACPI and PPC DIMMs.
>>  *
>>  * MEMORY_BLOCK_UNREMOVABLE:
>>  *  This memory block type should be treated as if it can not be
>>  *  removed/unplugged again. E.g. there is no hardware interface to
>>  *  unplug such memory. This memory block type is usually onlined to
>>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>>  *  memory on most architectures and memory added via balloon devices.
> 
> AFAIK baloon device can be inflated as well so this does not really
> describe how this memory type works in any meaningful way. Also it
> should not be possible to see this kind of memory from userspace. The
> baloon driver just takes existing memory that is properly backed,
> allocates it for itself, and allows the hypervisor to use it. Thus it
> creates the equivalent to s390 standby memory which is not backed in
> the VM. When memory is reclaimed from hypervisor the baloon driver
> frees it making it available to the VM kernel again. However, the whole
> time the memory appears present in the machine and no hotplug events
> should be visible unless the docs I am looking at are really outdated.

It's all not optimal yet.

Don't confuse what I describe here with inflated/deflated memory. XEN
and Hyper-V add *new* memory to the system using add_memory(). New
memory blocks. This memory will never be removed using the typical
"offline + remove_memory()" approach. It will be removed using
ballooning (if at all) and only in pieces. So it will usually be onlined
to the NORMAL zone. (but userspace can later on implement whatever rule
it wants)

I am not talking about any kind of inflation/deflation. I am talking
about memory blocks added to the system via add_memory().

Inflation/deflation does not belong into the memory block interface.

> 
>>  *
>>  * MEMORY_BLOCK_STANDBY:
>>  *  The memory block type should be treated as if it can be
>>  *  removed/unplugged again, however the actual memory hot(un)plug is
>>  *  performed by onlining/offlining. In virtual environments, such memory
>>  *  is usually added during boot and never removed. Onlining memory will
>>  *  result in memory getting allocated to a VM. This memory type is usually
>>  *  not onlined automatically but explicitly by the administrator. One
>>  *  example is standby memory on s390x.
> 
> Again, this does not meaningfully describe the memory type. There is
> no memory on standby. There is in fact no backing at all unless you
> online it. So this probably is some kind of shared memory. However, the
> (de)allocation is controlled differently compared to the baloon device.
> The concept is very similar, though.

We have memory blocks and we have to describe them somehow. On s390x
standby memory is model via memory blocks that are offline - that is the
way it is modeled. I am still thinking about possible ways to describe
this via a memory type. And here the message should be "don't online
this unless you are aware of the consequences, this is not your ordinary
DIMM".

Which types of memory would you have in mind? The problem we are trying
to solve is to give user space an idea of if and how to online memory.
And to make it aware that there are different types that are expected to
be handled differently.

-- 

Thanks,

David / dhildenb
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-27 16:47                 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-27 16:47 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Rob Herring, Len Brown, Pavel Tatashin, linux-s390,
	mike.travis, Haiyang Zhang, Jonathan Neuschäfer,
	Nicholas Piggin, Joe Perches, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Dan Williams,
	Joonsoo Kim, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, Fenghua Yu, Mauricio Faria de Oliveira,
	Thomas Gleixner, Philippe Ombredanne, Martin Schwidefsky, devel,
	Andrew Morton, linuxppc-dev, Kirill A. Shutemov

On 27.11.18 17:32, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 16:59:14 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 26.11.18 15:20, Michal Suchánek wrote:
>>> On Mon, 26 Nov 2018 14:33:29 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>> On 26.11.18 13:30, David Hildenbrand wrote:  
>>>>> On 23.11.18 19:06, Michal Suchánek wrote:    
>>>   
>>>>>>
>>>>>> If we are going to fake the driver information we may as well add the
>>>>>> type attribute and be done with it.
>>>>>>
>>>>>> I think the problem with the patch was more with the semantic than the
>>>>>> attribute itself.
>>>>>>
>>>>>> What is normal, paravirtualized, and standby memory?
>>>>>>
>>>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>>>> adding memory you might have.
>>>>>>
>>>>>> I can understand "memory designated as standby by the cluster
>>>>>> administrator".
>>>>>>
>>>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>>>> conflated into one property.
>>>>>>
>>>>>> paravirtualized means nothing at all in relationship to memory type and
>>>>>> the desired online policy to me.    
>>>>>
>>>>> Right, so with whatever we come up, it should allow to make a decision
>>>>> in user space about
>>>>> - if memory is to be onlined automatically    
>>>>
>>>> And I will think about if we really should model standby memory. Maybe
>>>> it is really better to have in user space something like (as Dan noted)  
>>>
>>> If it is possible to designate the memory as standby or online in the
>>> s390 admin interface and the kernel does have access to this
>>> information it makes sense to forward it to userspace (as separate
>>> s390-specific property). If not then you need to make some kind of
>>> assumption like below and the user can tune the script according to
>>> their usecase.  
>>
>> Also true, standby memory really represents a distinct type of memory
>> block (memory seems to be there but really isn't). Right now I am
>> thinking about something like this (tried to formulate it on a very
>> generic level because we can't predict which mechanism might want to
>> make use of these types in the future).
>>
>>
>> /*
>>  * Memory block types allow user space to formulate rules if and how to
>>  * online memory blocks. The types are exposed to user space as text
>>  * strings in sysfs. While the typical online strategies are described
>>  * along with the types, there are use cases where that can differ (e.g.
>>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>>  * due to zone imbalance or because memory unplug is not intended).
>>  *
>>  * MEMORY_BLOCK_NONE:
>>  *  No memory block is to be created (e.g. device memory). Used internally
>>  *  only.
>>  *
>>  * MEMORY_BLOCK_REMOVABLE:
>>  *  This memory block type should be treated as if it can be
>>  *  removed/unplugged from the system again. E.g. there is a hardware
>>  *  interface to unplug such memory. This memory block type is usually
>>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>>  *  reliable. Examples include ACPI and PPC DIMMs.
>>  *
>>  * MEMORY_BLOCK_UNREMOVABLE:
>>  *  This memory block type should be treated as if it can not be
>>  *  removed/unplugged again. E.g. there is no hardware interface to
>>  *  unplug such memory. This memory block type is usually onlined to
>>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>>  *  memory on most architectures and memory added via balloon devices.
> 
> AFAIK baloon device can be inflated as well so this does not really
> describe how this memory type works in any meaningful way. Also it
> should not be possible to see this kind of memory from userspace. The
> baloon driver just takes existing memory that is properly backed,
> allocates it for itself, and allows the hypervisor to use it. Thus it
> creates the equivalent to s390 standby memory which is not backed in
> the VM. When memory is reclaimed from hypervisor the baloon driver
> frees it making it available to the VM kernel again. However, the whole
> time the memory appears present in the machine and no hotplug events
> should be visible unless the docs I am looking at are really outdated.

It's all not optimal yet.

Don't confuse what I describe here with inflated/deflated memory. XEN
and Hyper-V add *new* memory to the system using add_memory(). New
memory blocks. This memory will never be removed using the typical
"offline + remove_memory()" approach. It will be removed using
ballooning (if at all) and only in pieces. So it will usually be onlined
to the NORMAL zone. (but userspace can later on implement whatever rule
it wants)

I am not talking about any kind of inflation/deflation. I am talking
about memory blocks added to the system via add_memory().

Inflation/deflation does not belong into the memory block interface.

> 
>>  *
>>  * MEMORY_BLOCK_STANDBY:
>>  *  The memory block type should be treated as if it can be
>>  *  removed/unplugged again, however the actual memory hot(un)plug is
>>  *  performed by onlining/offlining. In virtual environments, such memory
>>  *  is usually added during boot and never removed. Onlining memory will
>>  *  result in memory getting allocated to a VM. This memory type is usually
>>  *  not onlined automatically but explicitly by the administrator. One
>>  *  example is standby memory on s390x.
> 
> Again, this does not meaningfully describe the memory type. There is
> no memory on standby. There is in fact no backing at all unless you
> online it. So this probably is some kind of shared memory. However, the
> (de)allocation is controlled differently compared to the baloon device.
> The concept is very similar, though.

We have memory blocks and we have to describe them somehow. On s390x
standby memory is model via memory blocks that are offline - that is the
way it is modeled. I am still thinking about possible ways to describe
this via a memory type. And here the message should be "don't online
this unless you are aware of the consequences, this is not your ordinary
DIMM".

Which types of memory would you have in mind? The problem we are trying
to solve is to give user space an idea of if and how to online memory.
And to make it aware that there are different types that are expected to
be handled differently.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-11-27 16:47                 ` David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-27 16:47 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Dan Williams, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring, mike.travis,
	Haiyang Zhang, Philippe Ombredanne, Jonathan Neuschäfer,
	Nicholas Piggin, Martin Schwidefsky, Jérôme Glisse,
	Mike Rapoport, Borislav Petkov, Andy Lutomirski, Boris Ostrovsky,
	Andrew Morton, Oscar Salvador, Juergen Gross, Tony Luck,
	Mathieu Malaterre, linux-s390, Rafael J. Wysocki, linux-kernel,
	Fenghua Yu, Mauricio Faria de Oliveira, Thomas Gleixner,
	Greg Kroah-Hartman, Joe Perches, devel, Joonsoo Kim,
	linuxppc-dev, Kirill A. Shutemov

On 27.11.18 17:32, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 16:59:14 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 26.11.18 15:20, Michal Suchánek wrote:
>>> On Mon, 26 Nov 2018 14:33:29 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>> On 26.11.18 13:30, David Hildenbrand wrote:  
>>>>> On 23.11.18 19:06, Michal Suchánek wrote:    
>>>   
>>>>>>
>>>>>> If we are going to fake the driver information we may as well add the
>>>>>> type attribute and be done with it.
>>>>>>
>>>>>> I think the problem with the patch was more with the semantic than the
>>>>>> attribute itself.
>>>>>>
>>>>>> What is normal, paravirtualized, and standby memory?
>>>>>>
>>>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>>>> adding memory you might have.
>>>>>>
>>>>>> I can understand "memory designated as standby by the cluster
>>>>>> administrator".
>>>>>>
>>>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>>>> conflated into one property.
>>>>>>
>>>>>> paravirtualized means nothing at all in relationship to memory type and
>>>>>> the desired online policy to me.    
>>>>>
>>>>> Right, so with whatever we come up, it should allow to make a decision
>>>>> in user space about
>>>>> - if memory is to be onlined automatically    
>>>>
>>>> And I will think about if we really should model standby memory. Maybe
>>>> it is really better to have in user space something like (as Dan noted)  
>>>
>>> If it is possible to designate the memory as standby or online in the
>>> s390 admin interface and the kernel does have access to this
>>> information it makes sense to forward it to userspace (as separate
>>> s390-specific property). If not then you need to make some kind of
>>> assumption like below and the user can tune the script according to
>>> their usecase.  
>>
>> Also true, standby memory really represents a distinct type of memory
>> block (memory seems to be there but really isn't). Right now I am
>> thinking about something like this (tried to formulate it on a very
>> generic level because we can't predict which mechanism might want to
>> make use of these types in the future).
>>
>>
>> /*
>>  * Memory block types allow user space to formulate rules if and how to
>>  * online memory blocks. The types are exposed to user space as text
>>  * strings in sysfs. While the typical online strategies are described
>>  * along with the types, there are use cases where that can differ (e.g.
>>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>>  * due to zone imbalance or because memory unplug is not intended).
>>  *
>>  * MEMORY_BLOCK_NONE:
>>  *  No memory block is to be created (e.g. device memory). Used internally
>>  *  only.
>>  *
>>  * MEMORY_BLOCK_REMOVABLE:
>>  *  This memory block type should be treated as if it can be
>>  *  removed/unplugged from the system again. E.g. there is a hardware
>>  *  interface to unplug such memory. This memory block type is usually
>>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>>  *  reliable. Examples include ACPI and PPC DIMMs.
>>  *
>>  * MEMORY_BLOCK_UNREMOVABLE:
>>  *  This memory block type should be treated as if it can not be
>>  *  removed/unplugged again. E.g. there is no hardware interface to
>>  *  unplug such memory. This memory block type is usually onlined to
>>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>>  *  memory on most architectures and memory added via balloon devices.
> 
> AFAIK baloon device can be inflated as well so this does not really
> describe how this memory type works in any meaningful way. Also it
> should not be possible to see this kind of memory from userspace. The
> baloon driver just takes existing memory that is properly backed,
> allocates it for itself, and allows the hypervisor to use it. Thus it
> creates the equivalent to s390 standby memory which is not backed in
> the VM. When memory is reclaimed from hypervisor the baloon driver
> frees it making it available to the VM kernel again. However, the whole
> time the memory appears present in the machine and no hotplug events
> should be visible unless the docs I am looking at are really outdated.

It's all not optimal yet.

Don't confuse what I describe here with inflated/deflated memory. XEN
and Hyper-V add *new* memory to the system using add_memory(). New
memory blocks. This memory will never be removed using the typical
"offline + remove_memory()" approach. It will be removed using
ballooning (if at all) and only in pieces. So it will usually be onlined
to the NORMAL zone. (but userspace can later on implement whatever rule
it wants)

I am not talking about any kind of inflation/deflation. I am talking
about memory blocks added to the system via add_memory().

Inflation/deflation does not belong into the memory block interface.

> 
>>  *
>>  * MEMORY_BLOCK_STANDBY:
>>  *  The memory block type should be treated as if it can be
>>  *  removed/unplugged again, however the actual memory hot(un)plug is
>>  *  performed by onlining/offlining. In virtual environments, such memory
>>  *  is usually added during boot and never removed. Onlining memory will
>>  *  result in memory getting allocated to a VM. This memory type is usually
>>  *  not onlined automatically but explicitly by the administrator. One
>>  *  example is standby memory on s390x.
> 
> Again, this does not meaningfully describe the memory type. There is
> no memory on standby. There is in fact no backing at all unless you
> online it. So this probably is some kind of shared memory. However, the
> (de)allocation is controlled differently compared to the baloon device.
> The concept is very similar, though.

We have memory blocks and we have to describe them somehow. On s390x
standby memory is model via memory blocks that are offline - that is the
way it is modeled. I am still thinking about possible ways to describe
this via a memory type. And here the message should be "don't online
this unless you are aware of the consequences, this is not your ordinary
DIMM".

Which types of memory would you have in mind? The problem we are trying
to solve is to give user space an idea of if and how to online memory.
And to make it aware that there are different types that are expected to
be handled differently.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 144+ messages in thread

* Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types
  2018-11-27 16:32               ` Michal Suchánek
  (?)
  (?)
@ 2018-11-27 16:47               ` David Hildenbrand
  -1 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-11-27 16:47 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Dave Hansen, Heiko Carstens, linux-mm, Michal Hocko,
	Paul Mackerras, H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Dan Williams, Stephen Rothwell, Michael Neuling,
	Stephen Hemminger, Yoshinori Sato, linux-acpi, Ingo Molnar,
	xen-devel, Len Brown, Pavel Tatashin, Rob Herring, mike.travis

On 27.11.18 17:32, Michal Suchánek wrote:
> On Mon, 26 Nov 2018 16:59:14 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 26.11.18 15:20, Michal Suchánek wrote:
>>> On Mon, 26 Nov 2018 14:33:29 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>> On 26.11.18 13:30, David Hildenbrand wrote:  
>>>>> On 23.11.18 19:06, Michal Suchánek wrote:    
>>>   
>>>>>>
>>>>>> If we are going to fake the driver information we may as well add the
>>>>>> type attribute and be done with it.
>>>>>>
>>>>>> I think the problem with the patch was more with the semantic than the
>>>>>> attribute itself.
>>>>>>
>>>>>> What is normal, paravirtualized, and standby memory?
>>>>>>
>>>>>> I can understand DIMM device, baloon device, or whatever mechanism for
>>>>>> adding memory you might have.
>>>>>>
>>>>>> I can understand "memory designated as standby by the cluster
>>>>>> administrator".
>>>>>>
>>>>>> However, DIMM vs baloon is orthogonal to standby and should not be
>>>>>> conflated into one property.
>>>>>>
>>>>>> paravirtualized means nothing at all in relationship to memory type and
>>>>>> the desired online policy to me.    
>>>>>
>>>>> Right, so with whatever we come up, it should allow to make a decision
>>>>> in user space about
>>>>> - if memory is to be onlined automatically    
>>>>
>>>> And I will think about if we really should model standby memory. Maybe
>>>> it is really better to have in user space something like (as Dan noted)  
>>>
>>> If it is possible to designate the memory as standby or online in the
>>> s390 admin interface and the kernel does have access to this
>>> information it makes sense to forward it to userspace (as separate
>>> s390-specific property). If not then you need to make some kind of
>>> assumption like below and the user can tune the script according to
>>> their usecase.  
>>
>> Also true, standby memory really represents a distinct type of memory
>> block (memory seems to be there but really isn't). Right now I am
>> thinking about something like this (tried to formulate it on a very
>> generic level because we can't predict which mechanism might want to
>> make use of these types in the future).
>>
>>
>> /*
>>  * Memory block types allow user space to formulate rules if and how to
>>  * online memory blocks. The types are exposed to user space as text
>>  * strings in sysfs. While the typical online strategies are described
>>  * along with the types, there are use cases where that can differ (e.g.
>>  * use MOVABLE zone for more reliable huge page usage, use NORMAL zone
>>  * due to zone imbalance or because memory unplug is not intended).
>>  *
>>  * MEMORY_BLOCK_NONE:
>>  *  No memory block is to be created (e.g. device memory). Used internally
>>  *  only.
>>  *
>>  * MEMORY_BLOCK_REMOVABLE:
>>  *  This memory block type should be treated as if it can be
>>  *  removed/unplugged from the system again. E.g. there is a hardware
>>  *  interface to unplug such memory. This memory block type is usually
>>  *  onlined to the MOVABLE zone, to e.g. make offlining of it more
>>  *  reliable. Examples include ACPI and PPC DIMMs.
>>  *
>>  * MEMORY_BLOCK_UNREMOVABLE:
>>  *  This memory block type should be treated as if it can not be
>>  *  removed/unplugged again. E.g. there is no hardware interface to
>>  *  unplug such memory. This memory block type is usually onlined to
>>  *  the NORMAL zone, as offlining is not beneficial. Examples include boot
>>  *  memory on most architectures and memory added via balloon devices.
> 
> AFAIK baloon device can be inflated as well so this does not really
> describe how this memory type works in any meaningful way. Also it
> should not be possible to see this kind of memory from userspace. The
> baloon driver just takes existing memory that is properly backed,
> allocates it for itself, and allows the hypervisor to use it. Thus it
> creates the equivalent to s390 standby memory which is not backed in
> the VM. When memory is reclaimed from hypervisor the baloon driver
> frees it making it available to the VM kernel again. However, the whole
> time the memory appears present in the machine and no hotplug events
> should be visible unless the docs I am looking at are really outdated.

It's all not optimal yet.

Don't confuse what I describe here with inflated/deflated memory. XEN
and Hyper-V add *new* memory to the system using add_memory(). New
memory blocks. This memory will never be removed using the typical
"offline + remove_memory()" approach. It will be removed using
ballooning (if at all) and only in pieces. So it will usually be onlined
to the NORMAL zone. (but userspace can later on implement whatever rule
it wants)

I am not talking about any kind of inflation/deflation. I am talking
about memory blocks added to the system via add_memory().

Inflation/deflation does not belong into the memory block interface.

> 
>>  *
>>  * MEMORY_BLOCK_STANDBY:
>>  *  The memory block type should be treated as if it can be
>>  *  removed/unplugged again, however the actual memory hot(un)plug is
>>  *  performed by onlining/offlining. In virtual environments, such memory
>>  *  is usually added during boot and never removed. Onlining memory will
>>  *  result in memory getting allocated to a VM. This memory type is usually
>>  *  not onlined automatically but explicitly by the administrator. One
>>  *  example is standby memory on s390x.
> 
> Again, this does not meaningfully describe the memory type. There is
> no memory on standby. There is in fact no backing at all unless you
> online it. So this probably is some kind of shared memory. However, the
> (de)allocation is controlled differently compared to the baloon device.
> The concept is very similar, though.

We have memory blocks and we have to describe them somehow. On s390x
standby memory is model via memory blocks that are offline - that is the
way it is modeled. I am still thinking about possible ways to describe
this via a memory type. And here the message should be "don't online
this unless you are aware of the consequences, this is not your ordinary
DIMM".

Which types of memory would you have in mind? The problem we are trying
to solve is to give user space an idea of if and how to online memory.
And to make it aware that there are different types that are expected to
be handled differently.

-- 

Thanks,

David / dhildenb

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 144+ messages in thread

* [PATCH RFC] mm/memory_hotplug: Introduce memory block types
@ 2018-09-28 15:03 David Hildenbrand
  0 siblings, 0 replies; 144+ messages in thread
From: David Hildenbrand @ 2018-09-28 15:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Kate Stewart, Rich Felker, linux-ia64, linux-sh, Peter Zijlstra,
	Benjamin Herrenschmidt, Balbir Singh, Dave Hansen,
	Heiko Carstens, Pavel Tatashin, Michal Hocko, Paul Mackerras,
	H. Peter Anvin, Rashmica Gupta, K. Y. Srinivasan,
	Boris Ostrovsky, linux-s390, Michael Neuling, Stephen Hemminger,
	Yoshinori Sato, Michael Ellerman, David Hildenbrand, linux-acpi,
	Ingo

How to/when to online hotplugged memory is hard to manage for
distributions because different memory types are to be treated differently.
Right now, we need complicated udev rules that e.g. check if we are
running on s390x, on a physical system or on a virtualized system. But
there is also sometimes the demand to really online memory immediately
while adding in the kernel and not to wait for user space to make a
decision. And on virtualized systems there might be different
requirements, depending on "how" the memory was added (and if it will
eventually get unplugged again - DIMM vs. paravirtualized mechanisms).

On the one hand, we have physical systems where we sometimes
want to be able to unplug memory again - e.g. a DIMM - so we have to online
it to the MOVABLE zone optionally. That decision is usually made in user
space.

On the other hand, we have memory that should never be onlined
automatically, only when asked for by an administrator. Such memory only
applies to virtualized environments like s390x, where the concept of
"standby" memory exists. Memory is detected and added during boot, so it
can be onlined when requested by the admininistrator or some tooling.
Only when onlining, memory will be allocated in the hypervisor.

But then, we also have paravirtualized devices (namely xen and hyper-v
balloons), that hotplug memory that will never ever be removed from a
system right now using offline_pages/remove_memory. If at all, this memory
is logically unplugged and handed back to the hypervisor via ballooning.

For paravirtualized devices it is relevant that memory is onlined as
quickly as possible after adding - and that it is added to the NORMAL
zone. Otherwise, it could happen that too much memory in a row is added
(but not onlined), resulting in out-of-memory conditions due to the
additional memory for "struct pages" and friends. MOVABLE zone as well
as delays might be very problematic and lead to crashes (e.g. zone
imbalance).

Therefore, introduce memory block types and online memory depending on
it when adding the memory. Expose the memory type to user space, so user
space handlers can start to process only "normal" memory. Other memory
block types can be ignored. One thing less to worry about in user space.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: "Jonathan Neuschäfer" <j.neuschaefer@gmx.net>
Cc: Joe Perches <joe@perches.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

This patch is based on the current mm-tree, where some related
patches from me are currently residing that touched the add_memory()
functions.

 arch/ia64/mm/init.c                       |  4 +-
 arch/powerpc/mm/mem.c                     |  4 +-
 arch/powerpc/platforms/powernv/memtrace.c |  3 +-
 arch/s390/mm/init.c                       |  4 +-
 arch/sh/mm/init.c                         |  4 +-
 arch/x86/mm/init_32.c                     |  4 +-
 arch/x86/mm/init_64.c                     |  8 +--
 drivers/acpi/acpi_memhotplug.c            |  3 +-
 drivers/base/memory.c                     | 63 ++++++++++++++++++++---
 drivers/hv/hv_balloon.c                   | 33 ++----------
 drivers/s390/char/sclp_cmd.c              |  3 +-
 drivers/xen/balloon.c                     |  2 +-
 include/linux/memory.h                    | 28 +++++++++-
 include/linux/memory_hotplug.h            | 17 +++---
 mm/hmm.c                                  |  6 ++-
 mm/memory_hotplug.c                       | 31 ++++++-----
 16 files changed, 139 insertions(+), 78 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d5e12ff1d73c..813d1d86bf95 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -646,13 +646,13 @@ mem_init (void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	if (ret)
 		printk("%s: Problem encountered in __add_pages() as ret=%d\n",
 		       __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5551f5870dcc..dd32fcc9099c 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -118,7 +118,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 }
 
 int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+			      int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -135,7 +135,7 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
 	}
 	flush_inval_dcache_range(start, start + size);
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 84d038ed3882..57d6b3d46382 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -232,7 +232,8 @@ static int memtrace_online(void)
 			ent->mem = 0;
 		}
 
-		if (add_memory(ent->nid, ent->start, ent->size)) {
+		if (add_memory(ent->nid, ent->start, ent->size,
+			       MEMORY_BLOCK_NORMAL)) {
 			pr_err("Failed to add trace memory to node %d\n",
 				ent->nid);
 			ret += 1;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index e472cd763eb3..b5324527c7f6 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -222,7 +222,7 @@ device_initcall(s390_cma_mem_init);
 #endif /* CONFIG_CMA */
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long size_pages = PFN_DOWN(size);
@@ -232,7 +232,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	if (rc)
 		return rc;
 
-	rc = __add_pages(nid, start_pfn, size_pages, altmap, want_memblock);
+	rc = __add_pages(nid, start_pfn, size_pages, altmap, memory_block_type);
 	if (rc)
 		vmem_remove_mapping(start, size);
 	return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index c8c13c777162..6b876000731a 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -419,14 +419,14 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index f2837e4c40b3..4f50cd4467a9 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -851,12 +851,12 @@ void __init mem_init(void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..fc3df573f0f3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -783,11 +783,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
 }
 
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock)
+		struct vmem_altmap *altmap, int memory_block_type)
 {
 	int ret;
 
-	ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	ret = __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
@@ -798,14 +798,14 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 }
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-		bool want_memblock)
+		    int memory_block_type)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
 	init_memory_mapping(start, start + size);
 
-	return add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 
 #define PAGE_INUSE 0xFD
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 8fe0960ea572..c5f646b4e97e 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,8 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = __add_memory(node, info->start_addr, info->length);
+		result = __add_memory(node, info->start_addr, info->length,
+				      MEMORY_BLOCK_NORMAL);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 0e5985682642..2686101e41b5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -381,6 +381,32 @@ static ssize_t show_phys_device(struct device *dev,
 	return sprintf(buf, "%d\n", mem->phys_device);
 }
 
+static ssize_t type_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct memory_block *mem = to_memory_block(dev);
+	ssize_t len = 0;
+
+	switch (mem->state) {
+	case MEMORY_BLOCK_NORMAL:
+		len = sprintf(buf, "normal\n");
+		break;
+	case MEMORY_BLOCK_STANDBY:
+		len = sprintf(buf, "standby\n");
+		break;
+	case MEMORY_BLOCK_PARAVIRT:
+		len = sprintf(buf, "paravirt\n");
+		break;
+	default:
+		len = sprintf(buf, "ERROR-UNKNOWN-%ld\n",
+				mem->state);
+		WARN_ON(1);
+		break;
+	}
+
+	return len;
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn,
 		unsigned long nr_pages, int online_type,
@@ -442,6 +468,7 @@ static DEVICE_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
 static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state);
 static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL);
 static DEVICE_ATTR(removable, 0444, show_mem_removable, NULL);
+static DEVICE_ATTR_RO(type);
 
 /*
  * Block size attribute stuff
@@ -514,7 +541,8 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = __add_memory(nid, phys_addr,
-			   MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			   MIN_MEMORY_BLOCK_SIZE * sections_per_block,
+			   MEMORY_BLOCK_NORMAL);
 
 	if (ret)
 		goto out;
@@ -620,6 +648,7 @@ static struct attribute *memory_memblk_attrs[] = {
 	&dev_attr_state.attr,
 	&dev_attr_phys_device.attr,
 	&dev_attr_removable.attr,
+	&dev_attr_type.attr,
 #ifdef CONFIG_MEMORY_HOTREMOVE
 	&dev_attr_valid_zones.attr,
 #endif
@@ -657,13 +686,17 @@ int register_memory(struct memory_block *memory)
 }
 
 static int init_memory_block(struct memory_block **memory,
-			     struct mem_section *section, unsigned long state)
+			     struct mem_section *section, unsigned long state,
+			     int memory_block_type)
 {
 	struct memory_block *mem;
 	unsigned long start_pfn;
 	int scn_nr;
 	int ret = 0;
 
+	if (memory_block_type == MEMORY_BLOCK_NONE)
+		return -EINVAL;
+
 	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem)
 		return -ENOMEM;
@@ -675,6 +708,7 @@ static int init_memory_block(struct memory_block **memory,
 	mem->state = state;
 	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
+	mem->type = memory_block_type;
 
 	ret = register_memory(mem);
 
@@ -699,7 +733,8 @@ static int add_memory_block(int base_section_nr)
 
 	if (section_count == 0)
 		return 0;
-	ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE);
+	ret = init_memory_block(&mem, __nr_to_section(section_nr), MEM_ONLINE,
+				MEMORY_BLOCK_NORMAL);
 	if (ret)
 		return ret;
 	mem->section_count = section_count;
@@ -710,19 +745,35 @@ static int add_memory_block(int base_section_nr)
  * need an interface for the VM to add new memory regions,
  * but without onlining it.
  */
-int hotplug_memory_register(int nid, struct mem_section *section)
+int hotplug_memory_register(int nid, struct mem_section *section,
+			    int memory_block_type)
 {
 	int ret = 0;
 	struct memory_block *mem;
 
 	mutex_lock(&mem_sysfs_mutex);
 
+	/* make sure there is no memblock if we don't want one */
+	if (memory_block_type == MEMORY_BLOCK_NONE) {
+		mem = find_memory_block(section);
+		if (mem) {
+			put_device(&mem->dev);
+			ret = -EINVAL;
+		}
+		goto out;
+	}
+
 	mem = find_memory_block(section);
 	if (mem) {
-		mem->section_count++;
+		/* make sure the type matches */
+		if (mem->type == memory_block_type)
+			mem->section_count++;
+		else
+			ret = -EINVAL;
 		put_device(&mem->dev);
 	} else {
-		ret = init_memory_block(&mem, section, MEM_OFFLINE);
+		ret = init_memory_block(&mem, section, MEM_OFFLINE,
+					memory_block_type);
 		if (ret)
 			goto out;
 		mem->section_count++;
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index b1b788082793..5a8d18c4d699 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -537,11 +537,6 @@ struct hv_dynmem_device {
 	 */
 	bool host_specified_ha_region;
 
-	/*
-	 * State to synchronize hot-add.
-	 */
-	struct completion  ol_waitevent;
-	bool ha_waiting;
 	/*
 	 * This thread handles hot-add
 	 * requests from the host as well as notifying
@@ -640,14 +635,6 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 	unsigned long flags, pfn_count;
 
 	switch (val) {
-	case MEM_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		if (dm_device.ha_waiting) {
-			dm_device.ha_waiting = false;
-			complete(&dm_device.ol_waitevent);
-		}
-		break;
-
 	case MEM_OFFLINE:
 		spin_lock_irqsave(&dm_device.ha_lock, flags);
 		pfn_count = hv_page_offline_check(mem->start_pfn,
@@ -665,9 +652,7 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 		}
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 		break;
-	case MEM_GOING_ONLINE:
-	case MEM_GOING_OFFLINE:
-	case MEM_CANCEL_OFFLINE:
+	default:
 		break;
 	}
 	return NOTIFY_OK;
@@ -731,12 +716,10 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 		has->covered_end_pfn +=  processed_pfn;
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 
-		init_completion(&dm_device.ol_waitevent);
-		dm_device.ha_waiting = !memhp_auto_online;
-
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				 (HA_CHUNK << PAGE_SHIFT),
+				 MEMORY_BLOCK_PARAVIRT);
 
 		if (ret) {
 			pr_err("hot_add memory failed error is %d\n", ret);
@@ -757,16 +740,6 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 			break;
 		}
 
-		/*
-		 * Wait for the memory block to be onlined when memory onlining
-		 * is done outside of kernel (memhp_auto_online). Since the hot
-		 * add has succeeded, it is ok to proceed even if the pages in
-		 * the hot added region have not been "onlined" within the
-		 * allowed time.
-		 */
-		if (dm_device.ha_waiting)
-			wait_for_completion_timeout(&dm_device.ol_waitevent,
-						    5*HZ);
 		post_status(&dm_device);
 	}
 }
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index d7686a68c093..1928a2411456 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -406,7 +406,8 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size,
+			   MEMORY_BLOCK_STANDBY);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index fdfc64f5acea..291a8aac6af3 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -397,7 +397,7 @@ static enum bp_state reserve_additional_memory(void)
 	mutex_unlock(&balloon_mutex);
 	/* add_memory_resource() requires the device_hotplug lock */
 	lock_device_hotplug();
-	rc = add_memory_resource(nid, resource, memhp_auto_online);
+	rc = add_memory_resource(nid, resource, MEMORY_BLOCK_PARAVIRT);
 	unlock_device_hotplug();
 	mutex_lock(&balloon_mutex);
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index a6ddefc60517..3dc2a0b12653 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -23,6 +23,30 @@
 
 #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
 
+/*
+ * NONE:     No memory block is to be created (e.g. device memory).
+ * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
+ *           (e.g. ACPI DIMMs) that should be onlined either automatically
+ *           (memhp_auto_online) or manually by user space to select a
+ *           specific zone.
+ *           Applicable to memhp_auto_online.
+ * STANDBY:  Memory block that represents standby memory that should only
+ *           be onlined on demand by user space (e.g. standby memory on
+ *           s390x), but never automatically by the kernel.
+ *           Not applicable to memhp_auto_online.
+ * PARAVIRT: Memory block that represents memory added by
+ *           paravirtualized mechanisms (e.g. hyper-v, xen) that will
+ *           always automatically get onlined. Memory will be unplugged
+ *           using ballooning, not by relying on the MOVABLE ZONE.
+ *           Not applicable to memhp_auto_online.
+ */
+enum {
+	MEMORY_BLOCK_NONE,
+	MEMORY_BLOCK_NORMAL,
+	MEMORY_BLOCK_STANDBY,
+	MEMORY_BLOCK_PARAVIRT,
+};
+
 struct memory_block {
 	unsigned long start_section_nr;
 	unsigned long end_section_nr;
@@ -34,6 +58,7 @@ struct memory_block {
 	int (*phys_callback)(struct memory_block *);
 	struct device dev;
 	int nid;			/* NID for this memory block */
+	int type;			/* type of this memory block */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -111,7 +136,8 @@ extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
-int hotplug_memory_register(int nid, struct mem_section *section);
+int hotplug_memory_register(int nid, struct mem_section *section,
+			    int memory_block_type);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int unregister_memory_section(struct mem_section *);
 #endif
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ffd9cd10fcf3..b560a9ee0e8c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -115,18 +115,18 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+		struct vmem_altmap *altmap, int memory_block_type);
 
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		int memory_block_type)
 {
-	return __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+	return __add_pages(nid, start_pfn, nr_pages, altmap, memory_block_type);
 }
 #else /* ARCH_HAS_ADD_PAGES */
 int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
-		struct vmem_altmap *altmap, bool want_memblock);
+	      struct vmem_altmap *altmap, int memory_block_type);
 #endif /* ARCH_HAS_ADD_PAGES */
 
 #ifdef CONFIG_NUMA
@@ -324,11 +324,12 @@ static inline void __remove_memory(int nid, u64 start, u64 size) {}
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
-extern int __add_memory(int nid, u64 start, u64 size);
-extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource, bool online);
+extern int __add_memory(int nid, u64 start, u64 size, int memory_block_type);
+extern int add_memory(int nid, u64 start, u64 size, int memory_block_type);
+extern int add_memory_resource(int nid, struct resource *resource,
+			       int memory_block_type);
 extern int arch_add_memory(int nid, u64 start, u64 size,
-		struct vmem_altmap *altmap, bool want_memblock);
+			   struct vmem_altmap *altmap, int memory_block_type);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index c968e49f7a0c..2350f6f6ab42 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -32,6 +32,7 @@
 #include <linux/jump_label.h>
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
+#include <linux/memory.h>
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
@@ -1096,10 +1097,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	 */
 	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
 		ret = arch_add_memory(nid, align_start, align_size, NULL,
-				false);
+				      MEMORY_BLOCK_NONE);
 	else
 		ret = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL, false);
+				align_size >> PAGE_SHIFT, NULL,
+				MEMORY_BLOCK_NONE);
 	if (ret) {
 		mem_hotplug_done();
 		goto error_add_memory;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4c7e42e46f3..bce6c41d721c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -246,7 +246,7 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
 static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
-		struct vmem_altmap *altmap, bool want_memblock)
+		struct vmem_altmap *altmap, int memory_block_type)
 {
 	int ret;
 
@@ -257,10 +257,11 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
-	if (!want_memblock)
+	if (memory_block_type == MEMBLOCK_NONE)
 		return 0;
 
-	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn));
+	return hotplug_memory_register(nid, __pfn_to_section(phys_start_pfn),
+				       memory_block_type);
 }
 
 /*
@@ -271,7 +272,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
  */
 int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
-		bool want_memblock)
+		int memory_block_type)
 {
 	unsigned long i;
 	int err = 0;
@@ -296,7 +297,7 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, section_nr_to_pfn(i), altmap,
-				want_memblock);
+				    memory_block_type);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -1099,7 +1100,8 @@ static int online_memory_block(struct memory_block *mem, void *arg)
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int __ref add_memory_resource(int nid, struct resource *res, bool online)
+int __ref add_memory_resource(int nid, struct resource *res,
+			      int memory_block_type)
 {
 	u64 start, size;
 	bool new_node = false;
@@ -1108,6 +1110,9 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	start = res->start;
 	size = resource_size(res);
 
+	if (memory_block_type == MEMORY_BLOCK_NONE)
+		return -EINVAL;
+
 	ret = check_hotplug_memory_range(start, size);
 	if (ret)
 		return ret;
@@ -1128,7 +1133,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	new_node = ret;
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, NULL, true);
+	ret = arch_add_memory(nid, start, size, NULL, memory_block_type);
 	if (ret < 0)
 		goto error;
 
@@ -1153,8 +1158,8 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	/* device_online() will take the lock when calling online_pages() */
 	mem_hotplug_done();
 
-	/* online pages if requested */
-	if (online)
+	if (memory_block_type == MEMORY_BLOCK_PARAVIRT ||
+	    (memory_block_type == MEMORY_BLOCK_NORMAL && memhp_auto_online))
 		walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
 				  NULL, online_memory_block);
 
@@ -1169,7 +1174,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 }
 
 /* requires device_hotplug_lock, see add_memory_resource() */
-int __ref __add_memory(int nid, u64 start, u64 size)
+int __ref __add_memory(int nid, u64 start, u64 size, int memory_block_type)
 {
 	struct resource *res;
 	int ret;
@@ -1178,18 +1183,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res, memhp_auto_online);
+	ret = add_memory_resource(nid, res, memory_block_type);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
 }
 
-int add_memory(int nid, u64 start, u64 size)
+int add_memory(int nid, u64 start, u64 size, int memory_block_type)
 {
 	int rc;
 
 	lock_device_hotplug();
-	rc = __add_memory(nid, start, size);
+	rc = __add_memory(nid, start, size, memory_block_type);
 	unlock_device_hotplug();
 
 	return rc;
-- 
2.17.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 144+ messages in thread

end of thread, other threads:[~2018-11-27 16:50 UTC | newest]

Thread overview: 144+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-28 15:03 [PATCH RFC] mm/memory_hotplug: Introduce memory block types David Hildenbrand
2018-09-28 15:03 ` David Hildenbrand
2018-09-28 15:03 ` David Hildenbrand
2018-09-28 17:02 ` Dave Hansen
2018-09-28 17:02   ` Dave Hansen
2018-09-28 17:02   ` Dave Hansen
2018-10-01  9:13   ` David Hildenbrand
2018-10-01  9:13   ` David Hildenbrand
2018-10-01  9:13     ` David Hildenbrand
2018-10-01  9:13     ` David Hildenbrand
2018-10-01 16:24     ` Dave Hansen
2018-10-01 16:24     ` Dave Hansen
2018-10-01 16:24       ` Dave Hansen
2018-10-01 16:24       ` Dave Hansen
2018-10-04  7:48       ` David Hildenbrand
2018-10-04  7:48       ` David Hildenbrand
2018-10-04  7:48         ` David Hildenbrand
2018-10-04  7:48         ` David Hildenbrand
2018-09-28 17:02 ` Dave Hansen
2018-10-01  8:40 ` Michal Hocko
2018-10-01  8:40   ` Michal Hocko
2018-10-01  8:40   ` Michal Hocko
2018-10-01  9:34   ` David Hildenbrand
2018-10-01  9:34     ` David Hildenbrand
2018-10-01  9:34     ` David Hildenbrand
2018-10-02 13:47     ` Michal Hocko
2018-10-02 13:47     ` Michal Hocko
2018-10-02 13:47       ` Michal Hocko
2018-10-02 13:47       ` Michal Hocko
2018-10-02 15:25       ` David Hildenbrand
2018-10-02 15:25       ` David Hildenbrand
2018-10-02 15:25         ` David Hildenbrand
2018-10-02 15:25         ` David Hildenbrand
2018-10-03 13:38         ` Vitaly Kuznetsov
2018-10-03 13:38         ` Vitaly Kuznetsov
2018-10-03 13:38           ` Vitaly Kuznetsov
2018-10-03 13:38           ` Vitaly Kuznetsov
2018-10-03 13:44           ` Michal Hocko
2018-10-03 13:44           ` Michal Hocko
2018-10-03 13:44             ` Michal Hocko
2018-10-03 13:44             ` Michal Hocko
2018-10-03 13:52             ` Vitaly Kuznetsov
2018-10-03 13:52               ` Vitaly Kuznetsov
2018-10-03 13:52               ` Vitaly Kuznetsov
2018-10-03 14:07               ` Dave Hansen
2018-10-03 14:07                 ` Dave Hansen
2018-10-03 14:07                 ` Dave Hansen
2018-10-03 14:34                 ` Vitaly Kuznetsov
2018-10-03 14:34                 ` Vitaly Kuznetsov
2018-10-03 14:34                   ` Vitaly Kuznetsov
2018-10-03 14:34                   ` Vitaly Kuznetsov
2018-10-03 17:14                   ` David Hildenbrand
2018-10-03 17:14                     ` David Hildenbrand
2018-10-03 17:14                     ` David Hildenbrand
2018-10-04  6:19                     ` Michal Hocko
2018-10-04  6:19                       ` Michal Hocko
2018-10-04  6:19                       ` Michal Hocko
2018-10-04  8:13                       ` David Hildenbrand
2018-10-04  8:13                       ` David Hildenbrand
2018-10-04  8:13                         ` David Hildenbrand
2018-10-04  8:13                         ` David Hildenbrand
2018-10-04 15:28                         ` Michal Suchánek
2018-10-04 15:28                         ` Michal Suchánek
2018-10-04 15:28                           ` Michal Suchánek
2018-10-04 15:28                           ` Michal Suchánek
2018-10-04 15:45                           ` David Hildenbrand
2018-10-04 15:45                           ` David Hildenbrand
2018-10-04 15:45                             ` David Hildenbrand
2018-10-04 15:45                             ` David Hildenbrand
2018-10-04 17:50                             ` Michal Suchánek
2018-10-04 17:50                               ` Michal Suchánek
2018-10-04 17:50                               ` Michal Suchánek
2018-10-05  7:37                               ` David Hildenbrand
2018-10-05  7:37                                 ` David Hildenbrand
2018-10-05  7:37                                 ` David Hildenbrand
2018-10-05  7:37                               ` David Hildenbrand
2018-10-04 17:50                             ` Michal Suchánek
2018-10-04  6:19                     ` Michal Hocko
2018-10-03 17:14                   ` David Hildenbrand
2018-10-03 14:07               ` Dave Hansen
2018-10-03 14:24               ` Michal Hocko
2018-10-03 14:24               ` Michal Hocko
2018-10-03 14:24                 ` Michal Hocko
2018-10-03 14:24                 ` Michal Hocko
2018-10-03 17:06                 ` David Hildenbrand
2018-10-03 17:06                 ` David Hildenbrand
2018-10-03 17:06                   ` David Hildenbrand
2018-10-03 17:06                   ` David Hildenbrand
2018-10-04  8:12                 ` David Hildenbrand
2018-10-04  8:12                 ` David Hildenbrand
2018-10-04  8:12                   ` David Hildenbrand
2018-10-04  8:12                   ` David Hildenbrand
2018-10-03 13:52             ` Vitaly Kuznetsov
2018-10-03 13:54         ` Michal Hocko
2018-10-03 13:54         ` Michal Hocko
2018-10-03 13:54           ` Michal Hocko
2018-10-03 13:54           ` Michal Hocko
2018-10-03 17:00           ` David Hildenbrand
2018-10-03 17:00             ` David Hildenbrand
2018-10-03 17:00             ` David Hildenbrand
2018-10-04  6:28             ` Michal Hocko
2018-10-04  6:28               ` Michal Hocko
2018-10-04  6:28               ` Michal Hocko
2018-10-04  7:40               ` David Hildenbrand
2018-10-04  7:40                 ` David Hildenbrand
2018-10-04  7:40                 ` David Hildenbrand
2018-10-04  7:40               ` David Hildenbrand
2018-10-04  6:28             ` Michal Hocko
2018-10-03 17:00           ` David Hildenbrand
2018-10-01  9:34   ` David Hildenbrand
2018-10-01  8:40 ` Michal Hocko
2018-11-23 11:13 ` David Hildenbrand
2018-11-23 11:13   ` David Hildenbrand
2018-11-23 11:13   ` David Hildenbrand
2018-11-23 18:06   ` Michal Suchánek
2018-11-23 18:06   ` Michal Suchánek
2018-11-23 18:06     ` Michal Suchánek
2018-11-23 18:06     ` Michal Suchánek
2018-11-26 12:30     ` David Hildenbrand
2018-11-26 12:30       ` David Hildenbrand
2018-11-26 12:30       ` David Hildenbrand
2018-11-26 13:33       ` David Hildenbrand
2018-11-26 13:33         ` David Hildenbrand
2018-11-26 13:33         ` David Hildenbrand
2018-11-26 14:20         ` Michal Suchánek
2018-11-26 14:20           ` Michal Suchánek
2018-11-26 14:20           ` Michal Suchánek
2018-11-26 15:59           ` David Hildenbrand
2018-11-26 15:59           ` David Hildenbrand
2018-11-26 15:59             ` David Hildenbrand
2018-11-26 15:59             ` David Hildenbrand
2018-11-27 16:32             ` Michal Suchánek
2018-11-27 16:32             ` Michal Suchánek
2018-11-27 16:32               ` Michal Suchánek
2018-11-27 16:32               ` Michal Suchánek
2018-11-27 16:47               ` David Hildenbrand
2018-11-27 16:47               ` David Hildenbrand
2018-11-27 16:47                 ` David Hildenbrand
2018-11-27 16:47                 ` David Hildenbrand
2018-11-26 14:20         ` Michal Suchánek
2018-11-26 13:33       ` David Hildenbrand
2018-11-26 12:30     ` David Hildenbrand
2018-11-23 11:13 ` David Hildenbrand
  -- strict thread matches above, loose matches on Subject: below --
2018-09-28 15:03 David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.