linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v1] mm: is_mem_section_removable() overhaul
@ 2020-01-17 10:57 David Hildenbrand
  2020-01-17 11:33 ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2020-01-17 10:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Greg Kroah-Hartman,
	Rafael J. Wysocki, Andrew Morton, Leonardo Bras, Nathan Lynch,
	Allison Randal, Nathan Fontenot, Thomas Gleixner, Michal Hocko,
	Dan Williams, Stephen Rothwell, Anshuman Khandual, lantianyu1986,
	linuxppc-dev

Let's refactor that code. We want to check if we can offline memory
blocks. Add a new function is_mem_section_offlineable() for that and
make it call is_mem_section_offlineable() for each contained section.
Within is_mem_section_offlineable(), add some more sanity checks and
directly bail out if the section contains holes or if it spans multiple
zones.

The old code was inherently racy with concurrent offlining/memory
unplug. Let's avoid that and grab the device_hotplug_lock. Luckily
we are already holding it when calling from powerpc code.

Note1: If somebody wants to export this function for use in driver code, we
need a variant that takes the device_hotplug_lock()

Note2: If we could have a zombie device (not clear yet), the present
section checks would properly bail out early.

Note3: I'd prefer the mem_hotplug_lock in read, but as we are about to
change the locking on the removal path (IOW, don't hold it when removing
memory block devices), I do not want to go down that path.

Note4: For now we would have returned "removable" although we would
block offlining due to memory holes, multiple zones, or missing
sections.

Tested with DIMMs on x86-64. Compile-tested on Power.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Leonardo Bras <leonardo@linux.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: lantianyu1986@gmail.com
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 .../platforms/pseries/hotplug-memory.c        | 24 ++-----
 drivers/base/memory.c                         | 37 ++++++----
 include/linux/memory.h                        |  1 +
 include/linux/memory_hotplug.h                |  5 +-
 mm/memory_hotplug.c                           | 68 +++++++++----------
 5 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c126b94d1943..8d80159465e4 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -337,34 +337,24 @@ static int pseries_remove_mem_node(struct device_node *np)
 
 static bool lmb_is_removable(struct drmem_lmb *lmb)
 {
-	int i, scns_per_block;
-	bool rc = true;
-	unsigned long pfn, block_sz;
-	u64 phys_addr;
+	struct memory_block *mem;
+	bool rc = false;
 
 	if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
 		return false;
 
-	block_sz = memory_block_size_bytes();
-	scns_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
-	phys_addr = lmb->base_addr;
-
 #ifdef CONFIG_FA_DUMP
 	/*
 	 * Don't hot-remove memory that falls in fadump boot memory area
 	 * and memory that is reserved for capturing old kernel memory.
 	 */
-	if (is_fadump_memory_area(phys_addr, block_sz))
+	if (is_fadump_memory_area(lmb->base_addr,  memory_block_size_bytes()))
 		return false;
 #endif
-
-	for (i = 0; i < scns_per_block; i++) {
-		pfn = PFN_DOWN(phys_addr);
-		if (!pfn_present(pfn))
-			continue;
-
-		rc = rc && is_mem_section_removable(pfn, PAGES_PER_SECTION);
-		phys_addr += MIN_MEMORY_BLOCK_SIZE;
+	mem = lmb_to_memblock(lmb);
+	if (mem) {
+		rc = is_memory_block_offlineable(mem);
+		put_device(&mem->dev);
 	}
 
 	return rc;
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c6d288fad493..f744250c34d0 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -104,6 +104,25 @@ static ssize_t phys_index_show(struct device *dev,
 	return sprintf(buf, "%08lx\n", phys_index);
 }
 
+/*
+ * Test if a memory block is likely to be offlineable. Returns true if
+ * the block is already offline.
+ *
+ * Called under device_hotplug_lock.
+ */
+bool is_memory_block_offlineable(struct memory_block *mem)
+{
+	int i;
+
+	if (mem->state != MEM_ONLINE)
+		return true;
+
+	for (i = 0; i < sections_per_block; i++)
+		if (!is_mem_section_offlineable(mem->start_section_nr + i))
+			return false;
+	return true;
+}
+
 /*
  * Show whether the memory block is likely to be offlineable (or is already
  * offline). Once offline, the memory block could be removed. The return
@@ -114,20 +133,14 @@ static ssize_t removable_show(struct device *dev, struct device_attribute *attr,
 			      char *buf)
 {
 	struct memory_block *mem = to_memory_block(dev);
-	unsigned long pfn;
-	int ret = 1, i;
-
-	if (mem->state != MEM_ONLINE)
-		goto out;
+	int ret;
 
-	for (i = 0; i < sections_per_block; i++) {
-		if (!present_section_nr(mem->start_section_nr + i))
-			continue;
-		pfn = section_nr_to_pfn(mem->start_section_nr + i);
-		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
-	}
+	ret = lock_device_hotplug_sysfs();
+	if (ret)
+		return ret;
+	ret = is_memory_block_offlineable(mem);
+	unlock_device_hotplug();
 
-out:
 	return sprintf(buf, "%d\n", ret);
 }
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 0b8d791b6669..faf03eb64ecc 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -91,6 +91,7 @@ typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *);
 extern int walk_memory_blocks(unsigned long start, unsigned long size,
 			      void *arg, walk_memory_blocks_func_t func);
 extern int for_each_memory_block(void *arg, walk_memory_blocks_func_t func);
+extern bool is_memory_block_offlineable(struct memory_block *mem);
 #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f4d59155f3d4..8d087772f748 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -306,15 +306,14 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
-extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
+extern bool is_mem_section_offlineable(unsigned long nr);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 
 #else
-static inline bool is_mem_section_removable(unsigned long pfn,
-					unsigned long nr_pages)
+static inline bool is_mem_section_offlineable(unsigned long nr)
 {
 	return false;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7a6de9b0dcab..a6d14d2b7f0c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1128,46 +1128,51 @@ static unsigned long next_active_pageblock(unsigned long pfn)
 	return pfn + pageblock_nr_pages;
 }
 
-static bool is_pageblock_removable_nolock(unsigned long pfn)
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+				     unsigned long nr_pages, void *data)
 {
-	struct page *page = pfn_to_page(pfn);
+	unsigned long *nr_system_ram_pages = data;
+
+	*nr_system_ram_pages += nr_pages;
+	return 0;
+}
+
+/*
+ * Check if a section is likely to be offlineable.
+ *
+ * Called with device_hotplug_lock.
+ */
+bool is_mem_section_offlineable(unsigned long nr)
+{
+	const unsigned long start_pfn = section_nr_to_pfn(nr);
+	const unsigned long end_pfn = start_pfn + PAGES_PER_SECTION;
+	unsigned long pfn, nr_pages = 0;
 	struct zone *zone;
 
-	/*
-	 * We have to be careful here because we are iterating over memory
-	 * sections which are not zone aware so we might end up outside of
-	 * the zone but still within the section.
-	 * We have to take care about the node as well. If the node is offline
-	 * its NODE_DATA will be NULL - see page_zone.
-	 */
-	if (!node_online(page_to_nid(page)))
+	if (!present_section_nr(nr))
 		return false;
-
-	zone = page_zone(page);
-	pfn = page_to_pfn(page);
-	if (!zone_spans_pfn(zone, pfn))
+	if (!online_section_nr(nr))
 		return false;
 
-	return !has_unmovable_pages(zone, page, MIGRATE_MOVABLE,
-				    MEMORY_OFFLINE);
-}
-
-/* Checks if this range of memory is likely to be hot-removable. */
-bool is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages)
-{
-	unsigned long end_pfn, pfn;
+	/* we don't allow to offline sections with holes */
+	walk_system_ram_range(start_pfn, PAGES_PER_SECTION, &nr_pages,
+			      count_system_ram_pages_cb);
+	if (nr_pages != PAGES_PER_SECTION)
+		return false;
 
-	end_pfn = min(start_pfn + nr_pages,
-			zone_end_pfn(page_zone(pfn_to_page(start_pfn))));
+	/* we don't allow to offline sections with mixed zones/nodes */
+	zone = test_pages_in_a_zone(start_pfn, end_pfn);
+	if (!zone)
+		return false;
 
-	/* Check the starting page of each pageblock within the range */
+	/* check each pageblock if it contains unmovable pages */
 	for (pfn = start_pfn; pfn < end_pfn; pfn = next_active_pageblock(pfn)) {
-		if (!is_pageblock_removable_nolock(pfn))
+		if (has_unmovable_pages(zone, pfn_to_page(pfn), MIGRATE_MOVABLE,
+					MEMORY_OFFLINE))
 			return false;
 		cond_resched();
 	}
 
-	/* All pageblocks in the memory block are likely to be hot-removable */
 	return true;
 }
 
@@ -1436,15 +1441,6 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 		node_clear_state(node, N_MEMORY);
 }
 
-static int count_system_ram_pages_cb(unsigned long start_pfn,
-				     unsigned long nr_pages, void *data)
-{
-	unsigned long *nr_system_ram_pages = data;
-
-	*nr_system_ram_pages += nr_pages;
-	return 0;
-}
-
 static int __ref __offline_pages(unsigned long start_pfn,
 		  unsigned long end_pfn)
 {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 10:57 [PATCH RFC v1] mm: is_mem_section_removable() overhaul David Hildenbrand
@ 2020-01-17 11:33 ` Michal Hocko
  2020-01-17 13:08   ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-17 11:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Leonardo Bras, Nathan Lynch, Allison Randal,
	Nathan Fontenot, Thomas Gleixner, Dan Williams, Stephen Rothwell,
	Anshuman Khandual, lantianyu1986, linuxppc-dev

On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> Let's refactor that code. We want to check if we can offline memory
> blocks. Add a new function is_mem_section_offlineable() for that and
> make it call is_mem_section_offlineable() for each contained section.
> Within is_mem_section_offlineable(), add some more sanity checks and
> directly bail out if the section contains holes or if it spans multiple
> zones.

I didn't read the patch (yet) but I am wondering. If we want to touch
this code, can we simply always return true there? I mean whoever
depends on this check is racy and the failure can happen even after
the sysfs says good to go, right? The check is essentially as expensive
as calling the offlining code itself. So the only usecase I can think of
is a dumb driver to crawl over blocks and check which is removable and
try to hotremove it. But just trying to offline one block after another
is essentially going to achieve the same.

Or does anybody see any reasonable usecase that would break if we did
that unconditional behavior?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 11:33 ` Michal Hocko
@ 2020-01-17 13:08   ` David Hildenbrand
  2020-01-17 14:52     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2020-01-17 13:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Leonardo Bras, Nathan Lynch, Allison Randal,
	Nathan Fontenot, Thomas Gleixner, Dan Williams, Stephen Rothwell,
	Anshuman Khandual, lantianyu1986, linuxppc-dev

On 17.01.20 12:33, Michal Hocko wrote:
> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
>> Let's refactor that code. We want to check if we can offline memory
>> blocks. Add a new function is_mem_section_offlineable() for that and
>> make it call is_mem_section_offlineable() for each contained section.
>> Within is_mem_section_offlineable(), add some more sanity checks and
>> directly bail out if the section contains holes or if it spans multiple
>> zones.
> 
> I didn't read the patch (yet) but I am wondering. If we want to touch
> this code, can we simply always return true there? I mean whoever
> depends on this check is racy and the failure can happen even after
> the sysfs says good to go, right? The check is essentially as expensive
> as calling the offlining code itself. So the only usecase I can think of
> is a dumb driver to crawl over blocks and check which is removable and
> try to hotremove it. But just trying to offline one block after another
> is essentially going to achieve the same.

Some thoughts:

1. It allows you to check if memory is likely to be offlineable without
doing expensive locking and trying to isolate pages (meaning:
zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
when isolating)

2. There are use cases that want to identify a memory block/DIMM to
unplug. One example is PPC DLPAR code (see this patch). Going over all
memory block trying to offline them is an expensive operation.

3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
makes use of /sys/.../removable to speed up the search AFAIK.

4. lsmem displays/groups by "removable".

5. If "removable=false" then it usually really is not offlineable.
Of course, there could also be races (free the last unmovable page),
but it means "don't even try". OTOH, "removable=true" is more racy,
and gives less guarantees. ("looks okay, feel free to try")

> 
> Or does anybody see any reasonable usecase that would break if we did
> that unconditional behavior?

If we would return always "true", then the whole reason the
interface originally was introduced would be "broken" (meaning, less
performant as you would try to offline any memory block).

commit 5c755e9fd813810680abd56ec09a5f90143e815b
Author: Badari Pulavarty <pbadari@us.ibm.com>
Date:   Wed Jul 23 21:28:19 2008 -0700

    memory-hotplug: add sysfs removable attribute for hotplug memory remove
    
    Memory may be hot-removed on a per-memory-block basis, particularly on
    POWER where the SPARSEMEM section size often matches the memory-block
    size.  A user-level agent must be able to identify which sections of
    memory are likely to be removable before attempting the potentially
    expensive operation.  This patch adds a file called "removable" to the
    memory directory in sysfs to help such an agent.  In this patch, a memory
    block is considered removable if;


I'd love to see this go away (just like "valid_zones"), but I don't
think it is possible.

So this patch makes it work a little more correctly (multiplezones, holes),
cleans up the code and avoids races with unplug code. It can, however,
not give more guarantees if memory offlining will actually succeed.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 13:08   ` David Hildenbrand
@ 2020-01-17 14:52     ` Michal Hocko
  2020-01-17 14:58       ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-17 14:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Leonardo Bras, Nathan Lynch, Allison Randal,
	Nathan Fontenot, Thomas Gleixner, Dan Williams, Stephen Rothwell,
	Anshuman Khandual, lantianyu1986, linuxppc-dev

On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> On 17.01.20 12:33, Michal Hocko wrote:
> > On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> >> Let's refactor that code. We want to check if we can offline memory
> >> blocks. Add a new function is_mem_section_offlineable() for that and
> >> make it call is_mem_section_offlineable() for each contained section.
> >> Within is_mem_section_offlineable(), add some more sanity checks and
> >> directly bail out if the section contains holes or if it spans multiple
> >> zones.
> > 
> > I didn't read the patch (yet) but I am wondering. If we want to touch
> > this code, can we simply always return true there? I mean whoever
> > depends on this check is racy and the failure can happen even after
> > the sysfs says good to go, right? The check is essentially as expensive
> > as calling the offlining code itself. So the only usecase I can think of
> > is a dumb driver to crawl over blocks and check which is removable and
> > try to hotremove it. But just trying to offline one block after another
> > is essentially going to achieve the same.
> 
> Some thoughts:
> 
> 1. It allows you to check if memory is likely to be offlineable without
> doing expensive locking and trying to isolate pages (meaning:
> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> when isolating)
> 
> 2. There are use cases that want to identify a memory block/DIMM to
> unplug. One example is PPC DLPAR code (see this patch). Going over all
> memory block trying to offline them is an expensive operation.
> 
> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> makes use of /sys/.../removable to speed up the search AFAIK.

Well, while I do see those points I am not really sure they are worth
having a broken (by-definition) interface.
 
> 4. lsmem displays/groups by "removable".

Is anybody really using that?

> 5. If "removable=false" then it usually really is not offlineable.
> Of course, there could also be races (free the last unmovable page),
> but it means "don't even try". OTOH, "removable=true" is more racy,
> and gives less guarantees. ("looks okay, feel free to try")

Yeah, but you could be already pessimistic and try movable zones before
other kernel zones.

> > Or does anybody see any reasonable usecase that would break if we did
> > that unconditional behavior?
> 
> If we would return always "true", then the whole reason the
> interface originally was introduced would be "broken" (meaning, less
> performant as you would try to offline any memory block).

I would argue that the whole interface is broken ;). Not the first time
in the kernel development history and not the last time either. What I
am trying to say here is that unless there are _real_ usecases depending
on knowing that something surely is _not_ offlineable then I would just
try to drop the functionality while preserving the interface and see
what happens.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 14:52     ` Michal Hocko
@ 2020-01-17 14:58       ` David Hildenbrand
  2020-01-17 15:29         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2020-01-17 14:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Leonardo Bras, Nathan Lynch, Allison Randal,
	Nathan Fontenot, Thomas Gleixner, Dan Williams, Stephen Rothwell,
	Anshuman Khandual, lantianyu1986, linuxppc-dev

On 17.01.20 15:52, Michal Hocko wrote:
> On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
>> On 17.01.20 12:33, Michal Hocko wrote:
>>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
>>>> Let's refactor that code. We want to check if we can offline memory
>>>> blocks. Add a new function is_mem_section_offlineable() for that and
>>>> make it call is_mem_section_offlineable() for each contained section.
>>>> Within is_mem_section_offlineable(), add some more sanity checks and
>>>> directly bail out if the section contains holes or if it spans multiple
>>>> zones.
>>>
>>> I didn't read the patch (yet) but I am wondering. If we want to touch
>>> this code, can we simply always return true there? I mean whoever
>>> depends on this check is racy and the failure can happen even after
>>> the sysfs says good to go, right? The check is essentially as expensive
>>> as calling the offlining code itself. So the only usecase I can think of
>>> is a dumb driver to crawl over blocks and check which is removable and
>>> try to hotremove it. But just trying to offline one block after another
>>> is essentially going to achieve the same.
>>
>> Some thoughts:
>>
>> 1. It allows you to check if memory is likely to be offlineable without
>> doing expensive locking and trying to isolate pages (meaning:
>> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
>> when isolating)
>>
>> 2. There are use cases that want to identify a memory block/DIMM to
>> unplug. One example is PPC DLPAR code (see this patch). Going over all
>> memory block trying to offline them is an expensive operation.
>>
>> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
>> makes use of /sys/.../removable to speed up the search AFAIK.
> 
> Well, while I do see those points I am not really sure they are worth
> having a broken (by-definition) interface.

It's a pure speedup. And for that, the interface has been working
perfectly fine for years?

>  
>> 4. lsmem displays/groups by "removable".
> 
> Is anybody really using that?

Well at least I am using that when testing to identify which
(ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
all the zone shrinking stuff I have been fixing)

So there is at least one user ;)

[...]
> 
>>> Or does anybody see any reasonable usecase that would break if we did
>>> that unconditional behavior?
>>
>> If we would return always "true", then the whole reason the
>> interface originally was introduced would be "broken" (meaning, less
>> performant as you would try to offline any memory block).
> 
> I would argue that the whole interface is broken ;). Not the first time
> in the kernel development history and not the last time either. What I
> am trying to say here is that unless there are _real_ usecases depending
> on knowing that something surely is _not_ offlineable then I would just
> try to drop the functionality while preserving the interface and see
> what happens.

I can see that, but I can perfectly well understand why - especially
powerpc - wants a fast way to sense which blocks actually sense to try
to online.

The original patch correctly states
   "which sections of
    memory are likely to be removable before attempting the potentially
    expensive operation."

It works as designed I would say.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 14:58       ` David Hildenbrand
@ 2020-01-17 15:29         ` Michal Hocko
  2020-01-17 15:54           ` Dan Williams
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-17 15:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Leonardo Bras, Nathan Lynch, Allison Randal,
	Nathan Fontenot, Thomas Gleixner, Dan Williams, Stephen Rothwell,
	Anshuman Khandual, lantianyu1986, linuxppc-dev

On Fri 17-01-20 15:58:26, David Hildenbrand wrote:
> On 17.01.20 15:52, Michal Hocko wrote:
> > On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> >> On 17.01.20 12:33, Michal Hocko wrote:
> >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> >>>> Let's refactor that code. We want to check if we can offline memory
> >>>> blocks. Add a new function is_mem_section_offlineable() for that and
> >>>> make it call is_mem_section_offlineable() for each contained section.
> >>>> Within is_mem_section_offlineable(), add some more sanity checks and
> >>>> directly bail out if the section contains holes or if it spans multiple
> >>>> zones.
> >>>
> >>> I didn't read the patch (yet) but I am wondering. If we want to touch
> >>> this code, can we simply always return true there? I mean whoever
> >>> depends on this check is racy and the failure can happen even after
> >>> the sysfs says good to go, right? The check is essentially as expensive
> >>> as calling the offlining code itself. So the only usecase I can think of
> >>> is a dumb driver to crawl over blocks and check which is removable and
> >>> try to hotremove it. But just trying to offline one block after another
> >>> is essentially going to achieve the same.
> >>
> >> Some thoughts:
> >>
> >> 1. It allows you to check if memory is likely to be offlineable without
> >> doing expensive locking and trying to isolate pages (meaning:
> >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> >> when isolating)
> >>
> >> 2. There are use cases that want to identify a memory block/DIMM to
> >> unplug. One example is PPC DLPAR code (see this patch). Going over all
> >> memory block trying to offline them is an expensive operation.
> >>
> >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> >> makes use of /sys/.../removable to speed up the search AFAIK.
> > 
> > Well, while I do see those points I am not really sure they are worth
> > having a broken (by-definition) interface.
> 
> It's a pure speedup. And for that, the interface has been working
> perfectly fine for years?
> 
> >  
> >> 4. lsmem displays/groups by "removable".
> > 
> > Is anybody really using that?
> 
> Well at least I am using that when testing to identify which
> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
> all the zone shrinking stuff I have been fixing)
> 
> So there is at least one user ;)

Fair enough. But I would argue that there are better ways to do the same
solely for testing purposes. Rather than having a subtly broken code to
maintain.
 
> > 
> >>> Or does anybody see any reasonable usecase that would break if we did
> >>> that unconditional behavior?
> >>
> >> If we would return always "true", then the whole reason the
> >> interface originally was introduced would be "broken" (meaning, less
> >> performant as you would try to offline any memory block).
> > 
> > I would argue that the whole interface is broken ;). Not the first time
> > in the kernel development history and not the last time either. What I
> > am trying to say here is that unless there are _real_ usecases depending
> > on knowing that something surely is _not_ offlineable then I would just
> > try to drop the functionality while preserving the interface and see
> > what happens.
> 
> I can see that, but I can perfectly well understand why - especially
> powerpc - wants a fast way to sense which blocks actually sense to try
> to online.
> 
> The original patch correctly states
>    "which sections of
>     memory are likely to be removable before attempting the potentially
>     expensive operation."
> 
> It works as designed I would say.

Then I would just keep it crippled the same way it has been for years
without anybody noticing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 15:29         ` Michal Hocko
@ 2020-01-17 15:54           ` Dan Williams
  2020-01-17 16:10             ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2020-01-17 15:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 17-01-20 15:58:26, David Hildenbrand wrote:
> > On 17.01.20 15:52, Michal Hocko wrote:
> > > On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> > >> On 17.01.20 12:33, Michal Hocko wrote:
> > >>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> > >>>> Let's refactor that code. We want to check if we can offline memory
> > >>>> blocks. Add a new function is_mem_section_offlineable() for that and
> > >>>> make it call is_mem_section_offlineable() for each contained section.
> > >>>> Within is_mem_section_offlineable(), add some more sanity checks and
> > >>>> directly bail out if the section contains holes or if it spans multiple
> > >>>> zones.
> > >>>
> > >>> I didn't read the patch (yet) but I am wondering. If we want to touch
> > >>> this code, can we simply always return true there? I mean whoever
> > >>> depends on this check is racy and the failure can happen even after
> > >>> the sysfs says good to go, right? The check is essentially as expensive
> > >>> as calling the offlining code itself. So the only usecase I can think of
> > >>> is a dumb driver to crawl over blocks and check which is removable and
> > >>> try to hotremove it. But just trying to offline one block after another
> > >>> is essentially going to achieve the same.
> > >>
> > >> Some thoughts:
> > >>
> > >> 1. It allows you to check if memory is likely to be offlineable without
> > >> doing expensive locking and trying to isolate pages (meaning:
> > >> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> > >> when isolating)
> > >>
> > >> 2. There are use cases that want to identify a memory block/DIMM to
> > >> unplug. One example is PPC DLPAR code (see this patch). Going over all
> > >> memory block trying to offline them is an expensive operation.
> > >>
> > >> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> > >> makes use of /sys/.../removable to speed up the search AFAIK.
> > >
> > > Well, while I do see those points I am not really sure they are worth
> > > having a broken (by-definition) interface.
> >
> > It's a pure speedup. And for that, the interface has been working
> > perfectly fine for years?
> >
> > >
> > >> 4. lsmem displays/groups by "removable".
> > >
> > > Is anybody really using that?
> >
> > Well at least I am using that when testing to identify which
> > (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
> > all the zone shrinking stuff I have been fixing)
> >
> > So there is at least one user ;)
>
> Fair enough. But I would argue that there are better ways to do the same
> solely for testing purposes. Rather than having a subtly broken code to
> maintain.
>
> > >
> > >>> Or does anybody see any reasonable usecase that would break if we did
> > >>> that unconditional behavior?
> > >>
> > >> If we would return always "true", then the whole reason the
> > >> interface originally was introduced would be "broken" (meaning, less
> > >> performant as you would try to offline any memory block).
> > >
> > > I would argue that the whole interface is broken ;). Not the first time
> > > in the kernel development history and not the last time either. What I
> > > am trying to say here is that unless there are _real_ usecases depending
> > > on knowing that something surely is _not_ offlineable then I would just
> > > try to drop the functionality while preserving the interface and see
> > > what happens.
> >
> > I can see that, but I can perfectly well understand why - especially
> > powerpc - wants a fast way to sense which blocks actually sense to try
> > to online.
> >
> > The original patch correctly states
> >    "which sections of
> >     memory are likely to be removable before attempting the potentially
> >     expensive operation."
> >
> > It works as designed I would say.
>
> Then I would just keep it crippled the same way it has been for years
> without anybody noticing.

I tend to agree. At least the kmem driver that wants to unplug memory
could not use an interface that does not give stable answers. It just
relies on remove_memory() to return a definitive error.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 15:54           ` Dan Williams
@ 2020-01-17 16:10             ` David Hildenbrand
  2020-01-17 16:57               ` Dan Williams
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2020-01-17 16:10 UTC (permalink / raw)
  To: Dan Williams, Michal Hocko
  Cc: Linux Kernel Mailing List, Linux MM, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Greg Kroah-Hartman,
	Rafael J. Wysocki, Andrew Morton, Leonardo Bras, Nathan Lynch,
	Allison Randal, Nathan Fontenot, Thomas Gleixner,
	Stephen Rothwell, Anshuman Khandual, lantianyu1986, linuxppc-dev

On 17.01.20 16:54, Dan Williams wrote:
> On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko <mhocko@kernel.org> wrote:
>>
>> On Fri 17-01-20 15:58:26, David Hildenbrand wrote:
>>> On 17.01.20 15:52, Michal Hocko wrote:
>>>> On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
>>>>> On 17.01.20 12:33, Michal Hocko wrote:
>>>>>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
>>>>>>> Let's refactor that code. We want to check if we can offline memory
>>>>>>> blocks. Add a new function is_mem_section_offlineable() for that and
>>>>>>> make it call is_mem_section_offlineable() for each contained section.
>>>>>>> Within is_mem_section_offlineable(), add some more sanity checks and
>>>>>>> directly bail out if the section contains holes or if it spans multiple
>>>>>>> zones.
>>>>>>
>>>>>> I didn't read the patch (yet) but I am wondering. If we want to touch
>>>>>> this code, can we simply always return true there? I mean whoever
>>>>>> depends on this check is racy and the failure can happen even after
>>>>>> the sysfs says good to go, right? The check is essentially as expensive
>>>>>> as calling the offlining code itself. So the only usecase I can think of
>>>>>> is a dumb driver to crawl over blocks and check which is removable and
>>>>>> try to hotremove it. But just trying to offline one block after another
>>>>>> is essentially going to achieve the same.
>>>>>
>>>>> Some thoughts:
>>>>>
>>>>> 1. It allows you to check if memory is likely to be offlineable without
>>>>> doing expensive locking and trying to isolate pages (meaning:
>>>>> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
>>>>> when isolating)
>>>>>
>>>>> 2. There are use cases that want to identify a memory block/DIMM to
>>>>> unplug. One example is PPC DLPAR code (see this patch). Going over all
>>>>> memory block trying to offline them is an expensive operation.
>>>>>
>>>>> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
>>>>> makes use of /sys/.../removable to speed up the search AFAIK.
>>>>
>>>> Well, while I do see those points I am not really sure they are worth
>>>> having a broken (by-definition) interface.
>>>
>>> It's a pure speedup. And for that, the interface has been working
>>> perfectly fine for years?
>>>
>>>>
>>>>> 4. lsmem displays/groups by "removable".
>>>>
>>>> Is anybody really using that?
>>>
>>> Well at least I am using that when testing to identify which
>>> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
>>> all the zone shrinking stuff I have been fixing)
>>>
>>> So there is at least one user ;)
>>
>> Fair enough. But I would argue that there are better ways to do the same
>> solely for testing purposes. Rather than having a subtly broken code to
>> maintain.
>>
>>>>
>>>>>> Or does anybody see any reasonable usecase that would break if we did
>>>>>> that unconditional behavior?
>>>>>
>>>>> If we would return always "true", then the whole reason the
>>>>> interface originally was introduced would be "broken" (meaning, less
>>>>> performant as you would try to offline any memory block).
>>>>
>>>> I would argue that the whole interface is broken ;). Not the first time
>>>> in the kernel development history and not the last time either. What I
>>>> am trying to say here is that unless there are _real_ usecases depending
>>>> on knowing that something surely is _not_ offlineable then I would just
>>>> try to drop the functionality while preserving the interface and see
>>>> what happens.
>>>
>>> I can see that, but I can perfectly well understand why - especially
>>> powerpc - wants a fast way to sense which blocks actually sense to try
>>> to online.
>>>
>>> The original patch correctly states
>>>    "which sections of
>>>     memory are likely to be removable before attempting the potentially
>>>     expensive operation."
>>>
>>> It works as designed I would say.
>>
>> Then I would just keep it crippled the same way it has been for years
>> without anybody noticing.
> 
> I tend to agree. At least the kmem driver that wants to unplug memory
> could not use an interface that does not give stable answers. It just
> relies on remove_memory() to return a definitive error.
> 

Just because kmem cannot reuse such an interface doesn't mean we should
not touch it (or I am not getting your point). Especially, this
interface is about "can it be likely be offlined and then eventually be
removed (if there is a HW interface for that)" (as documented), not
about "will remove_memory()" work.

We do have users and if we agree to keep it (what I think we should as I
expressed) then I think we should un-cripple and fix it. After all we
have to maintain it. The current interface provides what was documented
- "likely to be offlineable". (the chosen name was just horribly bad -
as I expressed a while ago already :) )

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 16:10             ` David Hildenbrand
@ 2020-01-17 16:57               ` Dan Williams
  2020-01-20  7:48                 ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Williams @ 2020-01-17 16:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Fri, Jan 17, 2020 at 8:11 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 17.01.20 16:54, Dan Williams wrote:
> > On Fri, Jan 17, 2020 at 7:30 AM Michal Hocko <mhocko@kernel.org> wrote:
> >>
> >> On Fri 17-01-20 15:58:26, David Hildenbrand wrote:
> >>> On 17.01.20 15:52, Michal Hocko wrote:
> >>>> On Fri 17-01-20 14:08:06, David Hildenbrand wrote:
> >>>>> On 17.01.20 12:33, Michal Hocko wrote:
> >>>>>> On Fri 17-01-20 11:57:59, David Hildenbrand wrote:
> >>>>>>> Let's refactor that code. We want to check if we can offline memory
> >>>>>>> blocks. Add a new function is_mem_section_offlineable() for that and
> >>>>>>> make it call is_mem_section_offlineable() for each contained section.
> >>>>>>> Within is_mem_section_offlineable(), add some more sanity checks and
> >>>>>>> directly bail out if the section contains holes or if it spans multiple
> >>>>>>> zones.
> >>>>>>
> >>>>>> I didn't read the patch (yet) but I am wondering. If we want to touch
> >>>>>> this code, can we simply always return true there? I mean whoever
> >>>>>> depends on this check is racy and the failure can happen even after
> >>>>>> the sysfs says good to go, right? The check is essentially as expensive
> >>>>>> as calling the offlining code itself. So the only usecase I can think of
> >>>>>> is a dumb driver to crawl over blocks and check which is removable and
> >>>>>> try to hotremove it. But just trying to offline one block after another
> >>>>>> is essentially going to achieve the same.
> >>>>>
> >>>>> Some thoughts:
> >>>>>
> >>>>> 1. It allows you to check if memory is likely to be offlineable without
> >>>>> doing expensive locking and trying to isolate pages (meaning:
> >>>>> zone->lock, mem_hotplug_lock. but also, calling drain_all_pages()
> >>>>> when isolating)
> >>>>>
> >>>>> 2. There are use cases that want to identify a memory block/DIMM to
> >>>>> unplug. One example is PPC DLPAR code (see this patch). Going over all
> >>>>> memory block trying to offline them is an expensive operation.
> >>>>>
> >>>>> 3. powerpc-utils (https://github.com/ibm-power-utilities/powerpc-utils)
> >>>>> makes use of /sys/.../removable to speed up the search AFAIK.
> >>>>
> >>>> Well, while I do see those points I am not really sure they are worth
> >>>> having a broken (by-definition) interface.
> >>>
> >>> It's a pure speedup. And for that, the interface has been working
> >>> perfectly fine for years?
> >>>
> >>>>
> >>>>> 4. lsmem displays/groups by "removable".
> >>>>
> >>>> Is anybody really using that?
> >>>
> >>> Well at least I am using that when testing to identify which
> >>> (ZONE_NORMAL!) block I can easily offline/re-online (e.g., to validate
> >>> all the zone shrinking stuff I have been fixing)
> >>>
> >>> So there is at least one user ;)
> >>
> >> Fair enough. But I would argue that there are better ways to do the same
> >> solely for testing purposes. Rather than having a subtly broken code to
> >> maintain.
> >>
> >>>>
> >>>>>> Or does anybody see any reasonable usecase that would break if we did
> >>>>>> that unconditional behavior?
> >>>>>
> >>>>> If we would return always "true", then the whole reason the
> >>>>> interface originally was introduced would be "broken" (meaning, less
> >>>>> performant as you would try to offline any memory block).
> >>>>
> >>>> I would argue that the whole interface is broken ;). Not the first time
> >>>> in the kernel development history and not the last time either. What I
> >>>> am trying to say here is that unless there are _real_ usecases depending
> >>>> on knowing that something surely is _not_ offlineable then I would just
> >>>> try to drop the functionality while preserving the interface and see
> >>>> what happens.
> >>>
> >>> I can see that, but I can perfectly well understand why - especially
> >>> powerpc - wants a fast way to sense which blocks actually sense to try
> >>> to online.
> >>>
> >>> The original patch correctly states
> >>>    "which sections of
> >>>     memory are likely to be removable before attempting the potentially
> >>>     expensive operation."
> >>>
> >>> It works as designed I would say.
> >>
> >> Then I would just keep it crippled the same way it has been for years
> >> without anybody noticing.
> >
> > I tend to agree. At least the kmem driver that wants to unplug memory
> > could not use an interface that does not give stable answers. It just
> > relies on remove_memory() to return a definitive error.
> >
>
> Just because kmem cannot reuse such an interface doesn't mean we should
> not touch it (or I am not getting your point). Especially, this
> interface is about "can it be likely be offlined and then eventually be
> removed (if there is a HW interface for that)" (as documented), not
> about "will remove_memory()" work.

Unless the user is willing to hold the device_hotplug_lock over the
evaluation then the result is unreliable. For example the changes to
removable_show() are no better than they were previously because the
result is invalidated as soon as the lock is dropped.

> We do have users and if we agree to keep it (what I think we should as I
> expressed) then I think we should un-cripple and fix it. After all we
> have to maintain it. The current interface provides what was documented
> - "likely to be offlineable". (the chosen name was just horribly bad -
> as I expressed a while ago already :) )

Are there users that misbehave because they try to offline a section
with holes? I brought up kmem because it's an unplug user that does
not care whether the failure was due to pinned pages or holes in
sections. Do others care about that precision in a meaningful way?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-17 16:57               ` Dan Williams
@ 2020-01-20  7:48                 ` Michal Hocko
  2020-01-20  9:14                   ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-20  7:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Fri 17-01-20 08:57:51, Dan Williams wrote:
[...]
> Unless the user is willing to hold the device_hotplug_lock over the
> evaluation then the result is unreliable.

Do we want to hold the device_hotplug_lock from this user readable file
in the first place? My book says that this just waits to become a
problem.

Really, the interface is flawed and should have never been merged in the
first place. We cannot simply remove it altogether I am afraid so let's
at least remove the bogus code and pretend that the world is a better
place where everything is removable except the reality sucks...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-20  7:48                 ` Michal Hocko
@ 2020-01-20  9:14                   ` David Hildenbrand
  2020-01-20  9:20                     ` David Hildenbrand
  2020-01-21 12:07                     ` Michal Hocko
  0 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2020-01-20  9:14 UTC (permalink / raw)
  To: Michal Hocko, Dan Williams
  Cc: Linux Kernel Mailing List, Linux MM, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Greg Kroah-Hartman,
	Rafael J. Wysocki, Andrew Morton, Leonardo Bras, Nathan Lynch,
	Allison Randal, Nathan Fontenot, Thomas Gleixner,
	Stephen Rothwell, Anshuman Khandual, lantianyu1986, linuxppc-dev

On 20.01.20 08:48, Michal Hocko wrote:
> On Fri 17-01-20 08:57:51, Dan Williams wrote:
> [...]
>> Unless the user is willing to hold the device_hotplug_lock over the
>> evaluation then the result is unreliable.
> 
> Do we want to hold the device_hotplug_lock from this user readable file
> in the first place? My book says that this just waits to become a
> problem.

It was the "big hammer" solution for this RFC.

I think we could do with a try_lock() on the device_lock() paired with a
device->removed flag. The latter is helpful for properly catching zombie
devices on the onlining/offlining path either way (and on my todo list).

> 
> Really, the interface is flawed and should have never been merged in the
> first place. We cannot simply remove it altogether I am afraid so let's
> at least remove the bogus code and pretend that the world is a better
> place where everything is removable except the reality sucks...

As I expressed already, the interface works as designed/documented and
has been used like that for years. I tend to agree that it never should
have been merged like that.

We have (at least) two places that are racy (with concurrent memory
hotplug):

1. /sys/.../memoryX/removable
- a) make it always return yes and make the interface useless
- b) add proper locking and keep it running as is (e.g., so David can
     identify offlineable memory blocks :) ).

2. /sys/.../memoryX/valid_zones
- a) always return "none" if the memory is online
- b) add proper locking and keep it running as is
- c) cache the result ("zone") when a block is onlined (e.g., in
mem->zone. If it is NULL, either mixed zones or unknown)

At least 2. already scream for a proper device_lock() locking as the
mem->state is not stable across the function call.

1a and 2a are the easiest solutions but remove all ways to identify if a
memory block could theoretically be offlined - without trying
(especially, also to identify the MOVABLE zone).

I tend to prefer 1b) and 2c), paired with proper device_lock() locking.
We don't affect existing use cases but are able to simplify the code +
fix the races.

What's your opinion? Any alternatives?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-20  9:14                   ` David Hildenbrand
@ 2020-01-20  9:20                     ` David Hildenbrand
  2020-01-21 12:07                     ` Michal Hocko
  1 sibling, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2020-01-20  9:20 UTC (permalink / raw)
  To: Michal Hocko, Dan Williams
  Cc: Linux Kernel Mailing List, Linux MM, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Greg Kroah-Hartman,
	Rafael J. Wysocki, Andrew Morton, Leonardo Bras, Nathan Lynch,
	Allison Randal, Nathan Fontenot, Thomas Gleixner,
	Stephen Rothwell, Anshuman Khandual, lantianyu1986, linuxppc-dev

On 20.01.20 10:14, David Hildenbrand wrote:
> On 20.01.20 08:48, Michal Hocko wrote:
>> On Fri 17-01-20 08:57:51, Dan Williams wrote:
>> [...]
>>> Unless the user is willing to hold the device_hotplug_lock over the
>>> evaluation then the result is unreliable.
>>
>> Do we want to hold the device_hotplug_lock from this user readable file
>> in the first place? My book says that this just waits to become a
>> problem.
> 
> It was the "big hammer" solution for this RFC.
> 
> I think we could do with a try_lock() on the device_lock() paired with a
> device->removed flag. The latter is helpful for properly catching zombie
> devices on the onlining/offlining path either way (and on my todo list).

We do have dev->p->dead which could come in handy.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-20  9:14                   ` David Hildenbrand
  2020-01-20  9:20                     ` David Hildenbrand
@ 2020-01-21 12:07                     ` Michal Hocko
  2020-01-22 10:39                       ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-21 12:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Mon 20-01-20 10:14:44, David Hildenbrand wrote:
> On 20.01.20 08:48, Michal Hocko wrote:
> > On Fri 17-01-20 08:57:51, Dan Williams wrote:
> > [...]
> >> Unless the user is willing to hold the device_hotplug_lock over the
> >> evaluation then the result is unreliable.
> > 
> > Do we want to hold the device_hotplug_lock from this user readable file
> > in the first place? My book says that this just waits to become a
> > problem.
> 
> It was the "big hammer" solution for this RFC.
> 
> I think we could do with a try_lock() on the device_lock() paired with a
> device->removed flag. The latter is helpful for properly catching zombie
> devices on the onlining/offlining path either way (and on my todo list).

try_lock would be more considerate. It would at least make any potential
hammering a bit harder.

> > Really, the interface is flawed and should have never been merged in the
> > first place. We cannot simply remove it altogether I am afraid so let's
> > at least remove the bogus code and pretend that the world is a better
> > place where everything is removable except the reality sucks...
> 
> As I expressed already, the interface works as designed/documented and
> has been used like that for years.

It seems we do differ in the usefulness though. Using a crappy interface
for years doesn't make it less crappy. I do realize we cannot remove the
interface but we can remove issues with the implementation and I dare to
say that most existing users wouldn't really notice.

> I tend to agree that it never should have been merged like that.
> 
> We have (at least) two places that are racy (with concurrent memory
> hotplug):
> 
> 1. /sys/.../memoryX/removable
> - a) make it always return yes and make the interface useless
> - b) add proper locking and keep it running as is (e.g., so David can
>      identify offlineable memory blocks :) ).
> 
> 2. /sys/.../memoryX/valid_zones
> - a) always return "none" if the memory is online
> - b) add proper locking and keep it running as is
> - c) cache the result ("zone") when a block is onlined (e.g., in
> mem->zone. If it is NULL, either mixed zones or unknown)
> 
> At least 2. already scream for a proper device_lock() locking as the
> mem->state is not stable across the function call.
> 
> 1a and 2a are the easiest solutions but remove all ways to identify if a
> memory block could theoretically be offlined - without trying
> (especially, also to identify the MOVABLE zone).
> 
> I tend to prefer 1b) and 2c), paired with proper device_lock() locking.
> We don't affect existing use cases but are able to simplify the code +
> fix the races.
> 
> What's your opinion? Any alternatives?

1a) and 2c) if you ask me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-21 12:07                     ` Michal Hocko
@ 2020-01-22 10:39                       ` David Hildenbrand
  2020-01-22 10:42                         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2020-01-22 10:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

>>> Really, the interface is flawed and should have never been merged in the
>>> first place. We cannot simply remove it altogether I am afraid so let's
>>> at least remove the bogus code and pretend that the world is a better
>>> place where everything is removable except the reality sucks...
>>
>> As I expressed already, the interface works as designed/documented and
>> has been used like that for years.
> 
> It seems we do differ in the usefulness though. Using a crappy interface
> for years doesn't make it less crappy. I do realize we cannot remove the
> interface but we can remove issues with the implementation and I dare to
> say that most existing users wouldn't really notice.

Well, at least powerpc-utils (why this interface was introduced) will
notice a) performance wise and b) because more logging output will be
generated (obviously non-offlineable blocks will be tried to offline).

However, it should not break, because we could have had races
before/false positives.

> 
>> I tend to agree that it never should have been merged like that.
>>
>> We have (at least) two places that are racy (with concurrent memory
>> hotplug):
>>
>> 1. /sys/.../memoryX/removable
>> - a) make it always return yes and make the interface useless
>> - b) add proper locking and keep it running as is (e.g., so David can
>>      identify offlineable memory blocks :) ).
>>
>> 2. /sys/.../memoryX/valid_zones
>> - a) always return "none" if the memory is online
>> - b) add proper locking and keep it running as is
>> - c) cache the result ("zone") when a block is onlined (e.g., in
>> mem->zone. If it is NULL, either mixed zones or unknown)
>>
>> At least 2. already scream for a proper device_lock() locking as the
>> mem->state is not stable across the function call.
>>
>> 1a and 2a are the easiest solutions but remove all ways to identify if a
>> memory block could theoretically be offlined - without trying
>> (especially, also to identify the MOVABLE zone).
>>
>> I tend to prefer 1b) and 2c), paired with proper device_lock() locking.
>> We don't affect existing use cases but are able to simplify the code +
>> fix the races.
>>
>> What's your opinion? Any alternatives?
> 
> 1a) and 2c) if you ask me.
> 

I'll look into that all, just might take a little (busy with a lot of
stuff). But after all, it does not seem to be urgent.

1a) will be easy, I'll post a patch soon that we can let rest in -next
for a bit to see if people start to scream out loud.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 10:39                       ` David Hildenbrand
@ 2020-01-22 10:42                         ` Michal Hocko
  2020-01-22 10:54                           ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-22 10:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
> >>> Really, the interface is flawed and should have never been merged in the
> >>> first place. We cannot simply remove it altogether I am afraid so let's
> >>> at least remove the bogus code and pretend that the world is a better
> >>> place where everything is removable except the reality sucks...
> >>
> >> As I expressed already, the interface works as designed/documented and
> >> has been used like that for years.
> > 
> > It seems we do differ in the usefulness though. Using a crappy interface
> > for years doesn't make it less crappy. I do realize we cannot remove the
> > interface but we can remove issues with the implementation and I dare to
> > say that most existing users wouldn't really notice.
> 
> Well, at least powerpc-utils (why this interface was introduced) will
> notice a) performance wise and b) because more logging output will be
> generated (obviously non-offlineable blocks will be tried to offline).

I would really appreciate some specific example for a real usecase. I am
not familiar with powerpc-utils worklflows myself.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 10:42                         ` Michal Hocko
@ 2020-01-22 10:54                           ` David Hildenbrand
  2020-01-22 11:58                             ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2020-01-22 10:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On 22.01.20 11:42, Michal Hocko wrote:
> On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
>>>>> Really, the interface is flawed and should have never been merged in the
>>>>> first place. We cannot simply remove it altogether I am afraid so let's
>>>>> at least remove the bogus code and pretend that the world is a better
>>>>> place where everything is removable except the reality sucks...
>>>>
>>>> As I expressed already, the interface works as designed/documented and
>>>> has been used like that for years.
>>>
>>> It seems we do differ in the usefulness though. Using a crappy interface
>>> for years doesn't make it less crappy. I do realize we cannot remove the
>>> interface but we can remove issues with the implementation and I dare to
>>> say that most existing users wouldn't really notice.
>>
>> Well, at least powerpc-utils (why this interface was introduced) will
>> notice a) performance wise and b) because more logging output will be
>> generated (obviously non-offlineable blocks will be tried to offline).
> 
> I would really appreciate some specific example for a real usecase. I am
> not familiar with powerpc-utils worklflows myself.
> 

Not an expert myself:

https://github.com/ibm-power-utilities/powerpc-utils

-> src/drmgr/drslot_chrp_mem.c

On request to remove some memory it will

a) Read "->removable" of all memory blocks ("lmb")
b) Check if the request can be fulfilled using the removable blocks
c) Try to offline the memory blocks by trying to offline it. If that
succeeded, trigger removeal of it using some hypervisor hooks.

Interestingly, with "AMS ballooning", it will already consider the
"removable" information useless (most probably, because of
non-migratable balloon pages that can be offlined - I assume the powerpc
code that I converted to proper balloon compaction just recently). a)
and b) is skipped.

Returning "yes" on all blocks will make them handle it just like if "AMS
ballooning" is active. So any memory block will be tried. Should work
but will be slower if no ballooning is active.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 10:54                           ` David Hildenbrand
@ 2020-01-22 11:58                             ` David Hildenbrand
  2020-01-22 16:46                               ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2020-01-22 11:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On 22.01.20 11:54, David Hildenbrand wrote:
> On 22.01.20 11:42, Michal Hocko wrote:
>> On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
>>>>>> Really, the interface is flawed and should have never been merged in the
>>>>>> first place. We cannot simply remove it altogether I am afraid so let's
>>>>>> at least remove the bogus code and pretend that the world is a better
>>>>>> place where everything is removable except the reality sucks...
>>>>>
>>>>> As I expressed already, the interface works as designed/documented and
>>>>> has been used like that for years.
>>>>
>>>> It seems we do differ in the usefulness though. Using a crappy interface
>>>> for years doesn't make it less crappy. I do realize we cannot remove the
>>>> interface but we can remove issues with the implementation and I dare to
>>>> say that most existing users wouldn't really notice.
>>>
>>> Well, at least powerpc-utils (why this interface was introduced) will
>>> notice a) performance wise and b) because more logging output will be
>>> generated (obviously non-offlineable blocks will be tried to offline).
>>
>> I would really appreciate some specific example for a real usecase. I am
>> not familiar with powerpc-utils worklflows myself.
>>
> 
> Not an expert myself:
> 
> https://github.com/ibm-power-utilities/powerpc-utils
> 
> -> src/drmgr/drslot_chrp_mem.c
> 
> On request to remove some memory it will
> 
> a) Read "->removable" of all memory blocks ("lmb")
> b) Check if the request can be fulfilled using the removable blocks
> c) Try to offline the memory blocks by trying to offline it. If that
> succeeded, trigger removeal of it using some hypervisor hooks.
> 
> Interestingly, with "AMS ballooning", it will already consider the
> "removable" information useless (most probably, because of
> non-migratable balloon pages that can be offlined - I assume the powerpc
> code that I converted to proper balloon compaction just recently). a)
> and b) is skipped.
> 
> Returning "yes" on all blocks will make them handle it just like if "AMS
> ballooning" is active. So any memory block will be tried. Should work
> but will be slower if no ballooning is active.
> 

On lsmem:

https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html

"
Removable
    yes if the memory range can be set offline, no if it cannot be set
offline. A dash (-) means that the range is already offline. The kernel
method that identifies removable memory ranges is heuristic and not
exact. Occasionally, memory ranges are falsely reported as removable or
falsely reported as not removable.
"

Usage of lsmem paird with chmem:

https://access.redhat.com/solutions/3937181


Especially interesting for IBM z Systems, whereby memory
onlining/offlining will trigger the actual population of memory in the
hypervisor. So if an admin wants to offline some memory (to give it back
to the hypervisor), it would use lsmem to identify such blocks first,
instead of trying random blocks until one offlining request succeeds.

E.g., documented in

https://books.google.de/books?id=1UEhDQAAQBAJ&pg=PA117&lpg=PA117&dq=lsmem+removable&source=bl&ots=OzMfU6Gbzu&sig=ACfU3U2IfH0eTVJs0qu50FdkysA3iC0elw&hl=de&sa=X&ved=2ahUKEwjQpdXQkpfnAhVOzqQKHTN4BsoQ6AEwBXoECAoQAQ#v=onepage&q=lsmem%20removable&f=false


So I still think that the interface is useful and is getting used in
real life. Users tolerate false positives/negatives.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 11:58                             ` David Hildenbrand
@ 2020-01-22 16:46                               ` Michal Hocko
  2020-01-22 18:15                                 ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-22 16:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Wed 22-01-20 12:58:16, David Hildenbrand wrote:
> On 22.01.20 11:54, David Hildenbrand wrote:
> > On 22.01.20 11:42, Michal Hocko wrote:
> >> On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
> >>>>>> Really, the interface is flawed and should have never been merged in the
> >>>>>> first place. We cannot simply remove it altogether I am afraid so let's
> >>>>>> at least remove the bogus code and pretend that the world is a better
> >>>>>> place where everything is removable except the reality sucks...
> >>>>>
> >>>>> As I expressed already, the interface works as designed/documented and
> >>>>> has been used like that for years.
> >>>>
> >>>> It seems we do differ in the usefulness though. Using a crappy interface
> >>>> for years doesn't make it less crappy. I do realize we cannot remove the
> >>>> interface but we can remove issues with the implementation and I dare to
> >>>> say that most existing users wouldn't really notice.
> >>>
> >>> Well, at least powerpc-utils (why this interface was introduced) will
> >>> notice a) performance wise and b) because more logging output will be
> >>> generated (obviously non-offlineable blocks will be tried to offline).
> >>
> >> I would really appreciate some specific example for a real usecase. I am
> >> not familiar with powerpc-utils worklflows myself.
> >>
> > 
> > Not an expert myself:
> > 
> > https://github.com/ibm-power-utilities/powerpc-utils
> > 
> > -> src/drmgr/drslot_chrp_mem.c
> > 
> > On request to remove some memory it will
> > 
> > a) Read "->removable" of all memory blocks ("lmb")
> > b) Check if the request can be fulfilled using the removable blocks
> > c) Try to offline the memory blocks by trying to offline it. If that
> > succeeded, trigger removeal of it using some hypervisor hooks.
> > 
> > Interestingly, with "AMS ballooning", it will already consider the
> > "removable" information useless (most probably, because of
> > non-migratable balloon pages that can be offlined - I assume the powerpc
> > code that I converted to proper balloon compaction just recently). a)
> > and b) is skipped.
> > 
> > Returning "yes" on all blocks will make them handle it just like if "AMS
> > ballooning" is active. So any memory block will be tried. Should work
> > but will be slower if no ballooning is active.
> > 
> 
> On lsmem:
> 
> https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html
> 
> "
> Removable
>     yes if the memory range can be set offline, no if it cannot be set
> offline. A dash (-) means that the range is already offline. The kernel
> method that identifies removable memory ranges is heuristic and not
> exact. Occasionally, memory ranges are falsely reported as removable or
> falsely reported as not removable.
> "
> 
> Usage of lsmem paird with chmem:
> 
> https://access.redhat.com/solutions/3937181
> 
> 
> Especially interesting for IBM z Systems, whereby memory
> onlining/offlining will trigger the actual population of memory in the
> hypervisor. So if an admin wants to offline some memory (to give it back
> to the hypervisor), it would use lsmem to identify such blocks first,
> instead of trying random blocks until one offlining request succeeds.

I am sorry for being dense here but I still do not understand why s390
and the way how it does the hotremove matters here. Afterall there are
no arch specific operations done until the memory is offlined. Also
randomly checking memory blocks and then hoping that the offline will
succeed is not way much different from just trying the offline the
block. Both have to crawl through the pfn range and bail out on the
unmovable memory.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 16:46                               ` Michal Hocko
@ 2020-01-22 18:15                                 ` David Hildenbrand
  2020-01-22 18:38                                   ` Michal Hocko
  2020-01-22 19:01                                   ` Michal Hocko
  0 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2020-01-22 18:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On 22.01.20 17:46, Michal Hocko wrote:
> On Wed 22-01-20 12:58:16, David Hildenbrand wrote:
>> On 22.01.20 11:54, David Hildenbrand wrote:
>>> On 22.01.20 11:42, Michal Hocko wrote:
>>>> On Wed 22-01-20 11:39:08, David Hildenbrand wrote:
>>>>>>>> Really, the interface is flawed and should have never been merged in the
>>>>>>>> first place. We cannot simply remove it altogether I am afraid so let's
>>>>>>>> at least remove the bogus code and pretend that the world is a better
>>>>>>>> place where everything is removable except the reality sucks...
>>>>>>>
>>>>>>> As I expressed already, the interface works as designed/documented and
>>>>>>> has been used like that for years.
>>>>>>
>>>>>> It seems we do differ in the usefulness though. Using a crappy interface
>>>>>> for years doesn't make it less crappy. I do realize we cannot remove the
>>>>>> interface but we can remove issues with the implementation and I dare to
>>>>>> say that most existing users wouldn't really notice.
>>>>>
>>>>> Well, at least powerpc-utils (why this interface was introduced) will
>>>>> notice a) performance wise and b) because more logging output will be
>>>>> generated (obviously non-offlineable blocks will be tried to offline).
>>>>
>>>> I would really appreciate some specific example for a real usecase. I am
>>>> not familiar with powerpc-utils worklflows myself.
>>>>
>>>
>>> Not an expert myself:
>>>
>>> https://github.com/ibm-power-utilities/powerpc-utils
>>>
>>> -> src/drmgr/drslot_chrp_mem.c
>>>
>>> On request to remove some memory it will
>>>
>>> a) Read "->removable" of all memory blocks ("lmb")
>>> b) Check if the request can be fulfilled using the removable blocks
>>> c) Try to offline the memory blocks by trying to offline it. If that
>>> succeeded, trigger removeal of it using some hypervisor hooks.
>>>
>>> Interestingly, with "AMS ballooning", it will already consider the
>>> "removable" information useless (most probably, because of
>>> non-migratable balloon pages that can be offlined - I assume the powerpc
>>> code that I converted to proper balloon compaction just recently). a)
>>> and b) is skipped.
>>>
>>> Returning "yes" on all blocks will make them handle it just like if "AMS
>>> ballooning" is active. So any memory block will be tried. Should work
>>> but will be slower if no ballooning is active.
>>>
>>
>> On lsmem:
>>
>> https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.lgdd/lgdd_r_lsmem_cmd.html
>>
>> "
>> Removable
>>     yes if the memory range can be set offline, no if it cannot be set
>> offline. A dash (-) means that the range is already offline. The kernel
>> method that identifies removable memory ranges is heuristic and not
>> exact. Occasionally, memory ranges are falsely reported as removable or
>> falsely reported as not removable.
>> "
>>
>> Usage of lsmem paird with chmem:
>>
>> https://access.redhat.com/solutions/3937181
>>
>>
>> Especially interesting for IBM z Systems, whereby memory
>> onlining/offlining will trigger the actual population of memory in the
>> hypervisor. So if an admin wants to offline some memory (to give it back
>> to the hypervisor), it would use lsmem to identify such blocks first,
>> instead of trying random blocks until one offlining request succeeds.
> 
> I am sorry for being dense here but I still do not understand why s390

It's good that we talk about it :) It's hard to reconstruct actual use
cases from tools and some documentation only ...

Side note (just FYI): One difference on s390x compared to other
architectures (AFAIKS) is that once memory is offline, you might not be
allowed (by the hypervisor) to online it again - because it was
effectively unplugged. Such memory is not removed via remove_memory(),
it's simply kept offline.


> and the way how it does the hotremove matters here. Afterall there are
> no arch specific operations done until the memory is offlined. Also
> randomly checking memory blocks and then hoping that the offline will
> succeed is not way much different from just trying the offline the
> block. Both have to crawl through the pfn range and bail out on the
> unmovable memory.

I think in general we have to approaches to memory unplugging.

1. Know explicitly what you want to unplug (e.g., a DIMM spanning
multiple memory blocks).

2. Find random memory blocks you can offline/unplug.


For 1, I think we both agree that we don't need this. Just try to
offline and you know if it worked.

Now of course, for 2 you can try random blocks until you succeeded. From
a sysadmin point of view that's very inefficient. From a powerpc-utils
point of view, that's inefficient.

I learned just now, "chmem"[1] has a mode where you can specify a "size"
and not only a range. So a sysadmin can still control onlining/offlining
for this use case with a few commands. In other tools (e.g.,
powerpc-utils), well, you have to try to offline random memory blocks
(just like chmem does).


AFAIK, once we turn /sys/.../removable useless, I can see the following
changes:

1. Trying to offline a certain amount of memory blocks gets slower/takes
longer/is less efficient. Might be tolerable. The tools seem to keep
working.

2. You can no longer make a rough estimate how much memory you could
offline - before you actually try to offline it. I can only imagine that
something like this makes sense in a virtual environment (e.g., IBM z)
to balance memory between virtual machines, but I am not aware of a real
user of something like that.


So what I can do is

a) Come up with a patch that rips that stuff out (well I have that
already lying around)

b) Describe the existing users + changes we will see

c) CC relevant people I identify (lsmem/chmem/powerpc-utils/etc.) on the
patch to see if we are missing other use cases/users/implications.

Sounds like a plan?


[1]
https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/tree/sys-utils/chmem.c

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 18:15                                 ` David Hildenbrand
@ 2020-01-22 18:38                                   ` Michal Hocko
  2020-01-22 18:46                                     ` David Hildenbrand
  2020-01-22 19:01                                   ` Michal Hocko
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-22 18:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Wed 22-01-20 19:15:47, David Hildenbrand wrote:
> On 22.01.20 17:46, Michal Hocko wrote:
> > On Wed 22-01-20 12:58:16, David Hildenbrand wrote:
[...]
> >> Especially interesting for IBM z Systems, whereby memory
> >> onlining/offlining will trigger the actual population of memory in the
> >> hypervisor. So if an admin wants to offline some memory (to give it back
> >> to the hypervisor), it would use lsmem to identify such blocks first,
> >> instead of trying random blocks until one offlining request succeeds.
> > 
> > I am sorry for being dense here but I still do not understand why s390
> 
> It's good that we talk about it :) It's hard to reconstruct actual use
> cases from tools and some documentation only ...
> 
> Side note (just FYI): One difference on s390x compared to other
> architectures (AFAIKS) is that once memory is offline, you might not be
> allowed (by the hypervisor) to online it again - because it was
> effectively unplugged. Such memory is not removed via remove_memory(),
> it's simply kept offline.

I have a very vague understanding of s390 specialities but this is not
really relevant to the discussion AFAICS because this happens _after_
offlining.
 
> > and the way how it does the hotremove matters here. Afterall there are
> > no arch specific operations done until the memory is offlined. Also
> > randomly checking memory blocks and then hoping that the offline will
> > succeed is not way much different from just trying the offline the
> > block. Both have to crawl through the pfn range and bail out on the
> > unmovable memory.
> 
> I think in general we have to approaches to memory unplugging.
> 
> 1. Know explicitly what you want to unplug (e.g., a DIMM spanning
> multiple memory blocks).
> 
> 2. Find random memory blocks you can offline/unplug.
> 
> 
> For 1, I think we both agree that we don't need this. Just try to
> offline and you know if it worked.
> 
> Now of course, for 2 you can try random blocks until you succeeded. From
> a sysadmin point of view that's very inefficient. From a powerpc-utils
> point of view, that's inefficient.

How exactly is check + offline more optimal then offline which makes
check as its first step? I will get to your later points after this is
clarified.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 18:38                                   ` Michal Hocko
@ 2020-01-22 18:46                                     ` David Hildenbrand
  2020-01-22 19:09                                       ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2020-01-22 18:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On 22.01.20 19:38, Michal Hocko wrote:
> On Wed 22-01-20 19:15:47, David Hildenbrand wrote:
>> On 22.01.20 17:46, Michal Hocko wrote:
>>> On Wed 22-01-20 12:58:16, David Hildenbrand wrote:
> [...]
>>>> Especially interesting for IBM z Systems, whereby memory
>>>> onlining/offlining will trigger the actual population of memory in the
>>>> hypervisor. So if an admin wants to offline some memory (to give it back
>>>> to the hypervisor), it would use lsmem to identify such blocks first,
>>>> instead of trying random blocks until one offlining request succeeds.
>>>
>>> I am sorry for being dense here but I still do not understand why s390
>>
>> It's good that we talk about it :) It's hard to reconstruct actual use
>> cases from tools and some documentation only ...
>>
>> Side note (just FYI): One difference on s390x compared to other
>> architectures (AFAIKS) is that once memory is offline, you might not be
>> allowed (by the hypervisor) to online it again - because it was
>> effectively unplugged. Such memory is not removed via remove_memory(),
>> it's simply kept offline.
> 
> I have a very vague understanding of s390 specialities but this is not
> really relevant to the discussion AFAICS because this happens _after_
> offlining.

Jep, that's why I flagged it as a side note.

>  
>>> and the way how it does the hotremove matters here. Afterall there are
>>> no arch specific operations done until the memory is offlined. Also
>>> randomly checking memory blocks and then hoping that the offline will
>>> succeed is not way much different from just trying the offline the
>>> block. Both have to crawl through the pfn range and bail out on the
>>> unmovable memory.
>>
>> I think in general we have to approaches to memory unplugging.
>>
>> 1. Know explicitly what you want to unplug (e.g., a DIMM spanning
>> multiple memory blocks).
>>
>> 2. Find random memory blocks you can offline/unplug.
>>
>>
>> For 1, I think we both agree that we don't need this. Just try to
>> offline and you know if it worked.
>>
>> Now of course, for 2 you can try random blocks until you succeeded. From
>> a sysadmin point of view that's very inefficient. From a powerpc-utils
>> point of view, that's inefficient.
> 
> How exactly is check + offline more optimal then offline which makes
> check as its first step? I will get to your later points after this is
> clarified.

Scanning (almost) lockless is more efficient than bouncing back and
forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock
and zone locks - as far as I understand. And as far as I understood,
that was the whole reason of the original commit.

Anyhow, you should have read until the end of my mail to find what you
were looking for :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 18:15                                 ` David Hildenbrand
  2020-01-22 18:38                                   ` Michal Hocko
@ 2020-01-22 19:01                                   ` Michal Hocko
  1 sibling, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2020-01-22 19:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Wed 22-01-20 19:15:47, David Hildenbrand wrote:
[...]
> c) CC relevant people I identify (lsmem/chmem/powerpc-utils/etc.) on the
> patch to see if we are missing other use cases/users/implications.
> 
> Sounds like a plan?

I would start with this step. Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 18:46                                     ` David Hildenbrand
@ 2020-01-22 19:09                                       ` Michal Hocko
  2020-01-22 20:51                                         ` Dan Williams
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2020-01-22 19:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Wed 22-01-20 19:46:15, David Hildenbrand wrote:
> On 22.01.20 19:38, Michal Hocko wrote:
[...]
> > How exactly is check + offline more optimal then offline which makes
> > check as its first step? I will get to your later points after this is
> > clarified.
> 
> Scanning (almost) lockless is more efficient than bouncing back and
> forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock
> and zone locks - as far as I understand.

All but the zone lock shouldn't be really contended and as such
shouldn't cause any troubles. zone->lock really depends on the page
allocator usage of course. But as soon as we have a contention then it
is just more likely that the result is less reliable.

I would be also really curious about how much actual time could be saved
by this - some real numbers - because hotplug operations shouldn't
happen so often that this would stand out. At least that is my
understanding.

> And as far as I understood, that was the whole reason of the original
> commit.

Well, I have my doubts but it might be just me and I might be wrong. My
experience from a large part of the memory hotplug functionality is that
it was driven by a good intention but without a due diligence to think
behind the most obvious usecase. Having a removable flag on the memblock
sounds like a neat idea of course. But an inherently racy flag is just
borderline useful.

Anyway, I will stop at this moment and wait for real usecases.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v1] mm: is_mem_section_removable() overhaul
  2020-01-22 19:09                                       ` Michal Hocko
@ 2020-01-22 20:51                                         ` Dan Williams
  0 siblings, 0 replies; 24+ messages in thread
From: Dan Williams @ 2020-01-22 20:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linux MM,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Greg Kroah-Hartman, Rafael J. Wysocki, Andrew Morton,
	Leonardo Bras, Nathan Lynch, Allison Randal, Nathan Fontenot,
	Thomas Gleixner, Stephen Rothwell, Anshuman Khandual,
	lantianyu1986, linuxppc-dev

On Wed, Jan 22, 2020 at 11:09 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 22-01-20 19:46:15, David Hildenbrand wrote:
> > On 22.01.20 19:38, Michal Hocko wrote:
> [...]
> > > How exactly is check + offline more optimal then offline which makes
> > > check as its first step? I will get to your later points after this is
> > > clarified.
> >
> > Scanning (almost) lockless is more efficient than bouncing back and
> > forth with the device_hotplug_lock, mem_hotplug_lock, cpu_hotplug_lock
> > and zone locks - as far as I understand.
>
> All but the zone lock shouldn't be really contended and as such
> shouldn't cause any troubles. zone->lock really depends on the page
> allocator usage of course. But as soon as we have a contention then it
> is just more likely that the result is less reliable.
>
> I would be also really curious about how much actual time could be saved
> by this - some real numbers - because hotplug operations shouldn't
> happen so often that this would stand out. At least that is my
> understanding.
>
> > And as far as I understood, that was the whole reason of the original
> > commit.
>
> Well, I have my doubts but it might be just me and I might be wrong. My
> experience from a large part of the memory hotplug functionality is that
> it was driven by a good intention but without a due diligence to think
> behind the most obvious usecase. Having a removable flag on the memblock
> sounds like a neat idea of course. But an inherently racy flag is just
> borderline useful.
>
> Anyway, I will stop at this moment and wait for real usecases.

...that and practical numbers showing that optimizing an interface
that can at best give rough estimate answers is worth the code change.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-01-22 20:51 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-17 10:57 [PATCH RFC v1] mm: is_mem_section_removable() overhaul David Hildenbrand
2020-01-17 11:33 ` Michal Hocko
2020-01-17 13:08   ` David Hildenbrand
2020-01-17 14:52     ` Michal Hocko
2020-01-17 14:58       ` David Hildenbrand
2020-01-17 15:29         ` Michal Hocko
2020-01-17 15:54           ` Dan Williams
2020-01-17 16:10             ` David Hildenbrand
2020-01-17 16:57               ` Dan Williams
2020-01-20  7:48                 ` Michal Hocko
2020-01-20  9:14                   ` David Hildenbrand
2020-01-20  9:20                     ` David Hildenbrand
2020-01-21 12:07                     ` Michal Hocko
2020-01-22 10:39                       ` David Hildenbrand
2020-01-22 10:42                         ` Michal Hocko
2020-01-22 10:54                           ` David Hildenbrand
2020-01-22 11:58                             ` David Hildenbrand
2020-01-22 16:46                               ` Michal Hocko
2020-01-22 18:15                                 ` David Hildenbrand
2020-01-22 18:38                                   ` Michal Hocko
2020-01-22 18:46                                     ` David Hildenbrand
2020-01-22 19:09                                       ` Michal Hocko
2020-01-22 20:51                                         ` Dan Williams
2020-01-22 19:01                                   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).