All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events
@ 2020-04-15  6:39 ` Anshuman Khandual
  0 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-04-15  6:39 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel
  Cc: akpm, catalin.marinas, will, mark.rutland, Anshuman Khandual,
	Michal Hocko, Dan Williams, David Hildenbrand, Yu Zhao,
	Hsin-Yi Wang, Thomas Gleixner, Steve Capper, linux-kernel

This series improves arm64 memory event notifier (hot remove) robustness by
enabling it to detect potential problems (if any) in the future. But first
it enumerates memory isolation failure reasons that can be sent across a
notifier. This series does not go beyond arm64 to explore if these failure
reason codes could be used in other existing registered memory notifiers.
Please do let me know if there is any other potential use cases, will be
happy to incorporate next time around. Also should we add similar failure
reasons for online_pages() as well ?

This series has been tested on arm64, boot tested on x86 and build tested
on multiple other platforms.

This series applies on v5.7-rc1.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (2):
  mm/hotplug: Enumerate memory range offlining failure reasons
  arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE

 arch/arm64/mm/mmu.c    | 52 ++++++++++++++++++++++++++++++++++++++----
 drivers/base/memory.c  |  9 ++++++++
 include/linux/memory.h | 27 ++++++++++++++++++++++
 mm/memory_hotplug.c    | 24 ++++++++++++-------
 4 files changed, 99 insertions(+), 13 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events
@ 2020-04-15  6:39 ` Anshuman Khandual
  0 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-04-15  6:39 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel
  Cc: mark.rutland, Michal Hocko, will, Anshuman Khandual,
	catalin.marinas, David Hildenbrand, Steve Capper, linux-kernel,
	Thomas Gleixner, Hsin-Yi Wang, akpm, Dan Williams, Yu Zhao

This series improves arm64 memory event notifier (hot remove) robustness by
enabling it to detect potential problems (if any) in the future. But first
it enumerates memory isolation failure reasons that can be sent across a
notifier. This series does not go beyond arm64 to explore if these failure
reason codes could be used in other existing registered memory notifiers.
Please do let me know if there is any other potential use cases, will be
happy to incorporate next time around. Also should we add similar failure
reasons for online_pages() as well ?

This series has been tested on arm64, boot tested on x86 and build tested
on multiple other platforms.

This series applies on v5.7-rc1.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (2):
  mm/hotplug: Enumerate memory range offlining failure reasons
  arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE

 arch/arm64/mm/mmu.c    | 52 ++++++++++++++++++++++++++++++++++++++----
 drivers/base/memory.c  |  9 ++++++++
 include/linux/memory.h | 27 ++++++++++++++++++++++
 mm/memory_hotplug.c    | 24 ++++++++++++-------
 4 files changed, 99 insertions(+), 13 deletions(-)

-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] mm/hotplug: Enumerate memory range offlining failure reasons
  2020-04-15  6:39 ` Anshuman Khandual
@ 2020-04-15  6:39   ` Anshuman Khandual
  -1 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-04-15  6:39 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel
  Cc: akpm, catalin.marinas, will, mark.rutland, Anshuman Khandual,
	David Hildenbrand, Michal Hocko, Dan Williams, linux-kernel

Currently just a debug message is shown describing the reason during memory
range offline failure. Even though just sufficient for debugging purpose,
these messages can not be used in registered memory event notifiers that
might be interested in MEM_CANCEL_OFFLINE event and it's possible reasons.

This enumerates all existing memory range offline failure reason codes thus
enabling their further effective utilization. It also adds a new element in
memory notifier structure (void *data) that will carry this offline failure
reason code into all registered notifiers when offlining process fails and
MEM_CANCEL_OFFLINE is triggered.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 drivers/base/memory.c  |  9 +++++++++
 include/linux/memory.h | 27 +++++++++++++++++++++++++++
 mm/memory_hotplug.c    | 24 ++++++++++++++++--------
 3 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index dbec3a05590a..2a6d52984803 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -159,6 +159,15 @@ static ssize_t state_show(struct device *dev, struct device_attribute *attr,
 
 int memory_notify(unsigned long val, void *v)
 {
+	struct memory_notify *arg = v;
+
+	/*
+	 * arg->data should be available and processed only for
+	 * MEM_CANCEL_OFFLINE event. Drop this warning when it's
+	 * usage goes beyond MEM_CANCEL_OFFLINE.
+	 */
+	WARN_ON((val != MEM_CANCEL_OFFLINE) && arg->data);
+
 	return blocking_notifier_call_chain(&memory_chain, val, v);
 }
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 439a89e758d8..7914b0dbd4bb 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -44,12 +44,39 @@ int set_memory_block_size_order(unsigned int order);
 #define	MEM_CANCEL_ONLINE	(1<<4)
 #define	MEM_CANCEL_OFFLINE	(1<<5)
 
+/*
+ * Memory offline failure reasons
+ */
+enum offline_failure_reason {
+	OFFLINE_FAILURE_MEMHOLES,
+	OFFLINE_FAILURE_MULTIZONE,
+	OFFLINE_FAILURE_ISOLATE,
+	OFFLINE_FAILURE_NOTIFIER,
+	OFFLINE_FAILURE_SIGNAL,
+	OFFLINE_FAILURE_DISSOLVE,
+};
+
+static const char *const offline_failure_names[] = {
+	[OFFLINE_FAILURE_MEMHOLES]	= "memory holes",
+	[OFFLINE_FAILURE_MULTIZONE]	= "multizone range",
+	[OFFLINE_FAILURE_ISOLATE]	= "failure to isolate range",
+	[OFFLINE_FAILURE_NOTIFIER]	= "notifier failure",
+	[OFFLINE_FAILURE_SIGNAL]	= "signal backoff",
+	[OFFLINE_FAILURE_DISSOLVE]	= "failure to dissolve huge pages",
+};
+
+static inline const char *offline_failure(int reason)
+{
+	return offline_failure_names[reason];
+}
+
 struct memory_notify {
 	unsigned long start_pfn;
 	unsigned long nr_pages;
 	int status_change_nid_normal;
 	int status_change_nid_high;
 	int status_change_nid;
+	void *data;
 };
 
 struct notifier_block;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fc0aad0bc1f5..2b733902dfcf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -787,6 +787,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 
 	arg.start_pfn = pfn;
 	arg.nr_pages = nr_pages;
+	arg.data = NULL;
 	node_states_check_changes_online(nr_pages, zone, &arg);
 
 	ret = memory_notify(MEM_GOING_ONLINE, &arg);
@@ -1466,7 +1467,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	unsigned long flags;
 	struct zone *zone;
 	struct memory_notify arg;
-	char *reason;
+	enum offline_failure_reason reason;
 
 	mem_hotplug_begin();
 
@@ -1482,7 +1483,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 			      count_system_ram_pages_cb);
 	if (nr_pages != end_pfn - start_pfn) {
 		ret = -EINVAL;
-		reason = "memory holes";
+		reason = OFFLINE_FAILURE_MEMHOLES;
 		goto failed_removal;
 	}
 
@@ -1491,7 +1492,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	zone = test_pages_in_a_zone(start_pfn, end_pfn);
 	if (!zone) {
 		ret = -EINVAL;
-		reason = "multizone range";
+		reason = OFFLINE_FAILURE_MULTIZONE;
 		goto failed_removal;
 	}
 	node = zone_to_nid(zone);
@@ -1501,19 +1502,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret < 0) {
-		reason = "failure to isolate range";
+		reason = OFFLINE_FAILURE_ISOLATE;
 		goto failed_removal;
 	}
 	nr_isolate_pageblock = ret;
 
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
+	arg.data = NULL;
 	node_states_check_changes_offline(nr_pages, zone, &arg);
 
 	ret = memory_notify(MEM_GOING_OFFLINE, &arg);
 	ret = notifier_to_errno(ret);
 	if (ret) {
-		reason = "notifier failure";
+		reason = OFFLINE_FAILURE_NOTIFIER;
 		goto failed_removal_isolated;
 	}
 
@@ -1521,7 +1523,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		for (pfn = start_pfn; pfn;) {
 			if (signal_pending(current)) {
 				ret = -EINTR;
-				reason = "signal backoff";
+				reason = OFFLINE_FAILURE_SIGNAL;
 				goto failed_removal_isolated;
 			}
 
@@ -1545,7 +1547,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		 */
 		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
 		if (ret) {
-			reason = "failure to dissolve huge pages";
+			reason = OFFLINE_FAILURE_DISSOLVE;
 			goto failed_removal_isolated;
 		}
 		/* check again */
@@ -1599,12 +1601,18 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 failed_removal_isolated:
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+
+	/*
+	 * Send the offline failure reason to all registered
+	 * notifiers for MEM_CANCEL_OFFLINE.
+	 */
+	arg.data = &reason;
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 failed_removal:
 	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
 		 (unsigned long long) start_pfn << PAGE_SHIFT,
 		 ((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
-		 reason);
+		 offline_failure(reason));
 	/* pushback to free area */
 	mem_hotplug_done();
 	return ret;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 1/2] mm/hotplug: Enumerate memory range offlining failure reasons
@ 2020-04-15  6:39   ` Anshuman Khandual
  0 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-04-15  6:39 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel
  Cc: mark.rutland, Michal Hocko, will, David Hildenbrand,
	catalin.marinas, Anshuman Khandual, linux-kernel, akpm,
	Dan Williams

Currently just a debug message is shown describing the reason during memory
range offline failure. Even though just sufficient for debugging purpose,
these messages can not be used in registered memory event notifiers that
might be interested in MEM_CANCEL_OFFLINE event and it's possible reasons.

This enumerates all existing memory range offline failure reason codes thus
enabling their further effective utilization. It also adds a new element in
memory notifier structure (void *data) that will carry this offline failure
reason code into all registered notifiers when offlining process fails and
MEM_CANCEL_OFFLINE is triggered.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 drivers/base/memory.c  |  9 +++++++++
 include/linux/memory.h | 27 +++++++++++++++++++++++++++
 mm/memory_hotplug.c    | 24 ++++++++++++++++--------
 3 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index dbec3a05590a..2a6d52984803 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -159,6 +159,15 @@ static ssize_t state_show(struct device *dev, struct device_attribute *attr,
 
 int memory_notify(unsigned long val, void *v)
 {
+	struct memory_notify *arg = v;
+
+	/*
+	 * arg->data should be available and processed only for
+	 * MEM_CANCEL_OFFLINE event. Drop this warning when it's
+	 * usage goes beyond MEM_CANCEL_OFFLINE.
+	 */
+	WARN_ON((val != MEM_CANCEL_OFFLINE) && arg->data);
+
 	return blocking_notifier_call_chain(&memory_chain, val, v);
 }
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 439a89e758d8..7914b0dbd4bb 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -44,12 +44,39 @@ int set_memory_block_size_order(unsigned int order);
 #define	MEM_CANCEL_ONLINE	(1<<4)
 #define	MEM_CANCEL_OFFLINE	(1<<5)
 
+/*
+ * Memory offline failure reasons
+ */
+enum offline_failure_reason {
+	OFFLINE_FAILURE_MEMHOLES,
+	OFFLINE_FAILURE_MULTIZONE,
+	OFFLINE_FAILURE_ISOLATE,
+	OFFLINE_FAILURE_NOTIFIER,
+	OFFLINE_FAILURE_SIGNAL,
+	OFFLINE_FAILURE_DISSOLVE,
+};
+
+static const char *const offline_failure_names[] = {
+	[OFFLINE_FAILURE_MEMHOLES]	= "memory holes",
+	[OFFLINE_FAILURE_MULTIZONE]	= "multizone range",
+	[OFFLINE_FAILURE_ISOLATE]	= "failure to isolate range",
+	[OFFLINE_FAILURE_NOTIFIER]	= "notifier failure",
+	[OFFLINE_FAILURE_SIGNAL]	= "signal backoff",
+	[OFFLINE_FAILURE_DISSOLVE]	= "failure to dissolve huge pages",
+};
+
+static inline const char *offline_failure(int reason)
+{
+	return offline_failure_names[reason];
+}
+
 struct memory_notify {
 	unsigned long start_pfn;
 	unsigned long nr_pages;
 	int status_change_nid_normal;
 	int status_change_nid_high;
 	int status_change_nid;
+	void *data;
 };
 
 struct notifier_block;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fc0aad0bc1f5..2b733902dfcf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -787,6 +787,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 
 	arg.start_pfn = pfn;
 	arg.nr_pages = nr_pages;
+	arg.data = NULL;
 	node_states_check_changes_online(nr_pages, zone, &arg);
 
 	ret = memory_notify(MEM_GOING_ONLINE, &arg);
@@ -1466,7 +1467,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	unsigned long flags;
 	struct zone *zone;
 	struct memory_notify arg;
-	char *reason;
+	enum offline_failure_reason reason;
 
 	mem_hotplug_begin();
 
@@ -1482,7 +1483,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 			      count_system_ram_pages_cb);
 	if (nr_pages != end_pfn - start_pfn) {
 		ret = -EINVAL;
-		reason = "memory holes";
+		reason = OFFLINE_FAILURE_MEMHOLES;
 		goto failed_removal;
 	}
 
@@ -1491,7 +1492,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	zone = test_pages_in_a_zone(start_pfn, end_pfn);
 	if (!zone) {
 		ret = -EINVAL;
-		reason = "multizone range";
+		reason = OFFLINE_FAILURE_MULTIZONE;
 		goto failed_removal;
 	}
 	node = zone_to_nid(zone);
@@ -1501,19 +1502,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret < 0) {
-		reason = "failure to isolate range";
+		reason = OFFLINE_FAILURE_ISOLATE;
 		goto failed_removal;
 	}
 	nr_isolate_pageblock = ret;
 
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
+	arg.data = NULL;
 	node_states_check_changes_offline(nr_pages, zone, &arg);
 
 	ret = memory_notify(MEM_GOING_OFFLINE, &arg);
 	ret = notifier_to_errno(ret);
 	if (ret) {
-		reason = "notifier failure";
+		reason = OFFLINE_FAILURE_NOTIFIER;
 		goto failed_removal_isolated;
 	}
 
@@ -1521,7 +1523,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		for (pfn = start_pfn; pfn;) {
 			if (signal_pending(current)) {
 				ret = -EINTR;
-				reason = "signal backoff";
+				reason = OFFLINE_FAILURE_SIGNAL;
 				goto failed_removal_isolated;
 			}
 
@@ -1545,7 +1547,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		 */
 		ret = dissolve_free_huge_pages(start_pfn, end_pfn);
 		if (ret) {
-			reason = "failure to dissolve huge pages";
+			reason = OFFLINE_FAILURE_DISSOLVE;
 			goto failed_removal_isolated;
 		}
 		/* check again */
@@ -1599,12 +1601,18 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 failed_removal_isolated:
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+
+	/*
+	 * Send the offline failure reason to all registered
+	 * notifiers for MEM_CANCEL_OFFLINE.
+	 */
+	arg.data = &reason;
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 failed_removal:
 	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
 		 (unsigned long long) start_pfn << PAGE_SHIFT,
 		 ((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
-		 reason);
+		 offline_failure(reason));
 	/* pushback to free area */
 	mem_hotplug_done();
 	return ret;
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE
  2020-04-15  6:39 ` Anshuman Khandual
@ 2020-04-15  6:39   ` Anshuman Khandual
  -1 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-04-15  6:39 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel
  Cc: akpm, catalin.marinas, will, mark.rutland, Anshuman Khandual,
	Steve Capper, David Hildenbrand, Yu Zhao, Hsin-Yi Wang,
	Thomas Gleixner, linux-kernel

Process MEM_OFFLINE and MEM_CANCEL_OFFLINE memory events to intercept any
possible error conditions during memory offline operation. This includes if
boot memory still got offlined even after an expilicit notifier failure or
if non-boot memory got declined for an offline request. This help improve
memory notifier robustness while also enhancing debug capabilities during
various potential memory offlining error conditions.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/mmu.c | 52 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a374e4f51a62..48c71d8a29b2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1422,13 +1422,55 @@ static int prevent_bootmem_remove_notifier(struct notifier_block *nb,
 	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
 	unsigned long pfn = arg->start_pfn;
 
-	if (action != MEM_GOING_OFFLINE)
+	if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE) &&
+		(action != MEM_CANCEL_OFFLINE))
 		return NOTIFY_OK;
 
-	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
-		if (early_section(ms))
-			return NOTIFY_BAD;
+	if (action == MEM_GOING_OFFLINE) {
+		for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+			ms = __pfn_to_section(pfn);
+			if (early_section(ms)) {
+				pr_warn("Boot memory offlining attempted\n");
+				return NOTIFY_BAD;
+			}
+		}
+	} else if (action == MEM_OFFLINE) {
+		for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+			ms = __pfn_to_section(pfn);
+			if (early_section(ms)) {
+
+				/*
+				 * This should have never happened. Boot memory
+				 * offlining should have been prevented by this
+				 * very notifier. Probably some memory removal
+				 * procedure might have changed which would then
+				 * require further debug.
+				 */
+				pr_err("Boot memory offlined\n");
+				return NOTIFY_BAD;
+			}
+		}
+	} else if (action == MEM_CANCEL_OFFLINE) {
+		enum offline_failure_reason reason = *(int *)arg->data;
+
+		if (reason != OFFLINE_FAILURE_NOTIFIER)
+			return NOTIFY_OK;
+
+		for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+			ms = __pfn_to_section(pfn);
+			if (early_section(ms))
+				return NOTIFY_OK;
+		}
+
+		/*
+		 * This should have never happened. Non boot memory
+		 * offlining should never have been prevented from
+		 * this notifier. Probably some memory hot removal
+		 * procedure might have changed which would then
+		 * require further debug.
+		 */
+		pr_err("Notifier declined non boot memory offlining\n");
+		return NOTIFY_BAD;
 	}
 	return NOTIFY_OK;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE
@ 2020-04-15  6:39   ` Anshuman Khandual
  0 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-04-15  6:39 UTC (permalink / raw)
  To: linux-mm, linux-arm-kernel
  Cc: mark.rutland, will, Steve Capper, catalin.marinas,
	Anshuman Khandual, David Hildenbrand, linux-kernel, Hsin-Yi Wang,
	akpm, Thomas Gleixner, Yu Zhao

Process MEM_OFFLINE and MEM_CANCEL_OFFLINE memory events to intercept any
possible error conditions during memory offline operation. This includes if
boot memory still got offlined even after an expilicit notifier failure or
if non-boot memory got declined for an offline request. This help improve
memory notifier robustness while also enhancing debug capabilities during
various potential memory offlining error conditions.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/mmu.c | 52 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a374e4f51a62..48c71d8a29b2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1422,13 +1422,55 @@ static int prevent_bootmem_remove_notifier(struct notifier_block *nb,
 	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
 	unsigned long pfn = arg->start_pfn;
 
-	if (action != MEM_GOING_OFFLINE)
+	if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE) &&
+		(action != MEM_CANCEL_OFFLINE))
 		return NOTIFY_OK;
 
-	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
-		if (early_section(ms))
-			return NOTIFY_BAD;
+	if (action == MEM_GOING_OFFLINE) {
+		for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+			ms = __pfn_to_section(pfn);
+			if (early_section(ms)) {
+				pr_warn("Boot memory offlining attempted\n");
+				return NOTIFY_BAD;
+			}
+		}
+	} else if (action == MEM_OFFLINE) {
+		for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+			ms = __pfn_to_section(pfn);
+			if (early_section(ms)) {
+
+				/*
+				 * This should have never happened. Boot memory
+				 * offlining should have been prevented by this
+				 * very notifier. Probably some memory removal
+				 * procedure might have changed which would then
+				 * require further debug.
+				 */
+				pr_err("Boot memory offlined\n");
+				return NOTIFY_BAD;
+			}
+		}
+	} else if (action == MEM_CANCEL_OFFLINE) {
+		enum offline_failure_reason reason = *(int *)arg->data;
+
+		if (reason != OFFLINE_FAILURE_NOTIFIER)
+			return NOTIFY_OK;
+
+		for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+			ms = __pfn_to_section(pfn);
+			if (early_section(ms))
+				return NOTIFY_OK;
+		}
+
+		/*
+		 * This should have never happened. Non boot memory
+		 * offlining should never have been prevented from
+		 * this notifier. Probably some memory hot removal
+		 * procedure might have changed which would then
+		 * require further debug.
+		 */
+		pr_err("Notifier declined non boot memory offlining\n");
+		return NOTIFY_BAD;
 	}
 	return NOTIFY_OK;
 }
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events
  2020-04-15  6:39 ` Anshuman Khandual
@ 2020-04-15  7:35   ` David Hildenbrand
  -1 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2020-04-15  7:35 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, linux-arm-kernel
  Cc: akpm, catalin.marinas, mark.rutland, Michal Hocko, Dan Williams,
	Yu Zhao, Hsin-Yi Wang, Thomas Gleixner, Steve Capper,
	linux-kernel

On 15.04.20 08:39, Anshuman Khandual wrote:
> This series improves arm64 memory event notifier (hot remove) robustness by
> enabling it to detect potential problems (if any) in the future. But first
> it enumerates memory isolation failure reasons that can be sent across a
> notifier. This series does not go beyond arm64 to explore if these failure
> reason codes could be used in other existing registered memory notifiers.
> Please do let me know if there is any other potential use cases, will be
> happy to incorporate next time around. Also should we add similar failure
> reasons for online_pages() as well ?
> 
> This series has been tested on arm64, boot tested on x86 and build tested
> on multiple other platforms.
> 

I'm sorry, but I have to nack this series. Why?

1. A hotplug notifier should not have to bother why offlining failed. He
received a MEM_GOING_OFFLINE, followed by a MEM_CANCEL_OFFLINE. That's
all he really has to know. Undo what you've done, end of story.

2. Patch 2 just introduces dead code that should never happen unless
something is horribly broken in the core (memory offlined although
nacked from notifier). And, it (for *whatever reason*) thinks it's okay
to bail out if another notifier canceled offlining hotplugged memory.

I fail to see the benefit for core changes and

 4 files changed, 99 insertions(+), 13 deletions(-)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events
@ 2020-04-15  7:35   ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2020-04-15  7:35 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, linux-arm-kernel
  Cc: mark.rutland, Michal Hocko, Yu Zhao, Steve Capper,
	catalin.marinas, linux-kernel, Thomas Gleixner, Hsin-Yi Wang,
	akpm, Dan Williams

On 15.04.20 08:39, Anshuman Khandual wrote:
> This series improves arm64 memory event notifier (hot remove) robustness by
> enabling it to detect potential problems (if any) in the future. But first
> it enumerates memory isolation failure reasons that can be sent across a
> notifier. This series does not go beyond arm64 to explore if these failure
> reason codes could be used in other existing registered memory notifiers.
> Please do let me know if there is any other potential use cases, will be
> happy to incorporate next time around. Also should we add similar failure
> reasons for online_pages() as well ?
> 
> This series has been tested on arm64, boot tested on x86 and build tested
> on multiple other platforms.
> 

I'm sorry, but I have to nack this series. Why?

1. A hotplug notifier should not have to bother why offlining failed. He
received a MEM_GOING_OFFLINE, followed by a MEM_CANCEL_OFFLINE. That's
all he really has to know. Undo what you've done, end of story.

2. Patch 2 just introduces dead code that should never happen unless
something is horribly broken in the core (memory offlined although
nacked from notifier). And, it (for *whatever reason*) thinks it's okay
to bail out if another notifier canceled offlining hotplugged memory.

I fail to see the benefit for core changes and

 4 files changed, 99 insertions(+), 13 deletions(-)

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events
  2020-04-15  7:35   ` David Hildenbrand
@ 2020-04-15 10:16     ` Michal Hocko
  -1 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2020-04-15 10:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Anshuman Khandual, linux-mm, linux-arm-kernel, akpm,
	catalin.marinas, mark.rutland, Dan Williams, Yu Zhao,
	Hsin-Yi Wang, Thomas Gleixner, Steve Capper, linux-kernel

On Wed 15-04-20 09:35:33, David Hildenbrand wrote:
> On 15.04.20 08:39, Anshuman Khandual wrote:
> > This series improves arm64 memory event notifier (hot remove) robustness by
> > enabling it to detect potential problems (if any) in the future. But first
> > it enumerates memory isolation failure reasons that can be sent across a
> > notifier. This series does not go beyond arm64 to explore if these failure
> > reason codes could be used in other existing registered memory notifiers.
> > Please do let me know if there is any other potential use cases, will be
> > happy to incorporate next time around. Also should we add similar failure
> > reasons for online_pages() as well ?
> > 
> > This series has been tested on arm64, boot tested on x86 and build tested
> > on multiple other platforms.
> > 
> 
> I'm sorry, but I have to nack this series. Why?
> 
> 1. A hotplug notifier should not have to bother why offlining failed. He
> received a MEM_GOING_OFFLINE, followed by a MEM_CANCEL_OFFLINE. That's
> all he really has to know. Undo what you've done, end of story.
> 
> 2. Patch 2 just introduces dead code that should never happen unless
> something is horribly broken in the core (memory offlined although
> nacked from notifier). And, it (for *whatever reason*) thinks it's okay
> to bail out if another noYtifier canceled offlining hotplugged memory.
> 
> I fail to see the benefit for core changes and

Agreed! If arm64 wants to check and report early bootmem memory
offlining then just do it. There is no reason to add a whole machinery
for that. Cancel notifier is indeed only supposed to restore the state
before GOING_OFFLINE.

>  4 files changed, 99 insertions(+), 13 deletions(-)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events
@ 2020-04-15 10:16     ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2020-04-15 10:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: mark.rutland, Yu Zhao, Anshuman Khandual, catalin.marinas,
	Steve Capper, linux-kernel, linux-mm, Thomas Gleixner,
	Hsin-Yi Wang, akpm, Dan Williams, linux-arm-kernel

On Wed 15-04-20 09:35:33, David Hildenbrand wrote:
> On 15.04.20 08:39, Anshuman Khandual wrote:
> > This series improves arm64 memory event notifier (hot remove) robustness by
> > enabling it to detect potential problems (if any) in the future. But first
> > it enumerates memory isolation failure reasons that can be sent across a
> > notifier. This series does not go beyond arm64 to explore if these failure
> > reason codes could be used in other existing registered memory notifiers.
> > Please do let me know if there is any other potential use cases, will be
> > happy to incorporate next time around. Also should we add similar failure
> > reasons for online_pages() as well ?
> > 
> > This series has been tested on arm64, boot tested on x86 and build tested
> > on multiple other platforms.
> > 
> 
> I'm sorry, but I have to nack this series. Why?
> 
> 1. A hotplug notifier should not have to bother why offlining failed. He
> received a MEM_GOING_OFFLINE, followed by a MEM_CANCEL_OFFLINE. That's
> all he really has to know. Undo what you've done, end of story.
> 
> 2. Patch 2 just introduces dead code that should never happen unless
> something is horribly broken in the core (memory offlined although
> nacked from notifier). And, it (for *whatever reason*) thinks it's okay
> to bail out if another noYtifier canceled offlining hotplugged memory.
> 
> I fail to see the benefit for core changes and

Agreed! If arm64 wants to check and report early bootmem memory
offlining then just do it. There is no reason to add a whole machinery
for that. Cancel notifier is indeed only supposed to restore the state
before GOING_OFFLINE.

>  4 files changed, 99 insertions(+), 13 deletions(-)

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-04-15 10:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-15  6:39 [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events Anshuman Khandual
2020-04-15  6:39 ` Anshuman Khandual
2020-04-15  6:39 ` [PATCH 1/2] mm/hotplug: Enumerate memory range offlining failure reasons Anshuman Khandual
2020-04-15  6:39   ` Anshuman Khandual
2020-04-15  6:39 ` [PATCH 2/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE Anshuman Khandual
2020-04-15  6:39   ` Anshuman Khandual
2020-04-15  7:35 ` [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events David Hildenbrand
2020-04-15  7:35   ` David Hildenbrand
2020-04-15 10:16   ` Michal Hocko
2020-04-15 10:16     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.