* [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events @ 2020-04-15 6:39 ` Anshuman Khandual 0 siblings, 0 replies; 10+ messages in thread From: Anshuman Khandual @ 2020-04-15 6:39 UTC (permalink / raw) To: linux-mm, linux-arm-kernel Cc: akpm, catalin.marinas, will, mark.rutland, Anshuman Khandual, Michal Hocko, Dan Williams, David Hildenbrand, Yu Zhao, Hsin-Yi Wang, Thomas Gleixner, Steve Capper, linux-kernel This series improves arm64 memory event notifier (hot remove) robustness by enabling it to detect potential problems (if any) in the future. But first it enumerates memory isolation failure reasons that can be sent across a notifier. This series does not go beyond arm64 to explore if these failure reason codes could be used in other existing registered memory notifiers. Please do let me know if there is any other potential use cases, will be happy to incorporate next time around. Also should we add similar failure reasons for online_pages() as well ? This series has been tested on arm64, boot tested on x86 and build tested on multiple other platforms. This series applies on v5.7-rc1. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Anshuman Khandual (2): mm/hotplug: Enumerate memory range offlining failure reasons arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE arch/arm64/mm/mmu.c | 52 ++++++++++++++++++++++++++++++++++++++---- drivers/base/memory.c | 9 ++++++++ include/linux/memory.h | 27 ++++++++++++++++++++++ mm/memory_hotplug.c | 24 ++++++++++++------- 4 files changed, 99 insertions(+), 13 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events @ 2020-04-15 6:39 ` Anshuman Khandual 0 siblings, 0 replies; 10+ messages in thread From: Anshuman Khandual @ 2020-04-15 6:39 UTC (permalink / raw) To: linux-mm, linux-arm-kernel Cc: mark.rutland, Michal Hocko, will, Anshuman Khandual, catalin.marinas, David Hildenbrand, Steve Capper, linux-kernel, Thomas Gleixner, Hsin-Yi Wang, akpm, Dan Williams, Yu Zhao This series improves arm64 memory event notifier (hot remove) robustness by enabling it to detect potential problems (if any) in the future. But first it enumerates memory isolation failure reasons that can be sent across a notifier. This series does not go beyond arm64 to explore if these failure reason codes could be used in other existing registered memory notifiers. Please do let me know if there is any other potential use cases, will be happy to incorporate next time around. Also should we add similar failure reasons for online_pages() as well ? This series has been tested on arm64, boot tested on x86 and build tested on multiple other platforms. This series applies on v5.7-rc1. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Anshuman Khandual (2): mm/hotplug: Enumerate memory range offlining failure reasons arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE arch/arm64/mm/mmu.c | 52 ++++++++++++++++++++++++++++++++++++++---- drivers/base/memory.c | 9 ++++++++ include/linux/memory.h | 27 ++++++++++++++++++++++ mm/memory_hotplug.c | 24 ++++++++++++------- 4 files changed, 99 insertions(+), 13 deletions(-) -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] mm/hotplug: Enumerate memory range offlining failure reasons 2020-04-15 6:39 ` Anshuman Khandual @ 2020-04-15 6:39 ` Anshuman Khandual -1 siblings, 0 replies; 10+ messages in thread From: Anshuman Khandual @ 2020-04-15 6:39 UTC (permalink / raw) To: linux-mm, linux-arm-kernel Cc: akpm, catalin.marinas, will, mark.rutland, Anshuman Khandual, David Hildenbrand, Michal Hocko, Dan Williams, linux-kernel Currently just a debug message is shown describing the reason during memory range offline failure. Even though just sufficient for debugging purpose, these messages can not be used in registered memory event notifiers that might be interested in MEM_CANCEL_OFFLINE event and it's possible reasons. This enumerates all existing memory range offline failure reason codes thus enabling their further effective utilization. It also adds a new element in memory notifier structure (void *data) that will carry this offline failure reason code into all registered notifiers when offlining process fails and MEM_CANCEL_OFFLINE is triggered. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> --- drivers/base/memory.c | 9 +++++++++ include/linux/memory.h | 27 +++++++++++++++++++++++++++ mm/memory_hotplug.c | 24 ++++++++++++++++-------- 3 files changed, 52 insertions(+), 8 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index dbec3a05590a..2a6d52984803 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -159,6 +159,15 @@ static ssize_t state_show(struct device *dev, struct device_attribute *attr, int memory_notify(unsigned long val, void *v) { + struct memory_notify *arg = v; + + /* + * arg->data should be available and processed only for + * MEM_CANCEL_OFFLINE event. Drop this warning when it's + * usage goes beyond MEM_CANCEL_OFFLINE. + */ + WARN_ON((val != MEM_CANCEL_OFFLINE) && arg->data); + return blocking_notifier_call_chain(&memory_chain, val, v); } diff --git a/include/linux/memory.h b/include/linux/memory.h index 439a89e758d8..7914b0dbd4bb 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -44,12 +44,39 @@ int set_memory_block_size_order(unsigned int order); #define MEM_CANCEL_ONLINE (1<<4) #define MEM_CANCEL_OFFLINE (1<<5) +/* + * Memory offline failure reasons + */ +enum offline_failure_reason { + OFFLINE_FAILURE_MEMHOLES, + OFFLINE_FAILURE_MULTIZONE, + OFFLINE_FAILURE_ISOLATE, + OFFLINE_FAILURE_NOTIFIER, + OFFLINE_FAILURE_SIGNAL, + OFFLINE_FAILURE_DISSOLVE, +}; + +static const char *const offline_failure_names[] = { + [OFFLINE_FAILURE_MEMHOLES] = "memory holes", + [OFFLINE_FAILURE_MULTIZONE] = "multizone range", + [OFFLINE_FAILURE_ISOLATE] = "failure to isolate range", + [OFFLINE_FAILURE_NOTIFIER] = "notifier failure", + [OFFLINE_FAILURE_SIGNAL] = "signal backoff", + [OFFLINE_FAILURE_DISSOLVE] = "failure to dissolve huge pages", +}; + +static inline const char *offline_failure(int reason) +{ + return offline_failure_names[reason]; +} + struct memory_notify { unsigned long start_pfn; unsigned long nr_pages; int status_change_nid_normal; int status_change_nid_high; int status_change_nid; + void *data; }; struct notifier_block; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index fc0aad0bc1f5..2b733902dfcf 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -787,6 +787,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, arg.start_pfn = pfn; arg.nr_pages = nr_pages; + arg.data = NULL; node_states_check_changes_online(nr_pages, zone, &arg); ret = memory_notify(MEM_GOING_ONLINE, &arg); @@ -1466,7 +1467,7 @@ static int __ref __offline_pages(unsigned long start_pfn, unsigned long flags; struct zone *zone; struct memory_notify arg; - char *reason; + enum offline_failure_reason reason; mem_hotplug_begin(); @@ -1482,7 +1483,7 @@ static int __ref __offline_pages(unsigned long start_pfn, count_system_ram_pages_cb); if (nr_pages != end_pfn - start_pfn) { ret = -EINVAL; - reason = "memory holes"; + reason = OFFLINE_FAILURE_MEMHOLES; goto failed_removal; } @@ -1491,7 +1492,7 @@ static int __ref __offline_pages(unsigned long start_pfn, zone = test_pages_in_a_zone(start_pfn, end_pfn); if (!zone) { ret = -EINVAL; - reason = "multizone range"; + reason = OFFLINE_FAILURE_MULTIZONE; goto failed_removal; } node = zone_to_nid(zone); @@ -1501,19 +1502,20 @@ static int __ref __offline_pages(unsigned long start_pfn, MIGRATE_MOVABLE, MEMORY_OFFLINE | REPORT_FAILURE); if (ret < 0) { - reason = "failure to isolate range"; + reason = OFFLINE_FAILURE_ISOLATE; goto failed_removal; } nr_isolate_pageblock = ret; arg.start_pfn = start_pfn; arg.nr_pages = nr_pages; + arg.data = NULL; node_states_check_changes_offline(nr_pages, zone, &arg); ret = memory_notify(MEM_GOING_OFFLINE, &arg); ret = notifier_to_errno(ret); if (ret) { - reason = "notifier failure"; + reason = OFFLINE_FAILURE_NOTIFIER; goto failed_removal_isolated; } @@ -1521,7 +1523,7 @@ static int __ref __offline_pages(unsigned long start_pfn, for (pfn = start_pfn; pfn;) { if (signal_pending(current)) { ret = -EINTR; - reason = "signal backoff"; + reason = OFFLINE_FAILURE_SIGNAL; goto failed_removal_isolated; } @@ -1545,7 +1547,7 @@ static int __ref __offline_pages(unsigned long start_pfn, */ ret = dissolve_free_huge_pages(start_pfn, end_pfn); if (ret) { - reason = "failure to dissolve huge pages"; + reason = OFFLINE_FAILURE_DISSOLVE; goto failed_removal_isolated; } /* check again */ @@ -1599,12 +1601,18 @@ static int __ref __offline_pages(unsigned long start_pfn, failed_removal_isolated: undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); + + /* + * Send the offline failure reason to all registered + * notifiers for MEM_CANCEL_OFFLINE. + */ + arg.data = &reason; memory_notify(MEM_CANCEL_OFFLINE, &arg); failed_removal: pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n", (unsigned long long) start_pfn << PAGE_SHIFT, ((unsigned long long) end_pfn << PAGE_SHIFT) - 1, - reason); + offline_failure(reason)); /* pushback to free area */ mem_hotplug_done(); return ret; -- 2.20.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 1/2] mm/hotplug: Enumerate memory range offlining failure reasons @ 2020-04-15 6:39 ` Anshuman Khandual 0 siblings, 0 replies; 10+ messages in thread From: Anshuman Khandual @ 2020-04-15 6:39 UTC (permalink / raw) To: linux-mm, linux-arm-kernel Cc: mark.rutland, Michal Hocko, will, David Hildenbrand, catalin.marinas, Anshuman Khandual, linux-kernel, akpm, Dan Williams Currently just a debug message is shown describing the reason during memory range offline failure. Even though just sufficient for debugging purpose, these messages can not be used in registered memory event notifiers that might be interested in MEM_CANCEL_OFFLINE event and it's possible reasons. This enumerates all existing memory range offline failure reason codes thus enabling their further effective utilization. It also adds a new element in memory notifier structure (void *data) that will carry this offline failure reason code into all registered notifiers when offlining process fails and MEM_CANCEL_OFFLINE is triggered. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> --- drivers/base/memory.c | 9 +++++++++ include/linux/memory.h | 27 +++++++++++++++++++++++++++ mm/memory_hotplug.c | 24 ++++++++++++++++-------- 3 files changed, 52 insertions(+), 8 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index dbec3a05590a..2a6d52984803 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -159,6 +159,15 @@ static ssize_t state_show(struct device *dev, struct device_attribute *attr, int memory_notify(unsigned long val, void *v) { + struct memory_notify *arg = v; + + /* + * arg->data should be available and processed only for + * MEM_CANCEL_OFFLINE event. Drop this warning when it's + * usage goes beyond MEM_CANCEL_OFFLINE. + */ + WARN_ON((val != MEM_CANCEL_OFFLINE) && arg->data); + return blocking_notifier_call_chain(&memory_chain, val, v); } diff --git a/include/linux/memory.h b/include/linux/memory.h index 439a89e758d8..7914b0dbd4bb 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -44,12 +44,39 @@ int set_memory_block_size_order(unsigned int order); #define MEM_CANCEL_ONLINE (1<<4) #define MEM_CANCEL_OFFLINE (1<<5) +/* + * Memory offline failure reasons + */ +enum offline_failure_reason { + OFFLINE_FAILURE_MEMHOLES, + OFFLINE_FAILURE_MULTIZONE, + OFFLINE_FAILURE_ISOLATE, + OFFLINE_FAILURE_NOTIFIER, + OFFLINE_FAILURE_SIGNAL, + OFFLINE_FAILURE_DISSOLVE, +}; + +static const char *const offline_failure_names[] = { + [OFFLINE_FAILURE_MEMHOLES] = "memory holes", + [OFFLINE_FAILURE_MULTIZONE] = "multizone range", + [OFFLINE_FAILURE_ISOLATE] = "failure to isolate range", + [OFFLINE_FAILURE_NOTIFIER] = "notifier failure", + [OFFLINE_FAILURE_SIGNAL] = "signal backoff", + [OFFLINE_FAILURE_DISSOLVE] = "failure to dissolve huge pages", +}; + +static inline const char *offline_failure(int reason) +{ + return offline_failure_names[reason]; +} + struct memory_notify { unsigned long start_pfn; unsigned long nr_pages; int status_change_nid_normal; int status_change_nid_high; int status_change_nid; + void *data; }; struct notifier_block; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index fc0aad0bc1f5..2b733902dfcf 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -787,6 +787,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, arg.start_pfn = pfn; arg.nr_pages = nr_pages; + arg.data = NULL; node_states_check_changes_online(nr_pages, zone, &arg); ret = memory_notify(MEM_GOING_ONLINE, &arg); @@ -1466,7 +1467,7 @@ static int __ref __offline_pages(unsigned long start_pfn, unsigned long flags; struct zone *zone; struct memory_notify arg; - char *reason; + enum offline_failure_reason reason; mem_hotplug_begin(); @@ -1482,7 +1483,7 @@ static int __ref __offline_pages(unsigned long start_pfn, count_system_ram_pages_cb); if (nr_pages != end_pfn - start_pfn) { ret = -EINVAL; - reason = "memory holes"; + reason = OFFLINE_FAILURE_MEMHOLES; goto failed_removal; } @@ -1491,7 +1492,7 @@ static int __ref __offline_pages(unsigned long start_pfn, zone = test_pages_in_a_zone(start_pfn, end_pfn); if (!zone) { ret = -EINVAL; - reason = "multizone range"; + reason = OFFLINE_FAILURE_MULTIZONE; goto failed_removal; } node = zone_to_nid(zone); @@ -1501,19 +1502,20 @@ static int __ref __offline_pages(unsigned long start_pfn, MIGRATE_MOVABLE, MEMORY_OFFLINE | REPORT_FAILURE); if (ret < 0) { - reason = "failure to isolate range"; + reason = OFFLINE_FAILURE_ISOLATE; goto failed_removal; } nr_isolate_pageblock = ret; arg.start_pfn = start_pfn; arg.nr_pages = nr_pages; + arg.data = NULL; node_states_check_changes_offline(nr_pages, zone, &arg); ret = memory_notify(MEM_GOING_OFFLINE, &arg); ret = notifier_to_errno(ret); if (ret) { - reason = "notifier failure"; + reason = OFFLINE_FAILURE_NOTIFIER; goto failed_removal_isolated; } @@ -1521,7 +1523,7 @@ static int __ref __offline_pages(unsigned long start_pfn, for (pfn = start_pfn; pfn;) { if (signal_pending(current)) { ret = -EINTR; - reason = "signal backoff"; + reason = OFFLINE_FAILURE_SIGNAL; goto failed_removal_isolated; } @@ -1545,7 +1547,7 @@ static int __ref __offline_pages(unsigned long start_pfn, */ ret = dissolve_free_huge_pages(start_pfn, end_pfn); if (ret) { - reason = "failure to dissolve huge pages"; + reason = OFFLINE_FAILURE_DISSOLVE; goto failed_removal_isolated; } /* check again */ @@ -1599,12 +1601,18 @@ static int __ref __offline_pages(unsigned long start_pfn, failed_removal_isolated: undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); + + /* + * Send the offline failure reason to all registered + * notifiers for MEM_CANCEL_OFFLINE. + */ + arg.data = &reason; memory_notify(MEM_CANCEL_OFFLINE, &arg); failed_removal: pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n", (unsigned long long) start_pfn << PAGE_SHIFT, ((unsigned long long) end_pfn << PAGE_SHIFT) - 1, - reason); + offline_failure(reason)); /* pushback to free area */ mem_hotplug_done(); return ret; -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE 2020-04-15 6:39 ` Anshuman Khandual @ 2020-04-15 6:39 ` Anshuman Khandual -1 siblings, 0 replies; 10+ messages in thread From: Anshuman Khandual @ 2020-04-15 6:39 UTC (permalink / raw) To: linux-mm, linux-arm-kernel Cc: akpm, catalin.marinas, will, mark.rutland, Anshuman Khandual, Steve Capper, David Hildenbrand, Yu Zhao, Hsin-Yi Wang, Thomas Gleixner, linux-kernel Process MEM_OFFLINE and MEM_CANCEL_OFFLINE memory events to intercept any possible error conditions during memory offline operation. This includes if boot memory still got offlined even after an expilicit notifier failure or if non-boot memory got declined for an offline request. This help improve memory notifier robustness while also enhancing debug capabilities during various potential memory offlining error conditions. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> --- arch/arm64/mm/mmu.c | 52 ++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 47 insertions(+), 5 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index a374e4f51a62..48c71d8a29b2 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1422,13 +1422,55 @@ static int prevent_bootmem_remove_notifier(struct notifier_block *nb, unsigned long end_pfn = arg->start_pfn + arg->nr_pages; unsigned long pfn = arg->start_pfn; - if (action != MEM_GOING_OFFLINE) + if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE) && + (action != MEM_CANCEL_OFFLINE)) return NOTIFY_OK; - for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { - ms = __pfn_to_section(pfn); - if (early_section(ms)) - return NOTIFY_BAD; + if (action == MEM_GOING_OFFLINE) { + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + ms = __pfn_to_section(pfn); + if (early_section(ms)) { + pr_warn("Boot memory offlining attempted\n"); + return NOTIFY_BAD; + } + } + } else if (action == MEM_OFFLINE) { + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + ms = __pfn_to_section(pfn); + if (early_section(ms)) { + + /* + * This should have never happened. Boot memory + * offlining should have been prevented by this + * very notifier. Probably some memory removal + * procedure might have changed which would then + * require further debug. + */ + pr_err("Boot memory offlined\n"); + return NOTIFY_BAD; + } + } + } else if (action == MEM_CANCEL_OFFLINE) { + enum offline_failure_reason reason = *(int *)arg->data; + + if (reason != OFFLINE_FAILURE_NOTIFIER) + return NOTIFY_OK; + + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + ms = __pfn_to_section(pfn); + if (early_section(ms)) + return NOTIFY_OK; + } + + /* + * This should have never happened. Non boot memory + * offlining should never have been prevented from + * this notifier. Probably some memory hot removal + * procedure might have changed which would then + * require further debug. + */ + pr_err("Notifier declined non boot memory offlining\n"); + return NOTIFY_BAD; } return NOTIFY_OK; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE @ 2020-04-15 6:39 ` Anshuman Khandual 0 siblings, 0 replies; 10+ messages in thread From: Anshuman Khandual @ 2020-04-15 6:39 UTC (permalink / raw) To: linux-mm, linux-arm-kernel Cc: mark.rutland, will, Steve Capper, catalin.marinas, Anshuman Khandual, David Hildenbrand, linux-kernel, Hsin-Yi Wang, akpm, Thomas Gleixner, Yu Zhao Process MEM_OFFLINE and MEM_CANCEL_OFFLINE memory events to intercept any possible error conditions during memory offline operation. This includes if boot memory still got offlined even after an expilicit notifier failure or if non-boot memory got declined for an offline request. This help improve memory notifier robustness while also enhancing debug capabilities during various potential memory offlining error conditions. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> --- arch/arm64/mm/mmu.c | 52 ++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 47 insertions(+), 5 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index a374e4f51a62..48c71d8a29b2 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1422,13 +1422,55 @@ static int prevent_bootmem_remove_notifier(struct notifier_block *nb, unsigned long end_pfn = arg->start_pfn + arg->nr_pages; unsigned long pfn = arg->start_pfn; - if (action != MEM_GOING_OFFLINE) + if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE) && + (action != MEM_CANCEL_OFFLINE)) return NOTIFY_OK; - for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { - ms = __pfn_to_section(pfn); - if (early_section(ms)) - return NOTIFY_BAD; + if (action == MEM_GOING_OFFLINE) { + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + ms = __pfn_to_section(pfn); + if (early_section(ms)) { + pr_warn("Boot memory offlining attempted\n"); + return NOTIFY_BAD; + } + } + } else if (action == MEM_OFFLINE) { + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + ms = __pfn_to_section(pfn); + if (early_section(ms)) { + + /* + * This should have never happened. Boot memory + * offlining should have been prevented by this + * very notifier. Probably some memory removal + * procedure might have changed which would then + * require further debug. + */ + pr_err("Boot memory offlined\n"); + return NOTIFY_BAD; + } + } + } else if (action == MEM_CANCEL_OFFLINE) { + enum offline_failure_reason reason = *(int *)arg->data; + + if (reason != OFFLINE_FAILURE_NOTIFIER) + return NOTIFY_OK; + + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + ms = __pfn_to_section(pfn); + if (early_section(ms)) + return NOTIFY_OK; + } + + /* + * This should have never happened. Non boot memory + * offlining should never have been prevented from + * this notifier. Probably some memory hot removal + * procedure might have changed which would then + * require further debug. + */ + pr_err("Notifier declined non boot memory offlining\n"); + return NOTIFY_BAD; } return NOTIFY_OK; } -- 2.20.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events 2020-04-15 6:39 ` Anshuman Khandual @ 2020-04-15 7:35 ` David Hildenbrand -1 siblings, 0 replies; 10+ messages in thread From: David Hildenbrand @ 2020-04-15 7:35 UTC (permalink / raw) To: Anshuman Khandual, linux-mm, linux-arm-kernel Cc: akpm, catalin.marinas, mark.rutland, Michal Hocko, Dan Williams, Yu Zhao, Hsin-Yi Wang, Thomas Gleixner, Steve Capper, linux-kernel On 15.04.20 08:39, Anshuman Khandual wrote: > This series improves arm64 memory event notifier (hot remove) robustness by > enabling it to detect potential problems (if any) in the future. But first > it enumerates memory isolation failure reasons that can be sent across a > notifier. This series does not go beyond arm64 to explore if these failure > reason codes could be used in other existing registered memory notifiers. > Please do let me know if there is any other potential use cases, will be > happy to incorporate next time around. Also should we add similar failure > reasons for online_pages() as well ? > > This series has been tested on arm64, boot tested on x86 and build tested > on multiple other platforms. > I'm sorry, but I have to nack this series. Why? 1. A hotplug notifier should not have to bother why offlining failed. He received a MEM_GOING_OFFLINE, followed by a MEM_CANCEL_OFFLINE. That's all he really has to know. Undo what you've done, end of story. 2. Patch 2 just introduces dead code that should never happen unless something is horribly broken in the core (memory offlined although nacked from notifier). And, it (for *whatever reason*) thinks it's okay to bail out if another notifier canceled offlining hotplugged memory. I fail to see the benefit for core changes and 4 files changed, 99 insertions(+), 13 deletions(-) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events @ 2020-04-15 7:35 ` David Hildenbrand 0 siblings, 0 replies; 10+ messages in thread From: David Hildenbrand @ 2020-04-15 7:35 UTC (permalink / raw) To: Anshuman Khandual, linux-mm, linux-arm-kernel Cc: mark.rutland, Michal Hocko, Yu Zhao, Steve Capper, catalin.marinas, linux-kernel, Thomas Gleixner, Hsin-Yi Wang, akpm, Dan Williams On 15.04.20 08:39, Anshuman Khandual wrote: > This series improves arm64 memory event notifier (hot remove) robustness by > enabling it to detect potential problems (if any) in the future. But first > it enumerates memory isolation failure reasons that can be sent across a > notifier. This series does not go beyond arm64 to explore if these failure > reason codes could be used in other existing registered memory notifiers. > Please do let me know if there is any other potential use cases, will be > happy to incorporate next time around. Also should we add similar failure > reasons for online_pages() as well ? > > This series has been tested on arm64, boot tested on x86 and build tested > on multiple other platforms. > I'm sorry, but I have to nack this series. Why? 1. A hotplug notifier should not have to bother why offlining failed. He received a MEM_GOING_OFFLINE, followed by a MEM_CANCEL_OFFLINE. That's all he really has to know. Undo what you've done, end of story. 2. Patch 2 just introduces dead code that should never happen unless something is horribly broken in the core (memory offlined although nacked from notifier). And, it (for *whatever reason*) thinks it's okay to bail out if another notifier canceled offlining hotplugged memory. I fail to see the benefit for core changes and 4 files changed, 99 insertions(+), 13 deletions(-) -- Thanks, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events 2020-04-15 7:35 ` David Hildenbrand @ 2020-04-15 10:16 ` Michal Hocko -1 siblings, 0 replies; 10+ messages in thread From: Michal Hocko @ 2020-04-15 10:16 UTC (permalink / raw) To: David Hildenbrand Cc: Anshuman Khandual, linux-mm, linux-arm-kernel, akpm, catalin.marinas, mark.rutland, Dan Williams, Yu Zhao, Hsin-Yi Wang, Thomas Gleixner, Steve Capper, linux-kernel On Wed 15-04-20 09:35:33, David Hildenbrand wrote: > On 15.04.20 08:39, Anshuman Khandual wrote: > > This series improves arm64 memory event notifier (hot remove) robustness by > > enabling it to detect potential problems (if any) in the future. But first > > it enumerates memory isolation failure reasons that can be sent across a > > notifier. This series does not go beyond arm64 to explore if these failure > > reason codes could be used in other existing registered memory notifiers. > > Please do let me know if there is any other potential use cases, will be > > happy to incorporate next time around. Also should we add similar failure > > reasons for online_pages() as well ? > > > > This series has been tested on arm64, boot tested on x86 and build tested > > on multiple other platforms. > > > > I'm sorry, but I have to nack this series. Why? > > 1. A hotplug notifier should not have to bother why offlining failed. He > received a MEM_GOING_OFFLINE, followed by a MEM_CANCEL_OFFLINE. That's > all he really has to know. Undo what you've done, end of story. > > 2. Patch 2 just introduces dead code that should never happen unless > something is horribly broken in the core (memory offlined although > nacked from notifier). And, it (for *whatever reason*) thinks it's okay > to bail out if another noYtifier canceled offlining hotplugged memory. > > I fail to see the benefit for core changes and Agreed! If arm64 wants to check and report early bootmem memory offlining then just do it. There is no reason to add a whole machinery for that. Cancel notifier is indeed only supposed to restore the state before GOING_OFFLINE. > 4 files changed, 99 insertions(+), 13 deletions(-) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events @ 2020-04-15 10:16 ` Michal Hocko 0 siblings, 0 replies; 10+ messages in thread From: Michal Hocko @ 2020-04-15 10:16 UTC (permalink / raw) To: David Hildenbrand Cc: mark.rutland, Yu Zhao, Anshuman Khandual, catalin.marinas, Steve Capper, linux-kernel, linux-mm, Thomas Gleixner, Hsin-Yi Wang, akpm, Dan Williams, linux-arm-kernel On Wed 15-04-20 09:35:33, David Hildenbrand wrote: > On 15.04.20 08:39, Anshuman Khandual wrote: > > This series improves arm64 memory event notifier (hot remove) robustness by > > enabling it to detect potential problems (if any) in the future. But first > > it enumerates memory isolation failure reasons that can be sent across a > > notifier. This series does not go beyond arm64 to explore if these failure > > reason codes could be used in other existing registered memory notifiers. > > Please do let me know if there is any other potential use cases, will be > > happy to incorporate next time around. Also should we add similar failure > > reasons for online_pages() as well ? > > > > This series has been tested on arm64, boot tested on x86 and build tested > > on multiple other platforms. > > > > I'm sorry, but I have to nack this series. Why? > > 1. A hotplug notifier should not have to bother why offlining failed. He > received a MEM_GOING_OFFLINE, followed by a MEM_CANCEL_OFFLINE. That's > all he really has to know. Undo what you've done, end of story. > > 2. Patch 2 just introduces dead code that should never happen unless > something is horribly broken in the core (memory offlined although > nacked from notifier). And, it (for *whatever reason*) thinks it's okay > to bail out if another noYtifier canceled offlining hotplugged memory. > > I fail to see the benefit for core changes and Agreed! If arm64 wants to check and report early bootmem memory offlining then just do it. There is no reason to add a whole machinery for that. Cancel notifier is indeed only supposed to restore the state before GOING_OFFLINE. > 4 files changed, 99 insertions(+), 13 deletions(-) -- Michal Hocko SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-04-15 10:24 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-04-15 6:39 [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events Anshuman Khandual 2020-04-15 6:39 ` Anshuman Khandual 2020-04-15 6:39 ` [PATCH 1/2] mm/hotplug: Enumerate memory range offlining failure reasons Anshuman Khandual 2020-04-15 6:39 ` Anshuman Khandual 2020-04-15 6:39 ` [PATCH 2/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE Anshuman Khandual 2020-04-15 6:39 ` Anshuman Khandual 2020-04-15 7:35 ` [PATCH 0/2] arm64/hotplug: Process MEM_OFFLINE and MEM_CANCEL_OFFLINE events David Hildenbrand 2020-04-15 7:35 ` David Hildenbrand 2020-04-15 10:16 ` Michal Hocko 2020-04-15 10:16 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.