All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v18 0/7] crash: Kernel handling of CPU and memory hot un/plug
@ 2023-01-31 22:42 ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr must also
be updated.

The elfcorehdr describes to kdump the CPUs and memory in the system,
and any inaccuracies can result in a vmcore with missing CPU context
or memory regions.

The current solution utilizes udev to initiate an unload-then-reload
of the kdump image (eg. kernel, initrd, boot_params, puratory and
elfcorehdr) by the userspace kexec utility. In previous posts I have
outlined the significant performance problems related to offloading
this activity to userspace.

This patchset introduces a generic crash handler that registers with
the CPU and memory notifiers. Upon CPU or memory changes, from either
hot un/plug or off/onlining, this generic handler is invoked and
performs important housekeeping, for example obtaining the appropriate
lock, and then invokes an architecture specific handler to do the
appropriate elfcorehdr update.

In the case of x86_64, the arch specific handler generates a new
elfcorehdr, and overwrites the old one in memory; thus no involvement
with userspace needed.

To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:

 - Prevent udev from updating kdump crash kernel on hot un/plug changes.
   Add the following as the first lines to the RHEL udev rule file
   /usr/lib/udev/rules.d/98-kexec.rules:

   # The kernel handles updates to crash elfcorehdr for cpu and memory changes
   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

   With this changeset applied, the two rules evaluate to false for
   cpu and memory change events and thus skip the userspace
   unload-then-reload of kdump.

 - Change to the kexec_file_load for loading the kdump kernel:
   Eg. on RHEL: in /usr/bin/kdumpctl, change to:
    standard_kexec_args="-p -d -s"
   which adds the -s to select kexec_file_load() syscall.

This kernel patchset also supports kexec_load() with a modified kexec
userspace utility. A working changeset to the kexec userspace utility
is posted to the kexec-tools mailing list here:

 http://lists.infradead.org/pipermail/kexec/2022-October/026032.html

To use the kexec-tools patch, apply, build and install kexec-tools,
then change the kdumpctl's standard_kexec_args to replace the -s with
--hotplug. The removal of -s reverts to the kexec_load syscall and
the addition of --hotplug invokes the changes put forth in the
kexec-tools patch.

Regards,
eric
---
v18: 31jan2023
 - Rebased onto 6.2.0-rc6
 - Renamed struct kimage member hotplug_event to hp_action, and
   re-enumerated the KEXEC_CRASH_HP_x items, adding _NONE at 0.
 - Moved to cpuhp state CPUHP_BP_PREPARE_DYN instead of
   CPUHP_AP_ONLINE_DYN in order to minimize window of time CPU
   is not reflected in elfcorehdr.
 - Reworked some of the comments and commit messages to offer
   more of the why, than what, per Thomas Gleixner.

v17: 18jan2023
 - Rebased onto 6.2.0-rc4
 - Moved a bit of code around so that kexec_load()-only builds
   work, per Sourabh.
 - Corrected computation of number of memory region Phdrs needed
   when x86 memory hotplug is not enabled, per Baoquan.

v16: 5jan2023
 https://lkml.org/lkml/2023/1/5/673
 - Rebased onto 6.2.0-rc2
 - Corrected error identified by Baoquan.

v15: 9dec2022
 https://lkml.org/lkml/2022/12/9/520
 - Rebased onto 6.1.0-rc8
 - Replaced arch_un/map_crash_pages() with direct use of
   kun/map_local_pages(), per Boris.
 - Some x86 changes, per Boris.

v14: 16nov2022
 https://lkml.org/lkml/2022/11/16/1645
 - Rebased onto 6.1.0-rc5
 - Introduced CRASH_HOTPLUG Kconfig item to better fine tune
   compilation of feature components, per Boris.
 - Removed hp_action parameter to arch_crash_handle_hotplug_event()
   as it is unused.

v13: 31oct2022
 https://lkml.org/lkml/2022/10/31/854
 - Rebased onto 6.1.0-rc3, which means converting to use the new
   kexec_trylock() away from mutex_lock(kexec_mutex).
 - Moved arch_un/map_crash_pages() into kexec.h and default
   implementation using k/unmap_local_pages().
 - Changed more #ifdef's into IS_ENABLED()
 - Changed CRASH_MAX_MEMORY_RANGES to 8192 from 32768, and it moved
   into x86 crash.c as #define rather Kconfig item, per Boris.
 - Check number of Phdrs against PN_XNUM, max possible.

v12: 9sep2022
 https://lkml.org/lkml/2022/9/9/1358
 - Rebased onto 6.0-rc4
 - Addressed some minor formatting items, per Baoquan

v11: 26aug2022
 https://lkml.org/lkml/2022/8/26/963
 - Rebased onto 6.0-rc2
 - Redid the rework of __weak to use asm/kexec.h, per Baoquan
 - Reworked some comments and minor items, per Baoquan

v10: 21jul2022
 https://lkml.org/lkml/2022/7/21/1007
 - Rebased to 5.19.0-rc7
 - Per Sourabh, corrected build issue with arch_un/map_crash_pages()
   for architectures not supporting this feature.
 - Per David Hildebrand, removed the WARN_ONCE() altogether.
 - Per David Hansen, converted to use of kmap_local_page().
 - Per Baoquan He, replaced use of __weak with the kexec technique.

v9: 13jun2022
 https://lkml.org/lkml/2022/6/13/3382
 - Rebased to 5.18.0
 - Per Sourabh, moved crash_prepare_elf64_headers() into common
   crash_core.c to avoid compile issues with kexec_load only path.
 - Per David Hildebrand, replaced mutex_trylock() with mutex_lock().
 - Changed the __weak arch_crash_handle_hotplug_event() to utilize
   WARN_ONCE() instead of WARN(). Fix some formatting issues.
 - Per Sourabh, introduced sysfs attribute crash_hotplug for memory
   and CPUs; for use by userspace (udev) to determine if the kernel
   performs crash hot un/plug support.
 - Per Sourabh, moved the code detecting the elfcorehdr segment from
   arch/x86 into crash_core:handle_hotplug_event() so both kexec_load
   and kexec_file_load can benefit.
 - Updated userspace kexec-tools kexec utility to reflect change to
   using CRASH_MAX_MEMORY_RANGES and get_nr_cpus().
 - Updated the new proposed udev rules to reflect using the sysfs
   attributes crash_hotplug.

v8: 5may2022
 https://lkml.org/lkml/2022/5/5/1133
 - Per Borislav Petkov, eliminated CONFIG_CRASH_HOTPLUG in favor
   of CONFIG_HOTPLUG_CPU || CONFIG_MEMORY_HOTPLUG, ie a new define
   is not needed. Also use of IS_ENABLED() rather than #ifdef's.
   Renamed crash_hotplug_handler() to handle_hotplug_event().
   And other corrections.
 - Per Baoquan, minimized the parameters to the arch_crash_
   handle_hotplug_event() to hp_action and cpu.
 - Introduce KEXEC_CRASH_HP_INVALID_CPU definition, per Baoquan.
 - Per Sourabh Jain, renamed and repurposed CRASH_HOTPLUG_ELFCOREHDR_SZ
   to CONFIG_CRASH_MAX_MEMORY_RANGES, mirroring kexec-tools change
   by David Hildebrand. Folded this patch into the x86
   kexec_file_load support patch.

v7: 13apr2022
 https://lkml.org/lkml/2022/4/13/850
 - Resolved parameter usage to crash_hotplug_handler(), per Baoquan.

v6: 1apr2022
 https://lkml.org/lkml/2022/4/1/1203
 - Reword commit messages and some comment cleanup per Baoquan.
 - Changed elf_index to elfcorehdr_index for clarity.
 - Minor code changes per Baoquan.

v5: 3mar2022
 https://lkml.org/lkml/2022/3/3/674
 - Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per
   David Hildenbrand.
 - Refactored slightly a few patches per Baoquan recommendation.

v4: 9feb2022
 https://lkml.org/lkml/2022/2/9/1406
 - Refactored patches per Baoquan suggestsions.
 - A few corrections, per Baoquan.

v3: 10jan2022
 https://lkml.org/lkml/2022/1/10/1212
 - Rebasing per Baoquan He request.
 - Changed memory notifier per David Hildenbrand.
 - Providing example kexec userspace change in cover letter.

RFC v2: 7dec2021
 https://lkml.org/lkml/2021/12/7/1088
 - Acting upon Baoquan He suggestion of removing elfcorehdr from
   the purgatory list of segments, removed purgatory code from
   patchset, and it is signficiantly simpler now.

RFC v1: 18nov2021
 https://lkml.org/lkml/2021/11/18/845
 - working patchset demonstrating kernel handling of hotplug
   updates to x86 elfcorehdr for kexec_file_load

RFC: 14dec2020
 https://lkml.org/lkml/2020/12/14/532
 - proposed concept of allowing kernel to handle hotplug update
   of elfcorehdr
---


Eric DeVolder (7):
  crash: move a few code bits to setup support of crash hotplug
  crash: prototype change for crash_prepare_elf64_headers()
  crash: add generic infrastructure for crash hotplug support
  kexec: exclude elfcorehdr from the segment digest
  kexec: exclude hot remove cpu from elfcorehdr notes
  crash: memory and cpu hotplug sysfs attributes
  x86/crash: add x86 crash hotplug support

 .../admin-guide/mm/memory-hotplug.rst         |   8 +
 Documentation/core-api/cpu_hotplug.rst        |  18 +
 arch/arm64/kernel/machine_kexec_file.c        |   6 +-
 arch/powerpc/kexec/file_load_64.c             |   2 +-
 arch/riscv/kernel/elf_kexec.c                 |   7 +-
 arch/x86/Kconfig                              |  13 +
 arch/x86/include/asm/kexec.h                  |  15 +
 arch/x86/kernel/crash.c                       | 124 ++++++-
 drivers/base/cpu.c                            |  14 +
 drivers/base/memory.c                         |  13 +
 include/linux/crash_core.h                    |   9 +
 include/linux/kexec.h                         |  53 ++-
 kernel/crash_core.c                           | 337 ++++++++++++++++++
 kernel/kexec_file.c                           | 187 +---------
 14 files changed, 595 insertions(+), 211 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v18 0/7] crash: Kernel handling of CPU and memory hot un/plug
@ 2023-01-31 22:42 ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr must also
be updated.

The elfcorehdr describes to kdump the CPUs and memory in the system,
and any inaccuracies can result in a vmcore with missing CPU context
or memory regions.

The current solution utilizes udev to initiate an unload-then-reload
of the kdump image (eg. kernel, initrd, boot_params, puratory and
elfcorehdr) by the userspace kexec utility. In previous posts I have
outlined the significant performance problems related to offloading
this activity to userspace.

This patchset introduces a generic crash handler that registers with
the CPU and memory notifiers. Upon CPU or memory changes, from either
hot un/plug or off/onlining, this generic handler is invoked and
performs important housekeeping, for example obtaining the appropriate
lock, and then invokes an architecture specific handler to do the
appropriate elfcorehdr update.

In the case of x86_64, the arch specific handler generates a new
elfcorehdr, and overwrites the old one in memory; thus no involvement
with userspace needed.

To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:

 - Prevent udev from updating kdump crash kernel on hot un/plug changes.
   Add the following as the first lines to the RHEL udev rule file
   /usr/lib/udev/rules.d/98-kexec.rules:

   # The kernel handles updates to crash elfcorehdr for cpu and memory changes
   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

   With this changeset applied, the two rules evaluate to false for
   cpu and memory change events and thus skip the userspace
   unload-then-reload of kdump.

 - Change to the kexec_file_load for loading the kdump kernel:
   Eg. on RHEL: in /usr/bin/kdumpctl, change to:
    standard_kexec_args="-p -d -s"
   which adds the -s to select kexec_file_load() syscall.

This kernel patchset also supports kexec_load() with a modified kexec
userspace utility. A working changeset to the kexec userspace utility
is posted to the kexec-tools mailing list here:

 http://lists.infradead.org/pipermail/kexec/2022-October/026032.html

To use the kexec-tools patch, apply, build and install kexec-tools,
then change the kdumpctl's standard_kexec_args to replace the -s with
--hotplug. The removal of -s reverts to the kexec_load syscall and
the addition of --hotplug invokes the changes put forth in the
kexec-tools patch.

Regards,
eric
---
v18: 31jan2023
 - Rebased onto 6.2.0-rc6
 - Renamed struct kimage member hotplug_event to hp_action, and
   re-enumerated the KEXEC_CRASH_HP_x items, adding _NONE at 0.
 - Moved to cpuhp state CPUHP_BP_PREPARE_DYN instead of
   CPUHP_AP_ONLINE_DYN in order to minimize window of time CPU
   is not reflected in elfcorehdr.
 - Reworked some of the comments and commit messages to offer
   more of the why, than what, per Thomas Gleixner.

v17: 18jan2023
 - Rebased onto 6.2.0-rc4
 - Moved a bit of code around so that kexec_load()-only builds
   work, per Sourabh.
 - Corrected computation of number of memory region Phdrs needed
   when x86 memory hotplug is not enabled, per Baoquan.

v16: 5jan2023
 https://lkml.org/lkml/2023/1/5/673
 - Rebased onto 6.2.0-rc2
 - Corrected error identified by Baoquan.

v15: 9dec2022
 https://lkml.org/lkml/2022/12/9/520
 - Rebased onto 6.1.0-rc8
 - Replaced arch_un/map_crash_pages() with direct use of
   kun/map_local_pages(), per Boris.
 - Some x86 changes, per Boris.

v14: 16nov2022
 https://lkml.org/lkml/2022/11/16/1645
 - Rebased onto 6.1.0-rc5
 - Introduced CRASH_HOTPLUG Kconfig item to better fine tune
   compilation of feature components, per Boris.
 - Removed hp_action parameter to arch_crash_handle_hotplug_event()
   as it is unused.

v13: 31oct2022
 https://lkml.org/lkml/2022/10/31/854
 - Rebased onto 6.1.0-rc3, which means converting to use the new
   kexec_trylock() away from mutex_lock(kexec_mutex).
 - Moved arch_un/map_crash_pages() into kexec.h and default
   implementation using k/unmap_local_pages().
 - Changed more #ifdef's into IS_ENABLED()
 - Changed CRASH_MAX_MEMORY_RANGES to 8192 from 32768, and it moved
   into x86 crash.c as #define rather Kconfig item, per Boris.
 - Check number of Phdrs against PN_XNUM, max possible.

v12: 9sep2022
 https://lkml.org/lkml/2022/9/9/1358
 - Rebased onto 6.0-rc4
 - Addressed some minor formatting items, per Baoquan

v11: 26aug2022
 https://lkml.org/lkml/2022/8/26/963
 - Rebased onto 6.0-rc2
 - Redid the rework of __weak to use asm/kexec.h, per Baoquan
 - Reworked some comments and minor items, per Baoquan

v10: 21jul2022
 https://lkml.org/lkml/2022/7/21/1007
 - Rebased to 5.19.0-rc7
 - Per Sourabh, corrected build issue with arch_un/map_crash_pages()
   for architectures not supporting this feature.
 - Per David Hildebrand, removed the WARN_ONCE() altogether.
 - Per David Hansen, converted to use of kmap_local_page().
 - Per Baoquan He, replaced use of __weak with the kexec technique.

v9: 13jun2022
 https://lkml.org/lkml/2022/6/13/3382
 - Rebased to 5.18.0
 - Per Sourabh, moved crash_prepare_elf64_headers() into common
   crash_core.c to avoid compile issues with kexec_load only path.
 - Per David Hildebrand, replaced mutex_trylock() with mutex_lock().
 - Changed the __weak arch_crash_handle_hotplug_event() to utilize
   WARN_ONCE() instead of WARN(). Fix some formatting issues.
 - Per Sourabh, introduced sysfs attribute crash_hotplug for memory
   and CPUs; for use by userspace (udev) to determine if the kernel
   performs crash hot un/plug support.
 - Per Sourabh, moved the code detecting the elfcorehdr segment from
   arch/x86 into crash_core:handle_hotplug_event() so both kexec_load
   and kexec_file_load can benefit.
 - Updated userspace kexec-tools kexec utility to reflect change to
   using CRASH_MAX_MEMORY_RANGES and get_nr_cpus().
 - Updated the new proposed udev rules to reflect using the sysfs
   attributes crash_hotplug.

v8: 5may2022
 https://lkml.org/lkml/2022/5/5/1133
 - Per Borislav Petkov, eliminated CONFIG_CRASH_HOTPLUG in favor
   of CONFIG_HOTPLUG_CPU || CONFIG_MEMORY_HOTPLUG, ie a new define
   is not needed. Also use of IS_ENABLED() rather than #ifdef's.
   Renamed crash_hotplug_handler() to handle_hotplug_event().
   And other corrections.
 - Per Baoquan, minimized the parameters to the arch_crash_
   handle_hotplug_event() to hp_action and cpu.
 - Introduce KEXEC_CRASH_HP_INVALID_CPU definition, per Baoquan.
 - Per Sourabh Jain, renamed and repurposed CRASH_HOTPLUG_ELFCOREHDR_SZ
   to CONFIG_CRASH_MAX_MEMORY_RANGES, mirroring kexec-tools change
   by David Hildebrand. Folded this patch into the x86
   kexec_file_load support patch.

v7: 13apr2022
 https://lkml.org/lkml/2022/4/13/850
 - Resolved parameter usage to crash_hotplug_handler(), per Baoquan.

v6: 1apr2022
 https://lkml.org/lkml/2022/4/1/1203
 - Reword commit messages and some comment cleanup per Baoquan.
 - Changed elf_index to elfcorehdr_index for clarity.
 - Minor code changes per Baoquan.

v5: 3mar2022
 https://lkml.org/lkml/2022/3/3/674
 - Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per
   David Hildenbrand.
 - Refactored slightly a few patches per Baoquan recommendation.

v4: 9feb2022
 https://lkml.org/lkml/2022/2/9/1406
 - Refactored patches per Baoquan suggestsions.
 - A few corrections, per Baoquan.

v3: 10jan2022
 https://lkml.org/lkml/2022/1/10/1212
 - Rebasing per Baoquan He request.
 - Changed memory notifier per David Hildenbrand.
 - Providing example kexec userspace change in cover letter.

RFC v2: 7dec2021
 https://lkml.org/lkml/2021/12/7/1088
 - Acting upon Baoquan He suggestion of removing elfcorehdr from
   the purgatory list of segments, removed purgatory code from
   patchset, and it is signficiantly simpler now.

RFC v1: 18nov2021
 https://lkml.org/lkml/2021/11/18/845
 - working patchset demonstrating kernel handling of hotplug
   updates to x86 elfcorehdr for kexec_file_load

RFC: 14dec2020
 https://lkml.org/lkml/2020/12/14/532
 - proposed concept of allowing kernel to handle hotplug update
   of elfcorehdr
---


Eric DeVolder (7):
  crash: move a few code bits to setup support of crash hotplug
  crash: prototype change for crash_prepare_elf64_headers()
  crash: add generic infrastructure for crash hotplug support
  kexec: exclude elfcorehdr from the segment digest
  kexec: exclude hot remove cpu from elfcorehdr notes
  crash: memory and cpu hotplug sysfs attributes
  x86/crash: add x86 crash hotplug support

 .../admin-guide/mm/memory-hotplug.rst         |   8 +
 Documentation/core-api/cpu_hotplug.rst        |  18 +
 arch/arm64/kernel/machine_kexec_file.c        |   6 +-
 arch/powerpc/kexec/file_load_64.c             |   2 +-
 arch/riscv/kernel/elf_kexec.c                 |   7 +-
 arch/x86/Kconfig                              |  13 +
 arch/x86/include/asm/kexec.h                  |  15 +
 arch/x86/kernel/crash.c                       | 124 ++++++-
 drivers/base/cpu.c                            |  14 +
 drivers/base/memory.c                         |  13 +
 include/linux/crash_core.h                    |   9 +
 include/linux/kexec.h                         |  53 ++-
 kernel/crash_core.c                           | 337 ++++++++++++++++++
 kernel/kexec_file.c                           | 187 +---------
 14 files changed, 595 insertions(+), 211 deletions(-)

-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v18 1/7] crash: move a few code bits to setup support of crash hotplug
  2023-01-31 22:42 ` Eric DeVolder
@ 2023-01-31 22:42   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

The crash hotplug support leans on the work for the kexec_file_load()
syscall. To also support the kexec_load() syscall, a few bits of code
need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are
moved out of kexec_file.c and into a common location crash_core.c.

No functionality change.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
---
 include/linux/kexec.h |  30 +++----
 kernel/crash_core.c   | 182 ++++++++++++++++++++++++++++++++++++++++++
 kernel/kexec_file.c   | 181 -----------------------------------------
 3 files changed, 197 insertions(+), 196 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 5dd4343c1bbe..582ea213467a 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -105,6 +105,21 @@ struct compat_kexec_segment {
 };
 #endif
 
+/* Alignment required for elf header segment */
+#define ELF_CORE_HEADER_ALIGN   4096
+
+struct crash_mem {
+	unsigned int max_nr_ranges;
+	unsigned int nr_ranges;
+	struct range ranges[];
+};
+
+extern int crash_exclude_mem_range(struct crash_mem *mem,
+				   unsigned long long mstart,
+				   unsigned long long mend);
+extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
+				       void **addr, unsigned long *sz);
+
 #ifdef CONFIG_KEXEC_FILE
 struct purgatory_info {
 	/*
@@ -238,21 +253,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
 }
 #endif
 
-/* Alignment required for elf header segment */
-#define ELF_CORE_HEADER_ALIGN   4096
-
-struct crash_mem {
-	unsigned int max_nr_ranges;
-	unsigned int nr_ranges;
-	struct range ranges[];
-};
-
-extern int crash_exclude_mem_range(struct crash_mem *mem,
-				   unsigned long long mstart,
-				   unsigned long long mend);
-extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
-				       void **addr, unsigned long *sz);
-
 #ifndef arch_kexec_apply_relocations_add
 /*
  * arch_kexec_apply_relocations_add - apply relocations of type RELA
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 87ef6096823f..8a439b6d723b 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -10,6 +10,7 @@
 #include <linux/utsname.h>
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
+#include <linux/kexec.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg)
 }
 early_param("crashkernel", parse_crashkernel_dummy);
 
+int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
+			  void **addr, unsigned long *sz)
+{
+	Elf64_Ehdr *ehdr;
+	Elf64_Phdr *phdr;
+	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
+	unsigned char *buf;
+	unsigned int cpu, i;
+	unsigned long long notes_addr;
+	unsigned long mstart, mend;
+
+	/* extra phdr for vmcoreinfo ELF note */
+	nr_phdr = nr_cpus + 1;
+	nr_phdr += mem->nr_ranges;
+
+	/*
+	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
+	 * area (for example, ffffffff80000000 - ffffffffa0000000 on x86_64).
+	 * I think this is required by tools like gdb. So same physical
+	 * memory will be mapped in two ELF headers. One will contain kernel
+	 * text virtual addresses and other will have __va(physical) addresses.
+	 */
+
+	nr_phdr++;
+	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
+	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
+
+	buf = vzalloc(elf_sz);
+	if (!buf)
+		return -ENOMEM;
+
+	ehdr = (Elf64_Ehdr *)buf;
+	phdr = (Elf64_Phdr *)(ehdr + 1);
+	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
+	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
+	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
+	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
+	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
+	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
+	ehdr->e_type = ET_CORE;
+	ehdr->e_machine = ELF_ARCH;
+	ehdr->e_version = EV_CURRENT;
+	ehdr->e_phoff = sizeof(Elf64_Ehdr);
+	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
+	ehdr->e_phentsize = sizeof(Elf64_Phdr);
+
+	/* Prepare one phdr of type PT_NOTE for each present CPU */
+	for_each_present_cpu(cpu) {
+		phdr->p_type = PT_NOTE;
+		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
+		phdr->p_offset = phdr->p_paddr = notes_addr;
+		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
+		(ehdr->e_phnum)++;
+		phdr++;
+	}
+
+	/* Prepare one PT_NOTE header for vmcoreinfo */
+	phdr->p_type = PT_NOTE;
+	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
+	phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
+	(ehdr->e_phnum)++;
+	phdr++;
+
+	/* Prepare PT_LOAD type program header for kernel text region */
+	if (need_kernel_map) {
+		phdr->p_type = PT_LOAD;
+		phdr->p_flags = PF_R|PF_W|PF_X;
+		phdr->p_vaddr = (unsigned long) _text;
+		phdr->p_filesz = phdr->p_memsz = _end - _text;
+		phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
+		ehdr->e_phnum++;
+		phdr++;
+	}
+
+	/* Go through all the ranges in mem->ranges[] and prepare phdr */
+	for (i = 0; i < mem->nr_ranges; i++) {
+		mstart = mem->ranges[i].start;
+		mend = mem->ranges[i].end;
+
+		phdr->p_type = PT_LOAD;
+		phdr->p_flags = PF_R|PF_W|PF_X;
+		phdr->p_offset  = mstart;
+
+		phdr->p_paddr = mstart;
+		phdr->p_vaddr = (unsigned long) __va(mstart);
+		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
+		phdr->p_align = 0;
+		ehdr->e_phnum++;
+		pr_debug("Crash PT_LOAD ELF header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
+			phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
+			ehdr->e_phnum, phdr->p_offset);
+		phdr++;
+	}
+
+	*addr = buf;
+	*sz = elf_sz;
+	return 0;
+}
+
+int crash_exclude_mem_range(struct crash_mem *mem,
+			    unsigned long long mstart, unsigned long long mend)
+{
+	int i, j;
+	unsigned long long start, end, p_start, p_end;
+	struct range temp_range = {0, 0};
+
+	for (i = 0; i < mem->nr_ranges; i++) {
+		start = mem->ranges[i].start;
+		end = mem->ranges[i].end;
+		p_start = mstart;
+		p_end = mend;
+
+		if (mstart > end || mend < start)
+			continue;
+
+		/* Truncate any area outside of range */
+		if (mstart < start)
+			p_start = start;
+		if (mend > end)
+			p_end = end;
+
+		/* Found completely overlapping range */
+		if (p_start == start && p_end == end) {
+			mem->ranges[i].start = 0;
+			mem->ranges[i].end = 0;
+			if (i < mem->nr_ranges - 1) {
+				/* Shift rest of the ranges to left */
+				for (j = i; j < mem->nr_ranges - 1; j++) {
+					mem->ranges[j].start =
+						mem->ranges[j+1].start;
+					mem->ranges[j].end =
+							mem->ranges[j+1].end;
+				}
+
+				/*
+				 * Continue to check if there are another overlapping ranges
+				 * from the current position because of shifting the above
+				 * mem ranges.
+				 */
+				i--;
+				mem->nr_ranges--;
+				continue;
+			}
+			mem->nr_ranges--;
+			return 0;
+		}
+
+		if (p_start > start && p_end < end) {
+			/* Split original range */
+			mem->ranges[i].end = p_start - 1;
+			temp_range.start = p_end + 1;
+			temp_range.end = end;
+		} else if (p_start != start)
+			mem->ranges[i].end = p_start - 1;
+		else
+			mem->ranges[i].start = p_end + 1;
+		break;
+	}
+
+	/* If a split happened, add the split to array */
+	if (!temp_range.end)
+		return 0;
+
+	/* Split happened */
+	if (i == mem->max_nr_ranges - 1)
+		return -ENOMEM;
+
+	/* Location where new range should go */
+	j = i + 1;
+	if (j < mem->nr_ranges) {
+		/* Move over all ranges one slot towards the end */
+		for (i = mem->nr_ranges - 1; i >= j; i--)
+			mem->ranges[i + 1] = mem->ranges[i];
+	}
+
+	mem->ranges[j].start = temp_range.start;
+	mem->ranges[j].end = temp_range.end;
+	mem->nr_ranges++;
+	return 0;
+}
+
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len)
 {
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index dd5983010b7b..ead3443e7f9d 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -1135,184 +1135,3 @@ int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
 	return 0;
 }
 #endif /* CONFIG_ARCH_HAS_KEXEC_PURGATORY */
-
-int crash_exclude_mem_range(struct crash_mem *mem,
-			    unsigned long long mstart, unsigned long long mend)
-{
-	int i, j;
-	unsigned long long start, end, p_start, p_end;
-	struct range temp_range = {0, 0};
-
-	for (i = 0; i < mem->nr_ranges; i++) {
-		start = mem->ranges[i].start;
-		end = mem->ranges[i].end;
-		p_start = mstart;
-		p_end = mend;
-
-		if (mstart > end || mend < start)
-			continue;
-
-		/* Truncate any area outside of range */
-		if (mstart < start)
-			p_start = start;
-		if (mend > end)
-			p_end = end;
-
-		/* Found completely overlapping range */
-		if (p_start == start && p_end == end) {
-			mem->ranges[i].start = 0;
-			mem->ranges[i].end = 0;
-			if (i < mem->nr_ranges - 1) {
-				/* Shift rest of the ranges to left */
-				for (j = i; j < mem->nr_ranges - 1; j++) {
-					mem->ranges[j].start =
-						mem->ranges[j+1].start;
-					mem->ranges[j].end =
-							mem->ranges[j+1].end;
-				}
-
-				/*
-				 * Continue to check if there are another overlapping ranges
-				 * from the current position because of shifting the above
-				 * mem ranges.
-				 */
-				i--;
-				mem->nr_ranges--;
-				continue;
-			}
-			mem->nr_ranges--;
-			return 0;
-		}
-
-		if (p_start > start && p_end < end) {
-			/* Split original range */
-			mem->ranges[i].end = p_start - 1;
-			temp_range.start = p_end + 1;
-			temp_range.end = end;
-		} else if (p_start != start)
-			mem->ranges[i].end = p_start - 1;
-		else
-			mem->ranges[i].start = p_end + 1;
-		break;
-	}
-
-	/* If a split happened, add the split to array */
-	if (!temp_range.end)
-		return 0;
-
-	/* Split happened */
-	if (i == mem->max_nr_ranges - 1)
-		return -ENOMEM;
-
-	/* Location where new range should go */
-	j = i + 1;
-	if (j < mem->nr_ranges) {
-		/* Move over all ranges one slot towards the end */
-		for (i = mem->nr_ranges - 1; i >= j; i--)
-			mem->ranges[i + 1] = mem->ranges[i];
-	}
-
-	mem->ranges[j].start = temp_range.start;
-	mem->ranges[j].end = temp_range.end;
-	mem->nr_ranges++;
-	return 0;
-}
-
-int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
-			  void **addr, unsigned long *sz)
-{
-	Elf64_Ehdr *ehdr;
-	Elf64_Phdr *phdr;
-	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
-	unsigned char *buf;
-	unsigned int cpu, i;
-	unsigned long long notes_addr;
-	unsigned long mstart, mend;
-
-	/* extra phdr for vmcoreinfo ELF note */
-	nr_phdr = nr_cpus + 1;
-	nr_phdr += mem->nr_ranges;
-
-	/*
-	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
-	 * area (for example, ffffffff80000000 - ffffffffa0000000 on x86_64).
-	 * I think this is required by tools like gdb. So same physical
-	 * memory will be mapped in two ELF headers. One will contain kernel
-	 * text virtual addresses and other will have __va(physical) addresses.
-	 */
-
-	nr_phdr++;
-	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
-	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
-
-	buf = vzalloc(elf_sz);
-	if (!buf)
-		return -ENOMEM;
-
-	ehdr = (Elf64_Ehdr *)buf;
-	phdr = (Elf64_Phdr *)(ehdr + 1);
-	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
-	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
-	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
-	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
-	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
-	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
-	ehdr->e_type = ET_CORE;
-	ehdr->e_machine = ELF_ARCH;
-	ehdr->e_version = EV_CURRENT;
-	ehdr->e_phoff = sizeof(Elf64_Ehdr);
-	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
-	ehdr->e_phentsize = sizeof(Elf64_Phdr);
-
-	/* Prepare one phdr of type PT_NOTE for each present CPU */
-	for_each_present_cpu(cpu) {
-		phdr->p_type = PT_NOTE;
-		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
-		phdr->p_offset = phdr->p_paddr = notes_addr;
-		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
-		(ehdr->e_phnum)++;
-		phdr++;
-	}
-
-	/* Prepare one PT_NOTE header for vmcoreinfo */
-	phdr->p_type = PT_NOTE;
-	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
-	phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
-	(ehdr->e_phnum)++;
-	phdr++;
-
-	/* Prepare PT_LOAD type program header for kernel text region */
-	if (need_kernel_map) {
-		phdr->p_type = PT_LOAD;
-		phdr->p_flags = PF_R|PF_W|PF_X;
-		phdr->p_vaddr = (unsigned long) _text;
-		phdr->p_filesz = phdr->p_memsz = _end - _text;
-		phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
-		ehdr->e_phnum++;
-		phdr++;
-	}
-
-	/* Go through all the ranges in mem->ranges[] and prepare phdr */
-	for (i = 0; i < mem->nr_ranges; i++) {
-		mstart = mem->ranges[i].start;
-		mend = mem->ranges[i].end;
-
-		phdr->p_type = PT_LOAD;
-		phdr->p_flags = PF_R|PF_W|PF_X;
-		phdr->p_offset  = mstart;
-
-		phdr->p_paddr = mstart;
-		phdr->p_vaddr = (unsigned long) __va(mstart);
-		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
-		phdr->p_align = 0;
-		ehdr->e_phnum++;
-		pr_debug("Crash PT_LOAD ELF header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
-			phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
-			ehdr->e_phnum, phdr->p_offset);
-		phdr++;
-	}
-
-	*addr = buf;
-	*sz = elf_sz;
-	return 0;
-}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 1/7] crash: move a few code bits to setup support of crash hotplug
@ 2023-01-31 22:42   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

The crash hotplug support leans on the work for the kexec_file_load()
syscall. To also support the kexec_load() syscall, a few bits of code
need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are
moved out of kexec_file.c and into a common location crash_core.c.

No functionality change.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
---
 include/linux/kexec.h |  30 +++----
 kernel/crash_core.c   | 182 ++++++++++++++++++++++++++++++++++++++++++
 kernel/kexec_file.c   | 181 -----------------------------------------
 3 files changed, 197 insertions(+), 196 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 5dd4343c1bbe..582ea213467a 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -105,6 +105,21 @@ struct compat_kexec_segment {
 };
 #endif
 
+/* Alignment required for elf header segment */
+#define ELF_CORE_HEADER_ALIGN   4096
+
+struct crash_mem {
+	unsigned int max_nr_ranges;
+	unsigned int nr_ranges;
+	struct range ranges[];
+};
+
+extern int crash_exclude_mem_range(struct crash_mem *mem,
+				   unsigned long long mstart,
+				   unsigned long long mend);
+extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
+				       void **addr, unsigned long *sz);
+
 #ifdef CONFIG_KEXEC_FILE
 struct purgatory_info {
 	/*
@@ -238,21 +253,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
 }
 #endif
 
-/* Alignment required for elf header segment */
-#define ELF_CORE_HEADER_ALIGN   4096
-
-struct crash_mem {
-	unsigned int max_nr_ranges;
-	unsigned int nr_ranges;
-	struct range ranges[];
-};
-
-extern int crash_exclude_mem_range(struct crash_mem *mem,
-				   unsigned long long mstart,
-				   unsigned long long mend);
-extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
-				       void **addr, unsigned long *sz);
-
 #ifndef arch_kexec_apply_relocations_add
 /*
  * arch_kexec_apply_relocations_add - apply relocations of type RELA
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 87ef6096823f..8a439b6d723b 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -10,6 +10,7 @@
 #include <linux/utsname.h>
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
+#include <linux/kexec.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -314,6 +315,187 @@ static int __init parse_crashkernel_dummy(char *arg)
 }
 early_param("crashkernel", parse_crashkernel_dummy);
 
+int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
+			  void **addr, unsigned long *sz)
+{
+	Elf64_Ehdr *ehdr;
+	Elf64_Phdr *phdr;
+	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
+	unsigned char *buf;
+	unsigned int cpu, i;
+	unsigned long long notes_addr;
+	unsigned long mstart, mend;
+
+	/* extra phdr for vmcoreinfo ELF note */
+	nr_phdr = nr_cpus + 1;
+	nr_phdr += mem->nr_ranges;
+
+	/*
+	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
+	 * area (for example, ffffffff80000000 - ffffffffa0000000 on x86_64).
+	 * I think this is required by tools like gdb. So same physical
+	 * memory will be mapped in two ELF headers. One will contain kernel
+	 * text virtual addresses and other will have __va(physical) addresses.
+	 */
+
+	nr_phdr++;
+	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
+	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
+
+	buf = vzalloc(elf_sz);
+	if (!buf)
+		return -ENOMEM;
+
+	ehdr = (Elf64_Ehdr *)buf;
+	phdr = (Elf64_Phdr *)(ehdr + 1);
+	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
+	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
+	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
+	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
+	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
+	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
+	ehdr->e_type = ET_CORE;
+	ehdr->e_machine = ELF_ARCH;
+	ehdr->e_version = EV_CURRENT;
+	ehdr->e_phoff = sizeof(Elf64_Ehdr);
+	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
+	ehdr->e_phentsize = sizeof(Elf64_Phdr);
+
+	/* Prepare one phdr of type PT_NOTE for each present CPU */
+	for_each_present_cpu(cpu) {
+		phdr->p_type = PT_NOTE;
+		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
+		phdr->p_offset = phdr->p_paddr = notes_addr;
+		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
+		(ehdr->e_phnum)++;
+		phdr++;
+	}
+
+	/* Prepare one PT_NOTE header for vmcoreinfo */
+	phdr->p_type = PT_NOTE;
+	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
+	phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
+	(ehdr->e_phnum)++;
+	phdr++;
+
+	/* Prepare PT_LOAD type program header for kernel text region */
+	if (need_kernel_map) {
+		phdr->p_type = PT_LOAD;
+		phdr->p_flags = PF_R|PF_W|PF_X;
+		phdr->p_vaddr = (unsigned long) _text;
+		phdr->p_filesz = phdr->p_memsz = _end - _text;
+		phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
+		ehdr->e_phnum++;
+		phdr++;
+	}
+
+	/* Go through all the ranges in mem->ranges[] and prepare phdr */
+	for (i = 0; i < mem->nr_ranges; i++) {
+		mstart = mem->ranges[i].start;
+		mend = mem->ranges[i].end;
+
+		phdr->p_type = PT_LOAD;
+		phdr->p_flags = PF_R|PF_W|PF_X;
+		phdr->p_offset  = mstart;
+
+		phdr->p_paddr = mstart;
+		phdr->p_vaddr = (unsigned long) __va(mstart);
+		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
+		phdr->p_align = 0;
+		ehdr->e_phnum++;
+		pr_debug("Crash PT_LOAD ELF header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
+			phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
+			ehdr->e_phnum, phdr->p_offset);
+		phdr++;
+	}
+
+	*addr = buf;
+	*sz = elf_sz;
+	return 0;
+}
+
+int crash_exclude_mem_range(struct crash_mem *mem,
+			    unsigned long long mstart, unsigned long long mend)
+{
+	int i, j;
+	unsigned long long start, end, p_start, p_end;
+	struct range temp_range = {0, 0};
+
+	for (i = 0; i < mem->nr_ranges; i++) {
+		start = mem->ranges[i].start;
+		end = mem->ranges[i].end;
+		p_start = mstart;
+		p_end = mend;
+
+		if (mstart > end || mend < start)
+			continue;
+
+		/* Truncate any area outside of range */
+		if (mstart < start)
+			p_start = start;
+		if (mend > end)
+			p_end = end;
+
+		/* Found completely overlapping range */
+		if (p_start == start && p_end == end) {
+			mem->ranges[i].start = 0;
+			mem->ranges[i].end = 0;
+			if (i < mem->nr_ranges - 1) {
+				/* Shift rest of the ranges to left */
+				for (j = i; j < mem->nr_ranges - 1; j++) {
+					mem->ranges[j].start =
+						mem->ranges[j+1].start;
+					mem->ranges[j].end =
+							mem->ranges[j+1].end;
+				}
+
+				/*
+				 * Continue to check if there are another overlapping ranges
+				 * from the current position because of shifting the above
+				 * mem ranges.
+				 */
+				i--;
+				mem->nr_ranges--;
+				continue;
+			}
+			mem->nr_ranges--;
+			return 0;
+		}
+
+		if (p_start > start && p_end < end) {
+			/* Split original range */
+			mem->ranges[i].end = p_start - 1;
+			temp_range.start = p_end + 1;
+			temp_range.end = end;
+		} else if (p_start != start)
+			mem->ranges[i].end = p_start - 1;
+		else
+			mem->ranges[i].start = p_end + 1;
+		break;
+	}
+
+	/* If a split happened, add the split to array */
+	if (!temp_range.end)
+		return 0;
+
+	/* Split happened */
+	if (i == mem->max_nr_ranges - 1)
+		return -ENOMEM;
+
+	/* Location where new range should go */
+	j = i + 1;
+	if (j < mem->nr_ranges) {
+		/* Move over all ranges one slot towards the end */
+		for (i = mem->nr_ranges - 1; i >= j; i--)
+			mem->ranges[i + 1] = mem->ranges[i];
+	}
+
+	mem->ranges[j].start = temp_range.start;
+	mem->ranges[j].end = temp_range.end;
+	mem->nr_ranges++;
+	return 0;
+}
+
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len)
 {
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index dd5983010b7b..ead3443e7f9d 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -1135,184 +1135,3 @@ int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
 	return 0;
 }
 #endif /* CONFIG_ARCH_HAS_KEXEC_PURGATORY */
-
-int crash_exclude_mem_range(struct crash_mem *mem,
-			    unsigned long long mstart, unsigned long long mend)
-{
-	int i, j;
-	unsigned long long start, end, p_start, p_end;
-	struct range temp_range = {0, 0};
-
-	for (i = 0; i < mem->nr_ranges; i++) {
-		start = mem->ranges[i].start;
-		end = mem->ranges[i].end;
-		p_start = mstart;
-		p_end = mend;
-
-		if (mstart > end || mend < start)
-			continue;
-
-		/* Truncate any area outside of range */
-		if (mstart < start)
-			p_start = start;
-		if (mend > end)
-			p_end = end;
-
-		/* Found completely overlapping range */
-		if (p_start == start && p_end == end) {
-			mem->ranges[i].start = 0;
-			mem->ranges[i].end = 0;
-			if (i < mem->nr_ranges - 1) {
-				/* Shift rest of the ranges to left */
-				for (j = i; j < mem->nr_ranges - 1; j++) {
-					mem->ranges[j].start =
-						mem->ranges[j+1].start;
-					mem->ranges[j].end =
-							mem->ranges[j+1].end;
-				}
-
-				/*
-				 * Continue to check if there are another overlapping ranges
-				 * from the current position because of shifting the above
-				 * mem ranges.
-				 */
-				i--;
-				mem->nr_ranges--;
-				continue;
-			}
-			mem->nr_ranges--;
-			return 0;
-		}
-
-		if (p_start > start && p_end < end) {
-			/* Split original range */
-			mem->ranges[i].end = p_start - 1;
-			temp_range.start = p_end + 1;
-			temp_range.end = end;
-		} else if (p_start != start)
-			mem->ranges[i].end = p_start - 1;
-		else
-			mem->ranges[i].start = p_end + 1;
-		break;
-	}
-
-	/* If a split happened, add the split to array */
-	if (!temp_range.end)
-		return 0;
-
-	/* Split happened */
-	if (i == mem->max_nr_ranges - 1)
-		return -ENOMEM;
-
-	/* Location where new range should go */
-	j = i + 1;
-	if (j < mem->nr_ranges) {
-		/* Move over all ranges one slot towards the end */
-		for (i = mem->nr_ranges - 1; i >= j; i--)
-			mem->ranges[i + 1] = mem->ranges[i];
-	}
-
-	mem->ranges[j].start = temp_range.start;
-	mem->ranges[j].end = temp_range.end;
-	mem->nr_ranges++;
-	return 0;
-}
-
-int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
-			  void **addr, unsigned long *sz)
-{
-	Elf64_Ehdr *ehdr;
-	Elf64_Phdr *phdr;
-	unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
-	unsigned char *buf;
-	unsigned int cpu, i;
-	unsigned long long notes_addr;
-	unsigned long mstart, mend;
-
-	/* extra phdr for vmcoreinfo ELF note */
-	nr_phdr = nr_cpus + 1;
-	nr_phdr += mem->nr_ranges;
-
-	/*
-	 * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
-	 * area (for example, ffffffff80000000 - ffffffffa0000000 on x86_64).
-	 * I think this is required by tools like gdb. So same physical
-	 * memory will be mapped in two ELF headers. One will contain kernel
-	 * text virtual addresses and other will have __va(physical) addresses.
-	 */
-
-	nr_phdr++;
-	elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
-	elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
-
-	buf = vzalloc(elf_sz);
-	if (!buf)
-		return -ENOMEM;
-
-	ehdr = (Elf64_Ehdr *)buf;
-	phdr = (Elf64_Phdr *)(ehdr + 1);
-	memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
-	ehdr->e_ident[EI_CLASS] = ELFCLASS64;
-	ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
-	ehdr->e_ident[EI_VERSION] = EV_CURRENT;
-	ehdr->e_ident[EI_OSABI] = ELF_OSABI;
-	memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
-	ehdr->e_type = ET_CORE;
-	ehdr->e_machine = ELF_ARCH;
-	ehdr->e_version = EV_CURRENT;
-	ehdr->e_phoff = sizeof(Elf64_Ehdr);
-	ehdr->e_ehsize = sizeof(Elf64_Ehdr);
-	ehdr->e_phentsize = sizeof(Elf64_Phdr);
-
-	/* Prepare one phdr of type PT_NOTE for each present CPU */
-	for_each_present_cpu(cpu) {
-		phdr->p_type = PT_NOTE;
-		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
-		phdr->p_offset = phdr->p_paddr = notes_addr;
-		phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
-		(ehdr->e_phnum)++;
-		phdr++;
-	}
-
-	/* Prepare one PT_NOTE header for vmcoreinfo */
-	phdr->p_type = PT_NOTE;
-	phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
-	phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
-	(ehdr->e_phnum)++;
-	phdr++;
-
-	/* Prepare PT_LOAD type program header for kernel text region */
-	if (need_kernel_map) {
-		phdr->p_type = PT_LOAD;
-		phdr->p_flags = PF_R|PF_W|PF_X;
-		phdr->p_vaddr = (unsigned long) _text;
-		phdr->p_filesz = phdr->p_memsz = _end - _text;
-		phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
-		ehdr->e_phnum++;
-		phdr++;
-	}
-
-	/* Go through all the ranges in mem->ranges[] and prepare phdr */
-	for (i = 0; i < mem->nr_ranges; i++) {
-		mstart = mem->ranges[i].start;
-		mend = mem->ranges[i].end;
-
-		phdr->p_type = PT_LOAD;
-		phdr->p_flags = PF_R|PF_W|PF_X;
-		phdr->p_offset  = mstart;
-
-		phdr->p_paddr = mstart;
-		phdr->p_vaddr = (unsigned long) __va(mstart);
-		phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
-		phdr->p_align = 0;
-		ehdr->e_phnum++;
-		pr_debug("Crash PT_LOAD ELF header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
-			phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
-			ehdr->e_phnum, phdr->p_offset);
-		phdr++;
-	}
-
-	*addr = buf;
-	*sz = elf_sz;
-	return 0;
-}
-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 2/7] crash: prototype change for crash_prepare_elf64_headers()
  2023-01-31 22:42 ` Eric DeVolder
@ 2023-01-31 22:42   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

From within crash_prepare_elf64_headers() there is a need to
reference the struct kimage hotplug members. As such, this
change passes the struct kimage as a parameter to the
crash_prepare_elf64_headers(). The hotplug members are added
in "crash: add generic infrastructure for crash hotplug support".

This is preparation for later patch, no functionality change.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/kernel/machine_kexec_file.c | 6 +++---
 arch/powerpc/kexec/file_load_64.c      | 2 +-
 arch/riscv/kernel/elf_kexec.c          | 7 ++++---
 arch/x86/kernel/crash.c                | 2 +-
 include/linux/kexec.h                  | 7 +++++--
 kernel/crash_core.c                    | 4 ++--
 6 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c
index a11a6e14ba89..2f7b773a83bb 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -39,7 +39,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 	return kexec_image_post_load_cleanup_default(image);
 }
 
-static int prepare_elf_headers(void **addr, unsigned long *sz)
+static int prepare_elf_headers(struct kimage *image, void **addr, unsigned long *sz)
 {
 	struct crash_mem *cmem;
 	unsigned int nr_ranges;
@@ -64,7 +64,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz)
 	}
 
 	/* Exclude crashkernel region */
-	ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
+	ret = crash_exclude_mem_range(image, cmem, crashk_res.start, crashk_res.end);
 	if (ret)
 		goto out;
 
@@ -74,7 +74,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz)
 			goto out;
 	}
 
-	ret = crash_prepare_elf64_headers(cmem, true, addr, sz);
+	ret = crash_prepare_elf64_headers(image, cmem, true, addr, sz);
 
 out:
 	kfree(cmem);
diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
index af8854f9eae3..e51d8059535b 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -799,7 +799,7 @@ static int load_elfcorehdr_segment(struct kimage *image, struct kexec_buf *kbuf)
 		goto out;
 
 	/* Setup elfcorehdr segment */
-	ret = crash_prepare_elf64_headers(cmem, false, &headers, &headers_sz);
+	ret = crash_prepare_elf64_headers(image, cmem, false, &headers, &headers_sz);
 	if (ret) {
 		pr_err("Failed to prepare elf headers for the core\n");
 		goto out;
diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c
index 5372b708fae2..8bb2233bd5bb 100644
--- a/arch/riscv/kernel/elf_kexec.c
+++ b/arch/riscv/kernel/elf_kexec.c
@@ -130,7 +130,8 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
 	return 0;
 }
 
-static int prepare_elf_headers(void **addr, unsigned long *sz)
+static int prepare_elf_headers(struct kimage *image,
+	void **addr, unsigned long *sz)
 {
 	struct crash_mem *cmem;
 	unsigned int nr_ranges;
@@ -152,7 +153,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz)
 	/* Exclude crashkernel region */
 	ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
 	if (!ret)
-		ret = crash_prepare_elf64_headers(cmem, true, addr, sz);
+		ret = crash_prepare_elf64_headers(image, cmem, true, addr, sz);
 
 out:
 	kfree(cmem);
@@ -224,7 +225,7 @@ static void *elf_kexec_load(struct kimage *image, char *kernel_buf,
 
 	/* Add elfcorehdr */
 	if (image->type == KEXEC_TYPE_CRASH) {
-		ret = prepare_elf_headers(&headers, &headers_sz);
+		ret = prepare_elf_headers(image, &headers, &headers_sz);
 		if (ret) {
 			pr_err("Preparing elf core header failed\n");
 			goto out;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 305514431f26..8a9bc9807813 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -265,7 +265,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
 		goto out;
 
 	/* By default prepare 64bit headers */
-	ret =  crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz);
+	ret =  crash_prepare_elf64_headers(image, cmem, IS_ENABLED(CONFIG_X86_64), addr, sz);
 
 out:
 	vfree(cmem);
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 582ea213467a..27ef420c7a45 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -117,8 +117,11 @@ struct crash_mem {
 extern int crash_exclude_mem_range(struct crash_mem *mem,
 				   unsigned long long mstart,
 				   unsigned long long mend);
-extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
-				       void **addr, unsigned long *sz);
+extern int crash_prepare_elf64_headers(struct kimage *image,
+				   struct crash_mem *mem,
+				   int need_kernel_map,
+				   void **addr,
+				   unsigned long *sz);
 
 #ifdef CONFIG_KEXEC_FILE
 struct purgatory_info {
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 8a439b6d723b..a3b7b60b63f1 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -315,8 +315,8 @@ static int __init parse_crashkernel_dummy(char *arg)
 }
 early_param("crashkernel", parse_crashkernel_dummy);
 
-int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
-			  void **addr, unsigned long *sz)
+int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
+			  int need_kernel_map, void **addr, unsigned long *sz)
 {
 	Elf64_Ehdr *ehdr;
 	Elf64_Phdr *phdr;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 2/7] crash: prototype change for crash_prepare_elf64_headers()
@ 2023-01-31 22:42   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

From within crash_prepare_elf64_headers() there is a need to
reference the struct kimage hotplug members. As such, this
change passes the struct kimage as a parameter to the
crash_prepare_elf64_headers(). The hotplug members are added
in "crash: add generic infrastructure for crash hotplug support".

This is preparation for later patch, no functionality change.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/kernel/machine_kexec_file.c | 6 +++---
 arch/powerpc/kexec/file_load_64.c      | 2 +-
 arch/riscv/kernel/elf_kexec.c          | 7 ++++---
 arch/x86/kernel/crash.c                | 2 +-
 include/linux/kexec.h                  | 7 +++++--
 kernel/crash_core.c                    | 4 ++--
 6 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c
index a11a6e14ba89..2f7b773a83bb 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -39,7 +39,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 	return kexec_image_post_load_cleanup_default(image);
 }
 
-static int prepare_elf_headers(void **addr, unsigned long *sz)
+static int prepare_elf_headers(struct kimage *image, void **addr, unsigned long *sz)
 {
 	struct crash_mem *cmem;
 	unsigned int nr_ranges;
@@ -64,7 +64,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz)
 	}
 
 	/* Exclude crashkernel region */
-	ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
+	ret = crash_exclude_mem_range(image, cmem, crashk_res.start, crashk_res.end);
 	if (ret)
 		goto out;
 
@@ -74,7 +74,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz)
 			goto out;
 	}
 
-	ret = crash_prepare_elf64_headers(cmem, true, addr, sz);
+	ret = crash_prepare_elf64_headers(image, cmem, true, addr, sz);
 
 out:
 	kfree(cmem);
diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c
index af8854f9eae3..e51d8059535b 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -799,7 +799,7 @@ static int load_elfcorehdr_segment(struct kimage *image, struct kexec_buf *kbuf)
 		goto out;
 
 	/* Setup elfcorehdr segment */
-	ret = crash_prepare_elf64_headers(cmem, false, &headers, &headers_sz);
+	ret = crash_prepare_elf64_headers(image, cmem, false, &headers, &headers_sz);
 	if (ret) {
 		pr_err("Failed to prepare elf headers for the core\n");
 		goto out;
diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c
index 5372b708fae2..8bb2233bd5bb 100644
--- a/arch/riscv/kernel/elf_kexec.c
+++ b/arch/riscv/kernel/elf_kexec.c
@@ -130,7 +130,8 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
 	return 0;
 }
 
-static int prepare_elf_headers(void **addr, unsigned long *sz)
+static int prepare_elf_headers(struct kimage *image,
+	void **addr, unsigned long *sz)
 {
 	struct crash_mem *cmem;
 	unsigned int nr_ranges;
@@ -152,7 +153,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz)
 	/* Exclude crashkernel region */
 	ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
 	if (!ret)
-		ret = crash_prepare_elf64_headers(cmem, true, addr, sz);
+		ret = crash_prepare_elf64_headers(image, cmem, true, addr, sz);
 
 out:
 	kfree(cmem);
@@ -224,7 +225,7 @@ static void *elf_kexec_load(struct kimage *image, char *kernel_buf,
 
 	/* Add elfcorehdr */
 	if (image->type == KEXEC_TYPE_CRASH) {
-		ret = prepare_elf_headers(&headers, &headers_sz);
+		ret = prepare_elf_headers(image, &headers, &headers_sz);
 		if (ret) {
 			pr_err("Preparing elf core header failed\n");
 			goto out;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 305514431f26..8a9bc9807813 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -265,7 +265,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
 		goto out;
 
 	/* By default prepare 64bit headers */
-	ret =  crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz);
+	ret =  crash_prepare_elf64_headers(image, cmem, IS_ENABLED(CONFIG_X86_64), addr, sz);
 
 out:
 	vfree(cmem);
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 582ea213467a..27ef420c7a45 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -117,8 +117,11 @@ struct crash_mem {
 extern int crash_exclude_mem_range(struct crash_mem *mem,
 				   unsigned long long mstart,
 				   unsigned long long mend);
-extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
-				       void **addr, unsigned long *sz);
+extern int crash_prepare_elf64_headers(struct kimage *image,
+				   struct crash_mem *mem,
+				   int need_kernel_map,
+				   void **addr,
+				   unsigned long *sz);
 
 #ifdef CONFIG_KEXEC_FILE
 struct purgatory_info {
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 8a439b6d723b..a3b7b60b63f1 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -315,8 +315,8 @@ static int __init parse_crashkernel_dummy(char *arg)
 }
 early_param("crashkernel", parse_crashkernel_dummy);
 
-int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
-			  void **addr, unsigned long *sz)
+int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
+			  int need_kernel_map, void **addr, unsigned long *sz)
 {
 	Elf64_Ehdr *ehdr;
 	Elf64_Phdr *phdr;
-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support
  2023-01-31 22:42 ` Eric DeVolder
@ 2023-01-31 22:42   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

To support crash hotplug, a mechanism is needed to update the crash
elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/
onlining).

To track CPU changes, callbacks are registered with the cpuhp
mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The
crash hotplug elfcorehdr update has no explicit ordering requirement
(relative to other cpuhp states), so meets the criteria for
utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic
state and avoids the need to introduce a new state for crash
hotplug. Also, this is the last state in the PREPARE group, just
prior to the STARTING group, which is very close to the CPU
starting up in an plug/online situation, or stopping in a unplug/
offline situation. This minimizes the window of time during an
actual plug/online or unplug/offline situation in which the
elfcorehdr would be inaccurate.

Note, that when a CPU is being unplugged/offlined, the CPU is still
in the foreach_present_cpu() during the regeneration of the
elfcorehdr. Thus there is a need to explicitly check and exclude
the soon-to-be offlined CPU. See patch 'kexec: exclude hot remove
cpu from elfcorehdr notes'.

To track memory changes, a notifier is registered to capture the
memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().

The cpu callbacks and memory notifiers invoke handle_hotplug_event()
which performs needed tasks and then dispatches the event to the
architecture specific arch_crash_handle_hotplug_event() to update the
elfcorehdr with the current state of CPUs and memory. During the
process, the kexec_lock is held.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 include/linux/crash_core.h |   9 +++
 include/linux/kexec.h      |  12 ++++
 kernel/crash_core.c        | 139 +++++++++++++++++++++++++++++++++++++
 3 files changed, 160 insertions(+)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index de62a722431e..ed868d237c07 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
 int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
 		unsigned long long *crash_size, unsigned long long *crash_base);
 
+#define KEXEC_CRASH_HP_NONE			0
+#define KEXEC_CRASH_HP_REMOVE_CPU		1
+#define KEXEC_CRASH_HP_ADD_CPU			2
+#define KEXEC_CRASH_HP_REMOVE_MEMORY		3
+#define KEXEC_CRASH_HP_ADD_MEMORY		4
+#define KEXEC_CRASH_HP_INVALID_CPU		-1U
+
+struct kimage;
+
 #endif /* LINUX_CRASH_CORE_H */
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 27ef420c7a45..a52624ae4452 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes;
 #include <linux/compat.h>
 #include <linux/ioport.h>
 #include <linux/module.h>
+#include <linux/highmem.h>
 #include <asm/kexec.h>
 
 /* Verify architecture specific macros are defined */
@@ -371,6 +372,13 @@ struct kimage {
 	struct purgatory_info purgatory_info;
 #endif
 
+#ifdef CONFIG_CRASH_HOTPLUG
+	int hp_action;
+	unsigned int offlinecpu;
+	bool elfcorehdr_index_valid;
+	int elfcorehdr_index;
+#endif
+
 #ifdef CONFIG_IMA_KEXEC
 	/* Virtual address of IMA measurement buffer for kexec syscall */
 	void *ima_buffer;
@@ -500,6 +508,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g
 static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { }
 #endif
 
+#ifndef arch_crash_handle_hotplug_event
+static inline void arch_crash_handle_hotplug_event(struct kimage *image) { }
+#endif
+
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index a3b7b60b63f1..5545de4597d0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -11,6 +11,8 @@
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
 #include <linux/kexec.h>
+#include <linux/memory.h>
+#include <linux/cpuhotplug.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -18,6 +20,7 @@
 #include <crypto/sha1.h>
 
 #include "kallsyms_internal.h"
+#include "kexec_internal.h"
 
 /* vmcoreinfo stuff */
 unsigned char *vmcoreinfo_data;
@@ -697,3 +700,139 @@ static int __init crash_save_vmcoreinfo_init(void)
 }
 
 subsys_initcall(crash_save_vmcoreinfo_init);
+
+#ifdef CONFIG_CRASH_HOTPLUG
+#undef pr_fmt
+#define pr_fmt(fmt) "crash hp: " fmt
+/*
+ * To accurately reflect hot un/plug changes of cpu and memory resources
+ * (including onling and offlining of those resources), the elfcorehdr
+ * (which is passed to the crash kernel via the elfcorehdr= parameter)
+ * must be updated with the new list of CPUs and memories.
+ *
+ * In order to make changes to elfcorehdr, two conditions are needed:
+ * First, the segment containing the elfcorehdr must be large enough
+ * to permit a growing number of resources; the elfcorehdr memory size
+ * is based on NR_CPUS_DEFAULT and CRASH_MAX_MEMORY_RANGES.
+ * Second, purgatory must explicitly exclude the elfcorehdr from the
+ * list of segments it checks (since the elfcorehdr changes and thus
+ * would require an update to purgatory itself to update the digest).
+ */
+static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
+{
+	/* Obtain lock while changing crash information */
+	if (kexec_trylock()) {
+
+		/* Check kdump is loaded */
+		if (kexec_crash_image) {
+			struct kimage *image = kexec_crash_image;
+
+			if (hp_action == KEXEC_CRASH_HP_ADD_CPU ||
+				hp_action == KEXEC_CRASH_HP_REMOVE_CPU)
+				pr_debug("hp_action %u, cpu %u\n", hp_action, cpu);
+			else
+				pr_debug("hp_action %u\n", hp_action);
+
+			/*
+			 * When the struct kimage is allocated, it is wiped to zero, so
+			 * the elfcorehdr_index_valid defaults to false. Find the
+			 * segment containing the elfcorehdr, if not already found.
+			 * This works for both the kexec_load and kexec_file_load paths.
+			 */
+			if (!image->elfcorehdr_index_valid) {
+				unsigned long mem;
+				unsigned char *ptr;
+				unsigned int n;
+
+				for (n = 0; n < image->nr_segments; n++) {
+					mem = image->segment[n].mem;
+					ptr = kmap_local_page(pfn_to_page(mem >> PAGE_SHIFT));
+					if (ptr) {
+						/* The segment containing elfcorehdr */
+						if (memcmp(ptr, ELFMAG, SELFMAG) == 0) {
+							image->elfcorehdr_index = (int)n;
+							image->elfcorehdr_index_valid = true;
+						}
+						kunmap_local(ptr);
+					}
+				}
+			}
+
+			if (!image->elfcorehdr_index_valid) {
+				pr_err("unable to locate elfcorehdr segment");
+				goto out;
+			}
+
+			/* Needed in order for the segments to be updated */
+			arch_kexec_unprotect_crashkres();
+
+			/* Differentiate between normal load and hotplug update */
+			image->hp_action = hp_action;
+
+			/* Now invoke arch-specific update handler */
+			arch_crash_handle_hotplug_event(image);
+
+			/* No longer handling a hotplug event */
+			image->hp_action = KEXEC_CRASH_HP_NONE;
+
+			/* Change back to read-only */
+			arch_kexec_protect_crashkres();
+		}
+
+out:
+		/* Release lock now that update complete */
+		kexec_unlock();
+	}
+}
+
+static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v)
+{
+	switch (val) {
+	case MEM_ONLINE:
+		handle_hotplug_event(KEXEC_CRASH_HP_ADD_MEMORY,
+			KEXEC_CRASH_HP_INVALID_CPU);
+		break;
+
+	case MEM_OFFLINE:
+		handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_MEMORY,
+			KEXEC_CRASH_HP_INVALID_CPU);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block crash_memhp_nb = {
+	.notifier_call = crash_memhp_notifier,
+	.priority = 0
+};
+
+static int crash_cpuhp_online(unsigned int cpu)
+{
+	handle_hotplug_event(KEXEC_CRASH_HP_ADD_CPU, cpu);
+	return 0;
+}
+
+static int crash_cpuhp_offline(unsigned int cpu)
+{
+	handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_CPU, cpu);
+	return 0;
+}
+
+static int __init crash_hotplug_init(void)
+{
+	int result = 0;
+
+	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
+		register_memory_notifier(&crash_memhp_nb);
+
+	if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
+		result = cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN,
+						   "crash/cpuhp",
+						   crash_cpuhp_online,
+						   crash_cpuhp_offline);
+
+	return result;
+}
+
+subsys_initcall(crash_hotplug_init);
+#endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support
@ 2023-01-31 22:42   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

To support crash hotplug, a mechanism is needed to update the crash
elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/
onlining).

To track CPU changes, callbacks are registered with the cpuhp
mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The
crash hotplug elfcorehdr update has no explicit ordering requirement
(relative to other cpuhp states), so meets the criteria for
utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic
state and avoids the need to introduce a new state for crash
hotplug. Also, this is the last state in the PREPARE group, just
prior to the STARTING group, which is very close to the CPU
starting up in an plug/online situation, or stopping in a unplug/
offline situation. This minimizes the window of time during an
actual plug/online or unplug/offline situation in which the
elfcorehdr would be inaccurate.

Note, that when a CPU is being unplugged/offlined, the CPU is still
in the foreach_present_cpu() during the regeneration of the
elfcorehdr. Thus there is a need to explicitly check and exclude
the soon-to-be offlined CPU. See patch 'kexec: exclude hot remove
cpu from elfcorehdr notes'.

To track memory changes, a notifier is registered to capture the
memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().

The cpu callbacks and memory notifiers invoke handle_hotplug_event()
which performs needed tasks and then dispatches the event to the
architecture specific arch_crash_handle_hotplug_event() to update the
elfcorehdr with the current state of CPUs and memory. During the
process, the kexec_lock is held.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 include/linux/crash_core.h |   9 +++
 include/linux/kexec.h      |  12 ++++
 kernel/crash_core.c        | 139 +++++++++++++++++++++++++++++++++++++
 3 files changed, 160 insertions(+)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index de62a722431e..ed868d237c07 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
 int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
 		unsigned long long *crash_size, unsigned long long *crash_base);
 
+#define KEXEC_CRASH_HP_NONE			0
+#define KEXEC_CRASH_HP_REMOVE_CPU		1
+#define KEXEC_CRASH_HP_ADD_CPU			2
+#define KEXEC_CRASH_HP_REMOVE_MEMORY		3
+#define KEXEC_CRASH_HP_ADD_MEMORY		4
+#define KEXEC_CRASH_HP_INVALID_CPU		-1U
+
+struct kimage;
+
 #endif /* LINUX_CRASH_CORE_H */
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 27ef420c7a45..a52624ae4452 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes;
 #include <linux/compat.h>
 #include <linux/ioport.h>
 #include <linux/module.h>
+#include <linux/highmem.h>
 #include <asm/kexec.h>
 
 /* Verify architecture specific macros are defined */
@@ -371,6 +372,13 @@ struct kimage {
 	struct purgatory_info purgatory_info;
 #endif
 
+#ifdef CONFIG_CRASH_HOTPLUG
+	int hp_action;
+	unsigned int offlinecpu;
+	bool elfcorehdr_index_valid;
+	int elfcorehdr_index;
+#endif
+
 #ifdef CONFIG_IMA_KEXEC
 	/* Virtual address of IMA measurement buffer for kexec syscall */
 	void *ima_buffer;
@@ -500,6 +508,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g
 static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { }
 #endif
 
+#ifndef arch_crash_handle_hotplug_event
+static inline void arch_crash_handle_hotplug_event(struct kimage *image) { }
+#endif
+
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index a3b7b60b63f1..5545de4597d0 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -11,6 +11,8 @@
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
 #include <linux/kexec.h>
+#include <linux/memory.h>
+#include <linux/cpuhotplug.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -18,6 +20,7 @@
 #include <crypto/sha1.h>
 
 #include "kallsyms_internal.h"
+#include "kexec_internal.h"
 
 /* vmcoreinfo stuff */
 unsigned char *vmcoreinfo_data;
@@ -697,3 +700,139 @@ static int __init crash_save_vmcoreinfo_init(void)
 }
 
 subsys_initcall(crash_save_vmcoreinfo_init);
+
+#ifdef CONFIG_CRASH_HOTPLUG
+#undef pr_fmt
+#define pr_fmt(fmt) "crash hp: " fmt
+/*
+ * To accurately reflect hot un/plug changes of cpu and memory resources
+ * (including onling and offlining of those resources), the elfcorehdr
+ * (which is passed to the crash kernel via the elfcorehdr= parameter)
+ * must be updated with the new list of CPUs and memories.
+ *
+ * In order to make changes to elfcorehdr, two conditions are needed:
+ * First, the segment containing the elfcorehdr must be large enough
+ * to permit a growing number of resources; the elfcorehdr memory size
+ * is based on NR_CPUS_DEFAULT and CRASH_MAX_MEMORY_RANGES.
+ * Second, purgatory must explicitly exclude the elfcorehdr from the
+ * list of segments it checks (since the elfcorehdr changes and thus
+ * would require an update to purgatory itself to update the digest).
+ */
+static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
+{
+	/* Obtain lock while changing crash information */
+	if (kexec_trylock()) {
+
+		/* Check kdump is loaded */
+		if (kexec_crash_image) {
+			struct kimage *image = kexec_crash_image;
+
+			if (hp_action == KEXEC_CRASH_HP_ADD_CPU ||
+				hp_action == KEXEC_CRASH_HP_REMOVE_CPU)
+				pr_debug("hp_action %u, cpu %u\n", hp_action, cpu);
+			else
+				pr_debug("hp_action %u\n", hp_action);
+
+			/*
+			 * When the struct kimage is allocated, it is wiped to zero, so
+			 * the elfcorehdr_index_valid defaults to false. Find the
+			 * segment containing the elfcorehdr, if not already found.
+			 * This works for both the kexec_load and kexec_file_load paths.
+			 */
+			if (!image->elfcorehdr_index_valid) {
+				unsigned long mem;
+				unsigned char *ptr;
+				unsigned int n;
+
+				for (n = 0; n < image->nr_segments; n++) {
+					mem = image->segment[n].mem;
+					ptr = kmap_local_page(pfn_to_page(mem >> PAGE_SHIFT));
+					if (ptr) {
+						/* The segment containing elfcorehdr */
+						if (memcmp(ptr, ELFMAG, SELFMAG) == 0) {
+							image->elfcorehdr_index = (int)n;
+							image->elfcorehdr_index_valid = true;
+						}
+						kunmap_local(ptr);
+					}
+				}
+			}
+
+			if (!image->elfcorehdr_index_valid) {
+				pr_err("unable to locate elfcorehdr segment");
+				goto out;
+			}
+
+			/* Needed in order for the segments to be updated */
+			arch_kexec_unprotect_crashkres();
+
+			/* Differentiate between normal load and hotplug update */
+			image->hp_action = hp_action;
+
+			/* Now invoke arch-specific update handler */
+			arch_crash_handle_hotplug_event(image);
+
+			/* No longer handling a hotplug event */
+			image->hp_action = KEXEC_CRASH_HP_NONE;
+
+			/* Change back to read-only */
+			arch_kexec_protect_crashkres();
+		}
+
+out:
+		/* Release lock now that update complete */
+		kexec_unlock();
+	}
+}
+
+static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v)
+{
+	switch (val) {
+	case MEM_ONLINE:
+		handle_hotplug_event(KEXEC_CRASH_HP_ADD_MEMORY,
+			KEXEC_CRASH_HP_INVALID_CPU);
+		break;
+
+	case MEM_OFFLINE:
+		handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_MEMORY,
+			KEXEC_CRASH_HP_INVALID_CPU);
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block crash_memhp_nb = {
+	.notifier_call = crash_memhp_notifier,
+	.priority = 0
+};
+
+static int crash_cpuhp_online(unsigned int cpu)
+{
+	handle_hotplug_event(KEXEC_CRASH_HP_ADD_CPU, cpu);
+	return 0;
+}
+
+static int crash_cpuhp_offline(unsigned int cpu)
+{
+	handle_hotplug_event(KEXEC_CRASH_HP_REMOVE_CPU, cpu);
+	return 0;
+}
+
+static int __init crash_hotplug_init(void)
+{
+	int result = 0;
+
+	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
+		register_memory_notifier(&crash_memhp_nb);
+
+	if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
+		result = cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN,
+						   "crash/cpuhp",
+						   crash_cpuhp_online,
+						   crash_cpuhp_offline);
+
+	return result;
+}
+
+subsys_initcall(crash_hotplug_init);
+#endif
-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 4/7] kexec: exclude elfcorehdr from the segment digest
  2023-01-31 22:42 ` Eric DeVolder
@ 2023-01-31 22:42   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

When a crash kernel is loaded via the kexec_file_load() syscall, the
kernel places the various segments (ie crash kernel, crash initrd,
boot_params, elfcorehdr, purgatory, etc) in memory. For those
architectures that utilize purgatory, a hash digest of the segments
is calculated for integrity checking. This digest is embedded into
the purgatory image prior to placing purgatory in memory.

This patchset updates the elfcorehdr on CPU or memory changes.
However, changes to the elfcorehdr in turn cause purgatory
integrity checking to fail (at crash time, and no vmcore created).
Therefore, this patch explicitly excludes the elfcorehdr segment
from the list of segments used to create the digest. By doing so,
this permits updates to the elfcorehdr in response to CPU or memory
changes, and avoids the need to also recompute the hash digest and
reload purgatory.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 kernel/kexec_file.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index ead3443e7f9d..2f3b20b52e5d 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -723,6 +723,12 @@ static int kexec_calculate_store_digests(struct kimage *image)
 	for (j = i = 0; i < image->nr_segments; i++) {
 		struct kexec_segment *ksegment;
 
+#ifdef CONFIG_CRASH_HOTPLUG
+		/* Exclude elfcorehdr segment to allow future changes via hotplug */
+		if (image->elfcorehdr_index_valid && (j == image->elfcorehdr_index))
+			continue;
+#endif
+
 		ksegment = &image->segment[i];
 		/*
 		 * Skip purgatory as it will be modified once we put digest
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 4/7] kexec: exclude elfcorehdr from the segment digest
@ 2023-01-31 22:42   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

When a crash kernel is loaded via the kexec_file_load() syscall, the
kernel places the various segments (ie crash kernel, crash initrd,
boot_params, elfcorehdr, purgatory, etc) in memory. For those
architectures that utilize purgatory, a hash digest of the segments
is calculated for integrity checking. This digest is embedded into
the purgatory image prior to placing purgatory in memory.

This patchset updates the elfcorehdr on CPU or memory changes.
However, changes to the elfcorehdr in turn cause purgatory
integrity checking to fail (at crash time, and no vmcore created).
Therefore, this patch explicitly excludes the elfcorehdr segment
from the list of segments used to create the digest. By doing so,
this permits updates to the elfcorehdr in response to CPU or memory
changes, and avoids the need to also recompute the hash digest and
reload purgatory.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 kernel/kexec_file.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index ead3443e7f9d..2f3b20b52e5d 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -723,6 +723,12 @@ static int kexec_calculate_store_digests(struct kimage *image)
 	for (j = i = 0; i < image->nr_segments; i++) {
 		struct kexec_segment *ksegment;
 
+#ifdef CONFIG_CRASH_HOTPLUG
+		/* Exclude elfcorehdr segment to allow future changes via hotplug */
+		if (image->elfcorehdr_index_valid && (j == image->elfcorehdr_index))
+			continue;
+#endif
+
 		ksegment = &image->segment[i];
 		/*
 		 * Skip purgatory as it will be modified once we put digest
-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-01-31 22:42 ` Eric DeVolder
@ 2023-01-31 22:42   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

In crash_prepare_elf64_headers(), the for_each_present_cpu() is
utilized to create the new elfcorehdr. When handling CPU hot
unplug/offline events, the CPU is still on the for_each_present_cpu()
list (not until the cpuhp state processing reaches CPUHP_OFFLINE does
the CPU exit the list). Thus the CPU must be explicitly excluded when
building the new list of CPUs.

This change identifies in handle_hotplug_event() the CPU to be
excluded, and the check for excluding the CPU in
crash_prepare_elf64_headers().

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 kernel/crash_core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 5545de4597d0..d985d334fae4 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
 
 	/* Prepare one phdr of type PT_NOTE for each present CPU */
 	for_each_present_cpu(cpu) {
+#ifdef CONFIG_CRASH_HOTPLUG
+		if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
+			/* Skip the soon-to-be offlined cpu */
+			if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
+				(cpu == image->offlinecpu))
+				continue;
+		}
+#endif
 		phdr->p_type = PT_NOTE;
 		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
 		phdr->p_offset = phdr->p_paddr = notes_addr;
@@ -769,6 +777,14 @@ static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
 			/* Differentiate between normal load and hotplug update */
 			image->hp_action = hp_action;
 
+			/*
+			 * Record which CPU is being unplugged/offlined, so that it
+			 * is explicitly excluded in crash_prepare_elf64_headers().
+			 */
+			image->offlinecpu =
+				(hp_action == KEXEC_CRASH_HP_REMOVE_CPU) ?
+					cpu : KEXEC_CRASH_HP_INVALID_CPU;
+
 			/* Now invoke arch-specific update handler */
 			arch_crash_handle_hotplug_event(image);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-01-31 22:42   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

In crash_prepare_elf64_headers(), the for_each_present_cpu() is
utilized to create the new elfcorehdr. When handling CPU hot
unplug/offline events, the CPU is still on the for_each_present_cpu()
list (not until the cpuhp state processing reaches CPUHP_OFFLINE does
the CPU exit the list). Thus the CPU must be explicitly excluded when
building the new list of CPUs.

This change identifies in handle_hotplug_event() the CPU to be
excluded, and the check for excluding the CPU in
crash_prepare_elf64_headers().

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 kernel/crash_core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 5545de4597d0..d985d334fae4 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
 
 	/* Prepare one phdr of type PT_NOTE for each present CPU */
 	for_each_present_cpu(cpu) {
+#ifdef CONFIG_CRASH_HOTPLUG
+		if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
+			/* Skip the soon-to-be offlined cpu */
+			if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
+				(cpu == image->offlinecpu))
+				continue;
+		}
+#endif
 		phdr->p_type = PT_NOTE;
 		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
 		phdr->p_offset = phdr->p_paddr = notes_addr;
@@ -769,6 +777,14 @@ static void handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
 			/* Differentiate between normal load and hotplug update */
 			image->hp_action = hp_action;
 
+			/*
+			 * Record which CPU is being unplugged/offlined, so that it
+			 * is explicitly excluded in crash_prepare_elf64_headers().
+			 */
+			image->offlinecpu =
+				(hp_action == KEXEC_CRASH_HP_REMOVE_CPU) ?
+					cpu : KEXEC_CRASH_HP_INVALID_CPU;
+
 			/* Now invoke arch-specific update handler */
 			arch_crash_handle_hotplug_event(image);
 
-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 6/7] crash: memory and cpu hotplug sysfs attributes
  2023-01-31 22:42 ` Eric DeVolder
@ 2023-01-31 22:42   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

This introduces the crash_hotplug attribute for memory and CPUs
for use by userspace.  This change directly facilitates the udev
rule for managing userspace re-loading of the crash kernel upon
hot un/plug changes.

For memory, this changeset introduces the crash_hotplug attribute
to the /sys/devices/system/memory directory. For example:

 # udevadm info --attribute-walk /sys/devices/system/memory/memory81
  looking at device '/devices/system/memory/memory81':
    KERNEL=="memory81"
    SUBSYSTEM=="memory"
    DRIVER==""
    ATTR{online}=="1"
    ATTR{phys_device}=="0"
    ATTR{phys_index}=="00000051"
    ATTR{removable}=="1"
    ATTR{state}=="online"
    ATTR{valid_zones}=="Movable"

  looking at parent device '/devices/system/memory':
    KERNELS=="memory"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{auto_online_blocks}=="offline"
    ATTRS{block_size_bytes}=="8000000"
    ATTRS{crash_hotplug}=="1"

For CPUs, this changeset introduces the crash_hotplug attribute
to the /sys/devices/system/cpu directory. For example:

 # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0
  looking at device '/devices/system/cpu/cpu0':
    KERNEL=="cpu0"
    SUBSYSTEM=="cpu"
    DRIVER=="processor"
    ATTR{crash_notes}=="277c38600"
    ATTR{crash_notes_size}=="368"
    ATTR{online}=="1"

  looking at parent device '/devices/system/cpu':
    KERNELS=="cpu"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{crash_hotplug}=="1"
    ATTRS{isolated}==""
    ATTRS{kernel_max}=="8191"
    ATTRS{nohz_full}=="  (null)"
    ATTRS{offline}=="4-7"
    ATTRS{online}=="0-3"
    ATTRS{possible}=="0-7"
    ATTRS{present}=="0-3"

With these sysfs attributes in place, it is possible to efficiently
instruct the udev rule to skip crash kernel reloading.

For example, the following is the proposed udev rule change for RHEL
system 98-kexec.rules (as the first lines of the rule file):

 # The kernel handles updates to crash elfcorehdr for cpu and memory changes
 SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
 SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

When examined in the context of 98-kexec.rules, the above change
tests if crash_hotplug is set, and if so, it skips the userspace
initiated unload-then-reload of the crash kernel.

Cpu and memory checks are separated in accordance with
CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options.
If an architecture supports, for example, memory hotplug but not
CPU hotplug, then the /sys/devices/system/memory/crash_hotplug
attribute file is present, but the /sys/devices/system/cpu/crash_hotplug
attribute file will NOT be present. Thus the udev rule will skip
userspace processing of memory hot un/plug events, but the udev
rule will evaluate false for CPU events, thus allowing userspace to
process cpu hot un/plug events (ie the unload-then-reload of the kdump
capture kernel).

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 .../admin-guide/mm/memory-hotplug.rst          |  8 ++++++++
 Documentation/core-api/cpu_hotplug.rst         | 18 ++++++++++++++++++
 drivers/base/cpu.c                             | 14 ++++++++++++++
 drivers/base/memory.c                          | 13 +++++++++++++
 include/linux/kexec.h                          |  8 ++++++++
 5 files changed, 61 insertions(+)

diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index a3c9e8ad8fa0..15fd1751a63c 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -293,6 +293,14 @@ The following files are currently defined:
 		       Availability depends on the CONFIG_ARCH_MEMORY_PROBE
 		       kernel configuration option.
 ``uevent``	       read-write: generic udev file for device subsystems.
+``crash_hotplug``      read-only: when changes to the system memory map
+		       occur due to hot un/plug of memory, this file contains
+		       '1' if the kernel updates the kdump capture kernel memory
+		       map itself (via elfcorehdr), or '0' if userspace must update
+		       the kdump capture kernel memory map.
+
+		       Availability depends on the CONFIG_MEMORY_HOTPLUG kernel
+		       configuration option.
 ====================== =========================================================
 
 .. note::
diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst
index f75778d37488..0c8dc3fe5f94 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -750,6 +750,24 @@ will receive all events. A script like::
 
 can process the event further.
 
+When changes to the CPUs in the system occur, the sysfs file
+/sys/devices/system/cpu/crash_hotplug contains '1' if the kernel
+updates the kdump capture kernel list of CPUs itself (via elfcorehdr),
+or '0' if userspace must update the kdump capture kernel list of CPUs.
+
+The availability depends on the CONFIG_HOTPLUG_CPU kernel configuration
+option.
+
+To skip userspace processing of CPU hot un/plug events for kdump
+(ie the unload-then-reload to obtain a current list of CPUs), this sysfs
+file can be used in a udev rule as follows:
+
+ SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
+
+For a cpu hot un/plug event, if the architecture supports kernel updates
+of the elfcorehdr (which contains the list of CPUs), then the rule skips
+the unload-then-reload of the kdump capture kernel.
+
 Kernel Inline Documentations Reference
 ======================================
 
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 4c98849577d4..fedbf87f9d13 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -293,6 +293,17 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_HOTPLUG_CPU
+#include <linux/kexec.h>
+static ssize_t crash_hotplug_show(struct device *dev,
+				     struct device_attribute *attr,
+				     char *buf)
+{
+	return sprintf(buf, "%d\n", crash_hotplug_cpu_support());
+}
+static DEVICE_ATTR_ADMIN_RO(crash_hotplug);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -469,6 +480,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	&dev_attr_crash_hotplug.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fe98fb8d94e5..a3f37cb57d79 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -495,6 +495,16 @@ static ssize_t auto_online_blocks_store(struct device *dev,
 
 static DEVICE_ATTR_RW(auto_online_blocks);
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+#include <linux/kexec.h>
+static ssize_t crash_hotplug_show(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", crash_hotplug_memory_support());
+}
+static DEVICE_ATTR_RO(crash_hotplug);
+#endif
+
 /*
  * Some architectures will have custom drivers to do this, and
  * will not need to do it from userspace.  The fake hot-add code
@@ -894,6 +904,9 @@ static struct attribute *memory_root_attrs[] = {
 
 	&dev_attr_block_size_bytes.attr,
 	&dev_attr_auto_online_blocks.attr,
+#ifdef CONFIG_MEMORY_HOTPLUG
+	&dev_attr_crash_hotplug.attr,
+#endif
 	NULL
 };
 
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index a52624ae4452..ef2b607fa105 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -512,6 +512,14 @@ static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) {
 static inline void arch_crash_handle_hotplug_event(struct kimage *image) { }
 #endif
 
+#ifndef crash_hotplug_cpu_support
+static inline int crash_hotplug_cpu_support(void) { return 0; }
+#endif
+
+#ifndef crash_hotplug_memory_support
+static inline int crash_hotplug_memory_support(void) { return 0; }
+#endif
+
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 6/7] crash: memory and cpu hotplug sysfs attributes
@ 2023-01-31 22:42   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

This introduces the crash_hotplug attribute for memory and CPUs
for use by userspace.  This change directly facilitates the udev
rule for managing userspace re-loading of the crash kernel upon
hot un/plug changes.

For memory, this changeset introduces the crash_hotplug attribute
to the /sys/devices/system/memory directory. For example:

 # udevadm info --attribute-walk /sys/devices/system/memory/memory81
  looking at device '/devices/system/memory/memory81':
    KERNEL=="memory81"
    SUBSYSTEM=="memory"
    DRIVER==""
    ATTR{online}=="1"
    ATTR{phys_device}=="0"
    ATTR{phys_index}=="00000051"
    ATTR{removable}=="1"
    ATTR{state}=="online"
    ATTR{valid_zones}=="Movable"

  looking at parent device '/devices/system/memory':
    KERNELS=="memory"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{auto_online_blocks}=="offline"
    ATTRS{block_size_bytes}=="8000000"
    ATTRS{crash_hotplug}=="1"

For CPUs, this changeset introduces the crash_hotplug attribute
to the /sys/devices/system/cpu directory. For example:

 # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0
  looking at device '/devices/system/cpu/cpu0':
    KERNEL=="cpu0"
    SUBSYSTEM=="cpu"
    DRIVER=="processor"
    ATTR{crash_notes}=="277c38600"
    ATTR{crash_notes_size}=="368"
    ATTR{online}=="1"

  looking at parent device '/devices/system/cpu':
    KERNELS=="cpu"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{crash_hotplug}=="1"
    ATTRS{isolated}==""
    ATTRS{kernel_max}=="8191"
    ATTRS{nohz_full}=="  (null)"
    ATTRS{offline}=="4-7"
    ATTRS{online}=="0-3"
    ATTRS{possible}=="0-7"
    ATTRS{present}=="0-3"

With these sysfs attributes in place, it is possible to efficiently
instruct the udev rule to skip crash kernel reloading.

For example, the following is the proposed udev rule change for RHEL
system 98-kexec.rules (as the first lines of the rule file):

 # The kernel handles updates to crash elfcorehdr for cpu and memory changes
 SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
 SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

When examined in the context of 98-kexec.rules, the above change
tests if crash_hotplug is set, and if so, it skips the userspace
initiated unload-then-reload of the crash kernel.

Cpu and memory checks are separated in accordance with
CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options.
If an architecture supports, for example, memory hotplug but not
CPU hotplug, then the /sys/devices/system/memory/crash_hotplug
attribute file is present, but the /sys/devices/system/cpu/crash_hotplug
attribute file will NOT be present. Thus the udev rule will skip
userspace processing of memory hot un/plug events, but the udev
rule will evaluate false for CPU events, thus allowing userspace to
process cpu hot un/plug events (ie the unload-then-reload of the kdump
capture kernel).

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 .../admin-guide/mm/memory-hotplug.rst          |  8 ++++++++
 Documentation/core-api/cpu_hotplug.rst         | 18 ++++++++++++++++++
 drivers/base/cpu.c                             | 14 ++++++++++++++
 drivers/base/memory.c                          | 13 +++++++++++++
 include/linux/kexec.h                          |  8 ++++++++
 5 files changed, 61 insertions(+)

diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index a3c9e8ad8fa0..15fd1751a63c 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -293,6 +293,14 @@ The following files are currently defined:
 		       Availability depends on the CONFIG_ARCH_MEMORY_PROBE
 		       kernel configuration option.
 ``uevent``	       read-write: generic udev file for device subsystems.
+``crash_hotplug``      read-only: when changes to the system memory map
+		       occur due to hot un/plug of memory, this file contains
+		       '1' if the kernel updates the kdump capture kernel memory
+		       map itself (via elfcorehdr), or '0' if userspace must update
+		       the kdump capture kernel memory map.
+
+		       Availability depends on the CONFIG_MEMORY_HOTPLUG kernel
+		       configuration option.
 ====================== =========================================================
 
 .. note::
diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst
index f75778d37488..0c8dc3fe5f94 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -750,6 +750,24 @@ will receive all events. A script like::
 
 can process the event further.
 
+When changes to the CPUs in the system occur, the sysfs file
+/sys/devices/system/cpu/crash_hotplug contains '1' if the kernel
+updates the kdump capture kernel list of CPUs itself (via elfcorehdr),
+or '0' if userspace must update the kdump capture kernel list of CPUs.
+
+The availability depends on the CONFIG_HOTPLUG_CPU kernel configuration
+option.
+
+To skip userspace processing of CPU hot un/plug events for kdump
+(ie the unload-then-reload to obtain a current list of CPUs), this sysfs
+file can be used in a udev rule as follows:
+
+ SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
+
+For a cpu hot un/plug event, if the architecture supports kernel updates
+of the elfcorehdr (which contains the list of CPUs), then the rule skips
+the unload-then-reload of the kdump capture kernel.
+
 Kernel Inline Documentations Reference
 ======================================
 
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 4c98849577d4..fedbf87f9d13 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -293,6 +293,17 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_HOTPLUG_CPU
+#include <linux/kexec.h>
+static ssize_t crash_hotplug_show(struct device *dev,
+				     struct device_attribute *attr,
+				     char *buf)
+{
+	return sprintf(buf, "%d\n", crash_hotplug_cpu_support());
+}
+static DEVICE_ATTR_ADMIN_RO(crash_hotplug);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
 	/*
@@ -469,6 +480,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_HOTPLUG_CPU
+	&dev_attr_crash_hotplug.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fe98fb8d94e5..a3f37cb57d79 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -495,6 +495,16 @@ static ssize_t auto_online_blocks_store(struct device *dev,
 
 static DEVICE_ATTR_RW(auto_online_blocks);
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+#include <linux/kexec.h>
+static ssize_t crash_hotplug_show(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", crash_hotplug_memory_support());
+}
+static DEVICE_ATTR_RO(crash_hotplug);
+#endif
+
 /*
  * Some architectures will have custom drivers to do this, and
  * will not need to do it from userspace.  The fake hot-add code
@@ -894,6 +904,9 @@ static struct attribute *memory_root_attrs[] = {
 
 	&dev_attr_block_size_bytes.attr,
 	&dev_attr_auto_online_blocks.attr,
+#ifdef CONFIG_MEMORY_HOTPLUG
+	&dev_attr_crash_hotplug.attr,
+#endif
 	NULL
 };
 
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index a52624ae4452..ef2b607fa105 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -512,6 +512,14 @@ static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) {
 static inline void arch_crash_handle_hotplug_event(struct kimage *image) { }
 #endif
 
+#ifndef crash_hotplug_cpu_support
+static inline int crash_hotplug_cpu_support(void) { return 0; }
+#endif
+
+#ifndef crash_hotplug_memory_support
+static inline int crash_hotplug_memory_support(void) { return 0; }
+#endif
+
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 7/7] x86/crash: add x86 crash hotplug support
  2023-01-31 22:42 ` Eric DeVolder
@ 2023-01-31 22:42   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

When CPU or memory is hot un/plugged, or off/onlined, the crash
elfcorehdr, which describes the CPUs and memory in the system,
must also be updated.

The segment containing the elfcorehdr is identified at run-time
in crash_core:handle_hotplug_event(), which works for both the
kexec_load() and kexec_file_load() syscalls. A new elfcorehdr
is generated from the available CPUs and memory into a buffer,
and then installed over the top of the existing elfcorehdr.

In the patch 'kexec: exclude elfcorehdr from the segment digest'
the need to update purgatory due to the change in elfcorehdr was
eliminated.  As a result, no changes to purgatory or boot_params
(as the elfcorehdr= kernel command line parameter pointer
remains unchanged and correct) are needed, just elfcorehdr.

To accommodate a growing number of resources via hotplug, the
elfcorehdr segment must be sufficiently large enough to accommodate
changes, see the CRASH_MAX_MEMORY_RANGES description. This is used
only on the kexec_file_load() syscall; for kexec_load() userspace
will need to size the segment similarly.

To accommodate kexec_load() syscall in the absence of
kexec_file_load() syscall support, and with CONFIG_CRASH_HOTPLUG
enabled, it is necessary to move prepare_elf_headers() and
dependents outside of CONFIG_KEXEC_FILE.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
---
 arch/x86/Kconfig             |  13 ++++
 arch/x86/include/asm/kexec.h |  15 +++++
 arch/x86/kernel/crash.c      | 122 +++++++++++++++++++++++++++++++++--
 3 files changed, 143 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..2ca5e19b8f19 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2119,6 +2119,19 @@ config CRASH_DUMP
 	  (CONFIG_RELOCATABLE=y).
 	  For more details see Documentation/admin-guide/kdump/kdump.rst
 
+config CRASH_HOTPLUG
+	bool "Update the crash elfcorehdr on system configuration changes"
+	default n
+	depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG)
+	help
+	  Enable direct update to the crash elfcorehdr (which contains
+	  the list of CPUs and memory regions to be dumped upon a crash)
+	  in response to hot plug/unplug or online/offline of CPUs or
+	  memory. This is a much more advanced approach than userspace
+	  attempting that.
+
+	  If unsure, say Y.
+
 config KEXEC_JUMP
 	bool "kexec jump"
 	depends on KEXEC && HIBERNATION
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index a3760ca796aa..1bc852ce347d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -212,6 +212,21 @@ typedef void crash_vmclear_fn(void);
 extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
 extern void kdump_nmi_shootdown_cpus(void);
 
+#ifdef CONFIG_CRASH_HOTPLUG
+void arch_crash_handle_hotplug_event(struct kimage *image);
+#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event
+
+#ifdef CONFIG_HOTPLUG_CPU
+static inline int crash_hotplug_cpu_support(void) { return 1; }
+#define crash_hotplug_cpu_support crash_hotplug_cpu_support
+#endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static inline int crash_hotplug_memory_support(void) { return 1; }
+#define crash_hotplug_memory_support crash_hotplug_memory_support
+#endif
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 8a9bc9807813..5c9e01fe27f5 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -42,6 +42,21 @@
 #include <asm/crash.h>
 #include <asm/cmdline.h>
 
+/*
+ * For the kexec_file_load() syscall path, specify the maximum number of
+ * memory regions that the elfcorehdr buffer/segment can accommodate.
+ * These regions are obtained via walk_system_ram_res(); eg. the
+ * 'System RAM' entries in /proc/iomem.
+ * This value is combined with NR_CPUS_DEFAULT and multiplied by
+ * sizeof(Elf64_Phdr) to determine the final elfcorehdr memory buffer/
+ * segment size.
+ * The value 8192, for example, covers a (sparsely populated) 1TiB system
+ * consisting of 128MiB memblocks, while resulting in an elfcorehdr
+ * memory buffer/segment size under 1MiB. This represents a sane choice
+ * to accommodate both baremetal and virtual machine configurations.
+ */
+#define CRASH_MAX_MEMORY_RANGES 8192
+
 /* Used while preparing memory map entries for second kernel */
 struct crash_memmap_data {
 	struct boot_params *params;
@@ -173,8 +188,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	crash_save_cpu(regs, safe_smp_processor_id());
 }
 
-#ifdef CONFIG_KEXEC_FILE
-
 static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
 {
 	unsigned int *nr_ranges = arg;
@@ -246,7 +259,7 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
 
 /* Prepare elf headers. Return addr and size */
 static int prepare_elf_headers(struct kimage *image, void **addr,
-					unsigned long *sz)
+					unsigned long *sz, unsigned long *nr_mem_ranges)
 {
 	struct crash_mem *cmem;
 	int ret;
@@ -264,6 +277,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
 	if (ret)
 		goto out;
 
+	/* Return the computed number of memory ranges, for hotplug usage */
+	*nr_mem_ranges = cmem->nr_ranges;
+
 	/* By default prepare 64bit headers */
 	ret =  crash_prepare_elf64_headers(image, cmem, IS_ENABLED(CONFIG_X86_64), addr, sz);
 
@@ -272,6 +288,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
 	return ret;
 }
 
+#ifdef CONFIG_KEXEC_FILE
 static int add_e820_entry(struct boot_params *params, struct e820_entry *entry)
 {
 	unsigned int nr_e820_entries;
@@ -386,18 +403,45 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
 int crash_load_segments(struct kimage *image)
 {
 	int ret;
+	unsigned long nr_mem_ranges;
 	struct kexec_buf kbuf = { .image = image, .buf_min = 0,
 				  .buf_max = ULONG_MAX, .top_down = false };
 
 	/* Prepare elf headers and add a segment */
-	ret = prepare_elf_headers(image, &kbuf.buffer, &kbuf.bufsz);
+	ret = prepare_elf_headers(image, &kbuf.buffer, &kbuf.bufsz, &nr_mem_ranges);
 	if (ret)
 		return ret;
 
-	image->elf_headers = kbuf.buffer;
-	image->elf_headers_sz = kbuf.bufsz;
+	image->elf_headers	= kbuf.buffer;
+	image->elf_headers_sz	= kbuf.bufsz;
+	kbuf.memsz		= kbuf.bufsz;
+
+	if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
+		/*
+		 * Ensure the elfcorehdr segment large enough for hotplug changes.
+		 * Start with VMCOREINFO and kernel_map and maximum CPUs.
+		 */
+		unsigned long pnum = 2 + CONFIG_NR_CPUS_DEFAULT;
+
+		if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
+			pnum += CRASH_MAX_MEMORY_RANGES;
+		else
+			pnum += nr_mem_ranges;
+
+		if (pnum < (unsigned long)PN_XNUM) {
+			kbuf.memsz = pnum * sizeof(Elf64_Phdr);
+			kbuf.memsz += sizeof(Elf64_Ehdr);
+
+			image->elfcorehdr_index = image->nr_segments;
+			image->elfcorehdr_index_valid = true;
+
+			/* Mark as usable to crash kernel, else crash kernel fails on boot */
+			image->elf_headers_sz = kbuf.memsz;
+		} else {
+			pr_err("number of Phdrs %lu exceeds max\n", pnum);
+		}
+	}
 
-	kbuf.memsz = kbuf.bufsz;
 	kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
 	kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
 	ret = kexec_add_buffer(&kbuf);
@@ -410,3 +454,67 @@ int crash_load_segments(struct kimage *image)
 	return ret;
 }
 #endif /* CONFIG_KEXEC_FILE */
+
+#ifdef CONFIG_CRASH_HOTPLUG
+
+#undef pr_fmt
+#define pr_fmt(fmt) "crash hp: " fmt
+
+/**
+ * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes
+ * @image: the active struct kimage
+ *
+ * The new elfcorehdr is prepared in a kernel buffer, and then it is
+ * written on top of the existing/old elfcorehdr.
+ */
+void arch_crash_handle_hotplug_event(struct kimage *image)
+{
+	void *elfbuf = NULL, *old_elfcorehdr;
+	unsigned long nr_mem_ranges;
+	unsigned long mem, memsz;
+	unsigned long elfsz = 0;
+
+	/*
+	 * Create the new elfcorehdr reflecting the changes to CPU and/or
+	 * memory resources.
+	 */
+	if (prepare_elf_headers(image, &elfbuf, &elfsz, &nr_mem_ranges)) {
+		pr_err("unable to prepare elfcore headers");
+		goto out;
+	}
+
+	/*
+	 * Obtain address and size of the elfcorehdr segment, and
+	 * check it against the new elfcorehdr buffer.
+	 */
+	mem = image->segment[image->elfcorehdr_index].mem;
+	memsz = image->segment[image->elfcorehdr_index].memsz;
+	if (elfsz > memsz) {
+		pr_err("update elfcorehdr elfsz %lu > memsz %lu",
+			elfsz, memsz);
+		goto out;
+	}
+
+	/*
+	 * Copy new elfcorehdr over the old elfcorehdr at destination.
+	 */
+	old_elfcorehdr = kmap_local_page(pfn_to_page(mem >> PAGE_SHIFT));
+	if (!old_elfcorehdr) {
+		pr_err("updating elfcorehdr failed\n");
+		goto out;
+	}
+
+	/*
+	 * Temporarily invalidate the crash image while the
+	 * elfcorehdr is updated.
+	 */
+	xchg(&kexec_crash_image, NULL);
+	memcpy_flushcache(old_elfcorehdr, elfbuf, elfsz);
+	xchg(&kexec_crash_image, image);
+	kunmap_local(old_elfcorehdr);
+	pr_debug("updated elfcorehdr\n");
+
+out:
+	vfree(elfbuf);
+}
+#endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v18 7/7] x86/crash: add x86 crash hotplug support
@ 2023-01-31 22:42   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-01-31 22:42 UTC (permalink / raw)
  To: linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

When CPU or memory is hot un/plugged, or off/onlined, the crash
elfcorehdr, which describes the CPUs and memory in the system,
must also be updated.

The segment containing the elfcorehdr is identified at run-time
in crash_core:handle_hotplug_event(), which works for both the
kexec_load() and kexec_file_load() syscalls. A new elfcorehdr
is generated from the available CPUs and memory into a buffer,
and then installed over the top of the existing elfcorehdr.

In the patch 'kexec: exclude elfcorehdr from the segment digest'
the need to update purgatory due to the change in elfcorehdr was
eliminated.  As a result, no changes to purgatory or boot_params
(as the elfcorehdr= kernel command line parameter pointer
remains unchanged and correct) are needed, just elfcorehdr.

To accommodate a growing number of resources via hotplug, the
elfcorehdr segment must be sufficiently large enough to accommodate
changes, see the CRASH_MAX_MEMORY_RANGES description. This is used
only on the kexec_file_load() syscall; for kexec_load() userspace
will need to size the segment similarly.

To accommodate kexec_load() syscall in the absence of
kexec_file_load() syscall support, and with CONFIG_CRASH_HOTPLUG
enabled, it is necessary to move prepare_elf_headers() and
dependents outside of CONFIG_KEXEC_FILE.

Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
---
 arch/x86/Kconfig             |  13 ++++
 arch/x86/include/asm/kexec.h |  15 +++++
 arch/x86/kernel/crash.c      | 122 +++++++++++++++++++++++++++++++++--
 3 files changed, 143 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..2ca5e19b8f19 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2119,6 +2119,19 @@ config CRASH_DUMP
 	  (CONFIG_RELOCATABLE=y).
 	  For more details see Documentation/admin-guide/kdump/kdump.rst
 
+config CRASH_HOTPLUG
+	bool "Update the crash elfcorehdr on system configuration changes"
+	default n
+	depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG)
+	help
+	  Enable direct update to the crash elfcorehdr (which contains
+	  the list of CPUs and memory regions to be dumped upon a crash)
+	  in response to hot plug/unplug or online/offline of CPUs or
+	  memory. This is a much more advanced approach than userspace
+	  attempting that.
+
+	  If unsure, say Y.
+
 config KEXEC_JUMP
 	bool "kexec jump"
 	depends on KEXEC && HIBERNATION
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index a3760ca796aa..1bc852ce347d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -212,6 +212,21 @@ typedef void crash_vmclear_fn(void);
 extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
 extern void kdump_nmi_shootdown_cpus(void);
 
+#ifdef CONFIG_CRASH_HOTPLUG
+void arch_crash_handle_hotplug_event(struct kimage *image);
+#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event
+
+#ifdef CONFIG_HOTPLUG_CPU
+static inline int crash_hotplug_cpu_support(void) { return 1; }
+#define crash_hotplug_cpu_support crash_hotplug_cpu_support
+#endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static inline int crash_hotplug_memory_support(void) { return 1; }
+#define crash_hotplug_memory_support crash_hotplug_memory_support
+#endif
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 8a9bc9807813..5c9e01fe27f5 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -42,6 +42,21 @@
 #include <asm/crash.h>
 #include <asm/cmdline.h>
 
+/*
+ * For the kexec_file_load() syscall path, specify the maximum number of
+ * memory regions that the elfcorehdr buffer/segment can accommodate.
+ * These regions are obtained via walk_system_ram_res(); eg. the
+ * 'System RAM' entries in /proc/iomem.
+ * This value is combined with NR_CPUS_DEFAULT and multiplied by
+ * sizeof(Elf64_Phdr) to determine the final elfcorehdr memory buffer/
+ * segment size.
+ * The value 8192, for example, covers a (sparsely populated) 1TiB system
+ * consisting of 128MiB memblocks, while resulting in an elfcorehdr
+ * memory buffer/segment size under 1MiB. This represents a sane choice
+ * to accommodate both baremetal and virtual machine configurations.
+ */
+#define CRASH_MAX_MEMORY_RANGES 8192
+
 /* Used while preparing memory map entries for second kernel */
 struct crash_memmap_data {
 	struct boot_params *params;
@@ -173,8 +188,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	crash_save_cpu(regs, safe_smp_processor_id());
 }
 
-#ifdef CONFIG_KEXEC_FILE
-
 static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
 {
 	unsigned int *nr_ranges = arg;
@@ -246,7 +259,7 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
 
 /* Prepare elf headers. Return addr and size */
 static int prepare_elf_headers(struct kimage *image, void **addr,
-					unsigned long *sz)
+					unsigned long *sz, unsigned long *nr_mem_ranges)
 {
 	struct crash_mem *cmem;
 	int ret;
@@ -264,6 +277,9 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
 	if (ret)
 		goto out;
 
+	/* Return the computed number of memory ranges, for hotplug usage */
+	*nr_mem_ranges = cmem->nr_ranges;
+
 	/* By default prepare 64bit headers */
 	ret =  crash_prepare_elf64_headers(image, cmem, IS_ENABLED(CONFIG_X86_64), addr, sz);
 
@@ -272,6 +288,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
 	return ret;
 }
 
+#ifdef CONFIG_KEXEC_FILE
 static int add_e820_entry(struct boot_params *params, struct e820_entry *entry)
 {
 	unsigned int nr_e820_entries;
@@ -386,18 +403,45 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
 int crash_load_segments(struct kimage *image)
 {
 	int ret;
+	unsigned long nr_mem_ranges;
 	struct kexec_buf kbuf = { .image = image, .buf_min = 0,
 				  .buf_max = ULONG_MAX, .top_down = false };
 
 	/* Prepare elf headers and add a segment */
-	ret = prepare_elf_headers(image, &kbuf.buffer, &kbuf.bufsz);
+	ret = prepare_elf_headers(image, &kbuf.buffer, &kbuf.bufsz, &nr_mem_ranges);
 	if (ret)
 		return ret;
 
-	image->elf_headers = kbuf.buffer;
-	image->elf_headers_sz = kbuf.bufsz;
+	image->elf_headers	= kbuf.buffer;
+	image->elf_headers_sz	= kbuf.bufsz;
+	kbuf.memsz		= kbuf.bufsz;
+
+	if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
+		/*
+		 * Ensure the elfcorehdr segment large enough for hotplug changes.
+		 * Start with VMCOREINFO and kernel_map and maximum CPUs.
+		 */
+		unsigned long pnum = 2 + CONFIG_NR_CPUS_DEFAULT;
+
+		if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
+			pnum += CRASH_MAX_MEMORY_RANGES;
+		else
+			pnum += nr_mem_ranges;
+
+		if (pnum < (unsigned long)PN_XNUM) {
+			kbuf.memsz = pnum * sizeof(Elf64_Phdr);
+			kbuf.memsz += sizeof(Elf64_Ehdr);
+
+			image->elfcorehdr_index = image->nr_segments;
+			image->elfcorehdr_index_valid = true;
+
+			/* Mark as usable to crash kernel, else crash kernel fails on boot */
+			image->elf_headers_sz = kbuf.memsz;
+		} else {
+			pr_err("number of Phdrs %lu exceeds max\n", pnum);
+		}
+	}
 
-	kbuf.memsz = kbuf.bufsz;
 	kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
 	kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
 	ret = kexec_add_buffer(&kbuf);
@@ -410,3 +454,67 @@ int crash_load_segments(struct kimage *image)
 	return ret;
 }
 #endif /* CONFIG_KEXEC_FILE */
+
+#ifdef CONFIG_CRASH_HOTPLUG
+
+#undef pr_fmt
+#define pr_fmt(fmt) "crash hp: " fmt
+
+/**
+ * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes
+ * @image: the active struct kimage
+ *
+ * The new elfcorehdr is prepared in a kernel buffer, and then it is
+ * written on top of the existing/old elfcorehdr.
+ */
+void arch_crash_handle_hotplug_event(struct kimage *image)
+{
+	void *elfbuf = NULL, *old_elfcorehdr;
+	unsigned long nr_mem_ranges;
+	unsigned long mem, memsz;
+	unsigned long elfsz = 0;
+
+	/*
+	 * Create the new elfcorehdr reflecting the changes to CPU and/or
+	 * memory resources.
+	 */
+	if (prepare_elf_headers(image, &elfbuf, &elfsz, &nr_mem_ranges)) {
+		pr_err("unable to prepare elfcore headers");
+		goto out;
+	}
+
+	/*
+	 * Obtain address and size of the elfcorehdr segment, and
+	 * check it against the new elfcorehdr buffer.
+	 */
+	mem = image->segment[image->elfcorehdr_index].mem;
+	memsz = image->segment[image->elfcorehdr_index].memsz;
+	if (elfsz > memsz) {
+		pr_err("update elfcorehdr elfsz %lu > memsz %lu",
+			elfsz, memsz);
+		goto out;
+	}
+
+	/*
+	 * Copy new elfcorehdr over the old elfcorehdr at destination.
+	 */
+	old_elfcorehdr = kmap_local_page(pfn_to_page(mem >> PAGE_SHIFT));
+	if (!old_elfcorehdr) {
+		pr_err("updating elfcorehdr failed\n");
+		goto out;
+	}
+
+	/*
+	 * Temporarily invalidate the crash image while the
+	 * elfcorehdr is updated.
+	 */
+	xchg(&kexec_crash_image, NULL);
+	memcpy_flushcache(old_elfcorehdr, elfbuf, elfsz);
+	xchg(&kexec_crash_image, image);
+	kunmap_local(old_elfcorehdr);
+	pr_debug("updated elfcorehdr\n");
+
+out:
+	vfree(elfbuf);
+}
+#endif
-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-01-31 22:42   ` Eric DeVolder
@ 2023-02-01 11:33     ` Thomas Gleixner
  -1 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2023-02-01 11:33 UTC (permalink / raw)
  To: Eric DeVolder, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

Eric!

On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote:
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
>  
>  	/* Prepare one phdr of type PT_NOTE for each present CPU */
>  	for_each_present_cpu(cpu) {
> +#ifdef CONFIG_CRASH_HOTPLUG
> +		if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
> +			/* Skip the soon-to-be offlined cpu */
> +			if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
> +				(cpu == image->offlinecpu))
> +				continue;
> +		}
> +#endif

I'm failing to see how the above is correct in any way. Look at the
following sequence of events:

     1) Offline CPU$N

        -> Prepare elf headers with CPU$N excluded

     2) Another hotplug operation != 'Online CPU$N'

        -> Prepare elf headers with CPU$N included

Also in case of loading the crash kernel in the situation where not all
present CPUs are online (think boot time SMT disable) then your
resulting crash image will contain all present CPUs and none of the
offline CPUs are excluded.

How does that make any sense at all?

This image->hp_action and image->offlinecpu dance is engineering
voodoo. You just can do:

        for_each_present_cpu(cpu) {
            if (!cpu_online(cpu))
            	continue;
            do_stuff(cpu);

which does the right thing in all situations and can be further
simplified to:

        for_each_online_cpu(cpu) {
            do_stuff(cpu);

without the need for ifdefs or whatever.

No?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-01 11:33     ` Thomas Gleixner
  0 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2023-02-01 11:33 UTC (permalink / raw)
  To: Eric DeVolder, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky,
	eric.devolder

Eric!

On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote:
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
>  
>  	/* Prepare one phdr of type PT_NOTE for each present CPU */
>  	for_each_present_cpu(cpu) {
> +#ifdef CONFIG_CRASH_HOTPLUG
> +		if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
> +			/* Skip the soon-to-be offlined cpu */
> +			if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
> +				(cpu == image->offlinecpu))
> +				continue;
> +		}
> +#endif

I'm failing to see how the above is correct in any way. Look at the
following sequence of events:

     1) Offline CPU$N

        -> Prepare elf headers with CPU$N excluded

     2) Another hotplug operation != 'Online CPU$N'

        -> Prepare elf headers with CPU$N included

Also in case of loading the crash kernel in the situation where not all
present CPUs are online (think boot time SMT disable) then your
resulting crash image will contain all present CPUs and none of the
offline CPUs are excluded.

How does that make any sense at all?

This image->hp_action and image->offlinecpu dance is engineering
voodoo. You just can do:

        for_each_present_cpu(cpu) {
            if (!cpu_online(cpu))
            	continue;
            do_stuff(cpu);

which does the right thing in all situations and can be further
simplified to:

        for_each_online_cpu(cpu) {
            do_stuff(cpu);

without the need for ifdefs or whatever.

No?

Thanks,

        tglx

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-01 11:33     ` Thomas Gleixner
@ 2023-02-06  8:12       ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-06  8:12 UTC (permalink / raw)
  To: Thomas Gleixner, Eric DeVolder, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

Hello Thomas,

On 01/02/23 17:03, Thomas Gleixner wrote:
> Eric!
>
> On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote:
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
>>   
>>   	/* Prepare one phdr of type PT_NOTE for each present CPU */
>>   	for_each_present_cpu(cpu) {
>> +#ifdef CONFIG_CRASH_HOTPLUG
>> +		if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
>> +			/* Skip the soon-to-be offlined cpu */
>> +			if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
>> +				(cpu == image->offlinecpu))
>> +				continue;
>> +		}
>> +#endif
> I'm failing to see how the above is correct in any way. Look at the
> following sequence of events:
>
>       1) Offline CPU$N
>
>          -> Prepare elf headers with CPU$N excluded
>
>       2) Another hotplug operation != 'Online CPU$N'
>
>          -> Prepare elf headers with CPU$N included
>
> Also in case of loading the crash kernel in the situation where not all
> present CPUs are online (think boot time SMT disable) then your
> resulting crash image will contain all present CPUs and none of the
> offline CPUs are excluded.
>
> How does that make any sense at all?
>
> This image->hp_action and image->offlinecpu dance is engineering
> voodoo. You just can do:
>
>          for_each_present_cpu(cpu) {
>              if (!cpu_online(cpu))
>              	continue;
>              do_stuff(cpu);
>
> which does the right thing in all situations and can be further
> simplified to:
>
>          for_each_online_cpu(cpu) {
>              do_stuff(cpu);

What will be the implication on x86 if we pack PT_NOTE for possible CPUs?

IIUC, on boot the crash notes are create for possible CPUs using pcpu_alloc
and when the system is on crash path the crash notes for online CPUs is
populated with the required data and rest crash notes are untouched.

And I think the /proc/vmcore generation in kdump/second kernel and 
makedumpfile do
take care of empty crash notes belong to offline CPUs.

Any thoughts?

Thanks,
Sourabh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-06  8:12       ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-06  8:12 UTC (permalink / raw)
  To: Thomas Gleixner, Eric DeVolder, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

Hello Thomas,

On 01/02/23 17:03, Thomas Gleixner wrote:
> Eric!
>
> On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote:
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
>>   
>>   	/* Prepare one phdr of type PT_NOTE for each present CPU */
>>   	for_each_present_cpu(cpu) {
>> +#ifdef CONFIG_CRASH_HOTPLUG
>> +		if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
>> +			/* Skip the soon-to-be offlined cpu */
>> +			if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
>> +				(cpu == image->offlinecpu))
>> +				continue;
>> +		}
>> +#endif
> I'm failing to see how the above is correct in any way. Look at the
> following sequence of events:
>
>       1) Offline CPU$N
>
>          -> Prepare elf headers with CPU$N excluded
>
>       2) Another hotplug operation != 'Online CPU$N'
>
>          -> Prepare elf headers with CPU$N included
>
> Also in case of loading the crash kernel in the situation where not all
> present CPUs are online (think boot time SMT disable) then your
> resulting crash image will contain all present CPUs and none of the
> offline CPUs are excluded.
>
> How does that make any sense at all?
>
> This image->hp_action and image->offlinecpu dance is engineering
> voodoo. You just can do:
>
>          for_each_present_cpu(cpu) {
>              if (!cpu_online(cpu))
>              	continue;
>              do_stuff(cpu);
>
> which does the right thing in all situations and can be further
> simplified to:
>
>          for_each_online_cpu(cpu) {
>              do_stuff(cpu);

What will be the implication on x86 if we pack PT_NOTE for possible CPUs?

IIUC, on boot the crash notes are create for possible CPUs using pcpu_alloc
and when the system is on crash path the crash notes for online CPUs is
populated with the required data and rest crash notes are untouched.

And I think the /proc/vmcore generation in kdump/second kernel and 
makedumpfile do
take care of empty crash notes belong to offline CPUs.

Any thoughts?

Thanks,
Sourabh

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-06  8:12       ` Sourabh Jain
@ 2023-02-06 13:03         ` Thomas Gleixner
  -1 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2023-02-06 13:03 UTC (permalink / raw)
  To: Sourabh Jain, Eric DeVolder, linux-kernel, x86, kexec, ebiederm,
	dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

On Mon, Feb 06 2023 at 13:42, Sourabh Jain wrote:
> On 01/02/23 17:03, Thomas Gleixner wrote:
>> Also in case of loading the crash kernel in the situation where not all
>> present CPUs are online (think boot time SMT disable) then your
>> resulting crash image will contain all present CPUs and none of the
>> offline CPUs are excluded.
>>
>> How does that make any sense at all?
>>
>> This image->hp_action and image->offlinecpu dance is engineering
>> voodoo. You just can do:
>>
>>          for_each_present_cpu(cpu) {
>>              if (!cpu_online(cpu))
>>              	continue;
>>              do_stuff(cpu);
>>
>> which does the right thing in all situations and can be further
>> simplified to:
>>
>>          for_each_online_cpu(cpu) {
>>              do_stuff(cpu);
>
> What will be the implication on x86 if we pack PT_NOTE for possible
> CPUs?

I don't know.

> IIUC, on boot the crash notes are create for possible CPUs using pcpu_alloc
> and when the system is on crash path the crash notes for online CPUs is
> populated with the required data and rest crash notes are untouched.

Which should be fine. That's a problem of postprocessing and it's
unclear to me from the changelogs what the actual problem is which is
trying to be solved here.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-06 13:03         ` Thomas Gleixner
  0 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2023-02-06 13:03 UTC (permalink / raw)
  To: Sourabh Jain, Eric DeVolder, linux-kernel, x86, kexec, ebiederm,
	dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

On Mon, Feb 06 2023 at 13:42, Sourabh Jain wrote:
> On 01/02/23 17:03, Thomas Gleixner wrote:
>> Also in case of loading the crash kernel in the situation where not all
>> present CPUs are online (think boot time SMT disable) then your
>> resulting crash image will contain all present CPUs and none of the
>> offline CPUs are excluded.
>>
>> How does that make any sense at all?
>>
>> This image->hp_action and image->offlinecpu dance is engineering
>> voodoo. You just can do:
>>
>>          for_each_present_cpu(cpu) {
>>              if (!cpu_online(cpu))
>>              	continue;
>>              do_stuff(cpu);
>>
>> which does the right thing in all situations and can be further
>> simplified to:
>>
>>          for_each_online_cpu(cpu) {
>>              do_stuff(cpu);
>
> What will be the implication on x86 if we pack PT_NOTE for possible
> CPUs?

I don't know.

> IIUC, on boot the crash notes are create for possible CPUs using pcpu_alloc
> and when the system is on crash path the crash notes for online CPUs is
> populated with the required data and rest crash notes are untouched.

Which should be fine. That's a problem of postprocessing and it's
unclear to me from the changelogs what the actual problem is which is
trying to be solved here.

Thanks,

        tglx



_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-01 11:33     ` Thomas Gleixner
@ 2023-02-07 17:23       ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-07 17:23 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky



On 2/1/23 05:33, Thomas Gleixner wrote:
> Eric!
> 
> On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote:
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
>>   
>>   	/* Prepare one phdr of type PT_NOTE for each present CPU */
>>   	for_each_present_cpu(cpu) {
>> +#ifdef CONFIG_CRASH_HOTPLUG
>> +		if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
>> +			/* Skip the soon-to-be offlined cpu */
>> +			if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
>> +				(cpu == image->offlinecpu))
>> +				continue;
>> +		}
>> +#endif
> 
> I'm failing to see how the above is correct in any way. Look at the
> following sequence of events:
> 
>       1) Offline CPU$N
> 
>          -> Prepare elf headers with CPU$N excluded
> 
>       2) Another hotplug operation != 'Online CPU$N'
> 
>          -> Prepare elf headers with CPU$N included
> 
> Also in case of loading the crash kernel in the situation where not all
> present CPUs are online (think boot time SMT disable) then your
> resulting crash image will contain all present CPUs and none of the
> offline CPUs are excluded.
> 
> How does that make any sense at all?
> 
> This image->hp_action and image->offlinecpu dance is engineering
> voodoo. You just can do:
> 
>          for_each_present_cpu(cpu) {
>              if (!cpu_online(cpu))
>              	continue;
>              do_stuff(cpu);
> 
> which does the right thing in all situations and can be further
> simplified to:
> 
>          for_each_online_cpu(cpu) {
>              do_stuff(cpu);
> 
> without the need for ifdefs or whatever.
> 
> No?
> 
> Thanks,
> 
>          tglx

Thomas,

I've been re-examining the cpuhp framework and understand a bit better its
operation.

Up until now, this patch series has been using either CPUHP_AP_ONLINE_DYN
or more recently CPUHP_BP_PREPARE_DYN with the same handler for both the
startup and teardown callbacks. This resulted in the cpu state, as seen by
my handler, as being incorrect in one direction or the other. For example,
when using CPUHP_AP_ONLINE_DYN, cpu_online() always resulted in 1 for the
cpu in my callback, even during tear down. For CPUHP_BP_PREPARE_DYN,
cpu_online() always resulted in 0. Thus the offlinecpu voodoo.

But no more!

The reason, as I now understand, is simple. A cpu will not show as online
until after state CPUHP_BRINGUP_CPU (when working from CPUHP_OFFLINE towards
CPUHP_ONLINE). And a cpu will not show as offline until after state
CPUHP_TEARDOWN_CPU (when working reverse order from CPUHP_ONLINE to
CPUHP_OFFLINE).

The CPUHP_BRINGUP_CPU is the last state of the PREPARE section, and boots
the new cpu. It is code running on the booting cpu that marks itself as
online.

  CPUHP_BRINGUP_CPU
    .startup()
      bringup_cpu()
        __cpu_up()
         smp_ops.cpu_up()
          native_cpu_up()
           do_boot_cpu()
            ===== on new cpu! =====
            start_secondary()
             set_cpu_online(true)

There are quite a few CPUHP_..._STARTING states before the cpu is in a productive state.

The CPUHP_TEARDOWN_CPU is the last state in the STARTING section, and takes the cpu down.
Work/irqs are removed from this cpu and re-assigned to others.

  CPUHP_TEARDOWN_CPU
    .teardown()
     takedown_cpu()
      take_cpu_down()
       __cpu_disable()
        smp_ops.cpu_disable()
         native_cpu_disable()
          cpu_disable_common()
           remove_cpu_from_maps()
            set_cpu_online(false)

So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.

The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
attempts at locating this state failed when inside the STARTING section, so I located
this just inside the ONLINE sectoin. The crash hotplug handler is registered on
this state as the callback for the .startup method.

The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
placed it at the end of the PREPARE section. This crash hotplug handler is also
registered on this state as the callback for the .teardown method.

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 6c6859bfc454..52d2db4d793e 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -131,6 +131,7 @@ enum cpuhp_state {
     CPUHP_ZCOMP_PREPARE,
     CPUHP_TIMERS_PREPARE,
     CPUHP_MIPS_SOC_PREPARE,
+   CPUHP_BP_ELFCOREHDR_OFFLINE,
     CPUHP_BP_PREPARE_DYN,
     CPUHP_BP_PREPARE_DYN_END        = CPUHP_BP_PREPARE_DYN + 20,
     CPUHP_BRINGUP_CPU,
@@ -205,6 +206,7 @@ enum cpuhp_state {

     /* Online section invoked on the hotplugged CPU from the hotplug thread */
     CPUHP_AP_ONLINE_IDLE,
+   CPUHP_AP_ELFCOREHDR_ONLINE,
     CPUHP_AP_SCHED_WAIT_EMPTY,
     CPUHP_AP_SMPBOOT_THREADS,
     CPUHP_AP_X86_VDSO_VMA_ONLINE,

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 8a439b6d723b..e1a3430f06f4 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c

+   if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
+       result = cpuhp_setup_state_nocalls(CPUHP_AP_ELFCOREHDR_ONLINE,
+                          "crash/cpuhp_online", crash_cpuhp_online, NULL);
+       result = cpuhp_setup_state_nocalls(CPUHP_BP_ELFCOREHDR_OFFLINE,
+                          "crash/cpuhp_offline", NULL, crash_cpuhp_offline);
+   }

With the above, there is no need for offlinecpu, as the crash hotplug handler
callback now observes the correct cpu_online() state in both online and offline
activities.

Which leads me to the next item. Thomas you suggested

           for_each_online_cpu(cpu) {
               do_stuff(cpu);

I've been looking into this further, and don't yet have conclusion.
In light of Sourabh's comments/concerns about packing PT_NOTES, I
need to determine if my introduction of

        if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
            if (!cpu_online(cpu)) continue;
        }

does not cause other downstream issues. My testing was focused on
hot plug/unplugging cpus in a last-on-first-off manner, where as
I now realize cpus can be onlined/offlined sparsely (thus the PT_NOTE
packing concern).

I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
makedumpfile and (the consumer of it all) the userspace crash utility,
in order to understand the impact of moving from for_each_present_cpu()
to for_each_online_cpu().

At any rate, I wanted to at least put forth the introduction of the
two new CPUHP states and solicit feedback there while I investigate
the for_each_online_cpu() matter.

Thanks for pushing me on this topic!
eric


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-07 17:23       ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-07 17:23 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky



On 2/1/23 05:33, Thomas Gleixner wrote:
> Eric!
> 
> On Tue, Jan 31 2023 at 17:42, Eric DeVolder wrote:
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -366,6 +366,14 @@ int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
>>   
>>   	/* Prepare one phdr of type PT_NOTE for each present CPU */
>>   	for_each_present_cpu(cpu) {
>> +#ifdef CONFIG_CRASH_HOTPLUG
>> +		if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
>> +			/* Skip the soon-to-be offlined cpu */
>> +			if ((image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU) &&
>> +				(cpu == image->offlinecpu))
>> +				continue;
>> +		}
>> +#endif
> 
> I'm failing to see how the above is correct in any way. Look at the
> following sequence of events:
> 
>       1) Offline CPU$N
> 
>          -> Prepare elf headers with CPU$N excluded
> 
>       2) Another hotplug operation != 'Online CPU$N'
> 
>          -> Prepare elf headers with CPU$N included
> 
> Also in case of loading the crash kernel in the situation where not all
> present CPUs are online (think boot time SMT disable) then your
> resulting crash image will contain all present CPUs and none of the
> offline CPUs are excluded.
> 
> How does that make any sense at all?
> 
> This image->hp_action and image->offlinecpu dance is engineering
> voodoo. You just can do:
> 
>          for_each_present_cpu(cpu) {
>              if (!cpu_online(cpu))
>              	continue;
>              do_stuff(cpu);
> 
> which does the right thing in all situations and can be further
> simplified to:
> 
>          for_each_online_cpu(cpu) {
>              do_stuff(cpu);
> 
> without the need for ifdefs or whatever.
> 
> No?
> 
> Thanks,
> 
>          tglx

Thomas,

I've been re-examining the cpuhp framework and understand a bit better its
operation.

Up until now, this patch series has been using either CPUHP_AP_ONLINE_DYN
or more recently CPUHP_BP_PREPARE_DYN with the same handler for both the
startup and teardown callbacks. This resulted in the cpu state, as seen by
my handler, as being incorrect in one direction or the other. For example,
when using CPUHP_AP_ONLINE_DYN, cpu_online() always resulted in 1 for the
cpu in my callback, even during tear down. For CPUHP_BP_PREPARE_DYN,
cpu_online() always resulted in 0. Thus the offlinecpu voodoo.

But no more!

The reason, as I now understand, is simple. A cpu will not show as online
until after state CPUHP_BRINGUP_CPU (when working from CPUHP_OFFLINE towards
CPUHP_ONLINE). And a cpu will not show as offline until after state
CPUHP_TEARDOWN_CPU (when working reverse order from CPUHP_ONLINE to
CPUHP_OFFLINE).

The CPUHP_BRINGUP_CPU is the last state of the PREPARE section, and boots
the new cpu. It is code running on the booting cpu that marks itself as
online.

  CPUHP_BRINGUP_CPU
    .startup()
      bringup_cpu()
        __cpu_up()
         smp_ops.cpu_up()
          native_cpu_up()
           do_boot_cpu()
            ===== on new cpu! =====
            start_secondary()
             set_cpu_online(true)

There are quite a few CPUHP_..._STARTING states before the cpu is in a productive state.

The CPUHP_TEARDOWN_CPU is the last state in the STARTING section, and takes the cpu down.
Work/irqs are removed from this cpu and re-assigned to others.

  CPUHP_TEARDOWN_CPU
    .teardown()
     takedown_cpu()
      take_cpu_down()
       __cpu_disable()
        smp_ops.cpu_disable()
         native_cpu_disable()
          cpu_disable_common()
           remove_cpu_from_maps()
            set_cpu_online(false)

So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.

The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
attempts at locating this state failed when inside the STARTING section, so I located
this just inside the ONLINE sectoin. The crash hotplug handler is registered on
this state as the callback for the .startup method.

The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
placed it at the end of the PREPARE section. This crash hotplug handler is also
registered on this state as the callback for the .teardown method.

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 6c6859bfc454..52d2db4d793e 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -131,6 +131,7 @@ enum cpuhp_state {
     CPUHP_ZCOMP_PREPARE,
     CPUHP_TIMERS_PREPARE,
     CPUHP_MIPS_SOC_PREPARE,
+   CPUHP_BP_ELFCOREHDR_OFFLINE,
     CPUHP_BP_PREPARE_DYN,
     CPUHP_BP_PREPARE_DYN_END        = CPUHP_BP_PREPARE_DYN + 20,
     CPUHP_BRINGUP_CPU,
@@ -205,6 +206,7 @@ enum cpuhp_state {

     /* Online section invoked on the hotplugged CPU from the hotplug thread */
     CPUHP_AP_ONLINE_IDLE,
+   CPUHP_AP_ELFCOREHDR_ONLINE,
     CPUHP_AP_SCHED_WAIT_EMPTY,
     CPUHP_AP_SMPBOOT_THREADS,
     CPUHP_AP_X86_VDSO_VMA_ONLINE,

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 8a439b6d723b..e1a3430f06f4 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c

+   if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
+       result = cpuhp_setup_state_nocalls(CPUHP_AP_ELFCOREHDR_ONLINE,
+                          "crash/cpuhp_online", crash_cpuhp_online, NULL);
+       result = cpuhp_setup_state_nocalls(CPUHP_BP_ELFCOREHDR_OFFLINE,
+                          "crash/cpuhp_offline", NULL, crash_cpuhp_offline);
+   }

With the above, there is no need for offlinecpu, as the crash hotplug handler
callback now observes the correct cpu_online() state in both online and offline
activities.

Which leads me to the next item. Thomas you suggested

           for_each_online_cpu(cpu) {
               do_stuff(cpu);

I've been looking into this further, and don't yet have conclusion.
In light of Sourabh's comments/concerns about packing PT_NOTES, I
need to determine if my introduction of

        if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
            if (!cpu_online(cpu)) continue;
        }

does not cause other downstream issues. My testing was focused on
hot plug/unplugging cpus in a last-on-first-off manner, where as
I now realize cpus can be onlined/offlined sparsely (thus the PT_NOTE
packing concern).

I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
makedumpfile and (the consumer of it all) the userspace crash utility,
in order to understand the impact of moving from for_each_present_cpu()
to for_each_online_cpu().

At any rate, I wanted to at least put forth the introduction of the
two new CPUHP states and solicit feedback there while I investigate
the for_each_online_cpu() matter.

Thanks for pushing me on this topic!
eric


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-07 17:23       ` Eric DeVolder
@ 2023-02-08 13:44         ` Thomas Gleixner
  -1 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2023-02-08 13:44 UTC (permalink / raw)
  To: Eric DeVolder, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky

Eric!

On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
> On 2/1/23 05:33, Thomas Gleixner wrote:
>
> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>
> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
> attempts at locating this state failed when inside the STARTING section, so I located
> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
> this state as the callback for the .startup method.
>
> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
> placed it at the end of the PREPARE section. This crash hotplug handler is also
> registered on this state as the callback for the .teardown method.

TBH, that's still overengineered. Something like this:

bool cpu_is_alive(unsigned int cpu)
{
	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);

	return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
}

and use this to query the actual state at crash time. That spares all
those callback heuristics.

> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
> makedumpfile and (the consumer of it all) the userspace crash utility,
> in order to understand the impact of moving from for_each_present_cpu()
> to for_each_online_cpu().

Is the packing actually worth the trouble? What's the actual win?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-08 13:44         ` Thomas Gleixner
  0 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2023-02-08 13:44 UTC (permalink / raw)
  To: Eric DeVolder, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky

Eric!

On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
> On 2/1/23 05:33, Thomas Gleixner wrote:
>
> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>
> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
> attempts at locating this state failed when inside the STARTING section, so I located
> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
> this state as the callback for the .startup method.
>
> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
> placed it at the end of the PREPARE section. This crash hotplug handler is also
> registered on this state as the callback for the .teardown method.

TBH, that's still overengineered. Something like this:

bool cpu_is_alive(unsigned int cpu)
{
	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);

	return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
}

and use this to query the actual state at crash time. That spares all
those callback heuristics.

> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
> makedumpfile and (the consumer of it all) the userspace crash utility,
> in order to understand the impact of moving from for_each_present_cpu()
> to for_each_online_cpu().

Is the packing actually worth the trouble? What's the actual win?

Thanks,

        tglx



_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-08 13:44         ` Thomas Gleixner
@ 2023-02-09 17:31           ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-09 17:31 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky



On 2/8/23 07:44, Thomas Gleixner wrote:
> Eric!
> 
> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>
>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>
>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>> attempts at locating this state failed when inside the STARTING section, so I located
>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>> this state as the callback for the .startup method.
>>
>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>> registered on this state as the callback for the .teardown method.
> 
> TBH, that's still overengineered. Something like this:
> 
> bool cpu_is_alive(unsigned int cpu)
> {
> 	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> 
> 	return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
> }
> 
> and use this to query the actual state at crash time. That spares all
> those callback heuristics.
> 
>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>> makedumpfile and (the consumer of it all) the userspace crash utility,
>> in order to understand the impact of moving from for_each_present_cpu()
>> to for_each_online_cpu().
> 
> Is the packing actually worth the trouble? What's the actual win?
> 
> Thanks,
> 
>          tglx
> 
> 

Thomas,
I've investigated the passing of crash notes through the vmcore. What I've learned is that:

- linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.

- makedumpfile will count the number of cpu PT_NOTES in order to determine its
   nr_cpus variable, which is reported in a header, but otherwise unused (except
   for sadump method).

- the crash utility, for the purposes of determining the cpus, does not appear to
   reference the elfcorehdr PT_NOTEs. Instead it locates the various
   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
   course which are online. In addition, when crash does reference the cpu PT_NOTE,
   to get its prstatus, it does so by using a percpu technique directly in the vmcore
   image memory, not via the ELF structure. Said differently, it appears to me that
   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
   via kernel cpumasks and the memory within the vmcore.

With this understanding, I did some testing. Perhaps the most telling test was that I
changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
cpu via 'set -c 30' and the backtrace was completely valid.

My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
cpu information directly from kernel data structures. Perhaps at one time crash relied
upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
that might rely on the ELF info?)

So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
is no compelling reason to move away from for_each_present_cpu(), or modify the list for
online/offline.

Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
be a compelling need to accurately track whether the cpu went online/offline for the
purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
kernel data structures, not the elfcorehdr.

I think this is what Sourabh has known and has been advocating for an optimization
path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
structs are all laid out). I do think it best to leave that as an arch choice.

Comments?

Thanks!
eric







^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-09 17:31           ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-09 17:31 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, sourabhjain, konrad.wilk, boris.ostrovsky



On 2/8/23 07:44, Thomas Gleixner wrote:
> Eric!
> 
> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>
>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>
>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>> attempts at locating this state failed when inside the STARTING section, so I located
>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>> this state as the callback for the .startup method.
>>
>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>> registered on this state as the callback for the .teardown method.
> 
> TBH, that's still overengineered. Something like this:
> 
> bool cpu_is_alive(unsigned int cpu)
> {
> 	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> 
> 	return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
> }
> 
> and use this to query the actual state at crash time. That spares all
> those callback heuristics.
> 
>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>> makedumpfile and (the consumer of it all) the userspace crash utility,
>> in order to understand the impact of moving from for_each_present_cpu()
>> to for_each_online_cpu().
> 
> Is the packing actually worth the trouble? What's the actual win?
> 
> Thanks,
> 
>          tglx
> 
> 

Thomas,
I've investigated the passing of crash notes through the vmcore. What I've learned is that:

- linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.

- makedumpfile will count the number of cpu PT_NOTES in order to determine its
   nr_cpus variable, which is reported in a header, but otherwise unused (except
   for sadump method).

- the crash utility, for the purposes of determining the cpus, does not appear to
   reference the elfcorehdr PT_NOTEs. Instead it locates the various
   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
   course which are online. In addition, when crash does reference the cpu PT_NOTE,
   to get its prstatus, it does so by using a percpu technique directly in the vmcore
   image memory, not via the ELF structure. Said differently, it appears to me that
   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
   via kernel cpumasks and the memory within the vmcore.

With this understanding, I did some testing. Perhaps the most telling test was that I
changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
cpu via 'set -c 30' and the backtrace was completely valid.

My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
cpu information directly from kernel data structures. Perhaps at one time crash relied
upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
that might rely on the ELF info?)

So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
is no compelling reason to move away from for_each_present_cpu(), or modify the list for
online/offline.

Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
be a compelling need to accurately track whether the cpu went online/offline for the
purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
kernel data structures, not the elfcorehdr.

I think this is what Sourabh has known and has been advocating for an optimization
path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
structs are all laid out). I do think it best to leave that as an arch choice.

Comments?

Thanks!
eric







_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-09 17:31           ` Eric DeVolder
@ 2023-02-09 18:43             ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-09 18:43 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

Hello Eric,

On 09/02/23 23:01, Eric DeVolder wrote:
>
>
> On 2/8/23 07:44, Thomas Gleixner wrote:
>> Eric!
>>
>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>
>>> So my latest solution is introduce two new CPUHP states, 
>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open 
>>> to better names.
>>>
>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>> CPUHP_BRINGUP_CPU. My
>>> attempts at locating this state failed when inside the STARTING 
>>> section, so I located
>>> this just inside the ONLINE sectoin. The crash hotplug handler is 
>>> registered on
>>> this state as the callback for the .startup method.
>>>
>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>> CPUHP_TEARDOWN_CPU, and I
>>> placed it at the end of the PREPARE section. This crash hotplug 
>>> handler is also
>>> registered on this state as the callback for the .teardown method.
>>
>> TBH, that's still overengineered. Something like this:
>>
>> bool cpu_is_alive(unsigned int cpu)
>> {
>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>
>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>> }
>>
>> and use this to query the actual state at crash time. That spares all
>> those callback heuristics.
>>
>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>> in order to understand the impact of moving from for_each_present_cpu()
>>> to for_each_online_cpu().
>>
>> Is the packing actually worth the trouble? What's the actual win?
>>
>> Thanks,
>>
>>          tglx
>>
>>
>
> Thomas,
> I've investigated the passing of crash notes through the vmcore. What 
> I've learned is that:
>
> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) 
> does
>   not care what the contents of cpu PT_NOTES are, but it does coalesce 
> them together.
>
> - makedumpfile will count the number of cpu PT_NOTES in order to 
> determine its
>   nr_cpus variable, which is reported in a header, but otherwise 
> unused (except
>   for sadump method).
>
> - the crash utility, for the purposes of determining the cpus, does 
> not appear to
>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>   cpu_[possible|present|online]_mask and computes nr_cpus from that, 
> and also of
>   course which are online. In addition, when crash does reference the 
> cpu PT_NOTE,
>   to get its prstatus, it does so by using a percpu technique directly 
> in the vmcore
>   image memory, not via the ELF structure. Said differently, it 
> appears to me that
>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it 
> obtains them
>   via kernel cpumasks and the memory within the vmcore.
>
> With this understanding, I did some testing. Perhaps the most telling 
> test was that I
> changed the number of cpu PT_NOTEs emitted in the 
> crash_prepare_elf64_headers() to just 1,
> hot plugged some cpus, then also took a few offline sparsely via 
> chcpu, then generated a
> vmcore. The crash utility had no problem loading the vmcore, it 
> reported the proper number
> of cpus and the number offline (despite only one cpu PT_NOTE), and 
> changing to a different
> cpu via 'set -c 30' and the backtrace was completely valid.
>
> My take away is that crash utility does not rely upon ELF cpu 
> PT_NOTEs, it obtains the
> cpu information directly from kernel data structures. Perhaps at one 
> time crash relied
> upon the ELF information, but no more. (Perhaps there are other crash 
> dump analyzers
> that might rely on the ELF info?)
>
> So, all this to say that I see no need to change 
> crash_prepare_elf64_headers(). There
> is no compelling reason to move away from for_each_present_cpu(), or 
> modify the list for
> online/offline.
>
> Which then leaves the topic of the cpuhp state on which to register. 
> Perhaps reverting
> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
> does not appear to
> be a compelling need to accurately track whether the cpu went 
> online/offline for the
> purposes of creating the elfcorehdr, as ultimately the crash utility 
> pulls that from
> kernel data structures, not the elfcorehdr.
>
> I think this is what Sourabh has known and has been advocating for an 
> optimization
> path that allows not regenerating the elfcorehdr on cpu changes 
> (because all the percpu
> structs are all laid out). I do think it best to leave that as an arch 
> choice.

Since things are clear on how the PT_NOTES are consumed in kdump kernel 
[fs/proc/vmcore.c],
makedumpfile, and crash tool I need your opinion on this:

Do we really need to regenerate elfcorehdr for CPU hotplug events?
If yes, can you please list the elfcorehdr components that changes due 
to CPU hotplug.

 From what I understood, crash notes are prepared for possible CPUs as 
system boots and
could be used to create a PT_NOTE section for each possible CPU while 
generating the elfcorehdr
during the kdump kernel load.

Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU 
there is no need to
regenerate it for CPU hotplug events. Or do we?

Thanks,
Sourabh Jain

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-09 18:43             ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-09 18:43 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

Hello Eric,

On 09/02/23 23:01, Eric DeVolder wrote:
>
>
> On 2/8/23 07:44, Thomas Gleixner wrote:
>> Eric!
>>
>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>
>>> So my latest solution is introduce two new CPUHP states, 
>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open 
>>> to better names.
>>>
>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>> CPUHP_BRINGUP_CPU. My
>>> attempts at locating this state failed when inside the STARTING 
>>> section, so I located
>>> this just inside the ONLINE sectoin. The crash hotplug handler is 
>>> registered on
>>> this state as the callback for the .startup method.
>>>
>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>> CPUHP_TEARDOWN_CPU, and I
>>> placed it at the end of the PREPARE section. This crash hotplug 
>>> handler is also
>>> registered on this state as the callback for the .teardown method.
>>
>> TBH, that's still overengineered. Something like this:
>>
>> bool cpu_is_alive(unsigned int cpu)
>> {
>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>
>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>> }
>>
>> and use this to query the actual state at crash time. That spares all
>> those callback heuristics.
>>
>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>> in order to understand the impact of moving from for_each_present_cpu()
>>> to for_each_online_cpu().
>>
>> Is the packing actually worth the trouble? What's the actual win?
>>
>> Thanks,
>>
>>          tglx
>>
>>
>
> Thomas,
> I've investigated the passing of crash notes through the vmcore. What 
> I've learned is that:
>
> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) 
> does
>   not care what the contents of cpu PT_NOTES are, but it does coalesce 
> them together.
>
> - makedumpfile will count the number of cpu PT_NOTES in order to 
> determine its
>   nr_cpus variable, which is reported in a header, but otherwise 
> unused (except
>   for sadump method).
>
> - the crash utility, for the purposes of determining the cpus, does 
> not appear to
>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>   cpu_[possible|present|online]_mask and computes nr_cpus from that, 
> and also of
>   course which are online. In addition, when crash does reference the 
> cpu PT_NOTE,
>   to get its prstatus, it does so by using a percpu technique directly 
> in the vmcore
>   image memory, not via the ELF structure. Said differently, it 
> appears to me that
>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it 
> obtains them
>   via kernel cpumasks and the memory within the vmcore.
>
> With this understanding, I did some testing. Perhaps the most telling 
> test was that I
> changed the number of cpu PT_NOTEs emitted in the 
> crash_prepare_elf64_headers() to just 1,
> hot plugged some cpus, then also took a few offline sparsely via 
> chcpu, then generated a
> vmcore. The crash utility had no problem loading the vmcore, it 
> reported the proper number
> of cpus and the number offline (despite only one cpu PT_NOTE), and 
> changing to a different
> cpu via 'set -c 30' and the backtrace was completely valid.
>
> My take away is that crash utility does not rely upon ELF cpu 
> PT_NOTEs, it obtains the
> cpu information directly from kernel data structures. Perhaps at one 
> time crash relied
> upon the ELF information, but no more. (Perhaps there are other crash 
> dump analyzers
> that might rely on the ELF info?)
>
> So, all this to say that I see no need to change 
> crash_prepare_elf64_headers(). There
> is no compelling reason to move away from for_each_present_cpu(), or 
> modify the list for
> online/offline.
>
> Which then leaves the topic of the cpuhp state on which to register. 
> Perhaps reverting
> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
> does not appear to
> be a compelling need to accurately track whether the cpu went 
> online/offline for the
> purposes of creating the elfcorehdr, as ultimately the crash utility 
> pulls that from
> kernel data structures, not the elfcorehdr.
>
> I think this is what Sourabh has known and has been advocating for an 
> optimization
> path that allows not regenerating the elfcorehdr on cpu changes 
> (because all the percpu
> structs are all laid out). I do think it best to leave that as an arch 
> choice.

Since things are clear on how the PT_NOTES are consumed in kdump kernel 
[fs/proc/vmcore.c],
makedumpfile, and crash tool I need your opinion on this:

Do we really need to regenerate elfcorehdr for CPU hotplug events?
If yes, can you please list the elfcorehdr components that changes due 
to CPU hotplug.

 From what I understood, crash notes are prepared for possible CPUs as 
system boots and
could be used to create a PT_NOTE section for each possible CPU while 
generating the elfcorehdr
during the kdump kernel load.

Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU 
there is no need to
regenerate it for CPU hotplug events. Or do we?

Thanks,
Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support
  2023-01-31 22:42   ` Eric DeVolder
@ 2023-02-09 19:10     ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-09 19:10 UTC (permalink / raw)
  To: Eric DeVolder, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

Hello Eric,

On 01/02/23 04:12, Eric DeVolder wrote:
> To support crash hotplug, a mechanism is needed to update the crash
> elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/
> onlining).
>
> To track CPU changes, callbacks are registered with the cpuhp
> mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The
> crash hotplug elfcorehdr update has no explicit ordering requirement
> (relative to other cpuhp states), so meets the criteria for
> utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic
> state and avoids the need to introduce a new state for crash
> hotplug. Also, this is the last state in the PREPARE group, just
> prior to the STARTING group, which is very close to the CPU
> starting up in an plug/online situation, or stopping in a unplug/
> offline situation. This minimizes the window of time during an
> actual plug/online or unplug/offline situation in which the
> elfcorehdr would be inaccurate.
>
> Note, that when a CPU is being unplugged/offlined, the CPU is still
> in the foreach_present_cpu() during the regeneration of the
> elfcorehdr. Thus there is a need to explicitly check and exclude
> the soon-to-be offlined CPU. See patch 'kexec: exclude hot remove
> cpu from elfcorehdr notes'.
>
> To track memory changes, a notifier is registered to capture the
> memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().
>
> The cpu callbacks and memory notifiers invoke handle_hotplug_event()
> which performs needed tasks and then dispatches the event to the
> architecture specific arch_crash_handle_hotplug_event() to update the
> elfcorehdr with the current state of CPUs and memory. During the
> process, the kexec_lock is held.
>
> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
> Acked-by: Baoquan He <bhe@redhat.com>
> ---
>   include/linux/crash_core.h |   9 +++
>   include/linux/kexec.h      |  12 ++++
>   kernel/crash_core.c        | 139 +++++++++++++++++++++++++++++++++++++
>   3 files changed, 160 insertions(+)
>
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index de62a722431e..ed868d237c07 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
>   int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
>   		unsigned long long *crash_size, unsigned long long *crash_base);
>   
> +#define KEXEC_CRASH_HP_NONE			0
> +#define KEXEC_CRASH_HP_REMOVE_CPU		1
> +#define KEXEC_CRASH_HP_ADD_CPU			2
> +#define KEXEC_CRASH_HP_REMOVE_MEMORY		3
> +#define KEXEC_CRASH_HP_ADD_MEMORY		4
> +#define KEXEC_CRASH_HP_INVALID_CPU		-1U
> +
> +struct kimage;
> +
>   #endif /* LINUX_CRASH_CORE_H */
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 27ef420c7a45..a52624ae4452 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes;
>   #include <linux/compat.h>
>   #include <linux/ioport.h>
>   #include <linux/module.h>
> +#include <linux/highmem.h>
>   #include <asm/kexec.h>
>   
>   /* Verify architecture specific macros are defined */
> @@ -371,6 +372,13 @@ struct kimage {
>   	struct purgatory_info purgatory_info;
>   #endif
>   
> +#ifdef CONFIG_CRASH_HOTPLUG
> +	int hp_action;
> +	unsigned int offlinecpu;
> +	bool elfcorehdr_index_valid;
> +	int elfcorehdr_index;

May be I am reiterating myself but I think we can manage without 
elfcorehdr_index_valid.

Here is how:
Initialize the elfcorehdr_index with a negative value in 
do_kimage_alloc_init
function (it is called for both kexec_load and kexec_file_load).

Now when the control reaches to handle_hotplug_event function and if 
elfcorehdr_index
has negative value find the correct index and re-initialize the 
elfcorehdr_index.

Thoughts?

Thanks,
Sourabh Jain


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support
@ 2023-02-09 19:10     ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-09 19:10 UTC (permalink / raw)
  To: Eric DeVolder, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

Hello Eric,

On 01/02/23 04:12, Eric DeVolder wrote:
> To support crash hotplug, a mechanism is needed to update the crash
> elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/
> onlining).
>
> To track CPU changes, callbacks are registered with the cpuhp
> mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The
> crash hotplug elfcorehdr update has no explicit ordering requirement
> (relative to other cpuhp states), so meets the criteria for
> utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic
> state and avoids the need to introduce a new state for crash
> hotplug. Also, this is the last state in the PREPARE group, just
> prior to the STARTING group, which is very close to the CPU
> starting up in an plug/online situation, or stopping in a unplug/
> offline situation. This minimizes the window of time during an
> actual plug/online or unplug/offline situation in which the
> elfcorehdr would be inaccurate.
>
> Note, that when a CPU is being unplugged/offlined, the CPU is still
> in the foreach_present_cpu() during the regeneration of the
> elfcorehdr. Thus there is a need to explicitly check and exclude
> the soon-to-be offlined CPU. See patch 'kexec: exclude hot remove
> cpu from elfcorehdr notes'.
>
> To track memory changes, a notifier is registered to capture the
> memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().
>
> The cpu callbacks and memory notifiers invoke handle_hotplug_event()
> which performs needed tasks and then dispatches the event to the
> architecture specific arch_crash_handle_hotplug_event() to update the
> elfcorehdr with the current state of CPUs and memory. During the
> process, the kexec_lock is held.
>
> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
> Acked-by: Baoquan He <bhe@redhat.com>
> ---
>   include/linux/crash_core.h |   9 +++
>   include/linux/kexec.h      |  12 ++++
>   kernel/crash_core.c        | 139 +++++++++++++++++++++++++++++++++++++
>   3 files changed, 160 insertions(+)
>
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index de62a722431e..ed868d237c07 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
>   int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
>   		unsigned long long *crash_size, unsigned long long *crash_base);
>   
> +#define KEXEC_CRASH_HP_NONE			0
> +#define KEXEC_CRASH_HP_REMOVE_CPU		1
> +#define KEXEC_CRASH_HP_ADD_CPU			2
> +#define KEXEC_CRASH_HP_REMOVE_MEMORY		3
> +#define KEXEC_CRASH_HP_ADD_MEMORY		4
> +#define KEXEC_CRASH_HP_INVALID_CPU		-1U
> +
> +struct kimage;
> +
>   #endif /* LINUX_CRASH_CORE_H */
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 27ef420c7a45..a52624ae4452 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes;
>   #include <linux/compat.h>
>   #include <linux/ioport.h>
>   #include <linux/module.h>
> +#include <linux/highmem.h>
>   #include <asm/kexec.h>
>   
>   /* Verify architecture specific macros are defined */
> @@ -371,6 +372,13 @@ struct kimage {
>   	struct purgatory_info purgatory_info;
>   #endif
>   
> +#ifdef CONFIG_CRASH_HOTPLUG
> +	int hp_action;
> +	unsigned int offlinecpu;
> +	bool elfcorehdr_index_valid;
> +	int elfcorehdr_index;

May be I am reiterating myself but I think we can manage without 
elfcorehdr_index_valid.

Here is how:
Initialize the elfcorehdr_index with a negative value in 
do_kimage_alloc_init
function (it is called for both kexec_load and kexec_file_load).

Now when the control reaches to handle_hotplug_event function and if 
elfcorehdr_index
has negative value find the correct index and re-initialize the 
elfcorehdr_index.

Thoughts?

Thanks,
Sourabh Jain


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-09 18:43             ` Sourabh Jain
@ 2023-02-09 19:39               ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-09 19:39 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/9/23 12:43, Sourabh Jain wrote:
> Hello Eric,
> 
> On 09/02/23 23:01, Eric DeVolder wrote:
>>
>>
>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>> Eric!
>>>
>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>
>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>
>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>> this state as the callback for the .startup method.
>>>>
>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>> registered on this state as the callback for the .teardown method.
>>>
>>> TBH, that's still overengineered. Something like this:
>>>
>>> bool cpu_is_alive(unsigned int cpu)
>>> {
>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>
>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>> }
>>>
>>> and use this to query the actual state at crash time. That spares all
>>> those callback heuristics.
>>>
>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>> to for_each_online_cpu().
>>>
>>> Is the packing actually worth the trouble? What's the actual win?
>>>
>>> Thanks,
>>>
>>>          tglx
>>>
>>>
>>
>> Thomas,
>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>
>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>
>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>   for sadump method).
>>
>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>   via kernel cpumasks and the memory within the vmcore.
>>
>> With this understanding, I did some testing. Perhaps the most telling test was that I
>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>> cpu via 'set -c 30' and the backtrace was completely valid.
>>
>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>> that might rely on the ELF info?)
>>
>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>> online/offline.
>>
>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>> be a compelling need to accurately track whether the cpu went online/offline for the
>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>> kernel data structures, not the elfcorehdr.
>>
>> I think this is what Sourabh has known and has been advocating for an optimization
>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>> structs are all laid out). I do think it best to leave that as an arch choice.
> 
> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
> makedumpfile, and crash tool I need your opinion on this:
> 
> Do we really need to regenerate elfcorehdr for CPU hotplug events?
> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
number of cpu PT_NOTEs (as the cpus are still present).

> 
>  From what I understood, crash notes are prepared for possible CPUs as system boots and
> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
> during the kdump kernel load.
> 
> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
> regenerate it for CPU hotplug events. Or do we?

For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
caveat here of course is that if crash utility is the only coredump analyzer of concern,
then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.

Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
any of this.

Perhaps the one item that might help here is to distinguish between actual hot un/plug of
cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
event and an online event (and unplug/offline). If those were distinguishable, then we
could only regenerate on un/plug events.

Or perhaps moving to for_each_possible_cpu() is the better choice?

eric


> 
> Thanks,
> Sourabh Jain

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-09 19:39               ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-09 19:39 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/9/23 12:43, Sourabh Jain wrote:
> Hello Eric,
> 
> On 09/02/23 23:01, Eric DeVolder wrote:
>>
>>
>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>> Eric!
>>>
>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>
>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>
>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>> this state as the callback for the .startup method.
>>>>
>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>> registered on this state as the callback for the .teardown method.
>>>
>>> TBH, that's still overengineered. Something like this:
>>>
>>> bool cpu_is_alive(unsigned int cpu)
>>> {
>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>
>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>> }
>>>
>>> and use this to query the actual state at crash time. That spares all
>>> those callback heuristics.
>>>
>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>> to for_each_online_cpu().
>>>
>>> Is the packing actually worth the trouble? What's the actual win?
>>>
>>> Thanks,
>>>
>>>          tglx
>>>
>>>
>>
>> Thomas,
>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>
>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>
>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>   for sadump method).
>>
>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>   via kernel cpumasks and the memory within the vmcore.
>>
>> With this understanding, I did some testing. Perhaps the most telling test was that I
>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>> cpu via 'set -c 30' and the backtrace was completely valid.
>>
>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>> that might rely on the ELF info?)
>>
>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>> online/offline.
>>
>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>> be a compelling need to accurately track whether the cpu went online/offline for the
>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>> kernel data structures, not the elfcorehdr.
>>
>> I think this is what Sourabh has known and has been advocating for an optimization
>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>> structs are all laid out). I do think it best to leave that as an arch choice.
> 
> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
> makedumpfile, and crash tool I need your opinion on this:
> 
> Do we really need to regenerate elfcorehdr for CPU hotplug events?
> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
number of cpu PT_NOTEs (as the cpus are still present).

> 
>  From what I understood, crash notes are prepared for possible CPUs as system boots and
> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
> during the kdump kernel load.
> 
> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
> regenerate it for CPU hotplug events. Or do we?

For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
caveat here of course is that if crash utility is the only coredump analyzer of concern,
then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.

Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
any of this.

Perhaps the one item that might help here is to distinguish between actual hot un/plug of
cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
event and an online event (and unplug/offline). If those were distinguishable, then we
could only regenerate on un/plug events.

Or perhaps moving to for_each_possible_cpu() is the better choice?

eric


> 
> Thanks,
> Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-09 19:39               ` Eric DeVolder
@ 2023-02-10  6:29                 ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-10  6:29 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 10/02/23 01:09, Eric DeVolder wrote:
>
>
> On 2/9/23 12:43, Sourabh Jain wrote:
>> Hello Eric,
>>
>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>
>>>
>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>> Eric!
>>>>
>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>
>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm 
>>>>> open to better names.
>>>>>
>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>> CPUHP_BRINGUP_CPU. My
>>>>> attempts at locating this state failed when inside the STARTING 
>>>>> section, so I located
>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is 
>>>>> registered on
>>>>> this state as the callback for the .startup method.
>>>>>
>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>> placed it at the end of the PREPARE section. This crash hotplug 
>>>>> handler is also
>>>>> registered on this state as the callback for the .teardown method.
>>>>
>>>> TBH, that's still overengineered. Something like this:
>>>>
>>>> bool cpu_is_alive(unsigned int cpu)
>>>> {
>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>
>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>> }
>>>>
>>>> and use this to query the actual state at crash time. That spares all
>>>> those callback heuristics.
>>>>
>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>> utility,
>>>>> in order to understand the impact of moving from 
>>>>> for_each_present_cpu()
>>>>> to for_each_online_cpu().
>>>>
>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>
>>>> Thanks,
>>>>
>>>>          tglx
>>>>
>>>>
>>>
>>> Thomas,
>>> I've investigated the passing of crash notes through the vmcore. 
>>> What I've learned is that:
>>>
>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its 
>>> job) does
>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>> coalesce them together.
>>>
>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>> determine its
>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>> unused (except
>>>   for sadump method).
>>>
>>> - the crash utility, for the purposes of determining the cpus, does 
>>> not appear to
>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, 
>>> and also of
>>>   course which are online. In addition, when crash does reference 
>>> the cpu PT_NOTE,
>>>   to get its prstatus, it does so by using a percpu technique 
>>> directly in the vmcore
>>>   image memory, not via the ELF structure. Said differently, it 
>>> appears to me that
>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it 
>>> obtains them
>>>   via kernel cpumasks and the memory within the vmcore.
>>>
>>> With this understanding, I did some testing. Perhaps the most 
>>> telling test was that I
>>> changed the number of cpu PT_NOTEs emitted in the 
>>> crash_prepare_elf64_headers() to just 1,
>>> hot plugged some cpus, then also took a few offline sparsely via 
>>> chcpu, then generated a
>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>> reported the proper number
>>> of cpus and the number offline (despite only one cpu PT_NOTE), and 
>>> changing to a different
>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>
>>> My take away is that crash utility does not rely upon ELF cpu 
>>> PT_NOTEs, it obtains the
>>> cpu information directly from kernel data structures. Perhaps at one 
>>> time crash relied
>>> upon the ELF information, but no more. (Perhaps there are other 
>>> crash dump analyzers
>>> that might rely on the ELF info?)
>>>
>>> So, all this to say that I see no need to change 
>>> crash_prepare_elf64_headers(). There
>>> is no compelling reason to move away from for_each_present_cpu(), or 
>>> modify the list for
>>> online/offline.
>>>
>>> Which then leaves the topic of the cpuhp state on which to register. 
>>> Perhaps reverting
>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
>>> does not appear to
>>> be a compelling need to accurately track whether the cpu went 
>>> online/offline for the
>>> purposes of creating the elfcorehdr, as ultimately the crash utility 
>>> pulls that from
>>> kernel data structures, not the elfcorehdr.
>>>
>>> I think this is what Sourabh has known and has been advocating for 
>>> an optimization
>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>> (because all the percpu
>>> structs are all laid out). I do think it best to leave that as an 
>>> arch choice.
>>
>> Since things are clear on how the PT_NOTES are consumed in kdump 
>> kernel [fs/proc/vmcore.c],
>> makedumpfile, and crash tool I need your opinion on this:
>>
>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>> If yes, can you please list the elfcorehdr components that changes 
>> due to CPU hotplug.
> Due to the use of for_each_present_cpu(), it is possible for the 
> number of cpu PT_NOTEs
> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does 
> not impact the
> number of cpu PT_NOTEs (as the cpus are still present).
>
>>
>>  From what I understood, crash notes are prepared for possible CPUs 
>> as system boots and
>> could be used to create a PT_NOTE section for each possible CPU while 
>> generating the elfcorehdr
>> during the kdump kernel load.
>>
>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible 
>> CPU there is no need to
>> regenerate it for CPU hotplug events. Or do we?
>
> For onlining/offlining of cpus, there is no need to regenerate the 
> elfcorehdr. However,
> for actual hot un/plug of cpus, the answer is yes due to 
> for_each_present_cpu(). The
> caveat here of course is that if crash utility is the only coredump 
> analyzer of concern,
> then it doesn't care about these cpu PT_NOTEs and there would be no 
> need to re-generate them.
>
> Also, I'm not sure if ARM cpu hotplug, which is just now coming into 
> mainstream, impacts
> any of this.
>
> Perhaps the one item that might help here is to distinguish between 
> actual hot un/plug of
> cpus, versus onlining/offlining. At the moment, I can not distinguish 
> between a hot plug
> event and an online event (and unplug/offline). If those were 
> distinguishable, then we
> could only regenerate on un/plug events.
>
> Or perhaps moving to for_each_possible_cpu() is the better choice?

Yes, because once elfcorehdr is built with possible CPUs we don't have 
to worry about
hot[un]plug case.

Here is my view on how things should be handled if a core-dump analyzer 
is dependent on
elfcorehdr PT_NOTEs to find online/offline CPUs.

A PT_NOTE in elfcorehdr holds the address of the corresponding crash 
notes (kernel has
one crash note per CPU for every possible CPU). Though the crash notes 
are allocated
during the boot time they are populated when the system is on the crash 
path.

This is how crash notes are populated on PowerPC and I am expecting it 
would be something
similar on other architectures too.

The crashing CPU sends IPI to every other online CPU with a callback 
function that updates the
crash notes of that specific CPU. Once the IPI completes the crashing 
CPU updates its own crash
note and proceeds further.

The crash notes of CPUs remain uninitialized if the CPUs were offline or 
hot unplugged at the time
system crash. The core-dump analyzer should be able to identify 
[un]/initialized crash notes
and display the information accordingly.

Thoughts?

- Sourabh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-10  6:29                 ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-10  6:29 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 10/02/23 01:09, Eric DeVolder wrote:
>
>
> On 2/9/23 12:43, Sourabh Jain wrote:
>> Hello Eric,
>>
>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>
>>>
>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>> Eric!
>>>>
>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>
>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm 
>>>>> open to better names.
>>>>>
>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>> CPUHP_BRINGUP_CPU. My
>>>>> attempts at locating this state failed when inside the STARTING 
>>>>> section, so I located
>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is 
>>>>> registered on
>>>>> this state as the callback for the .startup method.
>>>>>
>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>> placed it at the end of the PREPARE section. This crash hotplug 
>>>>> handler is also
>>>>> registered on this state as the callback for the .teardown method.
>>>>
>>>> TBH, that's still overengineered. Something like this:
>>>>
>>>> bool cpu_is_alive(unsigned int cpu)
>>>> {
>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>
>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>> }
>>>>
>>>> and use this to query the actual state at crash time. That spares all
>>>> those callback heuristics.
>>>>
>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>> utility,
>>>>> in order to understand the impact of moving from 
>>>>> for_each_present_cpu()
>>>>> to for_each_online_cpu().
>>>>
>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>
>>>> Thanks,
>>>>
>>>>          tglx
>>>>
>>>>
>>>
>>> Thomas,
>>> I've investigated the passing of crash notes through the vmcore. 
>>> What I've learned is that:
>>>
>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its 
>>> job) does
>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>> coalesce them together.
>>>
>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>> determine its
>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>> unused (except
>>>   for sadump method).
>>>
>>> - the crash utility, for the purposes of determining the cpus, does 
>>> not appear to
>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, 
>>> and also of
>>>   course which are online. In addition, when crash does reference 
>>> the cpu PT_NOTE,
>>>   to get its prstatus, it does so by using a percpu technique 
>>> directly in the vmcore
>>>   image memory, not via the ELF structure. Said differently, it 
>>> appears to me that
>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it 
>>> obtains them
>>>   via kernel cpumasks and the memory within the vmcore.
>>>
>>> With this understanding, I did some testing. Perhaps the most 
>>> telling test was that I
>>> changed the number of cpu PT_NOTEs emitted in the 
>>> crash_prepare_elf64_headers() to just 1,
>>> hot plugged some cpus, then also took a few offline sparsely via 
>>> chcpu, then generated a
>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>> reported the proper number
>>> of cpus and the number offline (despite only one cpu PT_NOTE), and 
>>> changing to a different
>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>
>>> My take away is that crash utility does not rely upon ELF cpu 
>>> PT_NOTEs, it obtains the
>>> cpu information directly from kernel data structures. Perhaps at one 
>>> time crash relied
>>> upon the ELF information, but no more. (Perhaps there are other 
>>> crash dump analyzers
>>> that might rely on the ELF info?)
>>>
>>> So, all this to say that I see no need to change 
>>> crash_prepare_elf64_headers(). There
>>> is no compelling reason to move away from for_each_present_cpu(), or 
>>> modify the list for
>>> online/offline.
>>>
>>> Which then leaves the topic of the cpuhp state on which to register. 
>>> Perhaps reverting
>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
>>> does not appear to
>>> be a compelling need to accurately track whether the cpu went 
>>> online/offline for the
>>> purposes of creating the elfcorehdr, as ultimately the crash utility 
>>> pulls that from
>>> kernel data structures, not the elfcorehdr.
>>>
>>> I think this is what Sourabh has known and has been advocating for 
>>> an optimization
>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>> (because all the percpu
>>> structs are all laid out). I do think it best to leave that as an 
>>> arch choice.
>>
>> Since things are clear on how the PT_NOTES are consumed in kdump 
>> kernel [fs/proc/vmcore.c],
>> makedumpfile, and crash tool I need your opinion on this:
>>
>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>> If yes, can you please list the elfcorehdr components that changes 
>> due to CPU hotplug.
> Due to the use of for_each_present_cpu(), it is possible for the 
> number of cpu PT_NOTEs
> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does 
> not impact the
> number of cpu PT_NOTEs (as the cpus are still present).
>
>>
>>  From what I understood, crash notes are prepared for possible CPUs 
>> as system boots and
>> could be used to create a PT_NOTE section for each possible CPU while 
>> generating the elfcorehdr
>> during the kdump kernel load.
>>
>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible 
>> CPU there is no need to
>> regenerate it for CPU hotplug events. Or do we?
>
> For onlining/offlining of cpus, there is no need to regenerate the 
> elfcorehdr. However,
> for actual hot un/plug of cpus, the answer is yes due to 
> for_each_present_cpu(). The
> caveat here of course is that if crash utility is the only coredump 
> analyzer of concern,
> then it doesn't care about these cpu PT_NOTEs and there would be no 
> need to re-generate them.
>
> Also, I'm not sure if ARM cpu hotplug, which is just now coming into 
> mainstream, impacts
> any of this.
>
> Perhaps the one item that might help here is to distinguish between 
> actual hot un/plug of
> cpus, versus onlining/offlining. At the moment, I can not distinguish 
> between a hot plug
> event and an online event (and unplug/offline). If those were 
> distinguishable, then we
> could only regenerate on un/plug events.
>
> Or perhaps moving to for_each_possible_cpu() is the better choice?

Yes, because once elfcorehdr is built with possible CPUs we don't have 
to worry about
hot[un]plug case.

Here is my view on how things should be handled if a core-dump analyzer 
is dependent on
elfcorehdr PT_NOTEs to find online/offline CPUs.

A PT_NOTE in elfcorehdr holds the address of the corresponding crash 
notes (kernel has
one crash note per CPU for every possible CPU). Though the crash notes 
are allocated
during the boot time they are populated when the system is on the crash 
path.

This is how crash notes are populated on PowerPC and I am expecting it 
would be something
similar on other architectures too.

The crashing CPU sends IPI to every other online CPU with a callback 
function that updates the
crash notes of that specific CPU. Once the IPI completes the crashing 
CPU updates its own crash
note and proceeds further.

The crash notes of CPUs remain uninitialized if the CPUs were offline or 
hot unplugged at the time
system crash. The core-dump analyzer should be able to identify 
[un]/initialized crash notes
and display the information accordingly.

Thoughts?

- Sourabh

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support
  2023-02-09 19:10     ` Sourabh Jain
@ 2023-02-10 16:51       ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-10 16:51 UTC (permalink / raw)
  To: Sourabh Jain, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/9/23 13:10, Sourabh Jain wrote:
> Hello Eric,
> 
> On 01/02/23 04:12, Eric DeVolder wrote:
>> To support crash hotplug, a mechanism is needed to update the crash
>> elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/
>> onlining).
>>
>> To track CPU changes, callbacks are registered with the cpuhp
>> mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The
>> crash hotplug elfcorehdr update has no explicit ordering requirement
>> (relative to other cpuhp states), so meets the criteria for
>> utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic
>> state and avoids the need to introduce a new state for crash
>> hotplug. Also, this is the last state in the PREPARE group, just
>> prior to the STARTING group, which is very close to the CPU
>> starting up in an plug/online situation, or stopping in a unplug/
>> offline situation. This minimizes the window of time during an
>> actual plug/online or unplug/offline situation in which the
>> elfcorehdr would be inaccurate.
>>
>> Note, that when a CPU is being unplugged/offlined, the CPU is still
>> in the foreach_present_cpu() during the regeneration of the
>> elfcorehdr. Thus there is a need to explicitly check and exclude
>> the soon-to-be offlined CPU. See patch 'kexec: exclude hot remove
>> cpu from elfcorehdr notes'.
>>
>> To track memory changes, a notifier is registered to capture the
>> memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().
>>
>> The cpu callbacks and memory notifiers invoke handle_hotplug_event()
>> which performs needed tasks and then dispatches the event to the
>> architecture specific arch_crash_handle_hotplug_event() to update the
>> elfcorehdr with the current state of CPUs and memory. During the
>> process, the kexec_lock is held.
>>
>> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
>> Acked-by: Baoquan He <bhe@redhat.com>
>> ---
>>   include/linux/crash_core.h |   9 +++
>>   include/linux/kexec.h      |  12 ++++
>>   kernel/crash_core.c        | 139 +++++++++++++++++++++++++++++++++++++
>>   3 files changed, 160 insertions(+)
>>
>> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
>> index de62a722431e..ed868d237c07 100644
>> --- a/include/linux/crash_core.h
>> +++ b/include/linux/crash_core.h
>> @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
>>   int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
>>           unsigned long long *crash_size, unsigned long long *crash_base);
>> +#define KEXEC_CRASH_HP_NONE            0
>> +#define KEXEC_CRASH_HP_REMOVE_CPU        1
>> +#define KEXEC_CRASH_HP_ADD_CPU            2
>> +#define KEXEC_CRASH_HP_REMOVE_MEMORY        3
>> +#define KEXEC_CRASH_HP_ADD_MEMORY        4
>> +#define KEXEC_CRASH_HP_INVALID_CPU        -1U
>> +
>> +struct kimage;
>> +
>>   #endif /* LINUX_CRASH_CORE_H */
>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>> index 27ef420c7a45..a52624ae4452 100644
>> --- a/include/linux/kexec.h
>> +++ b/include/linux/kexec.h
>> @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes;
>>   #include <linux/compat.h>
>>   #include <linux/ioport.h>
>>   #include <linux/module.h>
>> +#include <linux/highmem.h>
>>   #include <asm/kexec.h>
>>   /* Verify architecture specific macros are defined */
>> @@ -371,6 +372,13 @@ struct kimage {
>>       struct purgatory_info purgatory_info;
>>   #endif
>> +#ifdef CONFIG_CRASH_HOTPLUG
>> +    int hp_action;
>> +    unsigned int offlinecpu;
>> +    bool elfcorehdr_index_valid;
>> +    int elfcorehdr_index;
> 
> May be I am reiterating myself but I think we can manage without elfcorehdr_index_valid.
> 
> Here is how:
> Initialize the elfcorehdr_index with a negative value in do_kimage_alloc_init
> function (it is called for both kexec_load and kexec_file_load).
> 
> Now when the control reaches to handle_hotplug_event function and if elfcorehdr_index
> has negative value find the correct index and re-initialize the elfcorehdr_index.
> 
> Thoughts?
> 
> Thanks,
> Sourabh Jain
> 
ok, I'll eliminate elfcorehdr_index_valid.
eric

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support
@ 2023-02-10 16:51       ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-10 16:51 UTC (permalink / raw)
  To: Sourabh Jain, linux-kernel, x86, kexec, ebiederm, dyoung, bhe, vgoyal
  Cc: tglx, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/9/23 13:10, Sourabh Jain wrote:
> Hello Eric,
> 
> On 01/02/23 04:12, Eric DeVolder wrote:
>> To support crash hotplug, a mechanism is needed to update the crash
>> elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/
>> onlining).
>>
>> To track CPU changes, callbacks are registered with the cpuhp
>> mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The
>> crash hotplug elfcorehdr update has no explicit ordering requirement
>> (relative to other cpuhp states), so meets the criteria for
>> utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic
>> state and avoids the need to introduce a new state for crash
>> hotplug. Also, this is the last state in the PREPARE group, just
>> prior to the STARTING group, which is very close to the CPU
>> starting up in an plug/online situation, or stopping in a unplug/
>> offline situation. This minimizes the window of time during an
>> actual plug/online or unplug/offline situation in which the
>> elfcorehdr would be inaccurate.
>>
>> Note, that when a CPU is being unplugged/offlined, the CPU is still
>> in the foreach_present_cpu() during the regeneration of the
>> elfcorehdr. Thus there is a need to explicitly check and exclude
>> the soon-to-be offlined CPU. See patch 'kexec: exclude hot remove
>> cpu from elfcorehdr notes'.
>>
>> To track memory changes, a notifier is registered to capture the
>> memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().
>>
>> The cpu callbacks and memory notifiers invoke handle_hotplug_event()
>> which performs needed tasks and then dispatches the event to the
>> architecture specific arch_crash_handle_hotplug_event() to update the
>> elfcorehdr with the current state of CPUs and memory. During the
>> process, the kexec_lock is held.
>>
>> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
>> Acked-by: Baoquan He <bhe@redhat.com>
>> ---
>>   include/linux/crash_core.h |   9 +++
>>   include/linux/kexec.h      |  12 ++++
>>   kernel/crash_core.c        | 139 +++++++++++++++++++++++++++++++++++++
>>   3 files changed, 160 insertions(+)
>>
>> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
>> index de62a722431e..ed868d237c07 100644
>> --- a/include/linux/crash_core.h
>> +++ b/include/linux/crash_core.h
>> @@ -84,4 +84,13 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
>>   int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
>>           unsigned long long *crash_size, unsigned long long *crash_base);
>> +#define KEXEC_CRASH_HP_NONE            0
>> +#define KEXEC_CRASH_HP_REMOVE_CPU        1
>> +#define KEXEC_CRASH_HP_ADD_CPU            2
>> +#define KEXEC_CRASH_HP_REMOVE_MEMORY        3
>> +#define KEXEC_CRASH_HP_ADD_MEMORY        4
>> +#define KEXEC_CRASH_HP_INVALID_CPU        -1U
>> +
>> +struct kimage;
>> +
>>   #endif /* LINUX_CRASH_CORE_H */
>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
>> index 27ef420c7a45..a52624ae4452 100644
>> --- a/include/linux/kexec.h
>> +++ b/include/linux/kexec.h
>> @@ -33,6 +33,7 @@ extern note_buf_t __percpu *crash_notes;
>>   #include <linux/compat.h>
>>   #include <linux/ioport.h>
>>   #include <linux/module.h>
>> +#include <linux/highmem.h>
>>   #include <asm/kexec.h>
>>   /* Verify architecture specific macros are defined */
>> @@ -371,6 +372,13 @@ struct kimage {
>>       struct purgatory_info purgatory_info;
>>   #endif
>> +#ifdef CONFIG_CRASH_HOTPLUG
>> +    int hp_action;
>> +    unsigned int offlinecpu;
>> +    bool elfcorehdr_index_valid;
>> +    int elfcorehdr_index;
> 
> May be I am reiterating myself but I think we can manage without elfcorehdr_index_valid.
> 
> Here is how:
> Initialize the elfcorehdr_index with a negative value in do_kimage_alloc_init
> function (it is called for both kexec_load and kexec_file_load).
> 
> Now when the control reaches to handle_hotplug_event function and if elfcorehdr_index
> has negative value find the correct index and re-initialize the elfcorehdr_index.
> 
> Thoughts?
> 
> Thanks,
> Sourabh Jain
> 
ok, I'll eliminate elfcorehdr_index_valid.
eric

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-10  6:29                 ` Sourabh Jain
@ 2023-02-11  0:35                   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-11  0:35 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/10/23 00:29, Sourabh Jain wrote:
> 
> On 10/02/23 01:09, Eric DeVolder wrote:
>>
>>
>> On 2/9/23 12:43, Sourabh Jain wrote:
>>> Hello Eric,
>>>
>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>> Eric!
>>>>>
>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>
>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>
>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>> this state as the callback for the .startup method.
>>>>>>
>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>> registered on this state as the callback for the .teardown method.
>>>>>
>>>>> TBH, that's still overengineered. Something like this:
>>>>>
>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>> {
>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>
>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>> }
>>>>>
>>>>> and use this to query the actual state at crash time. That spares all
>>>>> those callback heuristics.
>>>>>
>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>> to for_each_online_cpu().
>>>>>
>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>          tglx
>>>>>
>>>>>
>>>>
>>>> Thomas,
>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>
>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>
>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>   for sadump method).
>>>>
>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>
>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>
>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>> that might rely on the ELF info?)
>>>>
>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>> online/offline.
>>>>
>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>> kernel data structures, not the elfcorehdr.
>>>>
>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>
>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>> makedumpfile, and crash tool I need your opinion on this:
>>>
>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>> number of cpu PT_NOTEs (as the cpus are still present).
>>
>>>
>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>> during the kdump kernel load.
>>>
>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>> regenerate it for CPU hotplug events. Or do we?
>>
>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>
>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>> any of this.
>>
>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>> event and an online event (and unplug/offline). If those were distinguishable, then we
>> could only regenerate on un/plug events.
>>
>> Or perhaps moving to for_each_possible_cpu() is the better choice?
> 
> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
> hot[un]plug case.
> 
> Here is my view on how things should be handled if a core-dump analyzer is dependent on
> elfcorehdr PT_NOTEs to find online/offline CPUs.
> 
> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
> one crash note per CPU for every possible CPU). Though the crash notes are allocated
> during the boot time they are populated when the system is on the crash path.
> 
> This is how crash notes are populated on PowerPC and I am expecting it would be something
> similar on other architectures too.
> 
> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
> note and proceeds further.
> 
> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
> and display the information accordingly.
> 
> Thoughts?
> 
> - Sourabh

In general, I agree with your points. You've presented a strong case to go with 
for_each_possible_cpu() in crash_prepare_elf64_headers() and those crash notes would always be 
present, and we can ignore changes to cpus wrt/ elfcorehdr updates.

But what do we do about kexec_load() syscall? The way the userspace utility works is it determines 
cpus by:
  nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
which is not the equivalent of possible_cpus. So the complete list of cpu PT_NOTEs is not generated 
up front. We would need a solution for that?

Thanks,
eric

PS. I'll be on vacation all of next week, returning 20feb.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-11  0:35                   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-11  0:35 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/10/23 00:29, Sourabh Jain wrote:
> 
> On 10/02/23 01:09, Eric DeVolder wrote:
>>
>>
>> On 2/9/23 12:43, Sourabh Jain wrote:
>>> Hello Eric,
>>>
>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>> Eric!
>>>>>
>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>
>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>
>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>> this state as the callback for the .startup method.
>>>>>>
>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>> registered on this state as the callback for the .teardown method.
>>>>>
>>>>> TBH, that's still overengineered. Something like this:
>>>>>
>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>> {
>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>
>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>> }
>>>>>
>>>>> and use this to query the actual state at crash time. That spares all
>>>>> those callback heuristics.
>>>>>
>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>> to for_each_online_cpu().
>>>>>
>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>          tglx
>>>>>
>>>>>
>>>>
>>>> Thomas,
>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>
>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>
>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>   for sadump method).
>>>>
>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>
>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>
>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>> that might rely on the ELF info?)
>>>>
>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>> online/offline.
>>>>
>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>> kernel data structures, not the elfcorehdr.
>>>>
>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>
>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>> makedumpfile, and crash tool I need your opinion on this:
>>>
>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>> number of cpu PT_NOTEs (as the cpus are still present).
>>
>>>
>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>> during the kdump kernel load.
>>>
>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>> regenerate it for CPU hotplug events. Or do we?
>>
>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>
>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>> any of this.
>>
>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>> event and an online event (and unplug/offline). If those were distinguishable, then we
>> could only regenerate on un/plug events.
>>
>> Or perhaps moving to for_each_possible_cpu() is the better choice?
> 
> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
> hot[un]plug case.
> 
> Here is my view on how things should be handled if a core-dump analyzer is dependent on
> elfcorehdr PT_NOTEs to find online/offline CPUs.
> 
> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
> one crash note per CPU for every possible CPU). Though the crash notes are allocated
> during the boot time they are populated when the system is on the crash path.
> 
> This is how crash notes are populated on PowerPC and I am expecting it would be something
> similar on other architectures too.
> 
> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
> note and proceeds further.
> 
> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
> and display the information accordingly.
> 
> Thoughts?
> 
> - Sourabh

In general, I agree with your points. You've presented a strong case to go with 
for_each_possible_cpu() in crash_prepare_elf64_headers() and those crash notes would always be 
present, and we can ignore changes to cpus wrt/ elfcorehdr updates.

But what do we do about kexec_load() syscall? The way the userspace utility works is it determines 
cpus by:
  nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
which is not the equivalent of possible_cpus. So the complete list of cpu PT_NOTEs is not generated 
up front. We would need a solution for that?

Thanks,
eric

PS. I'll be on vacation all of next week, returning 20feb.

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-11  0:35                   ` Eric DeVolder
@ 2023-02-13  4:40                     ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-13  4:40 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 11/02/23 06:05, Eric DeVolder wrote:
>
>
> On 2/10/23 00:29, Sourabh Jain wrote:
>>
>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>
>>>
>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>> Hello Eric,
>>>>
>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>> Eric!
>>>>>>
>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>
>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm 
>>>>>>> open to better names.
>>>>>>>
>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>> attempts at locating this state failed when inside the STARTING 
>>>>>>> section, so I located
>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler 
>>>>>>> is registered on
>>>>>>> this state as the callback for the .startup method.
>>>>>>>
>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>> placed it at the end of the PREPARE section. This crash hotplug 
>>>>>>> handler is also
>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>
>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>
>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>> {
>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>
>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>> }
>>>>>>
>>>>>> and use this to query the actual state at crash time. That spares 
>>>>>> all
>>>>>> those callback heuristics.
>>>>>>
>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>> vmcoreinfo,
>>>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>>>> utility,
>>>>>>> in order to understand the impact of moving from 
>>>>>>> for_each_present_cpu()
>>>>>>> to for_each_online_cpu().
>>>>>>
>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>          tglx
>>>>>>
>>>>>>
>>>>>
>>>>> Thomas,
>>>>> I've investigated the passing of crash notes through the vmcore. 
>>>>> What I've learned is that:
>>>>>
>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its 
>>>>> job) does
>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>> coalesce them together.
>>>>>
>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>>>> determine its
>>>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>>>> unused (except
>>>>>   for sadump method).
>>>>>
>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>> does not appear to
>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>> that, and also of
>>>>>   course which are online. In addition, when crash does reference 
>>>>> the cpu PT_NOTE,
>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>> directly in the vmcore
>>>>>   image memory, not via the ELF structure. Said differently, it 
>>>>> appears to me that
>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather 
>>>>> it obtains them
>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>
>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>> telling test was that I
>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>> crash_prepare_elf64_headers() to just 1,
>>>>> hot plugged some cpus, then also took a few offline sparsely via 
>>>>> chcpu, then generated a
>>>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>>>> reported the proper number
>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and 
>>>>> changing to a different
>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>
>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>> PT_NOTEs, it obtains the
>>>>> cpu information directly from kernel data structures. Perhaps at 
>>>>> one time crash relied
>>>>> upon the ELF information, but no more. (Perhaps there are other 
>>>>> crash dump analyzers
>>>>> that might rely on the ELF info?)
>>>>>
>>>>> So, all this to say that I see no need to change 
>>>>> crash_prepare_elf64_headers(). There
>>>>> is no compelling reason to move away from for_each_present_cpu(), 
>>>>> or modify the list for
>>>>> online/offline.
>>>>>
>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>> register. Perhaps reverting
>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
>>>>> does not appear to
>>>>> be a compelling need to accurately track whether the cpu went 
>>>>> online/offline for the
>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>> utility pulls that from
>>>>> kernel data structures, not the elfcorehdr.
>>>>>
>>>>> I think this is what Sourabh has known and has been advocating for 
>>>>> an optimization
>>>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>>>> (because all the percpu
>>>>> structs are all laid out). I do think it best to leave that as an 
>>>>> arch choice.
>>>>
>>>> Since things are clear on how the PT_NOTES are consumed in kdump 
>>>> kernel [fs/proc/vmcore.c],
>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>
>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>> If yes, can you please list the elfcorehdr components that changes 
>>>> due to CPU hotplug.
>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>> number of cpu PT_NOTEs
>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does 
>>> not impact the
>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>
>>>>
>>>>  From what I understood, crash notes are prepared for possible CPUs 
>>>> as system boots and
>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>> while generating the elfcorehdr
>>>> during the kdump kernel load.
>>>>
>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible 
>>>> CPU there is no need to
>>>> regenerate it for CPU hotplug events. Or do we?
>>>
>>> For onlining/offlining of cpus, there is no need to regenerate the 
>>> elfcorehdr. However,
>>> for actual hot un/plug of cpus, the answer is yes due to 
>>> for_each_present_cpu(). The
>>> caveat here of course is that if crash utility is the only coredump 
>>> analyzer of concern,
>>> then it doesn't care about these cpu PT_NOTEs and there would be no 
>>> need to re-generate them.
>>>
>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into 
>>> mainstream, impacts
>>> any of this.
>>>
>>> Perhaps the one item that might help here is to distinguish between 
>>> actual hot un/plug of
>>> cpus, versus onlining/offlining. At the moment, I can not 
>>> distinguish between a hot plug
>>> event and an online event (and unplug/offline). If those were 
>>> distinguishable, then we
>>> could only regenerate on un/plug events.
>>>
>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>
>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>> have to worry about
>> hot[un]plug case.
>>
>> Here is my view on how things should be handled if a core-dump 
>> analyzer is dependent on
>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>
>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash 
>> notes (kernel has
>> one crash note per CPU for every possible CPU). Though the crash 
>> notes are allocated
>> during the boot time they are populated when the system is on the 
>> crash path.
>>
>> This is how crash notes are populated on PowerPC and I am expecting 
>> it would be something
>> similar on other architectures too.
>>
>> The crashing CPU sends IPI to every other online CPU with a callback 
>> function that updates the
>> crash notes of that specific CPU. Once the IPI completes the crashing 
>> CPU updates its own crash
>> note and proceeds further.
>>
>> The crash notes of CPUs remain uninitialized if the CPUs were offline 
>> or hot unplugged at the time
>> system crash. The core-dump analyzer should be able to identify 
>> [un]/initialized crash notes
>> and display the information accordingly.
>>
>> Thoughts?
>>
>> - Sourabh
>
> In general, I agree with your points. You've presented a strong case 
> to go with for_each_possible_cpu() in crash_prepare_elf64_headers() 
> and those crash notes would always be present, and we can ignore 
> changes to cpus wrt/ elfcorehdr updates.
>
> But what do we do about kexec_load() syscall? The way the userspace 
> utility works is it determines cpus by:
>  nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
> which is not the equivalent of possible_cpus. So the complete list of 
> cpu PT_NOTEs is not generated up front. We would need a solution for 
> that?
Hello Eric,

The sysconf document says _SC_NPROCESSORS_CONF is processors configured, 
isn't that equivalent to possible CPUs?

What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on 
powerPC it is possible CPUs.

In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
/sys/devices/system/cpu/possible for kexec_load case.

Thoughts?

- Sourabh Jain

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-13  4:40                     ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-13  4:40 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 11/02/23 06:05, Eric DeVolder wrote:
>
>
> On 2/10/23 00:29, Sourabh Jain wrote:
>>
>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>
>>>
>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>> Hello Eric,
>>>>
>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>> Eric!
>>>>>>
>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>
>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm 
>>>>>>> open to better names.
>>>>>>>
>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>> attempts at locating this state failed when inside the STARTING 
>>>>>>> section, so I located
>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler 
>>>>>>> is registered on
>>>>>>> this state as the callback for the .startup method.
>>>>>>>
>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>> placed it at the end of the PREPARE section. This crash hotplug 
>>>>>>> handler is also
>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>
>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>
>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>> {
>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>
>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>> }
>>>>>>
>>>>>> and use this to query the actual state at crash time. That spares 
>>>>>> all
>>>>>> those callback heuristics.
>>>>>>
>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>> vmcoreinfo,
>>>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>>>> utility,
>>>>>>> in order to understand the impact of moving from 
>>>>>>> for_each_present_cpu()
>>>>>>> to for_each_online_cpu().
>>>>>>
>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>          tglx
>>>>>>
>>>>>>
>>>>>
>>>>> Thomas,
>>>>> I've investigated the passing of crash notes through the vmcore. 
>>>>> What I've learned is that:
>>>>>
>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its 
>>>>> job) does
>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>> coalesce them together.
>>>>>
>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>>>> determine its
>>>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>>>> unused (except
>>>>>   for sadump method).
>>>>>
>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>> does not appear to
>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>> that, and also of
>>>>>   course which are online. In addition, when crash does reference 
>>>>> the cpu PT_NOTE,
>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>> directly in the vmcore
>>>>>   image memory, not via the ELF structure. Said differently, it 
>>>>> appears to me that
>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather 
>>>>> it obtains them
>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>
>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>> telling test was that I
>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>> crash_prepare_elf64_headers() to just 1,
>>>>> hot plugged some cpus, then also took a few offline sparsely via 
>>>>> chcpu, then generated a
>>>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>>>> reported the proper number
>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and 
>>>>> changing to a different
>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>
>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>> PT_NOTEs, it obtains the
>>>>> cpu information directly from kernel data structures. Perhaps at 
>>>>> one time crash relied
>>>>> upon the ELF information, but no more. (Perhaps there are other 
>>>>> crash dump analyzers
>>>>> that might rely on the ELF info?)
>>>>>
>>>>> So, all this to say that I see no need to change 
>>>>> crash_prepare_elf64_headers(). There
>>>>> is no compelling reason to move away from for_each_present_cpu(), 
>>>>> or modify the list for
>>>>> online/offline.
>>>>>
>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>> register. Perhaps reverting
>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
>>>>> does not appear to
>>>>> be a compelling need to accurately track whether the cpu went 
>>>>> online/offline for the
>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>> utility pulls that from
>>>>> kernel data structures, not the elfcorehdr.
>>>>>
>>>>> I think this is what Sourabh has known and has been advocating for 
>>>>> an optimization
>>>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>>>> (because all the percpu
>>>>> structs are all laid out). I do think it best to leave that as an 
>>>>> arch choice.
>>>>
>>>> Since things are clear on how the PT_NOTES are consumed in kdump 
>>>> kernel [fs/proc/vmcore.c],
>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>
>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>> If yes, can you please list the elfcorehdr components that changes 
>>>> due to CPU hotplug.
>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>> number of cpu PT_NOTEs
>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does 
>>> not impact the
>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>
>>>>
>>>>  From what I understood, crash notes are prepared for possible CPUs 
>>>> as system boots and
>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>> while generating the elfcorehdr
>>>> during the kdump kernel load.
>>>>
>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible 
>>>> CPU there is no need to
>>>> regenerate it for CPU hotplug events. Or do we?
>>>
>>> For onlining/offlining of cpus, there is no need to regenerate the 
>>> elfcorehdr. However,
>>> for actual hot un/plug of cpus, the answer is yes due to 
>>> for_each_present_cpu(). The
>>> caveat here of course is that if crash utility is the only coredump 
>>> analyzer of concern,
>>> then it doesn't care about these cpu PT_NOTEs and there would be no 
>>> need to re-generate them.
>>>
>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into 
>>> mainstream, impacts
>>> any of this.
>>>
>>> Perhaps the one item that might help here is to distinguish between 
>>> actual hot un/plug of
>>> cpus, versus onlining/offlining. At the moment, I can not 
>>> distinguish between a hot plug
>>> event and an online event (and unplug/offline). If those were 
>>> distinguishable, then we
>>> could only regenerate on un/plug events.
>>>
>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>
>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>> have to worry about
>> hot[un]plug case.
>>
>> Here is my view on how things should be handled if a core-dump 
>> analyzer is dependent on
>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>
>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash 
>> notes (kernel has
>> one crash note per CPU for every possible CPU). Though the crash 
>> notes are allocated
>> during the boot time they are populated when the system is on the 
>> crash path.
>>
>> This is how crash notes are populated on PowerPC and I am expecting 
>> it would be something
>> similar on other architectures too.
>>
>> The crashing CPU sends IPI to every other online CPU with a callback 
>> function that updates the
>> crash notes of that specific CPU. Once the IPI completes the crashing 
>> CPU updates its own crash
>> note and proceeds further.
>>
>> The crash notes of CPUs remain uninitialized if the CPUs were offline 
>> or hot unplugged at the time
>> system crash. The core-dump analyzer should be able to identify 
>> [un]/initialized crash notes
>> and display the information accordingly.
>>
>> Thoughts?
>>
>> - Sourabh
>
> In general, I agree with your points. You've presented a strong case 
> to go with for_each_possible_cpu() in crash_prepare_elf64_headers() 
> and those crash notes would always be present, and we can ignore 
> changes to cpus wrt/ elfcorehdr updates.
>
> But what do we do about kexec_load() syscall? The way the userspace 
> utility works is it determines cpus by:
>  nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
> which is not the equivalent of possible_cpus. So the complete list of 
> cpu PT_NOTEs is not generated up front. We would need a solution for 
> that?
Hello Eric,

The sysconf document says _SC_NPROCESSORS_CONF is processors configured, 
isn't that equivalent to possible CPUs?

What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on 
powerPC it is possible CPUs.

In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
/sys/devices/system/cpu/possible for kexec_load case.

Thoughts?

- Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-13  4:40                     ` Sourabh Jain
@ 2023-02-13 12:52                       ` Thomas Gleixner
  -1 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2023-02-13 12:52 UTC (permalink / raw)
  To: Sourabh Jain, Eric DeVolder, linux-kernel, x86, kexec, ebiederm,
	dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

On Mon, Feb 13 2023 at 10:10, Sourabh Jain wrote:
> The sysconf document says _SC_NPROCESSORS_CONF is processors configured, 
> isn't that equivalent to possible CPUs?

glibc tries to evaluate that in the following order:

  1) /sys/devices/system/cpu/cpu*

     That's present CPUs not possible CPUs

  2) /proc/stat

     That's online CPUs

  3) sched_getaffinity()

     That's online CPUs at best. In the worst case it's an affinity mask
     which is set on a process group

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-13 12:52                       ` Thomas Gleixner
  0 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2023-02-13 12:52 UTC (permalink / raw)
  To: Sourabh Jain, Eric DeVolder, linux-kernel, x86, kexec, ebiederm,
	dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky

On Mon, Feb 13 2023 at 10:10, Sourabh Jain wrote:
> The sysconf document says _SC_NPROCESSORS_CONF is processors configured, 
> isn't that equivalent to possible CPUs?

glibc tries to evaluate that in the following order:

  1) /sys/devices/system/cpu/cpu*

     That's present CPUs not possible CPUs

  2) /proc/stat

     That's online CPUs

  3) sched_getaffinity()

     That's online CPUs at best. In the worst case it's an affinity mask
     which is set on a process group

Thanks,

        tglx

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-13 12:52                       ` Thomas Gleixner
@ 2023-02-15  2:53                         ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-15  2:53 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 13/02/23 18:22, Thomas Gleixner wrote:
> On Mon, Feb 13 2023 at 10:10, Sourabh Jain wrote:
>> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
>> isn't that equivalent to possible CPUs?
> glibc tries to evaluate that in the following order:
>
>    1) /sys/devices/system/cpu/cpu*
>
>       That's present CPUs not possible CPUs
>
>    2) /proc/stat
>
>       That's online CPUs
>
>    3) sched_getaffinity()
>
>       That's online CPUs at best. In the worst case it's an affinity mask
>       which is set on a process group

Thanks for the clarification Thomas.

- Sourabh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-15  2:53                         ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-15  2:53 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 13/02/23 18:22, Thomas Gleixner wrote:
> On Mon, Feb 13 2023 at 10:10, Sourabh Jain wrote:
>> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
>> isn't that equivalent to possible CPUs?
> glibc tries to evaluate that in the following order:
>
>    1) /sys/devices/system/cpu/cpu*
>
>       That's present CPUs not possible CPUs
>
>    2) /proc/stat
>
>       That's online CPUs
>
>    3) sched_getaffinity()
>
>       That's online CPUs at best. In the worst case it's an affinity mask
>       which is set on a process group

Thanks for the clarification Thomas.

- Sourabh

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-10  6:29                 ` Sourabh Jain
@ 2023-02-23 20:34                   ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-23 20:34 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/10/23 00:29, Sourabh Jain wrote:
> 
> On 10/02/23 01:09, Eric DeVolder wrote:
>>
>>
>> On 2/9/23 12:43, Sourabh Jain wrote:
>>> Hello Eric,
>>>
>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>> Eric!
>>>>>
>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>
>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>
>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>> this state as the callback for the .startup method.
>>>>>>
>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>> registered on this state as the callback for the .teardown method.
>>>>>
>>>>> TBH, that's still overengineered. Something like this:
>>>>>
>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>> {
>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>
>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>> }
>>>>>
>>>>> and use this to query the actual state at crash time. That spares all
>>>>> those callback heuristics.
>>>>>
>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>> to for_each_online_cpu().
>>>>>
>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>          tglx
>>>>>
>>>>>
>>>>
>>>> Thomas,
>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>
>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>
>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>   for sadump method).
>>>>
>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>
>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>
>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>> that might rely on the ELF info?)
>>>>
>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>> online/offline.
>>>>
>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>> kernel data structures, not the elfcorehdr.
>>>>
>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>
>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>> makedumpfile, and crash tool I need your opinion on this:
>>>
>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>> number of cpu PT_NOTEs (as the cpus are still present).
>>
>>>
>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>> during the kdump kernel load.
>>>
>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>> regenerate it for CPU hotplug events. Or do we?
>>
>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>
>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>> any of this.
>>
>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>> event and an online event (and unplug/offline). If those were distinguishable, then we
>> could only regenerate on un/plug events.
>>
>> Or perhaps moving to for_each_possible_cpu() is the better choice?
> 
> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
> hot[un]plug case.
> 
> Here is my view on how things should be handled if a core-dump analyzer is dependent on
> elfcorehdr PT_NOTEs to find online/offline CPUs.
> 
> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
> one crash note per CPU for every possible CPU). Though the crash notes are allocated
> during the boot time they are populated when the system is on the crash path.
> 
> This is how crash notes are populated on PowerPC and I am expecting it would be something
> similar on other architectures too.
> 
> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
> note and proceeds further.
> 
> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
> and display the information accordingly.
> 
> Thoughts?
> 
> - Sourabh

I've been examining what it would mean to move to for_each_possible_cpu() in 
crash_prepare_elf64_headers(). I think it means:

- Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
- For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the load 
with the elfcorehdr generated by crash_prepare_elf64_headers().
- Eliminate/remove the cpuhp machinery for handling crash hotplug events.

This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash 
analyzers that rely on ELF PT_NOTEs for crash_notes.

If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp 
machinery in place and each arch could decide how to handle crash cpu hotplug events. The overhead 
for doing this is very minimal, and the events are likely very infrequent.

No matter which is decided, to support crash hotplug for kexec_load still requires changes to the 
userspace kexec-tools utility (for excluding the elfcorehdr from the purgatory hash, and providing 
an appropriately sized elfcorehdr buffer).

I know Sourabh votes for for_each_possible_cpu(), Thomas/Boris/Baoquan/others, I'd appreciate your 
opinion/insight here!

Thanks!
eric

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-23 20:34                   ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-23 20:34 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/10/23 00:29, Sourabh Jain wrote:
> 
> On 10/02/23 01:09, Eric DeVolder wrote:
>>
>>
>> On 2/9/23 12:43, Sourabh Jain wrote:
>>> Hello Eric,
>>>
>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>> Eric!
>>>>>
>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>
>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>
>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>> this state as the callback for the .startup method.
>>>>>>
>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>> registered on this state as the callback for the .teardown method.
>>>>>
>>>>> TBH, that's still overengineered. Something like this:
>>>>>
>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>> {
>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>
>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>> }
>>>>>
>>>>> and use this to query the actual state at crash time. That spares all
>>>>> those callback heuristics.
>>>>>
>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>> to for_each_online_cpu().
>>>>>
>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>          tglx
>>>>>
>>>>>
>>>>
>>>> Thomas,
>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>
>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>
>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>   for sadump method).
>>>>
>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>
>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>
>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>> that might rely on the ELF info?)
>>>>
>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>> online/offline.
>>>>
>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>> kernel data structures, not the elfcorehdr.
>>>>
>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>
>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>> makedumpfile, and crash tool I need your opinion on this:
>>>
>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>> number of cpu PT_NOTEs (as the cpus are still present).
>>
>>>
>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>> during the kdump kernel load.
>>>
>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>> regenerate it for CPU hotplug events. Or do we?
>>
>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>
>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>> any of this.
>>
>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>> event and an online event (and unplug/offline). If those were distinguishable, then we
>> could only regenerate on un/plug events.
>>
>> Or perhaps moving to for_each_possible_cpu() is the better choice?
> 
> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
> hot[un]plug case.
> 
> Here is my view on how things should be handled if a core-dump analyzer is dependent on
> elfcorehdr PT_NOTEs to find online/offline CPUs.
> 
> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
> one crash note per CPU for every possible CPU). Though the crash notes are allocated
> during the boot time they are populated when the system is on the crash path.
> 
> This is how crash notes are populated on PowerPC and I am expecting it would be something
> similar on other architectures too.
> 
> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
> note and proceeds further.
> 
> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
> and display the information accordingly.
> 
> Thoughts?
> 
> - Sourabh

I've been examining what it would mean to move to for_each_possible_cpu() in 
crash_prepare_elf64_headers(). I think it means:

- Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
- For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the load 
with the elfcorehdr generated by crash_prepare_elf64_headers().
- Eliminate/remove the cpuhp machinery for handling crash hotplug events.

This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash 
analyzers that rely on ELF PT_NOTEs for crash_notes.

If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp 
machinery in place and each arch could decide how to handle crash cpu hotplug events. The overhead 
for doing this is very minimal, and the events are likely very infrequent.

No matter which is decided, to support crash hotplug for kexec_load still requires changes to the 
userspace kexec-tools utility (for excluding the elfcorehdr from the purgatory hash, and providing 
an appropriately sized elfcorehdr buffer).

I know Sourabh votes for for_each_possible_cpu(), Thomas/Boris/Baoquan/others, I'd appreciate your 
opinion/insight here!

Thanks!
eric

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-23 20:34                   ` Eric DeVolder
@ 2023-02-24  8:34                     ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-24  8:34 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 24/02/23 02:04, Eric DeVolder wrote:
>
>
> On 2/10/23 00:29, Sourabh Jain wrote:
>>
>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>
>>>
>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>> Hello Eric,
>>>>
>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>> Eric!
>>>>>>
>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>
>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm 
>>>>>>> open to better names.
>>>>>>>
>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>> attempts at locating this state failed when inside the STARTING 
>>>>>>> section, so I located
>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler 
>>>>>>> is registered on
>>>>>>> this state as the callback for the .startup method.
>>>>>>>
>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>> placed it at the end of the PREPARE section. This crash hotplug 
>>>>>>> handler is also
>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>
>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>
>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>> {
>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>
>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>> }
>>>>>>
>>>>>> and use this to query the actual state at crash time. That spares 
>>>>>> all
>>>>>> those callback heuristics.
>>>>>>
>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>> vmcoreinfo,
>>>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>>>> utility,
>>>>>>> in order to understand the impact of moving from 
>>>>>>> for_each_present_cpu()
>>>>>>> to for_each_online_cpu().
>>>>>>
>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>          tglx
>>>>>>
>>>>>>
>>>>>
>>>>> Thomas,
>>>>> I've investigated the passing of crash notes through the vmcore. 
>>>>> What I've learned is that:
>>>>>
>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its 
>>>>> job) does
>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>> coalesce them together.
>>>>>
>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>>>> determine its
>>>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>>>> unused (except
>>>>>   for sadump method).
>>>>>
>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>> does not appear to
>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>> that, and also of
>>>>>   course which are online. In addition, when crash does reference 
>>>>> the cpu PT_NOTE,
>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>> directly in the vmcore
>>>>>   image memory, not via the ELF structure. Said differently, it 
>>>>> appears to me that
>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather 
>>>>> it obtains them
>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>
>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>> telling test was that I
>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>> crash_prepare_elf64_headers() to just 1,
>>>>> hot plugged some cpus, then also took a few offline sparsely via 
>>>>> chcpu, then generated a
>>>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>>>> reported the proper number
>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and 
>>>>> changing to a different
>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>
>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>> PT_NOTEs, it obtains the
>>>>> cpu information directly from kernel data structures. Perhaps at 
>>>>> one time crash relied
>>>>> upon the ELF information, but no more. (Perhaps there are other 
>>>>> crash dump analyzers
>>>>> that might rely on the ELF info?)
>>>>>
>>>>> So, all this to say that I see no need to change 
>>>>> crash_prepare_elf64_headers(). There
>>>>> is no compelling reason to move away from for_each_present_cpu(), 
>>>>> or modify the list for
>>>>> online/offline.
>>>>>
>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>> register. Perhaps reverting
>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
>>>>> does not appear to
>>>>> be a compelling need to accurately track whether the cpu went 
>>>>> online/offline for the
>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>> utility pulls that from
>>>>> kernel data structures, not the elfcorehdr.
>>>>>
>>>>> I think this is what Sourabh has known and has been advocating for 
>>>>> an optimization
>>>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>>>> (because all the percpu
>>>>> structs are all laid out). I do think it best to leave that as an 
>>>>> arch choice.
>>>>
>>>> Since things are clear on how the PT_NOTES are consumed in kdump 
>>>> kernel [fs/proc/vmcore.c],
>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>
>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>> If yes, can you please list the elfcorehdr components that changes 
>>>> due to CPU hotplug.
>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>> number of cpu PT_NOTEs
>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does 
>>> not impact the
>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>
>>>>
>>>>  From what I understood, crash notes are prepared for possible CPUs 
>>>> as system boots and
>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>> while generating the elfcorehdr
>>>> during the kdump kernel load.
>>>>
>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible 
>>>> CPU there is no need to
>>>> regenerate it for CPU hotplug events. Or do we?
>>>
>>> For onlining/offlining of cpus, there is no need to regenerate the 
>>> elfcorehdr. However,
>>> for actual hot un/plug of cpus, the answer is yes due to 
>>> for_each_present_cpu(). The
>>> caveat here of course is that if crash utility is the only coredump 
>>> analyzer of concern,
>>> then it doesn't care about these cpu PT_NOTEs and there would be no 
>>> need to re-generate them.
>>>
>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into 
>>> mainstream, impacts
>>> any of this.
>>>
>>> Perhaps the one item that might help here is to distinguish between 
>>> actual hot un/plug of
>>> cpus, versus onlining/offlining. At the moment, I can not 
>>> distinguish between a hot plug
>>> event and an online event (and unplug/offline). If those were 
>>> distinguishable, then we
>>> could only regenerate on un/plug events.
>>>
>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>
>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>> have to worry about
>> hot[un]plug case.
>>
>> Here is my view on how things should be handled if a core-dump 
>> analyzer is dependent on
>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>
>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash 
>> notes (kernel has
>> one crash note per CPU for every possible CPU). Though the crash 
>> notes are allocated
>> during the boot time they are populated when the system is on the 
>> crash path.
>>
>> This is how crash notes are populated on PowerPC and I am expecting 
>> it would be something
>> similar on other architectures too.
>>
>> The crashing CPU sends IPI to every other online CPU with a callback 
>> function that updates the
>> crash notes of that specific CPU. Once the IPI completes the crashing 
>> CPU updates its own crash
>> note and proceeds further.
>>
>> The crash notes of CPUs remain uninitialized if the CPUs were offline 
>> or hot unplugged at the time
>> system crash. The core-dump analyzer should be able to identify 
>> [un]/initialized crash notes
>> and display the information accordingly.
>>
>> Thoughts?
>>
>> - Sourabh
>
> I've been examining what it would mean to move to 
> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it 
> means:
>
> - Changing for_each_present_cpu() to for_each_possible_cpu() in 
> crash_prepare_elf64_headers().
> - For kexec_load() syscall path, rewrite the incoming/supplied 
> elfcorehdr immediately on the load with the elfcorehdr generated by 
> crash_prepare_elf64_headers().
> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.

If for_each_present_cpu is replaced with for_each_possible_cpu I still 
need cpuhp machinery
to update FDT kexec segment for CPU hot add case.


>
> This would then setup PT_NOTEs for all possible cpus, which should in 
> theory accommodate crash analyzers that rely on ELF PT_NOTEs for 
> crash_notes.
>
> If staying with for_each_present_cpu() is ultimately decided, then I 
> think leaving the cpuhp machinery in place and each arch could decide 
> how to handle crash cpu hotplug events. The overhead for doing this is 
> very minimal, and the events are likely very infrequent.

I agree. Some architectures may need cpuhp machinery to update kexec 
segment[s] other then elfcorehdr. For example FDT on PowerPC.

- Sourabh Jain

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-24  8:34                     ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-24  8:34 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 24/02/23 02:04, Eric DeVolder wrote:
>
>
> On 2/10/23 00:29, Sourabh Jain wrote:
>>
>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>
>>>
>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>> Hello Eric,
>>>>
>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>> Eric!
>>>>>>
>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>
>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm 
>>>>>>> open to better names.
>>>>>>>
>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>> attempts at locating this state failed when inside the STARTING 
>>>>>>> section, so I located
>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler 
>>>>>>> is registered on
>>>>>>> this state as the callback for the .startup method.
>>>>>>>
>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>> placed it at the end of the PREPARE section. This crash hotplug 
>>>>>>> handler is also
>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>
>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>
>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>> {
>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>
>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>> }
>>>>>>
>>>>>> and use this to query the actual state at crash time. That spares 
>>>>>> all
>>>>>> those callback heuristics.
>>>>>>
>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>> vmcoreinfo,
>>>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>>>> utility,
>>>>>>> in order to understand the impact of moving from 
>>>>>>> for_each_present_cpu()
>>>>>>> to for_each_online_cpu().
>>>>>>
>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>          tglx
>>>>>>
>>>>>>
>>>>>
>>>>> Thomas,
>>>>> I've investigated the passing of crash notes through the vmcore. 
>>>>> What I've learned is that:
>>>>>
>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its 
>>>>> job) does
>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>> coalesce them together.
>>>>>
>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>>>> determine its
>>>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>>>> unused (except
>>>>>   for sadump method).
>>>>>
>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>> does not appear to
>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>> that, and also of
>>>>>   course which are online. In addition, when crash does reference 
>>>>> the cpu PT_NOTE,
>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>> directly in the vmcore
>>>>>   image memory, not via the ELF structure. Said differently, it 
>>>>> appears to me that
>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather 
>>>>> it obtains them
>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>
>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>> telling test was that I
>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>> crash_prepare_elf64_headers() to just 1,
>>>>> hot plugged some cpus, then also took a few offline sparsely via 
>>>>> chcpu, then generated a
>>>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>>>> reported the proper number
>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and 
>>>>> changing to a different
>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>
>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>> PT_NOTEs, it obtains the
>>>>> cpu information directly from kernel data structures. Perhaps at 
>>>>> one time crash relied
>>>>> upon the ELF information, but no more. (Perhaps there are other 
>>>>> crash dump analyzers
>>>>> that might rely on the ELF info?)
>>>>>
>>>>> So, all this to say that I see no need to change 
>>>>> crash_prepare_elf64_headers(). There
>>>>> is no compelling reason to move away from for_each_present_cpu(), 
>>>>> or modify the list for
>>>>> online/offline.
>>>>>
>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>> register. Perhaps reverting
>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
>>>>> does not appear to
>>>>> be a compelling need to accurately track whether the cpu went 
>>>>> online/offline for the
>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>> utility pulls that from
>>>>> kernel data structures, not the elfcorehdr.
>>>>>
>>>>> I think this is what Sourabh has known and has been advocating for 
>>>>> an optimization
>>>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>>>> (because all the percpu
>>>>> structs are all laid out). I do think it best to leave that as an 
>>>>> arch choice.
>>>>
>>>> Since things are clear on how the PT_NOTES are consumed in kdump 
>>>> kernel [fs/proc/vmcore.c],
>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>
>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>> If yes, can you please list the elfcorehdr components that changes 
>>>> due to CPU hotplug.
>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>> number of cpu PT_NOTEs
>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does 
>>> not impact the
>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>
>>>>
>>>>  From what I understood, crash notes are prepared for possible CPUs 
>>>> as system boots and
>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>> while generating the elfcorehdr
>>>> during the kdump kernel load.
>>>>
>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible 
>>>> CPU there is no need to
>>>> regenerate it for CPU hotplug events. Or do we?
>>>
>>> For onlining/offlining of cpus, there is no need to regenerate the 
>>> elfcorehdr. However,
>>> for actual hot un/plug of cpus, the answer is yes due to 
>>> for_each_present_cpu(). The
>>> caveat here of course is that if crash utility is the only coredump 
>>> analyzer of concern,
>>> then it doesn't care about these cpu PT_NOTEs and there would be no 
>>> need to re-generate them.
>>>
>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into 
>>> mainstream, impacts
>>> any of this.
>>>
>>> Perhaps the one item that might help here is to distinguish between 
>>> actual hot un/plug of
>>> cpus, versus onlining/offlining. At the moment, I can not 
>>> distinguish between a hot plug
>>> event and an online event (and unplug/offline). If those were 
>>> distinguishable, then we
>>> could only regenerate on un/plug events.
>>>
>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>
>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>> have to worry about
>> hot[un]plug case.
>>
>> Here is my view on how things should be handled if a core-dump 
>> analyzer is dependent on
>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>
>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash 
>> notes (kernel has
>> one crash note per CPU for every possible CPU). Though the crash 
>> notes are allocated
>> during the boot time they are populated when the system is on the 
>> crash path.
>>
>> This is how crash notes are populated on PowerPC and I am expecting 
>> it would be something
>> similar on other architectures too.
>>
>> The crashing CPU sends IPI to every other online CPU with a callback 
>> function that updates the
>> crash notes of that specific CPU. Once the IPI completes the crashing 
>> CPU updates its own crash
>> note and proceeds further.
>>
>> The crash notes of CPUs remain uninitialized if the CPUs were offline 
>> or hot unplugged at the time
>> system crash. The core-dump analyzer should be able to identify 
>> [un]/initialized crash notes
>> and display the information accordingly.
>>
>> Thoughts?
>>
>> - Sourabh
>
> I've been examining what it would mean to move to 
> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it 
> means:
>
> - Changing for_each_present_cpu() to for_each_possible_cpu() in 
> crash_prepare_elf64_headers().
> - For kexec_load() syscall path, rewrite the incoming/supplied 
> elfcorehdr immediately on the load with the elfcorehdr generated by 
> crash_prepare_elf64_headers().
> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.

If for_each_present_cpu is replaced with for_each_possible_cpu I still 
need cpuhp machinery
to update FDT kexec segment for CPU hot add case.


>
> This would then setup PT_NOTEs for all possible cpus, which should in 
> theory accommodate crash analyzers that rely on ELF PT_NOTEs for 
> crash_notes.
>
> If staying with for_each_present_cpu() is ultimately decided, then I 
> think leaving the cpuhp machinery in place and each arch could decide 
> how to handle crash cpu hotplug events. The overhead for doing this is 
> very minimal, and the events are likely very infrequent.

I agree. Some architectures may need cpuhp machinery to update kexec 
segment[s] other then elfcorehdr. For example FDT on PowerPC.

- Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-24  8:34                     ` Sourabh Jain
@ 2023-02-24 20:16                       ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-24 20:16 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/24/23 02:34, Sourabh Jain wrote:
> 
> On 24/02/23 02:04, Eric DeVolder wrote:
>>
>>
>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>
>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>> Hello Eric,
>>>>>
>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>> Eric!
>>>>>>>
>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>
>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>>>
>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>
>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>
>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>
>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>> {
>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>
>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>> }
>>>>>>>
>>>>>>> and use this to query the actual state at crash time. That spares all
>>>>>>> those callback heuristics.
>>>>>>>
>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>>>> to for_each_online_cpu().
>>>>>>>
>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>          tglx
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Thomas,
>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>>>
>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>>>
>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>>>   for sadump method).
>>>>>>
>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>
>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>
>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>>>> that might rely on the ELF info?)
>>>>>>
>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>>>> online/offline.
>>>>>>
>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>
>>>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>>>
>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>
>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>
>>>>>
>>>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>>>> during the kdump kernel load.
>>>>>
>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>
>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>>>
>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>>>> any of this.
>>>>
>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>>>> event and an online event (and unplug/offline). If those were distinguishable, then we
>>>> could only regenerate on un/plug events.
>>>>
>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>
>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
>>> hot[un]plug case.
>>>
>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on
>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>
>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated
>>> during the boot time they are populated when the system is on the crash path.
>>>
>>> This is how crash notes are populated on PowerPC and I am expecting it would be something
>>> similar on other architectures too.
>>>
>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
>>> note and proceeds further.
>>>
>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
>>> and display the information accordingly.
>>>
>>> Thoughts?
>>>
>>> - Sourabh
>>
>> I've been examining what it would mean to move to for_each_possible_cpu() in 
>> crash_prepare_elf64_headers(). I think it means:
>>
>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the load 
>> with the elfcorehdr generated by crash_prepare_elf64_headers().
>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.
> 
> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery
> to update FDT kexec segment for CPU hot add case.

Ah, ok, that's important! So the cpuhp callbacks are still needed.
> 
> 
>>
>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash 
>> analyzers that rely on ELF PT_NOTEs for crash_notes.
>>
>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp 
>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The overhead 
>> for doing this is very minimal, and the events are likely very infrequent.
> 
> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then 
> elfcorehdr. For example FDT on PowerPC.
> 
> - Sourabh Jain

OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire is 
to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all possible 
cpus upon kernel boot, there seems to be no reason to not do this?

HOWEVER...

It's not clear to me that this particular change needs to be part of this series. It's inclusion 
would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes kexec_load 
and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of 
for_each_present_cpu() where as with this change kexec_file_load would do for_each_possible_cpu(); 
until a hot plug event then both would do for_each_possible_cpu()). And if this change were to 
arrive as part of Sourabh's PPC support, then it does not appear to impact x86 (not sure about other 
arches). And the 'crash' dump analyzer doesn't care either way.

Including this change would enable an optimization path (for x86 at least) that short-circuits cpu 
hotplug changes in the arch crash handler, for example:

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index aca3f1817674..0883f6b11de4 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image)
     unsigned long mem, memsz;
     unsigned long elfsz = 0;

+   if (image->file_mode && (
+       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
+       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
+       return;
+
     /*
      * Create the new elfcorehdr reflecting the changes to CPU and/or
      * memory resources.

I'm not sure that is compelling given the infrequent nature of cpu hotplug events.

In my mind I still have a question about kexec_load() path. The userspace kexec can not do the 
equivalent of for_each_possible_cpu(). It can obtain max possible cpus from 
/sys/devices/system/cpu/possible, but for those cpus not present the /sys/devices/system/cpu/cpuXX 
is not available and so the crash_notes entries is not available. My attempts to expose all cpuXX 
lead to odd behavior that was requiring changes in ACPI and arch code that looked untenable.

There seem to be these options available for kexec_load() path:
- immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've 
made this work with the following, as proof of concept:

diff --git a/kernel/kexec.c b/kernel/kexec.c
index cb8e6e6f983c..4eb201270f97 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, unsigned long
     kimage_free(image);
  out_unlock:
     kexec_unlock();
+   if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
+       if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
+           crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, KEXEC_CRASH_HP_INVALID_CPU);
+       }
+   }
     return ret;
  }

- Another option is spend the time to determine whether exposing all cpuXX is a viable solution; I 
have no idea what impacts to userspace would be for possible-but-not-yet-present cpuXX entries would 
be. It might also mean requiring a 'present' entry available within the cpuXX.

- Another option is to simply let the hot plug events rewrite the elfcorehdr on demand. This is what 
I've originally put forth, but not sure how this impacts PPC given for_each_possible_cpu() change.

The concern is that today, both kexec_load and kexec_file_load mirror each other with respect to 
for_each_present_cpu(); that is userspace kexec is able to generate the elfcorehdr the same as would 
kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), the two would deviate.

Thoughts?
eric

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-24 20:16                       ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-24 20:16 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/24/23 02:34, Sourabh Jain wrote:
> 
> On 24/02/23 02:04, Eric DeVolder wrote:
>>
>>
>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>
>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>> Hello Eric,
>>>>>
>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>> Eric!
>>>>>>>
>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>
>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>>>
>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>
>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>
>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>
>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>> {
>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>
>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>> }
>>>>>>>
>>>>>>> and use this to query the actual state at crash time. That spares all
>>>>>>> those callback heuristics.
>>>>>>>
>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>>>> to for_each_online_cpu().
>>>>>>>
>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>          tglx
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Thomas,
>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>>>
>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>>>
>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>>>   for sadump method).
>>>>>>
>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>
>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>
>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>>>> that might rely on the ELF info?)
>>>>>>
>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>>>> online/offline.
>>>>>>
>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>
>>>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>>>
>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>
>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>
>>>>>
>>>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>>>> during the kdump kernel load.
>>>>>
>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>
>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>>>
>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>>>> any of this.
>>>>
>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>>>> event and an online event (and unplug/offline). If those were distinguishable, then we
>>>> could only regenerate on un/plug events.
>>>>
>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>
>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
>>> hot[un]plug case.
>>>
>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on
>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>
>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated
>>> during the boot time they are populated when the system is on the crash path.
>>>
>>> This is how crash notes are populated on PowerPC and I am expecting it would be something
>>> similar on other architectures too.
>>>
>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
>>> note and proceeds further.
>>>
>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
>>> and display the information accordingly.
>>>
>>> Thoughts?
>>>
>>> - Sourabh
>>
>> I've been examining what it would mean to move to for_each_possible_cpu() in 
>> crash_prepare_elf64_headers(). I think it means:
>>
>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the load 
>> with the elfcorehdr generated by crash_prepare_elf64_headers().
>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.
> 
> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery
> to update FDT kexec segment for CPU hot add case.

Ah, ok, that's important! So the cpuhp callbacks are still needed.
> 
> 
>>
>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash 
>> analyzers that rely on ELF PT_NOTEs for crash_notes.
>>
>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp 
>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The overhead 
>> for doing this is very minimal, and the events are likely very infrequent.
> 
> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then 
> elfcorehdr. For example FDT on PowerPC.
> 
> - Sourabh Jain

OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire is 
to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all possible 
cpus upon kernel boot, there seems to be no reason to not do this?

HOWEVER...

It's not clear to me that this particular change needs to be part of this series. It's inclusion 
would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes kexec_load 
and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of 
for_each_present_cpu() where as with this change kexec_file_load would do for_each_possible_cpu(); 
until a hot plug event then both would do for_each_possible_cpu()). And if this change were to 
arrive as part of Sourabh's PPC support, then it does not appear to impact x86 (not sure about other 
arches). And the 'crash' dump analyzer doesn't care either way.

Including this change would enable an optimization path (for x86 at least) that short-circuits cpu 
hotplug changes in the arch crash handler, for example:

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index aca3f1817674..0883f6b11de4 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image)
     unsigned long mem, memsz;
     unsigned long elfsz = 0;

+   if (image->file_mode && (
+       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
+       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
+       return;
+
     /*
      * Create the new elfcorehdr reflecting the changes to CPU and/or
      * memory resources.

I'm not sure that is compelling given the infrequent nature of cpu hotplug events.

In my mind I still have a question about kexec_load() path. The userspace kexec can not do the 
equivalent of for_each_possible_cpu(). It can obtain max possible cpus from 
/sys/devices/system/cpu/possible, but for those cpus not present the /sys/devices/system/cpu/cpuXX 
is not available and so the crash_notes entries is not available. My attempts to expose all cpuXX 
lead to odd behavior that was requiring changes in ACPI and arch code that looked untenable.

There seem to be these options available for kexec_load() path:
- immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've 
made this work with the following, as proof of concept:

diff --git a/kernel/kexec.c b/kernel/kexec.c
index cb8e6e6f983c..4eb201270f97 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, unsigned long
     kimage_free(image);
  out_unlock:
     kexec_unlock();
+   if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
+       if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
+           crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, KEXEC_CRASH_HP_INVALID_CPU);
+       }
+   }
     return ret;
  }

- Another option is spend the time to determine whether exposing all cpuXX is a viable solution; I 
have no idea what impacts to userspace would be for possible-but-not-yet-present cpuXX entries would 
be. It might also mean requiring a 'present' entry available within the cpuXX.

- Another option is to simply let the hot plug events rewrite the elfcorehdr on demand. This is what 
I've originally put forth, but not sure how this impacts PPC given for_each_possible_cpu() change.

The concern is that today, both kexec_load and kexec_file_load mirror each other with respect to 
for_each_present_cpu(); that is userspace kexec is able to generate the elfcorehdr the same as would 
kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), the two would deviate.

Thoughts?
eric

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-24 20:16                       ` Eric DeVolder
@ 2023-02-27  6:11                         ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-27  6:11 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 25/02/23 01:46, Eric DeVolder wrote:
>
>
> On 2/24/23 02:34, Sourabh Jain wrote:
>>
>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>
>>>
>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>
>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>> Hello Eric,
>>>>>>
>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>> Eric!
>>>>>>>>
>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>
>>>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. 
>>>>>>>>> I'm open to better names.
>>>>>>>>>
>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>>>> attempts at locating this state failed when inside the 
>>>>>>>>> STARTING section, so I located
>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler 
>>>>>>>>> is registered on
>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>
>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>>>> placed it at the end of the PREPARE section. This crash 
>>>>>>>>> hotplug handler is also
>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>> method.
>>>>>>>>
>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>
>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>> {
>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>
>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>> }
>>>>>>>>
>>>>>>>> and use this to query the actual state at crash time. That 
>>>>>>>> spares all
>>>>>>>> those callback heuristics.
>>>>>>>>
>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>>>> vmcoreinfo,
>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>>>>>> utility,
>>>>>>>>> in order to understand the impact of moving from 
>>>>>>>>> for_each_present_cpu()
>>>>>>>>> to for_each_online_cpu().
>>>>>>>>
>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>          tglx
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Thomas,
>>>>>>> I've investigated the passing of crash notes through the vmcore. 
>>>>>>> What I've learned is that:
>>>>>>>
>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do 
>>>>>>> its job) does
>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>>>> coalesce them together.
>>>>>>>
>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>>>>>> determine its
>>>>>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>>>>>> unused (except
>>>>>>>   for sadump method).
>>>>>>>
>>>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>>>> does not appear to
>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>>>> that, and also of
>>>>>>>   course which are online. In addition, when crash does 
>>>>>>> reference the cpu PT_NOTE,
>>>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>>>> directly in the vmcore
>>>>>>>   image memory, not via the ELF structure. Said differently, it 
>>>>>>> appears to me that
>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; 
>>>>>>> rather it obtains them
>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>
>>>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>>>> telling test was that I
>>>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>> hot plugged some cpus, then also took a few offline sparsely via 
>>>>>>> chcpu, then generated a
>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>>>>>> reported the proper number
>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), 
>>>>>>> and changing to a different
>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>
>>>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>>>> PT_NOTEs, it obtains the
>>>>>>> cpu information directly from kernel data structures. Perhaps at 
>>>>>>> one time crash relied
>>>>>>> upon the ELF information, but no more. (Perhaps there are other 
>>>>>>> crash dump analyzers
>>>>>>> that might rely on the ELF info?)
>>>>>>>
>>>>>>> So, all this to say that I see no need to change 
>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>> is no compelling reason to move away from 
>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>> online/offline.
>>>>>>>
>>>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>>>> register. Perhaps reverting
>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. 
>>>>>>> There does not appear to
>>>>>>> be a compelling need to accurately track whether the cpu went 
>>>>>>> online/offline for the
>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>>>> utility pulls that from
>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>
>>>>>>> I think this is what Sourabh has known and has been advocating 
>>>>>>> for an optimization
>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>>>>>> (because all the percpu
>>>>>>> structs are all laid out). I do think it best to leave that as 
>>>>>>> an arch choice.
>>>>>>
>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump 
>>>>>> kernel [fs/proc/vmcore.c],
>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>
>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>> If yes, can you please list the elfcorehdr components that 
>>>>>> changes due to CPU hotplug.
>>>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>>>> number of cpu PT_NOTEs
>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus 
>>>>> does not impact the
>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>
>>>>>>
>>>>>>  From what I understood, crash notes are prepared for possible 
>>>>>> CPUs as system boots and
>>>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>>>> while generating the elfcorehdr
>>>>>> during the kdump kernel load.
>>>>>>
>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every 
>>>>>> possible CPU there is no need to
>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>
>>>>> For onlining/offlining of cpus, there is no need to regenerate the 
>>>>> elfcorehdr. However,
>>>>> for actual hot un/plug of cpus, the answer is yes due to 
>>>>> for_each_present_cpu(). The
>>>>> caveat here of course is that if crash utility is the only 
>>>>> coredump analyzer of concern,
>>>>> then it doesn't care about these cpu PT_NOTEs and there would be 
>>>>> no need to re-generate them.
>>>>>
>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming 
>>>>> into mainstream, impacts
>>>>> any of this.
>>>>>
>>>>> Perhaps the one item that might help here is to distinguish 
>>>>> between actual hot un/plug of
>>>>> cpus, versus onlining/offlining. At the moment, I can not 
>>>>> distinguish between a hot plug
>>>>> event and an online event (and unplug/offline). If those were 
>>>>> distinguishable, then we
>>>>> could only regenerate on un/plug events.
>>>>>
>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>
>>>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>>>> have to worry about
>>>> hot[un]plug case.
>>>>
>>>> Here is my view on how things should be handled if a core-dump 
>>>> analyzer is dependent on
>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>
>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding 
>>>> crash notes (kernel has
>>>> one crash note per CPU for every possible CPU). Though the crash 
>>>> notes are allocated
>>>> during the boot time they are populated when the system is on the 
>>>> crash path.
>>>>
>>>> This is how crash notes are populated on PowerPC and I am expecting 
>>>> it would be something
>>>> similar on other architectures too.
>>>>
>>>> The crashing CPU sends IPI to every other online CPU with a 
>>>> callback function that updates the
>>>> crash notes of that specific CPU. Once the IPI completes the 
>>>> crashing CPU updates its own crash
>>>> note and proceeds further.
>>>>
>>>> The crash notes of CPUs remain uninitialized if the CPUs were 
>>>> offline or hot unplugged at the time
>>>> system crash. The core-dump analyzer should be able to identify 
>>>> [un]/initialized crash notes
>>>> and display the information accordingly.
>>>>
>>>> Thoughts?
>>>>
>>>> - Sourabh
>>>
>>> I've been examining what it would mean to move to 
>>> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it 
>>> means:
>>>
>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in 
>>> crash_prepare_elf64_headers().
>>> - For kexec_load() syscall path, rewrite the incoming/supplied 
>>> elfcorehdr immediately on the load with the elfcorehdr generated by 
>>> crash_prepare_elf64_headers().
>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug 
>>> events.
>>
>> If for_each_present_cpu is replaced with for_each_possible_cpu I 
>> still need cpuhp machinery
>> to update FDT kexec segment for CPU hot add case.
>
> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>
>>
>>>
>>> This would then setup PT_NOTEs for all possible cpus, which should 
>>> in theory accommodate crash analyzers that rely on ELF PT_NOTEs for 
>>> crash_notes.
>>>
>>> If staying with for_each_present_cpu() is ultimately decided, then I 
>>> think leaving the cpuhp machinery in place and each arch could 
>>> decide how to handle crash cpu hotplug events. The overhead for 
>>> doing this is very minimal, and the events are likely very infrequent.
>>
>> I agree. Some architectures may need cpuhp machinery to update kexec 
>> segment[s] other then elfcorehdr. For example FDT on PowerPC.
>>
>> - Sourabh Jain
>
> OK, I was thinking that the desire was to eliminate the cpuhp 
> callbacks. In reality, the desire is to change to 
> for_each_possible_cpu(). Given that the kernel creates crash_notes for 
> all possible cpus upon kernel boot, there seems to be no reason to not 
> do this?
>
> HOWEVER...
>
> It's not clear to me that this particular change needs to be part of 
> this series. It's inclusion would facilitate PPC support, but doesn't 
> "solve" anything in general. In fact it causes kexec_load and 
> kexec_file_load to deviate (kexec_load via userspace kexec does the 
> equivalent of for_each_present_cpu() where as with this change 
> kexec_file_load would do for_each_possible_cpu(); until a hot plug 
> event then both would do for_each_possible_cpu()). And if this change 
> were to arrive as part of Sourabh's PPC support, then it does not 
> appear to impact x86 (not sure about other arches). And the 'crash' 
> dump analyzer doesn't care either way.
>
> Including this change would enable an optimization path (for x86 at 
> least) that short-circuits cpu hotplug changes in the arch crash 
> handler, for example:
>
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index aca3f1817674..0883f6b11de4 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct 
> kimage *image)
>     unsigned long mem, memsz;
>     unsigned long elfsz = 0;
>
> +   if (image->file_mode && (
> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
> +       return;
> +
>     /*
>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>      * memory resources.
>
> I'm not sure that is compelling given the infrequent nature of cpu 
> hotplug events.
It certainly closes/reduces the window where kdump is not active due 
kexec segment update.|

>
> In my mind I still have a question about kexec_load() path. The 
> userspace kexec can not do the equivalent of for_each_possible_cpu(). 
> It can obtain max possible cpus from /sys/devices/system/cpu/possible, 
> but for those cpus not present the /sys/devices/system/cpu/cpuXX is 
> not available and so the crash_notes entries is not available. My 
> attempts to expose all cpuXX lead to odd behavior that was requiring 
> changes in ACPI and arch code that looked untenable.
>
> There seem to be these options available for kexec_load() path:
> - immediately rewrite the elfcorehdr upon load via a call to 
> crash_prepare_elf64_headers(). I've made this work with the following, 
> as proof of concept:
Yes regenerating/patching the elfcorehdr could be an option for 
kexec_load syscall.

>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index cb8e6e6f983c..4eb201270f97 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, 
> unsigned long
>     kimage_free(image);
>  out_unlock:
>     kexec_unlock();
> +   if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
> +       if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
> +           crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, 
> KEXEC_CRASH_HP_INVALID_CPU);
> +       }
> +   }
>     return ret;
>  }
>
> - Another option is spend the time to determine whether exposing all 
> cpuXX is a viable solution; I have no idea what impacts to userspace 
> would be for possible-but-not-yet-present cpuXX entries would be. It 
> might also mean requiring a 'present' entry available within the cpuXX.
>
> - Another option is to simply let the hot plug events rewrite the 
> elfcorehdr on demand. This is what I've originally put forth, but not 
> sure how this impacts PPC given for_each_possible_cpu() change.
Given that /sys/devices/system/cpu/cpuXX is not present for 
possbile-but-not-yet-present CPUs, I am wondering do we even have crash 
notes for possible CPUs on x86?
>
> The concern is that today, both kexec_load and kexec_file_load mirror 
> each other with respect to for_each_present_cpu(); that is userspace 
> kexec is able to generate the elfcorehdr the same as would 
> kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), 
> the two would deviate.

Thanks,
Sourabh Jain

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-27  6:11                         ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-02-27  6:11 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 25/02/23 01:46, Eric DeVolder wrote:
>
>
> On 2/24/23 02:34, Sourabh Jain wrote:
>>
>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>
>>>
>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>
>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>> Hello Eric,
>>>>>>
>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>> Eric!
>>>>>>>>
>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>
>>>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. 
>>>>>>>>> I'm open to better names.
>>>>>>>>>
>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>>>> attempts at locating this state failed when inside the 
>>>>>>>>> STARTING section, so I located
>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler 
>>>>>>>>> is registered on
>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>
>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>>>> placed it at the end of the PREPARE section. This crash 
>>>>>>>>> hotplug handler is also
>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>> method.
>>>>>>>>
>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>
>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>> {
>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>
>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>> }
>>>>>>>>
>>>>>>>> and use this to query the actual state at crash time. That 
>>>>>>>> spares all
>>>>>>>> those callback heuristics.
>>>>>>>>
>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>>>> vmcoreinfo,
>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash 
>>>>>>>>> utility,
>>>>>>>>> in order to understand the impact of moving from 
>>>>>>>>> for_each_present_cpu()
>>>>>>>>> to for_each_online_cpu().
>>>>>>>>
>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>          tglx
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Thomas,
>>>>>>> I've investigated the passing of crash notes through the vmcore. 
>>>>>>> What I've learned is that:
>>>>>>>
>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do 
>>>>>>> its job) does
>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>>>> coalesce them together.
>>>>>>>
>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to 
>>>>>>> determine its
>>>>>>>   nr_cpus variable, which is reported in a header, but otherwise 
>>>>>>> unused (except
>>>>>>>   for sadump method).
>>>>>>>
>>>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>>>> does not appear to
>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>>>> that, and also of
>>>>>>>   course which are online. In addition, when crash does 
>>>>>>> reference the cpu PT_NOTE,
>>>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>>>> directly in the vmcore
>>>>>>>   image memory, not via the ELF structure. Said differently, it 
>>>>>>> appears to me that
>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; 
>>>>>>> rather it obtains them
>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>
>>>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>>>> telling test was that I
>>>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>> hot plugged some cpus, then also took a few offline sparsely via 
>>>>>>> chcpu, then generated a
>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it 
>>>>>>> reported the proper number
>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), 
>>>>>>> and changing to a different
>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>
>>>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>>>> PT_NOTEs, it obtains the
>>>>>>> cpu information directly from kernel data structures. Perhaps at 
>>>>>>> one time crash relied
>>>>>>> upon the ELF information, but no more. (Perhaps there are other 
>>>>>>> crash dump analyzers
>>>>>>> that might rely on the ELF info?)
>>>>>>>
>>>>>>> So, all this to say that I see no need to change 
>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>> is no compelling reason to move away from 
>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>> online/offline.
>>>>>>>
>>>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>>>> register. Perhaps reverting
>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. 
>>>>>>> There does not appear to
>>>>>>> be a compelling need to accurately track whether the cpu went 
>>>>>>> online/offline for the
>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>>>> utility pulls that from
>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>
>>>>>>> I think this is what Sourabh has known and has been advocating 
>>>>>>> for an optimization
>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes 
>>>>>>> (because all the percpu
>>>>>>> structs are all laid out). I do think it best to leave that as 
>>>>>>> an arch choice.
>>>>>>
>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump 
>>>>>> kernel [fs/proc/vmcore.c],
>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>
>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>> If yes, can you please list the elfcorehdr components that 
>>>>>> changes due to CPU hotplug.
>>>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>>>> number of cpu PT_NOTEs
>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus 
>>>>> does not impact the
>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>
>>>>>>
>>>>>>  From what I understood, crash notes are prepared for possible 
>>>>>> CPUs as system boots and
>>>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>>>> while generating the elfcorehdr
>>>>>> during the kdump kernel load.
>>>>>>
>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every 
>>>>>> possible CPU there is no need to
>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>
>>>>> For onlining/offlining of cpus, there is no need to regenerate the 
>>>>> elfcorehdr. However,
>>>>> for actual hot un/plug of cpus, the answer is yes due to 
>>>>> for_each_present_cpu(). The
>>>>> caveat here of course is that if crash utility is the only 
>>>>> coredump analyzer of concern,
>>>>> then it doesn't care about these cpu PT_NOTEs and there would be 
>>>>> no need to re-generate them.
>>>>>
>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming 
>>>>> into mainstream, impacts
>>>>> any of this.
>>>>>
>>>>> Perhaps the one item that might help here is to distinguish 
>>>>> between actual hot un/plug of
>>>>> cpus, versus onlining/offlining. At the moment, I can not 
>>>>> distinguish between a hot plug
>>>>> event and an online event (and unplug/offline). If those were 
>>>>> distinguishable, then we
>>>>> could only regenerate on un/plug events.
>>>>>
>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>
>>>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>>>> have to worry about
>>>> hot[un]plug case.
>>>>
>>>> Here is my view on how things should be handled if a core-dump 
>>>> analyzer is dependent on
>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>
>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding 
>>>> crash notes (kernel has
>>>> one crash note per CPU for every possible CPU). Though the crash 
>>>> notes are allocated
>>>> during the boot time they are populated when the system is on the 
>>>> crash path.
>>>>
>>>> This is how crash notes are populated on PowerPC and I am expecting 
>>>> it would be something
>>>> similar on other architectures too.
>>>>
>>>> The crashing CPU sends IPI to every other online CPU with a 
>>>> callback function that updates the
>>>> crash notes of that specific CPU. Once the IPI completes the 
>>>> crashing CPU updates its own crash
>>>> note and proceeds further.
>>>>
>>>> The crash notes of CPUs remain uninitialized if the CPUs were 
>>>> offline or hot unplugged at the time
>>>> system crash. The core-dump analyzer should be able to identify 
>>>> [un]/initialized crash notes
>>>> and display the information accordingly.
>>>>
>>>> Thoughts?
>>>>
>>>> - Sourabh
>>>
>>> I've been examining what it would mean to move to 
>>> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it 
>>> means:
>>>
>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in 
>>> crash_prepare_elf64_headers().
>>> - For kexec_load() syscall path, rewrite the incoming/supplied 
>>> elfcorehdr immediately on the load with the elfcorehdr generated by 
>>> crash_prepare_elf64_headers().
>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug 
>>> events.
>>
>> If for_each_present_cpu is replaced with for_each_possible_cpu I 
>> still need cpuhp machinery
>> to update FDT kexec segment for CPU hot add case.
>
> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>
>>
>>>
>>> This would then setup PT_NOTEs for all possible cpus, which should 
>>> in theory accommodate crash analyzers that rely on ELF PT_NOTEs for 
>>> crash_notes.
>>>
>>> If staying with for_each_present_cpu() is ultimately decided, then I 
>>> think leaving the cpuhp machinery in place and each arch could 
>>> decide how to handle crash cpu hotplug events. The overhead for 
>>> doing this is very minimal, and the events are likely very infrequent.
>>
>> I agree. Some architectures may need cpuhp machinery to update kexec 
>> segment[s] other then elfcorehdr. For example FDT on PowerPC.
>>
>> - Sourabh Jain
>
> OK, I was thinking that the desire was to eliminate the cpuhp 
> callbacks. In reality, the desire is to change to 
> for_each_possible_cpu(). Given that the kernel creates crash_notes for 
> all possible cpus upon kernel boot, there seems to be no reason to not 
> do this?
>
> HOWEVER...
>
> It's not clear to me that this particular change needs to be part of 
> this series. It's inclusion would facilitate PPC support, but doesn't 
> "solve" anything in general. In fact it causes kexec_load and 
> kexec_file_load to deviate (kexec_load via userspace kexec does the 
> equivalent of for_each_present_cpu() where as with this change 
> kexec_file_load would do for_each_possible_cpu(); until a hot plug 
> event then both would do for_each_possible_cpu()). And if this change 
> were to arrive as part of Sourabh's PPC support, then it does not 
> appear to impact x86 (not sure about other arches). And the 'crash' 
> dump analyzer doesn't care either way.
>
> Including this change would enable an optimization path (for x86 at 
> least) that short-circuits cpu hotplug changes in the arch crash 
> handler, for example:
>
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index aca3f1817674..0883f6b11de4 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct 
> kimage *image)
>     unsigned long mem, memsz;
>     unsigned long elfsz = 0;
>
> +   if (image->file_mode && (
> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
> +       return;
> +
>     /*
>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>      * memory resources.
>
> I'm not sure that is compelling given the infrequent nature of cpu 
> hotplug events.
It certainly closes/reduces the window where kdump is not active due 
kexec segment update.|

>
> In my mind I still have a question about kexec_load() path. The 
> userspace kexec can not do the equivalent of for_each_possible_cpu(). 
> It can obtain max possible cpus from /sys/devices/system/cpu/possible, 
> but for those cpus not present the /sys/devices/system/cpu/cpuXX is 
> not available and so the crash_notes entries is not available. My 
> attempts to expose all cpuXX lead to odd behavior that was requiring 
> changes in ACPI and arch code that looked untenable.
>
> There seem to be these options available for kexec_load() path:
> - immediately rewrite the elfcorehdr upon load via a call to 
> crash_prepare_elf64_headers(). I've made this work with the following, 
> as proof of concept:
Yes regenerating/patching the elfcorehdr could be an option for 
kexec_load syscall.

>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index cb8e6e6f983c..4eb201270f97 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, 
> unsigned long
>     kimage_free(image);
>  out_unlock:
>     kexec_unlock();
> +   if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
> +       if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
> +           crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, 
> KEXEC_CRASH_HP_INVALID_CPU);
> +       }
> +   }
>     return ret;
>  }
>
> - Another option is spend the time to determine whether exposing all 
> cpuXX is a viable solution; I have no idea what impacts to userspace 
> would be for possible-but-not-yet-present cpuXX entries would be. It 
> might also mean requiring a 'present' entry available within the cpuXX.
>
> - Another option is to simply let the hot plug events rewrite the 
> elfcorehdr on demand. This is what I've originally put forth, but not 
> sure how this impacts PPC given for_each_possible_cpu() change.
Given that /sys/devices/system/cpu/cpuXX is not present for 
possbile-but-not-yet-present CPUs, I am wondering do we even have crash 
notes for possible CPUs on x86?
>
> The concern is that today, both kexec_load and kexec_file_load mirror 
> each other with respect to for_each_present_cpu(); that is userspace 
> kexec is able to generate the elfcorehdr the same as would 
> kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), 
> the two would deviate.

Thanks,
Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-13  4:40                     ` Sourabh Jain
@ 2023-02-28 12:44                       ` Baoquan He
  -1 siblings, 0 replies; 70+ messages in thread
From: Baoquan He @ 2023-02-28 12:44 UTC (permalink / raw)
  To: Sourabh Jain
  Cc: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, vgoyal, mingo, bp, dave.hansen, hpa, nramas,
	thomas.lendacky, robh, efault, rppt, david, konrad.wilk,
	boris.ostrovsky

On 02/13/23 at 10:10am, Sourabh Jain wrote:
> 
> On 11/02/23 06:05, Eric DeVolder wrote:
> > 
> > 
> > On 2/10/23 00:29, Sourabh Jain wrote:
> > > 
> > > On 10/02/23 01:09, Eric DeVolder wrote:
> > > > 
> > > > 
> > > > On 2/9/23 12:43, Sourabh Jain wrote:
> > > > > Hello Eric,
> > > > > 
> > > > > On 09/02/23 23:01, Eric DeVolder wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2/8/23 07:44, Thomas Gleixner wrote:
> > > > > > > Eric!
> > > > > > > 
> > > > > > > On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
> > > > > > > > On 2/1/23 05:33, Thomas Gleixner wrote:
> > > > > > > > 
> > > > > > > > So my latest solution is introduce two new CPUHP
> > > > > > > > states, CPUHP_AP_ELFCOREHDR_ONLINE
> > > > > > > > for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
> > > > > > > > offlining. I'm open to better names.
> > > > > > > > 
> > > > > > > > The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
> > > > > > > > placed after CPUHP_BRINGUP_CPU. My
> > > > > > > > attempts at locating this state failed when
> > > > > > > > inside the STARTING section, so I located
> > > > > > > > this just inside the ONLINE sectoin. The crash
> > > > > > > > hotplug handler is registered on
> > > > > > > > this state as the callback for the .startup method.
> > > > > > > > 
> > > > > > > > The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
> > > > > > > > placed before CPUHP_TEARDOWN_CPU, and I
> > > > > > > > placed it at the end of the PREPARE section.
> > > > > > > > This crash hotplug handler is also
> > > > > > > > registered on this state as the callback for the .teardown method.
> > > > > > > 
> > > > > > > TBH, that's still overengineered. Something like this:
> > > > > > > 
> > > > > > > bool cpu_is_alive(unsigned int cpu)
> > > > > > > {
> > > > > > >     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> > > > > > > 
> > > > > > >     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
> > > > > > > }
> > > > > > > 
> > > > > > > and use this to query the actual state at crash
> > > > > > > time. That spares all
> > > > > > > those callback heuristics.
> > > > > > > 
> > > > > > > > I'm making my way though percpu crash_notes,
> > > > > > > > elfcorehdr, vmcoreinfo,
> > > > > > > > makedumpfile and (the consumer of it all) the
> > > > > > > > userspace crash utility,
> > > > > > > > in order to understand the impact of moving from
> > > > > > > > for_each_present_cpu()
> > > > > > > > to for_each_online_cpu().
> > > > > > > 
> > > > > > > Is the packing actually worth the trouble? What's the actual win?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > 
> > > > > > >          tglx
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > Thomas,
> > > > > > I've investigated the passing of crash notes through the
> > > > > > vmcore. What I've learned is that:
> > > > > > 
> > > > > > - linux/fs/proc/vmcore.c (which makedumpfile references
> > > > > > to do its job) does
> > > > > >   not care what the contents of cpu PT_NOTES are, but it
> > > > > > does coalesce them together.
> > > > > > 
> > > > > > - makedumpfile will count the number of cpu PT_NOTES in
> > > > > > order to determine its
> > > > > >   nr_cpus variable, which is reported in a header, but
> > > > > > otherwise unused (except
> > > > > >   for sadump method).
> > > > > > 
> > > > > > - the crash utility, for the purposes of determining the
> > > > > > cpus, does not appear to
> > > > > >   reference the elfcorehdr PT_NOTEs. Instead it locates the various
> > > > > >   cpu_[possible|present|online]_mask and computes
> > > > > > nr_cpus from that, and also of
> > > > > >   course which are online. In addition, when crash does
> > > > > > reference the cpu PT_NOTE,
> > > > > >   to get its prstatus, it does so by using a percpu
> > > > > > technique directly in the vmcore
> > > > > >   image memory, not via the ELF structure. Said
> > > > > > differently, it appears to me that
> > > > > >   crash utility doesn't rely on the ELF PT_NOTEs for
> > > > > > cpus; rather it obtains them
> > > > > >   via kernel cpumasks and the memory within the vmcore.
> > > > > > 
> > > > > > With this understanding, I did some testing. Perhaps the
> > > > > > most telling test was that I
> > > > > > changed the number of cpu PT_NOTEs emitted in the
> > > > > > crash_prepare_elf64_headers() to just 1,
> > > > > > hot plugged some cpus, then also took a few offline
> > > > > > sparsely via chcpu, then generated a
> > > > > > vmcore. The crash utility had no problem loading the
> > > > > > vmcore, it reported the proper number
> > > > > > of cpus and the number offline (despite only one cpu
> > > > > > PT_NOTE), and changing to a different
> > > > > > cpu via 'set -c 30' and the backtrace was completely valid.
> > > > > > 
> > > > > > My take away is that crash utility does not rely upon
> > > > > > ELF cpu PT_NOTEs, it obtains the
> > > > > > cpu information directly from kernel data structures.
> > > > > > Perhaps at one time crash relied
> > > > > > upon the ELF information, but no more. (Perhaps there
> > > > > > are other crash dump analyzers
> > > > > > that might rely on the ELF info?)
> > > > > > 
> > > > > > So, all this to say that I see no need to change
> > > > > > crash_prepare_elf64_headers(). There
> > > > > > is no compelling reason to move away from
> > > > > > for_each_present_cpu(), or modify the list for
> > > > > > online/offline.
> > > > > > 
> > > > > > Which then leaves the topic of the cpuhp state on which
> > > > > > to register. Perhaps reverting
> > > > > > back to the use of CPUHP_BP_PREPARE_DYN is the right
> > > > > > answer. There does not appear to
> > > > > > be a compelling need to accurately track whether the cpu
> > > > > > went online/offline for the
> > > > > > purposes of creating the elfcorehdr, as ultimately the
> > > > > > crash utility pulls that from
> > > > > > kernel data structures, not the elfcorehdr.
> > > > > > 
> > > > > > I think this is what Sourabh has known and has been
> > > > > > advocating for an optimization
> > > > > > path that allows not regenerating the elfcorehdr on cpu
> > > > > > changes (because all the percpu
> > > > > > structs are all laid out). I do think it best to leave
> > > > > > that as an arch choice.
> > > > > 
> > > > > Since things are clear on how the PT_NOTES are consumed in
> > > > > kdump kernel [fs/proc/vmcore.c],
> > > > > makedumpfile, and crash tool I need your opinion on this:
> > > > > 
> > > > > Do we really need to regenerate elfcorehdr for CPU hotplug events?
> > > > > If yes, can you please list the elfcorehdr components that
> > > > > changes due to CPU hotplug.
> > > > Due to the use of for_each_present_cpu(), it is possible for the
> > > > number of cpu PT_NOTEs
> > > > to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
> > > > does not impact the
> > > > number of cpu PT_NOTEs (as the cpus are still present).
> > > > 
> > > > > 
> > > > >  From what I understood, crash notes are prepared for
> > > > > possible CPUs as system boots and
> > > > > could be used to create a PT_NOTE section for each possible
> > > > > CPU while generating the elfcorehdr
> > > > > during the kdump kernel load.
> > > > > 
> > > > > Now once the elfcorehdr is loaded with PT_NOTEs for every
> > > > > possible CPU there is no need to
> > > > > regenerate it for CPU hotplug events. Or do we?
> > > > 
> > > > For onlining/offlining of cpus, there is no need to regenerate
> > > > the elfcorehdr. However,
> > > > for actual hot un/plug of cpus, the answer is yes due to
> > > > for_each_present_cpu(). The
> > > > caveat here of course is that if crash utility is the only
> > > > coredump analyzer of concern,
> > > > then it doesn't care about these cpu PT_NOTEs and there would be
> > > > no need to re-generate them.
> > > > 
> > > > Also, I'm not sure if ARM cpu hotplug, which is just now coming
> > > > into mainstream, impacts
> > > > any of this.
> > > > 
> > > > Perhaps the one item that might help here is to distinguish
> > > > between actual hot un/plug of
> > > > cpus, versus onlining/offlining. At the moment, I can not
> > > > distinguish between a hot plug
> > > > event and an online event (and unplug/offline). If those were
> > > > distinguishable, then we
> > > > could only regenerate on un/plug events.
> > > > 
> > > > Or perhaps moving to for_each_possible_cpu() is the better choice?
> > > 
> > > Yes, because once elfcorehdr is built with possible CPUs we don't
> > > have to worry about
> > > hot[un]plug case.
> > > 
> > > Here is my view on how things should be handled if a core-dump
> > > analyzer is dependent on
> > > elfcorehdr PT_NOTEs to find online/offline CPUs.
> > > 
> > > A PT_NOTE in elfcorehdr holds the address of the corresponding crash
> > > notes (kernel has
> > > one crash note per CPU for every possible CPU). Though the crash
> > > notes are allocated
> > > during the boot time they are populated when the system is on the
> > > crash path.
> > > 
> > > This is how crash notes are populated on PowerPC and I am expecting
> > > it would be something
> > > similar on other architectures too.
> > > 
> > > The crashing CPU sends IPI to every other online CPU with a callback
> > > function that updates the
> > > crash notes of that specific CPU. Once the IPI completes the
> > > crashing CPU updates its own crash
> > > note and proceeds further.
> > > 
> > > The crash notes of CPUs remain uninitialized if the CPUs were
> > > offline or hot unplugged at the time
> > > system crash. The core-dump analyzer should be able to identify
> > > [un]/initialized crash notes
> > > and display the information accordingly.
> > > 
> > > Thoughts?
> > > 
> > > - Sourabh
> > 
> > In general, I agree with your points. You've presented a strong case to
> > go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
> > those crash notes would always be present, and we can ignore changes to
> > cpus wrt/ elfcorehdr updates.
> > 
> > But what do we do about kexec_load() syscall? The way the userspace
> > utility works is it determines cpus by:
> >  nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
> > which is not the equivalent of possible_cpus. So the complete list of
> > cpu PT_NOTEs is not generated up front. We would need a solution for
> > that?
> Hello Eric,
> 
> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
> isn't that equivalent to possible CPUs?
> 
> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC
> it is possible CPUs.

From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
cpus. If these are true, we can use them.

But I am wondering why the existing present cpu way is going to be
discarded. Sorry, I tried to go through this thread, it's too long, can
anyone summarize the reason with shorter and clear sentences. Sorry
again for that.

> 
> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
> /sys/devices/system/cpu/possible for kexec_load case.
> 
> Thoughts?
> 
> - Sourabh Jain
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-28 12:44                       ` Baoquan He
  0 siblings, 0 replies; 70+ messages in thread
From: Baoquan He @ 2023-02-28 12:44 UTC (permalink / raw)
  To: Sourabh Jain
  Cc: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, vgoyal, mingo, bp, dave.hansen, hpa, nramas,
	thomas.lendacky, robh, efault, rppt, david, konrad.wilk,
	boris.ostrovsky

On 02/13/23 at 10:10am, Sourabh Jain wrote:
> 
> On 11/02/23 06:05, Eric DeVolder wrote:
> > 
> > 
> > On 2/10/23 00:29, Sourabh Jain wrote:
> > > 
> > > On 10/02/23 01:09, Eric DeVolder wrote:
> > > > 
> > > > 
> > > > On 2/9/23 12:43, Sourabh Jain wrote:
> > > > > Hello Eric,
> > > > > 
> > > > > On 09/02/23 23:01, Eric DeVolder wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2/8/23 07:44, Thomas Gleixner wrote:
> > > > > > > Eric!
> > > > > > > 
> > > > > > > On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
> > > > > > > > On 2/1/23 05:33, Thomas Gleixner wrote:
> > > > > > > > 
> > > > > > > > So my latest solution is introduce two new CPUHP
> > > > > > > > states, CPUHP_AP_ELFCOREHDR_ONLINE
> > > > > > > > for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
> > > > > > > > offlining. I'm open to better names.
> > > > > > > > 
> > > > > > > > The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
> > > > > > > > placed after CPUHP_BRINGUP_CPU. My
> > > > > > > > attempts at locating this state failed when
> > > > > > > > inside the STARTING section, so I located
> > > > > > > > this just inside the ONLINE sectoin. The crash
> > > > > > > > hotplug handler is registered on
> > > > > > > > this state as the callback for the .startup method.
> > > > > > > > 
> > > > > > > > The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
> > > > > > > > placed before CPUHP_TEARDOWN_CPU, and I
> > > > > > > > placed it at the end of the PREPARE section.
> > > > > > > > This crash hotplug handler is also
> > > > > > > > registered on this state as the callback for the .teardown method.
> > > > > > > 
> > > > > > > TBH, that's still overengineered. Something like this:
> > > > > > > 
> > > > > > > bool cpu_is_alive(unsigned int cpu)
> > > > > > > {
> > > > > > >     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> > > > > > > 
> > > > > > >     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
> > > > > > > }
> > > > > > > 
> > > > > > > and use this to query the actual state at crash
> > > > > > > time. That spares all
> > > > > > > those callback heuristics.
> > > > > > > 
> > > > > > > > I'm making my way though percpu crash_notes,
> > > > > > > > elfcorehdr, vmcoreinfo,
> > > > > > > > makedumpfile and (the consumer of it all) the
> > > > > > > > userspace crash utility,
> > > > > > > > in order to understand the impact of moving from
> > > > > > > > for_each_present_cpu()
> > > > > > > > to for_each_online_cpu().
> > > > > > > 
> > > > > > > Is the packing actually worth the trouble? What's the actual win?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > 
> > > > > > >          tglx
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > Thomas,
> > > > > > I've investigated the passing of crash notes through the
> > > > > > vmcore. What I've learned is that:
> > > > > > 
> > > > > > - linux/fs/proc/vmcore.c (which makedumpfile references
> > > > > > to do its job) does
> > > > > >   not care what the contents of cpu PT_NOTES are, but it
> > > > > > does coalesce them together.
> > > > > > 
> > > > > > - makedumpfile will count the number of cpu PT_NOTES in
> > > > > > order to determine its
> > > > > >   nr_cpus variable, which is reported in a header, but
> > > > > > otherwise unused (except
> > > > > >   for sadump method).
> > > > > > 
> > > > > > - the crash utility, for the purposes of determining the
> > > > > > cpus, does not appear to
> > > > > >   reference the elfcorehdr PT_NOTEs. Instead it locates the various
> > > > > >   cpu_[possible|present|online]_mask and computes
> > > > > > nr_cpus from that, and also of
> > > > > >   course which are online. In addition, when crash does
> > > > > > reference the cpu PT_NOTE,
> > > > > >   to get its prstatus, it does so by using a percpu
> > > > > > technique directly in the vmcore
> > > > > >   image memory, not via the ELF structure. Said
> > > > > > differently, it appears to me that
> > > > > >   crash utility doesn't rely on the ELF PT_NOTEs for
> > > > > > cpus; rather it obtains them
> > > > > >   via kernel cpumasks and the memory within the vmcore.
> > > > > > 
> > > > > > With this understanding, I did some testing. Perhaps the
> > > > > > most telling test was that I
> > > > > > changed the number of cpu PT_NOTEs emitted in the
> > > > > > crash_prepare_elf64_headers() to just 1,
> > > > > > hot plugged some cpus, then also took a few offline
> > > > > > sparsely via chcpu, then generated a
> > > > > > vmcore. The crash utility had no problem loading the
> > > > > > vmcore, it reported the proper number
> > > > > > of cpus and the number offline (despite only one cpu
> > > > > > PT_NOTE), and changing to a different
> > > > > > cpu via 'set -c 30' and the backtrace was completely valid.
> > > > > > 
> > > > > > My take away is that crash utility does not rely upon
> > > > > > ELF cpu PT_NOTEs, it obtains the
> > > > > > cpu information directly from kernel data structures.
> > > > > > Perhaps at one time crash relied
> > > > > > upon the ELF information, but no more. (Perhaps there
> > > > > > are other crash dump analyzers
> > > > > > that might rely on the ELF info?)
> > > > > > 
> > > > > > So, all this to say that I see no need to change
> > > > > > crash_prepare_elf64_headers(). There
> > > > > > is no compelling reason to move away from
> > > > > > for_each_present_cpu(), or modify the list for
> > > > > > online/offline.
> > > > > > 
> > > > > > Which then leaves the topic of the cpuhp state on which
> > > > > > to register. Perhaps reverting
> > > > > > back to the use of CPUHP_BP_PREPARE_DYN is the right
> > > > > > answer. There does not appear to
> > > > > > be a compelling need to accurately track whether the cpu
> > > > > > went online/offline for the
> > > > > > purposes of creating the elfcorehdr, as ultimately the
> > > > > > crash utility pulls that from
> > > > > > kernel data structures, not the elfcorehdr.
> > > > > > 
> > > > > > I think this is what Sourabh has known and has been
> > > > > > advocating for an optimization
> > > > > > path that allows not regenerating the elfcorehdr on cpu
> > > > > > changes (because all the percpu
> > > > > > structs are all laid out). I do think it best to leave
> > > > > > that as an arch choice.
> > > > > 
> > > > > Since things are clear on how the PT_NOTES are consumed in
> > > > > kdump kernel [fs/proc/vmcore.c],
> > > > > makedumpfile, and crash tool I need your opinion on this:
> > > > > 
> > > > > Do we really need to regenerate elfcorehdr for CPU hotplug events?
> > > > > If yes, can you please list the elfcorehdr components that
> > > > > changes due to CPU hotplug.
> > > > Due to the use of for_each_present_cpu(), it is possible for the
> > > > number of cpu PT_NOTEs
> > > > to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
> > > > does not impact the
> > > > number of cpu PT_NOTEs (as the cpus are still present).
> > > > 
> > > > > 
> > > > >  From what I understood, crash notes are prepared for
> > > > > possible CPUs as system boots and
> > > > > could be used to create a PT_NOTE section for each possible
> > > > > CPU while generating the elfcorehdr
> > > > > during the kdump kernel load.
> > > > > 
> > > > > Now once the elfcorehdr is loaded with PT_NOTEs for every
> > > > > possible CPU there is no need to
> > > > > regenerate it for CPU hotplug events. Or do we?
> > > > 
> > > > For onlining/offlining of cpus, there is no need to regenerate
> > > > the elfcorehdr. However,
> > > > for actual hot un/plug of cpus, the answer is yes due to
> > > > for_each_present_cpu(). The
> > > > caveat here of course is that if crash utility is the only
> > > > coredump analyzer of concern,
> > > > then it doesn't care about these cpu PT_NOTEs and there would be
> > > > no need to re-generate them.
> > > > 
> > > > Also, I'm not sure if ARM cpu hotplug, which is just now coming
> > > > into mainstream, impacts
> > > > any of this.
> > > > 
> > > > Perhaps the one item that might help here is to distinguish
> > > > between actual hot un/plug of
> > > > cpus, versus onlining/offlining. At the moment, I can not
> > > > distinguish between a hot plug
> > > > event and an online event (and unplug/offline). If those were
> > > > distinguishable, then we
> > > > could only regenerate on un/plug events.
> > > > 
> > > > Or perhaps moving to for_each_possible_cpu() is the better choice?
> > > 
> > > Yes, because once elfcorehdr is built with possible CPUs we don't
> > > have to worry about
> > > hot[un]plug case.
> > > 
> > > Here is my view on how things should be handled if a core-dump
> > > analyzer is dependent on
> > > elfcorehdr PT_NOTEs to find online/offline CPUs.
> > > 
> > > A PT_NOTE in elfcorehdr holds the address of the corresponding crash
> > > notes (kernel has
> > > one crash note per CPU for every possible CPU). Though the crash
> > > notes are allocated
> > > during the boot time they are populated when the system is on the
> > > crash path.
> > > 
> > > This is how crash notes are populated on PowerPC and I am expecting
> > > it would be something
> > > similar on other architectures too.
> > > 
> > > The crashing CPU sends IPI to every other online CPU with a callback
> > > function that updates the
> > > crash notes of that specific CPU. Once the IPI completes the
> > > crashing CPU updates its own crash
> > > note and proceeds further.
> > > 
> > > The crash notes of CPUs remain uninitialized if the CPUs were
> > > offline or hot unplugged at the time
> > > system crash. The core-dump analyzer should be able to identify
> > > [un]/initialized crash notes
> > > and display the information accordingly.
> > > 
> > > Thoughts?
> > > 
> > > - Sourabh
> > 
> > In general, I agree with your points. You've presented a strong case to
> > go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
> > those crash notes would always be present, and we can ignore changes to
> > cpus wrt/ elfcorehdr updates.
> > 
> > But what do we do about kexec_load() syscall? The way the userspace
> > utility works is it determines cpus by:
> >  nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
> > which is not the equivalent of possible_cpus. So the complete list of
> > cpu PT_NOTEs is not generated up front. We would need a solution for
> > that?
> Hello Eric,
> 
> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
> isn't that equivalent to possible CPUs?
> 
> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC
> it is possible CPUs.

From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
cpus. If these are true, we can use them.

But I am wondering why the existing present cpu way is going to be
discarded. Sorry, I tried to go through this thread, it's too long, can
anyone summarize the reason with shorter and clear sentences. Sorry
again for that.

> 
> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
> /sys/devices/system/cpu/possible for kexec_load case.
> 
> Thoughts?
> 
> - Sourabh Jain
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-28 12:44                       ` Baoquan He
@ 2023-02-28 18:52                         ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-28 18:52 UTC (permalink / raw)
  To: Baoquan He, Sourabh Jain
  Cc: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung,
	vgoyal, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky,
	robh, efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/28/23 06:44, Baoquan He wrote:
> On 02/13/23 at 10:10am, Sourabh Jain wrote:
>>
>> On 11/02/23 06:05, Eric DeVolder wrote:
>>>
>>>
>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>
>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>> Hello Eric,
>>>>>>
>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>> Eric!
>>>>>>>>
>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>
>>>>>>>>> So my latest solution is introduce two new CPUHP
>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
>>>>>>>>> offlining. I'm open to better names.
>>>>>>>>>
>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>> attempts at locating this state failed when
>>>>>>>>> inside the STARTING section, so I located
>>>>>>>>> this just inside the ONLINE sectoin. The crash
>>>>>>>>> hotplug handler is registered on
>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>
>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>> placed it at the end of the PREPARE section.
>>>>>>>>> This crash hotplug handler is also
>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>
>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>
>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>> {
>>>>>>>>      struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>
>>>>>>>>      return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>> }
>>>>>>>>
>>>>>>>> and use this to query the actual state at crash
>>>>>>>> time. That spares all
>>>>>>>> those callback heuristics.
>>>>>>>>
>>>>>>>>> I'm making my way though percpu crash_notes,
>>>>>>>>> elfcorehdr, vmcoreinfo,
>>>>>>>>> makedumpfile and (the consumer of it all) the
>>>>>>>>> userspace crash utility,
>>>>>>>>> in order to understand the impact of moving from
>>>>>>>>> for_each_present_cpu()
>>>>>>>>> to for_each_online_cpu().
>>>>>>>>
>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>           tglx
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Thomas,
>>>>>>> I've investigated the passing of crash notes through the
>>>>>>> vmcore. What I've learned is that:
>>>>>>>
>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references
>>>>>>> to do its job) does
>>>>>>>    not care what the contents of cpu PT_NOTES are, but it
>>>>>>> does coalesce them together.
>>>>>>>
>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in
>>>>>>> order to determine its
>>>>>>>    nr_cpus variable, which is reported in a header, but
>>>>>>> otherwise unused (except
>>>>>>>    for sadump method).
>>>>>>>
>>>>>>> - the crash utility, for the purposes of determining the
>>>>>>> cpus, does not appear to
>>>>>>>    reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>    cpu_[possible|present|online]_mask and computes
>>>>>>> nr_cpus from that, and also of
>>>>>>>    course which are online. In addition, when crash does
>>>>>>> reference the cpu PT_NOTE,
>>>>>>>    to get its prstatus, it does so by using a percpu
>>>>>>> technique directly in the vmcore
>>>>>>>    image memory, not via the ELF structure. Said
>>>>>>> differently, it appears to me that
>>>>>>>    crash utility doesn't rely on the ELF PT_NOTEs for
>>>>>>> cpus; rather it obtains them
>>>>>>>    via kernel cpumasks and the memory within the vmcore.
>>>>>>>
>>>>>>> With this understanding, I did some testing. Perhaps the
>>>>>>> most telling test was that I
>>>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>> hot plugged some cpus, then also took a few offline
>>>>>>> sparsely via chcpu, then generated a
>>>>>>> vmcore. The crash utility had no problem loading the
>>>>>>> vmcore, it reported the proper number
>>>>>>> of cpus and the number offline (despite only one cpu
>>>>>>> PT_NOTE), and changing to a different
>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>
>>>>>>> My take away is that crash utility does not rely upon
>>>>>>> ELF cpu PT_NOTEs, it obtains the
>>>>>>> cpu information directly from kernel data structures.
>>>>>>> Perhaps at one time crash relied
>>>>>>> upon the ELF information, but no more. (Perhaps there
>>>>>>> are other crash dump analyzers
>>>>>>> that might rely on the ELF info?)
>>>>>>>
>>>>>>> So, all this to say that I see no need to change
>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>> is no compelling reason to move away from
>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>> online/offline.
>>>>>>>
>>>>>>> Which then leaves the topic of the cpuhp state on which
>>>>>>> to register. Perhaps reverting
>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right
>>>>>>> answer. There does not appear to
>>>>>>> be a compelling need to accurately track whether the cpu
>>>>>>> went online/offline for the
>>>>>>> purposes of creating the elfcorehdr, as ultimately the
>>>>>>> crash utility pulls that from
>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>
>>>>>>> I think this is what Sourabh has known and has been
>>>>>>> advocating for an optimization
>>>>>>> path that allows not regenerating the elfcorehdr on cpu
>>>>>>> changes (because all the percpu
>>>>>>> structs are all laid out). I do think it best to leave
>>>>>>> that as an arch choice.
>>>>>>
>>>>>> Since things are clear on how the PT_NOTES are consumed in
>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>
>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>> If yes, can you please list the elfcorehdr components that
>>>>>> changes due to CPU hotplug.
>>>>> Due to the use of for_each_present_cpu(), it is possible for the
>>>>> number of cpu PT_NOTEs
>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
>>>>> does not impact the
>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>
>>>>>>
>>>>>>   From what I understood, crash notes are prepared for
>>>>>> possible CPUs as system boots and
>>>>>> could be used to create a PT_NOTE section for each possible
>>>>>> CPU while generating the elfcorehdr
>>>>>> during the kdump kernel load.
>>>>>>
>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every
>>>>>> possible CPU there is no need to
>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>
>>>>> For onlining/offlining of cpus, there is no need to regenerate
>>>>> the elfcorehdr. However,
>>>>> for actual hot un/plug of cpus, the answer is yes due to
>>>>> for_each_present_cpu(). The
>>>>> caveat here of course is that if crash utility is the only
>>>>> coredump analyzer of concern,
>>>>> then it doesn't care about these cpu PT_NOTEs and there would be
>>>>> no need to re-generate them.
>>>>>
>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming
>>>>> into mainstream, impacts
>>>>> any of this.
>>>>>
>>>>> Perhaps the one item that might help here is to distinguish
>>>>> between actual hot un/plug of
>>>>> cpus, versus onlining/offlining. At the moment, I can not
>>>>> distinguish between a hot plug
>>>>> event and an online event (and unplug/offline). If those were
>>>>> distinguishable, then we
>>>>> could only regenerate on un/plug events.
>>>>>
>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>
>>>> Yes, because once elfcorehdr is built with possible CPUs we don't
>>>> have to worry about
>>>> hot[un]plug case.
>>>>
>>>> Here is my view on how things should be handled if a core-dump
>>>> analyzer is dependent on
>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>
>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>>>> notes (kernel has
>>>> one crash note per CPU for every possible CPU). Though the crash
>>>> notes are allocated
>>>> during the boot time they are populated when the system is on the
>>>> crash path.
>>>>
>>>> This is how crash notes are populated on PowerPC and I am expecting
>>>> it would be something
>>>> similar on other architectures too.
>>>>
>>>> The crashing CPU sends IPI to every other online CPU with a callback
>>>> function that updates the
>>>> crash notes of that specific CPU. Once the IPI completes the
>>>> crashing CPU updates its own crash
>>>> note and proceeds further.
>>>>
>>>> The crash notes of CPUs remain uninitialized if the CPUs were
>>>> offline or hot unplugged at the time
>>>> system crash. The core-dump analyzer should be able to identify
>>>> [un]/initialized crash notes
>>>> and display the information accordingly.
>>>>
>>>> Thoughts?
>>>>
>>>> - Sourabh
>>>
>>> In general, I agree with your points. You've presented a strong case to
>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
>>> those crash notes would always be present, and we can ignore changes to
>>> cpus wrt/ elfcorehdr updates.
>>>
>>> But what do we do about kexec_load() syscall? The way the userspace
>>> utility works is it determines cpus by:
>>>   nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
>>> which is not the equivalent of possible_cpus. So the complete list of
>>> cpu PT_NOTEs is not generated up front. We would need a solution for
>>> that?
>> Hello Eric,
>>
>> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
>> isn't that equivalent to possible CPUs?
>>
>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC
>> it is possible CPUs.
> 
Baoquan,

>  From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
> cpus. If these are true, we can use them.

Thomas Gleixner has pointed out that:

  glibc tries to evaluate that in the following order:
   1) /sys/devices/system/cpu/cpu*
      That's present CPUs not possible CPUs
   2) /proc/stat
      That's online CPUs
   3) sched_getaffinity()
      That's online CPUs at best. In the worst case it's an affinity mask
      which is set on a process group

meaning that _SC_NPROCESSORS_CONF is not equivalent to possible_cpus(). Furthermore, the 
/sys/system/devices/cpus/cpuXX entries are not available for not-present-but-possible cpus; thus 
userspace kexec utility can not write out the elfcorehdr with all possible cpus listed.

> 
> But I am wondering why the existing present cpu way is going to be
> discarded. Sorry, I tried to go through this thread, it's too long, can
> anyone summarize the reason with shorter and clear sentences. Sorry
> again for that.

By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), in the case of the 
kexec_file_load(), this change would simplify some issues Sourabh has encountered for PPC support. 
It would also enable an optimization that permits NOT re-generating the elfcorehdr on cpu changes, 
as all the [possible] cpus are already described in the elfcorehdr.

I've pointed out that this change would have kexec_load (as kexec-tools can only write out, 
initially, the present_cpus()) initially deviate from kexec_file_load (which would now write out the 
possible_cpus()). This deviation would disappear after the first hotplug event (due to calling 
crash_prepare_elf64_headers()). Or I've provided a simple way for kexec_load to rewrite its 
elfcorehdr upon initial load (by calling into the crash hotplug handler).

Can you think of any side effects of going to for_each_possible_cpu()?

Thanks,
eric


> 
>>
>> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
>> /sys/devices/system/cpu/possible for kexec_load case.
>>
>> Thoughts?
>>
>> - Sourabh Jain
>>
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-28 18:52                         ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-28 18:52 UTC (permalink / raw)
  To: Baoquan He, Sourabh Jain
  Cc: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung,
	vgoyal, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky,
	robh, efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/28/23 06:44, Baoquan He wrote:
> On 02/13/23 at 10:10am, Sourabh Jain wrote:
>>
>> On 11/02/23 06:05, Eric DeVolder wrote:
>>>
>>>
>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>
>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>> Hello Eric,
>>>>>>
>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>> Eric!
>>>>>>>>
>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>
>>>>>>>>> So my latest solution is introduce two new CPUHP
>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
>>>>>>>>> offlining. I'm open to better names.
>>>>>>>>>
>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>> attempts at locating this state failed when
>>>>>>>>> inside the STARTING section, so I located
>>>>>>>>> this just inside the ONLINE sectoin. The crash
>>>>>>>>> hotplug handler is registered on
>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>
>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>> placed it at the end of the PREPARE section.
>>>>>>>>> This crash hotplug handler is also
>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>
>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>
>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>> {
>>>>>>>>      struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>
>>>>>>>>      return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>> }
>>>>>>>>
>>>>>>>> and use this to query the actual state at crash
>>>>>>>> time. That spares all
>>>>>>>> those callback heuristics.
>>>>>>>>
>>>>>>>>> I'm making my way though percpu crash_notes,
>>>>>>>>> elfcorehdr, vmcoreinfo,
>>>>>>>>> makedumpfile and (the consumer of it all) the
>>>>>>>>> userspace crash utility,
>>>>>>>>> in order to understand the impact of moving from
>>>>>>>>> for_each_present_cpu()
>>>>>>>>> to for_each_online_cpu().
>>>>>>>>
>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>           tglx
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Thomas,
>>>>>>> I've investigated the passing of crash notes through the
>>>>>>> vmcore. What I've learned is that:
>>>>>>>
>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references
>>>>>>> to do its job) does
>>>>>>>    not care what the contents of cpu PT_NOTES are, but it
>>>>>>> does coalesce them together.
>>>>>>>
>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in
>>>>>>> order to determine its
>>>>>>>    nr_cpus variable, which is reported in a header, but
>>>>>>> otherwise unused (except
>>>>>>>    for sadump method).
>>>>>>>
>>>>>>> - the crash utility, for the purposes of determining the
>>>>>>> cpus, does not appear to
>>>>>>>    reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>    cpu_[possible|present|online]_mask and computes
>>>>>>> nr_cpus from that, and also of
>>>>>>>    course which are online. In addition, when crash does
>>>>>>> reference the cpu PT_NOTE,
>>>>>>>    to get its prstatus, it does so by using a percpu
>>>>>>> technique directly in the vmcore
>>>>>>>    image memory, not via the ELF structure. Said
>>>>>>> differently, it appears to me that
>>>>>>>    crash utility doesn't rely on the ELF PT_NOTEs for
>>>>>>> cpus; rather it obtains them
>>>>>>>    via kernel cpumasks and the memory within the vmcore.
>>>>>>>
>>>>>>> With this understanding, I did some testing. Perhaps the
>>>>>>> most telling test was that I
>>>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>> hot plugged some cpus, then also took a few offline
>>>>>>> sparsely via chcpu, then generated a
>>>>>>> vmcore. The crash utility had no problem loading the
>>>>>>> vmcore, it reported the proper number
>>>>>>> of cpus and the number offline (despite only one cpu
>>>>>>> PT_NOTE), and changing to a different
>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>
>>>>>>> My take away is that crash utility does not rely upon
>>>>>>> ELF cpu PT_NOTEs, it obtains the
>>>>>>> cpu information directly from kernel data structures.
>>>>>>> Perhaps at one time crash relied
>>>>>>> upon the ELF information, but no more. (Perhaps there
>>>>>>> are other crash dump analyzers
>>>>>>> that might rely on the ELF info?)
>>>>>>>
>>>>>>> So, all this to say that I see no need to change
>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>> is no compelling reason to move away from
>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>> online/offline.
>>>>>>>
>>>>>>> Which then leaves the topic of the cpuhp state on which
>>>>>>> to register. Perhaps reverting
>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right
>>>>>>> answer. There does not appear to
>>>>>>> be a compelling need to accurately track whether the cpu
>>>>>>> went online/offline for the
>>>>>>> purposes of creating the elfcorehdr, as ultimately the
>>>>>>> crash utility pulls that from
>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>
>>>>>>> I think this is what Sourabh has known and has been
>>>>>>> advocating for an optimization
>>>>>>> path that allows not regenerating the elfcorehdr on cpu
>>>>>>> changes (because all the percpu
>>>>>>> structs are all laid out). I do think it best to leave
>>>>>>> that as an arch choice.
>>>>>>
>>>>>> Since things are clear on how the PT_NOTES are consumed in
>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>
>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>> If yes, can you please list the elfcorehdr components that
>>>>>> changes due to CPU hotplug.
>>>>> Due to the use of for_each_present_cpu(), it is possible for the
>>>>> number of cpu PT_NOTEs
>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
>>>>> does not impact the
>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>
>>>>>>
>>>>>>   From what I understood, crash notes are prepared for
>>>>>> possible CPUs as system boots and
>>>>>> could be used to create a PT_NOTE section for each possible
>>>>>> CPU while generating the elfcorehdr
>>>>>> during the kdump kernel load.
>>>>>>
>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every
>>>>>> possible CPU there is no need to
>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>
>>>>> For onlining/offlining of cpus, there is no need to regenerate
>>>>> the elfcorehdr. However,
>>>>> for actual hot un/plug of cpus, the answer is yes due to
>>>>> for_each_present_cpu(). The
>>>>> caveat here of course is that if crash utility is the only
>>>>> coredump analyzer of concern,
>>>>> then it doesn't care about these cpu PT_NOTEs and there would be
>>>>> no need to re-generate them.
>>>>>
>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming
>>>>> into mainstream, impacts
>>>>> any of this.
>>>>>
>>>>> Perhaps the one item that might help here is to distinguish
>>>>> between actual hot un/plug of
>>>>> cpus, versus onlining/offlining. At the moment, I can not
>>>>> distinguish between a hot plug
>>>>> event and an online event (and unplug/offline). If those were
>>>>> distinguishable, then we
>>>>> could only regenerate on un/plug events.
>>>>>
>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>
>>>> Yes, because once elfcorehdr is built with possible CPUs we don't
>>>> have to worry about
>>>> hot[un]plug case.
>>>>
>>>> Here is my view on how things should be handled if a core-dump
>>>> analyzer is dependent on
>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>
>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>>>> notes (kernel has
>>>> one crash note per CPU for every possible CPU). Though the crash
>>>> notes are allocated
>>>> during the boot time they are populated when the system is on the
>>>> crash path.
>>>>
>>>> This is how crash notes are populated on PowerPC and I am expecting
>>>> it would be something
>>>> similar on other architectures too.
>>>>
>>>> The crashing CPU sends IPI to every other online CPU with a callback
>>>> function that updates the
>>>> crash notes of that specific CPU. Once the IPI completes the
>>>> crashing CPU updates its own crash
>>>> note and proceeds further.
>>>>
>>>> The crash notes of CPUs remain uninitialized if the CPUs were
>>>> offline or hot unplugged at the time
>>>> system crash. The core-dump analyzer should be able to identify
>>>> [un]/initialized crash notes
>>>> and display the information accordingly.
>>>>
>>>> Thoughts?
>>>>
>>>> - Sourabh
>>>
>>> In general, I agree with your points. You've presented a strong case to
>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
>>> those crash notes would always be present, and we can ignore changes to
>>> cpus wrt/ elfcorehdr updates.
>>>
>>> But what do we do about kexec_load() syscall? The way the userspace
>>> utility works is it determines cpus by:
>>>   nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
>>> which is not the equivalent of possible_cpus. So the complete list of
>>> cpu PT_NOTEs is not generated up front. We would need a solution for
>>> that?
>> Hello Eric,
>>
>> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
>> isn't that equivalent to possible CPUs?
>>
>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC
>> it is possible CPUs.
> 
Baoquan,

>  From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
> cpus. If these are true, we can use them.

Thomas Gleixner has pointed out that:

  glibc tries to evaluate that in the following order:
   1) /sys/devices/system/cpu/cpu*
      That's present CPUs not possible CPUs
   2) /proc/stat
      That's online CPUs
   3) sched_getaffinity()
      That's online CPUs at best. In the worst case it's an affinity mask
      which is set on a process group

meaning that _SC_NPROCESSORS_CONF is not equivalent to possible_cpus(). Furthermore, the 
/sys/system/devices/cpus/cpuXX entries are not available for not-present-but-possible cpus; thus 
userspace kexec utility can not write out the elfcorehdr with all possible cpus listed.

> 
> But I am wondering why the existing present cpu way is going to be
> discarded. Sorry, I tried to go through this thread, it's too long, can
> anyone summarize the reason with shorter and clear sentences. Sorry
> again for that.

By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), in the case of the 
kexec_file_load(), this change would simplify some issues Sourabh has encountered for PPC support. 
It would also enable an optimization that permits NOT re-generating the elfcorehdr on cpu changes, 
as all the [possible] cpus are already described in the elfcorehdr.

I've pointed out that this change would have kexec_load (as kexec-tools can only write out, 
initially, the present_cpus()) initially deviate from kexec_file_load (which would now write out the 
possible_cpus()). This deviation would disappear after the first hotplug event (due to calling 
crash_prepare_elf64_headers()). Or I've provided a simple way for kexec_load to rewrite its 
elfcorehdr upon initial load (by calling into the crash hotplug handler).

Can you think of any side effects of going to for_each_possible_cpu()?

Thanks,
eric


> 
>>
>> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
>> /sys/devices/system/cpu/possible for kexec_load case.
>>
>> Thoughts?
>>
>> - Sourabh Jain
>>
> 

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-27  6:11                         ` Sourabh Jain
@ 2023-02-28 21:50                           ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-28 21:50 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/27/23 00:11, Sourabh Jain wrote:
> 
> On 25/02/23 01:46, Eric DeVolder wrote:
>>
>>
>> On 2/24/23 02:34, Sourabh Jain wrote:
>>>
>>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>
>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>> Hello Eric,
>>>>>>>
>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>> Eric!
>>>>>>>>>
>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>
>>>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>>
>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>
>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>> {
>>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>
>>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and use this to query the actual state at crash time. That spares all
>>>>>>>>> those callback heuristics.
>>>>>>>>>
>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>
>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>          tglx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thomas,
>>>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>>>>>
>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>>>>>
>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>>>>>   for sadump method).
>>>>>>>>
>>>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>>
>>>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>
>>>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>>>>>> that might rely on the ELF info?)
>>>>>>>>
>>>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>>>>>> online/offline.
>>>>>>>>
>>>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>
>>>>>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>>>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>>>>>
>>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>
>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>>>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>
>>>>>>>
>>>>>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>>>>>> during the kdump kernel load.
>>>>>>>
>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>
>>>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>>>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>>>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>>>>>
>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>>>>>> any of this.
>>>>>>
>>>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>>>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>>>>>> event and an online event (and unplug/offline). If those were distinguishable, then we
>>>>>> could only regenerate on un/plug events.
>>>>>>
>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>
>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
>>>>> hot[un]plug case.
>>>>>
>>>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on
>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>
>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
>>>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated
>>>>> during the boot time they are populated when the system is on the crash path.
>>>>>
>>>>> This is how crash notes are populated on PowerPC and I am expecting it would be something
>>>>> similar on other architectures too.
>>>>>
>>>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
>>>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
>>>>> note and proceeds further.
>>>>>
>>>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
>>>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
>>>>> and display the information accordingly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> - Sourabh
>>>>
>>>> I've been examining what it would mean to move to for_each_possible_cpu() in 
>>>> crash_prepare_elf64_headers(). I think it means:
>>>>
>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
>>>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the 
>>>> load with the elfcorehdr generated by crash_prepare_elf64_headers().
>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.
>>>
>>> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery
>>> to update FDT kexec segment for CPU hot add case.
>>
>> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>>
>>>
>>>>
>>>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash 
>>>> analyzers that rely on ELF PT_NOTEs for crash_notes.
>>>>
>>>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp 
>>>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The 
>>>> overhead for doing this is very minimal, and the events are likely very infrequent.
>>>
>>> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then 
>>> elfcorehdr. For example FDT on PowerPC.
>>>
>>> - Sourabh Jain
>>
>> OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire is 
>> to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all possible 
>> cpus upon kernel boot, there seems to be no reason to not do this?
>>
>> HOWEVER...
>>
>> It's not clear to me that this particular change needs to be part of this series. It's inclusion 
>> would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes 
>> kexec_load and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of 
>> for_each_present_cpu() where as with this change kexec_file_load would do for_each_possible_cpu(); 
>> until a hot plug event then both would do for_each_possible_cpu()). And if this change were to 
>> arrive as part of Sourabh's PPC support, then it does not appear to impact x86 (not sure about 
>> other arches). And the 'crash' dump analyzer doesn't care either way.
>>
>> Including this change would enable an optimization path (for x86 at least) that short-circuits cpu 
>> hotplug changes in the arch crash handler, for example:
>>
>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> index aca3f1817674..0883f6b11de4 100644
>> --- a/arch/x86/kernel/crash.c
>> +++ b/arch/x86/kernel/crash.c
>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image)
>>     unsigned long mem, memsz;
>>     unsigned long elfsz = 0;
>>
>> +   if (image->file_mode && (
>> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
>> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
>> +       return;
>> +
>>     /*
>>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>>      * memory resources.
>>
>> I'm not sure that is compelling given the infrequent nature of cpu hotplug events.
> It certainly closes/reduces the window where kdump is not active due kexec segment update.|

Fair enough. I plan to include this change in v19.

> 
>>
>> In my mind I still have a question about kexec_load() path. The userspace kexec can not do the 
>> equivalent of for_each_possible_cpu(). It can obtain max possible cpus from 
>> /sys/devices/system/cpu/possible, but for those cpus not present the /sys/devices/system/cpu/cpuXX 
>> is not available and so the crash_notes entries is not available. My attempts to expose all cpuXX 
>> lead to odd behavior that was requiring changes in ACPI and arch code that looked untenable.
>>
>> There seem to be these options available for kexec_load() path:
>> - immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've 
>> made this work with the following, as proof of concept:
> Yes regenerating/patching the elfcorehdr could be an option for kexec_load syscall.
So this is not needed by x86, but more so by ppc. Should this change be in the ppc set or this set?


> 
>>
>> diff --git a/kernel/kexec.c b/kernel/kexec.c
>> index cb8e6e6f983c..4eb201270f97 100644
>> --- a/kernel/kexec.c
>> +++ b/kernel/kexec.c
>> @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, unsigned long
>>     kimage_free(image);
>>  out_unlock:
>>     kexec_unlock();
>> +   if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
>> +       if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
>> +           crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, KEXEC_CRASH_HP_INVALID_CPU);
>> +       }
>> +   }
>>     return ret;
>>  }
>>
>> - Another option is spend the time to determine whether exposing all cpuXX is a viable solution; I 
>> have no idea what impacts to userspace would be for possible-but-not-yet-present cpuXX entries 
>> would be. It might also mean requiring a 'present' entry available within the cpuXX.
>>
>> - Another option is to simply let the hot plug events rewrite the elfcorehdr on demand. This is 
>> what I've originally put forth, but not sure how this impacts PPC given for_each_possible_cpu() 
>> change.
> Given that /sys/devices/system/cpu/cpuXX is not present for possbile-but-not-yet-present CPUs, I am 
> wondering do we even have crash notes for possible CPUs on x86?
Yes there are crash_notes for all possible cpus on x86.
eric

>>
>> The concern is that today, both kexec_load and kexec_file_load mirror each other with respect to 
>> for_each_present_cpu(); that is userspace kexec is able to generate the elfcorehdr the same as 
>> would kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), the two would deviate.
> 
> Thanks,
> Sourabh Jain

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-02-28 21:50                           ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-02-28 21:50 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/27/23 00:11, Sourabh Jain wrote:
> 
> On 25/02/23 01:46, Eric DeVolder wrote:
>>
>>
>> On 2/24/23 02:34, Sourabh Jain wrote:
>>>
>>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>
>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>> Hello Eric,
>>>>>>>
>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>> Eric!
>>>>>>>>>
>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>
>>>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>>
>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>
>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>> {
>>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>
>>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and use this to query the actual state at crash time. That spares all
>>>>>>>>> those callback heuristics.
>>>>>>>>>
>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>
>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>          tglx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thomas,
>>>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>>>>>
>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>>>>>
>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>>>>>   for sadump method).
>>>>>>>>
>>>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>>
>>>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>
>>>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>>>>>> that might rely on the ELF info?)
>>>>>>>>
>>>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>>>>>> online/offline.
>>>>>>>>
>>>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>
>>>>>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>>>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>>>>>
>>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>
>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>>>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>
>>>>>>>
>>>>>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the elfcorehdr
>>>>>>> during the kdump kernel load.
>>>>>>>
>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>
>>>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>>>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>>>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>>>>>
>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>>>>>> any of this.
>>>>>>
>>>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>>>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>>>>>> event and an online event (and unplug/offline). If those were distinguishable, then we
>>>>>> could only regenerate on un/plug events.
>>>>>>
>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>
>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
>>>>> hot[un]plug case.
>>>>>
>>>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on
>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>
>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
>>>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated
>>>>> during the boot time they are populated when the system is on the crash path.
>>>>>
>>>>> This is how crash notes are populated on PowerPC and I am expecting it would be something
>>>>> similar on other architectures too.
>>>>>
>>>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
>>>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
>>>>> note and proceeds further.
>>>>>
>>>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the time
>>>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
>>>>> and display the information accordingly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> - Sourabh
>>>>
>>>> I've been examining what it would mean to move to for_each_possible_cpu() in 
>>>> crash_prepare_elf64_headers(). I think it means:
>>>>
>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
>>>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the 
>>>> load with the elfcorehdr generated by crash_prepare_elf64_headers().
>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.
>>>
>>> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery
>>> to update FDT kexec segment for CPU hot add case.
>>
>> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>>
>>>
>>>>
>>>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash 
>>>> analyzers that rely on ELF PT_NOTEs for crash_notes.
>>>>
>>>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp 
>>>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The 
>>>> overhead for doing this is very minimal, and the events are likely very infrequent.
>>>
>>> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then 
>>> elfcorehdr. For example FDT on PowerPC.
>>>
>>> - Sourabh Jain
>>
>> OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire is 
>> to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all possible 
>> cpus upon kernel boot, there seems to be no reason to not do this?
>>
>> HOWEVER...
>>
>> It's not clear to me that this particular change needs to be part of this series. It's inclusion 
>> would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes 
>> kexec_load and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of 
>> for_each_present_cpu() where as with this change kexec_file_load would do for_each_possible_cpu(); 
>> until a hot plug event then both would do for_each_possible_cpu()). And if this change were to 
>> arrive as part of Sourabh's PPC support, then it does not appear to impact x86 (not sure about 
>> other arches). And the 'crash' dump analyzer doesn't care either way.
>>
>> Including this change would enable an optimization path (for x86 at least) that short-circuits cpu 
>> hotplug changes in the arch crash handler, for example:
>>
>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> index aca3f1817674..0883f6b11de4 100644
>> --- a/arch/x86/kernel/crash.c
>> +++ b/arch/x86/kernel/crash.c
>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image)
>>     unsigned long mem, memsz;
>>     unsigned long elfsz = 0;
>>
>> +   if (image->file_mode && (
>> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
>> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
>> +       return;
>> +
>>     /*
>>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>>      * memory resources.
>>
>> I'm not sure that is compelling given the infrequent nature of cpu hotplug events.
> It certainly closes/reduces the window where kdump is not active due kexec segment update.|

Fair enough. I plan to include this change in v19.

> 
>>
>> In my mind I still have a question about kexec_load() path. The userspace kexec can not do the 
>> equivalent of for_each_possible_cpu(). It can obtain max possible cpus from 
>> /sys/devices/system/cpu/possible, but for those cpus not present the /sys/devices/system/cpu/cpuXX 
>> is not available and so the crash_notes entries is not available. My attempts to expose all cpuXX 
>> lead to odd behavior that was requiring changes in ACPI and arch code that looked untenable.
>>
>> There seem to be these options available for kexec_load() path:
>> - immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've 
>> made this work with the following, as proof of concept:
> Yes regenerating/patching the elfcorehdr could be an option for kexec_load syscall.
So this is not needed by x86, but more so by ppc. Should this change be in the ppc set or this set?


> 
>>
>> diff --git a/kernel/kexec.c b/kernel/kexec.c
>> index cb8e6e6f983c..4eb201270f97 100644
>> --- a/kernel/kexec.c
>> +++ b/kernel/kexec.c
>> @@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry, unsigned long
>>     kimage_free(image);
>>  out_unlock:
>>     kexec_unlock();
>> +   if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
>> +       if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
>> +           crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE, KEXEC_CRASH_HP_INVALID_CPU);
>> +       }
>> +   }
>>     return ret;
>>  }
>>
>> - Another option is spend the time to determine whether exposing all cpuXX is a viable solution; I 
>> have no idea what impacts to userspace would be for possible-but-not-yet-present cpuXX entries 
>> would be. It might also mean requiring a 'present' entry available within the cpuXX.
>>
>> - Another option is to simply let the hot plug events rewrite the elfcorehdr on demand. This is 
>> what I've originally put forth, but not sure how this impacts PPC given for_each_possible_cpu() 
>> change.
> Given that /sys/devices/system/cpu/cpuXX is not present for possbile-but-not-yet-present CPUs, I am 
> wondering do we even have crash notes for possible CPUs on x86?
Yes there are crash_notes for all possible cpus on x86.
eric

>>
>> The concern is that today, both kexec_load and kexec_file_load mirror each other with respect to 
>> for_each_present_cpu(); that is userspace kexec is able to generate the elfcorehdr the same as 
>> would kexec_file_load, for cpus. But by changing to for_each_possible_cpu(), the two would deviate.
> 
> Thanks,
> Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-28 21:50                           ` Eric DeVolder
@ 2023-03-01  6:22                             ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-03-01  6:22 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 01/03/23 03:20, Eric DeVolder wrote:
>
>
> On 2/27/23 00:11, Sourabh Jain wrote:
>>
>> On 25/02/23 01:46, Eric DeVolder wrote:
>>>
>>>
>>> On 2/24/23 02:34, Sourabh Jain wrote:
>>>>
>>>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>>
>>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>>> Hello Eric,
>>>>>>>>
>>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>>> Eric!
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>>
>>>>>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. 
>>>>>>>>>>> I'm open to better names.
>>>>>>>>>>>
>>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>>>>>> attempts at locating this state failed when inside the 
>>>>>>>>>>> STARTING section, so I located
>>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug 
>>>>>>>>>>> handler is registered on
>>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>>
>>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>>> placed it at the end of the PREPARE section. This crash 
>>>>>>>>>>> hotplug handler is also
>>>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>>>> method.
>>>>>>>>>>
>>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>>
>>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>>> {
>>>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>>
>>>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> and use this to query the actual state at crash time. That 
>>>>>>>>>> spares all
>>>>>>>>>> those callback heuristics.
>>>>>>>>>>
>>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>>>>>> vmcoreinfo,
>>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace 
>>>>>>>>>>> crash utility,
>>>>>>>>>>> in order to understand the impact of moving from 
>>>>>>>>>>> for_each_present_cpu()
>>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>>
>>>>>>>>>> Is the packing actually worth the trouble? What's the actual 
>>>>>>>>>> win?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>>          tglx
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thomas,
>>>>>>>>> I've investigated the passing of crash notes through the 
>>>>>>>>> vmcore. What I've learned is that:
>>>>>>>>>
>>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do 
>>>>>>>>> its job) does
>>>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>>>>>> coalesce them together.
>>>>>>>>>
>>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order 
>>>>>>>>> to determine its
>>>>>>>>>   nr_cpus variable, which is reported in a header, but 
>>>>>>>>> otherwise unused (except
>>>>>>>>>   for sadump method).
>>>>>>>>>
>>>>>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>>>>>> does not appear to
>>>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the 
>>>>>>>>> various
>>>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>>>>>> that, and also of
>>>>>>>>>   course which are online. In addition, when crash does 
>>>>>>>>> reference the cpu PT_NOTE,
>>>>>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>>>>>> directly in the vmcore
>>>>>>>>>   image memory, not via the ELF structure. Said differently, 
>>>>>>>>> it appears to me that
>>>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; 
>>>>>>>>> rather it obtains them
>>>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>>>
>>>>>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>>>>>> telling test was that I
>>>>>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>>>> hot plugged some cpus, then also took a few offline sparsely 
>>>>>>>>> via chcpu, then generated a
>>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, 
>>>>>>>>> it reported the proper number
>>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), 
>>>>>>>>> and changing to a different
>>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>>
>>>>>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>>>>>> PT_NOTEs, it obtains the
>>>>>>>>> cpu information directly from kernel data structures. Perhaps 
>>>>>>>>> at one time crash relied
>>>>>>>>> upon the ELF information, but no more. (Perhaps there are 
>>>>>>>>> other crash dump analyzers
>>>>>>>>> that might rely on the ELF info?)
>>>>>>>>>
>>>>>>>>> So, all this to say that I see no need to change 
>>>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>>>> is no compelling reason to move away from 
>>>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>>>> online/offline.
>>>>>>>>>
>>>>>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>>>>>> register. Perhaps reverting
>>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. 
>>>>>>>>> There does not appear to
>>>>>>>>> be a compelling need to accurately track whether the cpu went 
>>>>>>>>> online/offline for the
>>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>>>>>> utility pulls that from
>>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>>
>>>>>>>>> I think this is what Sourabh has known and has been advocating 
>>>>>>>>> for an optimization
>>>>>>>>> path that allows not regenerating the elfcorehdr on cpu 
>>>>>>>>> changes (because all the percpu
>>>>>>>>> structs are all laid out). I do think it best to leave that as 
>>>>>>>>> an arch choice.
>>>>>>>>
>>>>>>>> Since things are clear on how the PT_NOTES are consumed in 
>>>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>>
>>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>>> If yes, can you please list the elfcorehdr components that 
>>>>>>>> changes due to CPU hotplug.
>>>>>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>>>>>> number of cpu PT_NOTEs
>>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus 
>>>>>>> does not impact the
>>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>>
>>>>>>>>
>>>>>>>>  From what I understood, crash notes are prepared for possible 
>>>>>>>> CPUs as system boots and
>>>>>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>>>>>> while generating the elfcorehdr
>>>>>>>> during the kdump kernel load.
>>>>>>>>
>>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every 
>>>>>>>> possible CPU there is no need to
>>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>>
>>>>>>> For onlining/offlining of cpus, there is no need to regenerate 
>>>>>>> the elfcorehdr. However,
>>>>>>> for actual hot un/plug of cpus, the answer is yes due to 
>>>>>>> for_each_present_cpu(). The
>>>>>>> caveat here of course is that if crash utility is the only 
>>>>>>> coredump analyzer of concern,
>>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be 
>>>>>>> no need to re-generate them.
>>>>>>>
>>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming 
>>>>>>> into mainstream, impacts
>>>>>>> any of this.
>>>>>>>
>>>>>>> Perhaps the one item that might help here is to distinguish 
>>>>>>> between actual hot un/plug of
>>>>>>> cpus, versus onlining/offlining. At the moment, I can not 
>>>>>>> distinguish between a hot plug
>>>>>>> event and an online event (and unplug/offline). If those were 
>>>>>>> distinguishable, then we
>>>>>>> could only regenerate on un/plug events.
>>>>>>>
>>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>>
>>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>>>>>> have to worry about
>>>>>> hot[un]plug case.
>>>>>>
>>>>>> Here is my view on how things should be handled if a core-dump 
>>>>>> analyzer is dependent on
>>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>>
>>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding 
>>>>>> crash notes (kernel has
>>>>>> one crash note per CPU for every possible CPU). Though the crash 
>>>>>> notes are allocated
>>>>>> during the boot time they are populated when the system is on the 
>>>>>> crash path.
>>>>>>
>>>>>> This is how crash notes are populated on PowerPC and I am 
>>>>>> expecting it would be something
>>>>>> similar on other architectures too.
>>>>>>
>>>>>> The crashing CPU sends IPI to every other online CPU with a 
>>>>>> callback function that updates the
>>>>>> crash notes of that specific CPU. Once the IPI completes the 
>>>>>> crashing CPU updates its own crash
>>>>>> note and proceeds further.
>>>>>>
>>>>>> The crash notes of CPUs remain uninitialized if the CPUs were 
>>>>>> offline or hot unplugged at the time
>>>>>> system crash. The core-dump analyzer should be able to identify 
>>>>>> [un]/initialized crash notes
>>>>>> and display the information accordingly.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> - Sourabh
>>>>>
>>>>> I've been examining what it would mean to move to 
>>>>> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think 
>>>>> it means:
>>>>>
>>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in 
>>>>> crash_prepare_elf64_headers().
>>>>> - For kexec_load() syscall path, rewrite the incoming/supplied 
>>>>> elfcorehdr immediately on the load with the elfcorehdr generated 
>>>>> by crash_prepare_elf64_headers().
>>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug 
>>>>> events.
>>>>
>>>> If for_each_present_cpu is replaced with for_each_possible_cpu I 
>>>> still need cpuhp machinery
>>>> to update FDT kexec segment for CPU hot add case.
>>>
>>> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>>>
>>>>
>>>>>
>>>>> This would then setup PT_NOTEs for all possible cpus, which should 
>>>>> in theory accommodate crash analyzers that rely on ELF PT_NOTEs 
>>>>> for crash_notes.
>>>>>
>>>>> If staying with for_each_present_cpu() is ultimately decided, then 
>>>>> I think leaving the cpuhp machinery in place and each arch could 
>>>>> decide how to handle crash cpu hotplug events. The overhead for 
>>>>> doing this is very minimal, and the events are likely very 
>>>>> infrequent.
>>>>
>>>> I agree. Some architectures may need cpuhp machinery to update 
>>>> kexec segment[s] other then elfcorehdr. For example FDT on PowerPC.
>>>>
>>>> - Sourabh Jain
>>>
>>> OK, I was thinking that the desire was to eliminate the cpuhp 
>>> callbacks. In reality, the desire is to change to 
>>> for_each_possible_cpu(). Given that the kernel creates crash_notes 
>>> for all possible cpus upon kernel boot, there seems to be no reason 
>>> to not do this?
>>>
>>> HOWEVER...
>>>
>>> It's not clear to me that this particular change needs to be part of 
>>> this series. It's inclusion would facilitate PPC support, but 
>>> doesn't "solve" anything in general. In fact it causes kexec_load 
>>> and kexec_file_load to deviate (kexec_load via userspace kexec does 
>>> the equivalent of for_each_present_cpu() where as with this change 
>>> kexec_file_load would do for_each_possible_cpu(); until a hot plug 
>>> event then both would do for_each_possible_cpu()). And if this 
>>> change were to arrive as part of Sourabh's PPC support, then it does 
>>> not appear to impact x86 (not sure about other arches). And the 
>>> 'crash' dump analyzer doesn't care either way.
>>>
>>> Including this change would enable an optimization path (for x86 at 
>>> least) that short-circuits cpu hotplug changes in the arch crash 
>>> handler, for example:
>>>
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index aca3f1817674..0883f6b11de4 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct 
>>> kimage *image)
>>>     unsigned long mem, memsz;
>>>     unsigned long elfsz = 0;
>>>
>>> +   if (image->file_mode && (
>>> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
>>> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
>>> +       return;
>>> +
>>>     /*
>>>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>>>      * memory resources.
>>>
>>> I'm not sure that is compelling given the infrequent nature of cpu 
>>> hotplug events.
>> It certainly closes/reduces the window where kdump is not active due 
>> kexec segment update.|
>
> Fair enough. I plan to include this change in v19.
>
>>
>>>
>>> In my mind I still have a question about kexec_load() path. The 
>>> userspace kexec can not do the equivalent of 
>>> for_each_possible_cpu(). It can obtain max possible cpus from 
>>> /sys/devices/system/cpu/possible, but for those cpus not present the 
>>> /sys/devices/system/cpu/cpuXX is not available and so the 
>>> crash_notes entries is not available. My attempts to expose all 
>>> cpuXX lead to odd behavior that was requiring changes in ACPI and 
>>> arch code that looked untenable.
>>>
>>> There seem to be these options available for kexec_load() path:
>>> - immediately rewrite the elfcorehdr upon load via a call to 
>>> crash_prepare_elf64_headers(). I've made this work with the 
>>> following, as proof of concept:
>> Yes regenerating/patching the elfcorehdr could be an option for 
>> kexec_load syscall.
> So this is not needed by x86, but more so by ppc. Should this change 
> be in the ppc set or this set?
Since /sys/devices/system/cpu/cpuXX represents possible CPUs on PowerPC, 
there is no need for elfcorehdr regeneration on PowerPC for kexec_load case
for CPU hotplug events.

My ask is, keep the cpuhp machinery so that architectures can update 
other kexec segments if needed of CPU add/remove case.

In case x86 has nothing to update on CPU hotplug events and you want 
remove the CPU hp machinery I can add the same
in ppc patch series.

Thanks,
Sourabh Jain

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-03-01  6:22                             ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-03-01  6:22 UTC (permalink / raw)
  To: Eric DeVolder, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky


On 01/03/23 03:20, Eric DeVolder wrote:
>
>
> On 2/27/23 00:11, Sourabh Jain wrote:
>>
>> On 25/02/23 01:46, Eric DeVolder wrote:
>>>
>>>
>>> On 2/24/23 02:34, Sourabh Jain wrote:
>>>>
>>>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>>>
>>>>>
>>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>>
>>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>>> Hello Eric,
>>>>>>>>
>>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>>> Eric!
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>>
>>>>>>>>>>> So my latest solution is introduce two new CPUHP states, 
>>>>>>>>>>> CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. 
>>>>>>>>>>> I'm open to better names.
>>>>>>>>>>>
>>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
>>>>>>>>>>> CPUHP_BRINGUP_CPU. My
>>>>>>>>>>> attempts at locating this state failed when inside the 
>>>>>>>>>>> STARTING section, so I located
>>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug 
>>>>>>>>>>> handler is registered on
>>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>>
>>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
>>>>>>>>>>> CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>>> placed it at the end of the PREPARE section. This crash 
>>>>>>>>>>> hotplug handler is also
>>>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>>>> method.
>>>>>>>>>>
>>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>>
>>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>>> {
>>>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>>
>>>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> and use this to query the actual state at crash time. That 
>>>>>>>>>> spares all
>>>>>>>>>> those callback heuristics.
>>>>>>>>>>
>>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, 
>>>>>>>>>>> vmcoreinfo,
>>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace 
>>>>>>>>>>> crash utility,
>>>>>>>>>>> in order to understand the impact of moving from 
>>>>>>>>>>> for_each_present_cpu()
>>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>>
>>>>>>>>>> Is the packing actually worth the trouble? What's the actual 
>>>>>>>>>> win?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>>          tglx
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thomas,
>>>>>>>>> I've investigated the passing of crash notes through the 
>>>>>>>>> vmcore. What I've learned is that:
>>>>>>>>>
>>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do 
>>>>>>>>> its job) does
>>>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does 
>>>>>>>>> coalesce them together.
>>>>>>>>>
>>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order 
>>>>>>>>> to determine its
>>>>>>>>>   nr_cpus variable, which is reported in a header, but 
>>>>>>>>> otherwise unused (except
>>>>>>>>>   for sadump method).
>>>>>>>>>
>>>>>>>>> - the crash utility, for the purposes of determining the cpus, 
>>>>>>>>> does not appear to
>>>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the 
>>>>>>>>> various
>>>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from 
>>>>>>>>> that, and also of
>>>>>>>>>   course which are online. In addition, when crash does 
>>>>>>>>> reference the cpu PT_NOTE,
>>>>>>>>>   to get its prstatus, it does so by using a percpu technique 
>>>>>>>>> directly in the vmcore
>>>>>>>>>   image memory, not via the ELF structure. Said differently, 
>>>>>>>>> it appears to me that
>>>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; 
>>>>>>>>> rather it obtains them
>>>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>>>
>>>>>>>>> With this understanding, I did some testing. Perhaps the most 
>>>>>>>>> telling test was that I
>>>>>>>>> changed the number of cpu PT_NOTEs emitted in the 
>>>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>>>> hot plugged some cpus, then also took a few offline sparsely 
>>>>>>>>> via chcpu, then generated a
>>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, 
>>>>>>>>> it reported the proper number
>>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), 
>>>>>>>>> and changing to a different
>>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>>
>>>>>>>>> My take away is that crash utility does not rely upon ELF cpu 
>>>>>>>>> PT_NOTEs, it obtains the
>>>>>>>>> cpu information directly from kernel data structures. Perhaps 
>>>>>>>>> at one time crash relied
>>>>>>>>> upon the ELF information, but no more. (Perhaps there are 
>>>>>>>>> other crash dump analyzers
>>>>>>>>> that might rely on the ELF info?)
>>>>>>>>>
>>>>>>>>> So, all this to say that I see no need to change 
>>>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>>>> is no compelling reason to move away from 
>>>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>>>> online/offline.
>>>>>>>>>
>>>>>>>>> Which then leaves the topic of the cpuhp state on which to 
>>>>>>>>> register. Perhaps reverting
>>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. 
>>>>>>>>> There does not appear to
>>>>>>>>> be a compelling need to accurately track whether the cpu went 
>>>>>>>>> online/offline for the
>>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash 
>>>>>>>>> utility pulls that from
>>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>>
>>>>>>>>> I think this is what Sourabh has known and has been advocating 
>>>>>>>>> for an optimization
>>>>>>>>> path that allows not regenerating the elfcorehdr on cpu 
>>>>>>>>> changes (because all the percpu
>>>>>>>>> structs are all laid out). I do think it best to leave that as 
>>>>>>>>> an arch choice.
>>>>>>>>
>>>>>>>> Since things are clear on how the PT_NOTES are consumed in 
>>>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>>
>>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>>> If yes, can you please list the elfcorehdr components that 
>>>>>>>> changes due to CPU hotplug.
>>>>>>> Due to the use of for_each_present_cpu(), it is possible for the 
>>>>>>> number of cpu PT_NOTEs
>>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus 
>>>>>>> does not impact the
>>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>>
>>>>>>>>
>>>>>>>>  From what I understood, crash notes are prepared for possible 
>>>>>>>> CPUs as system boots and
>>>>>>>> could be used to create a PT_NOTE section for each possible CPU 
>>>>>>>> while generating the elfcorehdr
>>>>>>>> during the kdump kernel load.
>>>>>>>>
>>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every 
>>>>>>>> possible CPU there is no need to
>>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>>
>>>>>>> For onlining/offlining of cpus, there is no need to regenerate 
>>>>>>> the elfcorehdr. However,
>>>>>>> for actual hot un/plug of cpus, the answer is yes due to 
>>>>>>> for_each_present_cpu(). The
>>>>>>> caveat here of course is that if crash utility is the only 
>>>>>>> coredump analyzer of concern,
>>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be 
>>>>>>> no need to re-generate them.
>>>>>>>
>>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming 
>>>>>>> into mainstream, impacts
>>>>>>> any of this.
>>>>>>>
>>>>>>> Perhaps the one item that might help here is to distinguish 
>>>>>>> between actual hot un/plug of
>>>>>>> cpus, versus onlining/offlining. At the moment, I can not 
>>>>>>> distinguish between a hot plug
>>>>>>> event and an online event (and unplug/offline). If those were 
>>>>>>> distinguishable, then we
>>>>>>> could only regenerate on un/plug events.
>>>>>>>
>>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>>
>>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't 
>>>>>> have to worry about
>>>>>> hot[un]plug case.
>>>>>>
>>>>>> Here is my view on how things should be handled if a core-dump 
>>>>>> analyzer is dependent on
>>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>>
>>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding 
>>>>>> crash notes (kernel has
>>>>>> one crash note per CPU for every possible CPU). Though the crash 
>>>>>> notes are allocated
>>>>>> during the boot time they are populated when the system is on the 
>>>>>> crash path.
>>>>>>
>>>>>> This is how crash notes are populated on PowerPC and I am 
>>>>>> expecting it would be something
>>>>>> similar on other architectures too.
>>>>>>
>>>>>> The crashing CPU sends IPI to every other online CPU with a 
>>>>>> callback function that updates the
>>>>>> crash notes of that specific CPU. Once the IPI completes the 
>>>>>> crashing CPU updates its own crash
>>>>>> note and proceeds further.
>>>>>>
>>>>>> The crash notes of CPUs remain uninitialized if the CPUs were 
>>>>>> offline or hot unplugged at the time
>>>>>> system crash. The core-dump analyzer should be able to identify 
>>>>>> [un]/initialized crash notes
>>>>>> and display the information accordingly.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> - Sourabh
>>>>>
>>>>> I've been examining what it would mean to move to 
>>>>> for_each_possible_cpu() in crash_prepare_elf64_headers(). I think 
>>>>> it means:
>>>>>
>>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in 
>>>>> crash_prepare_elf64_headers().
>>>>> - For kexec_load() syscall path, rewrite the incoming/supplied 
>>>>> elfcorehdr immediately on the load with the elfcorehdr generated 
>>>>> by crash_prepare_elf64_headers().
>>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug 
>>>>> events.
>>>>
>>>> If for_each_present_cpu is replaced with for_each_possible_cpu I 
>>>> still need cpuhp machinery
>>>> to update FDT kexec segment for CPU hot add case.
>>>
>>> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>>>
>>>>
>>>>>
>>>>> This would then setup PT_NOTEs for all possible cpus, which should 
>>>>> in theory accommodate crash analyzers that rely on ELF PT_NOTEs 
>>>>> for crash_notes.
>>>>>
>>>>> If staying with for_each_present_cpu() is ultimately decided, then 
>>>>> I think leaving the cpuhp machinery in place and each arch could 
>>>>> decide how to handle crash cpu hotplug events. The overhead for 
>>>>> doing this is very minimal, and the events are likely very 
>>>>> infrequent.
>>>>
>>>> I agree. Some architectures may need cpuhp machinery to update 
>>>> kexec segment[s] other then elfcorehdr. For example FDT on PowerPC.
>>>>
>>>> - Sourabh Jain
>>>
>>> OK, I was thinking that the desire was to eliminate the cpuhp 
>>> callbacks. In reality, the desire is to change to 
>>> for_each_possible_cpu(). Given that the kernel creates crash_notes 
>>> for all possible cpus upon kernel boot, there seems to be no reason 
>>> to not do this?
>>>
>>> HOWEVER...
>>>
>>> It's not clear to me that this particular change needs to be part of 
>>> this series. It's inclusion would facilitate PPC support, but 
>>> doesn't "solve" anything in general. In fact it causes kexec_load 
>>> and kexec_file_load to deviate (kexec_load via userspace kexec does 
>>> the equivalent of for_each_present_cpu() where as with this change 
>>> kexec_file_load would do for_each_possible_cpu(); until a hot plug 
>>> event then both would do for_each_possible_cpu()). And if this 
>>> change were to arrive as part of Sourabh's PPC support, then it does 
>>> not appear to impact x86 (not sure about other arches). And the 
>>> 'crash' dump analyzer doesn't care either way.
>>>
>>> Including this change would enable an optimization path (for x86 at 
>>> least) that short-circuits cpu hotplug changes in the arch crash 
>>> handler, for example:
>>>
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index aca3f1817674..0883f6b11de4 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct 
>>> kimage *image)
>>>     unsigned long mem, memsz;
>>>     unsigned long elfsz = 0;
>>>
>>> +   if (image->file_mode && (
>>> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
>>> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
>>> +       return;
>>> +
>>>     /*
>>>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>>>      * memory resources.
>>>
>>> I'm not sure that is compelling given the infrequent nature of cpu 
>>> hotplug events.
>> It certainly closes/reduces the window where kdump is not active due 
>> kexec segment update.|
>
> Fair enough. I plan to include this change in v19.
>
>>
>>>
>>> In my mind I still have a question about kexec_load() path. The 
>>> userspace kexec can not do the equivalent of 
>>> for_each_possible_cpu(). It can obtain max possible cpus from 
>>> /sys/devices/system/cpu/possible, but for those cpus not present the 
>>> /sys/devices/system/cpu/cpuXX is not available and so the 
>>> crash_notes entries is not available. My attempts to expose all 
>>> cpuXX lead to odd behavior that was requiring changes in ACPI and 
>>> arch code that looked untenable.
>>>
>>> There seem to be these options available for kexec_load() path:
>>> - immediately rewrite the elfcorehdr upon load via a call to 
>>> crash_prepare_elf64_headers(). I've made this work with the 
>>> following, as proof of concept:
>> Yes regenerating/patching the elfcorehdr could be an option for 
>> kexec_load syscall.
> So this is not needed by x86, but more so by ppc. Should this change 
> be in the ppc set or this set?
Since /sys/devices/system/cpu/cpuXX represents possible CPUs on PowerPC, 
there is no need for elfcorehdr regeneration on PowerPC for kexec_load case
for CPU hotplug events.

My ask is, keep the cpuhp machinery so that architectures can update 
other kexec segments if needed of CPU add/remove case.

In case x86 has nothing to update on CPU hotplug events and you want 
remove the CPU hp machinery I can add the same
in ppc patch series.

Thanks,
Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-03-01  6:22                             ` Sourabh Jain
@ 2023-03-01 14:16                               ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-03-01 14:16 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 3/1/23 00:22, Sourabh Jain wrote:
> 
> On 01/03/23 03:20, Eric DeVolder wrote:
>>
>>
>> On 2/27/23 00:11, Sourabh Jain wrote:
>>>
>>> On 25/02/23 01:46, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/24/23 02:34, Sourabh Jain wrote:
>>>>>
>>>>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>>>
>>>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>>>> Hello Eric,
>>>>>>>>>
>>>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>>>> Eric!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>>>>>>>
>>>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>>>
>>>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>>>>
>>>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>>>
>>>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>>>> {
>>>>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>>>
>>>>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> and use this to query the actual state at crash time. That spares all
>>>>>>>>>>> those callback heuristics.
>>>>>>>>>>>
>>>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>>>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>>>
>>>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>>          tglx
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thomas,
>>>>>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>>>>>>>
>>>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>>>>>>>
>>>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>>>>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>>>>>>>   for sadump method).
>>>>>>>>>>
>>>>>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>>>>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>>>>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>>>>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>>>>
>>>>>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>>>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>>>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>>>
>>>>>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>>>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>>>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>>>>>>>> that might rely on the ELF info?)
>>>>>>>>>>
>>>>>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>>>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>>>>>>>> online/offline.
>>>>>>>>>>
>>>>>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>>>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>>>
>>>>>>>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>>>>>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>>>>>>>
>>>>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>>>
>>>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>>>>>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>>>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>>>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>>>>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the 
>>>>>>>>> elfcorehdr
>>>>>>>>> during the kdump kernel load.
>>>>>>>>>
>>>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>>>
>>>>>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>>>>>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>>>>>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>>>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>>>>>>>
>>>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>>>>>>>> any of this.
>>>>>>>>
>>>>>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>>>>>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>>>>>>>> event and an online event (and unplug/offline). If those were distinguishable, then we
>>>>>>>> could only regenerate on un/plug events.
>>>>>>>>
>>>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>>>
>>>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
>>>>>>> hot[un]plug case.
>>>>>>>
>>>>>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on
>>>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>>>
>>>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
>>>>>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated
>>>>>>> during the boot time they are populated when the system is on the crash path.
>>>>>>>
>>>>>>> This is how crash notes are populated on PowerPC and I am expecting it would be something
>>>>>>> similar on other architectures too.
>>>>>>>
>>>>>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
>>>>>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
>>>>>>> note and proceeds further.
>>>>>>>
>>>>>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the 
>>>>>>> time
>>>>>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
>>>>>>> and display the information accordingly.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> - Sourabh
>>>>>>
>>>>>> I've been examining what it would mean to move to for_each_possible_cpu() in 
>>>>>> crash_prepare_elf64_headers(). I think it means:
>>>>>>
>>>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
>>>>>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the 
>>>>>> load with the elfcorehdr generated by crash_prepare_elf64_headers().
>>>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.
>>>>>
>>>>> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery
>>>>> to update FDT kexec segment for CPU hot add case.
>>>>
>>>> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>>>>
>>>>>
>>>>>>
>>>>>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash 
>>>>>> analyzers that rely on ELF PT_NOTEs for crash_notes.
>>>>>>
>>>>>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp 
>>>>>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The 
>>>>>> overhead for doing this is very minimal, and the events are likely very infrequent.
>>>>>
>>>>> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then 
>>>>> elfcorehdr. For example FDT on PowerPC.
>>>>>
>>>>> - Sourabh Jain
>>>>
>>>> OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire 
>>>> is to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all 
>>>> possible cpus upon kernel boot, there seems to be no reason to not do this?
>>>>
>>>> HOWEVER...
>>>>
>>>> It's not clear to me that this particular change needs to be part of this series. It's inclusion 
>>>> would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes 
>>>> kexec_load and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of 
>>>> for_each_present_cpu() where as with this change kexec_file_load would do 
>>>> for_each_possible_cpu(); until a hot plug event then both would do for_each_possible_cpu()). And 
>>>> if this change were to arrive as part of Sourabh's PPC support, then it does not appear to 
>>>> impact x86 (not sure about other arches). And the 'crash' dump analyzer doesn't care either way.
>>>>
>>>> Including this change would enable an optimization path (for x86 at least) that short-circuits 
>>>> cpu hotplug changes in the arch crash handler, for example:
>>>>
>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>> index aca3f1817674..0883f6b11de4 100644
>>>> --- a/arch/x86/kernel/crash.c
>>>> +++ b/arch/x86/kernel/crash.c
>>>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image)
>>>>     unsigned long mem, memsz;
>>>>     unsigned long elfsz = 0;
>>>>
>>>> +   if (image->file_mode && (
>>>> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
>>>> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
>>>> +       return;
>>>> +
>>>>     /*
>>>>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>>>>      * memory resources.
>>>>
>>>> I'm not sure that is compelling given the infrequent nature of cpu hotplug events.
>>> It certainly closes/reduces the window where kdump is not active due kexec segment update.|
>>
>> Fair enough. I plan to include this change in v19.
>>
>>>
>>>>
>>>> In my mind I still have a question about kexec_load() path. The userspace kexec can not do the 
>>>> equivalent of for_each_possible_cpu(). It can obtain max possible cpus from 
>>>> /sys/devices/system/cpu/possible, but for those cpus not present the 
>>>> /sys/devices/system/cpu/cpuXX is not available and so the crash_notes entries is not available. 
>>>> My attempts to expose all cpuXX lead to odd behavior that was requiring changes in ACPI and arch 
>>>> code that looked untenable.
>>>>
>>>> There seem to be these options available for kexec_load() path:
>>>> - immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've 
>>>> made this work with the following, as proof of concept:
>>> Yes regenerating/patching the elfcorehdr could be an option for kexec_load syscall.
>> So this is not needed by x86, but more so by ppc. Should this change be in the ppc set or this set?
> Since /sys/devices/system/cpu/cpuXX represents possible CPUs on PowerPC, there is no need for 
> elfcorehdr regeneration on PowerPC for kexec_load case
> for CPU hotplug events.
> 
> My ask is, keep the cpuhp machinery so that architectures can update other kexec segments if needed 
> of CPU add/remove case.
> 
> In case x86 has nothing to update on CPU hotplug events and you want remove the CPU hp machinery I 
> can add the same
> in ppc patch series.

I'll keep the cpuhp machinery; in general it is needed for kexec_load usage in particular since we 
are changing crash_prepare_elf64_headers() to for_each_possible_cpu().
eric

> 
> Thanks,
> Sourabh Jain

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-03-01 14:16                               ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-03-01 14:16 UTC (permalink / raw)
  To: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, bhe, vgoyal
  Cc: mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky, robh,
	efault, rppt, david, konrad.wilk, boris.ostrovsky



On 3/1/23 00:22, Sourabh Jain wrote:
> 
> On 01/03/23 03:20, Eric DeVolder wrote:
>>
>>
>> On 2/27/23 00:11, Sourabh Jain wrote:
>>>
>>> On 25/02/23 01:46, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/24/23 02:34, Sourabh Jain wrote:
>>>>>
>>>>> On 24/02/23 02:04, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>>>
>>>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>>>> Hello Eric,
>>>>>>>>>
>>>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>>>> Eric!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> So my latest solution is introduce two new CPUHP states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better names.
>>>>>>>>>>>>
>>>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>>>> attempts at locating this state failed when inside the STARTING section, so I located
>>>>>>>>>>>> this just inside the ONLINE sectoin. The crash hotplug handler is registered on
>>>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>>>
>>>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>>>> placed it at the end of the PREPARE section. This crash hotplug handler is also
>>>>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>>>>
>>>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>>>
>>>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>>>> {
>>>>>>>>>>>     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>>>
>>>>>>>>>>>     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> and use this to query the actual state at crash time. That spares all
>>>>>>>>>>> those callback heuristics.
>>>>>>>>>>>
>>>>>>>>>>>> I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
>>>>>>>>>>>> makedumpfile and (the consumer of it all) the userspace crash utility,
>>>>>>>>>>>> in order to understand the impact of moving from for_each_present_cpu()
>>>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>>>
>>>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>>          tglx
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thomas,
>>>>>>>>>> I've investigated the passing of crash notes through the vmcore. What I've learned is that:
>>>>>>>>>>
>>>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
>>>>>>>>>>   not care what the contents of cpu PT_NOTES are, but it does coalesce them together.
>>>>>>>>>>
>>>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in order to determine its
>>>>>>>>>>   nr_cpus variable, which is reported in a header, but otherwise unused (except
>>>>>>>>>>   for sadump method).
>>>>>>>>>>
>>>>>>>>>> - the crash utility, for the purposes of determining the cpus, does not appear to
>>>>>>>>>>   reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>>>>   cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
>>>>>>>>>>   course which are online. In addition, when crash does reference the cpu PT_NOTE,
>>>>>>>>>>   to get its prstatus, it does so by using a percpu technique directly in the vmcore
>>>>>>>>>>   image memory, not via the ELF structure. Said differently, it appears to me that
>>>>>>>>>>   crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains them
>>>>>>>>>>   via kernel cpumasks and the memory within the vmcore.
>>>>>>>>>>
>>>>>>>>>> With this understanding, I did some testing. Perhaps the most telling test was that I
>>>>>>>>>> changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() to just 1,
>>>>>>>>>> hot plugged some cpus, then also took a few offline sparsely via chcpu, then generated a
>>>>>>>>>> vmcore. The crash utility had no problem loading the vmcore, it reported the proper number
>>>>>>>>>> of cpus and the number offline (despite only one cpu PT_NOTE), and changing to a different
>>>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>>>
>>>>>>>>>> My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it obtains the
>>>>>>>>>> cpu information directly from kernel data structures. Perhaps at one time crash relied
>>>>>>>>>> upon the ELF information, but no more. (Perhaps there are other crash dump analyzers
>>>>>>>>>> that might rely on the ELF info?)
>>>>>>>>>>
>>>>>>>>>> So, all this to say that I see no need to change crash_prepare_elf64_headers(). There
>>>>>>>>>> is no compelling reason to move away from for_each_present_cpu(), or modify the list for
>>>>>>>>>> online/offline.
>>>>>>>>>>
>>>>>>>>>> Which then leaves the topic of the cpuhp state on which to register. Perhaps reverting
>>>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not appear to
>>>>>>>>>> be a compelling need to accurately track whether the cpu went online/offline for the
>>>>>>>>>> purposes of creating the elfcorehdr, as ultimately the crash utility pulls that from
>>>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>>>
>>>>>>>>>> I think this is what Sourabh has known and has been advocating for an optimization
>>>>>>>>>> path that allows not regenerating the elfcorehdr on cpu changes (because all the percpu
>>>>>>>>>> structs are all laid out). I do think it best to leave that as an arch choice.
>>>>>>>>>
>>>>>>>>> Since things are clear on how the PT_NOTES are consumed in kdump kernel [fs/proc/vmcore.c],
>>>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>>>
>>>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>>>> If yes, can you please list the elfcorehdr components that changes due to CPU hotplug.
>>>>>>>> Due to the use of for_each_present_cpu(), it is possible for the number of cpu PT_NOTEs
>>>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact the
>>>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  From what I understood, crash notes are prepared for possible CPUs as system boots and
>>>>>>>>> could be used to create a PT_NOTE section for each possible CPU while generating the 
>>>>>>>>> elfcorehdr
>>>>>>>>> during the kdump kernel load.
>>>>>>>>>
>>>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every possible CPU there is no need to
>>>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>>>
>>>>>>>> For onlining/offlining of cpus, there is no need to regenerate the elfcorehdr. However,
>>>>>>>> for actual hot un/plug of cpus, the answer is yes due to for_each_present_cpu(). The
>>>>>>>> caveat here of course is that if crash utility is the only coredump analyzer of concern,
>>>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be no need to re-generate them.
>>>>>>>>
>>>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming into mainstream, impacts
>>>>>>>> any of this.
>>>>>>>>
>>>>>>>> Perhaps the one item that might help here is to distinguish between actual hot un/plug of
>>>>>>>> cpus, versus onlining/offlining. At the moment, I can not distinguish between a hot plug
>>>>>>>> event and an online event (and unplug/offline). If those were distinguishable, then we
>>>>>>>> could only regenerate on un/plug events.
>>>>>>>>
>>>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>>>
>>>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't have to worry about
>>>>>>> hot[un]plug case.
>>>>>>>
>>>>>>> Here is my view on how things should be handled if a core-dump analyzer is dependent on
>>>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>>>
>>>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash notes (kernel has
>>>>>>> one crash note per CPU for every possible CPU). Though the crash notes are allocated
>>>>>>> during the boot time they are populated when the system is on the crash path.
>>>>>>>
>>>>>>> This is how crash notes are populated on PowerPC and I am expecting it would be something
>>>>>>> similar on other architectures too.
>>>>>>>
>>>>>>> The crashing CPU sends IPI to every other online CPU with a callback function that updates the
>>>>>>> crash notes of that specific CPU. Once the IPI completes the crashing CPU updates its own crash
>>>>>>> note and proceeds further.
>>>>>>>
>>>>>>> The crash notes of CPUs remain uninitialized if the CPUs were offline or hot unplugged at the 
>>>>>>> time
>>>>>>> system crash. The core-dump analyzer should be able to identify [un]/initialized crash notes
>>>>>>> and display the information accordingly.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> - Sourabh
>>>>>>
>>>>>> I've been examining what it would mean to move to for_each_possible_cpu() in 
>>>>>> crash_prepare_elf64_headers(). I think it means:
>>>>>>
>>>>>> - Changing for_each_present_cpu() to for_each_possible_cpu() in crash_prepare_elf64_headers().
>>>>>> - For kexec_load() syscall path, rewrite the incoming/supplied elfcorehdr immediately on the 
>>>>>> load with the elfcorehdr generated by crash_prepare_elf64_headers().
>>>>>> - Eliminate/remove the cpuhp machinery for handling crash hotplug events.
>>>>>
>>>>> If for_each_present_cpu is replaced with for_each_possible_cpu I still need cpuhp machinery
>>>>> to update FDT kexec segment for CPU hot add case.
>>>>
>>>> Ah, ok, that's important! So the cpuhp callbacks are still needed.
>>>>>
>>>>>
>>>>>>
>>>>>> This would then setup PT_NOTEs for all possible cpus, which should in theory accommodate crash 
>>>>>> analyzers that rely on ELF PT_NOTEs for crash_notes.
>>>>>>
>>>>>> If staying with for_each_present_cpu() is ultimately decided, then I think leaving the cpuhp 
>>>>>> machinery in place and each arch could decide how to handle crash cpu hotplug events. The 
>>>>>> overhead for doing this is very minimal, and the events are likely very infrequent.
>>>>>
>>>>> I agree. Some architectures may need cpuhp machinery to update kexec segment[s] other then 
>>>>> elfcorehdr. For example FDT on PowerPC.
>>>>>
>>>>> - Sourabh Jain
>>>>
>>>> OK, I was thinking that the desire was to eliminate the cpuhp callbacks. In reality, the desire 
>>>> is to change to for_each_possible_cpu(). Given that the kernel creates crash_notes for all 
>>>> possible cpus upon kernel boot, there seems to be no reason to not do this?
>>>>
>>>> HOWEVER...
>>>>
>>>> It's not clear to me that this particular change needs to be part of this series. It's inclusion 
>>>> would facilitate PPC support, but doesn't "solve" anything in general. In fact it causes 
>>>> kexec_load and kexec_file_load to deviate (kexec_load via userspace kexec does the equivalent of 
>>>> for_each_present_cpu() where as with this change kexec_file_load would do 
>>>> for_each_possible_cpu(); until a hot plug event then both would do for_each_possible_cpu()). And 
>>>> if this change were to arrive as part of Sourabh's PPC support, then it does not appear to 
>>>> impact x86 (not sure about other arches). And the 'crash' dump analyzer doesn't care either way.
>>>>
>>>> Including this change would enable an optimization path (for x86 at least) that short-circuits 
>>>> cpu hotplug changes in the arch crash handler, for example:
>>>>
>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>> index aca3f1817674..0883f6b11de4 100644
>>>> --- a/arch/x86/kernel/crash.c
>>>> +++ b/arch/x86/kernel/crash.c
>>>> @@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct kimage *image)
>>>>     unsigned long mem, memsz;
>>>>     unsigned long elfsz = 0;
>>>>
>>>> +   if (image->file_mode && (
>>>> +       image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
>>>> +       image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
>>>> +       return;
>>>> +
>>>>     /*
>>>>      * Create the new elfcorehdr reflecting the changes to CPU and/or
>>>>      * memory resources.
>>>>
>>>> I'm not sure that is compelling given the infrequent nature of cpu hotplug events.
>>> It certainly closes/reduces the window where kdump is not active due kexec segment update.|
>>
>> Fair enough. I plan to include this change in v19.
>>
>>>
>>>>
>>>> In my mind I still have a question about kexec_load() path. The userspace kexec can not do the 
>>>> equivalent of for_each_possible_cpu(). It can obtain max possible cpus from 
>>>> /sys/devices/system/cpu/possible, but for those cpus not present the 
>>>> /sys/devices/system/cpu/cpuXX is not available and so the crash_notes entries is not available. 
>>>> My attempts to expose all cpuXX lead to odd behavior that was requiring changes in ACPI and arch 
>>>> code that looked untenable.
>>>>
>>>> There seem to be these options available for kexec_load() path:
>>>> - immediately rewrite the elfcorehdr upon load via a call to crash_prepare_elf64_headers(). I've 
>>>> made this work with the following, as proof of concept:
>>> Yes regenerating/patching the elfcorehdr could be an option for kexec_load syscall.
>> So this is not needed by x86, but more so by ppc. Should this change be in the ppc set or this set?
> Since /sys/devices/system/cpu/cpuXX represents possible CPUs on PowerPC, there is no need for 
> elfcorehdr regeneration on PowerPC for kexec_load case
> for CPU hotplug events.
> 
> My ask is, keep the cpuhp machinery so that architectures can update other kexec segments if needed 
> of CPU add/remove case.
> 
> In case x86 has nothing to update on CPU hotplug events and you want remove the CPU hp machinery I 
> can add the same
> in ppc patch series.

I'll keep the cpuhp machinery; in general it is needed for kexec_load usage in particular since we 
are changing crash_prepare_elf64_headers() to for_each_possible_cpu().
eric

> 
> Thanks,
> Sourabh Jain

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-28 18:52                         ` Eric DeVolder
@ 2023-03-01 15:48                           ` Eric DeVolder
  -1 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-03-01 15:48 UTC (permalink / raw)
  To: Baoquan He, Sourabh Jain
  Cc: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung,
	vgoyal, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky,
	robh, efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/28/23 12:52, Eric DeVolder wrote:
> 
> 
> On 2/28/23 06:44, Baoquan He wrote:
>> On 02/13/23 at 10:10am, Sourabh Jain wrote:
>>>
>>> On 11/02/23 06:05, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>
>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>> Hello Eric,
>>>>>>>
>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>> Eric!
>>>>>>>>>
>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>
>>>>>>>>>> So my latest solution is introduce two new CPUHP
>>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
>>>>>>>>>> offlining. I'm open to better names.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
>>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>> attempts at locating this state failed when
>>>>>>>>>> inside the STARTING section, so I located
>>>>>>>>>> this just inside the ONLINE sectoin. The crash
>>>>>>>>>> hotplug handler is registered on
>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
>>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>> placed it at the end of the PREPARE section.
>>>>>>>>>> This crash hotplug handler is also
>>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>>
>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>
>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>> {
>>>>>>>>>      struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>
>>>>>>>>>      return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and use this to query the actual state at crash
>>>>>>>>> time. That spares all
>>>>>>>>> those callback heuristics.
>>>>>>>>>
>>>>>>>>>> I'm making my way though percpu crash_notes,
>>>>>>>>>> elfcorehdr, vmcoreinfo,
>>>>>>>>>> makedumpfile and (the consumer of it all) the
>>>>>>>>>> userspace crash utility,
>>>>>>>>>> in order to understand the impact of moving from
>>>>>>>>>> for_each_present_cpu()
>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>
>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>           tglx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thomas,
>>>>>>>> I've investigated the passing of crash notes through the
>>>>>>>> vmcore. What I've learned is that:
>>>>>>>>
>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references
>>>>>>>> to do its job) does
>>>>>>>>    not care what the contents of cpu PT_NOTES are, but it
>>>>>>>> does coalesce them together.
>>>>>>>>
>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in
>>>>>>>> order to determine its
>>>>>>>>    nr_cpus variable, which is reported in a header, but
>>>>>>>> otherwise unused (except
>>>>>>>>    for sadump method).
>>>>>>>>
>>>>>>>> - the crash utility, for the purposes of determining the
>>>>>>>> cpus, does not appear to
>>>>>>>>    reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>>    cpu_[possible|present|online]_mask and computes
>>>>>>>> nr_cpus from that, and also of
>>>>>>>>    course which are online. In addition, when crash does
>>>>>>>> reference the cpu PT_NOTE,
>>>>>>>>    to get its prstatus, it does so by using a percpu
>>>>>>>> technique directly in the vmcore
>>>>>>>>    image memory, not via the ELF structure. Said
>>>>>>>> differently, it appears to me that
>>>>>>>>    crash utility doesn't rely on the ELF PT_NOTEs for
>>>>>>>> cpus; rather it obtains them
>>>>>>>>    via kernel cpumasks and the memory within the vmcore.
>>>>>>>>
>>>>>>>> With this understanding, I did some testing. Perhaps the
>>>>>>>> most telling test was that I
>>>>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>>> hot plugged some cpus, then also took a few offline
>>>>>>>> sparsely via chcpu, then generated a
>>>>>>>> vmcore. The crash utility had no problem loading the
>>>>>>>> vmcore, it reported the proper number
>>>>>>>> of cpus and the number offline (despite only one cpu
>>>>>>>> PT_NOTE), and changing to a different
>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>
>>>>>>>> My take away is that crash utility does not rely upon
>>>>>>>> ELF cpu PT_NOTEs, it obtains the
>>>>>>>> cpu information directly from kernel data structures.
>>>>>>>> Perhaps at one time crash relied
>>>>>>>> upon the ELF information, but no more. (Perhaps there
>>>>>>>> are other crash dump analyzers
>>>>>>>> that might rely on the ELF info?)
>>>>>>>>
>>>>>>>> So, all this to say that I see no need to change
>>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>>> is no compelling reason to move away from
>>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>>> online/offline.
>>>>>>>>
>>>>>>>> Which then leaves the topic of the cpuhp state on which
>>>>>>>> to register. Perhaps reverting
>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right
>>>>>>>> answer. There does not appear to
>>>>>>>> be a compelling need to accurately track whether the cpu
>>>>>>>> went online/offline for the
>>>>>>>> purposes of creating the elfcorehdr, as ultimately the
>>>>>>>> crash utility pulls that from
>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>
>>>>>>>> I think this is what Sourabh has known and has been
>>>>>>>> advocating for an optimization
>>>>>>>> path that allows not regenerating the elfcorehdr on cpu
>>>>>>>> changes (because all the percpu
>>>>>>>> structs are all laid out). I do think it best to leave
>>>>>>>> that as an arch choice.
>>>>>>>
>>>>>>> Since things are clear on how the PT_NOTES are consumed in
>>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>
>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>> If yes, can you please list the elfcorehdr components that
>>>>>>> changes due to CPU hotplug.
>>>>>> Due to the use of for_each_present_cpu(), it is possible for the
>>>>>> number of cpu PT_NOTEs
>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
>>>>>> does not impact the
>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>
>>>>>>>
>>>>>>>   From what I understood, crash notes are prepared for
>>>>>>> possible CPUs as system boots and
>>>>>>> could be used to create a PT_NOTE section for each possible
>>>>>>> CPU while generating the elfcorehdr
>>>>>>> during the kdump kernel load.
>>>>>>>
>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every
>>>>>>> possible CPU there is no need to
>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>
>>>>>> For onlining/offlining of cpus, there is no need to regenerate
>>>>>> the elfcorehdr. However,
>>>>>> for actual hot un/plug of cpus, the answer is yes due to
>>>>>> for_each_present_cpu(). The
>>>>>> caveat here of course is that if crash utility is the only
>>>>>> coredump analyzer of concern,
>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be
>>>>>> no need to re-generate them.
>>>>>>
>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming
>>>>>> into mainstream, impacts
>>>>>> any of this.
>>>>>>
>>>>>> Perhaps the one item that might help here is to distinguish
>>>>>> between actual hot un/plug of
>>>>>> cpus, versus onlining/offlining. At the moment, I can not
>>>>>> distinguish between a hot plug
>>>>>> event and an online event (and unplug/offline). If those were
>>>>>> distinguishable, then we
>>>>>> could only regenerate on un/plug events.
>>>>>>
>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>
>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't
>>>>> have to worry about
>>>>> hot[un]plug case.
>>>>>
>>>>> Here is my view on how things should be handled if a core-dump
>>>>> analyzer is dependent on
>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>
>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>>>>> notes (kernel has
>>>>> one crash note per CPU for every possible CPU). Though the crash
>>>>> notes are allocated
>>>>> during the boot time they are populated when the system is on the
>>>>> crash path.
>>>>>
>>>>> This is how crash notes are populated on PowerPC and I am expecting
>>>>> it would be something
>>>>> similar on other architectures too.
>>>>>
>>>>> The crashing CPU sends IPI to every other online CPU with a callback
>>>>> function that updates the
>>>>> crash notes of that specific CPU. Once the IPI completes the
>>>>> crashing CPU updates its own crash
>>>>> note and proceeds further.
>>>>>
>>>>> The crash notes of CPUs remain uninitialized if the CPUs were
>>>>> offline or hot unplugged at the time
>>>>> system crash. The core-dump analyzer should be able to identify
>>>>> [un]/initialized crash notes
>>>>> and display the information accordingly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> - Sourabh
>>>>
>>>> In general, I agree with your points. You've presented a strong case to
>>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
>>>> those crash notes would always be present, and we can ignore changes to
>>>> cpus wrt/ elfcorehdr updates.
>>>>
>>>> But what do we do about kexec_load() syscall? The way the userspace
>>>> utility works is it determines cpus by:
>>>>   nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
>>>> which is not the equivalent of possible_cpus. So the complete list of
>>>> cpu PT_NOTEs is not generated up front. We would need a solution for
>>>> that?
>>> Hello Eric,
>>>
>>> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
>>> isn't that equivalent to possible CPUs?
>>>
>>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC
>>> it is possible CPUs.
>>
> Baoquan,
> 
>>  From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
>> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
>> cpus. If these are true, we can use them.
> 
> Thomas Gleixner has pointed out that:
> 
>   glibc tries to evaluate that in the following order:
>    1) /sys/devices/system/cpu/cpu*
>       That's present CPUs not possible CPUs
>    2) /proc/stat
>       That's online CPUs
>    3) sched_getaffinity()
>       That's online CPUs at best. In the worst case it's an affinity mask
>       which is set on a process group
> 
> meaning that _SC_NPROCESSORS_CONF is not equivalent to possible_cpus(). Furthermore, the 
> /sys/system/devices/cpus/cpuXX entries are not available for not-present-but-possible cpus; thus 
> userspace kexec utility can not write out the elfcorehdr with all possible cpus listed.
> 
>>
>> But I am wondering why the existing present cpu way is going to be
>> discarded. Sorry, I tried to go through this thread, it's too long, can
>> anyone summarize the reason with shorter and clear sentences. Sorry
>> again for that.
> 
> By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), in the case of the 
> kexec_file_load(), this change would simplify some issues Sourabh has encountered for PPC support. 
> It would also enable an optimization that permits NOT re-generating the elfcorehdr on cpu changes, 
> as all the [possible] cpus are already described in the elfcorehdr.
> 
> I've pointed out that this change would have kexec_load (as kexec-tools can only write out, 
> initially, the present_cpus()) initially deviate from kexec_file_load (which would now write out the 
> possible_cpus()). This deviation would disappear after the first hotplug event (due to calling 
> crash_prepare_elf64_headers()). Or I've provided a simple way for kexec_load to rewrite its 
> elfcorehdr upon initial load (by calling into the crash hotplug handler).
> 
> Can you think of any side effects of going to for_each_possible_cpu()?
> 
> Thanks,
> eric

Well, this won't be shorter sentences, but hopefully it makes the case clearer. Below I've 
cut-n-pasted my current patch w/ commit message which explains it all.

Please let me know if you can think of any side effects not addressed!
Thanks,
eric

> 
> 
>>
>>>
>>> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
>>> /sys/devices/system/cpu/possible for kexec_load case.
>>>
>>> Thoughts?
>>>
>>> - Sourabh Jain
>>>
>>

 From b56aa428b07d970f26e3c3704d54ce8805f05ddc Mon Sep 17 00:00:00 2001
From: Eric DeVolder <eric.devolder@oracle.com>
Date: Tue, 28 Feb 2023 14:20:04 -0500
Subject: [PATCH v19 3/7] crash: change crash_prepare_elf64_headers() to
  for_each_possible_cpu()

The function crash_prepare_elf64_headers() generates the elfcorehdr
which describes the cpus and memory in the system for the crash kernel.
In particular, it writes out ELF PT_NOTEs for memory regions and the
processors in the system.

With respect to the cpus, the current implementation utilizes
for_each_present_cpu() which means that as cpus are added and removed,
the elfcorehdr must again be updated to reflect the new set of cpus.

The reasoning behind the change to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
   possible cpus; that is, crash_notes are not allocated dynamically
   when cpus are plugged/unplugged. Thus the crash_notes for each
   possible cpu are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu.
   Changing to for_each_possible_cpu() is valid as the crash_notes
   pointed to by each cpu PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

  kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
            elfcorehdr      /proc/vmcore     vmcore

reveals how the ELF cpu PT_NOTEs are utilized:

- Upon panic, each cpu is sent an IPI and shuts itself down, recording
  its state in its crash_notes. When all cpus are shutdown, the
  crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
  use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the cpu
  PT_NOTEs to craft a nr_cpus variable, which is reported in a
  header but otherwise generally unused. Makedumpfile creates the
  vmcore.

- The 'crash' dump analyzer does not appear to reference the cpu
  PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
  symbols and directly examines those structure contents from vmcore
  memory. From that information it is able to determine which cpus
  are present and online, and locate the corresponding crash_notes.
  Said differently, it appears to me that 'crash' analyzer does not
  rely on the ELF PT_NOTEs for cpus; rather it obtains the information
  directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use
these PT_NOTEs, but 'makedumpfile' and 'crash' seem to me to be the
most common solution.)

This change results in the benefit of having all cpus described in
the elfcorehdr, and therefore reducing the need to re-generate the
elfcorehdr on cpu changes, at the small expense of an additional
56 bytes per PT_NOTE for not-present-but-possible cpus.

On systems where kexec_file_load() syscall is utilized, all the above
is valid. On systems where kexec_load() syscall is utilized, there
may be the need for the elfcorehdr to be regenerated once. The reason
being that some archs only populate the 'present' cpus in the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility
uses to generate the userspace-supplied elfcorehdr. In this situation,
one memory or cpu change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible cpus will
be described, just as with kexec_file_load() syscall.

Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
---
  kernel/crash_core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index dba4b75f7541..537b199a8774 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
  	ehdr->e_phentsize = sizeof(Elf64_Phdr);

  	/* Prepare one phdr of type PT_NOTE for each present CPU */
-	for_each_present_cpu(cpu) {
+	for_each_possible_cpu(cpu) {
  		phdr->p_type = PT_NOTE;
  		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
  		phdr->p_offset = phdr->p_paddr = notes_addr;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-03-01 15:48                           ` Eric DeVolder
  0 siblings, 0 replies; 70+ messages in thread
From: Eric DeVolder @ 2023-03-01 15:48 UTC (permalink / raw)
  To: Baoquan He, Sourabh Jain
  Cc: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung,
	vgoyal, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky,
	robh, efault, rppt, david, konrad.wilk, boris.ostrovsky



On 2/28/23 12:52, Eric DeVolder wrote:
> 
> 
> On 2/28/23 06:44, Baoquan He wrote:
>> On 02/13/23 at 10:10am, Sourabh Jain wrote:
>>>
>>> On 11/02/23 06:05, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>
>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>> Hello Eric,
>>>>>>>
>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>> Eric!
>>>>>>>>>
>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>
>>>>>>>>>> So my latest solution is introduce two new CPUHP
>>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
>>>>>>>>>> offlining. I'm open to better names.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
>>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>> attempts at locating this state failed when
>>>>>>>>>> inside the STARTING section, so I located
>>>>>>>>>> this just inside the ONLINE sectoin. The crash
>>>>>>>>>> hotplug handler is registered on
>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
>>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>> placed it at the end of the PREPARE section.
>>>>>>>>>> This crash hotplug handler is also
>>>>>>>>>> registered on this state as the callback for the .teardown method.
>>>>>>>>>
>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>
>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>> {
>>>>>>>>>      struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>
>>>>>>>>>      return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and use this to query the actual state at crash
>>>>>>>>> time. That spares all
>>>>>>>>> those callback heuristics.
>>>>>>>>>
>>>>>>>>>> I'm making my way though percpu crash_notes,
>>>>>>>>>> elfcorehdr, vmcoreinfo,
>>>>>>>>>> makedumpfile and (the consumer of it all) the
>>>>>>>>>> userspace crash utility,
>>>>>>>>>> in order to understand the impact of moving from
>>>>>>>>>> for_each_present_cpu()
>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>
>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>           tglx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thomas,
>>>>>>>> I've investigated the passing of crash notes through the
>>>>>>>> vmcore. What I've learned is that:
>>>>>>>>
>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references
>>>>>>>> to do its job) does
>>>>>>>>    not care what the contents of cpu PT_NOTES are, but it
>>>>>>>> does coalesce them together.
>>>>>>>>
>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in
>>>>>>>> order to determine its
>>>>>>>>    nr_cpus variable, which is reported in a header, but
>>>>>>>> otherwise unused (except
>>>>>>>>    for sadump method).
>>>>>>>>
>>>>>>>> - the crash utility, for the purposes of determining the
>>>>>>>> cpus, does not appear to
>>>>>>>>    reference the elfcorehdr PT_NOTEs. Instead it locates the various
>>>>>>>>    cpu_[possible|present|online]_mask and computes
>>>>>>>> nr_cpus from that, and also of
>>>>>>>>    course which are online. In addition, when crash does
>>>>>>>> reference the cpu PT_NOTE,
>>>>>>>>    to get its prstatus, it does so by using a percpu
>>>>>>>> technique directly in the vmcore
>>>>>>>>    image memory, not via the ELF structure. Said
>>>>>>>> differently, it appears to me that
>>>>>>>>    crash utility doesn't rely on the ELF PT_NOTEs for
>>>>>>>> cpus; rather it obtains them
>>>>>>>>    via kernel cpumasks and the memory within the vmcore.
>>>>>>>>
>>>>>>>> With this understanding, I did some testing. Perhaps the
>>>>>>>> most telling test was that I
>>>>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>>> hot plugged some cpus, then also took a few offline
>>>>>>>> sparsely via chcpu, then generated a
>>>>>>>> vmcore. The crash utility had no problem loading the
>>>>>>>> vmcore, it reported the proper number
>>>>>>>> of cpus and the number offline (despite only one cpu
>>>>>>>> PT_NOTE), and changing to a different
>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>
>>>>>>>> My take away is that crash utility does not rely upon
>>>>>>>> ELF cpu PT_NOTEs, it obtains the
>>>>>>>> cpu information directly from kernel data structures.
>>>>>>>> Perhaps at one time crash relied
>>>>>>>> upon the ELF information, but no more. (Perhaps there
>>>>>>>> are other crash dump analyzers
>>>>>>>> that might rely on the ELF info?)
>>>>>>>>
>>>>>>>> So, all this to say that I see no need to change
>>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>>> is no compelling reason to move away from
>>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>>> online/offline.
>>>>>>>>
>>>>>>>> Which then leaves the topic of the cpuhp state on which
>>>>>>>> to register. Perhaps reverting
>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right
>>>>>>>> answer. There does not appear to
>>>>>>>> be a compelling need to accurately track whether the cpu
>>>>>>>> went online/offline for the
>>>>>>>> purposes of creating the elfcorehdr, as ultimately the
>>>>>>>> crash utility pulls that from
>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>
>>>>>>>> I think this is what Sourabh has known and has been
>>>>>>>> advocating for an optimization
>>>>>>>> path that allows not regenerating the elfcorehdr on cpu
>>>>>>>> changes (because all the percpu
>>>>>>>> structs are all laid out). I do think it best to leave
>>>>>>>> that as an arch choice.
>>>>>>>
>>>>>>> Since things are clear on how the PT_NOTES are consumed in
>>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>
>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>> If yes, can you please list the elfcorehdr components that
>>>>>>> changes due to CPU hotplug.
>>>>>> Due to the use of for_each_present_cpu(), it is possible for the
>>>>>> number of cpu PT_NOTEs
>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
>>>>>> does not impact the
>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>
>>>>>>>
>>>>>>>   From what I understood, crash notes are prepared for
>>>>>>> possible CPUs as system boots and
>>>>>>> could be used to create a PT_NOTE section for each possible
>>>>>>> CPU while generating the elfcorehdr
>>>>>>> during the kdump kernel load.
>>>>>>>
>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every
>>>>>>> possible CPU there is no need to
>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>
>>>>>> For onlining/offlining of cpus, there is no need to regenerate
>>>>>> the elfcorehdr. However,
>>>>>> for actual hot un/plug of cpus, the answer is yes due to
>>>>>> for_each_present_cpu(). The
>>>>>> caveat here of course is that if crash utility is the only
>>>>>> coredump analyzer of concern,
>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be
>>>>>> no need to re-generate them.
>>>>>>
>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming
>>>>>> into mainstream, impacts
>>>>>> any of this.
>>>>>>
>>>>>> Perhaps the one item that might help here is to distinguish
>>>>>> between actual hot un/plug of
>>>>>> cpus, versus onlining/offlining. At the moment, I can not
>>>>>> distinguish between a hot plug
>>>>>> event and an online event (and unplug/offline). If those were
>>>>>> distinguishable, then we
>>>>>> could only regenerate on un/plug events.
>>>>>>
>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>
>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't
>>>>> have to worry about
>>>>> hot[un]plug case.
>>>>>
>>>>> Here is my view on how things should be handled if a core-dump
>>>>> analyzer is dependent on
>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>
>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>>>>> notes (kernel has
>>>>> one crash note per CPU for every possible CPU). Though the crash
>>>>> notes are allocated
>>>>> during the boot time they are populated when the system is on the
>>>>> crash path.
>>>>>
>>>>> This is how crash notes are populated on PowerPC and I am expecting
>>>>> it would be something
>>>>> similar on other architectures too.
>>>>>
>>>>> The crashing CPU sends IPI to every other online CPU with a callback
>>>>> function that updates the
>>>>> crash notes of that specific CPU. Once the IPI completes the
>>>>> crashing CPU updates its own crash
>>>>> note and proceeds further.
>>>>>
>>>>> The crash notes of CPUs remain uninitialized if the CPUs were
>>>>> offline or hot unplugged at the time
>>>>> system crash. The core-dump analyzer should be able to identify
>>>>> [un]/initialized crash notes
>>>>> and display the information accordingly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> - Sourabh
>>>>
>>>> In general, I agree with your points. You've presented a strong case to
>>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
>>>> those crash notes would always be present, and we can ignore changes to
>>>> cpus wrt/ elfcorehdr updates.
>>>>
>>>> But what do we do about kexec_load() syscall? The way the userspace
>>>> utility works is it determines cpus by:
>>>>   nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
>>>> which is not the equivalent of possible_cpus. So the complete list of
>>>> cpu PT_NOTEs is not generated up front. We would need a solution for
>>>> that?
>>> Hello Eric,
>>>
>>> The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
>>> isn't that equivalent to possible CPUs?
>>>
>>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC
>>> it is possible CPUs.
>>
> Baoquan,
> 
>>  From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
>> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
>> cpus. If these are true, we can use them.
> 
> Thomas Gleixner has pointed out that:
> 
>   glibc tries to evaluate that in the following order:
>    1) /sys/devices/system/cpu/cpu*
>       That's present CPUs not possible CPUs
>    2) /proc/stat
>       That's online CPUs
>    3) sched_getaffinity()
>       That's online CPUs at best. In the worst case it's an affinity mask
>       which is set on a process group
> 
> meaning that _SC_NPROCESSORS_CONF is not equivalent to possible_cpus(). Furthermore, the 
> /sys/system/devices/cpus/cpuXX entries are not available for not-present-but-possible cpus; thus 
> userspace kexec utility can not write out the elfcorehdr with all possible cpus listed.
> 
>>
>> But I am wondering why the existing present cpu way is going to be
>> discarded. Sorry, I tried to go through this thread, it's too long, can
>> anyone summarize the reason with shorter and clear sentences. Sorry
>> again for that.
> 
> By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), in the case of the 
> kexec_file_load(), this change would simplify some issues Sourabh has encountered for PPC support. 
> It would also enable an optimization that permits NOT re-generating the elfcorehdr on cpu changes, 
> as all the [possible] cpus are already described in the elfcorehdr.
> 
> I've pointed out that this change would have kexec_load (as kexec-tools can only write out, 
> initially, the present_cpus()) initially deviate from kexec_file_load (which would now write out the 
> possible_cpus()). This deviation would disappear after the first hotplug event (due to calling 
> crash_prepare_elf64_headers()). Or I've provided a simple way for kexec_load to rewrite its 
> elfcorehdr upon initial load (by calling into the crash hotplug handler).
> 
> Can you think of any side effects of going to for_each_possible_cpu()?
> 
> Thanks,
> eric

Well, this won't be shorter sentences, but hopefully it makes the case clearer. Below I've 
cut-n-pasted my current patch w/ commit message which explains it all.

Please let me know if you can think of any side effects not addressed!
Thanks,
eric

> 
> 
>>
>>>
>>> In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
>>> /sys/devices/system/cpu/possible for kexec_load case.
>>>
>>> Thoughts?
>>>
>>> - Sourabh Jain
>>>
>>

 From b56aa428b07d970f26e3c3704d54ce8805f05ddc Mon Sep 17 00:00:00 2001
From: Eric DeVolder <eric.devolder@oracle.com>
Date: Tue, 28 Feb 2023 14:20:04 -0500
Subject: [PATCH v19 3/7] crash: change crash_prepare_elf64_headers() to
  for_each_possible_cpu()

The function crash_prepare_elf64_headers() generates the elfcorehdr
which describes the cpus and memory in the system for the crash kernel.
In particular, it writes out ELF PT_NOTEs for memory regions and the
processors in the system.

With respect to the cpus, the current implementation utilizes
for_each_present_cpu() which means that as cpus are added and removed,
the elfcorehdr must again be updated to reflect the new set of cpus.

The reasoning behind the change to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
   possible cpus; that is, crash_notes are not allocated dynamically
   when cpus are plugged/unplugged. Thus the crash_notes for each
   possible cpu are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu.
   Changing to for_each_possible_cpu() is valid as the crash_notes
   pointed to by each cpu PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

  kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
            elfcorehdr      /proc/vmcore     vmcore

reveals how the ELF cpu PT_NOTEs are utilized:

- Upon panic, each cpu is sent an IPI and shuts itself down, recording
  its state in its crash_notes. When all cpus are shutdown, the
  crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
  use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the cpu
  PT_NOTEs to craft a nr_cpus variable, which is reported in a
  header but otherwise generally unused. Makedumpfile creates the
  vmcore.

- The 'crash' dump analyzer does not appear to reference the cpu
  PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
  symbols and directly examines those structure contents from vmcore
  memory. From that information it is able to determine which cpus
  are present and online, and locate the corresponding crash_notes.
  Said differently, it appears to me that 'crash' analyzer does not
  rely on the ELF PT_NOTEs for cpus; rather it obtains the information
  directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use
these PT_NOTEs, but 'makedumpfile' and 'crash' seem to me to be the
most common solution.)

This change results in the benefit of having all cpus described in
the elfcorehdr, and therefore reducing the need to re-generate the
elfcorehdr on cpu changes, at the small expense of an additional
56 bytes per PT_NOTE for not-present-but-possible cpus.

On systems where kexec_file_load() syscall is utilized, all the above
is valid. On systems where kexec_load() syscall is utilized, there
may be the need for the elfcorehdr to be regenerated once. The reason
being that some archs only populate the 'present' cpus in the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility
uses to generate the userspace-supplied elfcorehdr. In this situation,
one memory or cpu change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible cpus will
be described, just as with kexec_file_load() syscall.

Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
---
  kernel/crash_core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index dba4b75f7541..537b199a8774 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
  	ehdr->e_phentsize = sizeof(Elf64_Phdr);

  	/* Prepare one phdr of type PT_NOTE for each present CPU */
-	for_each_present_cpu(cpu) {
+	for_each_possible_cpu(cpu) {
  		phdr->p_type = PT_NOTE;
  		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
  		phdr->p_offset = phdr->p_paddr = notes_addr;
-- 
2.31.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-02-28 18:52                         ` Eric DeVolder
@ 2023-03-02  5:23                           ` Sourabh Jain
  -1 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-03-02  5:23 UTC (permalink / raw)
  To: Eric DeVolder, Baoquan He
  Cc: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung,
	vgoyal, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky,
	robh, efault, rppt, david, konrad.wilk, boris.ostrovsky


On 01/03/23 00:22, Eric DeVolder wrote:
>
>
> On 2/28/23 06:44, Baoquan He wrote:
>> On 02/13/23 at 10:10am, Sourabh Jain wrote:
>>>
>>> On 11/02/23 06:05, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>
>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>> Hello Eric,
>>>>>>>
>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>> Eric!
>>>>>>>>>
>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>
>>>>>>>>>> So my latest solution is introduce two new CPUHP
>>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
>>>>>>>>>> offlining. I'm open to better names.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
>>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>> attempts at locating this state failed when
>>>>>>>>>> inside the STARTING section, so I located
>>>>>>>>>> this just inside the ONLINE sectoin. The crash
>>>>>>>>>> hotplug handler is registered on
>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
>>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>> placed it at the end of the PREPARE section.
>>>>>>>>>> This crash hotplug handler is also
>>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>>> method.
>>>>>>>>>
>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>
>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>> {
>>>>>>>>>      struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>
>>>>>>>>>      return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and use this to query the actual state at crash
>>>>>>>>> time. That spares all
>>>>>>>>> those callback heuristics.
>>>>>>>>>
>>>>>>>>>> I'm making my way though percpu crash_notes,
>>>>>>>>>> elfcorehdr, vmcoreinfo,
>>>>>>>>>> makedumpfile and (the consumer of it all) the
>>>>>>>>>> userspace crash utility,
>>>>>>>>>> in order to understand the impact of moving from
>>>>>>>>>> for_each_present_cpu()
>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>
>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>           tglx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thomas,
>>>>>>>> I've investigated the passing of crash notes through the
>>>>>>>> vmcore. What I've learned is that:
>>>>>>>>
>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references
>>>>>>>> to do its job) does
>>>>>>>>    not care what the contents of cpu PT_NOTES are, but it
>>>>>>>> does coalesce them together.
>>>>>>>>
>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in
>>>>>>>> order to determine its
>>>>>>>>    nr_cpus variable, which is reported in a header, but
>>>>>>>> otherwise unused (except
>>>>>>>>    for sadump method).
>>>>>>>>
>>>>>>>> - the crash utility, for the purposes of determining the
>>>>>>>> cpus, does not appear to
>>>>>>>>    reference the elfcorehdr PT_NOTEs. Instead it locates the 
>>>>>>>> various
>>>>>>>>    cpu_[possible|present|online]_mask and computes
>>>>>>>> nr_cpus from that, and also of
>>>>>>>>    course which are online. In addition, when crash does
>>>>>>>> reference the cpu PT_NOTE,
>>>>>>>>    to get its prstatus, it does so by using a percpu
>>>>>>>> technique directly in the vmcore
>>>>>>>>    image memory, not via the ELF structure. Said
>>>>>>>> differently, it appears to me that
>>>>>>>>    crash utility doesn't rely on the ELF PT_NOTEs for
>>>>>>>> cpus; rather it obtains them
>>>>>>>>    via kernel cpumasks and the memory within the vmcore.
>>>>>>>>
>>>>>>>> With this understanding, I did some testing. Perhaps the
>>>>>>>> most telling test was that I
>>>>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>>> hot plugged some cpus, then also took a few offline
>>>>>>>> sparsely via chcpu, then generated a
>>>>>>>> vmcore. The crash utility had no problem loading the
>>>>>>>> vmcore, it reported the proper number
>>>>>>>> of cpus and the number offline (despite only one cpu
>>>>>>>> PT_NOTE), and changing to a different
>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>
>>>>>>>> My take away is that crash utility does not rely upon
>>>>>>>> ELF cpu PT_NOTEs, it obtains the
>>>>>>>> cpu information directly from kernel data structures.
>>>>>>>> Perhaps at one time crash relied
>>>>>>>> upon the ELF information, but no more. (Perhaps there
>>>>>>>> are other crash dump analyzers
>>>>>>>> that might rely on the ELF info?)
>>>>>>>>
>>>>>>>> So, all this to say that I see no need to change
>>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>>> is no compelling reason to move away from
>>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>>> online/offline.
>>>>>>>>
>>>>>>>> Which then leaves the topic of the cpuhp state on which
>>>>>>>> to register. Perhaps reverting
>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right
>>>>>>>> answer. There does not appear to
>>>>>>>> be a compelling need to accurately track whether the cpu
>>>>>>>> went online/offline for the
>>>>>>>> purposes of creating the elfcorehdr, as ultimately the
>>>>>>>> crash utility pulls that from
>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>
>>>>>>>> I think this is what Sourabh has known and has been
>>>>>>>> advocating for an optimization
>>>>>>>> path that allows not regenerating the elfcorehdr on cpu
>>>>>>>> changes (because all the percpu
>>>>>>>> structs are all laid out). I do think it best to leave
>>>>>>>> that as an arch choice.
>>>>>>>
>>>>>>> Since things are clear on how the PT_NOTES are consumed in
>>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>
>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>> If yes, can you please list the elfcorehdr components that
>>>>>>> changes due to CPU hotplug.
>>>>>> Due to the use of for_each_present_cpu(), it is possible for the
>>>>>> number of cpu PT_NOTEs
>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
>>>>>> does not impact the
>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>
>>>>>>>
>>>>>>>   From what I understood, crash notes are prepared for
>>>>>>> possible CPUs as system boots and
>>>>>>> could be used to create a PT_NOTE section for each possible
>>>>>>> CPU while generating the elfcorehdr
>>>>>>> during the kdump kernel load.
>>>>>>>
>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every
>>>>>>> possible CPU there is no need to
>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>
>>>>>> For onlining/offlining of cpus, there is no need to regenerate
>>>>>> the elfcorehdr. However,
>>>>>> for actual hot un/plug of cpus, the answer is yes due to
>>>>>> for_each_present_cpu(). The
>>>>>> caveat here of course is that if crash utility is the only
>>>>>> coredump analyzer of concern,
>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be
>>>>>> no need to re-generate them.
>>>>>>
>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming
>>>>>> into mainstream, impacts
>>>>>> any of this.
>>>>>>
>>>>>> Perhaps the one item that might help here is to distinguish
>>>>>> between actual hot un/plug of
>>>>>> cpus, versus onlining/offlining. At the moment, I can not
>>>>>> distinguish between a hot plug
>>>>>> event and an online event (and unplug/offline). If those were
>>>>>> distinguishable, then we
>>>>>> could only regenerate on un/plug events.
>>>>>>
>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>
>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't
>>>>> have to worry about
>>>>> hot[un]plug case.
>>>>>
>>>>> Here is my view on how things should be handled if a core-dump
>>>>> analyzer is dependent on
>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>
>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>>>>> notes (kernel has
>>>>> one crash note per CPU for every possible CPU). Though the crash
>>>>> notes are allocated
>>>>> during the boot time they are populated when the system is on the
>>>>> crash path.
>>>>>
>>>>> This is how crash notes are populated on PowerPC and I am expecting
>>>>> it would be something
>>>>> similar on other architectures too.
>>>>>
>>>>> The crashing CPU sends IPI to every other online CPU with a callback
>>>>> function that updates the
>>>>> crash notes of that specific CPU. Once the IPI completes the
>>>>> crashing CPU updates its own crash
>>>>> note and proceeds further.
>>>>>
>>>>> The crash notes of CPUs remain uninitialized if the CPUs were
>>>>> offline or hot unplugged at the time
>>>>> system crash. The core-dump analyzer should be able to identify
>>>>> [un]/initialized crash notes
>>>>> and display the information accordingly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> - Sourabh
>>>>
>>>> In general, I agree with your points. You've presented a strong 
>>>> case to
>>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
>>>> those crash notes would always be present, and we can ignore 
>>>> changes to
>>>> cpus wrt/ elfcorehdr updates.
>>>>
>>>> But what do we do about kexec_load() syscall? The way the userspace
>>>> utility works is it determines cpus by:
>>>>   nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
>>>> which is not the equivalent of possible_cpus. So the complete list of
>>>> cpu PT_NOTEs is not generated up front. We would need a solution for
>>>> that?
>>> Hello Eric,
>>>
>>> The sysconf document says _SC_NPROCESSORS_CONF is processors 
>>> configured,
>>> isn't that equivalent to possible CPUs?
>>>
>>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on 
>>> powerPC
>>> it is possible CPUs.
>>
> Baoquan,
>
>>  From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
>> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
>> cpus. If these are true, we can use them.
>
> Thomas Gleixner has pointed out that:
>
>  glibc tries to evaluate that in the following order:
>   1) /sys/devices/system/cpu/cpu*
>      That's present CPUs not possible CPUs
>   2) /proc/stat
>      That's online CPUs
>   3) sched_getaffinity()
>      That's online CPUs at best. In the worst case it's an affinity mask
>      which is set on a process group
>
> meaning that _SC_NPROCESSORS_CONF is not equivalent to 
> possible_cpus(). Furthermore, the /sys/system/devices/cpus/cpuXX 
> entries are not available for not-present-but-possible cpus; thus 
> userspace kexec utility can not write out the elfcorehdr with all 
> possible cpus listed.
>
>>
>> But I am wondering why the existing present cpu way is going to be
>> discarded. Sorry, I tried to go through this thread, it's too long, can
>> anyone summarize the reason with shorter and clear sentences. Sorry
>> again for that.
>
Hello Eric,

> By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), 
> in the case of the kexec_file_load(), this change would simplify some 
> issues Sourabh has encountered for PPC support.

Things are fine even with for_each_present_cpu on PPC. It is just that I 
want to avoid
the regeneration of elfcorehdr for every CPU change by packing possible 
CPUs at once.


Thanks,
Sourabh Jain


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-03-02  5:23                           ` Sourabh Jain
  0 siblings, 0 replies; 70+ messages in thread
From: Sourabh Jain @ 2023-03-02  5:23 UTC (permalink / raw)
  To: Eric DeVolder, Baoquan He
  Cc: Thomas Gleixner, linux-kernel, x86, kexec, ebiederm, dyoung,
	vgoyal, mingo, bp, dave.hansen, hpa, nramas, thomas.lendacky,
	robh, efault, rppt, david, konrad.wilk, boris.ostrovsky


On 01/03/23 00:22, Eric DeVolder wrote:
>
>
> On 2/28/23 06:44, Baoquan He wrote:
>> On 02/13/23 at 10:10am, Sourabh Jain wrote:
>>>
>>> On 11/02/23 06:05, Eric DeVolder wrote:
>>>>
>>>>
>>>> On 2/10/23 00:29, Sourabh Jain wrote:
>>>>>
>>>>> On 10/02/23 01:09, Eric DeVolder wrote:
>>>>>>
>>>>>>
>>>>>> On 2/9/23 12:43, Sourabh Jain wrote:
>>>>>>> Hello Eric,
>>>>>>>
>>>>>>> On 09/02/23 23:01, Eric DeVolder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/8/23 07:44, Thomas Gleixner wrote:
>>>>>>>>> Eric!
>>>>>>>>>
>>>>>>>>> On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
>>>>>>>>>> On 2/1/23 05:33, Thomas Gleixner wrote:
>>>>>>>>>>
>>>>>>>>>> So my latest solution is introduce two new CPUHP
>>>>>>>>>> states, CPUHP_AP_ELFCOREHDR_ONLINE
>>>>>>>>>> for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
>>>>>>>>>> offlining. I'm open to better names.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
>>>>>>>>>> placed after CPUHP_BRINGUP_CPU. My
>>>>>>>>>> attempts at locating this state failed when
>>>>>>>>>> inside the STARTING section, so I located
>>>>>>>>>> this just inside the ONLINE sectoin. The crash
>>>>>>>>>> hotplug handler is registered on
>>>>>>>>>> this state as the callback for the .startup method.
>>>>>>>>>>
>>>>>>>>>> The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
>>>>>>>>>> placed before CPUHP_TEARDOWN_CPU, and I
>>>>>>>>>> placed it at the end of the PREPARE section.
>>>>>>>>>> This crash hotplug handler is also
>>>>>>>>>> registered on this state as the callback for the .teardown 
>>>>>>>>>> method.
>>>>>>>>>
>>>>>>>>> TBH, that's still overengineered. Something like this:
>>>>>>>>>
>>>>>>>>> bool cpu_is_alive(unsigned int cpu)
>>>>>>>>> {
>>>>>>>>>      struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>>>>>>>>>
>>>>>>>>>      return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and use this to query the actual state at crash
>>>>>>>>> time. That spares all
>>>>>>>>> those callback heuristics.
>>>>>>>>>
>>>>>>>>>> I'm making my way though percpu crash_notes,
>>>>>>>>>> elfcorehdr, vmcoreinfo,
>>>>>>>>>> makedumpfile and (the consumer of it all) the
>>>>>>>>>> userspace crash utility,
>>>>>>>>>> in order to understand the impact of moving from
>>>>>>>>>> for_each_present_cpu()
>>>>>>>>>> to for_each_online_cpu().
>>>>>>>>>
>>>>>>>>> Is the packing actually worth the trouble? What's the actual win?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>           tglx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thomas,
>>>>>>>> I've investigated the passing of crash notes through the
>>>>>>>> vmcore. What I've learned is that:
>>>>>>>>
>>>>>>>> - linux/fs/proc/vmcore.c (which makedumpfile references
>>>>>>>> to do its job) does
>>>>>>>>    not care what the contents of cpu PT_NOTES are, but it
>>>>>>>> does coalesce them together.
>>>>>>>>
>>>>>>>> - makedumpfile will count the number of cpu PT_NOTES in
>>>>>>>> order to determine its
>>>>>>>>    nr_cpus variable, which is reported in a header, but
>>>>>>>> otherwise unused (except
>>>>>>>>    for sadump method).
>>>>>>>>
>>>>>>>> - the crash utility, for the purposes of determining the
>>>>>>>> cpus, does not appear to
>>>>>>>>    reference the elfcorehdr PT_NOTEs. Instead it locates the 
>>>>>>>> various
>>>>>>>>    cpu_[possible|present|online]_mask and computes
>>>>>>>> nr_cpus from that, and also of
>>>>>>>>    course which are online. In addition, when crash does
>>>>>>>> reference the cpu PT_NOTE,
>>>>>>>>    to get its prstatus, it does so by using a percpu
>>>>>>>> technique directly in the vmcore
>>>>>>>>    image memory, not via the ELF structure. Said
>>>>>>>> differently, it appears to me that
>>>>>>>>    crash utility doesn't rely on the ELF PT_NOTEs for
>>>>>>>> cpus; rather it obtains them
>>>>>>>>    via kernel cpumasks and the memory within the vmcore.
>>>>>>>>
>>>>>>>> With this understanding, I did some testing. Perhaps the
>>>>>>>> most telling test was that I
>>>>>>>> changed the number of cpu PT_NOTEs emitted in the
>>>>>>>> crash_prepare_elf64_headers() to just 1,
>>>>>>>> hot plugged some cpus, then also took a few offline
>>>>>>>> sparsely via chcpu, then generated a
>>>>>>>> vmcore. The crash utility had no problem loading the
>>>>>>>> vmcore, it reported the proper number
>>>>>>>> of cpus and the number offline (despite only one cpu
>>>>>>>> PT_NOTE), and changing to a different
>>>>>>>> cpu via 'set -c 30' and the backtrace was completely valid.
>>>>>>>>
>>>>>>>> My take away is that crash utility does not rely upon
>>>>>>>> ELF cpu PT_NOTEs, it obtains the
>>>>>>>> cpu information directly from kernel data structures.
>>>>>>>> Perhaps at one time crash relied
>>>>>>>> upon the ELF information, but no more. (Perhaps there
>>>>>>>> are other crash dump analyzers
>>>>>>>> that might rely on the ELF info?)
>>>>>>>>
>>>>>>>> So, all this to say that I see no need to change
>>>>>>>> crash_prepare_elf64_headers(). There
>>>>>>>> is no compelling reason to move away from
>>>>>>>> for_each_present_cpu(), or modify the list for
>>>>>>>> online/offline.
>>>>>>>>
>>>>>>>> Which then leaves the topic of the cpuhp state on which
>>>>>>>> to register. Perhaps reverting
>>>>>>>> back to the use of CPUHP_BP_PREPARE_DYN is the right
>>>>>>>> answer. There does not appear to
>>>>>>>> be a compelling need to accurately track whether the cpu
>>>>>>>> went online/offline for the
>>>>>>>> purposes of creating the elfcorehdr, as ultimately the
>>>>>>>> crash utility pulls that from
>>>>>>>> kernel data structures, not the elfcorehdr.
>>>>>>>>
>>>>>>>> I think this is what Sourabh has known and has been
>>>>>>>> advocating for an optimization
>>>>>>>> path that allows not regenerating the elfcorehdr on cpu
>>>>>>>> changes (because all the percpu
>>>>>>>> structs are all laid out). I do think it best to leave
>>>>>>>> that as an arch choice.
>>>>>>>
>>>>>>> Since things are clear on how the PT_NOTES are consumed in
>>>>>>> kdump kernel [fs/proc/vmcore.c],
>>>>>>> makedumpfile, and crash tool I need your opinion on this:
>>>>>>>
>>>>>>> Do we really need to regenerate elfcorehdr for CPU hotplug events?
>>>>>>> If yes, can you please list the elfcorehdr components that
>>>>>>> changes due to CPU hotplug.
>>>>>> Due to the use of for_each_present_cpu(), it is possible for the
>>>>>> number of cpu PT_NOTEs
>>>>>> to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
>>>>>> does not impact the
>>>>>> number of cpu PT_NOTEs (as the cpus are still present).
>>>>>>
>>>>>>>
>>>>>>>   From what I understood, crash notes are prepared for
>>>>>>> possible CPUs as system boots and
>>>>>>> could be used to create a PT_NOTE section for each possible
>>>>>>> CPU while generating the elfcorehdr
>>>>>>> during the kdump kernel load.
>>>>>>>
>>>>>>> Now once the elfcorehdr is loaded with PT_NOTEs for every
>>>>>>> possible CPU there is no need to
>>>>>>> regenerate it for CPU hotplug events. Or do we?
>>>>>>
>>>>>> For onlining/offlining of cpus, there is no need to regenerate
>>>>>> the elfcorehdr. However,
>>>>>> for actual hot un/plug of cpus, the answer is yes due to
>>>>>> for_each_present_cpu(). The
>>>>>> caveat here of course is that if crash utility is the only
>>>>>> coredump analyzer of concern,
>>>>>> then it doesn't care about these cpu PT_NOTEs and there would be
>>>>>> no need to re-generate them.
>>>>>>
>>>>>> Also, I'm not sure if ARM cpu hotplug, which is just now coming
>>>>>> into mainstream, impacts
>>>>>> any of this.
>>>>>>
>>>>>> Perhaps the one item that might help here is to distinguish
>>>>>> between actual hot un/plug of
>>>>>> cpus, versus onlining/offlining. At the moment, I can not
>>>>>> distinguish between a hot plug
>>>>>> event and an online event (and unplug/offline). If those were
>>>>>> distinguishable, then we
>>>>>> could only regenerate on un/plug events.
>>>>>>
>>>>>> Or perhaps moving to for_each_possible_cpu() is the better choice?
>>>>>
>>>>> Yes, because once elfcorehdr is built with possible CPUs we don't
>>>>> have to worry about
>>>>> hot[un]plug case.
>>>>>
>>>>> Here is my view on how things should be handled if a core-dump
>>>>> analyzer is dependent on
>>>>> elfcorehdr PT_NOTEs to find online/offline CPUs.
>>>>>
>>>>> A PT_NOTE in elfcorehdr holds the address of the corresponding crash
>>>>> notes (kernel has
>>>>> one crash note per CPU for every possible CPU). Though the crash
>>>>> notes are allocated
>>>>> during the boot time they are populated when the system is on the
>>>>> crash path.
>>>>>
>>>>> This is how crash notes are populated on PowerPC and I am expecting
>>>>> it would be something
>>>>> similar on other architectures too.
>>>>>
>>>>> The crashing CPU sends IPI to every other online CPU with a callback
>>>>> function that updates the
>>>>> crash notes of that specific CPU. Once the IPI completes the
>>>>> crashing CPU updates its own crash
>>>>> note and proceeds further.
>>>>>
>>>>> The crash notes of CPUs remain uninitialized if the CPUs were
>>>>> offline or hot unplugged at the time
>>>>> system crash. The core-dump analyzer should be able to identify
>>>>> [un]/initialized crash notes
>>>>> and display the information accordingly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> - Sourabh
>>>>
>>>> In general, I agree with your points. You've presented a strong 
>>>> case to
>>>> go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
>>>> those crash notes would always be present, and we can ignore 
>>>> changes to
>>>> cpus wrt/ elfcorehdr updates.
>>>>
>>>> But what do we do about kexec_load() syscall? The way the userspace
>>>> utility works is it determines cpus by:
>>>>   nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
>>>> which is not the equivalent of possible_cpus. So the complete list of
>>>> cpu PT_NOTEs is not generated up front. We would need a solution for
>>>> that?
>>> Hello Eric,
>>>
>>> The sysconf document says _SC_NPROCESSORS_CONF is processors 
>>> configured,
>>> isn't that equivalent to possible CPUs?
>>>
>>> What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on 
>>> powerPC
>>> it is possible CPUs.
>>
> Baoquan,
>
>>  From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
>> returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
>> cpus. If these are true, we can use them.
>
> Thomas Gleixner has pointed out that:
>
>  glibc tries to evaluate that in the following order:
>   1) /sys/devices/system/cpu/cpu*
>      That's present CPUs not possible CPUs
>   2) /proc/stat
>      That's online CPUs
>   3) sched_getaffinity()
>      That's online CPUs at best. In the worst case it's an affinity mask
>      which is set on a process group
>
> meaning that _SC_NPROCESSORS_CONF is not equivalent to 
> possible_cpus(). Furthermore, the /sys/system/devices/cpus/cpuXX 
> entries are not available for not-present-but-possible cpus; thus 
> userspace kexec utility can not write out the elfcorehdr with all 
> possible cpus listed.
>
>>
>> But I am wondering why the existing present cpu way is going to be
>> discarded. Sorry, I tried to go through this thread, it's too long, can
>> anyone summarize the reason with shorter and clear sentences. Sorry
>> again for that.
>
Hello Eric,

> By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), 
> in the case of the kexec_file_load(), this change would simplify some 
> issues Sourabh has encountered for PPC support.

Things are fine even with for_each_present_cpu on PPC. It is just that I 
want to avoid
the regeneration of elfcorehdr for every CPU change by packing possible 
CPUs at once.


Thanks,
Sourabh Jain


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
  2023-03-01 15:48                           ` Eric DeVolder
@ 2023-03-02 10:51                             ` Baoquan He
  -1 siblings, 0 replies; 70+ messages in thread
From: Baoquan He @ 2023-03-02 10:51 UTC (permalink / raw)
  To: Eric DeVolder
  Cc: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, vgoyal, mingo, bp, dave.hansen, hpa, nramas,
	thomas.lendacky, robh, efault, rppt, david, konrad.wilk,
	boris.ostrovsky

On 03/01/23 at 09:48am, Eric DeVolder wrote:
...... 
> From b56aa428b07d970f26e3c3704d54ce8805f05ddc Mon Sep 17 00:00:00 2001
> From: Eric DeVolder <eric.devolder@oracle.com>
> Date: Tue, 28 Feb 2023 14:20:04 -0500
> Subject: [PATCH v19 3/7] crash: change crash_prepare_elf64_headers() to
>  for_each_possible_cpu()
> 
> The function crash_prepare_elf64_headers() generates the elfcorehdr
> which describes the cpus and memory in the system for the crash kernel.
> In particular, it writes out ELF PT_NOTEs for memory regions and the
> processors in the system.
> 
> With respect to the cpus, the current implementation utilizes
> for_each_present_cpu() which means that as cpus are added and removed,
> the elfcorehdr must again be updated to reflect the new set of cpus.
> 
> The reasoning behind the change to use for_each_possible_cpu(), is:
> 
> - At kernel boot time, all percpu crash_notes are allocated for all
>   possible cpus; that is, crash_notes are not allocated dynamically
>   when cpus are plugged/unplugged. Thus the crash_notes for each
>   possible cpu are always available.
> 
> - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu.
>   Changing to for_each_possible_cpu() is valid as the crash_notes
>   pointed to by each cpu PT_NOTE are present and always valid.
> 
> Furthermore, examining a common crash processing path of:
> 
>  kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
>            elfcorehdr      /proc/vmcore     vmcore
> 
> reveals how the ELF cpu PT_NOTEs are utilized:
> 
> - Upon panic, each cpu is sent an IPI and shuts itself down, recording
>  its state in its crash_notes. When all cpus are shutdown, the
>  crash kernel is launched with a pointer to the elfcorehdr.
> 
> - The crash kernel via linux/fs/proc/vmcore.c does not examine or
>  use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.
> 
> - The makedumpfile utility uses /proc/vmcore and reads the cpu
>  PT_NOTEs to craft a nr_cpus variable, which is reported in a
>  header but otherwise generally unused. Makedumpfile creates the
>  vmcore.
> 
> - The 'crash' dump analyzer does not appear to reference the cpu
>  PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
>  symbols and directly examines those structure contents from vmcore
>  memory. From that information it is able to determine which cpus
>  are present and online, and locate the corresponding crash_notes.
>  Said differently, it appears to me that 'crash' analyzer does not
>  rely on the ELF PT_NOTEs for cpus; rather it obtains the information
>  directly via kernel symbols and the memory within the vmcore.
> 
> (There maybe other vmcore generating and analysis tools that do use
> these PT_NOTEs, but 'makedumpfile' and 'crash' seem to me to be the
> most common solution.)
> 
> This change results in the benefit of having all cpus described in
> the elfcorehdr, and therefore reducing the need to re-generate the
> elfcorehdr on cpu changes, at the small expense of an additional
> 56 bytes per PT_NOTE for not-present-but-possible cpus.
> 
> On systems where kexec_file_load() syscall is utilized, all the above
> is valid. On systems where kexec_load() syscall is utilized, there
> may be the need for the elfcorehdr to be regenerated once. The reason
> being that some archs only populate the 'present' cpus in the
> /sys/devices/system/cpus entries, which the userspace 'kexec' utility
> uses to generate the userspace-supplied elfcorehdr. In this situation,
> one memory or cpu change will rewrite the elfcorehdr via the
> crash_prepare_elf64_headers() function and now all possible cpus will
> be described, just as with kexec_file_load() syscall.

So, with for_each_possible_cpu(), we don't need to respond to cpu
hotplug event, right? If so, it does bring benefit. While kexec_load
won't benefit from that. So far, it looks not bad.

> 
> Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
> ---
>  kernel/crash_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index dba4b75f7541..537b199a8774 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
>  	ehdr->e_phentsize = sizeof(Elf64_Phdr);
> 
>  	/* Prepare one phdr of type PT_NOTE for each present CPU */
> -	for_each_present_cpu(cpu) {
> +	for_each_possible_cpu(cpu) {
>  		phdr->p_type = PT_NOTE;
>  		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
>  		phdr->p_offset = phdr->p_paddr = notes_addr;
> -- 
> 2.31.1
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes
@ 2023-03-02 10:51                             ` Baoquan He
  0 siblings, 0 replies; 70+ messages in thread
From: Baoquan He @ 2023-03-02 10:51 UTC (permalink / raw)
  To: Eric DeVolder
  Cc: Sourabh Jain, Thomas Gleixner, linux-kernel, x86, kexec,
	ebiederm, dyoung, vgoyal, mingo, bp, dave.hansen, hpa, nramas,
	thomas.lendacky, robh, efault, rppt, david, konrad.wilk,
	boris.ostrovsky

On 03/01/23 at 09:48am, Eric DeVolder wrote:
...... 
> From b56aa428b07d970f26e3c3704d54ce8805f05ddc Mon Sep 17 00:00:00 2001
> From: Eric DeVolder <eric.devolder@oracle.com>
> Date: Tue, 28 Feb 2023 14:20:04 -0500
> Subject: [PATCH v19 3/7] crash: change crash_prepare_elf64_headers() to
>  for_each_possible_cpu()
> 
> The function crash_prepare_elf64_headers() generates the elfcorehdr
> which describes the cpus and memory in the system for the crash kernel.
> In particular, it writes out ELF PT_NOTEs for memory regions and the
> processors in the system.
> 
> With respect to the cpus, the current implementation utilizes
> for_each_present_cpu() which means that as cpus are added and removed,
> the elfcorehdr must again be updated to reflect the new set of cpus.
> 
> The reasoning behind the change to use for_each_possible_cpu(), is:
> 
> - At kernel boot time, all percpu crash_notes are allocated for all
>   possible cpus; that is, crash_notes are not allocated dynamically
>   when cpus are plugged/unplugged. Thus the crash_notes for each
>   possible cpu are always available.
> 
> - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu.
>   Changing to for_each_possible_cpu() is valid as the crash_notes
>   pointed to by each cpu PT_NOTE are present and always valid.
> 
> Furthermore, examining a common crash processing path of:
> 
>  kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
>            elfcorehdr      /proc/vmcore     vmcore
> 
> reveals how the ELF cpu PT_NOTEs are utilized:
> 
> - Upon panic, each cpu is sent an IPI and shuts itself down, recording
>  its state in its crash_notes. When all cpus are shutdown, the
>  crash kernel is launched with a pointer to the elfcorehdr.
> 
> - The crash kernel via linux/fs/proc/vmcore.c does not examine or
>  use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.
> 
> - The makedumpfile utility uses /proc/vmcore and reads the cpu
>  PT_NOTEs to craft a nr_cpus variable, which is reported in a
>  header but otherwise generally unused. Makedumpfile creates the
>  vmcore.
> 
> - The 'crash' dump analyzer does not appear to reference the cpu
>  PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
>  symbols and directly examines those structure contents from vmcore
>  memory. From that information it is able to determine which cpus
>  are present and online, and locate the corresponding crash_notes.
>  Said differently, it appears to me that 'crash' analyzer does not
>  rely on the ELF PT_NOTEs for cpus; rather it obtains the information
>  directly via kernel symbols and the memory within the vmcore.
> 
> (There maybe other vmcore generating and analysis tools that do use
> these PT_NOTEs, but 'makedumpfile' and 'crash' seem to me to be the
> most common solution.)
> 
> This change results in the benefit of having all cpus described in
> the elfcorehdr, and therefore reducing the need to re-generate the
> elfcorehdr on cpu changes, at the small expense of an additional
> 56 bytes per PT_NOTE for not-present-but-possible cpus.
> 
> On systems where kexec_file_load() syscall is utilized, all the above
> is valid. On systems where kexec_load() syscall is utilized, there
> may be the need for the elfcorehdr to be regenerated once. The reason
> being that some archs only populate the 'present' cpus in the
> /sys/devices/system/cpus entries, which the userspace 'kexec' utility
> uses to generate the userspace-supplied elfcorehdr. In this situation,
> one memory or cpu change will rewrite the elfcorehdr via the
> crash_prepare_elf64_headers() function and now all possible cpus will
> be described, just as with kexec_file_load() syscall.

So, with for_each_possible_cpu(), we don't need to respond to cpu
hotplug event, right? If so, it does bring benefit. While kexec_load
won't benefit from that. So far, it looks not bad.

> 
> Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
> ---
>  kernel/crash_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index dba4b75f7541..537b199a8774 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
>  	ehdr->e_phentsize = sizeof(Elf64_Phdr);
> 
>  	/* Prepare one phdr of type PT_NOTE for each present CPU */
> -	for_each_present_cpu(cpu) {
> +	for_each_possible_cpu(cpu) {
>  		phdr->p_type = PT_NOTE;
>  		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
>  		phdr->p_offset = phdr->p_paddr = notes_addr;
> -- 
> 2.31.1
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2023-03-02 10:53 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-31 22:42 [PATCH v18 0/7] crash: Kernel handling of CPU and memory hot un/plug Eric DeVolder
2023-01-31 22:42 ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 1/7] crash: move a few code bits to setup support of crash hotplug Eric DeVolder
2023-01-31 22:42   ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 2/7] crash: prototype change for crash_prepare_elf64_headers() Eric DeVolder
2023-01-31 22:42   ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 3/7] crash: add generic infrastructure for crash hotplug support Eric DeVolder
2023-01-31 22:42   ` Eric DeVolder
2023-02-09 19:10   ` Sourabh Jain
2023-02-09 19:10     ` Sourabh Jain
2023-02-10 16:51     ` Eric DeVolder
2023-02-10 16:51       ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 4/7] kexec: exclude elfcorehdr from the segment digest Eric DeVolder
2023-01-31 22:42   ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes Eric DeVolder
2023-01-31 22:42   ` Eric DeVolder
2023-02-01 11:33   ` Thomas Gleixner
2023-02-01 11:33     ` Thomas Gleixner
2023-02-06  8:12     ` Sourabh Jain
2023-02-06  8:12       ` Sourabh Jain
2023-02-06 13:03       ` Thomas Gleixner
2023-02-06 13:03         ` Thomas Gleixner
2023-02-07 17:23     ` Eric DeVolder
2023-02-07 17:23       ` Eric DeVolder
2023-02-08 13:44       ` Thomas Gleixner
2023-02-08 13:44         ` Thomas Gleixner
2023-02-09 17:31         ` Eric DeVolder
2023-02-09 17:31           ` Eric DeVolder
2023-02-09 18:43           ` Sourabh Jain
2023-02-09 18:43             ` Sourabh Jain
2023-02-09 19:39             ` Eric DeVolder
2023-02-09 19:39               ` Eric DeVolder
2023-02-10  6:29               ` Sourabh Jain
2023-02-10  6:29                 ` Sourabh Jain
2023-02-11  0:35                 ` Eric DeVolder
2023-02-11  0:35                   ` Eric DeVolder
2023-02-13  4:40                   ` Sourabh Jain
2023-02-13  4:40                     ` Sourabh Jain
2023-02-13 12:52                     ` Thomas Gleixner
2023-02-13 12:52                       ` Thomas Gleixner
2023-02-15  2:53                       ` Sourabh Jain
2023-02-15  2:53                         ` Sourabh Jain
2023-02-28 12:44                     ` Baoquan He
2023-02-28 12:44                       ` Baoquan He
2023-02-28 18:52                       ` Eric DeVolder
2023-02-28 18:52                         ` Eric DeVolder
2023-03-01 15:48                         ` Eric DeVolder
2023-03-01 15:48                           ` Eric DeVolder
2023-03-02 10:51                           ` Baoquan He
2023-03-02 10:51                             ` Baoquan He
2023-03-02  5:23                         ` Sourabh Jain
2023-03-02  5:23                           ` Sourabh Jain
2023-02-23 20:34                 ` Eric DeVolder
2023-02-23 20:34                   ` Eric DeVolder
2023-02-24  8:34                   ` Sourabh Jain
2023-02-24  8:34                     ` Sourabh Jain
2023-02-24 20:16                     ` Eric DeVolder
2023-02-24 20:16                       ` Eric DeVolder
2023-02-27  6:11                       ` Sourabh Jain
2023-02-27  6:11                         ` Sourabh Jain
2023-02-28 21:50                         ` Eric DeVolder
2023-02-28 21:50                           ` Eric DeVolder
2023-03-01  6:22                           ` Sourabh Jain
2023-03-01  6:22                             ` Sourabh Jain
2023-03-01 14:16                             ` Eric DeVolder
2023-03-01 14:16                               ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 6/7] crash: memory and cpu hotplug sysfs attributes Eric DeVolder
2023-01-31 22:42   ` Eric DeVolder
2023-01-31 22:42 ` [PATCH v18 7/7] x86/crash: add x86 crash hotplug support Eric DeVolder
2023-01-31 22:42   ` Eric DeVolder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.