* [PATCH 0/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active @ 2019-10-12 2:21 Lianbo Jiang 2019-10-12 2:21 ` [PATCH 1/3 " Lianbo Jiang ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Lianbo Jiang @ 2019-10-12 2:21 UTC (permalink / raw) To: linux-kernel Cc: tglx, mingo, bp, hpa, x86, bhe, dyoung, jgross, dhowells, Thomas.Lendacky, ebiederm, vgoyal, kexec In purgatory(), the main things are as below: [1] verify sha256 hashes for various segments. Lets keep these codes, and do not touch the logic. [2] copy the first 640k content to a backup region. Lets safely remove it and clean all code related to backup region. This patch series will remove the backup region, because the current handling of copying the first 640k runs into problems when SME is active. The low 1MiB region will always be reserved when the crashkernel kernel command line option is specified. And this way makes it unnecessary to do anything with the low 1MiB region, because the memory allocated later won't fall into the low 1MiB area. This series includes three patches: [1] Fix 'kmem -s' reported an invalid freepointer when SME was active The low 1MiB region will always be reserved when the crashkernel kernel command line option is specified, which ensures that the memory allocated later won't fall into the low 1MiB area. [2] x86/kdump cleanup: remove the unused crash_copy_backup_region() The crash_copy_backup_region() has never been used, so clean up the redundant code. [3] x86/kdump: clean up all the code related to the backup region Remove the backup region and clean up. Changes since v1: [1] Add extra checking condition: when the crashkernel option is specified, reserve the low 640k area. Changes since v2: [1] Reserve the low 1MiB region when the crashkernel option is only specified.(Suggested by Eric) [2] Remove the unused crash_copy_backup_region() [3] Remove the backup region and clean up [4] Split them into three patches Lianbo Jiang (3): x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active x86/kdump cleanup: remove the unused crash_copy_backup_region() x86/kdump: clean up all the code related to the backup region arch/x86/include/asm/crash.h | 1 - arch/x86/include/asm/kexec.h | 10 ---- arch/x86/include/asm/purgatory.h | 10 ---- arch/x86/kernel/crash.c | 91 ++++++------------------------ arch/x86/kernel/machine_kexec_64.c | 47 --------------- arch/x86/purgatory/purgatory.c | 19 ------- arch/x86/realmode/init.c | 11 ++++ 7 files changed, 27 insertions(+), 162 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active 2019-10-12 2:21 [PATCH 0/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active Lianbo Jiang @ 2019-10-12 2:21 ` Lianbo Jiang 2019-10-12 2:21 ` [PATCH 2/3 v3] x86/kdump cleanup: remove the unused crash_copy_backup_region() Lianbo Jiang 2019-10-12 2:21 ` [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region Lianbo Jiang 2 siblings, 0 replies; 16+ messages in thread From: Lianbo Jiang @ 2019-10-12 2:21 UTC (permalink / raw) To: linux-kernel Cc: tglx, mingo, bp, hpa, x86, bhe, dyoung, jgross, dhowells, Thomas.Lendacky, ebiederm, vgoyal, kexec Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793 Kdump kernel will reuse the first 640k region because of some reasons, for example: the trampline and conventional PC system BIOS region may require to allocate memory in this area. Obviously, kdump kernel will also overwrite the first 640k region, therefore, kernel has to copy the contents of the first 640k area to a backup area, which is done in purgatory(), because vmcore may need the old memory. When vmcore is dumped, kdump kernel will read the old memory from the backup area of the first 640k area. Basically, the main reason should be clear, kernel does not correctly handle the first 640k region when SME is active, which causes that kernel does not properly copy these old memory to the backup area in purgatory(). Therefore, kdump kernel reads out the incorrect contents from the backup area when dumping vmcore. Finally, the phenomenon is as follow: [root linux]$ crash vmlinux /var/crash/127.0.0.1-2019-09-19-08\:31\:27/vmcore WARNING: kernel relocated [240MB]: patching 97110 gdb minimal_symbol values KERNEL: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmlinux DUMPFILE: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmcore [PARTIAL DUMP] CPUS: 128 DATE: Thu Sep 19 08:31:18 2019 UPTIME: 00:01:21 LOAD AVERAGE: 0.16, 0.07, 0.02 TASKS: 1343 NODENAME: amd-ethanol RELEASE: 5.3.0-rc7+ VERSION: #4 SMP Thu Sep 19 08:14:00 EDT 2019 MACHINE: x86_64 (2195 Mhz) MEMORY: 127.9 GB PANIC: "Kernel panic - not syncing: sysrq triggered crash" PID: 9789 COMMAND: "bash" TASK: "ffff89711894ae80 [THREAD_INFO: ffff89711894ae80]" CPU: 83 STATE: TASK_RUNNING (PANIC) crash> kmem -s|grep -i invalid kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4 kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4 crash> BTW: I also tried to fix the above problem in purgatory(), but there are too many restricts in purgatory() context, for example: i can't allocate new memory to create the identity mapping page table for SME situation. Currently, there are two places where the first 640k area is needed, the first one is in the find_trampoline_placement(), another one is in the reserve_real_mode(), and their content doesn't matter. To avoid the above error, when the crashkernel kernel command line option is specified, lets reserve the remain low 1MiB memory(after reserving real mode memroy) so that the allocated memory does not fall into the low 1MiB area, which makes us not to copy the first 640k content to a backup region in purgatory(). In addition, also need to clean all the code related to the backup region later. Signed-off-by: Lianbo Jiang <lijiang@redhat.com> --- arch/x86/realmode/init.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index 7dce39c8c034..bf4c8ffc5ed9 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -34,6 +34,17 @@ void __init reserve_real_mode(void) memblock_reserve(mem, size); set_real_mode_mem(mem); + +#ifdef CONFIG_KEXEC_CORE + /* + * When the crashkernel option is specified, only use the low + * 1MiB for the real mode trampoline. + */ + if (strstr(boot_command_line, "crashkernel=")) { + memblock_reserve(0, SZ_1M); + pr_info("Reserving low 1MiB of memory for crashkernel\n"); + } +#endif /* CONFIG_KEXEC_CORE */ } static void __init setup_real_mode(void) -- 2.17.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/3 v3] x86/kdump cleanup: remove the unused crash_copy_backup_region() 2019-10-12 2:21 [PATCH 0/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active Lianbo Jiang 2019-10-12 2:21 ` [PATCH 1/3 " Lianbo Jiang @ 2019-10-12 2:21 ` Lianbo Jiang 2019-10-12 2:21 ` [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region Lianbo Jiang 2 siblings, 0 replies; 16+ messages in thread From: Lianbo Jiang @ 2019-10-12 2:21 UTC (permalink / raw) To: linux-kernel Cc: tglx, mingo, bp, hpa, x86, bhe, dyoung, jgross, dhowells, Thomas.Lendacky, ebiederm, vgoyal, kexec The crash_copy_backup_region() has never been used, so clean up the redundant code. Signed-off-by: Lianbo Jiang <lijiang@redhat.com> --- arch/x86/include/asm/crash.h | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h index 0acf5ee45a21..089b2850f9d1 100644 --- a/arch/x86/include/asm/crash.h +++ b/arch/x86/include/asm/crash.h @@ -3,7 +3,6 @@ #define _ASM_X86_CRASH_H int crash_load_segments(struct kimage *image); -int crash_copy_backup_region(struct kimage *image); int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params); void crash_smp_send_stop(void); -- 2.17.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-12 2:21 [PATCH 0/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active Lianbo Jiang 2019-10-12 2:21 ` [PATCH 1/3 " Lianbo Jiang 2019-10-12 2:21 ` [PATCH 2/3 v3] x86/kdump cleanup: remove the unused crash_copy_backup_region() Lianbo Jiang @ 2019-10-12 2:21 ` Lianbo Jiang 2019-10-12 11:26 ` Eric W. Biederman 2 siblings, 1 reply; 16+ messages in thread From: Lianbo Jiang @ 2019-10-12 2:21 UTC (permalink / raw) To: linux-kernel Cc: tglx, mingo, bp, hpa, x86, bhe, dyoung, jgross, dhowells, Thomas.Lendacky, ebiederm, vgoyal, kexec When the crashkernel kernel command line option is specified, the low 1MiB memory will always be reserved, which makes that the memory allocated later won't fall into the low 1MiB area, thereby, it's not necessary to create a backup region and also no need to copy the first 640k content to a backup region. Currently, the code related to the backup region can be safely removed, so lets clean up. Signed-off-by: Lianbo Jiang <lijiang@redhat.com> --- arch/x86/include/asm/kexec.h | 10 ---- arch/x86/include/asm/purgatory.h | 10 ---- arch/x86/kernel/crash.c | 91 ++++++------------------------ arch/x86/kernel/machine_kexec_64.c | 47 --------------- arch/x86/purgatory/purgatory.c | 19 ------- 5 files changed, 16 insertions(+), 161 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 5e7d6b46de97..6802c59e8252 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -66,10 +66,6 @@ struct kimage; # define KEXEC_ARCH KEXEC_ARCH_X86_64 #endif -/* Memory to backup during crash kdump */ -#define KEXEC_BACKUP_SRC_START (0UL) -#define KEXEC_BACKUP_SRC_END (640 * 1024UL - 1) /* 640K */ - /* * This function is responsible for capturing register states if coming * via panic otherwise just fix up the ss and sp if coming via kernel @@ -154,12 +150,6 @@ struct kimage_arch { pud_t *pud; pmd_t *pmd; pte_t *pte; - /* Details of backup region */ - unsigned long backup_src_start; - unsigned long backup_src_sz; - - /* Physical address of backup segment */ - unsigned long backup_load_addr; /* Core ELF header buffer */ void *elf_headers; diff --git a/arch/x86/include/asm/purgatory.h b/arch/x86/include/asm/purgatory.h index 92c34e517da1..5528e9325049 100644 --- a/arch/x86/include/asm/purgatory.h +++ b/arch/x86/include/asm/purgatory.h @@ -6,16 +6,6 @@ #include <linux/purgatory.h> extern void purgatory(void); -/* - * These forward declarations serve two purposes: - * - * 1) Make sparse happy when checking arch/purgatory - * 2) Document that these are required to be global so the symbol - * lookup in kexec works - */ -extern unsigned long purgatory_backup_dest; -extern unsigned long purgatory_backup_src; -extern unsigned long purgatory_backup_sz; #endif /* __ASSEMBLY__ */ #endif /* _ASM_PURGATORY_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index eb651fbde92a..cc5774fc84c0 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) #ifdef CONFIG_KEXEC_FILE -static unsigned long crash_zero_bytes; - static int get_nr_ram_ranges_callback(struct resource *res, void *arg) { unsigned int *nr_ranges = arg; @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) { struct crash_mem *cmem = arg; - cmem->ranges[cmem->nr_ranges].start = res->start; - cmem->ranges[cmem->nr_ranges].end = res->end; - cmem->nr_ranges++; + if (res->start >= SZ_1M) { + cmem->ranges[cmem->nr_ranges].start = res->start; + cmem->ranges[cmem->nr_ranges].end = res->end; + cmem->nr_ranges++; + } else if (res->end > SZ_1M) { + cmem->ranges[cmem->nr_ranges].start = SZ_1M; + cmem->ranges[cmem->nr_ranges].end = res->end; + cmem->nr_ranges++; + } return 0; } @@ -246,9 +250,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr, unsigned long *sz) { struct crash_mem *cmem; - Elf64_Ehdr *ehdr; - Elf64_Phdr *phdr; - int ret, i; + int ret; cmem = fill_up_crash_elf_data(); if (!cmem) @@ -270,19 +272,6 @@ static int prepare_elf_headers(struct kimage *image, void **addr, if (ret) goto out; - /* - * If a range matches backup region, adjust offset to backup - * segment. - */ - ehdr = (Elf64_Ehdr *)*addr; - phdr = (Elf64_Phdr *)(ehdr + 1); - for (i = 0; i < ehdr->e_phnum; phdr++, i++) - if (phdr->p_type == PT_LOAD && - phdr->p_paddr == image->arch.backup_src_start && - phdr->p_memsz == image->arch.backup_src_sz) { - phdr->p_offset = image->arch.backup_load_addr; - break; - } out: vfree(cmem); return ret; @@ -321,19 +310,11 @@ static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem, unsigned long long mend) { unsigned long start, end; - int ret = 0; cmem->ranges[0].start = mstart; cmem->ranges[0].end = mend; cmem->nr_ranges = 1; - /* Exclude Backup region */ - start = image->arch.backup_load_addr; - end = start + image->arch.backup_src_sz - 1; - ret = crash_exclude_mem_range(cmem, start, end); - if (ret) - return ret; - /* Exclude elf header region */ start = image->arch.elf_load_addr; end = start + image->arch.elf_headers_sz - 1; @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) memset(&cmd, 0, sizeof(struct crash_memmap_data)); cmd.params = params; - /* Add first 640K segment */ - ei.addr = image->arch.backup_src_start; - ei.size = image->arch.backup_src_sz; + /* + * Add the low memory range[0x1000, SZ_1M], skip + * the first zero page. + */ + ei.addr = PAGE_SIZE; + ei.size = SZ_1M - PAGE_SIZE; ei.type = E820_TYPE_RAM; add_e820_entry(params, &ei); @@ -409,55 +393,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) return ret; } -static int determine_backup_region(struct resource *res, void *arg) -{ - struct kimage *image = arg; - - image->arch.backup_src_start = res->start; - image->arch.backup_src_sz = resource_size(res); - - /* Expecting only one range for backup region */ - return 1; -} - int crash_load_segments(struct kimage *image) { int ret; struct kexec_buf kbuf = { .image = image, .buf_min = 0, .buf_max = ULONG_MAX, .top_down = false }; - /* - * Determine and load a segment for backup area. First 640K RAM - * region is backup source - */ - - ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END, - image, determine_backup_region); - - /* Zero or postive return values are ok */ - if (ret < 0) - return ret; - - /* Add backup segment. */ - if (image->arch.backup_src_sz) { - kbuf.buffer = &crash_zero_bytes; - kbuf.bufsz = sizeof(crash_zero_bytes); - kbuf.memsz = image->arch.backup_src_sz; - kbuf.buf_align = PAGE_SIZE; - /* - * Ideally there is no source for backup segment. This is - * copied in purgatory after crash. Just add a zero filled - * segment for now to make sure checksum logic works fine. - */ - ret = kexec_add_buffer(&kbuf); - if (ret) - return ret; - image->arch.backup_load_addr = kbuf.mem; - pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx memsz=0x%lx\n", - image->arch.backup_load_addr, - image->arch.backup_src_start, kbuf.memsz); - } - /* Prepare elf headers and add a segment */ ret = prepare_elf_headers(image, &kbuf.buffer, &kbuf.bufsz); if (ret) diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 5dcd438ad8f2..16e125a50b33 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -298,48 +298,6 @@ static void load_segments(void) ); } -#ifdef CONFIG_KEXEC_FILE -/* Update purgatory as needed after various image segments have been prepared */ -static int arch_update_purgatory(struct kimage *image) -{ - int ret = 0; - - if (!image->file_mode) - return 0; - - /* Setup copying of backup region */ - if (image->type == KEXEC_TYPE_CRASH) { - ret = kexec_purgatory_get_set_symbol(image, - "purgatory_backup_dest", - &image->arch.backup_load_addr, - sizeof(image->arch.backup_load_addr), 0); - if (ret) - return ret; - - ret = kexec_purgatory_get_set_symbol(image, - "purgatory_backup_src", - &image->arch.backup_src_start, - sizeof(image->arch.backup_src_start), 0); - if (ret) - return ret; - - ret = kexec_purgatory_get_set_symbol(image, - "purgatory_backup_sz", - &image->arch.backup_src_sz, - sizeof(image->arch.backup_src_sz), 0); - if (ret) - return ret; - } - - return ret; -} -#else /* !CONFIG_KEXEC_FILE */ -static inline int arch_update_purgatory(struct kimage *image) -{ - return 0; -} -#endif /* CONFIG_KEXEC_FILE */ - int machine_kexec_prepare(struct kimage *image) { unsigned long start_pgtable; @@ -353,11 +311,6 @@ int machine_kexec_prepare(struct kimage *image) if (result) return result; - /* update purgatory as needed */ - result = arch_update_purgatory(image); - if (result) - return result; - return 0; } diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c index 3b95410ff0f8..2961234d0795 100644 --- a/arch/x86/purgatory/purgatory.c +++ b/arch/x86/purgatory/purgatory.c @@ -14,28 +14,10 @@ #include "../boot/string.h" -unsigned long purgatory_backup_dest __section(.kexec-purgatory); -unsigned long purgatory_backup_src __section(.kexec-purgatory); -unsigned long purgatory_backup_sz __section(.kexec-purgatory); - u8 purgatory_sha256_digest[SHA256_DIGEST_SIZE] __section(.kexec-purgatory); struct kexec_sha_region purgatory_sha_regions[KEXEC_SEGMENT_MAX] __section(.kexec-purgatory); -/* - * On x86, second kernel requries first 640K of memory to boot. Copy - * first 640K to a backup region in reserved memory range so that second - * kernel can use first 640K. - */ -static int copy_backup_region(void) -{ - if (purgatory_backup_dest) { - memcpy((void *)purgatory_backup_dest, - (void *)purgatory_backup_src, purgatory_backup_sz); - } - return 0; -} - static int verify_sha256_digest(void) { struct kexec_sha_region *ptr, *end; @@ -66,7 +48,6 @@ void purgatory(void) for (;;) ; } - copy_backup_region(); } /* -- 2.17.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-12 2:21 ` [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region Lianbo Jiang @ 2019-10-12 11:26 ` Eric W. Biederman 2019-10-12 12:16 ` Dave Young 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2019-10-12 11:26 UTC (permalink / raw) To: Lianbo Jiang Cc: linux-kernel, tglx, mingo, bp, hpa, x86, bhe, dyoung, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec Lianbo Jiang <lijiang@redhat.com> writes: > When the crashkernel kernel command line option is specified, the > low 1MiB memory will always be reserved, which makes that the memory > allocated later won't fall into the low 1MiB area, thereby, it's not > necessary to create a backup region and also no need to copy the first > 640k content to a backup region. > > Currently, the code related to the backup region can be safely removed, > so lets clean up. > > Signed-off-by: Lianbo Jiang <lijiang@redhat.com> > --- > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c > index eb651fbde92a..cc5774fc84c0 100644 > --- a/arch/x86/kernel/crash.c > +++ b/arch/x86/kernel/crash.c > @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) > > #ifdef CONFIG_KEXEC_FILE > > -static unsigned long crash_zero_bytes; > - > static int get_nr_ram_ranges_callback(struct resource *res, void *arg) > { > unsigned int *nr_ranges = arg; > @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) > { > struct crash_mem *cmem = arg; > > - cmem->ranges[cmem->nr_ranges].start = res->start; > - cmem->ranges[cmem->nr_ranges].end = res->end; > - cmem->nr_ranges++; > + if (res->start >= SZ_1M) { > + cmem->ranges[cmem->nr_ranges].start = res->start; > + cmem->ranges[cmem->nr_ranges].end = res->end; > + cmem->nr_ranges++; > + } else if (res->end > SZ_1M) { > + cmem->ranges[cmem->nr_ranges].start = SZ_1M; > + cmem->ranges[cmem->nr_ranges].end = res->end; > + cmem->nr_ranges++; > + } What is going on with this chunk? I can guess but this needs a clear comment. > > return 0; > } > @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) > memset(&cmd, 0, sizeof(struct crash_memmap_data)); > cmd.params = params; > > - /* Add first 640K segment */ > - ei.addr = image->arch.backup_src_start; > - ei.size = image->arch.backup_src_sz; > + /* > + * Add the low memory range[0x1000, SZ_1M], skip > + * the first zero page. > + */ > + ei.addr = PAGE_SIZE; > + ei.size = SZ_1M - PAGE_SIZE; > ei.type = E820_TYPE_RAM; > add_e820_entry(params, &ei); Likewise here. Why do we need a special case? Why the magic with PAGE_SIZE? Is this needed because of your other special case above? When SME is active and the crashkernel command line is enabled do we just want to leave the low 1MB unencrypted? So we don't need any special cases? Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-12 11:26 ` Eric W. Biederman @ 2019-10-12 12:16 ` Dave Young 2019-10-13 3:54 ` Eric W. Biederman 2019-10-14 10:02 ` lijiang 0 siblings, 2 replies; 16+ messages in thread From: Dave Young @ 2019-10-12 12:16 UTC (permalink / raw) To: Eric W. Biederman Cc: Lianbo Jiang, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec Hi Eric, On 10/12/19 at 06:26am, Eric W. Biederman wrote: > Lianbo Jiang <lijiang@redhat.com> writes: > > > When the crashkernel kernel command line option is specified, the > > low 1MiB memory will always be reserved, which makes that the memory > > allocated later won't fall into the low 1MiB area, thereby, it's not > > necessary to create a backup region and also no need to copy the first > > 640k content to a backup region. > > > > Currently, the code related to the backup region can be safely removed, > > so lets clean up. > > > > Signed-off-by: Lianbo Jiang <lijiang@redhat.com> > > --- > > > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c > > index eb651fbde92a..cc5774fc84c0 100644 > > --- a/arch/x86/kernel/crash.c > > +++ b/arch/x86/kernel/crash.c > > @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) > > > > #ifdef CONFIG_KEXEC_FILE > > > > -static unsigned long crash_zero_bytes; > > - > > static int get_nr_ram_ranges_callback(struct resource *res, void *arg) > > { > > unsigned int *nr_ranges = arg; > > @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) > > { > > struct crash_mem *cmem = arg; > > > > - cmem->ranges[cmem->nr_ranges].start = res->start; > > - cmem->ranges[cmem->nr_ranges].end = res->end; > > - cmem->nr_ranges++; > > + if (res->start >= SZ_1M) { > > + cmem->ranges[cmem->nr_ranges].start = res->start; > > + cmem->ranges[cmem->nr_ranges].end = res->end; > > + cmem->nr_ranges++; > > + } else if (res->end > SZ_1M) { > > + cmem->ranges[cmem->nr_ranges].start = SZ_1M; > > + cmem->ranges[cmem->nr_ranges].end = res->end; > > + cmem->nr_ranges++; > > + } > > What is going on with this chunk? I can guess but this needs a clear > comment. Indeed it needs some code comment, this is based on some offline discussion. cat /proc/vmcore will give a warning because ioremap is mapping the system ram. We pass the first 1M to kdump kernel in e820 as system ram so that 2nd kernel can use the low 1M memory because for example the trampoline code. > > > > > return 0; > > } > > > @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) > > memset(&cmd, 0, sizeof(struct crash_memmap_data)); > > cmd.params = params; > > > > - /* Add first 640K segment */ > > - ei.addr = image->arch.backup_src_start; > > - ei.size = image->arch.backup_src_sz; > > + /* > > + * Add the low memory range[0x1000, SZ_1M], skip > > + * the first zero page. > > + */ > > + ei.addr = PAGE_SIZE; > > + ei.size = SZ_1M - PAGE_SIZE; > > ei.type = E820_TYPE_RAM; > > add_e820_entry(params, &ei); > > Likewise here. Why do we need a special case? > Why the magic with PAGE_SIZE? Good catch, the zero page part is useless, I think no other special reason, just assumed zero page is not usable, but it should be ok to remove the special handling, just pass 0 - 1M is good enough. Thanks Dave ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-12 12:16 ` Dave Young @ 2019-10-13 3:54 ` Eric W. Biederman 2019-10-13 9:36 ` lijiang 2019-10-13 10:20 ` Dave Young 2019-10-14 10:02 ` lijiang 1 sibling, 2 replies; 16+ messages in thread From: Eric W. Biederman @ 2019-10-13 3:54 UTC (permalink / raw) To: Dave Young Cc: Lianbo Jiang, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec Dave Young <dyoung@redhat.com> writes: > Hi Eric, > > On 10/12/19 at 06:26am, Eric W. Biederman wrote: >> Lianbo Jiang <lijiang@redhat.com> writes: >> >> > When the crashkernel kernel command line option is specified, the >> > low 1MiB memory will always be reserved, which makes that the memory >> > allocated later won't fall into the low 1MiB area, thereby, it's not >> > necessary to create a backup region and also no need to copy the first >> > 640k content to a backup region. >> > >> > Currently, the code related to the backup region can be safely removed, >> > so lets clean up. >> > >> > Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >> > --- >> >> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >> > index eb651fbde92a..cc5774fc84c0 100644 >> > --- a/arch/x86/kernel/crash.c >> > +++ b/arch/x86/kernel/crash.c >> > @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >> > >> > #ifdef CONFIG_KEXEC_FILE >> > >> > -static unsigned long crash_zero_bytes; >> > - >> > static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >> > { >> > unsigned int *nr_ranges = arg; >> > @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >> > { >> > struct crash_mem *cmem = arg; >> > >> > - cmem->ranges[cmem->nr_ranges].start = res->start; >> > - cmem->ranges[cmem->nr_ranges].end = res->end; >> > - cmem->nr_ranges++; >> > + if (res->start >= SZ_1M) { >> > + cmem->ranges[cmem->nr_ranges].start = res->start; >> > + cmem->ranges[cmem->nr_ranges].end = res->end; >> > + cmem->nr_ranges++; >> > + } else if (res->end > SZ_1M) { >> > + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >> > + cmem->ranges[cmem->nr_ranges].end = res->end; >> > + cmem->nr_ranges++; >> > + } >> >> What is going on with this chunk? I can guess but this needs a clear >> comment. > > Indeed it needs some code comment, this is based on some offline > discussion. cat /proc/vmcore will give a warning because ioremap is > mapping the system ram. > > We pass the first 1M to kdump kernel in e820 as system ram so that 2nd > kernel can use the low 1M memory because for example the trampoline > code. > >> >> > >> > return 0; >> > } >> >> > @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) >> > memset(&cmd, 0, sizeof(struct crash_memmap_data)); >> > cmd.params = params; >> > >> > - /* Add first 640K segment */ >> > - ei.addr = image->arch.backup_src_start; >> > - ei.size = image->arch.backup_src_sz; >> > + /* >> > + * Add the low memory range[0x1000, SZ_1M], skip >> > + * the first zero page. >> > + */ >> > + ei.addr = PAGE_SIZE; >> > + ei.size = SZ_1M - PAGE_SIZE; >> > ei.type = E820_TYPE_RAM; >> > add_e820_entry(params, &ei); >> >> Likewise here. Why do we need a special case? >> Why the magic with PAGE_SIZE? > > Good catch, the zero page part is useless, I think no other special > reason, just assumed zero page is not usable, but it should be ok to > remove the special handling, just pass 0 - 1M is good enough. But if we have stopped special casing the low 1M. Why do we need a special case here at all? If you need the special case it is almost certainly wrong to say you have ram above 640KiB and below 1MiB. That is the legacy ROM and video MMIO area. There is a reason the original code said 640KiB. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-13 3:54 ` Eric W. Biederman @ 2019-10-13 9:36 ` lijiang 2019-10-15 11:04 ` Eric W. Biederman 2019-10-13 10:20 ` Dave Young 1 sibling, 1 reply; 16+ messages in thread From: lijiang @ 2019-10-13 9:36 UTC (permalink / raw) To: Eric W. Biederman, Dave Young Cc: linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec 在 2019年10月13日 11:54, Eric W. Biederman 写道: > Dave Young <dyoung@redhat.com> writes: > >> Hi Eric, >> >> On 10/12/19 at 06:26am, Eric W. Biederman wrote: >>> Lianbo Jiang <lijiang@redhat.com> writes: >>> >>>> When the crashkernel kernel command line option is specified, the >>>> low 1MiB memory will always be reserved, which makes that the memory >>>> allocated later won't fall into the low 1MiB area, thereby, it's not >>>> necessary to create a backup region and also no need to copy the first >>>> 640k content to a backup region. >>>> >>>> Currently, the code related to the backup region can be safely removed, >>>> so lets clean up. >>>> >>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >>>> --- >>> >>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>> index eb651fbde92a..cc5774fc84c0 100644 >>>> --- a/arch/x86/kernel/crash.c >>>> +++ b/arch/x86/kernel/crash.c >>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>>> >>>> #ifdef CONFIG_KEXEC_FILE >>>> >>>> -static unsigned long crash_zero_bytes; >>>> - >>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>>> { >>>> unsigned int *nr_ranges = arg; >>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>>> { >>>> struct crash_mem *cmem = arg; >>>> >>>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>>> - cmem->nr_ranges++; >>>> + if (res->start >= SZ_1M) { >>>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>> + cmem->nr_ranges++; >>>> + } else if (res->end > SZ_1M) { >>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>> + cmem->nr_ranges++; >>>> + } >>> >>> What is going on with this chunk? I can guess but this needs a clear >>> comment. >> >> Indeed it needs some code comment, this is based on some offline >> discussion. cat /proc/vmcore will give a warning because ioremap is >> mapping the system ram. >> >> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd >> kernel can use the low 1M memory because for example the trampoline >> code. >> >>> >>>> >>>> return 0; >>>> } >>> >>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) >>>> memset(&cmd, 0, sizeof(struct crash_memmap_data)); >>>> cmd.params = params; >>>> >>>> - /* Add first 640K segment */ >>>> - ei.addr = image->arch.backup_src_start; >>>> - ei.size = image->arch.backup_src_sz; >>>> + /* >>>> + * Add the low memory range[0x1000, SZ_1M], skip >>>> + * the first zero page. >>>> + */ >>>> + ei.addr = PAGE_SIZE; >>>> + ei.size = SZ_1M - PAGE_SIZE; >>>> ei.type = E820_TYPE_RAM; >>>> add_e820_entry(params, &ei); >>> >>> Likewise here. Why do we need a special case? >>> Why the magic with PAGE_SIZE? >> >> Good catch, the zero page part is useless, I think no other special >> reason, just assumed zero page is not usable, but it should be ok to >> remove the special handling, just pass 0 - 1M is good enough. > > But if we have stopped special casing the low 1M. Why do we need a > special case here at all? > Here, need to pass the low memory range to kdump kernel, which will guarantee the availability of low memory in kdump kernel, otherwise, kdump kernel won't use the low memory region. > If you need the special case it is almost certainly wrong to say you > have ram above 640KiB and below 1MiB. That is the legacy ROM and video > MMIO area. > > There is a reason the original code said 640KiB. > Do you mean that the 640k region is good enough here instead of 1MiB? Thanks. Lianbo > Eric > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-13 9:36 ` lijiang @ 2019-10-15 11:04 ` Eric W. Biederman 2019-10-16 8:40 ` lijiang 2019-10-16 14:30 ` lijiang 0 siblings, 2 replies; 16+ messages in thread From: Eric W. Biederman @ 2019-10-15 11:04 UTC (permalink / raw) To: lijiang Cc: Dave Young, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec lijiang <lijiang@redhat.com> writes: > 在 2019年10月13日 11:54, Eric W. Biederman 写道: >> Dave Young <dyoung@redhat.com> writes: >> >>> Hi Eric, >>> >>> On 10/12/19 at 06:26am, Eric W. Biederman wrote: >>>> Lianbo Jiang <lijiang@redhat.com> writes: >>>> >>>>> When the crashkernel kernel command line option is specified, the >>>>> low 1MiB memory will always be reserved, which makes that the memory >>>>> allocated later won't fall into the low 1MiB area, thereby, it's not >>>>> necessary to create a backup region and also no need to copy the first >>>>> 640k content to a backup region. >>>>> >>>>> Currently, the code related to the backup region can be safely removed, >>>>> so lets clean up. >>>>> >>>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >>>>> --- >>>> >>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>>> index eb651fbde92a..cc5774fc84c0 100644 >>>>> --- a/arch/x86/kernel/crash.c >>>>> +++ b/arch/x86/kernel/crash.c >>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>>>> >>>>> #ifdef CONFIG_KEXEC_FILE >>>>> >>>>> -static unsigned long crash_zero_bytes; >>>>> - >>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>>>> { >>>>> unsigned int *nr_ranges = arg; >>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>>>> { >>>>> struct crash_mem *cmem = arg; >>>>> >>>>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>>>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>>>> - cmem->nr_ranges++; >>>>> + if (res->start >= SZ_1M) { >>>>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>> + cmem->nr_ranges++; >>>>> + } else if (res->end > SZ_1M) { >>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>> + cmem->nr_ranges++; >>>>> + } >>>> >>>> What is going on with this chunk? I can guess but this needs a clear >>>> comment. >>> >>> Indeed it needs some code comment, this is based on some offline >>> discussion. cat /proc/vmcore will give a warning because ioremap is >>> mapping the system ram. >>> >>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd >>> kernel can use the low 1M memory because for example the trampoline >>> code. >>> >>>> >>>>> >>>>> return 0; >>>>> } >>>> >>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) >>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data)); >>>>> cmd.params = params; >>>>> >>>>> - /* Add first 640K segment */ >>>>> - ei.addr = image->arch.backup_src_start; >>>>> - ei.size = image->arch.backup_src_sz; >>>>> + /* >>>>> + * Add the low memory range[0x1000, SZ_1M], skip >>>>> + * the first zero page. >>>>> + */ >>>>> + ei.addr = PAGE_SIZE; >>>>> + ei.size = SZ_1M - PAGE_SIZE; >>>>> ei.type = E820_TYPE_RAM; >>>>> add_e820_entry(params, &ei); >>>> >>>> Likewise here. Why do we need a special case? >>>> Why the magic with PAGE_SIZE? >>> >>> Good catch, the zero page part is useless, I think no other special >>> reason, just assumed zero page is not usable, but it should be ok to >>> remove the special handling, just pass 0 - 1M is good enough. >> >> But if we have stopped special casing the low 1M. Why do we need a >> special case here at all? >> > Here, need to pass the low memory range to kdump kernel, which will guarantee > the availability of low memory in kdump kernel, otherwise, kdump kernel won't > use the low memory region. > >> If you need the special case it is almost certainly wrong to say you >> have ram above 640KiB and below 1MiB. That is the legacy ROM and video >> MMIO area. >> >> There is a reason the original code said 640KiB. >> > Do you mean that the 640k region is good enough here instead of 1MiB? Reading through the code of crash_setup_memap_entries I see that what the code is doing now. The code is repeating the e820 memory map with the memory areas that were not reserved for the crash kernel removed. In which case what the code needs to be doing something like: cmd.type = E820_TYPE_RAM; flags = IORESOURCE_MEM; walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd, memmap_entry_callback); Depending on which bugs exist it might make sense to limit this to the low 640KiB. But finding something the kernel already recognizes as RAM should prevent most of those problems already. Barring bugs I admit it doesn't make sense to repeat the work that someone else has already done. This bit: /* Add e820 reserved ranges */ cmd.type = E820_TYPE_RESERVED; flags = IORESOURCE_MEM; walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd, memmap_entry_callback); Should probably start at 1MiB instead of 0. Just so we don't report the memory below 1MiB as unconditionally reserved. I don't properly understand the IORES_DESC_RESERVED flag, and how that differs from flags. So please test my suggestions to verify the code works as expected. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-15 11:04 ` Eric W. Biederman @ 2019-10-16 8:40 ` lijiang 2019-10-16 14:30 ` lijiang 1 sibling, 0 replies; 16+ messages in thread From: lijiang @ 2019-10-16 8:40 UTC (permalink / raw) To: Eric W. Biederman Cc: Dave Young, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec 在 2019年10月15日 19:04, Eric W. Biederman 写道: > lijiang <lijiang@redhat.com> writes: > >> 在 2019年10月13日 11:54, Eric W. Biederman 写道: >>> Dave Young <dyoung@redhat.com> writes: >>> >>>> Hi Eric, >>>> >>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote: >>>>> Lianbo Jiang <lijiang@redhat.com> writes: >>>>> >>>>>> When the crashkernel kernel command line option is specified, the >>>>>> low 1MiB memory will always be reserved, which makes that the memory >>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not >>>>>> necessary to create a backup region and also no need to copy the first >>>>>> 640k content to a backup region. >>>>>> >>>>>> Currently, the code related to the backup region can be safely removed, >>>>>> so lets clean up. >>>>>> >>>>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >>>>>> --- >>>>> >>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>>>> index eb651fbde92a..cc5774fc84c0 100644 >>>>>> --- a/arch/x86/kernel/crash.c >>>>>> +++ b/arch/x86/kernel/crash.c >>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>>>>> >>>>>> #ifdef CONFIG_KEXEC_FILE >>>>>> >>>>>> -static unsigned long crash_zero_bytes; >>>>>> - >>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>>>>> { >>>>>> unsigned int *nr_ranges = arg; >>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>>>>> { >>>>>> struct crash_mem *cmem = arg; >>>>>> >>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>> - cmem->nr_ranges++; >>>>>> + if (res->start >= SZ_1M) { >>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>> + cmem->nr_ranges++; >>>>>> + } else if (res->end > SZ_1M) { >>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>> + cmem->nr_ranges++; >>>>>> + } >>>>> >>>>> What is going on with this chunk? I can guess but this needs a clear >>>>> comment. >>>> >>>> Indeed it needs some code comment, this is based on some offline >>>> discussion. cat /proc/vmcore will give a warning because ioremap is >>>> mapping the system ram. >>>> >>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd >>>> kernel can use the low 1M memory because for example the trampoline >>>> code. >>>> >>>>> >>>>>> >>>>>> return 0; >>>>>> } >>>>> >>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) >>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data)); >>>>>> cmd.params = params; >>>>>> >>>>>> - /* Add first 640K segment */ >>>>>> - ei.addr = image->arch.backup_src_start; >>>>>> - ei.size = image->arch.backup_src_sz; >>>>>> + /* >>>>>> + * Add the low memory range[0x1000, SZ_1M], skip >>>>>> + * the first zero page. >>>>>> + */ >>>>>> + ei.addr = PAGE_SIZE; >>>>>> + ei.size = SZ_1M - PAGE_SIZE; >>>>>> ei.type = E820_TYPE_RAM; >>>>>> add_e820_entry(params, &ei); >>>>> >>>>> Likewise here. Why do we need a special case? >>>>> Why the magic with PAGE_SIZE? >>>> >>>> Good catch, the zero page part is useless, I think no other special >>>> reason, just assumed zero page is not usable, but it should be ok to >>>> remove the special handling, just pass 0 - 1M is good enough. >>> >>> But if we have stopped special casing the low 1M. Why do we need a >>> special case here at all? >>> >> Here, need to pass the low memory range to kdump kernel, which will guarantee >> the availability of low memory in kdump kernel, otherwise, kdump kernel won't >> use the low memory region. >> >>> If you need the special case it is almost certainly wrong to say you >>> have ram above 640KiB and below 1MiB. That is the legacy ROM and video >>> MMIO area. >>> >>> There is a reason the original code said 640KiB. >>> >> Do you mean that the 640k region is good enough here instead of 1MiB? > > Reading through the code of crash_setup_memap_entries I see that what > the code is doing now. The code is repeating the e820 memory map with > the memory areas that were not reserved for the crash kernel removed. > > In which case what the code needs to be doing something like: > > cmd.type = E820_TYPE_RAM; > flags = IORESOURCE_MEM; > walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd, > memmap_entry_callback); > > Depending on which bugs exist it might make sense to limit this to > the low 640KiB. But finding something the kernel already recognizes > as RAM should prevent most of those problems already. Barring bugs > I admit it doesn't make sense to repeat the work that someone else > has already done. > > This bit: > /* Add e820 reserved ranges */ > cmd.type = E820_TYPE_RESERVED; > flags = IORESOURCE_MEM; > walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd, > memmap_entry_callback); > > Should probably start at 1MiB instead of 0. Just so we don't report the > memory below 1MiB as unconditionally reserved. I don't properly > understand the IORES_DESC_RESERVED flag, and how that differs from > flags. So please test my suggestions to verify the code works as > expected. > Thanks for your comment, Eric. I will make a test based on your suggestions. But i need an SME machine, maybe i will reply later. Thanks. Lianbo > Eric > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-15 11:04 ` Eric W. Biederman 2019-10-16 8:40 ` lijiang @ 2019-10-16 14:30 ` lijiang 2019-10-16 16:51 ` Eric W. Biederman 1 sibling, 1 reply; 16+ messages in thread From: lijiang @ 2019-10-16 14:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Dave Young, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec 在 2019年10月15日 19:04, Eric W. Biederman 写道: > lijiang <lijiang@redhat.com> writes: > >> 在 2019年10月13日 11:54, Eric W. Biederman 写道: >>> Dave Young <dyoung@redhat.com> writes: >>> >>>> Hi Eric, >>>> >>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote: >>>>> Lianbo Jiang <lijiang@redhat.com> writes: >>>>> >>>>>> When the crashkernel kernel command line option is specified, the >>>>>> low 1MiB memory will always be reserved, which makes that the memory >>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not >>>>>> necessary to create a backup region and also no need to copy the first >>>>>> 640k content to a backup region. >>>>>> >>>>>> Currently, the code related to the backup region can be safely removed, >>>>>> so lets clean up. >>>>>> >>>>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >>>>>> --- >>>>> >>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>>>> index eb651fbde92a..cc5774fc84c0 100644 >>>>>> --- a/arch/x86/kernel/crash.c >>>>>> +++ b/arch/x86/kernel/crash.c >>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>>>>> >>>>>> #ifdef CONFIG_KEXEC_FILE >>>>>> >>>>>> -static unsigned long crash_zero_bytes; >>>>>> - >>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>>>>> { >>>>>> unsigned int *nr_ranges = arg; >>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>>>>> { >>>>>> struct crash_mem *cmem = arg; >>>>>> >>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>> - cmem->nr_ranges++; >>>>>> + if (res->start >= SZ_1M) { >>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>> + cmem->nr_ranges++; >>>>>> + } else if (res->end > SZ_1M) { >>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>> + cmem->nr_ranges++; >>>>>> + } >>>>> >>>>> What is going on with this chunk? I can guess but this needs a clear >>>>> comment. >>>> >>>> Indeed it needs some code comment, this is based on some offline >>>> discussion. cat /proc/vmcore will give a warning because ioremap is >>>> mapping the system ram. >>>> >>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd >>>> kernel can use the low 1M memory because for example the trampoline >>>> code. >>>> >>>>> >>>>>> >>>>>> return 0; >>>>>> } >>>>> >>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) >>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data)); >>>>>> cmd.params = params; >>>>>> >>>>>> - /* Add first 640K segment */ >>>>>> - ei.addr = image->arch.backup_src_start; >>>>>> - ei.size = image->arch.backup_src_sz; >>>>>> + /* >>>>>> + * Add the low memory range[0x1000, SZ_1M], skip >>>>>> + * the first zero page. >>>>>> + */ >>>>>> + ei.addr = PAGE_SIZE; >>>>>> + ei.size = SZ_1M - PAGE_SIZE; >>>>>> ei.type = E820_TYPE_RAM; >>>>>> add_e820_entry(params, &ei); >>>>> >>>>> Likewise here. Why do we need a special case? >>>>> Why the magic with PAGE_SIZE? >>>> >>>> Good catch, the zero page part is useless, I think no other special >>>> reason, just assumed zero page is not usable, but it should be ok to >>>> remove the special handling, just pass 0 - 1M is good enough. >>> >>> But if we have stopped special casing the low 1M. Why do we need a >>> special case here at all? >>> >> Here, need to pass the low memory range to kdump kernel, which will guarantee >> the availability of low memory in kdump kernel, otherwise, kdump kernel won't >> use the low memory region. >> >>> If you need the special case it is almost certainly wrong to say you >>> have ram above 640KiB and below 1MiB. That is the legacy ROM and video >>> MMIO area. >>> >>> There is a reason the original code said 640KiB. >>> >> Do you mean that the 640k region is good enough here instead of 1MiB? > > Reading through the code of crash_setup_memap_entries I see that what > the code is doing now. The code is repeating the e820 memory map with > the memory areas that were not reserved for the crash kernel removed. > > In which case what the code needs to be doing something like: > > cmd.type = E820_TYPE_RAM; > flags = IORESOURCE_MEM; > walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd, > memmap_entry_callback); > The above code does not get the results what we expected, it gets the reserved memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range. Finally, kdump kernel happened the panic as follow: ...... [ 3.555662] Kernel panic - not syncing: Real mode trampoline was not allocated [ 3.556660] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-rc3+ #4 [ 3.556660] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018 [ 3.556660] Call Trace: [ 3.556660] dump_stack+0x46/0x60 [ 3.556660] panic+0xfb/0x2d7 [ 3.556660] ? hv_init_spinlocks+0x7f/0x7f [ 3.556660] init_real_mode+0x27/0x1fa [ 3.556660] ? hv_init_spinlocks+0x7f/0x7f [ 3.556660] ? do_one_initcall+0x46/0x1e4 [ 3.556660] ? proc_register+0xd0/0x130 [ 3.556660] ? kernel_init_freeable+0xe2/0x242 [ 3.556660] ? rest_init+0xaa/0xaa [ 3.556660] ? kernel_init+0xa/0x106 [ 3.556660] ? ret_from_fork+0x22/0x40 [ 3.556660] Rebooting in 10 seconds.. [ 3.556660] ACPI MEMORY or I/O RESET_REG. I modified the above code, and tested it. This can find out the system ram in the low 1MiB range. And it worked well. @@ -356,11 +338,11 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) memset(&cmd, 0, sizeof(struct crash_memmap_data)); cmd.params = params; + /* Add the low 1MiB */ + cmd.type = E820_TYPE_RAM; + flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; + walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, 1024*1024 - 1, &cmd, + memmap_entry_callback); > Depending on which bugs exist it might make sense to limit this to > the low 640KiB. But finding something the kernel already recognizes > as RAM should prevent most of those problems already. Barring bugs > I admit it doesn't make sense to repeat the work that someone else > has already done. > > This bit: > /* Add e820 reserved ranges */ > cmd.type = E820_TYPE_RESERVED; > flags = IORESOURCE_MEM; > walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd, > memmap_entry_callback); > > Should probably start at 1MiB instead of 0. Just so we don't report the If so, it can not find out the reserved memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range, finally, it doesn't pass the reserved memory in the low 1MiB to kdump kernel, which could cause some problems, such as SME or PCI MMCONFIG issue. > memory below 1MiB as unconditionally reserved. I don't properly > understand the IORES_DESC_RESERVED flag, and how that differs from I found three commits about 'IORES_DESC_RESERVED' flag, hope this helps. 1.ae9e13d621d6 ("x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED") 2.5da04cc86d12 ("x86/mm: Rework ioremap resource mapping determination") 3.980621daf368 ("x86/crash: Add e820 reserved ranges to kdump kernel's e820 table") > flags. So please test my suggestions to verify the code works as > expected. > I have tested the two changes that you mentioned, please refer to the reply above. Thanks. Lianbo > Eric > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-16 14:30 ` lijiang @ 2019-10-16 16:51 ` Eric W. Biederman 0 siblings, 0 replies; 16+ messages in thread From: Eric W. Biederman @ 2019-10-16 16:51 UTC (permalink / raw) To: lijiang Cc: Dave Young, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec lijiang <lijiang@redhat.com> writes: > 在 2019年10月15日 19:04, Eric W. Biederman 写道: >> lijiang <lijiang@redhat.com> writes: >> >>> 在 2019年10月13日 11:54, Eric W. Biederman 写道: >>>> Dave Young <dyoung@redhat.com> writes: >>>> >>>>> Hi Eric, >>>>> >>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote: >>>>>> Lianbo Jiang <lijiang@redhat.com> writes: >>>>>> >>>>>>> When the crashkernel kernel command line option is specified, the >>>>>>> low 1MiB memory will always be reserved, which makes that the memory >>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not >>>>>>> necessary to create a backup region and also no need to copy the first >>>>>>> 640k content to a backup region. >>>>>>> >>>>>>> Currently, the code related to the backup region can be safely removed, >>>>>>> so lets clean up. >>>>>>> >>>>>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >>>>>>> --- >>>>>> >>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>>>>> index eb651fbde92a..cc5774fc84c0 100644 >>>>>>> --- a/arch/x86/kernel/crash.c >>>>>>> +++ b/arch/x86/kernel/crash.c >>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>>>>>> >>>>>>> #ifdef CONFIG_KEXEC_FILE >>>>>>> >>>>>>> -static unsigned long crash_zero_bytes; >>>>>>> - >>>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>>>>>> { >>>>>>> unsigned int *nr_ranges = arg; >>>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>>>>>> { >>>>>>> struct crash_mem *cmem = arg; >>>>>>> >>>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>>> - cmem->nr_ranges++; >>>>>>> + if (res->start >= SZ_1M) { >>>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>>> + cmem->nr_ranges++; >>>>>>> + } else if (res->end > SZ_1M) { >>>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>>>> + cmem->nr_ranges++; >>>>>>> + } >>>>>> >>>>>> What is going on with this chunk? I can guess but this needs a clear >>>>>> comment. >>>>> >>>>> Indeed it needs some code comment, this is based on some offline >>>>> discussion. cat /proc/vmcore will give a warning because ioremap is >>>>> mapping the system ram. >>>>> >>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd >>>>> kernel can use the low 1M memory because for example the trampoline >>>>> code. >>>>> >>>>>> >>>>>>> >>>>>>> return 0; >>>>>>> } >>>>>> >>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) >>>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data)); >>>>>>> cmd.params = params; >>>>>>> >>>>>>> - /* Add first 640K segment */ >>>>>>> - ei.addr = image->arch.backup_src_start; >>>>>>> - ei.size = image->arch.backup_src_sz; >>>>>>> + /* >>>>>>> + * Add the low memory range[0x1000, SZ_1M], skip >>>>>>> + * the first zero page. >>>>>>> + */ >>>>>>> + ei.addr = PAGE_SIZE; >>>>>>> + ei.size = SZ_1M - PAGE_SIZE; >>>>>>> ei.type = E820_TYPE_RAM; >>>>>>> add_e820_entry(params, &ei); >>>>>> >>>>>> Likewise here. Why do we need a special case? >>>>>> Why the magic with PAGE_SIZE? >>>>> >>>>> Good catch, the zero page part is useless, I think no other special >>>>> reason, just assumed zero page is not usable, but it should be ok to >>>>> remove the special handling, just pass 0 - 1M is good enough. >>>> >>>> But if we have stopped special casing the low 1M. Why do we need a >>>> special case here at all? >>>> >>> Here, need to pass the low memory range to kdump kernel, which will guarantee >>> the availability of low memory in kdump kernel, otherwise, kdump kernel won't >>> use the low memory region. >>> >>>> If you need the special case it is almost certainly wrong to say you >>>> have ram above 640KiB and below 1MiB. That is the legacy ROM and video >>>> MMIO area. >>>> >>>> There is a reason the original code said 640KiB. >>>> >>> Do you mean that the 640k region is good enough here instead of 1MiB? >> >> Reading through the code of crash_setup_memap_entries I see that what >> the code is doing now. The code is repeating the e820 memory map with >> the memory areas that were not reserved for the crash kernel removed. >> >> In which case what the code needs to be doing something like: >> >> cmd.type = E820_TYPE_RAM; >> flags = IORESOURCE_MEM; >> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd, >> memmap_entry_callback); >> > The above code does not get the results what we expected, it gets the reserved > memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range. > > Finally, kdump kernel happened the panic as follow: > ...... > [ 3.555662] Kernel panic - not syncing: Real mode trampoline was not allocated > [ 3.556660] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-rc3+ #4 > [ 3.556660] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018 > [ 3.556660] Call Trace: > [ 3.556660] dump_stack+0x46/0x60 > [ 3.556660] panic+0xfb/0x2d7 > [ 3.556660] ? hv_init_spinlocks+0x7f/0x7f > [ 3.556660] init_real_mode+0x27/0x1fa > [ 3.556660] ? hv_init_spinlocks+0x7f/0x7f > [ 3.556660] ? do_one_initcall+0x46/0x1e4 > [ 3.556660] ? proc_register+0xd0/0x130 > [ 3.556660] ? kernel_init_freeable+0xe2/0x242 > [ 3.556660] ? rest_init+0xaa/0xaa > [ 3.556660] ? kernel_init+0xa/0x106 > [ 3.556660] ? ret_from_fork+0x22/0x40 > [ 3.556660] Rebooting in 10 seconds.. > [ 3.556660] ACPI MEMORY or I/O RESET_REG. > > I modified the above code, and tested it. This can find out the system ram in > the low 1MiB range. And it worked well. > > @@ -356,11 +338,11 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) > memset(&cmd, 0, sizeof(struct crash_memmap_data)); > cmd.params = params; > > + /* Add the low 1MiB */ > + cmd.type = E820_TYPE_RAM; > + flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; > + walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, 1024*1024 - 1, &cmd, > + memmap_entry_callback); > That looks like a very reasonable fix. >> Depending on which bugs exist it might make sense to limit this to >> the low 640KiB. But finding something the kernel already recognizes >> as RAM should prevent most of those problems already. Barring bugs >> I admit it doesn't make sense to repeat the work that someone else >> has already done. >> >> This bit: >> /* Add e820 reserved ranges */ >> cmd.type = E820_TYPE_RESERVED; >> flags = IORESOURCE_MEM; >> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd, >> memmap_entry_callback); >> >> Should probably start at 1MiB instead of 0. Just so we don't report the > If so, it can not find out the reserved memory marked as 'IORES_DESC_RESERVED' in > the low 1MiB range, finally, it doesn't pass the reserved memory in the low 1MiB to > kdump kernel, which could cause some problems, such as SME or PCI MMCONFIG issue. Good point. For some reason I was thinking IORESOURCE_MEM and IORESOURCE_SYSTEM_RAM were the same thing. It has been way to long since I have been in that part of the code. So yes let's leave that part alone. >> memory below 1MiB as unconditionally reserved. I don't properly >> understand the IORES_DESC_RESERVED flag, and how that differs from > I found three commits about 'IORES_DESC_RESERVED' flag, hope this helps. > 1.ae9e13d621d6 ("x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED") > 2.5da04cc86d12 ("x86/mm: Rework ioremap resource mapping determination") > 3.980621daf368 ("x86/crash: Add e820 reserved ranges to kdump kernel's e820 table") > >> flags. So please test my suggestions to verify the code works as >> expected. >> > I have tested the two changes that you mentioned, please refer to the > reply above. Thank you. It looks like you have figured out how these things should work. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-13 3:54 ` Eric W. Biederman 2019-10-13 9:36 ` lijiang @ 2019-10-13 10:20 ` Dave Young 1 sibling, 0 replies; 16+ messages in thread From: Dave Young @ 2019-10-13 10:20 UTC (permalink / raw) To: Eric W. Biederman Cc: Lianbo Jiang, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec On 10/12/19 at 10:54pm, Eric W. Biederman wrote: > Dave Young <dyoung@redhat.com> writes: > > > Hi Eric, > > > > On 10/12/19 at 06:26am, Eric W. Biederman wrote: > >> Lianbo Jiang <lijiang@redhat.com> writes: > >> > >> > When the crashkernel kernel command line option is specified, the > >> > low 1MiB memory will always be reserved, which makes that the memory > >> > allocated later won't fall into the low 1MiB area, thereby, it's not > >> > necessary to create a backup region and also no need to copy the first > >> > 640k content to a backup region. > >> > > >> > Currently, the code related to the backup region can be safely removed, > >> > so lets clean up. > >> > > >> > Signed-off-by: Lianbo Jiang <lijiang@redhat.com> > >> > --- > >> > >> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c > >> > index eb651fbde92a..cc5774fc84c0 100644 > >> > --- a/arch/x86/kernel/crash.c > >> > +++ b/arch/x86/kernel/crash.c > >> > @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) > >> > > >> > #ifdef CONFIG_KEXEC_FILE > >> > > >> > -static unsigned long crash_zero_bytes; > >> > - > >> > static int get_nr_ram_ranges_callback(struct resource *res, void *arg) > >> > { > >> > unsigned int *nr_ranges = arg; > >> > @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) > >> > { > >> > struct crash_mem *cmem = arg; > >> > > >> > - cmem->ranges[cmem->nr_ranges].start = res->start; > >> > - cmem->ranges[cmem->nr_ranges].end = res->end; > >> > - cmem->nr_ranges++; > >> > + if (res->start >= SZ_1M) { > >> > + cmem->ranges[cmem->nr_ranges].start = res->start; > >> > + cmem->ranges[cmem->nr_ranges].end = res->end; > >> > + cmem->nr_ranges++; > >> > + } else if (res->end > SZ_1M) { > >> > + cmem->ranges[cmem->nr_ranges].start = SZ_1M; > >> > + cmem->ranges[cmem->nr_ranges].end = res->end; > >> > + cmem->nr_ranges++; > >> > + } > >> > >> What is going on with this chunk? I can guess but this needs a clear > >> comment. > > > > Indeed it needs some code comment, this is based on some offline > > discussion. cat /proc/vmcore will give a warning because ioremap is > > mapping the system ram. > > > > We pass the first 1M to kdump kernel in e820 as system ram so that 2nd > > kernel can use the low 1M memory because for example the trampoline > > code. > > > >> > >> > > >> > return 0; > >> > } > >> > >> > @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) > >> > memset(&cmd, 0, sizeof(struct crash_memmap_data)); > >> > cmd.params = params; > >> > > >> > - /* Add first 640K segment */ > >> > - ei.addr = image->arch.backup_src_start; > >> > - ei.size = image->arch.backup_src_sz; > >> > + /* > >> > + * Add the low memory range[0x1000, SZ_1M], skip > >> > + * the first zero page. > >> > + */ > >> > + ei.addr = PAGE_SIZE; > >> > + ei.size = SZ_1M - PAGE_SIZE; > >> > ei.type = E820_TYPE_RAM; > >> > add_e820_entry(params, &ei); > >> > >> Likewise here. Why do we need a special case? > >> Why the magic with PAGE_SIZE? > > > > Good catch, the zero page part is useless, I think no other special > > reason, just assumed zero page is not usable, but it should be ok to > > remove the special handling, just pass 0 - 1M is good enough. > > But if we have stopped special casing the low 1M. Why do we need a > special case here at all? Seems both Lianbo and I do not understand the query. Let me try to explain it more. The 2nd kernel still need use low memory for the trampoline use. So we have to let 2nd kernel access the low memory as system ram. The original special case is far more than above we are doing, it does several things: 1. backup the low 640K into kdump reserved high memory 2. set the low 640K as System RAM in kdump kernel as we do in this patch. 3. in /proc/vmcore elf header, map the low 640K to the backup region so that /proc/vmcore can give right old memory for that region. After the change we are doing in this series, we dropped the 1 and 3 but 2 is still needed because kdump kernel still need use low memory. But we do not care the vmcore part because nobody use the memory in old kernel, we already reserve it, and excluded the range in vmcore. I think another thing you mentioned about some reserved memory under 1M, even if we set 0-1M as System RAM, we still keep all the reserved regions in /proc/iomem as identical between 1st and 2nd kernels, so it just works, see below about cat /proc/iomem in kdump kernel (dropped into a shell before copying the vmcore out): kdump:/# cat /proc/iomem|less 00000000-00000fff : Reserved 00001000-0009ffff : System RAM 000a0000-000bffff : PCI Bus 0000:00 000f0000-000fffff : System ROM 30000000-3000006f : System RAM 30000070-39f5cfff : System RAM 38600000-39001070 : Kernel code 39001071-396ce5ff : Kernel data 39acd000-39bfffff : Kernel bss You can see 000a0000-000bffff : PCI Bus 0000:00 is same across the kdump reboot. But maybe if it is not elegant enough with simply using 0 - 1M, maybe use 0 - 640K as Lianbo said in another reply? ------------- BTW, we also discussed about compatibility issues, for kexec_file it just works because our change is in kernel. For kexec-tools part, we can just leave the userspace code as is, that means if one wants the SME case be fixed he needs an kernel update to reserve the low memory. We can not drop the kexec-tools special case about 640K because if we drop it and people use old kernels which does not reserve low 1M then kdump can not work at all. Thanks Dave ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-12 12:16 ` Dave Young 2019-10-13 3:54 ` Eric W. Biederman @ 2019-10-14 10:02 ` lijiang 2019-10-15 11:11 ` Eric W. Biederman 1 sibling, 1 reply; 16+ messages in thread From: lijiang @ 2019-10-14 10:02 UTC (permalink / raw) To: Dave Young, Eric W. Biederman Cc: linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec 在 2019年10月12日 20:16, Dave Young 写道: > Hi Eric, > > On 10/12/19 at 06:26am, Eric W. Biederman wrote: >> Lianbo Jiang <lijiang@redhat.com> writes: >> >>> When the crashkernel kernel command line option is specified, the >>> low 1MiB memory will always be reserved, which makes that the memory >>> allocated later won't fall into the low 1MiB area, thereby, it's not >>> necessary to create a backup region and also no need to copy the first >>> 640k content to a backup region. >>> >>> Currently, the code related to the backup region can be safely removed, >>> so lets clean up. >>> >>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >>> --- >> >>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>> index eb651fbde92a..cc5774fc84c0 100644 >>> --- a/arch/x86/kernel/crash.c >>> +++ b/arch/x86/kernel/crash.c >>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>> >>> #ifdef CONFIG_KEXEC_FILE >>> >>> -static unsigned long crash_zero_bytes; >>> - >>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>> { >>> unsigned int *nr_ranges = arg; >>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>> { >>> struct crash_mem *cmem = arg; >>> >>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>> - cmem->nr_ranges++; >>> + if (res->start >= SZ_1M) { >>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>> + cmem->nr_ranges++; >>> + } else if (res->end > SZ_1M) { >>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>> + cmem->nr_ranges++; >>> + } >> >> What is going on with this chunk? I can guess but this needs a clear >> comment. > > Indeed it needs some code comment, this is based on some offline > discussion. cat /proc/vmcore will give a warning because ioremap is > mapping the system ram. > > We pass the first 1M to kdump kernel in e820 as system ram so that 2nd > kernel can use the low 1M memory because for example the trampoline > code. > Thank you, Eric and Dave. I will add the code comment as below if it would be OK. @@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) { struct crash_mem *cmem = arg; - cmem->ranges[cmem->nr_ranges].start = res->start; - cmem->ranges[cmem->nr_ranges].end = res->end; - cmem->nr_ranges++; + /* + * Currently, pass the low 1MiB range to kdump kernel in e820 + * as system ram so that kdump kernel can also use the low 1MiB + * memory due to the real mode trampoline code. + * And later, the low 1MiB range will be exclued from elf header, + * which will avoid remapping the 1MiB system ram when dumping + * vmcore. + */ + if (res->start >= SZ_1M) { + cmem->ranges[cmem->nr_ranges].start = res->start; + cmem->ranges[cmem->nr_ranges].end = res->end; + cmem->nr_ranges++; + } else if (res->end > SZ_1M) { + cmem->ranges[cmem->nr_ranges].start = SZ_1M; + cmem->ranges[cmem->nr_ranges].end = res->end; + cmem->nr_ranges++; + } return 0; } >> >>> >>> return 0; >>> } >> >>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params) >>> memset(&cmd, 0, sizeof(struct crash_memmap_data)); >>> cmd.params = params; >>> >>> - /* Add first 640K segment */ >>> - ei.addr = image->arch.backup_src_start; >>> - ei.size = image->arch.backup_src_sz; >>> + /* >>> + * Add the low memory range[0x1000, SZ_1M], skip >>> + * the first zero page. >>> + */ >>> + ei.addr = PAGE_SIZE; >>> + ei.size = SZ_1M - PAGE_SIZE; >>> ei.type = E820_TYPE_RAM; >>> add_e820_entry(params, &ei); >> >> Likewise here. Why do we need a special case? >> Why the magic with PAGE_SIZE? > > Good catch, the zero page part is useless, I think no other special > reason, just assumed zero page is not usable, but it should be ok to > remove the special handling, just pass 0 - 1M is good enough. >> Thanks > Dave > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-14 10:02 ` lijiang @ 2019-10-15 11:11 ` Eric W. Biederman 2019-10-16 3:24 ` lijiang 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2019-10-15 11:11 UTC (permalink / raw) To: lijiang Cc: Dave Young, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec lijiang <lijiang@redhat.com> writes: > 在 2019年10月12日 20:16, Dave Young 写道: >> Hi Eric, >> >> On 10/12/19 at 06:26am, Eric W. Biederman wrote: >>> Lianbo Jiang <lijiang@redhat.com> writes: >>> >>>> When the crashkernel kernel command line option is specified, the >>>> low 1MiB memory will always be reserved, which makes that the memory >>>> allocated later won't fall into the low 1MiB area, thereby, it's not >>>> necessary to create a backup region and also no need to copy the first >>>> 640k content to a backup region. >>>> >>>> Currently, the code related to the backup region can be safely removed, >>>> so lets clean up. >>>> >>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >>>> --- >>> >>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>> index eb651fbde92a..cc5774fc84c0 100644 >>>> --- a/arch/x86/kernel/crash.c >>>> +++ b/arch/x86/kernel/crash.c >>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>>> >>>> #ifdef CONFIG_KEXEC_FILE >>>> >>>> -static unsigned long crash_zero_bytes; >>>> - >>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>>> { >>>> unsigned int *nr_ranges = arg; >>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>>> { >>>> struct crash_mem *cmem = arg; >>>> >>>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>>> - cmem->nr_ranges++; >>>> + if (res->start >= SZ_1M) { >>>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>> + cmem->nr_ranges++; >>>> + } else if (res->end > SZ_1M) { >>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>> + cmem->nr_ranges++; >>>> + } >>> >>> What is going on with this chunk? I can guess but this needs a clear >>> comment. >> >> Indeed it needs some code comment, this is based on some offline >> discussion. cat /proc/vmcore will give a warning because ioremap is >> mapping the system ram. >> >> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd >> kernel can use the low 1M memory because for example the trampoline >> code. >> > Thank you, Eric and Dave. I will add the code comment as below if it would be OK. > > @@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) > { > struct crash_mem *cmem = arg; > > - cmem->ranges[cmem->nr_ranges].start = res->start; > - cmem->ranges[cmem->nr_ranges].end = res->end; > - cmem->nr_ranges++; > + /* > + * Currently, pass the low 1MiB range to kdump kernel in e820 > + * as system ram so that kdump kernel can also use the low 1MiB > + * memory due to the real mode trampoline code. > + * And later, the low 1MiB range will be exclued from elf header, > + * which will avoid remapping the 1MiB system ram when dumping > + * vmcore. > + */ > + if (res->start >= SZ_1M) { > + cmem->ranges[cmem->nr_ranges].start = res->start; > + cmem->ranges[cmem->nr_ranges].end = res->end; > + cmem->nr_ranges++; > + } else if (res->end > SZ_1M) { > + cmem->ranges[cmem->nr_ranges].start = SZ_1M; > + cmem->ranges[cmem->nr_ranges].end = res->end; > + cmem->nr_ranges++; > + } > > return 0; > } I just read through the appropriate section of crash.c and the way things are structured doing this work in prepare_elf64_ram_headers_callback is wrong. This can be done in a simpler manner in elf_header_exclude_ranges. Something like: /* The low 1MiB is always reserved */ ret = crash_exclude_mem_range(cmem, 0, 1024*1024); if (ret) return ret; Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region 2019-10-15 11:11 ` Eric W. Biederman @ 2019-10-16 3:24 ` lijiang 0 siblings, 0 replies; 16+ messages in thread From: lijiang @ 2019-10-16 3:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Dave Young, linux-kernel, tglx, mingo, bp, hpa, x86, bhe, jgross, dhowells, Thomas.Lendacky, vgoyal, kexec 在 2019年10月15日 19:11, Eric W. Biederman 写道: > lijiang <lijiang@redhat.com> writes: > >> 在 2019年10月12日 20:16, Dave Young 写道: >>> Hi Eric, >>> >>> On 10/12/19 at 06:26am, Eric W. Biederman wrote: >>>> Lianbo Jiang <lijiang@redhat.com> writes: >>>> >>>>> When the crashkernel kernel command line option is specified, the >>>>> low 1MiB memory will always be reserved, which makes that the memory >>>>> allocated later won't fall into the low 1MiB area, thereby, it's not >>>>> necessary to create a backup region and also no need to copy the first >>>>> 640k content to a backup region. >>>>> >>>>> Currently, the code related to the backup region can be safely removed, >>>>> so lets clean up. >>>>> >>>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com> >>>>> --- >>>> >>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c >>>>> index eb651fbde92a..cc5774fc84c0 100644 >>>>> --- a/arch/x86/kernel/crash.c >>>>> +++ b/arch/x86/kernel/crash.c >>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs) >>>>> >>>>> #ifdef CONFIG_KEXEC_FILE >>>>> >>>>> -static unsigned long crash_zero_bytes; >>>>> - >>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg) >>>>> { >>>>> unsigned int *nr_ranges = arg; >>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >>>>> { >>>>> struct crash_mem *cmem = arg; >>>>> >>>>> - cmem->ranges[cmem->nr_ranges].start = res->start; >>>>> - cmem->ranges[cmem->nr_ranges].end = res->end; >>>>> - cmem->nr_ranges++; >>>>> + if (res->start >= SZ_1M) { >>>>> + cmem->ranges[cmem->nr_ranges].start = res->start; >>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>> + cmem->nr_ranges++; >>>>> + } else if (res->end > SZ_1M) { >>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >>>>> + cmem->ranges[cmem->nr_ranges].end = res->end; >>>>> + cmem->nr_ranges++; >>>>> + } >>>> >>>> What is going on with this chunk? I can guess but this needs a clear >>>> comment. >>> >>> Indeed it needs some code comment, this is based on some offline >>> discussion. cat /proc/vmcore will give a warning because ioremap is >>> mapping the system ram. >>> >>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd >>> kernel can use the low 1M memory because for example the trampoline >>> code. >>> >> Thank you, Eric and Dave. I will add the code comment as below if it would be OK. >> >> @@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) >> { >> struct crash_mem *cmem = arg; >> >> - cmem->ranges[cmem->nr_ranges].start = res->start; >> - cmem->ranges[cmem->nr_ranges].end = res->end; >> - cmem->nr_ranges++; >> + /* >> + * Currently, pass the low 1MiB range to kdump kernel in e820 >> + * as system ram so that kdump kernel can also use the low 1MiB >> + * memory due to the real mode trampoline code. >> + * And later, the low 1MiB range will be exclued from elf header, >> + * which will avoid remapping the 1MiB system ram when dumping >> + * vmcore. >> + */ >> + if (res->start >= SZ_1M) { >> + cmem->ranges[cmem->nr_ranges].start = res->start; >> + cmem->ranges[cmem->nr_ranges].end = res->end; >> + cmem->nr_ranges++; >> + } else if (res->end > SZ_1M) { >> + cmem->ranges[cmem->nr_ranges].start = SZ_1M; >> + cmem->ranges[cmem->nr_ranges].end = res->end; >> + cmem->nr_ranges++; >> + } >> >> return 0; >> } > > I just read through the appropriate section of crash.c and the way > things are structured doing this work in > prepare_elf64_ram_headers_callback is wrong. > > This can be done in a simpler manner in elf_header_exclude_ranges. > Something like: > Thank you, Eric. It seems that here is a more reasonable place, i will make a test about it and improve it in next post. Lianbo > /* The low 1MiB is always reserved */ > ret = crash_exclude_mem_range(cmem, 0, 1024*1024); > if (ret) > return ret; > > Eric > ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2019-10-16 16:52 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-10-12 2:21 [PATCH 0/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active Lianbo Jiang 2019-10-12 2:21 ` [PATCH 1/3 " Lianbo Jiang 2019-10-12 2:21 ` [PATCH 2/3 v3] x86/kdump cleanup: remove the unused crash_copy_backup_region() Lianbo Jiang 2019-10-12 2:21 ` [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region Lianbo Jiang 2019-10-12 11:26 ` Eric W. Biederman 2019-10-12 12:16 ` Dave Young 2019-10-13 3:54 ` Eric W. Biederman 2019-10-13 9:36 ` lijiang 2019-10-15 11:04 ` Eric W. Biederman 2019-10-16 8:40 ` lijiang 2019-10-16 14:30 ` lijiang 2019-10-16 16:51 ` Eric W. Biederman 2019-10-13 10:20 ` Dave Young 2019-10-14 10:02 ` lijiang 2019-10-15 11:11 ` Eric W. Biederman 2019-10-16 3:24 ` lijiang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).