All of lore.kernel.org
 help / color / mirror / Atom feed
From: lijiang <lijiang@redhat.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Young <dyoung@redhat.com>,
	linux-kernel@vger.kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, hpa@zytor.com, x86@kernel.org,
	bhe@redhat.com, jgross@suse.com, dhowells@redhat.com,
	Thomas.Lendacky@amd.com, vgoyal@redhat.com,
	kexec@lists.infradead.org
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region
Date: Wed, 16 Oct 2019 22:30:15 +0800	[thread overview]
Message-ID: <93e6713a-8027-3a7a-4445-9bee56b19f62@redhat.com> (raw)
In-Reply-To: <8736fu1d8k.fsf@x220.int.ebiederm.org>

在 2019年10月15日 19:04, Eric W. Biederman 写道:
> lijiang <lijiang@redhat.com> writes:
> 
>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>> Dave Young <dyoung@redhat.com> writes:
>>>
>>>> Hi Eric,
>>>>
>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>> Lianbo Jiang <lijiang@redhat.com> writes:
>>>>>
>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>> 640k content to a backup region.
>>>>>>
>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>> so lets clean up.
>>>>>>
>>>>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
>>>>>> ---
>>>>>
>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>>>  
>>>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>>>  
>>>>>> -static unsigned long crash_zero_bytes;
>>>>>> -
>>>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>>  {
>>>>>>  	unsigned int *nr_ranges = arg;
>>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>>  {
>>>>>>  	struct crash_mem *cmem = arg;
>>>>>>  
>>>>>> -	cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> -	cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> -	cmem->nr_ranges++;
>>>>>> +	if (res->start >= SZ_1M) {
>>>>>> +		cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> +		cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> +		cmem->nr_ranges++;
>>>>>> +	} else if (res->end > SZ_1M) {
>>>>>> +		cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>> +		cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> +		cmem->nr_ranges++;
>>>>>> +	}
>>>>>
>>>>> What is going on with this chunk?  I can guess but this needs a clear
>>>>> comment.
>>>>
>>>> Indeed it needs some code comment, this is based on some offline
>>>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>>>> mapping the system ram.
>>>>
>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>> kernel can use the low 1M memory because for example the trampoline
>>>> code.
>>>>
>>>>>
>>>>>>  
>>>>>>  	return 0;
>>>>>>  }
>>>>>
>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>>>>>  	memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>>>  	cmd.params = params;
>>>>>>  
>>>>>> -	/* Add first 640K segment */
>>>>>> -	ei.addr = image->arch.backup_src_start;
>>>>>> -	ei.size = image->arch.backup_src_sz;
>>>>>> +	/*
>>>>>> +	 * Add the low memory range[0x1000, SZ_1M], skip
>>>>>> +	 * the first zero page.
>>>>>> +	 */
>>>>>> +	ei.addr = PAGE_SIZE;
>>>>>> +	ei.size = SZ_1M - PAGE_SIZE;
>>>>>>  	ei.type = E820_TYPE_RAM;
>>>>>>  	add_e820_entry(params, &ei);
>>>>>
>>>>> Likewise here.  Why do we need a special case?
>>>>> Why the magic with PAGE_SIZE?
>>>>
>>>> Good catch, the zero page part is useless, I think no other special
>>>> reason, just assumed zero page is not usable, but it should be ok to
>>>> remove the special handling, just pass 0 - 1M is good enough.
>>>
>>> But if we have stopped special casing the low 1M.  Why do we need a
>>> special case here at all?
>>>
>> Here, need to pass the low memory range to kdump kernel, which will guarantee
>> the availability of low memory in kdump kernel, otherwise, kdump kernel won't
>> use the low memory region.
>>
>>> If you need the special case it is almost certainly wrong to say you
>>> have ram above 640KiB and below 1MiB.  That is the legacy ROM and video
>>> MMIO area.
>>>
>>> There is a reason the original code said 640KiB.
>>>
>> Do you mean that the 640k region is good enough here instead of 1MiB?
> 
> Reading through the code of crash_setup_memap_entries I see that what
> the code is doing now.  The code is repeating the e820 memory map with
> the memory areas that were not reserved for the crash kernel removed.
> 
> In which case what the code needs to be doing something like:
> 
> 	cmd.type = E820_TYPE_RAM;
> 	flags = IORESOURCE_MEM;
> 	walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd,
> 			       memmap_entry_callback);
> 
The above code does not get the results what we expected, it gets the reserved
memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range.

Finally, kdump kernel happened the panic as follow:
......
[    3.555662] Kernel panic - not syncing: Real mode trampoline was not allocated
[    3.556660] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-rc3+ #4
[    3.556660] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018
[    3.556660] Call Trace:
[    3.556660]  dump_stack+0x46/0x60
[    3.556660]  panic+0xfb/0x2d7
[    3.556660]  ? hv_init_spinlocks+0x7f/0x7f
[    3.556660]  init_real_mode+0x27/0x1fa
[    3.556660]  ? hv_init_spinlocks+0x7f/0x7f
[    3.556660]  ? do_one_initcall+0x46/0x1e4
[    3.556660]  ? proc_register+0xd0/0x130
[    3.556660]  ? kernel_init_freeable+0xe2/0x242
[    3.556660]  ? rest_init+0xaa/0xaa
[    3.556660]  ? kernel_init+0xa/0x106
[    3.556660]  ? ret_from_fork+0x22/0x40
[    3.556660] Rebooting in 10 seconds..
[    3.556660] ACPI MEMORY or I/O RESET_REG.

I modified the above code, and tested it. This can find out the system ram in
the low 1MiB range. And it worked well.

@@ -356,11 +338,11 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
        memset(&cmd, 0, sizeof(struct crash_memmap_data));
        cmd.params = params;
 
+       /* Add the low 1MiB */
+       cmd.type = E820_TYPE_RAM;
+       flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+       walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, 1024*1024 - 1, &cmd,
+                       memmap_entry_callback);

> Depending on which bugs exist it might make sense to limit this to
> the low 640KiB.  But finding something the kernel already recognizes
> as RAM should prevent most of those problems already.  Barring bugs
> I admit it doesn't make sense to repeat the work that someone else
> has already done.
> 
> This bit:
> 	/* Add e820 reserved ranges */
> 	cmd.type = E820_TYPE_RESERVED;
> 	flags = IORESOURCE_MEM;
> 	walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd,
> 			   memmap_entry_callback);
> 
> Should probably start at 1MiB instead of 0.  Just so we don't report the
If so, it can not find out the reserved memory marked as 'IORES_DESC_RESERVED' in
the low 1MiB range, finally, it doesn't pass the reserved memory in the low 1MiB to
kdump kernel, which could cause some problems, such as SME or PCI MMCONFIG issue.

> memory below 1MiB as unconditionally reserved.   I don't properly
> understand the IORES_DESC_RESERVED flag, and how that differs from
I found three commits about 'IORES_DESC_RESERVED' flag, hope this helps.
1.ae9e13d621d6 ("x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED")
2.5da04cc86d12 ("x86/mm: Rework ioremap resource mapping determination")
3.980621daf368 ("x86/crash: Add e820 reserved ranges to kdump kernel's e820 table")

> flags.  So please test my suggestions to verify the code works as
> expected.
> 
I have tested the two changes that you mentioned, please refer to the reply above.

Thanks.
Lianbo

> Eric
> 

WARNING: multiple messages have this Message-ID (diff)
From: lijiang <lijiang@redhat.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: jgross@suse.com, Thomas.Lendacky@amd.com, bhe@redhat.com,
	x86@kernel.org, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, dhowells@redhat.com,
	mingo@redhat.com, bp@alien8.de, hpa@zytor.com,
	tglx@linutronix.de, Dave Young <dyoung@redhat.com>,
	vgoyal@redhat.com
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region
Date: Wed, 16 Oct 2019 22:30:15 +0800	[thread overview]
Message-ID: <93e6713a-8027-3a7a-4445-9bee56b19f62@redhat.com> (raw)
In-Reply-To: <8736fu1d8k.fsf@x220.int.ebiederm.org>

在 2019年10月15日 19:04, Eric W. Biederman 写道:
> lijiang <lijiang@redhat.com> writes:
> 
>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>> Dave Young <dyoung@redhat.com> writes:
>>>
>>>> Hi Eric,
>>>>
>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>> Lianbo Jiang <lijiang@redhat.com> writes:
>>>>>
>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>> 640k content to a backup region.
>>>>>>
>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>> so lets clean up.
>>>>>>
>>>>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
>>>>>> ---
>>>>>
>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>>>  
>>>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>>>  
>>>>>> -static unsigned long crash_zero_bytes;
>>>>>> -
>>>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>>  {
>>>>>>  	unsigned int *nr_ranges = arg;
>>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>>  {
>>>>>>  	struct crash_mem *cmem = arg;
>>>>>>  
>>>>>> -	cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> -	cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> -	cmem->nr_ranges++;
>>>>>> +	if (res->start >= SZ_1M) {
>>>>>> +		cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> +		cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> +		cmem->nr_ranges++;
>>>>>> +	} else if (res->end > SZ_1M) {
>>>>>> +		cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>> +		cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> +		cmem->nr_ranges++;
>>>>>> +	}
>>>>>
>>>>> What is going on with this chunk?  I can guess but this needs a clear
>>>>> comment.
>>>>
>>>> Indeed it needs some code comment, this is based on some offline
>>>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>>>> mapping the system ram.
>>>>
>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>> kernel can use the low 1M memory because for example the trampoline
>>>> code.
>>>>
>>>>>
>>>>>>  
>>>>>>  	return 0;
>>>>>>  }
>>>>>
>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>>>>>  	memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>>>  	cmd.params = params;
>>>>>>  
>>>>>> -	/* Add first 640K segment */
>>>>>> -	ei.addr = image->arch.backup_src_start;
>>>>>> -	ei.size = image->arch.backup_src_sz;
>>>>>> +	/*
>>>>>> +	 * Add the low memory range[0x1000, SZ_1M], skip
>>>>>> +	 * the first zero page.
>>>>>> +	 */
>>>>>> +	ei.addr = PAGE_SIZE;
>>>>>> +	ei.size = SZ_1M - PAGE_SIZE;
>>>>>>  	ei.type = E820_TYPE_RAM;
>>>>>>  	add_e820_entry(params, &ei);
>>>>>
>>>>> Likewise here.  Why do we need a special case?
>>>>> Why the magic with PAGE_SIZE?
>>>>
>>>> Good catch, the zero page part is useless, I think no other special
>>>> reason, just assumed zero page is not usable, but it should be ok to
>>>> remove the special handling, just pass 0 - 1M is good enough.
>>>
>>> But if we have stopped special casing the low 1M.  Why do we need a
>>> special case here at all?
>>>
>> Here, need to pass the low memory range to kdump kernel, which will guarantee
>> the availability of low memory in kdump kernel, otherwise, kdump kernel won't
>> use the low memory region.
>>
>>> If you need the special case it is almost certainly wrong to say you
>>> have ram above 640KiB and below 1MiB.  That is the legacy ROM and video
>>> MMIO area.
>>>
>>> There is a reason the original code said 640KiB.
>>>
>> Do you mean that the 640k region is good enough here instead of 1MiB?
> 
> Reading through the code of crash_setup_memap_entries I see that what
> the code is doing now.  The code is repeating the e820 memory map with
> the memory areas that were not reserved for the crash kernel removed.
> 
> In which case what the code needs to be doing something like:
> 
> 	cmd.type = E820_TYPE_RAM;
> 	flags = IORESOURCE_MEM;
> 	walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd,
> 			       memmap_entry_callback);
> 
The above code does not get the results what we expected, it gets the reserved
memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range.

Finally, kdump kernel happened the panic as follow:
......
[    3.555662] Kernel panic - not syncing: Real mode trampoline was not allocated
[    3.556660] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-rc3+ #4
[    3.556660] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018
[    3.556660] Call Trace:
[    3.556660]  dump_stack+0x46/0x60
[    3.556660]  panic+0xfb/0x2d7
[    3.556660]  ? hv_init_spinlocks+0x7f/0x7f
[    3.556660]  init_real_mode+0x27/0x1fa
[    3.556660]  ? hv_init_spinlocks+0x7f/0x7f
[    3.556660]  ? do_one_initcall+0x46/0x1e4
[    3.556660]  ? proc_register+0xd0/0x130
[    3.556660]  ? kernel_init_freeable+0xe2/0x242
[    3.556660]  ? rest_init+0xaa/0xaa
[    3.556660]  ? kernel_init+0xa/0x106
[    3.556660]  ? ret_from_fork+0x22/0x40
[    3.556660] Rebooting in 10 seconds..
[    3.556660] ACPI MEMORY or I/O RESET_REG.

I modified the above code, and tested it. This can find out the system ram in
the low 1MiB range. And it worked well.

@@ -356,11 +338,11 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
        memset(&cmd, 0, sizeof(struct crash_memmap_data));
        cmd.params = params;
 
+       /* Add the low 1MiB */
+       cmd.type = E820_TYPE_RAM;
+       flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+       walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, 1024*1024 - 1, &cmd,
+                       memmap_entry_callback);

> Depending on which bugs exist it might make sense to limit this to
> the low 640KiB.  But finding something the kernel already recognizes
> as RAM should prevent most of those problems already.  Barring bugs
> I admit it doesn't make sense to repeat the work that someone else
> has already done.
> 
> This bit:
> 	/* Add e820 reserved ranges */
> 	cmd.type = E820_TYPE_RESERVED;
> 	flags = IORESOURCE_MEM;
> 	walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd,
> 			   memmap_entry_callback);
> 
> Should probably start at 1MiB instead of 0.  Just so we don't report the
If so, it can not find out the reserved memory marked as 'IORES_DESC_RESERVED' in
the low 1MiB range, finally, it doesn't pass the reserved memory in the low 1MiB to
kdump kernel, which could cause some problems, such as SME or PCI MMCONFIG issue.

> memory below 1MiB as unconditionally reserved.   I don't properly
> understand the IORES_DESC_RESERVED flag, and how that differs from
I found three commits about 'IORES_DESC_RESERVED' flag, hope this helps.
1.ae9e13d621d6 ("x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED")
2.5da04cc86d12 ("x86/mm: Rework ioremap resource mapping determination")
3.980621daf368 ("x86/crash: Add e820 reserved ranges to kdump kernel's e820 table")

> flags.  So please test my suggestions to verify the code works as
> expected.
> 
I have tested the two changes that you mentioned, please refer to the reply above.

Thanks.
Lianbo

> Eric
> 

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

  parent reply	other threads:[~2019-10-16 14:30 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-12  2:21 [PATCH 0/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active Lianbo Jiang
2019-10-12  2:21 ` Lianbo Jiang
2019-10-12  2:21 ` [PATCH 1/3 " Lianbo Jiang
2019-10-12  2:21   ` Lianbo Jiang
2019-10-12  2:21 ` [PATCH 2/3 v3] x86/kdump cleanup: remove the unused crash_copy_backup_region() Lianbo Jiang
2019-10-12  2:21   ` Lianbo Jiang
2019-10-12  2:21 ` [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region Lianbo Jiang
2019-10-12  2:21   ` Lianbo Jiang
2019-10-12 11:26   ` Eric W. Biederman
2019-10-12 11:26     ` Eric W. Biederman
2019-10-12 12:16     ` Dave Young
2019-10-12 12:16       ` Dave Young
2019-10-13  3:54       ` Eric W. Biederman
2019-10-13  3:54         ` Eric W. Biederman
2019-10-13  9:36         ` lijiang
2019-10-13  9:36           ` lijiang
2019-10-15 11:04           ` Eric W. Biederman
2019-10-15 11:04             ` Eric W. Biederman
2019-10-16  8:40             ` lijiang
2019-10-16  8:40               ` lijiang
2019-10-16 14:30             ` lijiang [this message]
2019-10-16 14:30               ` lijiang
2019-10-16 16:51               ` Eric W. Biederman
2019-10-16 16:51                 ` Eric W. Biederman
2019-10-13 10:20         ` Dave Young
2019-10-13 10:20           ` Dave Young
2019-10-14 10:02       ` lijiang
2019-10-14 10:02         ` lijiang
2019-10-15 11:11         ` Eric W. Biederman
2019-10-15 11:11           ` Eric W. Biederman
2019-10-16  3:24           ` lijiang
2019-10-16  3:24             ` lijiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=93e6713a-8027-3a7a-4445-9bee56b19f62@redhat.com \
    --to=lijiang@redhat.com \
    --cc=Thomas.Lendacky@amd.com \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=dhowells@redhat.com \
    --cc=dyoung@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=vgoyal@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.