LKML Archive on lore.kernel.org
 help / color / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: lijiang <lijiang@redhat.com>
Cc: Dave Young <dyoung@redhat.com>,
	linux-kernel@vger.kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, hpa@zytor.com, x86@kernel.org,
	bhe@redhat.com, jgross@suse.com, dhowells@redhat.com,
	Thomas.Lendacky@amd.com, vgoyal@redhat.com,
	kexec@lists.infradead.org
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region
Date: Wed, 16 Oct 2019 11:51:33 -0500
Message-ID: <87eezczl9m.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <93e6713a-8027-3a7a-4445-9bee56b19f62@redhat.com> (lijiang's message of "Wed, 16 Oct 2019 22:30:15 +0800")

lijiang <lijiang@redhat.com> writes:

> 在 2019年10月15日 19:04, Eric W. Biederman 写道:
>> lijiang <lijiang@redhat.com> writes:
>> 
>>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>>> Dave Young <dyoung@redhat.com> writes:
>>>>
>>>>> Hi Eric,
>>>>>
>>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>>> Lianbo Jiang <lijiang@redhat.com> writes:
>>>>>>
>>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>>> 640k content to a backup region.
>>>>>>>
>>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>>> so lets clean up.
>>>>>>>
>>>>>>> Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
>>>>>>> ---
>>>>>>
>>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>>>>  
>>>>>>>  #ifdef CONFIG_KEXEC_FILE
>>>>>>>  
>>>>>>> -static unsigned long crash_zero_bytes;
>>>>>>> -
>>>>>>>  static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>>>  {
>>>>>>>  	unsigned int *nr_ranges = arg;
>>>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>>>  {
>>>>>>>  	struct crash_mem *cmem = arg;
>>>>>>>  
>>>>>>> -	cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>>> -	cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> -	cmem->nr_ranges++;
>>>>>>> +	if (res->start >= SZ_1M) {
>>>>>>> +		cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>>> +		cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> +		cmem->nr_ranges++;
>>>>>>> +	} else if (res->end > SZ_1M) {
>>>>>>> +		cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>>> +		cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> +		cmem->nr_ranges++;
>>>>>>> +	}
>>>>>>
>>>>>> What is going on with this chunk?  I can guess but this needs a clear
>>>>>> comment.
>>>>>
>>>>> Indeed it needs some code comment, this is based on some offline
>>>>> discussion.  cat /proc/vmcore will give a warning because ioremap is
>>>>> mapping the system ram.
>>>>>
>>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>>> kernel can use the low 1M memory because for example the trampoline
>>>>> code.
>>>>>
>>>>>>
>>>>>>>  
>>>>>>>  	return 0;
>>>>>>>  }
>>>>>>
>>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>>>>>>  	memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>>>>  	cmd.params = params;
>>>>>>>  
>>>>>>> -	/* Add first 640K segment */
>>>>>>> -	ei.addr = image->arch.backup_src_start;
>>>>>>> -	ei.size = image->arch.backup_src_sz;
>>>>>>> +	/*
>>>>>>> +	 * Add the low memory range[0x1000, SZ_1M], skip
>>>>>>> +	 * the first zero page.
>>>>>>> +	 */
>>>>>>> +	ei.addr = PAGE_SIZE;
>>>>>>> +	ei.size = SZ_1M - PAGE_SIZE;
>>>>>>>  	ei.type = E820_TYPE_RAM;
>>>>>>>  	add_e820_entry(params, &ei);
>>>>>>
>>>>>> Likewise here.  Why do we need a special case?
>>>>>> Why the magic with PAGE_SIZE?
>>>>>
>>>>> Good catch, the zero page part is useless, I think no other special
>>>>> reason, just assumed zero page is not usable, but it should be ok to
>>>>> remove the special handling, just pass 0 - 1M is good enough.
>>>>
>>>> But if we have stopped special casing the low 1M.  Why do we need a
>>>> special case here at all?
>>>>
>>> Here, need to pass the low memory range to kdump kernel, which will guarantee
>>> the availability of low memory in kdump kernel, otherwise, kdump kernel won't
>>> use the low memory region.
>>>
>>>> If you need the special case it is almost certainly wrong to say you
>>>> have ram above 640KiB and below 1MiB.  That is the legacy ROM and video
>>>> MMIO area.
>>>>
>>>> There is a reason the original code said 640KiB.
>>>>
>>> Do you mean that the 640k region is good enough here instead of 1MiB?
>> 
>> Reading through the code of crash_setup_memap_entries I see that what
>> the code is doing now.  The code is repeating the e820 memory map with
>> the memory areas that were not reserved for the crash kernel removed.
>> 
>> In which case what the code needs to be doing something like:
>> 
>> 	cmd.type = E820_TYPE_RAM;
>> 	flags = IORESOURCE_MEM;
>> 	walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd,
>> 			       memmap_entry_callback);
>> 
> The above code does not get the results what we expected, it gets the reserved
> memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range.
>
> Finally, kdump kernel happened the panic as follow:
> ......
> [    3.555662] Kernel panic - not syncing: Real mode trampoline was not allocated
> [    3.556660] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-rc3+ #4
> [    3.556660] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018
> [    3.556660] Call Trace:
> [    3.556660]  dump_stack+0x46/0x60
> [    3.556660]  panic+0xfb/0x2d7
> [    3.556660]  ? hv_init_spinlocks+0x7f/0x7f
> [    3.556660]  init_real_mode+0x27/0x1fa
> [    3.556660]  ? hv_init_spinlocks+0x7f/0x7f
> [    3.556660]  ? do_one_initcall+0x46/0x1e4
> [    3.556660]  ? proc_register+0xd0/0x130
> [    3.556660]  ? kernel_init_freeable+0xe2/0x242
> [    3.556660]  ? rest_init+0xaa/0xaa
> [    3.556660]  ? kernel_init+0xa/0x106
> [    3.556660]  ? ret_from_fork+0x22/0x40
> [    3.556660] Rebooting in 10 seconds..
> [    3.556660] ACPI MEMORY or I/O RESET_REG.
>
> I modified the above code, and tested it. This can find out the system ram in
> the low 1MiB range. And it worked well.
>
> @@ -356,11 +338,11 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>         memset(&cmd, 0, sizeof(struct crash_memmap_data));
>         cmd.params = params;
>  
> +       /* Add the low 1MiB */
> +       cmd.type = E820_TYPE_RAM;
> +       flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> +       walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, 1024*1024 - 1, &cmd,
> +                       memmap_entry_callback);
>

That looks like a very reasonable fix.

>> Depending on which bugs exist it might make sense to limit this to
>> the low 640KiB.  But finding something the kernel already recognizes
>> as RAM should prevent most of those problems already.  Barring bugs
>> I admit it doesn't make sense to repeat the work that someone else
>> has already done.
>> 
>> This bit:
>> 	/* Add e820 reserved ranges */
>> 	cmd.type = E820_TYPE_RESERVED;
>> 	flags = IORESOURCE_MEM;
>> 	walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd,
>> 			   memmap_entry_callback);
>> 
>> Should probably start at 1MiB instead of 0.  Just so we don't report the
> If so, it can not find out the reserved memory marked as 'IORES_DESC_RESERVED' in
> the low 1MiB range, finally, it doesn't pass the reserved memory in the low 1MiB to
> kdump kernel, which could cause some problems, such as SME or PCI MMCONFIG issue.

Good point. For some reason I was thinking IORESOURCE_MEM and
IORESOURCE_SYSTEM_RAM were the same thing.  It has been way to long
since I have been in that part of the code.

So yes let's leave that part alone.

>> memory below 1MiB as unconditionally reserved.   I don't properly
>> understand the IORES_DESC_RESERVED flag, and how that differs from
> I found three commits about 'IORES_DESC_RESERVED' flag, hope this helps.
> 1.ae9e13d621d6 ("x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED")
> 2.5da04cc86d12 ("x86/mm: Rework ioremap resource mapping determination")
> 3.980621daf368 ("x86/crash: Add e820 reserved ranges to kdump kernel's e820 table")
>
>> flags.  So please test my suggestions to verify the code works as
>> expected.
>> 
> I have tested the two changes that you mentioned, please refer to the
> reply above.

Thank you.  It looks like you have figured out how these things should
work.

Eric


  reply index

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-12  2:21 [PATCH 0/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active Lianbo Jiang
2019-10-12  2:21 ` [PATCH 1/3 " Lianbo Jiang
2019-10-12  2:21 ` [PATCH 2/3 v3] x86/kdump cleanup: remove the unused crash_copy_backup_region() Lianbo Jiang
2019-10-12  2:21 ` [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region Lianbo Jiang
2019-10-12 11:26   ` ebiederm
2019-10-12 12:16     ` Dave Young
2019-10-13  3:54       ` ebiederm
2019-10-13  9:36         ` lijiang
2019-10-15 11:04           ` ebiederm
2019-10-16  8:40             ` lijiang
2019-10-16 14:30             ` lijiang
2019-10-16 16:51               ` ebiederm [this message]
2019-10-13 10:20         ` Dave Young
2019-10-14 10:02       ` lijiang
2019-10-15 11:11         ` ebiederm
2019-10-16  3:24           ` lijiang

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87eezczl9m.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=Thomas.Lendacky@amd.com \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=dhowells@redhat.com \
    --cc=dyoung@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=kexec@lists.infradead.org \
    --cc=lijiang@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=vgoyal@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git