All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Young <dyoung@redhat.com>
To: Jiri Bohac <jbohac@suse.cz>
Cc: x86@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,  Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	 "H. Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
	 Baoquan He <bhe@redhat.com>,
	Eric Biederman <ebiederm@xmission.com>,
	 "Huang, Kai" <kai.huang@intel.com>
Subject: Re: [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data
Date: Fri, 22 Mar 2024 10:17:19 +0800	[thread overview]
Message-ID: <CALu+AoT2jYfVTFpVvUJv5T+OdSAQzYw0kn74EighK5-4A3O16w@mail.gmail.com> (raw)
In-Reply-To: <ZfwMribQCTKWSO9l@dwarf.suse.cz>

Hi Jiri,

On Thu, 21 Mar 2024 at 18:32, Jiri Bohac <jbohac@suse.cz> wrote:
>
> Hi,
>
> On Thu, Mar 21, 2024 at 05:23:20PM +0800, Dave Young wrote:
> > crashkernel reservation failed on a Thinkpad t440s laptop recently.
> > Actually the memblock reservation succeeded, but later insert_resource()
> > failed.
> >
> > Test steps:
> > kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
> >     kexec reboot ->
> >         dmesg|grep "crashkernel reserved";
> >             crashkernel memory range like below reserved successfully:
> >             0x00000000d0000000 - 0x00000000da000000
> >         But no such "Crash kernel" region in /proc/iomem
> >
> > The background story is like below:
> >
> > Currently E820 code reserves setup_data regions for both the current
> > kernel and the kexec kernel, and it inserts them into the resources list.
> > Before the kexec kernel reboots nobody passes the old setup_data, and
> > kexec only passes fresh SETUP_EFI and SETUP_IMA if needed.  Thus the old
> > setup data memory is not used at all.
> >
> > Due to old kernel updates the kexec e820 table as well so kexec kernel
> > sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
> > regions are inserted into resources list in the kexec kernel by
> > e820__reserve_resources().
> >
> > Note, due to no setup_data is passed in for those old regions they are not
> > early reserved (by function early_reserve_memory), and the crashkernel
> > memblock reservation will just treat them as usable memory and it could
> > reserve the crashkernel region which overlaps with the old setup_data
> > regions. And just like the bug I noticed here, kdump insert_resource
> > failed because e820__reserve_resources has added the overlapped chunks
> > in /proc/iomem already.
>
> wouldn't this be caused by
> 4a693ce65b186fddc1a73621bd6f941e6e3eca21 ("kdump: defer the
> insertion of crashkernel resources")?
>
> Before that the crashkernel resources were inserted from
> arch_reserve_crashkernel() which is called before
> e820__reserve_resources().

I think reverting the commit you mentioned can paper out this issue
but it is not
the root cause.  Yes, arch_reserve_crashkernel can succeed, then e820
still tries
to reserve the setup_data overlapping with crashkernel for another purpose.

>
> The semantics of E820_TYPE_RESERVED_KERN wrt kexec quite
> inconsistent. It's treated as E820_TYPE_RAM by
> e820__memblock_setup() and e820_type_to_iomem_type().
>
> The problem we're seeing here is the result of the former.
> e820__memblock_setup() will add the E820_TYPE_RESERVED_KERN
> region to the memblock, merging with the neighbouring memblocks,
> allowing crashkernel region to span across the originally
> reserved area.
>
> e820_type_to_iomem_type() treating E820_TYPE_RESERVED_KERN as
> E820_TYPE_RAM will make the E820_TYPE_RESERVED_KERN appear as
> system ram in /proc/iomem. If the old kexec_load (not
> kexec_file_load) syscall is used, the userspace kexec utility
> will construct the e820 table based on the contents of
> /proc/iomem and the kexec kernel will see the
> E820_TYPE_RESERVED_KERN range as E820_TYPE_RAM.  When
> kexec_file_load is used the E820_TYPE_RESERVED_KERN type is
> propagated to the kexec kernel by bzImage64_load() /
> setup_e820_entries().

This is true, but it does not matter for the kexec kernel as they are
only reserved for
the 1st kernel, and it is not meaningful to the kexec kernel.  Use
them as system ram
is fine in the 2nd kexec kernel.

>
>
> > Index: linux/arch/x86/kernel/e820.c
> > ===================================================================
> > --- linux.orig/arch/x86/kernel/e820.c
> > +++ linux/arch/x86/kernel/e820.c
> > @@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi
> >               pa_next = data->next;
> >
> >               e820__range_update(pa_data, sizeof(*data)+data->len, E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> > -
> > -             /*
> > -              * SETUP_EFI and SETUP_IMA are supplied by kexec and do not need
> > -              * to be reserved.
> > -              */
> > -             if (data->type != SETUP_EFI && data->type != SETUP_IMA)
> > -                     e820__range_update_kexec(pa_data,
> > -                                              sizeof(*data) + data->len,
> > -                                              E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> > -
>
> Your tree is missing this recent commit:
> 7fd817c906503b6813ea3b41f5fdf4192449a707 ("x86/e820: Don't
> reserve SETUP_RNG_SEED in e820").
>
> Wouldn't this fix [/paper over] your problem as well? I.e., isn't
> SETUP_RNG_SEED the setup_data item that's causing your problem?

Thanks for catching this, I will rebase and repost.

But it does not "fix" the problem as my problem is related to the
other setup_data
range, I think it is SETUP_PCI (not 100% sure, but it is certainly not RNG_SEED)

>
> Regards,
>
> --
> Jiri Bohac <jbohac@suse.cz>
> SUSE Labs, Prague, Czechia
>
>
Thanks
Dave


WARNING: multiple messages have this Message-ID (diff)
From: Dave Young <dyoung@redhat.com>
To: Jiri Bohac <jbohac@suse.cz>
Cc: x86@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,  Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	 "H. Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org, kexec@lists.infradead.org,
	 Baoquan He <bhe@redhat.com>,
	Eric Biederman <ebiederm@xmission.com>,
	 "Huang, Kai" <kai.huang@intel.com>
Subject: Re: [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data
Date: Fri, 22 Mar 2024 10:17:19 +0800	[thread overview]
Message-ID: <CALu+AoT2jYfVTFpVvUJv5T+OdSAQzYw0kn74EighK5-4A3O16w@mail.gmail.com> (raw)
In-Reply-To: <ZfwMribQCTKWSO9l@dwarf.suse.cz>

Hi Jiri,

On Thu, 21 Mar 2024 at 18:32, Jiri Bohac <jbohac@suse.cz> wrote:
>
> Hi,
>
> On Thu, Mar 21, 2024 at 05:23:20PM +0800, Dave Young wrote:
> > crashkernel reservation failed on a Thinkpad t440s laptop recently.
> > Actually the memblock reservation succeeded, but later insert_resource()
> > failed.
> >
> > Test steps:
> > kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
> >     kexec reboot ->
> >         dmesg|grep "crashkernel reserved";
> >             crashkernel memory range like below reserved successfully:
> >             0x00000000d0000000 - 0x00000000da000000
> >         But no such "Crash kernel" region in /proc/iomem
> >
> > The background story is like below:
> >
> > Currently E820 code reserves setup_data regions for both the current
> > kernel and the kexec kernel, and it inserts them into the resources list.
> > Before the kexec kernel reboots nobody passes the old setup_data, and
> > kexec only passes fresh SETUP_EFI and SETUP_IMA if needed.  Thus the old
> > setup data memory is not used at all.
> >
> > Due to old kernel updates the kexec e820 table as well so kexec kernel
> > sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
> > regions are inserted into resources list in the kexec kernel by
> > e820__reserve_resources().
> >
> > Note, due to no setup_data is passed in for those old regions they are not
> > early reserved (by function early_reserve_memory), and the crashkernel
> > memblock reservation will just treat them as usable memory and it could
> > reserve the crashkernel region which overlaps with the old setup_data
> > regions. And just like the bug I noticed here, kdump insert_resource
> > failed because e820__reserve_resources has added the overlapped chunks
> > in /proc/iomem already.
>
> wouldn't this be caused by
> 4a693ce65b186fddc1a73621bd6f941e6e3eca21 ("kdump: defer the
> insertion of crashkernel resources")?
>
> Before that the crashkernel resources were inserted from
> arch_reserve_crashkernel() which is called before
> e820__reserve_resources().

I think reverting the commit you mentioned can paper out this issue
but it is not
the root cause.  Yes, arch_reserve_crashkernel can succeed, then e820
still tries
to reserve the setup_data overlapping with crashkernel for another purpose.

>
> The semantics of E820_TYPE_RESERVED_KERN wrt kexec quite
> inconsistent. It's treated as E820_TYPE_RAM by
> e820__memblock_setup() and e820_type_to_iomem_type().
>
> The problem we're seeing here is the result of the former.
> e820__memblock_setup() will add the E820_TYPE_RESERVED_KERN
> region to the memblock, merging with the neighbouring memblocks,
> allowing crashkernel region to span across the originally
> reserved area.
>
> e820_type_to_iomem_type() treating E820_TYPE_RESERVED_KERN as
> E820_TYPE_RAM will make the E820_TYPE_RESERVED_KERN appear as
> system ram in /proc/iomem. If the old kexec_load (not
> kexec_file_load) syscall is used, the userspace kexec utility
> will construct the e820 table based on the contents of
> /proc/iomem and the kexec kernel will see the
> E820_TYPE_RESERVED_KERN range as E820_TYPE_RAM.  When
> kexec_file_load is used the E820_TYPE_RESERVED_KERN type is
> propagated to the kexec kernel by bzImage64_load() /
> setup_e820_entries().

This is true, but it does not matter for the kexec kernel as they are
only reserved for
the 1st kernel, and it is not meaningful to the kexec kernel.  Use
them as system ram
is fine in the 2nd kexec kernel.

>
>
> > Index: linux/arch/x86/kernel/e820.c
> > ===================================================================
> > --- linux.orig/arch/x86/kernel/e820.c
> > +++ linux/arch/x86/kernel/e820.c
> > @@ -1015,16 +1015,6 @@ void __init e820__reserve_setup_data(voi
> >               pa_next = data->next;
> >
> >               e820__range_update(pa_data, sizeof(*data)+data->len, E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> > -
> > -             /*
> > -              * SETUP_EFI and SETUP_IMA are supplied by kexec and do not need
> > -              * to be reserved.
> > -              */
> > -             if (data->type != SETUP_EFI && data->type != SETUP_IMA)
> > -                     e820__range_update_kexec(pa_data,
> > -                                              sizeof(*data) + data->len,
> > -                                              E820_TYPE_RAM, E820_TYPE_RESERVED_KERN);
> > -
>
> Your tree is missing this recent commit:
> 7fd817c906503b6813ea3b41f5fdf4192449a707 ("x86/e820: Don't
> reserve SETUP_RNG_SEED in e820").
>
> Wouldn't this fix [/paper over] your problem as well? I.e., isn't
> SETUP_RNG_SEED the setup_data item that's causing your problem?

Thanks for catching this, I will rebase and repost.

But it does not "fix" the problem as my problem is related to the
other setup_data
range, I think it is SETUP_PCI (not 100% sure, but it is certainly not RNG_SEED)

>
> Regards,
>
> --
> Jiri Bohac <jbohac@suse.cz>
> SUSE Labs, Prague, Czechia
>
>
Thanks
Dave


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

  reply	other threads:[~2024-03-22  2:17 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-21  9:23 [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data Dave Young
2024-03-21  9:23 ` Dave Young
2024-03-21 10:32 ` Jiri Bohac
2024-03-21 10:32   ` Jiri Bohac
2024-03-22  2:17   ` Dave Young [this message]
2024-03-22  2:17     ` Dave Young
2024-03-22  5:20     ` Dave Young
2024-03-22  5:20       ` Dave Young

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALu+AoT2jYfVTFpVvUJv5T+OdSAQzYw0kn74EighK5-4A3O16w@mail.gmail.com \
    --to=dyoung@redhat.com \
    --cc=bhe@redhat.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=ebiederm@xmission.com \
    --cc=hpa@zytor.com \
    --cc=jbohac@suse.cz \
    --cc=kai.huang@intel.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.