Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

From: Andy Lutomirski <luto@amacapital.net>
To: Baoquan He <bhe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Lianbo Jiang <lijiang@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-efi <linux-efi@vger.kernel.org>,
	Platform Driver <platform-driver-x86@vger.kernel.org>,
	X86 ML <x86@kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Darren Hart <dvhart@infradead.org>,
	Andy Shevchenko <andy@infradead.org>,
	kexec@lists.infradead.org, Dave Young <dyoung@redhat.com>
Subject: Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified
Date: Mon, 12 Apr 2021 08:24:25 -0700	[thread overview]
Message-ID: <CALCETrV0dgn1=7CoB+BSHdDuzqtfpKGOPvjJg+sNo74VrcJE=A@mail.gmail.com> (raw)
In-Reply-To: <20210412095231.GC4282@MiWiFi-R3L-srv>

On Mon, Apr 12, 2021 at 2:52 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> >
> >
> > > On Apr 11, 2021, at 6:14 PM, Baoquan He <bhe@redhat.com> wrote:
> > >
> > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> > >> Why don't we do this unconditionally? At the very best we gain half a megabyte of memory (except the trampoline, which has to live there, but it is only a few kilobytes.)
> > >
> > > This is a great suggestion, thanks. I think we can fix it in this way to
> > > make code simpler. Then the specific caring of real mode in
> > > efi_free_boot_services() can be removed too.
> > >
> >
> > This whole situation makes me think that the code is buggy before and buggy after.
> >
> > The issue here (I think) is that various pieces of code want to reserve specific pieces of otherwise-available low memory for their own nefarious uses. I don’t know *why* crash kernel needs this, but that doesn’t matter too much.
>
> Kdump kernel also need go through real mode code path during bootup. It
> is not different than normal kernel except that it skips the firmware
> resetting. So kdump kernel needs low 1M as system RAM just as normal
> kernel does. Here we reserve the whole low 1M with memblock_reserve()
> to avoid any later kernel or driver data reside in this area. Otherwise,
> we need dump the content of this area to vmcore. As we know, when crash
> happened, the old memory of 1st kernel should be untouched until vmcore
> dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> In the past, we used a back up region to copy out the low 1M area, and
> map the back up region into the low 1M area in vmcore elf file. In
> 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> option is specified"), we changed to lock the whole low 1M to avoid
> writting any kernel data into, like this we can skip this area when
> dumping vmcore.
>
> Above is why we try to memblock reserve the whole low 1M. We don't want
> to use it, just don't want anyone to use it in 1st kernel.
>
> >
> > I propose that the right solution is to give low-memory-reserving code paths two chances to do what they need: once at the very beginning and once after EFI boot services are freed.
> >
> > Alternatively, just reserve *all* otherwise unused sub 1M memory up front, then release it right after releasing boot services, and then invoke the special cases exactly once.
>
> I am not sure if I got both suggested ways clearly. They look a little
> complicated in our case. As I explained at above, we want the whole low
> 1M locked up, not one piece or some pieces of it.

My second suggestion is probably the better one.  Here it is, concretely:

The early (pre-free_efi_boot_services) code just reserves all
available sub-1M memory unconditionally, but it specially marks it as
reserved-but-available-later.  We stop allocating the trampoline page
at this stage.

In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
stick it in the pile of reserved memory created in the early step.
This may involve splitting a block, kind of like the current
trampoline late allocation works.

Then, *after* free_efi_boot_services(), we run a single block of code
that lets everything that wants sub-1M code claim some.  This means
that the trampoline gets allocated and, if crashkernel wants to claim
everything else, it can.  After that, everything still unclaimed gets
freed.

Does that make sense?

--Andy