All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel
@ 2020-10-25  0:41 Arvind Sankar
  2020-10-25  2:20 ` Arvind Sankar
  2020-10-27 12:40 ` Kirill A. Shutemov
  0 siblings, 2 replies; 6+ messages in thread
From: Arvind Sankar @ 2020-10-25  0:41 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Kirill A. Shutemov, Kees Cook, Joerg Roedel

Hi, I think the definition of BOOT_PGT_SIZE in
arch/x86/include/asm/boot.h is insufficient, especially after
  ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")

Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.

- The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
  debug messages, but warnings/errors are always output to VGA memory,
  so the two extra pages for mapping video RAM are always needed.

- The calculation wasn't updated for X86_5LEVEL, which requires at least
  one more page for the P4D level, and in theory could require two extra
  pages for each of the 4 mappings (compressed kernel, output kernel,
  boot_params and command line), though that would require a system with
  truly ginormous amounts of RAM.

- If KASLR is disabled, there are only 6 pages, but now that we're
  always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
  two PUD and two PMD pages for kernel, boot_params and command line),
  and 2 more pages for the video RAM, and more for 5-level. Even for
  !RELOCATABLE, 13 pages might be needed.

- SEV-ES needs one more page because it needs to do a PTE-level mapping
  for the GHCB page.

- The static calculation is also busted because
  boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
  configuration tables and the EFI memmap, and none of these are
  accounted for. They used to be scanned while still on the
  firmware/bootloader page tables, but now our page tables have to cover
  them as well. Trying to add up the worst case for all of these, and
  anything else the compressed kernel might potentially access seems
  like a lost cause.

We could do something similar to what the main kernel does with
early_dynamic_pgts: map the compressed kernel at a fixed virtual
address (in negative address space, say); recycle all the other mappings
until we're done with decompression, and then map the output,
boot_params and command line. The number of pages needed for this can be
statically calculated, for 4-level paging we'd need 2 pages for the
fixed mapping, 12 pages for the other three, and one PGD page.

Thoughts?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel
  2020-10-25  0:41 RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel Arvind Sankar
@ 2020-10-25  2:20 ` Arvind Sankar
  2020-10-27 12:40 ` Kirill A. Shutemov
  1 sibling, 0 replies; 6+ messages in thread
From: Arvind Sankar @ 2020-10-25  2:20 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: x86, linux-kernel, Kirill A. Shutemov, Kees Cook, Joerg Roedel

On Sat, Oct 24, 2020 at 08:41:58PM -0400, Arvind Sankar wrote:
> Hi, I think the definition of BOOT_PGT_SIZE in
> arch/x86/include/asm/boot.h is insufficient, especially after
>   ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
> 
> Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
> 19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.
> 
> - The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
>   debug messages, but warnings/errors are always output to VGA memory,
>   so the two extra pages for mapping video RAM are always needed.
> 
> - The calculation wasn't updated for X86_5LEVEL, which requires at least
>   one more page for the P4D level, and in theory could require two extra
>   pages for each of the 4 mappings (compressed kernel, output kernel,
>   boot_params and command line), though that would require a system with
>   truly ginormous amounts of RAM.
> 
> - If KASLR is disabled, there are only 6 pages, but now that we're
>   always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
>   two PUD and two PMD pages for kernel, boot_params and command line),
>   and 2 more pages for the video RAM, and more for 5-level. Even for
>   !RELOCATABLE, 13 pages might be needed.
> 
> - SEV-ES needs one more page because it needs to do a PTE-level mapping
>   for the GHCB page.
> 
> - The static calculation is also busted because
>   boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
>   configuration tables and the EFI memmap, and none of these are
>   accounted for. They used to be scanned while still on the
>   firmware/bootloader page tables, but now our page tables have to cover
>   them as well. Trying to add up the worst case for all of these, and
>   anything else the compressed kernel might potentially access seems
>   like a lost cause.
> 
> We could do something similar to what the main kernel does with
> early_dynamic_pgts: map the compressed kernel at a fixed virtual
> address (in negative address space, say); recycle all the other mappings
> until we're done with decompression, and then map the output,
> boot_params and command line. The number of pages needed for this can be
> statically calculated, for 4-level paging we'd need 2 pages for the
> fixed mapping, 12 pages for the other three, and one PGD page.
> 
> Thoughts?

Or just bump BOOT_PGT_SIZE to some largeish number?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel
  2020-10-25  0:41 RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel Arvind Sankar
  2020-10-25  2:20 ` Arvind Sankar
@ 2020-10-27 12:40 ` Kirill A. Shutemov
  2020-10-27 13:16   ` Arvind Sankar
  2020-10-27 15:46   ` Joerg Roedel
  1 sibling, 2 replies; 6+ messages in thread
From: Kirill A. Shutemov @ 2020-10-27 12:40 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: x86, linux-kernel, Kirill A. Shutemov, Kees Cook, Joerg Roedel

On Sat, Oct 24, 2020 at 08:41:58PM -0400, Arvind Sankar wrote:
> Hi, I think the definition of BOOT_PGT_SIZE in
> arch/x86/include/asm/boot.h is insufficient, especially after
>   ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
> 
> Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
> 19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.
> 
> - The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
>   debug messages, but warnings/errors are always output to VGA memory,
>   so the two extra pages for mapping video RAM are always needed.
> 
> - The calculation wasn't updated for X86_5LEVEL, which requires at least
>   one more page for the P4D level, and in theory could require two extra
>   pages for each of the 4 mappings (compressed kernel, output kernel,
>   boot_params and command line), though that would require a system with
>   truly ginormous amounts of RAM.

Or sparse physical memory map. I hacked QEMU before for testing 5-level
paging:

https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d

> - If KASLR is disabled, there are only 6 pages, but now that we're
>   always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
>   two PUD and two PMD pages for kernel, boot_params and command line),
>   and 2 more pages for the video RAM, and more for 5-level. Even for
>   !RELOCATABLE, 13 pages might be needed.

The comment for BOOT_PGT_SIZE has to be updated.

BTW, what happens if we underestimate BOOT_PGT_SIZE? Do we overwrite
something?

> - SEV-ES needs one more page because it needs to do a PTE-level mapping
>   for the GHCB page.
> 
> - The static calculation is also busted because
>   boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
>   configuration tables and the EFI memmap, and none of these are
>   accounted for. They used to be scanned while still on the
>   firmware/bootloader page tables, but now our page tables have to cover
>   them as well. Trying to add up the worst case for all of these, and
>   anything else the compressed kernel might potentially access seems
>   like a lost cause.
> 
> We could do something similar to what the main kernel does with
> early_dynamic_pgts: map the compressed kernel at a fixed virtual
> address (in negative address space, say); recycle all the other mappings
> until we're done with decompression, and then map the output,
> boot_params and command line. The number of pages needed for this can be
> statically calculated, for 4-level paging we'd need 2 pages for the
> fixed mapping, 12 pages for the other three, and one PGD page.

Recycling idea look promising to me, but it would require handling #PF in
decompression code, right? It is considerable complication of the code.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel
  2020-10-27 12:40 ` Kirill A. Shutemov
@ 2020-10-27 13:16   ` Arvind Sankar
  2020-10-27 14:31     ` Kirill A. Shutemov
  2020-10-27 15:46   ` Joerg Roedel
  1 sibling, 1 reply; 6+ messages in thread
From: Arvind Sankar @ 2020-10-27 13:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Arvind Sankar, x86, linux-kernel, Kirill A. Shutemov, Kees Cook,
	Joerg Roedel

On Tue, Oct 27, 2020 at 03:40:07PM +0300, Kirill A. Shutemov wrote:
> On Sat, Oct 24, 2020 at 08:41:58PM -0400, Arvind Sankar wrote:
> > Hi, I think the definition of BOOT_PGT_SIZE in
> > arch/x86/include/asm/boot.h is insufficient, especially after
> >   ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
> > 
> > Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
> > 19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.
> > 
> > - The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
> >   debug messages, but warnings/errors are always output to VGA memory,
> >   so the two extra pages for mapping video RAM are always needed.
> > 
> > - The calculation wasn't updated for X86_5LEVEL, which requires at least
> >   one more page for the P4D level, and in theory could require two extra
> >   pages for each of the 4 mappings (compressed kernel, output kernel,
> >   boot_params and command line), though that would require a system with
> >   truly ginormous amounts of RAM.
> 
> Or sparse physical memory map. I hacked QEMU before for testing 5-level
> paging:
> 
> https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> 
> > - If KASLR is disabled, there are only 6 pages, but now that we're
> >   always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
> >   two PUD and two PMD pages for kernel, boot_params and command line),
> >   and 2 more pages for the video RAM, and more for 5-level. Even for
> >   !RELOCATABLE, 13 pages might be needed.
> 
> The comment for BOOT_PGT_SIZE has to be updated.
> 
> BTW, what happens if we underestimate BOOT_PGT_SIZE? Do we overwrite
> something?

No, it checks whether it ran out of pages, so it will just error out and
hang.

> 
> > - SEV-ES needs one more page because it needs to do a PTE-level mapping
> >   for the GHCB page.
> > 
> > - The static calculation is also busted because
> >   boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
> >   configuration tables and the EFI memmap, and none of these are
> >   accounted for. They used to be scanned while still on the
> >   firmware/bootloader page tables, but now our page tables have to cover
> >   them as well. Trying to add up the worst case for all of these, and
> >   anything else the compressed kernel might potentially access seems
> >   like a lost cause.
> > 
> > We could do something similar to what the main kernel does with
> > early_dynamic_pgts: map the compressed kernel at a fixed virtual
> > address (in negative address space, say); recycle all the other mappings
> > until we're done with decompression, and then map the output,
> > boot_params and command line. The number of pages needed for this can be
> > statically calculated, for 4-level paging we'd need 2 pages for the
> > fixed mapping, 12 pages for the other three, and one PGD page.
> 
> Recycling idea look promising to me, but it would require handling #PF in
> decompression code, right? It is considerable complication of the code.
> 

The #PF handler is already there now with the SEV-ES series, but I agree
it would still complicate things. It's simpler to just increase
BOOT_PGT_SIZE and make it unconditional (i.e. bump it to say 32 or 64
even if !KASLR). It's @nobits anyway so it would not increase the size
of the bzImage, just require a slightly larger memory allocation by the
bootloader.

Another alternative is reusing the KASLR code, which contains a memory
allocator, and use it to find system memory for the page tables, but
that also seems like an over-engineered approach.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel
  2020-10-27 13:16   ` Arvind Sankar
@ 2020-10-27 14:31     ` Kirill A. Shutemov
  0 siblings, 0 replies; 6+ messages in thread
From: Kirill A. Shutemov @ 2020-10-27 14:31 UTC (permalink / raw)
  To: Arvind Sankar
  Cc: x86, linux-kernel, Kirill A. Shutemov, Kees Cook, Joerg Roedel

On Tue, Oct 27, 2020 at 09:16:17AM -0400, Arvind Sankar wrote:
> The #PF handler is already there now with the SEV-ES series, but I agree
> it would still complicate things. It's simpler to just increase
> BOOT_PGT_SIZE and make it unconditional (i.e. bump it to say 32 or 64
> even if !KASLR). It's @nobits anyway so it would not increase the size
> of the bzImage, just require a slightly larger memory allocation by the
> bootloader.

I guess it's fine. But I'm worried we leave the picture in the same
fragile state. We still rely on the magic number.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel
  2020-10-27 12:40 ` Kirill A. Shutemov
  2020-10-27 13:16   ` Arvind Sankar
@ 2020-10-27 15:46   ` Joerg Roedel
  1 sibling, 0 replies; 6+ messages in thread
From: Joerg Roedel @ 2020-10-27 15:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Arvind Sankar, x86, linux-kernel, Kirill A. Shutemov, Kees Cook

On Tue, Oct 27, 2020 at 03:40:07PM +0300, Kirill A. Shutemov wrote:
> BTW, what happens if we underestimate BOOT_PGT_SIZE? Do we overwrite
> something?

The boot code will print an error and stop the machine when allocating a
page-table page fails.

I also think that bumping BOOT_PGT_SIZE up to have more pages available
is a good short-term solution. Recycling pages will also need to take
page encryption attributes into account, as for SEV-ES the GHCB page
needs to be mapped unencrypted.

Another option to safe some memory is to make use of GB pages in the
decompression code. Machines where the current BOOT_PGT_SIZE is too
small will likely support GB pages too.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-10-27 17:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-25  0:41 RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel Arvind Sankar
2020-10-25  2:20 ` Arvind Sankar
2020-10-27 12:40 ` Kirill A. Shutemov
2020-10-27 13:16   ` Arvind Sankar
2020-10-27 14:31     ` Kirill A. Shutemov
2020-10-27 15:46   ` Joerg Roedel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.