Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

From: Borislav Petkov <bp@alien8.de>
To: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	pbonzini@redhat.com, ebiggers@kernel.org, x86@kernel.org,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	ardb@kernel.org, kraxel@redhat.com, philmd@linaro.org
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data
Date: Fri, 30 Dec 2022 20:54:11 +0100	[thread overview]
Message-ID: <Y69B40T9kWfxZpmf@zn.tnic> (raw)
In-Reply-To: <CAHmME9oPUJemVRvO3HX0q4BJGTFuzbLYANeizuRcNq2=Ykk1Gg@mail.gmail.com>

On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote:
> Look closer at the boot process. The compressed image is initially at
> 0x100000, but it gets relocated to a safer area at the end of
> startup_64:

That is the address we're executing here from, rip here looks like 0x100xxx.

> /*
>  * Copy the compressed kernel to the end of our buffer
>  * where decompression in place becomes safe.
>  */
>         pushq   %rsi
>         leaq    (_bss-8)(%rip), %rsi
>         leaq    rva(_bss-8)(%rbx), %rdi

when you get to here, it looks something like this:

        leaq    (_bss-8)(%rip), %rsi		# 0x9e7ff8
        leaq    rva(_bss-8)(%rbx), %rdi		# 0xc6eeff8

so the source address is that _bss thing and we copy...

>         movl    $(_bss - startup_32), %ecx
>         shrl    $3, %ecx
>         std

... backwards since DF=1.

Up to:

# rsi = 0xffff8
# rdi = 0xbe06ff8

Ok, so the source address is 0x100000. Good.

> HOWEVER, qemu currently appends setup_data to the end of the
> compressed kernel image,

Yeah, you mean the kernel which starts executing at 0x100000, i.e., that part
which is compressed/head_64.S and which does the above and the relocation etc.

> and this part isn't moved, and setup_data links aren't walked/relocated. So
> that means the original address remains, of 0x100000.

See above: when it starts copying the kernel image backwards to a higher
address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data
*after* that address. And that doesn't get copied ofc.

So far, so good.

Now later, we extract the compressed kernel created with the mkpiggy magic:

input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
input_data_end:

by doing

/*
 * Do the extraction, and jump to the new kernel..
 */

        pushq   %rsi                    /* Save the real mode argument */	0x13d00
        movq    %rsi, %rdi              /* real mode address */			0x13d00
        leaq    boot_heap(%rip), %rsi   /* malloc area for uncompression */	0xc6ef000
        leaq    input_data(%rip), %rdx  /* input_data */			0xbe073a8
        movl    input_len(%rip), %ecx   /* input_len */				0x8cfe13
        movq    %rbp, %r8               /* output target address */		0x1000000
        movl    output_len(%rip), %r9d  /* decompressed length, end of relocs */
        call    extract_kernel          /* returns kernel location in %rax */
        popq    %rsi

(actual addresses at the end.)

Now, when you say you triplefault somewhere in initialize_identity_maps() when
trying to access setup_data, then if you look a couple of lines before that call
we do

	call load_stage2_idt

which sets up a boottime #PF handler do_boot_page_fault() and it actually does
call kernel_add_identity_map() so *actually* it should map any unmapped
setup_data addresses.

So why doesn't it do that and why do you triplefault?

Hmmm.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette