Linux-EFI Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] x86: setup: extend low identity map to cover whole kernel range
@ 2015-10-14 11:30 Paolo Bonzini
  2015-10-14 13:52 ` Matt Fleming
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2015-10-14 11:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: hpa, x86, stable, lersek, matt.fleming, bp, linux-efi

On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

    $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
    $ gdb
    (gdb) target remote localhost:1234
    (gdb) hb *0x02858f6f
    Hardware assisted breakpoint 1 at 0x2858f6f
    (gdb) c
    Continuing.

    Breakpoint 1, 0x02858f6f in ?? ()
    (gdb) monitor info registers
    ...
    GDT=     0724e000 000000ff
    IDT=     fffbb000 000007ff
    CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690
    ...

The page directory is sane:

    (gdb) x/4wx 0x32b7000
    0x32b7000:	0x03398063	0x03399063	0x0339a063	0x0339b063
    (gdb) x/4wx 0x3398000
    0x3398000:	0x00000163	0x00001163	0x00002163	0x00003163
    (gdb) x/4wx 0x3399000
    0x3399000:	0x00400003	0x00401003	0x00402003	0x00403003

but our particular page directory entry is empty:

    (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
    0x32b7070:	0x00000000

Reported-by: Laszlo Ersek <lersek@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/setup.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 80f874bf999e..24154bd12307 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1198,6 +1198,14 @@ void __init setup_arch(char **cmdline_p)
 	clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
 			swapper_pg_dir     + KERNEL_PGD_BOUNDARY,
 			KERNEL_PGD_PTRS);
+
+	/*
+	 * sync back low identity map too.  It is used for example
+	 * in the 32-bit EFI stub.
+	 */
+	clone_pgd_range(initial_page_table,
+			swapper_pg_dir     + KERNEL_PGD_BOUNDARY,
+			KERNEL_PGD_PTRS);
 #endif
 
 	tboot_probe();
-- 
2.5.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range
  2015-10-14 11:30 [PATCH] x86: setup: extend low identity map to cover whole kernel range Paolo Bonzini
@ 2015-10-14 13:52 ` Matt Fleming
  2015-10-14 14:29   ` Paolo Bonzini
       [not found]   ` <20151014135211.GB2782-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
  0 siblings, 2 replies; 9+ messages in thread
From: Matt Fleming @ 2015-10-14 13:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, hpa, x86, stable, lersek, matt.fleming, bp,
	linux-efi, Andy Lutomirski

(Pulling in luto for low-level x86 fu)

On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
> On 32-bit systems, the initial_page_table is reused by
> efi_call_phys_prolog as an identity map to call
> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
> converting the current CPU's GDT to a physical address too.
> 
> For PAE kernels the identity mapping is achieved by aliasing the
> first PDPE for the kernel memory mapping into the first PDPE
> of initial_page_table.  This makes the EFI stub's trick "just work".
> 
> However, for non-PAE kernels there is no guarantee that the identity
> mapping in the initial_page_table extends as far as the GDT; in this
> case, accesses to the GDT will cause a page fault (which quickly becomes
> a triple fault).  Fix this by copying the kernel mappings from
> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> identity mapping.
 
Oops, good catch guys. This is clearly a bug, but...

> For some reason, this is only reproducible with QEMU's dynamic translation
> mode, and not for example with KVM.  However, even under KVM one can clearly
> see that the page table is bogus:
> 
>     $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
>     $ gdb
>     (gdb) target remote localhost:1234
>     (gdb) hb *0x02858f6f
>     Hardware assisted breakpoint 1 at 0x2858f6f
>     (gdb) c
>     Continuing.
> 
>     Breakpoint 1, 0x02858f6f in ?? ()
>     (gdb) monitor info registers
>     ...
>     GDT=     0724e000 000000ff
>     IDT=     fffbb000 000007ff
>     CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690
>     ...
> 
> The page directory is sane:
> 
>     (gdb) x/4wx 0x32b7000
>     0x32b7000:	0x03398063	0x03399063	0x0339a063	0x0339b063
>     (gdb) x/4wx 0x3398000
>     0x3398000:	0x00000163	0x00001163	0x00002163	0x00003163
>     (gdb) x/4wx 0x3399000
>     0x3399000:	0x00400003	0x00401003	0x00402003	0x00403003
> 
> but our particular page directory entry is empty:
> 
>     (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
>     0x32b7070:	0x00000000
 
... I'm a little surprised you managed to trigger this at all, because
the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
section and therefore part of the kernel image.

The kernel image *is* mapped in the identity range, even for non-PAE
kernels.

So yes, you're right this is a bug but I'm not sure how you're
actually triggering it.

> Reported-by: Laszlo Ersek <lersek@redhat.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kernel/setup.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 80f874bf999e..24154bd12307 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1198,6 +1198,14 @@ void __init setup_arch(char **cmdline_p)
>  	clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
>  			swapper_pg_dir     + KERNEL_PGD_BOUNDARY,
>  			KERNEL_PGD_PTRS);
> +
> +	/*
> +	 * sync back low identity map too.  It is used for example
> +	 * in the 32-bit EFI stub.
> +	 */
> +	clone_pgd_range(initial_page_table,
> +			swapper_pg_dir     + KERNEL_PGD_BOUNDARY,
> +			KERNEL_PGD_PTRS);
>  #endif
>  
>  	tboot_probe();
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range
  2015-10-14 13:52 ` Matt Fleming
@ 2015-10-14 14:29   ` Paolo Bonzini
  2015-10-14 21:04     ` Matt Fleming
       [not found]   ` <20151014135211.GB2782-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
  1 sibling, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2015-10-14 14:29 UTC (permalink / raw)
  To: Matt Fleming
  Cc: linux-kernel, hpa, x86, stable, lersek, matt.fleming, bp,
	linux-efi, Andy Lutomirski



On 14/10/2015 15:52, Matt Fleming wrote:
>> > However, for non-PAE kernels there is no guarantee that the identity
>> > mapping in the initial_page_table extends as far as the GDT; in this
>> > case, accesses to the GDT will cause a page fault (which quickly becomes
>> > a triple fault).  Fix this by copying the kernel mappings from
>> > swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> > identity mapping.
>  
> Oops, good catch guys. This is clearly a bug, but...
> 
> ... I'm a little surprised you managed to trigger this at all, because
> the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
> section and therefore part of the kernel image.

Only until setup_percpu, which is earlier than SetVirtualAddressMap.
For example, I get:

  setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:1 nr_node_ids:1
  PERCPU: Embedded 18 pages/cpu @c728e000 s41800 r0 d31928 u73728
                                  ^^^^^^^
but the kernel image ends at 0x037fffff.

The GDT is 0xc728e000 in this run, so the GDT is at the beginning of the
relocated percpu area.

In the above run, the FS base that switch_to_new_gdt loads is 0x551C000.
 You have 0x728E000 - 0x551C000 = 0x1D72000, and from tracing I see that
one of the GDT values that is loaded very early is exactly 0xC1D72000.
That _is_ inside the kernel image of course when you remove PAGE_OFFSET.

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range
       [not found]   ` <20151014135211.GB2782-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
@ 2015-10-14 16:22     ` Andy Lutomirski
  2015-10-14 21:00       ` Matt Fleming
  0 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2015-10-14 16:22 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Paolo Bonzini, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	H. Peter Anvin, X86 ML, stable, Laszlo Ersek, Matt Fleming,
	Borislav Petkov, linux-efi-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org> wrote:
> (Pulling in luto for low-level x86 fu)
>
> On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>> On 32-bit systems, the initial_page_table is reused by
>> efi_call_phys_prolog as an identity map to call
>> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>> converting the current CPU's GDT to a physical address too.
>>
>> For PAE kernels the identity mapping is achieved by aliasing the
>> first PDPE for the kernel memory mapping into the first PDPE
>> of initial_page_table.  This makes the EFI stub's trick "just work".
>>
>> However, for non-PAE kernels there is no guarantee that the identity
>> mapping in the initial_page_table extends as far as the GDT; in this
>> case, accesses to the GDT will cause a page fault (which quickly becomes
>> a triple fault).  Fix this by copying the kernel mappings from
>> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> identity mapping.
>
> Oops, good catch guys. This is clearly a bug, but...
>
>> For some reason, this is only reproducible with QEMU's dynamic translation
>> mode, and not for example with KVM.  However, even under KVM one can clearly
>> see that the page table is bogus:

I haven't looked at the code, but it wouldn't surprise me if this is
some kind of TLB issue.  With the hardware TLB (which is in use on
KVM), it seems quite likely that the GDT is pretty much always in the
TLB and, if nothing flushes global mappings, then it'll probably stick
around.

--Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range
  2015-10-14 16:22     ` Andy Lutomirski
@ 2015-10-14 21:00       ` Matt Fleming
       [not found]         ` <20151014210050.GE2782-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Matt Fleming @ 2015-10-14 21:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, linux-kernel, H. Peter Anvin, X86 ML, stable,
	Laszlo Ersek, Matt Fleming, Borislav Petkov, linux-efi

On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> > (Pulling in luto for low-level x86 fu)
> >
> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
> >> On 32-bit systems, the initial_page_table is reused by
> >> efi_call_phys_prolog as an identity map to call
> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
> >> converting the current CPU's GDT to a physical address too.
> >>
> >> For PAE kernels the identity mapping is achieved by aliasing the
> >> first PDPE for the kernel memory mapping into the first PDPE
> >> of initial_page_table.  This makes the EFI stub's trick "just work".
> >>
> >> However, for non-PAE kernels there is no guarantee that the identity
> >> mapping in the initial_page_table extends as far as the GDT; in this
> >> case, accesses to the GDT will cause a page fault (which quickly becomes
> >> a triple fault).  Fix this by copying the kernel mappings from
> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> >> identity mapping.
> >
> > Oops, good catch guys. This is clearly a bug, but...
> >
> >> For some reason, this is only reproducible with QEMU's dynamic translation
> >> mode, and not for example with KVM.  However, even under KVM one can clearly
> >> see that the page table is bogus:
> 
> I haven't looked at the code, but it wouldn't surprise me if this is
> some kind of TLB issue.  With the hardware TLB (which is in use on
> KVM), it seems quite likely that the GDT is pretty much always in the
> TLB and, if nothing flushes global mappings, then it'll probably stick
> around.

>From some quick experiments it appears that you can skate past this
issue if you don't receive any interrupts while the bogus GDT pointer
is loaded, or if you avoid reloading the segment registers in general.
Which is interesting because I assumed that writing to GDTR took
immediate effect.

Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
calls, not in the epilog/prolog calls") interrupts were disabled
around the prolog and epilog calls, and the functional GDT was
re-installed before interrupts were re-enabled. 

That does explain why no one has complained about this issue before.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range
  2015-10-14 14:29   ` Paolo Bonzini
@ 2015-10-14 21:04     ` Matt Fleming
  0 siblings, 0 replies; 9+ messages in thread
From: Matt Fleming @ 2015-10-14 21:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, hpa, x86, stable, lersek, matt.fleming, bp,
	linux-efi, Andy Lutomirski

On Wed, 14 Oct, at 04:29:33PM, Paolo Bonzini wrote:
> 
> On 14/10/2015 15:52, Matt Fleming wrote:
> >> > However, for non-PAE kernels there is no guarantee that the identity
> >> > mapping in the initial_page_table extends as far as the GDT; in this
> >> > case, accesses to the GDT will cause a page fault (which quickly becomes
> >> > a triple fault).  Fix this by copying the kernel mappings from
> >> > swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
> >> > identity mapping.
> >  
> > Oops, good catch guys. This is clearly a bug, but...
> > 
> > ... I'm a little surprised you managed to trigger this at all, because
> > the GDT we load in efi_call_phys_prolog() is part of the per-cpu data
> > section and therefore part of the kernel image.
> 
> Only until setup_percpu, which is earlier than SetVirtualAddressMap.
> For example, I get:
> 
>   setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:1 nr_node_ids:1
>   PERCPU: Embedded 18 pages/cpu @c728e000 s41800 r0 d31928 u73728
>                                   ^^^^^^^
> but the kernel image ends at 0x037fffff.
>
> The GDT is 0xc728e000 in this run, so the GDT is at the beginning of the
> relocated percpu area.

Ah, good point. I completely missed that the percpu sections get
relocated.

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range
       [not found]         ` <20151014210050.GE2782-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
@ 2015-10-14 21:39           ` Andy Lutomirski
  2015-10-15  9:45             ` Matt Fleming
       [not found]             ` <CALCETrU=YL8yWpp29xO0N7TEVogX1j5Fyk5M_FpJTa9ZOS21Zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 9+ messages in thread
From: Andy Lutomirski @ 2015-10-14 21:39 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Paolo Bonzini, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	H. Peter Anvin, X86 ML, stable, Laszlo Ersek, Matt Fleming,
	Borislav Petkov, linux-efi-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 14, 2015 at 2:00 PM, Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org> wrote:
> On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
>> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming <matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org> wrote:
>> > (Pulling in luto for low-level x86 fu)
>> >
>> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>> >> On 32-bit systems, the initial_page_table is reused by
>> >> efi_call_phys_prolog as an identity map to call
>> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>> >> converting the current CPU's GDT to a physical address too.
>> >>
>> >> For PAE kernels the identity mapping is achieved by aliasing the
>> >> first PDPE for the kernel memory mapping into the first PDPE
>> >> of initial_page_table.  This makes the EFI stub's trick "just work".
>> >>
>> >> However, for non-PAE kernels there is no guarantee that the identity
>> >> mapping in the initial_page_table extends as far as the GDT; in this
>> >> case, accesses to the GDT will cause a page fault (which quickly becomes
>> >> a triple fault).  Fix this by copying the kernel mappings from
>> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
>> >> identity mapping.
>> >
>> > Oops, good catch guys. This is clearly a bug, but...
>> >
>> >> For some reason, this is only reproducible with QEMU's dynamic translation
>> >> mode, and not for example with KVM.  However, even under KVM one can clearly
>> >> see that the page table is bogus:
>>
>> I haven't looked at the code, but it wouldn't surprise me if this is
>> some kind of TLB issue.  With the hardware TLB (which is in use on
>> KVM), it seems quite likely that the GDT is pretty much always in the
>> TLB and, if nothing flushes global mappings, then it'll probably stick
>> around.
>
> From some quick experiments it appears that you can skate past this
> issue if you don't receive any interrupts while the bogus GDT pointer
> is loaded, or if you avoid reloading the segment registers in general.
> Which is interesting because I assumed that writing to GDTR took
> immediate effect.

Trivia for your amusement:

AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
point to unmapped memory.  Any attempt to use them (segment loads,
interrupts, IRET, etc) will try to access that memory as if the access
came from CPL 0 and, if the access fails, will generate a valid page
fault with CR2 pointing into the GDT or LDT.

Xen is nuts^Wclever and actually uses this.

Of course, if your #PF vector references a GDT or LDT descriptor and
trying to load that descriptor results in a page fault, you get a
double fault.

I learned this while trying to puzzle out why v1 of my LDT
synchronization patch caused random faults on Xen.

--Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range
  2015-10-14 21:39           ` Andy Lutomirski
@ 2015-10-15  9:45             ` Matt Fleming
       [not found]             ` <CALCETrU=YL8yWpp29xO0N7TEVogX1j5Fyk5M_FpJTa9ZOS21Zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 9+ messages in thread
From: Matt Fleming @ 2015-10-15  9:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, linux-kernel, H. Peter Anvin, X86 ML, stable,
	Laszlo Ersek, Matt Fleming, Borislav Petkov, linux-efi

On Wed, 14 Oct, at 02:39:58PM, Andy Lutomirski wrote:
> 
> Trivia for your amusement:
> 
> AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
> point to unmapped memory.  Any attempt to use them (segment loads,
> interrupts, IRET, etc) will try to access that memory as if the access
> came from CPL 0 and, if the access fails, will generate a valid page
> fault with CR2 pointing into the GDT or LDT.
> 
> Xen is nuts^Wclever and actually uses this.
> 
> Of course, if your #PF vector references a GDT or LDT descriptor and
> trying to load that descriptor results in a page fault, you get a
> double fault.
> 
> I learned this while trying to puzzle out why v1 of my LDT
> synchronization patch caused random faults on Xen.

Ha, interesting! Thanks Andy, it's good to get confirmation.

OK, I think we understand this issue well enough to call this fixed.
I've queued the following patch up in my urgent tree and I'll send a
pull request to the tip folks tomorrow.

---

>From 72ac9f242fad7c3bf3e06461e34c9a546d8cb411 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed, 14 Oct 2015 13:30:45 +0200
Subject: [PATCH] x86/setup: Extend low identity map to cover whole kernel
 range

On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

    $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
    $ gdb
    (gdb) target remote localhost:1234
    (gdb) hb *0x02858f6f
    Hardware assisted breakpoint 1 at 0x2858f6f
    (gdb) c
    Continuing.

    Breakpoint 1, 0x02858f6f in ?? ()
    (gdb) monitor info registers
    ...
    GDT=     0724e000 000000ff
    IDT=     fffbb000 000007ff
    CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690
    ...

The page directory is sane:

    (gdb) x/4wx 0x32b7000
    0x32b7000:	0x03398063	0x03399063	0x0339a063	0x0339b063
    (gdb) x/4wx 0x3398000
    0x3398000:	0x00000163	0x00001163	0x00002163	0x00003163
    (gdb) x/4wx 0x3399000
    0x3399000:	0x00400003	0x00401003	0x00402003	0x00403003

but our particular page directory entry is empty:

    (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
    0x32b7070:	0x00000000

[ It appears that you can skate past this issue if you don't receive
  any interrupts while the bogus GDT pointer is loaded, or if you avoid
  reloading the segment registers in general.

  Andy Lutomirski provides some additional insight:

   "AFAICT it's entirely permissible for the GDTR and/or LDT
    descriptor to point to unmapped memory.  Any attempt to use them
    (segment loads, interrupts, IRET, etc) will try to access that memory
    as if the access came from CPL 0 and, if the access fails, will
    generate a valid page fault with CR2 pointing into the GDT or
    LDT."

  Up until commit 23a0d4e8fa6d ("efi: Disable interrupts around EFI
  calls, not in the epilog/prolog calls") interrupts were disabled
  around the prolog and epilog calls, and the functional GDT was
  re-installed before interrupts were re-enabled.

  Which explains why no one has hit this issue until now. ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
[ Updated changelog. ]
---
 arch/x86/kernel/setup.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a2d328..a3cccbfc5f77 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1173,6 +1173,14 @@ void __init setup_arch(char **cmdline_p)
 	clone_pgd_range(initial_page_table + KERNEL_PGD_BOUNDARY,
 			swapper_pg_dir     + KERNEL_PGD_BOUNDARY,
 			KERNEL_PGD_PTRS);
+
+	/*
+	 * sync back low identity map too.  It is used for example
+	 * in the 32-bit EFI stub.
+	 */
+	clone_pgd_range(initial_page_table,
+			swapper_pg_dir     + KERNEL_PGD_BOUNDARY,
+			KERNEL_PGD_PTRS);
 #endif
 
 	tboot_probe();
-- 
2.1.0

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] x86: setup: extend low identity map to cover whole kernel range
       [not found]             ` <CALCETrU=YL8yWpp29xO0N7TEVogX1j5Fyk5M_FpJTa9ZOS21Zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-15 12:18               ` H. Peter Anvin
  0 siblings, 0 replies; 9+ messages in thread
From: H. Peter Anvin @ 2015-10-15 12:18 UTC (permalink / raw)
  To: Andy Lutomirski, Matt Fleming
  Cc: Paolo Bonzini, linux-kernel-u79uwXL29TY76Z2rM5mHXA, X86 ML,
	stable, Laszlo Ersek, Matt Fleming, Borislav Petkov,
	linux-efi-u79uwXL29TY76Z2rM5mHXA

On October 14, 2015 2:39:58 PM PDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>On Wed, Oct 14, 2015 at 2:00 PM, Matt Fleming
><matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org> wrote:
>> On Wed, 14 Oct, at 09:22:03AM, Andy Lutomirski wrote:
>>> On Wed, Oct 14, 2015 at 6:52 AM, Matt Fleming
><matt-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org> wrote:
>>> > (Pulling in luto for low-level x86 fu)
>>> >
>>> > On Wed, 14 Oct, at 01:30:45PM, Paolo Bonzini wrote:
>>> >> On 32-bit systems, the initial_page_table is reused by
>>> >> efi_call_phys_prolog as an identity map to call
>>> >> SetVirtualAddressMap.  efi_call_phys_prolog takes care of
>>> >> converting the current CPU's GDT to a physical address too.
>>> >>
>>> >> For PAE kernels the identity mapping is achieved by aliasing the
>>> >> first PDPE for the kernel memory mapping into the first PDPE
>>> >> of initial_page_table.  This makes the EFI stub's trick "just
>work".
>>> >>
>>> >> However, for non-PAE kernels there is no guarantee that the
>identity
>>> >> mapping in the initial_page_table extends as far as the GDT; in
>this
>>> >> case, accesses to the GDT will cause a page fault (which quickly
>becomes
>>> >> a triple fault).  Fix this by copying the kernel mappings from
>>> >> swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET
>and at
>>> >> identity mapping.
>>> >
>>> > Oops, good catch guys. This is clearly a bug, but...
>>> >
>>> >> For some reason, this is only reproducible with QEMU's dynamic
>translation
>>> >> mode, and not for example with KVM.  However, even under KVM one
>can clearly
>>> >> see that the page table is bogus:
>>>
>>> I haven't looked at the code, but it wouldn't surprise me if this is
>>> some kind of TLB issue.  With the hardware TLB (which is in use on
>>> KVM), it seems quite likely that the GDT is pretty much always in
>the
>>> TLB and, if nothing flushes global mappings, then it'll probably
>stick
>>> around.
>>
>> From some quick experiments it appears that you can skate past this
>> issue if you don't receive any interrupts while the bogus GDT pointer
>> is loaded, or if you avoid reloading the segment registers in
>general.
>> Which is interesting because I assumed that writing to GDTR took
>> immediate effect.
>
>Trivia for your amusement:
>
>AFAICT it's entirely permissible for the GDTR and/or LDT descriptor to
>point to unmapped memory.  Any attempt to use them (segment loads,
>interrupts, IRET, etc) will try to access that memory as if the access
>came from CPL 0 and, if the access fails, will generate a valid page
>fault with CR2 pointing into the GDT or LDT.
>
>Xen is nuts^Wclever and actually uses this.
>
>Of course, if your #PF vector references a GDT or LDT descriptor and
>trying to load that descriptor results in a page fault, you get a
>double fault.
>
>I learned this while trying to puzzle out why v1 of my LDT
>synchronization patch caused random faults on Xen.
>
>--Andy

There is no "if"... you can't get to an interrupt vector without going through the GDT or LDT.  That being said, the GDT or LDT can be partially mapped.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-14 11:30 [PATCH] x86: setup: extend low identity map to cover whole kernel range Paolo Bonzini
2015-10-14 13:52 ` Matt Fleming
2015-10-14 14:29   ` Paolo Bonzini
2015-10-14 21:04     ` Matt Fleming
     [not found]   ` <20151014135211.GB2782-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2015-10-14 16:22     ` Andy Lutomirski
2015-10-14 21:00       ` Matt Fleming
     [not found]         ` <20151014210050.GE2782-mF/unelCI9GS6iBeEJttW/XRex20P6io@public.gmane.org>
2015-10-14 21:39           ` Andy Lutomirski
2015-10-15  9:45             ` Matt Fleming
     [not found]             ` <CALCETrU=YL8yWpp29xO0N7TEVogX1j5Fyk5M_FpJTa9ZOS21Zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-15 12:18               ` H. Peter Anvin

Linux-EFI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-efi/0 linux-efi/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-efi linux-efi/ https://lore.kernel.org/linux-efi \
		linux-efi@vger.kernel.org
	public-inbox-index linux-efi

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-efi


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git