linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
@ 2016-04-27 15:41 Alex Thorlton
  2016-04-27 18:23 ` Alex Thorlton
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Alex Thorlton @ 2016-04-27 15:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matt Fleming, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	linux-efi, Russ Anderson, Dimitri Sivanich, mike travis,
	Alex Thorlton, Nathan Zimmer

Hey guys,

We've hit an issue that we haven't seen before on recent community
kernels.  Hoping that you can provide some insight.

Late last week I hit this bad paging request in uv_system_init on one of
our small UV systems:

8<---
[    0.355998] UV: Found UV2000/3000 hub
[    0.360088] BUG: unable to handle kernel paging request at ffff8800fb600000
[    0.367882] IP: [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
[    0.374604] PGD 1f83067 PUD 0 
[    0.378030] Oops: 0000 [#1] SMP 
[    0.381652] Modules linked in:
[    0.385069] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc3-community-4.6-rc2+ #454
[    0.394003] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 01/15/2015
[    0.402743] task: ffff88086ee18000 ti: ffff88086ee1c000 task.ti: ffff88086ee1c000
[    0.411097] RIP: 0010:[<ffffffff81b6e906>]  [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
[    0.420530] RSP: 0000:ffff88086ee1fdb0  EFLAGS: 00010286
[    0.426458] RAX: ffff8800fb600000 RBX: 0000000000000000 RCX: 000000000000c780
[    0.434421] RDX: 00000000fa000000 RSI: 0000000000000246 RDI: 0000000000000246
[    0.442385] RBP: ffff88086ee1fe48 R08: 00000000fffffffe R09: 0000000000000000
[    0.450348] R10: 0000000000000005 R11: 0000000000000146 R12: 000000000000c780
[    0.458312] R13: 000000000000a0f0 R14: 0000000000000037 R15: 0000000000000038
[    0.466277] FS:  0000000000000000(0000) GS:ffff880870c00000(0000) knlGS:0000000000000000
[    0.475307] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.481718] CR2: ffff8800fb600000 CR3: 0000000001a06000 CR4: 00000000001406f0
[    0.489682] Stack:
[    0.491924]  0000000000000037 0000000000000038 ffff88086ee1fdd0 ffffffff810bef29
[    0.500219]  ffff88086ee1fe28 ffffffff8116426d ffff880800000010 ffff88086ee1fe38
[    0.508517]  ffff88086ee1fdf8 ffffffff8116426d 0000000000000002 0000000000000001
[    0.516810] Call Trace:
[    0.519542]  [<ffffffff810bef29>] ? vprintk_default+0x29/0x40
[    0.525957]  [<ffffffff8116426d>] ? printk+0x4d/0x4f
[    0.531497]  [<ffffffff8116426d>] ? printk+0x4d/0x4f
[    0.537041]  [<ffffffff81b6a2d2>] native_smp_prepare_cpus+0x299/0x2e4
[    0.544232]  [<ffffffff81b5b19b>] kernel_init_freeable+0xc3/0x220
[    0.551035]  [<ffffffff815b236e>] kernel_init+0xe/0x110
[    0.556869]  [<ffffffff815be102>] ret_from_fork+0x22/0x40
[    0.562893]  [<ffffffff815b2360>] ? rest_init+0x80/0x80
[    0.568723] Code: 65 48 03 05 1d b8 49 7e 80 78 14 02 ba 00 00 00 fa 76 0b 48 89 c8 65 48 03 05 07 b8 49 7e 48 b8 00 00 60 01 00 88 ff ff 48 09 d0 <48> 8b 00 88 c3 48 c1 e8 06 41 bd 01 00 00 00 88 c1 83 e3 3f 4c 
[    0.590530] RIP  [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
[    0.597341]  RSP <ffff88086ee1fdb0>
[    0.601230] CR2: ffff8800fb600000
[    0.604933] ---[ end trace 06918ff61c5d4cab ]---
[    0.610091] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
--->8

A bit of digging will tell us that this is the failing line:

m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );

A bisect (with a bit of extra digging into a merge commit) led me to
commit 67a9108ed4313 ("x86/efi: Build our own page table structures") as
being the first bad commit.

After a bit of poking around, I found that the part of this change that
is actually breaking things for us is the fact that the EFI memory
regions are no longer being mapped into the trampoline_pgd in
__map_region.

Just as a quick test, I made this small change and it got my UV to boot:

8<---
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 49e4dd4..15bf5df 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -296,6 +296,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va)
 {
        unsigned long flags = _PAGE_RW;
        unsigned long pfn;
+       pgd_t *tramp_pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
        pgd_t *pgd = efi_pgd;

        if (!(md->attribute & EFI_MEMORY_WB))
@@ -305,6 +306,10 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va)
        if (kernel_map_pages_in_pgd(pgd, pfn, va, md->num_pages, flags))
                pr_warn("Error mapping PA 0x%llx -> VA 0x%llx!\n",
                           md->phys_addr, va);
+
+       if (kernel_map_pages_in_pgd(tramp_pgd, pfn, va, md->num_pages, flags))
+               pr_warn("Error mapping in trampoline_pgd PA 0x%llx -> VA 0x%llx!\n",
+                          md->phys_addr, va);
 }

 void __init efi_map_region(efi_memory_desc_t *md)
--->8

>From what I know, this works because the first PGD entry (the one
containing the identity mappings) of the trampoline_pgd is shared with
the main kernel PGD (init_level4_pgt), so when __map_region maps this
stuff into the trampoline_pgd, we get it in the kernel PGD as well.

The way I understand it, the motivation behind this change is to isolate
the EFI runtime services code/data from the regular kernel page tables,
but it appears that we've isolated a bit more than that.  I do
understand that if we were doing an actual EFI callback, we'd switch
over to the efi_pgd, where we'd have this stuff mapped in, but, until
now, we've been able to read these MMRs using the regular kernel page
table without issues.

Please let me know what you guys think about this issue, and if you have
any suggestions for possible solutions.  Any input is greatly
appreciated!

- Alex

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-27 15:41 [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table Alex Thorlton
@ 2016-04-27 18:23 ` Alex Thorlton
  2016-04-27 22:51 ` Borislav Petkov
  2016-04-29  9:01 ` Matt Fleming
  2 siblings, 0 replies; 20+ messages in thread
From: Alex Thorlton @ 2016-04-27 18:23 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Matt Fleming, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer, Borislav Petkov

On Wed, Apr 27, 2016 at 10:41:32AM -0500, Alex Thorlton wrote:

Adding Boris to Cc.

> Hey guys,
> 
> We've hit an issue that we haven't seen before on recent community
> kernels.  Hoping that you can provide some insight.
> 
> Late last week I hit this bad paging request in uv_system_init on one of
> our small UV systems:
> 
> 8<---
> [    0.355998] UV: Found UV2000/3000 hub
> [    0.360088] BUG: unable to handle kernel paging request at ffff8800fb600000
> [    0.367882] IP: [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
> [    0.374604] PGD 1f83067 PUD 0 
> [    0.378030] Oops: 0000 [#1] SMP 
> [    0.381652] Modules linked in:
> [    0.385069] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc3-community-4.6-rc2+ #454
> [    0.394003] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 01/15/2015
> [    0.402743] task: ffff88086ee18000 ti: ffff88086ee1c000 task.ti: ffff88086ee1c000
> [    0.411097] RIP: 0010:[<ffffffff81b6e906>]  [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
> [    0.420530] RSP: 0000:ffff88086ee1fdb0  EFLAGS: 00010286
> [    0.426458] RAX: ffff8800fb600000 RBX: 0000000000000000 RCX: 000000000000c780
> [    0.434421] RDX: 00000000fa000000 RSI: 0000000000000246 RDI: 0000000000000246
> [    0.442385] RBP: ffff88086ee1fe48 R08: 00000000fffffffe R09: 0000000000000000
> [    0.450348] R10: 0000000000000005 R11: 0000000000000146 R12: 000000000000c780
> [    0.458312] R13: 000000000000a0f0 R14: 0000000000000037 R15: 0000000000000038
> [    0.466277] FS:  0000000000000000(0000) GS:ffff880870c00000(0000) knlGS:0000000000000000
> [    0.475307] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.481718] CR2: ffff8800fb600000 CR3: 0000000001a06000 CR4: 00000000001406f0
> [    0.489682] Stack:
> [    0.491924]  0000000000000037 0000000000000038 ffff88086ee1fdd0 ffffffff810bef29
> [    0.500219]  ffff88086ee1fe28 ffffffff8116426d ffff880800000010 ffff88086ee1fe38
> [    0.508517]  ffff88086ee1fdf8 ffffffff8116426d 0000000000000002 0000000000000001
> [    0.516810] Call Trace:
> [    0.519542]  [<ffffffff810bef29>] ? vprintk_default+0x29/0x40
> [    0.525957]  [<ffffffff8116426d>] ? printk+0x4d/0x4f
> [    0.531497]  [<ffffffff8116426d>] ? printk+0x4d/0x4f
> [    0.537041]  [<ffffffff81b6a2d2>] native_smp_prepare_cpus+0x299/0x2e4
> [    0.544232]  [<ffffffff81b5b19b>] kernel_init_freeable+0xc3/0x220
> [    0.551035]  [<ffffffff815b236e>] kernel_init+0xe/0x110
> [    0.556869]  [<ffffffff815be102>] ret_from_fork+0x22/0x40
> [    0.562893]  [<ffffffff815b2360>] ? rest_init+0x80/0x80
> [    0.568723] Code: 65 48 03 05 1d b8 49 7e 80 78 14 02 ba 00 00 00 fa 76 0b 48 89 c8 65 48 03 05 07 b8 49 7e 48 b8 00 00 60 01 00 88 ff ff 48 09 d0 <48> 8b 00 88 c3 48 c1 e8 06 41 bd 01 00 00 00 88 c1 83 e3 3f 4c 
> [    0.590530] RIP  [<ffffffff81b6e906>] uv_system_init+0x1f8/0x1051
> [    0.597341]  RSP <ffff88086ee1fdb0>
> [    0.601230] CR2: ffff8800fb600000
> [    0.604933] ---[ end trace 06918ff61c5d4cab ]---
> [    0.610091] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> --->8
> 
> A bit of digging will tell us that this is the failing line:
> 
> m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );
> 
> A bisect (with a bit of extra digging into a merge commit) led me to
> commit 67a9108ed4313 ("x86/efi: Build our own page table structures") as
> being the first bad commit.
> 
> After a bit of poking around, I found that the part of this change that
> is actually breaking things for us is the fact that the EFI memory
> regions are no longer being mapped into the trampoline_pgd in
> __map_region.
> 
> Just as a quick test, I made this small change and it got my UV to boot:
> 
> 8<---
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index 49e4dd4..15bf5df 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -296,6 +296,7 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va)
>  {
>         unsigned long flags = _PAGE_RW;
>         unsigned long pfn;
> +       pgd_t *tramp_pgd = (pgd_t *)__va(real_mode_header->trampoline_pgd);
>         pgd_t *pgd = efi_pgd;
> 
>         if (!(md->attribute & EFI_MEMORY_WB))
> @@ -305,6 +306,10 @@ static void __init __map_region(efi_memory_desc_t *md, u64 va)
>         if (kernel_map_pages_in_pgd(pgd, pfn, va, md->num_pages, flags))
>                 pr_warn("Error mapping PA 0x%llx -> VA 0x%llx!\n",
>                            md->phys_addr, va);
> +
> +       if (kernel_map_pages_in_pgd(tramp_pgd, pfn, va, md->num_pages, flags))
> +               pr_warn("Error mapping in trampoline_pgd PA 0x%llx -> VA 0x%llx!\n",
> +                          md->phys_addr, va);
>  }
> 
>  void __init efi_map_region(efi_memory_desc_t *md)
> --->8
> 
> From what I know, this works because the first PGD entry (the one
> containing the identity mappings) of the trampoline_pgd is shared with
> the main kernel PGD (init_level4_pgt), so when __map_region maps this
> stuff into the trampoline_pgd, we get it in the kernel PGD as well.
> 
> The way I understand it, the motivation behind this change is to isolate
> the EFI runtime services code/data from the regular kernel page tables,
> but it appears that we've isolated a bit more than that.  I do
> understand that if we were doing an actual EFI callback, we'd switch
> over to the efi_pgd, where we'd have this stuff mapped in, but, until
> now, we've been able to read these MMRs using the regular kernel page
> table without issues.
> 
> Please let me know what you guys think about this issue, and if you have
> any suggestions for possible solutions.  Any input is greatly
> appreciated!
> 
> - Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-27 15:41 [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table Alex Thorlton
  2016-04-27 18:23 ` Alex Thorlton
@ 2016-04-27 22:51 ` Borislav Petkov
  2016-04-28  1:41   ` Alex Thorlton
  2016-04-29  9:01 ` Matt Fleming
  2 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2016-04-27 22:51 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Matt Fleming, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer

On Wed, Apr 27, 2016 at 10:41:32AM -0500, Alex Thorlton wrote:
> A bit of digging will tell us that this is the failing line:
> 
> m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );

That looks like

All code
========
   0:   65 48 03 05 1d b8 49    add    %gs:0x7e49b81d(%rip),%rax        # 0x7e49b825
   7:   7e 
   8:   80 78 14 02             cmpb   $0x2,0x14(%rax)
   c:   ba 00 00 00 fa          mov    $0xfa000000,%edx
  11:   76 0b                   jbe    0x1e
  13:   48 89 c8                mov    %rcx,%rax
  16:   65 48 03 05 07 b8 49    add    %gs:0x7e49b807(%rip),%rax        # 0x7e49b825
  1d:   7e 
  1e:   48 b8 00 00 60 01 00    movabs $0xffff880001600000,%rax
  25:   88 ff ff 
  28:   48 09 d0                or     %rdx,%rax
  2b:*  48 8b 00                mov    (%rax),%rax              <-- trapping instruction
  2e:   88 c3                   mov    %al,%bl
  30:   48 c1 e8 06             shr    $0x6,%rax
  34:   41 bd 01 00 00 00       mov    $0x1,%r13d
  3a:   88 c1                   mov    %al,%cl
  3c:   83 e3 3f                and    $0x3f,%ebx

but why does this have anything to do with the EFI pagetable, at all?
The MMRs should be mapped in the normal kernel page table, right?

And your dirty fix of mapping into trampoline_pgd doesn't make any
sense...

How do the MMRs get mapped on that box exactly? And why aren't they
mapped in the normal kernel page table all of a sudden?

/me is confused and goes to bed.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-27 22:51 ` Borislav Petkov
@ 2016-04-28  1:41   ` Alex Thorlton
  2016-04-28 12:57     ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Thorlton @ 2016-04-28  1:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Thorlton, linux-kernel, Matt Fleming, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-efi, Russ Anderson,
	Dimitri Sivanich, mike travis, Nathan Zimmer

On Thu, Apr 28, 2016 at 12:51:22AM +0200, Borislav Petkov wrote:
> On Wed, Apr 27, 2016 at 10:41:32AM -0500, Alex Thorlton wrote:
> > A bit of digging will tell us that this is the failing line:
> > 
> > m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );
> 
> That looks like
> 
> All code
> ========
>    0:   65 48 03 05 1d b8 49    add    %gs:0x7e49b81d(%rip),%rax        # 0x7e49b825
>    7:   7e 
>    8:   80 78 14 02             cmpb   $0x2,0x14(%rax)
>    c:   ba 00 00 00 fa          mov    $0xfa000000,%edx
>   11:   76 0b                   jbe    0x1e
>   13:   48 89 c8                mov    %rcx,%rax
>   16:   65 48 03 05 07 b8 49    add    %gs:0x7e49b807(%rip),%rax        # 0x7e49b825
>   1d:   7e 
>   1e:   48 b8 00 00 60 01 00    movabs $0xffff880001600000,%rax
>   25:   88 ff ff 
>   28:   48 09 d0                or     %rdx,%rax
>   2b:*  48 8b 00                mov    (%rax),%rax              <-- trapping instruction
>   2e:   88 c3                   mov    %al,%bl
>   30:   48 c1 e8 06             shr    $0x6,%rax
>   34:   41 bd 01 00 00 00       mov    $0x1,%r13d
>   3a:   88 c1                   mov    %al,%cl
>   3c:   83 e3 3f                and    $0x3f,%ebx
> 
> but why does this have anything to do with the EFI pagetable, at all?

In this particular instance, it's not using the EFI page table - it's
showing that the register isn't mapped into the main kernel page table,
via the bad paging request.  The issue isn't that there's something
wrong with the EFI page table, but that something appears to be missing
from the kernel table.

> The MMRs should be mapped in the normal kernel page table, right?

Well, quite a while back, these MMRs got mapped in using
init_extra_mapping_uc in map_low_mmrs.  Back then, yes, they were mapped
straight into the kernel page table.

After d2f7cbe7b26a74 ("x86/efi: Runtime services virtual mapping") was
introduced (I'm sure you remember the BIOS issue we had a while back) we
had to fall back to using EFI_OLD_MEMMAP, so for a while we relied on
efi_ioremap.

Eventually we got a BIOS fix that allowed us to start using the new
memmap scheme, at which point we removed the init_extra_mapping_uc()s,
since the efi_map_region code appeared to be doing what we needed.

So, short answer is, they were mapped in and working up until now.

> And your dirty fix of mapping into trampoline_pgd doesn't make any
> sense...

I *think* it does?  This one took me a while to figure out - I wrote the
fix at midnight last night, so please inform me if my logic sounds
insane, haha.

In efi_map_region we first do the __map_region for the physical address,
which will get mapped into the lowest PGD entry of the trampoline_pgd.
I believe the lowest PGD entry in the trampoline_pgd is actually the
same as init_mm.pgd + pgd_index(PAGE_OFFSET) (see setup_real_mode), so
you're effectively mapping into the kernel 1:1 space when you map into
pgd_index 0 of the trampoline_pgd.

By adding the mappings to the trampoline_pgd back in, I think I've also
added them back into the kernel.

> How do the MMRs get mapped on that box exactly?

Believe I covered this above.  Let me know if you have questions!

> And why aren't they
> mapped in the normal kernel page table all of a sudden?

It looks to me like there might be a bit of a complicated chain of
events that led us to this point.  I thought I was close to having this
worked out, but after looking back over it I'm starting to think, "how
was it working in the first place?!"

At this point, considering the fact that my fix definitely gets the UV
to boot, I'm going to stick to my claim that adding the maps back into
the trampoline_pgd fixes at least part of the issue - not saying it's
the right way, but it's an intersting data point.

Despite the fact that my fix might help remedy the situation, I will
admit that my understanding of the mechanics behind said fix could be
flawed :)

I'm going to have to think about this some more tomorrow.  Let me know
if you have any ideas!

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-28  1:41   ` Alex Thorlton
@ 2016-04-28 12:57     ` Borislav Petkov
  2016-04-29 15:41       ` Alex Thorlton
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2016-04-28 12:57 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Matt Fleming, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer

On Wed, Apr 27, 2016 at 08:41:28PM -0500, Alex Thorlton wrote:
> In this particular instance, it's not using the EFI page table - it's
> showing that the register isn't mapped into the main kernel page table,
> via the bad paging request.  The issue isn't that there's something
> wrong with the EFI page table, but that something appears to be missing
> from the kernel table.

So my question stands: you said 67a9108ed4313 is causing this and this
commit creates an EFI-specific page table and that shouldn't have
anything to do with how the MMR stuff is mapped, should it?

> Well, quite a while back, these MMRs got mapped in using
> init_extra_mapping_uc in map_low_mmrs.  Back then, yes, they were mapped
> straight into the kernel page table.
> 
> After d2f7cbe7b26a74 ("x86/efi: Runtime services virtual mapping") was
> introduced (I'm sure you remember the BIOS issue we had a while back) we
> had to fall back to using EFI_OLD_MEMMAP, so for a while we relied on
> efi_ioremap.

Hmm, but you see my confusion, right? Why is init_extra_mapping_uc()
influenced by the EFI changes? It uses __init_extra_mapping() and it
maps into init_mm's pgd.

> Eventually we got a BIOS fix that allowed us to start using the new
> memmap scheme, at which point we removed the init_extra_mapping_uc()s,
> since the efi_map_region code appeared to be doing what we needed.

Why would you even do that? Why are you even mapping MMRs using EFI
facilities?

I'm more confused today.

So what I see from here is this:

* MMRs and EFI shouldn't have anything in common. Imagine there were an
UV box without EFI (you probably are going to say there's no such thing
but imagine anyway): how are you going to map the MMR space then?

* I think you should restore the old case where you mapped the MMRs
using init_extra_mapping_uc().

And I think

  d394f2d9d8e1 ("x86/platform/UV: Remove EFI memmap quirk for UV2+")

was wrong in doing

+       if (is_uv1_hub())
+               map_low_mmrs();

for the simple fact that MMRs mapping and EFI shouldn't depend on one
another.

So the proper fix would be to remove the if- check.

Or am I missing something here and MMRs need EFI?

 (However, UV1 apparently works fine without it).

Right?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-27 15:41 [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table Alex Thorlton
  2016-04-27 18:23 ` Alex Thorlton
  2016-04-27 22:51 ` Borislav Petkov
@ 2016-04-29  9:01 ` Matt Fleming
  2016-04-29 15:45   ` Alex Thorlton
  2 siblings, 1 reply; 20+ messages in thread
From: Matt Fleming @ 2016-04-29  9:01 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	linux-efi, Russ Anderson, Dimitri Sivanich, mike travis,
	Nathan Zimmer

On Wed, 27 Apr, at 10:41:32AM, Alex Thorlton wrote:
> 
> From what I know, this works because the first PGD entry (the one
> containing the identity mappings) of the trampoline_pgd is shared with
> the main kernel PGD (init_level4_pgt), so when __map_region maps this
> stuff into the trampoline_pgd, we get it in the kernel PGD as well.
 
Correct.

> The way I understand it, the motivation behind this change is to isolate
> the EFI runtime services code/data from the regular kernel page tables,

Yep.

> but it appears that we've isolated a bit more than that.  I do
> understand that if we were doing an actual EFI callback, we'd switch
> over to the efi_pgd, where we'd have this stuff mapped in, but, until
> now, we've been able to read these MMRs using the regular kernel page
> table without issues.
> 
> Please let me know what you guys think about this issue, and if you have
> any suggestions for possible solutions.  Any input is greatly
> appreciated!

The issue is that you're relying on the EFI mapping code to map these
regions for you, and that's a bug. These regions have nothing to do
with EFI.

Arguably it's always been a bug. You're not alone though, there were
also other pieces of code piggy-backing on the EFI mapping functions,
and EFI code piggy-backing on the regular kernel mapping functions,
see 452308de6105 ("x86/efi: Fix boot crash by always mapping boot
service regions into new EFI page tables").

I agree with Boris' suggestion that removing the "if (is_uv1_hub())"
check in uv_system_init() is probably the way to go, since it should
always be the responsibility of uv_system_init() to setup these
mappings.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-28 12:57     ` Borislav Petkov
@ 2016-04-29 15:41       ` Alex Thorlton
  2016-04-30 22:12         ` Matt Fleming
  2016-05-02 10:02         ` Borislav Petkov
  0 siblings, 2 replies; 20+ messages in thread
From: Alex Thorlton @ 2016-04-29 15:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Thorlton, linux-kernel, Matt Fleming, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-efi, Russ Anderson,
	Dimitri Sivanich, mike travis, Nathan Zimmer

On Thu, Apr 28, 2016 at 02:57:11PM +0200, Borislav Petkov wrote:
> > After d2f7cbe7b26a74 ("x86/efi: Runtime services virtual mapping") was
> > introduced (I'm sure you remember the BIOS issue we had a while back) we
> > had to fall back to using EFI_OLD_MEMMAP, so for a while we relied on
> > efi_ioremap.
> 
> Hmm, but you see my confusion, right? Why is init_extra_mapping_uc()
> influenced by the EFI changes? It uses __init_extra_mapping() and it
> maps into init_mm's pgd.

I think the answer here is that the init_extra_mapping wasn't
necessarily affected by the EFI changes.  What happened is that the new
EFI changes in d2f7cbe7b26a74 mapped in the register for us, before
uv_system_init came along and tried to do it with the
init_extra_mapping.  When uv_system_init tried to do the extra mappings,
they fail because the register was already mapped by the EFI code.

> > Eventually we got a BIOS fix that allowed us to start using the new
> > memmap scheme, at which point we removed the init_extra_mapping_uc()s,
> > since the efi_map_region code appeared to be doing what we needed.
> 
> Why would you even do that? Why are you even mapping MMRs using EFI
> facilities?

I believe this was unintentional behavior.  Here's the order that I
think all this got mixed up in (the first step being what was actually
wrong in the first place):

1) From the very beginning, we've had the address space containing those
   registers tucked inside and EFI memory descriptor.  Initially this
   wasn't an issue, I believe because the efi_ioremap was putting them
   in ioremap space, so there was no collision when we went to do the
   init_extra_mapping.
2) EFI code changed in d2f7cbe7b26a74 to do stuff with efi_map_region,
   which then comes in and maps the register into the kernel 1:1 space,
   where our init_extra_mapping used to place it.
   * This caused our original issue, because the init_extra_mapping
     would blow up in uv_system_init when the registers were already
     mapped.
   * Note that we had one BIOS bug that initally prevented us from using
     the new mapping scheme at all.  We *thought* we had this cleared
     up, but it appears we may have missed something.
3) I removed the init_extra_mapping in d394f2d9d8e1 becausem, as
   described above, it was causing a crash during uv_system_init when we
   switched over to the new EFI memmap scheme.
   * Note that UV1 did not receive a BIOS fix, so it will eternally use
     EFI_OLD_MEMMAP.
4) The latest change came along, and pulled the 1:1 mappings out of the
   trampoline_pgd (and therefore out of the kernel page table) and here
   we are today.

> I'm more confused today.

It took me a about 3 entire days to wrap my head around all this :)

> So what I see from here is this:
> 
> * MMRs and EFI shouldn't have anything in common. Imagine there were an
> UV box without EFI (you probably are going to say there's no such thing
> but imagine anyway): how are you going to map the MMR space then?

Right - I'm beginning to see where we went wrong here :)

> * I think you should restore the old case where you mapped the MMRs
> using init_extra_mapping_uc().
> 
> And I think
> 
>   d394f2d9d8e1 ("x86/platform/UV: Remove EFI memmap quirk for UV2+")
> 
> was wrong in doing
> 
> +       if (is_uv1_hub())
> +               map_low_mmrs();
> 
> for the simple fact that MMRs mapping and EFI shouldn't depend on one
> another.

I agree here.

> So the proper fix would be to remove the if- check.
> 
> Or am I missing something here and MMRs need EFI?

I think this is partially correct, but in doing that, we find that we're
still missing something.  Watch what happens when I make this small
tweak to my kernel:

8<---
diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c
b/arch/x86/kernel/apic/x2apic_uv_x.c
index 624db005..91ac029 100644
--- a/arch/x86/kernel/apic/x2apic_uv_x.c
+++ b/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -891,7 +891,7 @@ void __init uv_system_init(void)
        pr_info("UV: Found %s hub\n", hub);
 
        /* We now only need to map the MMRs on UV1 */
-       if (is_uv1_hub())
+       //if (is_uv1_hub())
                map_low_mmrs();
 
        m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );
--->8

Here's the result:

8<---
[    5.353656] BUG: unable to handle kernel paging request at ffff88006a1ab938
[    5.361448] IP: [<ffff88006a1ab938>] 0xffff88006a1ab938
[    5.367290] PGD 1f81067 PUD 87ffff067 PMD 87fff8067 PTE 0
[    5.373356] Oops: 0010 [#1] SMP 
[    5.376977] Modules linked in:
[    5.380395] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.5.0-rc2-uv4-comm-debug-fix+ #538
[    5.389428] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 01/15/2015
[    5.398169] task: ffff880867ec4040 ti: ffff880867ec8000 task.ti: ffff880867ec8000
[    5.406522] RIP: 0010:[<ffff88006a1ab938>]  [<ffff88006a1ab938>] 0xffff88006a1ab938
[    5.415080] RSP: 0000:ffff880867ecbc88  EFLAGS: 00010086
[    5.421006] RAX: 0000000000000000 RBX: 0000000000000282 RCX: 0000000000000001
[    5.428971] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88006a1ab938
[    5.436935] RBP: ffff880867ecbd58 R08: ffff880867ecbd68 R09: ffff880867ecbd70
[    5.444900] R10: ffffffffffffffda R11: 000000006a1ab938 R12: 0000000000000000
[    5.452864] R13: ffffffff81dcf0b8 R14: ffffffff81dcf0c0 R15: ffffffff81dcf0a0
[    5.460829] FS:  0000000000000000(0000) GS:ffff880878c00000(0000) knlGS:0000000000000000
[    5.469861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.476274] CR2: ffff88006a1ab938 CR3: 0000000001a0a000 CR4: 00000000001406f0
[    5.484240] Stack:
[    5.486483]  ffffffff8105d7f8 0000000000000000 0000000000000006 0000000000000006
[    5.494777]  000000000000001e 0000000000000000 0000000000000000 ffff880867ecbd38
[    5.503074]  0000000080050033 0000000000000000 0000000000000000 0000000000000000
[    5.511368] Call Trace:
[    5.514098]  [<ffffffff8105d7f8>] ? efi_call+0x58/0x90
[    5.519834]  [<ffffffff8106033d>] ? uv_bios_call_irqsave+0x5d/0x80
[    5.526733]  [<ffffffff810603a0>] uv_bios_get_sn_info+0x40/0xb0
[    5.533344]  [<ffffffff81b6f824>] uv_system_init+0x772/0x104d
[    5.539751]  [<ffffffff810bd479>] ? vprintk_default+0x29/0x40
[    5.546159]  [<ffffffff81161cf8>] ? printk+0x4d/0x4f
[    5.551692]  [<ffffffff81b6ac75>] native_smp_prepare_cpus+0x299/0x2e4
[    5.558884]  [<ffffffff81b5c18e>] kernel_init_freeable+0xc3/0x21b
[    5.565680]  [<ffffffff815acd00>] ? rest_init+0x80/0x80
[    5.571502]  [<ffffffff815acd0e>] kernel_init+0xe/0xf0
[    5.577238]  [<ffffffff815b87cf>] ret_from_fork+0x3f/0x70
[    5.583264]  [<ffffffff815acd00>] ? rest_init+0x80/0x80
[    5.589093] Code:  Bad RIP value.
[    5.592812] RIP  [<ffff88006a1ab938>] 0xffff88006a1ab938
[    5.598748]  RSP <ffff880867ecbc88>
[    5.602638] CR2: ffff88006a1ab938
[    5.606339] ---[ end trace 3abaacb020c74a50 ]---
[    5.611487] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
--->8

You can see here that we've made it past the MMR read in uv_system_init,
but we die inside of our first EFI callback.  In this example, it looks
like we're using the kernel page table at the time of the failure, and I
believe that the failing address is somewhere in our EFI runtime code:

[    0.803396] efi: ATHORLTON EFI md dump:
[    0.803396]         type: 5
[    0.803396]         pad: 0
[    0.803396]         phys_addr: 6a0a6000
[    0.803396]         virt_addr: 0
[    0.803396]         num_pages: 184
[    0.803396]         attribute: 800000000000000f

So it looks like we're trying to read from EFI runtime space while using
the kernel page table, which fails, presumably because the space is also
not mapped into the kernel page table.  While I understand *why* it
fails, and why the address isn't mapped, I don't fully know how we
should handle fixing it.

It would appear after a good amount of investigation that we've been
skating by on some (perhaps not so subtle) undefined/unintentional
behvaior inside the EFI memory mapping code.

I will need to talk with our BIOS team about this to see what they
think.  Any input you guys might have on how we might remedy this
situation would be really valuable.  I think we have a good
understanding of what's going wrong here, but now we need to work out a
way to fix it.

Thanks for you help so far!

- Alex

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-29  9:01 ` Matt Fleming
@ 2016-04-29 15:45   ` Alex Thorlton
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Thorlton @ 2016-04-29 15:45 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Alex Thorlton, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer

On Fri, Apr 29, 2016 at 10:01:15AM +0100, Matt Fleming wrote:
> On Wed, 27 Apr, at 10:41:32AM, Alex Thorlton wrote:
> > 
> > From what I know, this works because the first PGD entry (the one
> > containing the identity mappings) of the trampoline_pgd is shared with
> > the main kernel PGD (init_level4_pgt), so when __map_region maps this
> > stuff into the trampoline_pgd, we get it in the kernel PGD as well.
>  
> Correct.
> 
> > The way I understand it, the motivation behind this change is to isolate
> > the EFI runtime services code/data from the regular kernel page tables,
> 
> Yep.

Thanks for confirming my suspicions.  This one was really bothering me.

I'd rather have broken code and know why it's broken, than have working
code and not know why it works :)

> > but it appears that we've isolated a bit more than that.  I do
> > understand that if we were doing an actual EFI callback, we'd switch
> > over to the efi_pgd, where we'd have this stuff mapped in, but, until
> > now, we've been able to read these MMRs using the regular kernel page
> > table without issues.
> > 
> > Please let me know what you guys think about this issue, and if you have
> > any suggestions for possible solutions.  Any input is greatly
> > appreciated!
> 
> The issue is that you're relying on the EFI mapping code to map these
> regions for you, and that's a bug. These regions have nothing to do
> with EFI.

Understood, and agreed.

> Arguably it's always been a bug. You're not alone though, there were
> also other pieces of code piggy-backing on the EFI mapping functions,
> and EFI code piggy-backing on the regular kernel mapping functions,
> see 452308de6105 ("x86/efi: Fix boot crash by always mapping boot
> service regions into new EFI page tables").
> 
> I agree with Boris' suggestion that removing the "if (is_uv1_hub())"
> check in uv_system_init() is probably the way to go, since it should
> always be the responsibility of uv_system_init() to setup these
> mappings.

I have tried this, but there are some issues there as well (see my
response to Boris's last message).  I think we're getting closer to
having this all figured out, but there are still some pieces missing
from the puzzle.

Thanks for the input, Matt.  Please let me know if you have any ideas
about the remaining issue that I described!

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-29 15:41       ` Alex Thorlton
@ 2016-04-30 22:12         ` Matt Fleming
  2016-05-02 21:39           ` Alex Thorlton
  2016-05-02 10:02         ` Borislav Petkov
  1 sibling, 1 reply; 20+ messages in thread
From: Matt Fleming @ 2016-04-30 22:12 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Borislav Petkov, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer

On Fri, 29 Apr, at 10:41:19AM, Alex Thorlton wrote:
> 
> You can see here that we've made it past the MMR read in uv_system_init,
> but we die inside of our first EFI callback.  In this example, it looks
> like we're using the kernel page table at the time of the failure, and I
> believe that the failing address is somewhere in our EFI runtime code:
> 
> [    0.803396] efi: ATHORLTON EFI md dump:
> [    0.803396]         type: 5
> [    0.803396]         pad: 0
> [    0.803396]         phys_addr: 6a0a6000
> [    0.803396]         virt_addr: 0
> [    0.803396]         num_pages: 184
> [    0.803396]         attribute: 800000000000000f
> 
> So it looks like we're trying to read from EFI runtime space while using
> the kernel page table, which fails, presumably because the space is also
> not mapped into the kernel page table.  While I understand *why* it
> fails, and why the address isn't mapped, I don't fully know how we
> should handle fixing it.

How come you're not using the new EFI page tables while making EFI
runtime calls?

Who owns the MMR space and what is it used for? Do both the kernel and
the firmware need access to it? My SGI UV knowledge is zero, so I'm
happy to be educated! I can't think of any analogous memory regions on
x86 where the EFI services require the kernel to map them, other than
the EFI regions themselves.

Runtime EFI regions should be opaque, isolated and self-contained.
This is why there are two phases of execution for firmware; before and
after ExitBootServices(). Once the kernel takes control after
ExitBootServices() firmware can no longer provide timer services, for
example, and doesn't care where the kernel maps the LAPIC because it
never tries to access it.

The fact that the UV firmware does care where the MMR space is mapped
makes me suspect that there's a lack of isolation.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-29 15:41       ` Alex Thorlton
  2016-04-30 22:12         ` Matt Fleming
@ 2016-05-02 10:02         ` Borislav Petkov
  2016-05-02 22:27           ` Alex Thorlton
  1 sibling, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2016-05-02 10:02 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Matt Fleming, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer

On Fri, Apr 29, 2016 at 10:41:19AM -0500, Alex Thorlton wrote:
> I think this is partially correct, but in doing that, we find that we're
> still missing something.  Watch what happens when I make this small
> tweak to my kernel:
> 
> 8<---
> diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c
> b/arch/x86/kernel/apic/x2apic_uv_x.c
> index 624db005..91ac029 100644
> --- a/arch/x86/kernel/apic/x2apic_uv_x.c
> +++ b/arch/x86/kernel/apic/x2apic_uv_x.c
> @@ -891,7 +891,7 @@ void __init uv_system_init(void)
>         pr_info("UV: Found %s hub\n", hub);
>  
>         /* We now only need to map the MMRs on UV1 */
> -       if (is_uv1_hub())
> +       //if (is_uv1_hub())
>                 map_low_mmrs();
>  
>         m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );
> --->8
> 
> Here's the result:
> 
> 8<---
> [    5.353656] BUG: unable to handle kernel paging request at ffff88006a1ab938
> [    5.361448] IP: [<ffff88006a1ab938>] 0xffff88006a1ab938
> [    5.367290] PGD 1f81067 PUD 87ffff067 PMD 87fff8067 PTE 0
> [    5.373356] Oops: 0010 [#1] SMP 
> [    5.376977] Modules linked in:
> [    5.380395] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.5.0-rc2-uv4-comm-debug-fix+ #538
> [    5.389428] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 01/15/2015
> [    5.398169] task: ffff880867ec4040 ti: ffff880867ec8000 task.ti: ffff880867ec8000
> [    5.406522] RIP: 0010:[<ffff88006a1ab938>]  [<ffff88006a1ab938>] 0xffff88006a1ab938
> [    5.415080] RSP: 0000:ffff880867ecbc88  EFLAGS: 00010086
> [    5.421006] RAX: 0000000000000000 RBX: 0000000000000282 RCX: 0000000000000001
> [    5.428971] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88006a1ab938
> [    5.436935] RBP: ffff880867ecbd58 R08: ffff880867ecbd68 R09: ffff880867ecbd70
> [    5.444900] R10: ffffffffffffffda R11: 000000006a1ab938 R12: 0000000000000000
> [    5.452864] R13: ffffffff81dcf0b8 R14: ffffffff81dcf0c0 R15: ffffffff81dcf0a0
> [    5.460829] FS:  0000000000000000(0000) GS:ffff880878c00000(0000) knlGS:0000000000000000
> [    5.469861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.476274] CR2: ffff88006a1ab938 CR3: 0000000001a0a000 CR4: 00000000001406f0
> [    5.484240] Stack:
> [    5.486483]  ffffffff8105d7f8 0000000000000000 0000000000000006 0000000000000006
> [    5.494777]  000000000000001e 0000000000000000 0000000000000000 ffff880867ecbd38
> [    5.503074]  0000000080050033 0000000000000000 0000000000000000 0000000000000000
> [    5.511368] Call Trace:
> [    5.514098]  [<ffffffff8105d7f8>] ? efi_call+0x58/0x90
> [    5.519834]  [<ffffffff8106033d>] ? uv_bios_call_irqsave+0x5d/0x80
> [    5.526733]  [<ffffffff810603a0>] uv_bios_get_sn_info+0x40/0xb0
> [    5.533344]  [<ffffffff81b6f824>] uv_system_init+0x772/0x104d
> [    5.539751]  [<ffffffff810bd479>] ? vprintk_default+0x29/0x40
> [    5.546159]  [<ffffffff81161cf8>] ? printk+0x4d/0x4f
> [    5.551692]  [<ffffffff81b6ac75>] native_smp_prepare_cpus+0x299/0x2e4
> [    5.558884]  [<ffffffff81b5c18e>] kernel_init_freeable+0xc3/0x21b
> [    5.565680]  [<ffffffff815acd00>] ? rest_init+0x80/0x80
> [    5.571502]  [<ffffffff815acd0e>] kernel_init+0xe/0xf0
> [    5.577238]  [<ffffffff815b87cf>] ret_from_fork+0x3f/0x70
> [    5.583264]  [<ffffffff815acd00>] ? rest_init+0x80/0x80
> [    5.589093] Code:  Bad RIP value.
> [    5.592812] RIP  [<ffff88006a1ab938>] 0xffff88006a1ab938
> [    5.598748]  RSP <ffff880867ecbc88>
> [    5.602638] CR2: ffff88006a1ab938
> [    5.606339] ---[ end trace 3abaacb020c74a50 ]---
> [    5.611487] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> --->8
> 
> You can see here that we've made it past the MMR read in uv_system_init,
> but we die inside of our first EFI callback.  In this example, it looks
> like we're using the kernel page table at the time of the failure, and I
> believe that the failing address is somewhere in our EFI runtime code:

I think I see what's going on:

[    5.367290] PGD 1f81067 PUD 87ffff067 PMD 87fff8067 PTE 0

PTE 0 because, most probably, you need to sync
efi_sync_low_kernel_mappings(). Why?

So the point of time this call is done, is, IINM, after we have
enabled virtual mode. I.e., it is being done in start_kernel() and
your callstack points at rest_init() which happens later in that same
function.

So IMO what you should be doing, instead, is doing efi_call_virt() in
uv_bios_call() which should take care of everything.

I think this naked efi_call() in uv_bios_call() should not be there
but UV should be calling the _phys or _virt helpers from the EFI core.

Preferrably someone should go and audit all those EFI calls in UV and
figure out which one to use, _phys or _virt depending on the point in
time this call is being done.

For example, uv_system_init() should all be _virt calls, obviously.
And from a quick scan, most of the EFI calls are coming from there so
everything should be _virt.

Btw, uv_bios_call_reentrant() looks unused, might want to remove it
while at it.

Hmmm.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-04-30 22:12         ` Matt Fleming
@ 2016-05-02 21:39           ` Alex Thorlton
  2016-05-02 22:17             ` Mike Travis
  2016-05-09 21:55             ` Matt Fleming
  0 siblings, 2 replies; 20+ messages in thread
From: Alex Thorlton @ 2016-05-02 21:39 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Alex Thorlton, Borislav Petkov, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-efi, Russ Anderson,
	Dimitri Sivanich, mike travis, Nathan Zimmer

On Sat, Apr 30, 2016 at 11:12:09PM +0100, Matt Fleming wrote:
> On Fri, 29 Apr, at 10:41:19AM, Alex Thorlton wrote:
> > 
> > You can see here that we've made it past the MMR read in uv_system_init,
> > but we die inside of our first EFI callback.  In this example, it looks
> > like we're using the kernel page table at the time of the failure, and I
> > believe that the failing address is somewhere in our EFI runtime code:
> > 
> > [    0.803396] efi: ATHORLTON EFI md dump:
> > [    0.803396]         type: 5
> > [    0.803396]         pad: 0
> > [    0.803396]         phys_addr: 6a0a6000
> > [    0.803396]         virt_addr: 0
> > [    0.803396]         num_pages: 184
> > [    0.803396]         attribute: 800000000000000f
> > 
> > So it looks like we're trying to read from EFI runtime space while using
> > the kernel page table, which fails, presumably because the space is also
> > not mapped into the kernel page table.  While I understand *why* it
> > fails, and why the address isn't mapped, I don't fully know how we
> > should handle fixing it.
> 
> How come you're not using the new EFI page tables while making EFI
> runtime calls?

That is a good question :)  Until now, I wasn't aware that we were doing
it this way.  As Boris pointed out, we need to look into this and get it
corrected.  I've started playing with it, but haven't quite got things
figured out yet.  I'll explaing in my response to Boris.

> Who owns the MMR space and what is it used for? Do both the kernel and
> the firmware need access to it? My SGI UV knowledge is zero, so I'm
> happy to be educated! I can't think of any analogous memory regions on
> x86 where the EFI services require the kernel to map them, other than
> the EFI regions themselves.

We have MMRs that get used for a ton of different purposes.  I'm only
familiar with the details of a few of them, but they provide a bunch of
information about various bits of SGI-specific hardware (i.e. the hub
and the BAU) and I think there are also some that allow you to control
that hardware.

This is more Mike Travis's department - he might be able to paint a
better picture.

> Runtime EFI regions should be opaque, isolated and self-contained.
> This is why there are two phases of execution for firmware; before and
> after ExitBootServices(). Once the kernel takes control after
> ExitBootServices() firmware can no longer provide timer services, for
> example, and doesn't care where the kernel maps the LAPIC because it
> never tries to access it.
> 
> The fact that the UV firmware does care where the MMR space is mapped
> makes me suspect that there's a lack of isolation.

The way it has been described to me is that there are MMRs that we need
access to directly from the kernel, as well as during some of our
runtime EFI callbacks, so we need them mapped in both the EFI page
tables, and the kernel page tables (glossing over the fact that we have
not been properly utilizing said EFI page tables).

It looks like mapping in the registers explicitly gets us where we need
to be, as far as the MMRs go.  I'm still working on sorting out the
problems with our callbacks, but I think that's sort of a separate
issue.

If you think we're violating EFI rules by accessing these registers from
both sides of the fence, please let me know.  I'd like to make sure that
we get everything behaving the way it should be!

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-02 21:39           ` Alex Thorlton
@ 2016-05-02 22:17             ` Mike Travis
  2016-05-09 21:55             ` Matt Fleming
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Travis @ 2016-05-02 22:17 UTC (permalink / raw)
  To: Alex Thorlton, Matt Fleming
  Cc: Borislav Petkov, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	Nathan Zimmer



On 5/2/2016 2:39 PM, Alex Thorlton wrote:
>> > Who owns the MMR space and what is it used for? Do both the kernel and
>> > the firmware need access to it? My SGI UV knowledge is zero, so I'm
>> > happy to be educated! I can't think of any analogous memory regions on
>> > x86 where the EFI services require the kernel to map them, other than
>> > the EFI regions themselves.
>
> We have MMRs that get used for a ton of different purposes.  I'm only
> familiar with the details of a few of them, but they provide a bunch of
> information about various bits of SGI-specific hardware (i.e. the hub
> and the BAU) and I think there are also some that allow you to control
> that hardware.
> 
> This is more Mike Travis's department - he might be able to paint a
> better picture.
> 

I don't have the complete picture either.  But from what I do understand,
the answer is "yes", BIOS and the kernel both access the MMRs during the
lifetime of the system operation.  The vast majority of the MMRs are setup
by UV BIOS during system startup.  At some point they switch from each
blade/board/chassis running separately, to one doing overall system setup.
Then BIOS hands over the system to the kernel as expected.

After the hand off, the kernel and drivers, use the MMRs for machine control
and feedback.  In addition, various services are provided by BIOS in the
"System Management Mode" layer.  I'm not completely sure how the regs are
shared while in this mode, I think they both need the same address mappings.
Russ or Dimitri would know more about this, as this is the "kernel/BIOS
interface" the efi bios callbacks are using (most commonly in UV kernel
modules).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-02 10:02         ` Borislav Petkov
@ 2016-05-02 22:27           ` Alex Thorlton
  2016-05-03  0:10             ` Alex Thorlton
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Thorlton @ 2016-05-02 22:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Thorlton, linux-kernel, Matt Fleming, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-efi, Russ Anderson,
	Dimitri Sivanich, mike travis, Nathan Zimmer

On Mon, May 02, 2016 at 12:02:22PM +0200, Borislav Petkov wrote:
> On Fri, Apr 29, 2016 at 10:41:19AM -0500, Alex Thorlton wrote:
> > You can see here that we've made it past the MMR read in uv_system_init,
> > but we die inside of our first EFI callback.  In this example, it looks
> > like we're using the kernel page table at the time of the failure, and I
> > believe that the failing address is somewhere in our EFI runtime code:
> 
> I think I see what's going on:
> 
> [    5.367290] PGD 1f81067 PUD 87ffff067 PMD 87fff8067 PTE 0
> 
> PTE 0 because, most probably, you need to sync
> efi_sync_low_kernel_mappings(). Why?
> 
> So the point of time this call is done, is, IINM, after we have
> enabled virtual mode. I.e., it is being done in start_kernel() and
> your callstack points at rest_init() which happens later in that same
> function.
> 
> So IMO what you should be doing, instead, is doing efi_call_virt() in
> uv_bios_call() which should take care of everything.
> 
> I think this naked efi_call() in uv_bios_call() should not be there
> but UV should be calling the _phys or _virt helpers from the EFI core.
> 
> Preferrably someone should go and audit all those EFI calls in UV and
> figure out which one to use, _phys or _virt depending on the point in
> time this call is being done.

I think you're definitely right about needing to switch to something
closer to the efi_call_virt/phys macros, but those, unfortunately, don't
work right out of the box.  The efi_call_virt macro assumes that the
function pointer being passed in is in efi.systab, but our function
pointer is in efi.uv_systab.

I'm working on getting a slightly modified efi_call_virt working, but am
having some problems with it.  I've got it switching over to the EFI
page table, but it's still giving me a bad paging request while trying
to call into our EFI code.

> For example, uv_system_init() should all be _virt calls, obviously.
> And from a quick scan, most of the EFI calls are coming from there so
> everything should be _virt.

I'll make sure we take a look at this once we've sorted out our other
issues.  Thanks for pointing this out!

> Btw, uv_bios_call_reentrant() looks unused, might want to remove it
> while at it.

I'll be sure to check this out as well.

Thanks for the help.  I'll get back to you when I know a bit more about
what's happening with our runtime callbacks!

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-02 22:27           ` Alex Thorlton
@ 2016-05-03  0:10             ` Alex Thorlton
  2016-05-03  9:48               ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Thorlton @ 2016-05-03  0:10 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Borislav Petkov, linux-kernel, Matt Fleming, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-efi, Russ Anderson,
	Dimitri Sivanich, mike travis, Nathan Zimmer

On Mon, May 02, 2016 at 05:27:19PM -0500, Alex Thorlton wrote:
> Thanks for the help.  I'll get back to you when I know a bit more about
> what's happening with our runtime callbacks!

I've made a bit of progress here.  I was able to switch over to a very
slightly modified version of efi_call_virt and then tweak uv_bios_call
just a bit, and that (along with the re-introduction of the map_low_mmrs
calls) got my machine to boot:

8<---
diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c
b/arch/x86/kernel/apic/x2apic_uv_x.c
index 8f4942e..82aa6a7 100644
--- a/arch/x86/kernel/apic/x2apic_uv_x.c
+++ b/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -892,7 +892,7 @@ void __init uv_system_init(void)
        pr_info("UV: Found %s hub\n", hub);
 
        /* We now only need to map the MMRs on UV1 */
-       if (is_uv1_hub())
+       //if (is_uv1_hub())
                map_low_mmrs();
 
        m_n_config.v = uv_read_local_mmr(UVH_RH_GAM_CONFIG_MMR );

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 53748c4..0c4d347 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -48,6 +48,16 @@ extern unsigned long asmlinkage efi_call_phys(void *, ...);
        __s;                                                            \
 })
 
+#define uv_call_virt(f, args...) \
+({                                                                     \
+       efi_status_t __s;                                               \
+       kernel_fpu_begin();                                             \
+       __s = ((efi_##f##_t __attribute__((regparm(0)))*)               \
+               f)(args);                                               \
+       kernel_fpu_end();                                               \
+       __s;                                                            \
+})
+
 /* Use this macro if your virtual call does not return any value */
 #define __efi_call_virt(f, args...) \
 ({                                                                     \
@@ -104,6 +114,32 @@ struct efi_scratch {
        __s;                                                            \
 })
 
+#define uv_call_virt(f, ...)                                           \
+({                                                                     \
+       efi_status_t __s;                                               \
+                                                                       \
+       efi_sync_low_kernel_mappings();                                 \
+       preempt_disable();                                              \
+       __kernel_fpu_begin();                                           \
+                                                                       \
+       if (efi_scratch.use_pgd) {                                      \
+               efi_scratch.prev_cr3 = read_cr3();                      \
+               write_cr3((unsigned long)efi_scratch.efi_pgt);          \
+               __flush_tlb_all();                                      \
+       }                                                               \
+                                                                       \
+       __s = efi_call((void *)f, __VA_ARGS__); \
+                                                                       \
+       if (efi_scratch.use_pgd) {                                      \
+               write_cr3(efi_scratch.prev_cr3);                        \
+               __flush_tlb_all();                                      \
+       }                                                               \
+                                                                       \
+       __kernel_fpu_end();                                             \
+       preempt_enable();                                               \
+       __s;                                                            \
+})
+
 /*
  * All X86_64 virt calls return non-void values. Thus, use non-void call for
  * virt calls that would be void on X86_32.

diff --git a/arch/x86/platform/uv/bios_uv.c b/arch/x86/platform/uv/bios_uv.c
index 1584cbe..6e99f81 100644
--- a/arch/x86/platform/uv/bios_uv.c
+++ b/arch/x86/platform/uv/bios_uv.c
@@ -39,8 +39,8 @@ s64 uv_bios_call(enum uv_bios_cmd which, u64 a1, u64 a2, u64 a3, u64 a4, u64 a5)
                 */
                return BIOS_STATUS_UNIMPLEMENTED;
 
-       ret = efi_call((void *)__va(tab->function), (u64)which,
-                       a1, a2, a3, a4, a5);
+       ret = uv_call_virt(tab->function, (u64)which, a1, a2, a3, a4, a5);
+
        return ret;
 }
 EXPORT_SYMBOL_GPL(uv_bios_call);
--->8

Note that the only change I made to efi_call_virt was to change
efi.systab->runtime->f to simply f in the efi_call line.  This works up
until we try to do callbacks from a loaded module.  When we try that we
hit this:

[   56.232086] BUG: unable to handle kernel paging request at ffffffff8106148f
[   56.239880] IP: [<fffffffedbb408ce>] 0xfffffffedbb408ce
[   56.245721] PGD 8698e0067 PUD 1a08063 PMD 10001e1 
[   56.251102] Oops: 0003 [#1] SMP 
[   56.254725] Modules linked in: hwperf(OE+) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) msr(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) nl
s_iso8859_1(E) nls_cp437(E) irqbypass(E) vfat(E) crct10dif_pclmul(E) fat(E) crc32_pclmul(E) ghash_clmulni_intel(E) drbg(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) igb(E) iTCO_wdt(E) 
gf128mul(E) glue_helper(E) iTCO_vendor_support(E) sb_edac(E) ablk_helper(E) ptp(E) edac_core(E) dm_mod(E) cryptd(E) pcspkr(E) lpc_ich(E) pps_core(E) i2c_i801(E) mfd_core(E) mei_me(E) ioatdma(E
) mei(E) shpchp(E) wmi(E) dca(E) processor(E) button(E) efivarfs(E) xfs(E) libcrc32c(E) sd_mod(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) xhci_p
ci(E) fb_sys_fops(E) ahci(E) ehci_pci(E) ttm(E) xhci_hcd(E) ehci_hcd(E) crc32c_intel(E) libahci(E) ata_generic(E) drm(E) usbcore(E) libata(E) usb_common(E) sg(E) scsi_mod(E) autofs4(E)
[   56.348094] CPU: 12 PID: 7515 Comm: modprobe Tainted: G           OE   4.6.0-rc5-maplow-uvcall+ #552
[   56.358290] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 01/15/2015
[   56.367032] task: ffff8808650a1100 ti: ffff8808647ac000 task.ti: ffff8808647ac000
[   56.375385] RIP: 0010:[<fffffffedbb408ce>]  [<fffffffedbb408ce>] 0xfffffffedbb408ce
[   56.383941] RSP: 0018:ffff8808647afad0  EFLAGS: 00010246
[   56.389869] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8106148f
[   56.397833] RDX: 0000000000000003 RSI: 0000000000000002 RDI: 0000000000000000
[   56.405797] RBP: ffff8808647afc78 R08: 0000000000000002 R09: 0000000000000002
[   56.413760] R10: 000000006a1b8540 R11: 0000000000000000 R12: 0000000000000002
[   56.421724] R13: 0000000000000000 R14: 0000000000010000 R15: 0000000000000003
[   56.429688] FS:  00007f6cf45c2700(0000) GS:ffff880878d80000(0000) knlGS:0000000000000000
[   56.438720] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   56.445131] CR2: ffffffff8106148f CR3: 00000008698df000 CR4: 00000000001406e0
[   56.453096] Stack:
[   56.455337]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   56.463632]  ffff8808673ace78 ffff8808673ace60 0000000000000000 0000000000000003
[   56.471928]  ffff8808647afb20 ffffffff81094402 ffff8808647afb68 ffffffff810abbb5
[   56.480225] Call Trace:
[   56.482958]  [<ffffffff81094402>] ? default_wake_function+0x12/0x20
[   56.489953]  [<ffffffff810abbb5>] ? __wake_up_common+0x55/0x90
[   56.496466]  [<ffffffff8106148f>] ? uv_bios_call+0x6f/0x110
[   56.502686]  [<ffffffff8105ea3c>] ? efi_call+0x5c/0x90
[   56.508425]  [<ffffffff8130c751>] ? vsnprintf+0x251/0x4b0
[   56.514449]  [<ffffffff8106148f>] ? uv_bios_call+0x6f/0x110
[   56.520667]  [<ffffffff8106148f>] uv_bios_call+0x6f/0x110
[   56.526693]  [<ffffffffa0296850>] ? uv_hwperf_deregister_procfs+0x90/0x90 [hwperf]
[   56.535143]  [<ffffffffa0296948>] uv_hwperf_entry+0xf8/0x200 [hwperf]
[   56.542334]  [<ffffffff810003dd>] do_one_initcall+0xad/0x1e0
[   56.548646]  [<ffffffff81167ad2>] ? do_init_module+0x27/0x1da
[   56.555058]  [<ffffffff81167b0b>] do_init_module+0x60/0x1da
[   56.561279]  [<ffffffff810ef0f2>] load_module+0x1402/0x1ae0
[   56.567500]  [<ffffffff810eb970>] ? __symbol_put+0x40/0x40
[   56.573623]  [<ffffffff810ef9d9>] SYSC_finit_module+0xa9/0xd0
[   56.580035]  [<ffffffff810efa1e>] SyS_finit_module+0xe/0x10
[   56.586257]  [<ffffffff815c7432>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[   56.593444] Code: 49 83 f9 02 0f 82 67 03 00 00 b9 01 00 00 00 e8 6d 12 00 00 ba 03 00 00 00 48 8b c8 e8 d0 12 00 00 48 8b 8c 24 d0 00 00 00 33 ff <66> 89 01 48 81 39 3f 42 0f 00 ba fe 00 0
0 00 48 0f 44 fa 48 8b 
[   56.615234] RIP  [<fffffffedbb408ce>] 0xfffffffedbb408ce
[   56.621173]  RSP <ffff8808647afad0>
[   56.625063] CR2: ffffffff8106148f
[   56.628763] ---[ end trace fee09972b1382958 ]---
[   56.633914] Kernel panic - not syncing: Fatal exception
[   56.639767] Kernel Offset: disabled
[   56.643659] ---[ end Kernel panic - not syncing: Fatal exception

The bad paging request here appears to be on the:

if (efi_scratch.use_pgd)

Line of uv_call_virt.  It looks like it's having trouble accessing the
efi_scratch struct using the EFI page table.  I'm not sure why this
is an issue with callbacks from modules and not with the ones in
uv_system_init and friends.

I'll keep investigating the module issue.  Looks like we're getting
closer to sorting this out!

Let me know if you have thoughts about the way I'm getting stuff
working.  I'm thinking there's probably a better way to do this than by
copying the whole efi_call_virt macro - this was a quick and dirty
solution.

- Alex

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-03  0:10             ` Alex Thorlton
@ 2016-05-03  9:48               ` Borislav Petkov
  2016-05-03 18:47                 ` Alex Thorlton
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2016-05-03  9:48 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Matt Fleming, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer

On Mon, May 02, 2016 at 07:10:36PM -0500, Alex Thorlton wrote:
> +#define uv_call_virt(f, args...) \
> +({                                                                     \
> +       efi_status_t __s;                                               \
> +       kernel_fpu_begin();                                             \
> +       __s = ((efi_##f##_t __attribute__((regparm(0)))*)               \
> +               f)(args);                                               \
> +       kernel_fpu_end();                                               \
> +       __s;                                                            \
> +})

Right, can you use the EFI-ones in
drivers/firmware/efi/runtime-wrappers.c directly?

 (So they're going to land there, I'm staring at latest -tip and those calls
  have become all fancy now:

#define efi_call_virt(f, args...)                                       \
({                                                                      \
        efi_status_t __s;                                               \
        unsigned long flags;                                            \
        arch_efi_call_virt_setup();                                     \
        local_save_flags(flags);                                        \
        __s = arch_efi_call_virt(f, args);                              \
        efi_call_virt_check_flags(flags, __stringify(f));               \
        arch_efi_call_virt_teardown();                                  \
        __s;                                                            \
})

with efi_call_virt_check_flags() checking for IRQ flags corruption... And so
on, but that's beside the point... )

> +
>  /* Use this macro if your virtual call does not return any value */
>  #define __efi_call_virt(f, args...) \
>  ({                                                                     \
> @@ -104,6 +114,32 @@ struct efi_scratch {
>         __s;                                                            \
>  })
>  
> +#define uv_call_virt(f, ...)                                           \
> +({                                                                     \
> +       efi_status_t __s;                                               \
> +                                                                       \
> +       efi_sync_low_kernel_mappings();                                 \
> +       preempt_disable();                                              \
> +       __kernel_fpu_begin();                                           \
> +                                                                       \
> +       if (efi_scratch.use_pgd) {                                      \
> +               efi_scratch.prev_cr3 = read_cr3();                      \
> +               write_cr3((unsigned long)efi_scratch.efi_pgt);          \
> +               __flush_tlb_all();                                      \
> +       }                                                               \
> +                                                                       \
> +       __s = efi_call((void *)f, __VA_ARGS__); \
> +                                                                       \
> +       if (efi_scratch.use_pgd) {                                      \
> +               write_cr3(efi_scratch.prev_cr3);                        \
> +               __flush_tlb_all();                                      \
> +       }                                                               \
> +                                                                       \
> +       __kernel_fpu_end();                                             \
> +       preempt_enable();                                               \
> +       __s;                                                            \
> +})
> +
>  /*
>   * All X86_64 virt calls return non-void values. Thus, use non-void call for
>   * virt calls that would be void on X86_32.
> 
> diff --git a/arch/x86/platform/uv/bios_uv.c b/arch/x86/platform/uv/bios_uv.c
> index 1584cbe..6e99f81 100644
> --- a/arch/x86/platform/uv/bios_uv.c
> +++ b/arch/x86/platform/uv/bios_uv.c
> @@ -39,8 +39,8 @@ s64 uv_bios_call(enum uv_bios_cmd which, u64 a1, u64 a2, u64 a3, u64 a4, u64 a5)
>                  */
>                 return BIOS_STATUS_UNIMPLEMENTED;
>  
> -       ret = efi_call((void *)__va(tab->function), (u64)which,
> -                       a1, a2, a3, a4, a5);
> +       ret = uv_call_virt(tab->function, (u64)which, a1, a2, a3, a4, a5);
> +

That could be simply

		efi_call_virt(tab->function, ...)

methinks.

>         return ret;
>  }
>  EXPORT_SYMBOL_GPL(uv_bios_call);
> --->8
> 
> Note that the only change I made to efi_call_virt was to change
> efi.systab->runtime->f to simply f in the efi_call line.  This works up
> until we try to do callbacks from a loaded module.  When we try that we
> hit this:
> 
> [   56.232086] BUG: unable to handle kernel paging request at ffffffff8106148f
> [   56.239880] IP: [<fffffffedbb408ce>] 0xfffffffedbb408ce
> [   56.245721] PGD 8698e0067 PUD 1a08063 PMD 10001e1 

PMD looks ok to me.

Have you tried CONFIG_EFI_PGT_DUMP ? It will dump to dmesg so you might
be able to spot stuff.

Also, you could dump them from debugfs *right* before loading the module
and then look at stuff.

Also 2, booting with efi=debug should give you how the EFI regions get
mapped.

...

> The bad paging request here appears to be on the:
> 
> if (efi_scratch.use_pgd)
> 
> Line of uv_call_virt.  It looks like it's having trouble accessing the
> efi_scratch struct using the EFI page table.  I'm not sure why this
> is an issue with callbacks from modules and not with the ones in
> uv_system_init and friends.

Just this one module or all modules doing EFI calls?

> I'll keep investigating the module issue.  Looks like we're getting
> closer to sorting this out!
> 
> Let me know if you have thoughts about the way I'm getting stuff
> working.  I'm thinking there's probably a better way to do this than by
> copying the whole efi_call_virt macro - this was a quick and dirty
> solution.

Yeah, try using the generic facilities. We should be able to accomodate
all users...

HTH.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-03  9:48               ` Borislav Petkov
@ 2016-05-03 18:47                 ` Alex Thorlton
  2016-05-04 10:36                   ` Borislav Petkov
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Thorlton @ 2016-05-03 18:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Thorlton, linux-kernel, Matt Fleming, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-efi, Russ Anderson,
	Dimitri Sivanich, mike travis, Nathan Zimmer

On Tue, May 03, 2016 at 11:48:20AM +0200, Borislav Petkov wrote:
> On Mon, May 02, 2016 at 07:10:36PM -0500, Alex Thorlton wrote:
> > +#define uv_call_virt(f, args...) \
> > +({                                                                     \
> > +       efi_status_t __s;                                               \
> > +       kernel_fpu_begin();                                             \
> > +       __s = ((efi_##f##_t __attribute__((regparm(0)))*)               \
> > +               f)(args);                                               \
> > +       kernel_fpu_end();                                               \
> > +       __s;                                                            \
> > +})
> 
> Right, can you use the EFI-ones in
> drivers/firmware/efi/runtime-wrappers.c directly?
> 
>  (So they're going to land there, I'm staring at latest -tip and those calls
>   have become all fancy now:
> with efi_call_virt_check_flags() checking for IRQ flags corruption... And so
> on, but that's beside the point... )

I think this will work for us, for the most part.  Only issue is that
the efi_call_virt macro is only accessible from inside
runtime-wrappers.c.  If we could pull that macro (and whatever else it
needs) up to the header file, I think that might work for us.  Not sure
if that's the appropriate solution, but it's a start.

> > -       ret = efi_call((void *)__va(tab->function), (u64)which,
> > -                       a1, a2, a3, a4, a5);
> > +       ret = uv_call_virt(tab->function, (u64)which, a1, a2, a3, a4, a5);
> > +
> 
> That could be simply
> 
> 		efi_call_virt(tab->function, ...)
> 
> methinks.

Looking at the -tip code, this sounds correct.

> > [   56.232086] BUG: unable to handle kernel paging request at ffffffff8106148f
> > [   56.239880] IP: [<fffffffedbb408ce>] 0xfffffffedbb408ce
> > [   56.245721] PGD 8698e0067 PUD 1a08063 PMD 10001e1 
> 
> PMD looks ok to me.
> 
> Have you tried CONFIG_EFI_PGT_DUMP ? It will dump to dmesg so you might
> be able to spot stuff.

Yes, I do have CONFIG_EFI_PGT_DUMP=y.  I don't *think* I see anything
strange in there, but I could be missing something.  I will send you a
full dump of my log buffer wit MLs et. al. off of Cc.

Take note that the Oops bits here indicate that it was a *write* from
kernel space that triggered this most recent Oops, whereas the ones we
were hitting before were all just missing pages in the mappings.

This means my suggestiong about the "if(efi_scratch..." bit was wrong.
This issue is still rolling around in my head.  I'll address it below.

> Also, you could dump them from debugfs *right* before loading the module
> and then look at stuff.
> 
> Also 2, booting with efi=debug should give you how the EFI regions get
> mapped.

I will try both of those ideas and see if I notice anything
interesting.

> > The bad paging request here appears to be on the:
> > 
> > if (efi_scratch.use_pgd)
> > 
> > Line of uv_call_virt.  It looks like it's having trouble accessing the
> > efi_scratch struct using the EFI page table.  I'm not sure why this
> > is an issue with callbacks from modules and not with the ones in
> > uv_system_init and friends.
> 
> Just this one module or all modules doing EFI calls?

I think this is our only module that does EFI calls (at least the only
module doing EFI calls in efi.uv_systab, as opposed to efi.systab), so
I'm not sure about that.

> > I'm thinking there's probably a better way to do this than by
> > copying the whole efi_call_virt macro - this was a quick and dirty
> > solution.
> 
> Yeah, try using the generic facilities. We should be able to accomodate
> all users...

This is probably the answer for the future, when we can expect the
changes to these macros be merged with the mainline kernel, but I don't
know exactly how long it will be before that happens.

I think we really want to come up with a fix to get into the mainline
kernel, sooner, rather than later.  If that means coming up with an
interim fix, and then planning for the changes in -tip later, then we
might have to deal with that.

Even though we're still having issues with our callbacks, we have
proof-of-concept fixes for our first 2 issues.  In my mind, I think it
would be best to fix as much as possible now, just to make sure that
everything is in the best shape possible.  If this means leaving the
callbacks broken for now, that might just have to be a fact of life.

Let me know what you guys think!

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-03 18:47                 ` Alex Thorlton
@ 2016-05-04 10:36                   ` Borislav Petkov
  2016-05-04 16:32                     ` Alex Thorlton
  0 siblings, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2016-05-04 10:36 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-kernel, Matt Fleming, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer

On Tue, May 03, 2016 at 01:47:51PM -0500, Alex Thorlton wrote:
> I think this will work for us, for the most part.  Only issue is that
> the efi_call_virt macro is only accessible from inside
> runtime-wrappers.c.  If we could pull that macro (and whatever else it
> needs) up to the header file, I think that might work for us.  Not sure
> if that's the appropriate solution, but it's a start.

Should be doable. You could give it a try and see how ugly it can get.

> Yes, I do have CONFIG_EFI_PGT_DUMP=y.  I don't *think* I see anything
> strange in there, but I could be missing something.  I will send you a
> full dump of my log buffer wit MLs et. al. off of Cc.

Sure.

> Take note that the Oops bits here indicate that it was a *write* from
> kernel space that triggered this most recent Oops, whereas the ones we
> were hitting before were all just missing pages in the mappings.
> 
> This means my suggestiong about the "if(efi_scratch..." bit was wrong.
> This issue is still rolling around in my head.  I'll address it below.

One thing I don't see in your uv_call_virt() is you're not grabbing
efi_runtime_lock like the rest of the EFI callers do. And there's
__wake_up_common() somewhere there in the callstack, not on the current
frame but there's also another uv_bios_call() in there and this all
looks like some locking issue...

So please convert it to the generic one first, do the calls as runtime
services in drivers/firmware/efi/runtime-wrappers.c do and we can
continue debugging.

> This is probably the answer for the future, when we can expect the
> changes to these macros be merged with the mainline kernel, but I don't
> know exactly how long it will be before that happens.

What's the hurry exactly here? You want stuff fixed in 4.6 when it
releases in less than two weeks?

Lemme try to understand the fallout range: that's only UV1 or UV3 too?
Because the latest oops comes from UV3...

If it is UV1 only, I'd say we don't care since you guys wanted to even
kill that support :-)

Btw, does "efi=old_memmap" fix things and could it be used as an interim
workaround until we've fixed everything properly and stuff has trickled
into -stable.?

Thanks.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-04 10:36                   ` Borislav Petkov
@ 2016-05-04 16:32                     ` Alex Thorlton
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Thorlton @ 2016-05-04 16:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Thorlton, linux-kernel, Matt Fleming, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-efi, Russ Anderson,
	Dimitri Sivanich, mike travis, Nathan Zimmer

On Wed, May 04, 2016 at 12:36:36PM +0200, Borislav Petkov wrote:
> On Tue, May 03, 2016 at 01:47:51PM -0500, Alex Thorlton wrote:
> > I think this will work for us, for the most part.  Only issue is that
> > the efi_call_virt macro is only accessible from inside
> > runtime-wrappers.c.  If we could pull that macro (and whatever else it
> > needs) up to the header file, I think that might work for us.  Not sure
> > if that's the appropriate solution, but it's a start.
> 
> Should be doable. You could give it a try and see how ugly it can get.

I can do that.  I don't think it should be too bad - I just wanted to
make sure that was an appropriate move before starting to work on it.

> > Yes, I do have CONFIG_EFI_PGT_DUMP=y.  I don't *think* I see anything
> > strange in there, but I could be missing something.  I will send you a
> > full dump of my log buffer wit MLs et. al. off of Cc.
> 
> Sure.

I am sending this shortly.  Yesterday evening got away from me :)

> > Take note that the Oops bits here indicate that it was a *write* from
> > kernel space that triggered this most recent Oops, whereas the ones we
> > were hitting before were all just missing pages in the mappings.
> > 
> > This means my suggestiong about the "if(efi_scratch..." bit was wrong.
> > This issue is still rolling around in my head.  I'll address it below.
> 
> One thing I don't see in your uv_call_virt() is you're not grabbing
> efi_runtime_lock like the rest of the EFI callers do. And there's
> __wake_up_common() somewhere there in the callstack, not on the current
> frame but there's also another uv_bios_call() in there and this all
> looks like some locking issue...
> 
> So please convert it to the generic one first, do the calls as runtime
> services in drivers/firmware/efi/runtime-wrappers.c do and we can
> continue debugging.

Got it.

> > This is probably the answer for the future, when we can expect the
> > changes to these macros be merged with the mainline kernel, but I don't
> > know exactly how long it will be before that happens.
> 
> What's the hurry exactly here? You want stuff fixed in 4.6 when it
> releases in less than two weeks?

Well, in a perfect world, yes.  I realize that might be a bit of a
stretch, but we'd *really* prefer to have 4.6 not be outright broken.  I
think we might be able to get at least a few small fixes through to at
least get our machines booting.  If worse comes to worse, we can get the
fixes into -tip and then wrap back around and try to fix up 4.6 in a
later stable kernel release.  I guess the best we can do is try to work
quickly and see where things end up.

> Lemme try to understand the fallout range: that's only UV1 or UV3 too?
> Because the latest oops comes from UV3...
> 
> If it is UV1 only, I'd say we don't care since you guys wanted to even
> kill that support :-)

Sorry, I may not have made this clear.  Currently *all* UVs *except* for
UV1s are broken.  All of the testing I've done since we started
discussing this issue has been done on a UV3000, but everything >= UV2
is currently broken.

> Btw, does "efi=old_memmap" fix things and could it be used as an interim
> workaround until we've fixed everything properly and stuff has trickled
> into -stable.?

Unfortunately, without the call for map_low_mmrs, even that doesn't
work.  I think that's an easy fix that we might be able to get in for
4.6 though.  It's literally a one-liner.  I'm going to try to get that
out today, so at least our old workaround still works.  I think it might
still have some trouble with modules doing EFI calls, but I'd be at
least halfway happy if the machine boots :)

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-02 21:39           ` Alex Thorlton
  2016-05-02 22:17             ` Mike Travis
@ 2016-05-09 21:55             ` Matt Fleming
  2016-05-10 17:35               ` Alex Thorlton
  1 sibling, 1 reply; 20+ messages in thread
From: Matt Fleming @ 2016-05-09 21:55 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Borislav Petkov, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, linux-efi, Russ Anderson, Dimitri Sivanich,
	mike travis, Nathan Zimmer

On Mon, 02 May, at 04:39:31PM, Alex Thorlton wrote:
> 
> If you think we're violating EFI rules by accessing these registers from
> both sides of the fence, please let me know.  I'd like to make sure that
> we get everything behaving the way it should be!

Oh no, I don't think this is violating the UEFI spec at all, but I do
think it goes against the spirit of the way other implementations are
designed; with maximum separation between firmware and kernel.

In a perfect world, I'd suggest mapping the MMR range in both the
kernel and firmware, at different virtual address ranges, but have
the firmware's version opaque to the kernel and only described by an
EfiMemoryMappedIO region, or something. That is ignoring any region
type conflicts that may arise.

Of course, we don't operate in a perfect world, so a good solution may
be to just insert the kernels' mapping into the EFI page tables.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table
  2016-05-09 21:55             ` Matt Fleming
@ 2016-05-10 17:35               ` Alex Thorlton
  0 siblings, 0 replies; 20+ messages in thread
From: Alex Thorlton @ 2016-05-10 17:35 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Alex Thorlton, Borislav Petkov, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-efi, Russ Anderson,
	Dimitri Sivanich, mike travis, Nathan Zimmer

On Mon, May 09, 2016 at 10:55:24PM +0100, Matt Fleming wrote:
> On Mon, 02 May, at 04:39:31PM, Alex Thorlton wrote:
> > 
> > If you think we're violating EFI rules by accessing these registers from
> > both sides of the fence, please let me know.  I'd like to make sure that
> > we get everything behaving the way it should be!
> 
> Oh no, I don't think this is violating the UEFI spec at all, but I do
> think it goes against the spirit of the way other implementations are
> designed; with maximum separation between firmware and kernel.

Understood.  Thanks!

> In a perfect world, I'd suggest mapping the MMR range in both the
> kernel and firmware, at different virtual address ranges, but have
> the firmware's version opaque to the kernel and only described by an
> EfiMemoryMappedIO region, or something. That is ignoring any region
> type conflicts that may arise.

The code I'm working on right now will do exactly this, so I think we're
on the right track there.  My solution is still a bit hacky, and we're
still working out some kinks with runtime EFI calls, but we're getting
closer.

I will send up an RFC patch with my findings as soon as we've worked out
our callback issues.  I'm working with our BIOS guys on this as we
speak.

Thanks again for the help, Matt!

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-05-10 17:35 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-27 15:41 [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table Alex Thorlton
2016-04-27 18:23 ` Alex Thorlton
2016-04-27 22:51 ` Borislav Petkov
2016-04-28  1:41   ` Alex Thorlton
2016-04-28 12:57     ` Borislav Petkov
2016-04-29 15:41       ` Alex Thorlton
2016-04-30 22:12         ` Matt Fleming
2016-05-02 21:39           ` Alex Thorlton
2016-05-02 22:17             ` Mike Travis
2016-05-09 21:55             ` Matt Fleming
2016-05-10 17:35               ` Alex Thorlton
2016-05-02 10:02         ` Borislav Petkov
2016-05-02 22:27           ` Alex Thorlton
2016-05-03  0:10             ` Alex Thorlton
2016-05-03  9:48               ` Borislav Petkov
2016-05-03 18:47                 ` Alex Thorlton
2016-05-04 10:36                   ` Borislav Petkov
2016-05-04 16:32                     ` Alex Thorlton
2016-04-29  9:01 ` Matt Fleming
2016-04-29 15:45   ` Alex Thorlton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).