linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* x86: pgtable / kaslr initialisation (OOB) help
@ 2023-06-14 13:23 Lee Jones
  2023-06-14 14:16 ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Lee Jones @ 2023-06-14 13:23 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, linux-kernel,
	linux-mm
  Cc: Lee Jones

Good afternoon,

I'm looking for a little expert help with a out-of-bounds issue I've
been working on.  Please let me know if my present understanding does
not quite match reality.

- Report

The following highly indeterministic kernel panic was reported to occur every
few hundred boots:

    Kernel panic - not syncing: Fatal exception
     RIP: 0010:alloc_ucounts+0x68/0x280
     RSP: 0018:ffffb53ac13dfe98 EFLAGS: 00010006
     RAX: 0000000000000000 RBX: 000000015b804063 RCX: ffff9bb60d5aa500
     RDX: 0000000000000001 RSI: 00000000000003fb RDI: 0000000000000001
     RBP: ffffb53ac13dfec0 R08: 0000000000000000 R09: ffffd53abf600000
     R10: ffff9bb60b4bdd80 R11: ffffffffbcde7bb0 R12: ffffffffbf04e710
     R13: ffffffffbf405160 R14: 00000000000003fb R15: ffffffffbf1afc60
     FS:  00007f5be44d5000(0000) GS:ffff9bb6b9cc0000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 000000015b80407b CR3: 00000001038ea000 CR4: 00000000000406a0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     Call Trace:
      <TASK>
      __sys_setuid+0x1a0/0x290
      __x64_sys_setuid+0xc/0x10
      do_syscall_64+0x43/0x90
      ? asm_exc_page_fault+0x8/0x30
      entry_SYSCALL_64_after_hwframe+0x44/0xae

- Problem

The issue was eventually tracked down to the attempted dereference of a value
located in a corrupted hash table.  ucounts_hashtable is an array of 1024
struct hlists.  Each element is the head of its own linked list where previous
ucount allocations are stored.  The [20]th element of ucounts_hashtable was
being consistently trashed on each and every boot.  However the indeterminism
comes from it being accessed only every few hundred boots.

The issue disappeared, or was at least unidentifiable when !(LTO=full) or when
memory base randomisation (a.k.a. KASLR) was disabled, rending GDB all but
impossible to use effectively.

The cause of the corruption was uncovered using a verity of different debugging
techniques and was eventually tracked down to page table manipulation in early
architecture setup.

The following line in arch/x86/realmode/init.c [0] allocates a variable, just 8
Bytes in size, to "hold the pgd entry used on booting additional CPUs":

  pgd_t trampoline_pgd_entry;

The address of that variable is then passed from init_trampoline_kaslr() via a
call to set_pgd() [1] to have a value (not too important here) assigned to it.
Numerous abstractions take place, eventually leading to native_set_p4d(), an
inline function [2] contained in arch/x86/include/asm/pgtable_64.h.

From here, intentionally or otherwise, a call to pti_set_user_pgtbl() is made.
This is where the out-of-bounds write eventually occurs.  It is not known (by
me) why this function is called.  The returned result is subsequently used as a
value to write using the WRITE_ONCE macro.  Perhaps the premature write is not
intended.  This is what I hope to find out.

A little way down in pti_set_user_pgtbl() [3] the following line occurs:

  kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd

The kernel_to_user_pgdp() part takes the address of pgdp (a.k.a.
trampoline_pgd_entry) and ends up flipping the 12th bit, essentially adding 4k
(0x1000) to the address.  Then the part at the end assigns our value (still not
important here) to it.  However, if we remember that only 8 Bytes was allocated
(globally) for trampoline_pgd_entry, then means we just stored the value into
the outlands (what we now know to be allocated to another global storage user
ucounts_hashtable).

- Ask

Hopefully I haven't messed anything up in the explanation here.  Please let me
know if this is the case.  I'm keen to understand what the intention was and
what we might do to fix it, if of course this hasn't already been remedied
somewhere.

As ever, any help / guidance would be gratefully received.

Kind regards,
Lee

[0] https://elixir.bootlin.com/linux/latest/source/arch/x86/realmode/init.c#L18
[1] https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/kaslr.c#L178
[2] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/pgtable_64.h#L142
[3] https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/pti.c#L142

-- 
Lee Jones [李琼斯]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: pgtable / kaslr initialisation (OOB) help
  2023-06-14 13:23 x86: pgtable / kaslr initialisation (OOB) help Lee Jones
@ 2023-06-14 14:16 ` Peter Zijlstra
  2023-06-14 14:37   ` Lee Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2023-06-14 14:16 UTC (permalink / raw)
  To: Lee Jones
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel, linux-mm

On Wed, Jun 14, 2023 at 02:23:39PM +0100, Lee Jones wrote:
> Good afternoon,
> 
> I'm looking for a little expert help with a out-of-bounds issue I've
> been working on.  Please let me know if my present understanding does
> not quite match reality.
> 
> - Report
> 
> The following highly indeterministic kernel panic was reported to occur every
> few hundred boots:
> 
>     Kernel panic - not syncing: Fatal exception
>      RIP: 0010:alloc_ucounts+0x68/0x280
>      RSP: 0018:ffffb53ac13dfe98 EFLAGS: 00010006
>      RAX: 0000000000000000 RBX: 000000015b804063 RCX: ffff9bb60d5aa500
>      RDX: 0000000000000001 RSI: 00000000000003fb RDI: 0000000000000001
>      RBP: ffffb53ac13dfec0 R08: 0000000000000000 R09: ffffd53abf600000
>      R10: ffff9bb60b4bdd80 R11: ffffffffbcde7bb0 R12: ffffffffbf04e710
>      R13: ffffffffbf405160 R14: 00000000000003fb R15: ffffffffbf1afc60
>      FS:  00007f5be44d5000(0000) GS:ffff9bb6b9cc0000(0000) knlGS:0000000000000000
>      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>      CR2: 000000015b80407b CR3: 00000001038ea000 CR4: 00000000000406a0
>      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>      Call Trace:
>       <TASK>
>       __sys_setuid+0x1a0/0x290
>       __x64_sys_setuid+0xc/0x10
>       do_syscall_64+0x43/0x90
>       ? asm_exc_page_fault+0x8/0x30
>       entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> - Problem
> 
> The issue was eventually tracked down to the attempted dereference of a value
> located in a corrupted hash table.  ucounts_hashtable is an array of 1024
> struct hlists.  Each element is the head of its own linked list where previous
> ucount allocations are stored.  The [20]th element of ucounts_hashtable was
> being consistently trashed on each and every boot.  However the indeterminism
> comes from it being accessed only every few hundred boots.
> 
> The issue disappeared, or was at least unidentifiable when !(LTO=full) or when
> memory base randomisation (a.k.a. KASLR) was disabled, rending GDB all but
> impossible to use effectively.
> 
> The cause of the corruption was uncovered using a verity of different debugging
> techniques and was eventually tracked down to page table manipulation in early
> architecture setup.
> 
> The following line in arch/x86/realmode/init.c [0] allocates a variable, just 8
> Bytes in size, to "hold the pgd entry used on booting additional CPUs":
> 
>   pgd_t trampoline_pgd_entry;
> 
> The address of that variable is then passed from init_trampoline_kaslr() via a
> call to set_pgd() [1] to have a value (not too important here) assigned to it.
> Numerous abstractions take place, eventually leading to native_set_p4d(), an
> inline function [2] contained in arch/x86/include/asm/pgtable_64.h.
> 
> From here, intentionally or otherwise, a call to pti_set_user_pgtbl() is made.
> This is where the out-of-bounds write eventually occurs.  It is not known (by
> me) why this function is called.  The returned result is subsequently used as a
> value to write using the WRITE_ONCE macro.  Perhaps the premature write is not
> intended.  This is what I hope to find out.
> 
> A little way down in pti_set_user_pgtbl() [3] the following line occurs:
> 
>   kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd
> 
> The kernel_to_user_pgdp() part takes the address of pgdp (a.k.a.
> trampoline_pgd_entry) and ends up flipping the 12th bit, essentially adding 4k
> (0x1000) to the address.  Then the part at the end assigns our value (still not
> important here) to it.  However, if we remember that only 8 Bytes was allocated
> (globally) for trampoline_pgd_entry, then means we just stored the value into
> the outlands (what we now know to be allocated to another global storage user
> ucounts_hashtable).
> 
> - Ask
> 
> Hopefully I haven't messed anything up in the explanation here.  Please let me
> know if this is the case.  I'm keen to understand what the intention was and
> what we might do to fix it, if of course this hasn't already been remedied
> somewhere.
> 
> As ever, any help / guidance would be gratefully received.

This is the PTI, or Page-Table-Isolation muck that is the Meltdown
mitigation. Basically we end up running with 2 sets of page-tables, a
kernel and user pagetable, where the user pagetable only contains the
bare minimum of the kernel to make the transition (and consequently
userspace cannot access the kernel data).

The way this is done is by making the PGD two consecutive pages aligned
to 2*PAGE_SIZE, and a switch between the two CR3 values is a simple
bitop. The kernel is the first (or lower address) while the user is the
second (or higher address).

This is what kernel_to_user_pgdp() is doing, and what you'll find in
things like SWITCH_TO_{KERNEL,USER}_CR3 from arch/x86/entry/calling.h.

Now the problem is that:

1) this kaslr trampoline stuff does not play by the same rules, but
triggers it because it's about the first 1M of memory -- aka userspace
range.

2) we've already set X86_FEATURE_PTI by the time this happens

I've not yet remebered enough of this crap to figure out the best way to
cure this. Clearly we don't need PTI to perform kalsr, but all of this
early setup is a giant maze so perhaps PTI is needed for something else
this early.

IFF not, the solution might be to set PTI later.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: pgtable / kaslr initialisation (OOB) help
  2023-06-14 14:16 ` Peter Zijlstra
@ 2023-06-14 14:37   ` Lee Jones
  2023-06-14 14:45     ` Dave Hansen
  0 siblings, 1 reply; 9+ messages in thread
From: Lee Jones @ 2023-06-14 14:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel, linux-mm

On Wed, 14 Jun 2023, Peter Zijlstra wrote:

> On Wed, Jun 14, 2023 at 02:23:39PM +0100, Lee Jones wrote:
> > Good afternoon,
> > 
> > I'm looking for a little expert help with a out-of-bounds issue I've
> > been working on.  Please let me know if my present understanding does
> > not quite match reality.
> > 
> > - Report
> > 
> > The following highly indeterministic kernel panic was reported to occur every
> > few hundred boots:
> > 
> >     Kernel panic - not syncing: Fatal exception
> >      RIP: 0010:alloc_ucounts+0x68/0x280
> >      RSP: 0018:ffffb53ac13dfe98 EFLAGS: 00010006
> >      RAX: 0000000000000000 RBX: 000000015b804063 RCX: ffff9bb60d5aa500
> >      RDX: 0000000000000001 RSI: 00000000000003fb RDI: 0000000000000001
> >      RBP: ffffb53ac13dfec0 R08: 0000000000000000 R09: ffffd53abf600000
> >      R10: ffff9bb60b4bdd80 R11: ffffffffbcde7bb0 R12: ffffffffbf04e710
> >      R13: ffffffffbf405160 R14: 00000000000003fb R15: ffffffffbf1afc60
> >      FS:  00007f5be44d5000(0000) GS:ffff9bb6b9cc0000(0000) knlGS:0000000000000000
> >      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >      CR2: 000000015b80407b CR3: 00000001038ea000 CR4: 00000000000406a0
> >      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >      Call Trace:
> >       <TASK>
> >       __sys_setuid+0x1a0/0x290
> >       __x64_sys_setuid+0xc/0x10
> >       do_syscall_64+0x43/0x90
> >       ? asm_exc_page_fault+0x8/0x30
> >       entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > - Problem
> > 
> > The issue was eventually tracked down to the attempted dereference of a value
> > located in a corrupted hash table.  ucounts_hashtable is an array of 1024
> > struct hlists.  Each element is the head of its own linked list where previous
> > ucount allocations are stored.  The [20]th element of ucounts_hashtable was
> > being consistently trashed on each and every boot.  However the indeterminism
> > comes from it being accessed only every few hundred boots.
> > 
> > The issue disappeared, or was at least unidentifiable when !(LTO=full) or when
> > memory base randomisation (a.k.a. KASLR) was disabled, rending GDB all but
> > impossible to use effectively.
> > 
> > The cause of the corruption was uncovered using a verity of different debugging
> > techniques and was eventually tracked down to page table manipulation in early
> > architecture setup.
> > 
> > The following line in arch/x86/realmode/init.c [0] allocates a variable, just 8
> > Bytes in size, to "hold the pgd entry used on booting additional CPUs":
> > 
> >   pgd_t trampoline_pgd_entry;
> > 
> > The address of that variable is then passed from init_trampoline_kaslr() via a
> > call to set_pgd() [1] to have a value (not too important here) assigned to it.
> > Numerous abstractions take place, eventually leading to native_set_p4d(), an
> > inline function [2] contained in arch/x86/include/asm/pgtable_64.h.
> > 
> > From here, intentionally or otherwise, a call to pti_set_user_pgtbl() is made.
> > This is where the out-of-bounds write eventually occurs.  It is not known (by
> > me) why this function is called.  The returned result is subsequently used as a
> > value to write using the WRITE_ONCE macro.  Perhaps the premature write is not
> > intended.  This is what I hope to find out.
> > 
> > A little way down in pti_set_user_pgtbl() [3] the following line occurs:
> > 
> >   kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd
> > 
> > The kernel_to_user_pgdp() part takes the address of pgdp (a.k.a.
> > trampoline_pgd_entry) and ends up flipping the 12th bit, essentially adding 4k
> > (0x1000) to the address.  Then the part at the end assigns our value (still not
> > important here) to it.  However, if we remember that only 8 Bytes was allocated
> > (globally) for trampoline_pgd_entry, then means we just stored the value into
> > the outlands (what we now know to be allocated to another global storage user
> > ucounts_hashtable).
> > 
> > - Ask
> > 
> > Hopefully I haven't messed anything up in the explanation here.  Please let me
> > know if this is the case.  I'm keen to understand what the intention was and
> > what we might do to fix it, if of course this hasn't already been remedied
> > somewhere.
> > 
> > As ever, any help / guidance would be gratefully received.
> 
> This is the PTI, or Page-Table-Isolation muck that is the Meltdown
> mitigation. Basically we end up running with 2 sets of page-tables, a
> kernel and user pagetable, where the user pagetable only contains the
> bare minimum of the kernel to make the transition (and consequently
> userspace cannot access the kernel data).
> 
> The way this is done is by making the PGD two consecutive pages aligned
> to 2*PAGE_SIZE, and a switch between the two CR3 values is a simple
> bitop. The kernel is the first (or lower address) while the user is the
> second (or higher address).
> 
> This is what kernel_to_user_pgdp() is doing, and what you'll find in
> things like SWITCH_TO_{KERNEL,USER}_CR3 from arch/x86/entry/calling.h.
> 
> Now the problem is that:
> 
> 1) this kaslr trampoline stuff does not play by the same rules, but
> triggers it because it's about the first 1M of memory -- aka userspace
> range.
> 
> 2) we've already set X86_FEATURE_PTI by the time this happens
> 
> I've not yet remebered enough of this crap to figure out the best way to
> cure this. Clearly we don't need PTI to perform kalsr, but all of this
> early setup is a giant maze so perhaps PTI is needed for something else
> this early.
> 
> IFF not, the solution might be to set PTI later.

I see.

Still unsure how we (the kernel) can/should write to an area of memory
that does not belong to it.  Should we allocate enough memory
(2*PAGE_SIZE? rather than 8-Bytes) for trampoline_pgd_entry to consume
in a more sane way?

-- 
Lee Jones [李琼斯]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: pgtable / kaslr initialisation (OOB) help
  2023-06-14 14:37   ` Lee Jones
@ 2023-06-14 14:45     ` Dave Hansen
  2023-06-14 15:06       ` Lee Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Hansen @ 2023-06-14 14:45 UTC (permalink / raw)
  To: Lee Jones, Peter Zijlstra
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel, linux-mm

On 6/14/23 07:37, Lee Jones wrote:
> Still unsure how we (the kernel) can/should write to an area of memory
> that does not belong to it.  Should we allocate enough memory
> (2*PAGE_SIZE? rather than 8-Bytes) for trampoline_pgd_entry to consume
> in a more sane way?

No.

I think this:

                set_pgd(&trampoline_pgd_entry,
                        __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));

is bogus-ish.  set_pgd() wants to operate on a pgd_t inside a pgd
*PAGE*.  But it's just being pointed at a single  _entry_.  The address
of 'trampoline_pgd_entry' in your case  also just (unfortunately)
happens to pass the:

	__pti_set_user_pgtbl -> pgdp_maps_userspace()

test.  I _think_ we want these to just be something like:

	trampoline_pgd_entry = __pgd(_KERNPG_TABLE |
				     __pa(p4d_page_tramp);

That'll keep us away from all of the set_pgd()-induced nastiness.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: pgtable / kaslr initialisation (OOB) help
  2023-06-14 14:45     ` Dave Hansen
@ 2023-06-14 15:06       ` Lee Jones
  2023-06-14 15:10         ` Lee Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Lee Jones @ 2023-06-14 15:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, linux-kernel,
	linux-mm

Thanks for chiming in Dave.  I hoped you would.

On Wed, 14 Jun 2023, Dave Hansen wrote:

> On 6/14/23 07:37, Lee Jones wrote:
> > Still unsure how we (the kernel) can/should write to an area of memory
> > that does not belong to it.  Should we allocate enough memory
> > (2*PAGE_SIZE? rather than 8-Bytes) for trampoline_pgd_entry to consume
> > in a more sane way?
> 
> No.
> 
> I think this:
> 
>                 set_pgd(&trampoline_pgd_entry,
>                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
> 
> is bogus-ish.  set_pgd() wants to operate on a pgd_t inside a pgd
> *PAGE*.  But it's just being pointed at a single  _entry_.  The address
> of 'trampoline_pgd_entry' in your case  also just (unfortunately)
> happens to pass the:
> 
> 	__pti_set_user_pgtbl -> pgdp_maps_userspace()
> 
> test.  I _think_ we want these to just be something like:
> 
> 	trampoline_pgd_entry = __pgd(_KERNPG_TABLE |
> 				     __pa(p4d_page_tramp);
> 
> That'll keep us away from all of the set_pgd()-induced nastiness.

Okay.  Is this what you're suggesting?

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c                 v
index d336bb0cb38b..803595c7dcc8 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -176,7 +176,7 @@ void __meminit init_trampoline_kaslr(void)
                set_pgd(&trampoline_pgd_entry,
                        __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
        } else {
-               set_pgd(&trampoline_pgd_entry,
-                       __pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
+               trampoline_pgd_entry =
+                       __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp);
        }
 }

If so, I'll take it for a spin right now.

-- 
Lee Jones [李琼斯]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: x86: pgtable / kaslr initialisation (OOB) help
  2023-06-14 15:06       ` Lee Jones
@ 2023-06-14 15:10         ` Lee Jones
  2023-06-14 15:26           ` Lee Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Lee Jones @ 2023-06-14 15:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, linux-kernel,
	linux-mm

On Wed, 14 Jun 2023, Lee Jones wrote:

> Thanks for chiming in Dave.  I hoped you would.
> 
> On Wed, 14 Jun 2023, Dave Hansen wrote:
> 
> > On 6/14/23 07:37, Lee Jones wrote:
> > > Still unsure how we (the kernel) can/should write to an area of memory
> > > that does not belong to it.  Should we allocate enough memory
> > > (2*PAGE_SIZE? rather than 8-Bytes) for trampoline_pgd_entry to consume
> > > in a more sane way?
> > 
> > No.
> > 
> > I think this:
> > 
> >                 set_pgd(&trampoline_pgd_entry,
> >                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
> > 
> > is bogus-ish.  set_pgd() wants to operate on a pgd_t inside a pgd
> > *PAGE*.  But it's just being pointed at a single  _entry_.  The address
> > of 'trampoline_pgd_entry' in your case  also just (unfortunately)
> > happens to pass the:
> > 
> > 	__pti_set_user_pgtbl -> pgdp_maps_userspace()
> > 
> > test.  I _think_ we want these to just be something like:
> > 
> > 	trampoline_pgd_entry = __pgd(_KERNPG_TABLE |
> > 				     __pa(p4d_page_tramp);
> > 
> > That'll keep us away from all of the set_pgd()-induced nastiness.
> 
> Okay.  Is this what you're suggesting?
> 
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c                 v
> index d336bb0cb38b..803595c7dcc8 100644
> --- a/arch/x86/mm/kaslr.c
> +++ b/arch/x86/mm/kaslr.c
> @@ -176,7 +176,7 @@ void __meminit init_trampoline_kaslr(void)
>                 set_pgd(&trampoline_pgd_entry,
>                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
>         } else {
> -               set_pgd(&trampoline_pgd_entry,
> -                       __pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
> +               trampoline_pgd_entry =
> +                       __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp);

Note the change of *.page_tramp here.

  s/pud/p4d/

I'm assuming that too was intentional?

>         }
>  }
> 
> If so, I'll take it for a spin right now.

-- 
Lee Jones [李琼斯]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: pgtable / kaslr initialisation (OOB) help
  2023-06-14 15:10         ` Lee Jones
@ 2023-06-14 15:26           ` Lee Jones
  2023-06-14 16:01             ` Dave Hansen
  0 siblings, 1 reply; 9+ messages in thread
From: Lee Jones @ 2023-06-14 15:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, linux-kernel,
	linux-mm

On Wed, 14 Jun 2023, Lee Jones wrote:

> On Wed, 14 Jun 2023, Lee Jones wrote:
> 
> > Thanks for chiming in Dave.  I hoped you would.
> > 
> > On Wed, 14 Jun 2023, Dave Hansen wrote:
> > 
> > > On 6/14/23 07:37, Lee Jones wrote:
> > > > Still unsure how we (the kernel) can/should write to an area of memory
> > > > that does not belong to it.  Should we allocate enough memory
> > > > (2*PAGE_SIZE? rather than 8-Bytes) for trampoline_pgd_entry to consume
> > > > in a more sane way?
> > > 
> > > No.
> > > 
> > > I think this:
> > > 
> > >                 set_pgd(&trampoline_pgd_entry,
> > >                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
> > > 
> > > is bogus-ish.  set_pgd() wants to operate on a pgd_t inside a pgd
> > > *PAGE*.  But it's just being pointed at a single  _entry_.  The address
> > > of 'trampoline_pgd_entry' in your case  also just (unfortunately)
> > > happens to pass the:
> > > 
> > > 	__pti_set_user_pgtbl -> pgdp_maps_userspace()
> > > 
> > > test.  I _think_ we want these to just be something like:
> > > 
> > > 	trampoline_pgd_entry = __pgd(_KERNPG_TABLE |
> > > 				     __pa(p4d_page_tramp);
> > > 
> > > That'll keep us away from all of the set_pgd()-induced nastiness.
> > 
> > Okay.  Is this what you're suggesting?
> > 
> > diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c                 v
> > index d336bb0cb38b..803595c7dcc8 100644
> > --- a/arch/x86/mm/kaslr.c
> > +++ b/arch/x86/mm/kaslr.c
> > @@ -176,7 +176,7 @@ void __meminit init_trampoline_kaslr(void)
> >                 set_pgd(&trampoline_pgd_entry,
> >                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
> >         } else {
> > -               set_pgd(&trampoline_pgd_entry,
> > -                       __pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
> > +               trampoline_pgd_entry =
> > +                       __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp);
> 
> Note the change of *.page_tramp here.
> 
>   s/pud/p4d/
> 
> I'm assuming that too was intentional?

Never mind.  I can see that p4d_page_tramp is local to the if() segment.

While we're at it, does the if() segment look correct to you:

  if (pgtable_l5_enabled()) {
        p4d_page_tramp = alloc_low_page();

        p4d_tramp = p4d_page_tramp + p4d_index(paddr);

        set_p4d(p4d_tramp,
                __p4d(_KERNPG_TABLE | __pa(pud_page_tramp)));

        set_pgd(&trampoline_pgd_entry,
                __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
  } else {
        trampoline_pgd_entry =
                __pgd(_KERNPG_TABLE | __pa(pud_page_tramp));
  }

 - pud_page_tramp is being passed to set_p4d()
 - p4d_page_tramp is being passed to set_pgd()

Should those be the other way around, or am I missing the point?

-- 
Lee Jones [李琼斯]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: pgtable / kaslr initialisation (OOB) help
  2023-06-14 15:26           ` Lee Jones
@ 2023-06-14 16:01             ` Dave Hansen
  2023-06-14 16:09               ` Lee Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Hansen @ 2023-06-14 16:01 UTC (permalink / raw)
  To: Lee Jones
  Cc: Peter Zijlstra, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 3297 bytes --]

On 6/14/23 08:26, Lee Jones wrote:
> On Wed, 14 Jun 2023, Lee Jones wrote:
> 
>> On Wed, 14 Jun 2023, Lee Jones wrote:
>>
>>> Thanks for chiming in Dave.  I hoped you would.
>>>
>>> On Wed, 14 Jun 2023, Dave Hansen wrote:
>>>
>>>> On 6/14/23 07:37, Lee Jones wrote:
>>>>> Still unsure how we (the kernel) can/should write to an area of memory
>>>>> that does not belong to it.  Should we allocate enough memory
>>>>> (2*PAGE_SIZE? rather than 8-Bytes) for trampoline_pgd_entry to consume
>>>>> in a more sane way?
>>>>
>>>> No.
>>>>
>>>> I think this:
>>>>
>>>>                 set_pgd(&trampoline_pgd_entry,
>>>>                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
>>>>
>>>> is bogus-ish.  set_pgd() wants to operate on a pgd_t inside a pgd
>>>> *PAGE*.  But it's just being pointed at a single  _entry_.  The address
>>>> of 'trampoline_pgd_entry' in your case  also just (unfortunately)
>>>> happens to pass the:
>>>>
>>>> 	__pti_set_user_pgtbl -> pgdp_maps_userspace()
>>>>
>>>> test.  I _think_ we want these to just be something like:
>>>>
>>>> 	trampoline_pgd_entry = __pgd(_KERNPG_TABLE |
>>>> 				     __pa(p4d_page_tramp);
>>>>
>>>> That'll keep us away from all of the set_pgd()-induced nastiness.
>>>
>>> Okay.  Is this what you're suggesting?
>>>
>>> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c                 v
>>> index d336bb0cb38b..803595c7dcc8 100644
>>> --- a/arch/x86/mm/kaslr.c
>>> +++ b/arch/x86/mm/kaslr.c
>>> @@ -176,7 +176,7 @@ void __meminit init_trampoline_kaslr(void)
>>>                 set_pgd(&trampoline_pgd_entry,
>>>                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
>>>         } else {
>>> -               set_pgd(&trampoline_pgd_entry,
>>> -                       __pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
>>> +               trampoline_pgd_entry =
>>> +                       __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp);
>>
>> Note the change of *.page_tramp here.
>>
>>   s/pud/p4d/
>>
>> I'm assuming that too was intentional?
> 
> Never mind.  I can see that p4d_page_tramp is local to the if() segment.
> 
> While we're at it, does the if() segment look correct to you:
> 
>   if (pgtable_l5_enabled()) {
>         p4d_page_tramp = alloc_low_page();
> 
>         p4d_tramp = p4d_page_tramp + p4d_index(paddr);
> 
>         set_p4d(p4d_tramp,
>                 __p4d(_KERNPG_TABLE | __pa(pud_page_tramp)));
> 
>         set_pgd(&trampoline_pgd_entry,
>                 __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
>   } else {
>         trampoline_pgd_entry =
>                 __pgd(_KERNPG_TABLE | __pa(pud_page_tramp));
>   }
> 
>  - pud_page_tramp is being passed to set_p4d()
>  - p4d_page_tramp is being passed to set_pgd()
> 
> Should those be the other way around, or am I missing the point?

You're missing the point. :)

PGDs are always set up to point to the physical address of the thing at
one lower level than them.  A page is allocated for that level when
5-level paging is in play.  No page is needed when it is not in play.

The pattern is _almost_ always

	pgd = ... __pa(p4d);

In other words, point the PGD at the physical address of a p4d.  But
things get funky on systems without p4ds, thus the special casing here.

Does the (completely untested) attached patch fix your problem?

[-- Attachment #2: trampoline_pgd_entry.patch --]
[-- Type: text/x-patch, Size: 794 bytes --]



---

 b/arch/x86/mm/kaslr.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff -puN arch/x86/mm/kaslr.c~trampoline_pgd_entry arch/x86/mm/kaslr.c
--- a/arch/x86/mm/kaslr.c~trampoline_pgd_entry	2023-06-14 08:54:08.685554094 -0700
+++ b/arch/x86/mm/kaslr.c	2023-06-14 08:55:36.077089793 -0700
@@ -172,10 +172,10 @@ void __meminit init_trampoline_kaslr(voi
 		set_p4d(p4d_tramp,
 			__p4d(_KERNPG_TABLE | __pa(pud_page_tramp)));
 
-		set_pgd(&trampoline_pgd_entry,
-			__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+		trampoline_pgd_entry =
+			__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp));
 	} else {
-		set_pgd(&trampoline_pgd_entry,
-			__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
+		trampoline_pgd_entry =
+		       	__pgd(_KERNPG_TABLE | __pa(pud_page_tramp));
 	}
 }
_

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: x86: pgtable / kaslr initialisation (OOB) help
  2023-06-14 16:01             ` Dave Hansen
@ 2023-06-14 16:09               ` Lee Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Lee Jones @ 2023-06-14 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, linux-kernel,
	linux-mm

On Wed, 14 Jun 2023, Dave Hansen wrote:

> On 6/14/23 08:26, Lee Jones wrote:
> > On Wed, 14 Jun 2023, Lee Jones wrote:
> > 
> >> On Wed, 14 Jun 2023, Lee Jones wrote:
> >>
> >>> Thanks for chiming in Dave.  I hoped you would.
> >>>
> >>> On Wed, 14 Jun 2023, Dave Hansen wrote:
> >>>
> >>>> On 6/14/23 07:37, Lee Jones wrote:
> >>>>> Still unsure how we (the kernel) can/should write to an area of memory
> >>>>> that does not belong to it.  Should we allocate enough memory
> >>>>> (2*PAGE_SIZE? rather than 8-Bytes) for trampoline_pgd_entry to consume
> >>>>> in a more sane way?
> >>>>
> >>>> No.
> >>>>
> >>>> I think this:
> >>>>
> >>>>                 set_pgd(&trampoline_pgd_entry,
> >>>>                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
> >>>>
> >>>> is bogus-ish.  set_pgd() wants to operate on a pgd_t inside a pgd
> >>>> *PAGE*.  But it's just being pointed at a single  _entry_.  The address
> >>>> of 'trampoline_pgd_entry' in your case  also just (unfortunately)
> >>>> happens to pass the:
> >>>>
> >>>> 	__pti_set_user_pgtbl -> pgdp_maps_userspace()
> >>>>
> >>>> test.  I _think_ we want these to just be something like:
> >>>>
> >>>> 	trampoline_pgd_entry = __pgd(_KERNPG_TABLE |
> >>>> 				     __pa(p4d_page_tramp);
> >>>>
> >>>> That'll keep us away from all of the set_pgd()-induced nastiness.
> >>>
> >>> Okay.  Is this what you're suggesting?
> >>>
> >>> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c                 v
> >>> index d336bb0cb38b..803595c7dcc8 100644
> >>> --- a/arch/x86/mm/kaslr.c
> >>> +++ b/arch/x86/mm/kaslr.c
> >>> @@ -176,7 +176,7 @@ void __meminit init_trampoline_kaslr(void)
> >>>                 set_pgd(&trampoline_pgd_entry,
> >>>                         __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
> >>>         } else {
> >>> -               set_pgd(&trampoline_pgd_entry,
> >>> -                       __pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
> >>> +               trampoline_pgd_entry =
> >>> +                       __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp);
> >>
> >> Note the change of *.page_tramp here.
> >>
> >>   s/pud/p4d/
> >>
> >> I'm assuming that too was intentional?
> > 
> > Never mind.  I can see that p4d_page_tramp is local to the if() segment.
> > 
> > While we're at it, does the if() segment look correct to you:
> > 
> >   if (pgtable_l5_enabled()) {
> >         p4d_page_tramp = alloc_low_page();
> > 
> >         p4d_tramp = p4d_page_tramp + p4d_index(paddr);
> > 
> >         set_p4d(p4d_tramp,
> >                 __p4d(_KERNPG_TABLE | __pa(pud_page_tramp)));
> > 
> >         set_pgd(&trampoline_pgd_entry,
> >                 __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
> >   } else {
> >         trampoline_pgd_entry =
> >                 __pgd(_KERNPG_TABLE | __pa(pud_page_tramp));
> >   }
> > 
> >  - pud_page_tramp is being passed to set_p4d()
> >  - p4d_page_tramp is being passed to set_pgd()
> > 
> > Should those be the other way around, or am I missing the point?
> 
> You're missing the point. :)

Super, thanks for the explanation.
 
> PGDs are always set up to point to the physical address of the thing at
> one lower level than them.  A page is allocated for that level when
> 5-level paging is in play.  No page is needed when it is not in play.
> 
> The pattern is _almost_ always
> 
> 	pgd = ... __pa(p4d);
> 
> In other words, point the PGD at the physical address of a p4d.  But
> things get funky on systems without p4ds, thus the special casing here.
> 
> Does the (completely untested) attached patch fix your problem?

I just submitted a (tested) patch.

It doesn't cover the if() segment though.  I'll do so and resubmit.

> ---
> 
>  b/arch/x86/mm/kaslr.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff -puN arch/x86/mm/kaslr.c~trampoline_pgd_entry arch/x86/mm/kaslr.c
> --- a/arch/x86/mm/kaslr.c~trampoline_pgd_entry	2023-06-14 08:54:08.685554094 -0700
> +++ b/arch/x86/mm/kaslr.c	2023-06-14 08:55:36.077089793 -0700
> @@ -172,10 +172,10 @@ void __meminit init_trampoline_kaslr(voi
>  		set_p4d(p4d_tramp,
>  			__p4d(_KERNPG_TABLE | __pa(pud_page_tramp)));
>  
> -		set_pgd(&trampoline_pgd_entry,
> -			__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
> +		trampoline_pgd_entry =
> +			__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp));
>  	} else {
> -		set_pgd(&trampoline_pgd_entry,
> -			__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
> +		trampoline_pgd_entry =
> +		       	__pgd(_KERNPG_TABLE | __pa(pud_page_tramp));
>  	}
>  }
> _


-- 
Lee Jones [李琼斯]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-06-14 16:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-14 13:23 x86: pgtable / kaslr initialisation (OOB) help Lee Jones
2023-06-14 14:16 ` Peter Zijlstra
2023-06-14 14:37   ` Lee Jones
2023-06-14 14:45     ` Dave Hansen
2023-06-14 15:06       ` Lee Jones
2023-06-14 15:10         ` Lee Jones
2023-06-14 15:26           ` Lee Jones
2023-06-14 16:01             ` Dave Hansen
2023-06-14 16:09               ` Lee Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).