* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
@ 2018-09-04 7:04 ` Peter Zijlstra
2018-09-05 21:31 ` Andy Lutomirski
2018-09-07 9:35 ` Borislav Petkov
` (5 subsequent siblings)
6 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-09-04 7:04 UTC (permalink / raw)
To: Andy Lutomirski
Cc: x86, Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen
On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
> The SYSCALL64 trampoline has a couple of nice properties:
>
> - The usual sequence of SWAPGS followed by two GS-relative accesses to
> set up RSP is somewhat slow because the GS-relative accesses need
> to wait for SWAPGS to finish. The trampoline approach allows
> RIP-relative accesses to set up RSP, which avoids the stall.
>
> - The trampoline avoids any percpu access before CR3 is set up,
> which means that no percpu memory needs to be mapped in the user
> page tables. This prevents using Meltdown to read any percpu memory
> outside the cpu_entry_area and prevents using timing leaks
> to directly locate the percpu areas.
>
> The downsides of using a trampoline may outweigh the upsides, however.
> It adds an extra non-contiguous I$ cache line to system calls, and it
> forces an indirect jump to transfer control back to the normal kernel
> text after CR3 is set up. The latter is because x86 lacks a 64-bit
> direct jump instruction that could jump from the trampoline to the entry
> text. With retpolines enabled, the indirect jump is extremely slow.
>
> This patch changes the code to map the percpu TSS into the user page
> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
> This does not add a new direct information leak, since the TSS is
> readable by Meltdown from the cpu_entry_area alias regardless. It
> does allow a timing attack to locate the percpu area, but KASLR is
> more or less a lost cause against local attack on CPUs vulnerable to
> Meltdown regardless. As far as I'm concerned, on current hardware,
> KASLR is only useful to mitigate remote attacks that try to attack
> the kernel without first gaining RCE against a vulnerable user
> process.
>
> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
> syscall overhead from ~237ns to ~228ns.
>
> There is a possible alternative approach: we could instead move the
> trampoline within 2G of the entry text and make a separate copy for
> each CPU. Then we could use a direct jump to rejoin the normal
> entry path.
Can we have a few words on why this solution and not this alternative? I
mean, you raise the possibility, but then surely you chose not to
implement that. Might as well share that with us.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-04 7:04 ` Peter Zijlstra
@ 2018-09-05 21:31 ` Andy Lutomirski
2018-09-07 12:36 ` Peter Zijlstra
2018-09-07 19:54 ` Thomas Gleixner
0 siblings, 2 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-05 21:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andy Lutomirski, X86 ML, Borislav Petkov, LKML, Dave Hansen,
Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Linus Torvalds, Josh Poimboeuf, Joerg Roedel, Jiri Olsa,
Andi Kleen
On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
>> The SYSCALL64 trampoline has a couple of nice properties:
>>
>> - The usual sequence of SWAPGS followed by two GS-relative accesses to
>> set up RSP is somewhat slow because the GS-relative accesses need
>> to wait for SWAPGS to finish. The trampoline approach allows
>> RIP-relative accesses to set up RSP, which avoids the stall.
>>
>> - The trampoline avoids any percpu access before CR3 is set up,
>> which means that no percpu memory needs to be mapped in the user
>> page tables. This prevents using Meltdown to read any percpu memory
>> outside the cpu_entry_area and prevents using timing leaks
>> to directly locate the percpu areas.
>>
>> The downsides of using a trampoline may outweigh the upsides, however.
>> It adds an extra non-contiguous I$ cache line to system calls, and it
>> forces an indirect jump to transfer control back to the normal kernel
>> text after CR3 is set up. The latter is because x86 lacks a 64-bit
>> direct jump instruction that could jump from the trampoline to the entry
>> text. With retpolines enabled, the indirect jump is extremely slow.
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless. It
>> does allow a timing attack to locate the percpu area, but KASLR is
>> more or less a lost cause against local attack on CPUs vulnerable to
>> Meltdown regardless. As far as I'm concerned, on current hardware,
>> KASLR is only useful to mitigate remote attacks that try to attack
>> the kernel without first gaining RCE against a vulnerable user
>> process.
>>
>> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
>> syscall overhead from ~237ns to ~228ns.
>>
>> There is a possible alternative approach: we could instead move the
>> trampoline within 2G of the entry text and make a separate copy for
>> each CPU. Then we could use a direct jump to rejoin the normal
>> entry path.
>
> Can we have a few words on why this solution and not this alternative? I
> mean, you raise the possibility, but then surely you chose not to
> implement that. Might as well share that with us.
I can give some pros and cons. With the other approach:
- We avoid a pipeline stall.
- We execute from an extra page and read from another extra page
during the syscall. (The latter is because we need to use a relative
addressing mode to find sp1 -- it's the same *cacheline* we'd use
anyway, but we're accessing it using an alias, so it's an extra TLB
entry.)
- We use more memory. This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a
more complex implementation.
- More code complexity.
I'm not convinced this is a good tradeoff.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-05 21:31 ` Andy Lutomirski
@ 2018-09-07 12:36 ` Peter Zijlstra
2018-09-07 19:54 ` Thomas Gleixner
1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-09-07 12:36 UTC (permalink / raw)
To: Andy Lutomirski
Cc: X86 ML, Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen
On Wed, Sep 05, 2018 at 02:31:28PM -0700, Andy Lutomirski wrote:
> On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
> >> There is a possible alternative approach: we could instead move the
> >> trampoline within 2G of the entry text and make a separate copy for
> >> each CPU. Then we could use a direct jump to rejoin the normal
> >> entry path.
> >
> > Can we have a few words on why this solution and not this alternative? I
> > mean, you raise the possibility, but then surely you chose not to
> > implement that. Might as well share that with us.
>
> I can give some pros and cons. With the other approach:
>
> - We avoid a pipeline stall.
> - We execute from an extra page and read from another extra page
> during the syscall. (The latter is because we need to use a relative
> addressing mode to find sp1 -- it's the same *cacheline* we'd use
> anyway, but we're accessing it using an alias, so it's an extra TLB
> entry.)
> - We use more memory. This would be one page per CPU for a simple
> implementation and 64-ish bytes per CPU or one page per node for a
> more complex implementation.
> - More code complexity.
>
> I'm not convinced this is a good tradeoff.
Fair enough, thanks!
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-05 21:31 ` Andy Lutomirski
2018-09-07 12:36 ` Peter Zijlstra
@ 2018-09-07 19:54 ` Thomas Gleixner
2018-09-08 0:04 ` Linus Torvalds
1 sibling, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2018-09-07 19:54 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Peter Zijlstra, X86 ML, Borislav Petkov, LKML, Dave Hansen,
Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Linus Torvalds, Josh Poimboeuf, Joerg Roedel, Jiri Olsa,
Andi Kleen
On Wed, 5 Sep 2018, Andy Lutomirski wrote:
> On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Can we have a few words on why this solution and not this alternative? I
> > mean, you raise the possibility, but then surely you chose not to
> > implement that. Might as well share that with us.
>
> I can give some pros and cons. With the other approach:
>
> - We avoid a pipeline stall.
Which is good.
> - We execute from an extra page and read from another extra page
> during the syscall. (The latter is because we need to use a relative
> addressing mode to find sp1 -- it's the same *cacheline* we'd use
> anyway, but we're accessing it using an alias, so it's an extra TLB
> entry.)
Ok, but is this really an issue with PTI?
> - We use more memory. This would be one page per CPU for a simple
> implementation and 64-ish bytes per CPU or one page per node for a
> more complex implementation.
That's the least interesting argument really.
> - More code complexity.
Ok, but how much complex code is that?
> I'm not convinced this is a good tradeoff.
Well, the real question here is whether this has any advantage vs. the
percpu area exposure?
Thanks,
tglx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-07 19:54 ` Thomas Gleixner
@ 2018-09-08 0:04 ` Linus Torvalds
2018-09-08 4:32 ` Andy Lutomirski
2018-09-08 6:33 ` Thomas Gleixner
0 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2018-09-08 0:04 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Lutomirski, Peter Zijlstra, the arch/x86 maintainers,
Borislav Petkov, Linux Kernel Mailing List, Dave Hansen,
Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen
On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> > - We execute from an extra page and read from another extra page
> > during the syscall. (The latter is because we need to use a relative
> > addressing mode to find sp1 -- it's the same *cacheline* we'd use
> > anyway, but we're accessing it using an alias, so it's an extra TLB
> > entry.)
>
> Ok, but is this really an issue with PTI?
I'd expect it to be *more* of an issue with PTI, since you're already
wasting TLB entries due to the whole "two different page tables".
Sure, address space ID's save you from reloading them all the time,
but don't help with capacity.
But yeah, in the sense of "with PTI, all kernel entries are slow
anyway, so none of this matters" is probably correct in a very real
sense.
That said, the real reason I like Andy's patch series is that I think
it's simpler than the alternatives (including the current setup). No
subtle mappings, no nothing. It removes a lot more lines than it adds,
and half the lines that it *does* add are comments.
Virtual mapping tricks may be cool, but in the end, not having to use
them is better still, I think.
Linus
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-08 0:04 ` Linus Torvalds
@ 2018-09-08 4:32 ` Andy Lutomirski
2018-09-08 6:36 ` Thomas Gleixner
2018-09-08 6:33 ` Thomas Gleixner
1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-08 4:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Thomas Gleixner, Andrew Lutomirski, Peter Zijlstra,
the arch/x86 maintainers, Borislav Petkov,
Linux Kernel Mailing List, Dave Hansen, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Josh Poimboeuf,
Joerg Roedel, Jiri Olsa, Andi Kleen
On Fri, Sep 7, 2018 at 5:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> > - We execute from an extra page and read from another extra page
>> > during the syscall. (The latter is because we need to use a relative
>> > addressing mode to find sp1 -- it's the same *cacheline* we'd use
>> > anyway, but we're accessing it using an alias, so it's an extra TLB
>> > entry.)
>>
>> Ok, but is this really an issue with PTI?
>
> I'd expect it to be *more* of an issue with PTI, since you're already
> wasting TLB entries due to the whole "two different page tables".
>
> Sure, address space ID's save you from reloading them all the time,
> but don't help with capacity.
>
> But yeah, in the sense of "with PTI, all kernel entries are slow
> anyway, so none of this matters" is probably correct in a very real
> sense.
>
> That said, the real reason I like Andy's patch series is that I think
> it's simpler than the alternatives (including the current setup). No
> subtle mappings, no nothing. It removes a lot more lines than it adds,
> and half the lines that it *does* add are comments.
>
> Virtual mapping tricks may be cool, but in the end, not having to use
> them is better still, I think.
>
If (and this is a *big* if) all the percpu data is within 2GB of the
entry text, then we could avoid this extra TLB reference by accessing
it directly instead of using an alias.
I suppose the summary is that the retpoline-free trampoline variant is
even more complicated than the code I'm removing in this series, and
that it would be at best a teeny tiny win. Once all the Spectre dust
settles and we get fixed CPUs, we could consider re-optimizing this.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-08 4:32 ` Andy Lutomirski
@ 2018-09-08 6:36 ` Thomas Gleixner
0 siblings, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2018-09-08 6:36 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers,
Borislav Petkov, Linux Kernel Mailing List, Dave Hansen,
Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen
On Fri, 7 Sep 2018, Andy Lutomirski wrote:
> On Fri, Sep 7, 2018 at 5:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > Virtual mapping tricks may be cool, but in the end, not having to use
> > them is better still, I think.
> >
>
> If (and this is a *big* if) all the percpu data is within 2GB of the
> entry text, then we could avoid this extra TLB reference by accessing
> it directly instead of using an alias.
>
> I suppose the summary is that the retpoline-free trampoline variant is
> even more complicated than the code I'm removing in this series, and
> that it would be at best a teeny tiny win. Once all the Spectre dust
> settles and we get fixed CPUs, we could consider re-optimizing this.
That's going to be after my retirement ...
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-08 0:04 ` Linus Torvalds
2018-09-08 4:32 ` Andy Lutomirski
@ 2018-09-08 6:33 ` Thomas Gleixner
1 sibling, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2018-09-08 6:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Lutomirski, Peter Zijlstra, the arch/x86 maintainers,
Borislav Petkov, Linux Kernel Mailing List, Dave Hansen,
Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen
On Fri, 7 Sep 2018, Linus Torvalds wrote:
> On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > > - We execute from an extra page and read from another extra page
> > > during the syscall. (The latter is because we need to use a relative
> > > addressing mode to find sp1 -- it's the same *cacheline* we'd use
> > > anyway, but we're accessing it using an alias, so it's an extra TLB
> > > entry.)
> >
> > Ok, but is this really an issue with PTI?
>
> I'd expect it to be *more* of an issue with PTI, since you're already
> wasting TLB entries due to the whole "two different page tables".
>
> Sure, address space ID's save you from reloading them all the time,
> but don't help with capacity.
>
> But yeah, in the sense of "with PTI, all kernel entries are slow
> anyway, so none of this matters" is probably correct in a very real
> sense.
Unfortunately yes. As Andy pointed out the alternative approach would trade
an avoided pipeline stall with an extra TLB.
> That said, the real reason I like Andy's patch series is that I think
> it's simpler than the alternatives (including the current setup). No
> subtle mappings, no nothing. It removes a lot more lines than it adds,
> and half the lines that it *does* add are comments.
I agree and in fact I had the series already queued for merging, I just
wanted to get the full picture and add some of that information to the
changelog.
> Virtual mapping tricks may be cool, but in the end, not having to use
> them is better still, I think.
Sure, unless they provide a real value. The maybe smaller leak of address
information is probably not worth it.
Thanks,
tglx
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
2018-09-04 7:04 ` Peter Zijlstra
@ 2018-09-07 9:35 ` Borislav Petkov
2018-09-07 16:40 ` Josh Poimboeuf
` (4 subsequent siblings)
6 siblings, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2018-09-07 9:35 UTC (permalink / raw)
To: Andy Lutomirski
Cc: x86, LKML, Dave Hansen, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Linus Torvalds, Josh Poimboeuf,
Joerg Roedel, Jiri Olsa, Andi Kleen, Peter Zijlstra
On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
> The SYSCALL64 trampoline has a couple of nice properties:
>
> - The usual sequence of SWAPGS followed by two GS-relative accesses to
> set up RSP is somewhat slow because the GS-relative accesses need
> to wait for SWAPGS to finish. The trampoline approach allows
> RIP-relative accesses to set up RSP, which avoids the stall.
...
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index 31341ae7309f..7e79154846c8 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
> }
>
> /*
> - * Clone the CPU_ENTRY_AREA into the user space visible page table.
> + * Clone the CPU_ENTRY_AREA and associated data into the user space visible
> + * page table.
> */
> static void __init pti_clone_user_shared(void)
> {
> + unsigned cpu;
Make that
unsigned int cpu;
Otherwise, patches removing complex code are always good!
Reviewed-by: Borislav Petkov <bp@suse.de>
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
2018-09-04 7:04 ` Peter Zijlstra
2018-09-07 9:35 ` Borislav Petkov
@ 2018-09-07 16:40 ` Josh Poimboeuf
2018-09-08 4:35 ` Andy Lutomirski
2018-09-08 9:35 ` [tip:x86/pti] " tip-bot for Andy Lutomirski
` (3 subsequent siblings)
6 siblings, 1 reply; 24+ messages in thread
From: Josh Poimboeuf @ 2018-09-07 16:40 UTC (permalink / raw)
To: Andy Lutomirski
Cc: x86, Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
Joerg Roedel, Jiri Olsa, Andi Kleen, Peter Zijlstra
On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
> The SYSCALL64 trampoline has a couple of nice properties:
>
> - The usual sequence of SWAPGS followed by two GS-relative accesses to
> set up RSP is somewhat slow because the GS-relative accesses need
> to wait for SWAPGS to finish. The trampoline approach allows
> RIP-relative accesses to set up RSP, which avoids the stall.
>
> - The trampoline avoids any percpu access before CR3 is set up,
> which means that no percpu memory needs to be mapped in the user
> page tables. This prevents using Meltdown to read any percpu memory
> outside the cpu_entry_area and prevents using timing leaks
> to directly locate the percpu areas.
>
> The downsides of using a trampoline may outweigh the upsides, however.
> It adds an extra non-contiguous I$ cache line to system calls, and it
> forces an indirect jump to transfer control back to the normal kernel
> text after CR3 is set up. The latter is because x86 lacks a 64-bit
> direct jump instruction that could jump from the trampoline to the entry
> text. With retpolines enabled, the indirect jump is extremely slow.
>
> This patch changes the code to map the percpu TSS into the user page
> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
> This does not add a new direct information leak, since the TSS is
> readable by Meltdown from the cpu_entry_area alias regardless. It
> does allow a timing attack to locate the percpu area, but KASLR is
> more or less a lost cause against local attack on CPUs vulnerable to
> Meltdown regardless. As far as I'm concerned, on current hardware,
> KASLR is only useful to mitigate remote attacks that try to attack
> the kernel without first gaining RCE against a vulnerable user
> process.
>
> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
> syscall overhead from ~237ns to ~228ns.
>
> There is a possible alternative approach: we could instead move the
> trampoline within 2G of the entry text and make a separate copy for
> each CPU. Then we could use a direct jump to rejoin the normal
> entry path.
>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
The following commit should also be reverted:
4d99e4136580 ("perf machine: Workaround missing maps for x86 PTI entry trampolines")
--
Josh
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-07 16:40 ` Josh Poimboeuf
@ 2018-09-08 4:35 ` Andy Lutomirski
0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-08 4:35 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Andy Lutomirski, X86 ML, Borislav Petkov, LKML, Dave Hansen,
Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Linus Torvalds, Joerg Roedel, Jiri Olsa, Andi Kleen,
Peter Zijlstra
On Fri, Sep 7, 2018 at 9:40 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
>> The SYSCALL64 trampoline has a couple of nice properties:
>>
>> - The usual sequence of SWAPGS followed by two GS-relative accesses to
>> set up RSP is somewhat slow because the GS-relative accesses need
>> to wait for SWAPGS to finish. The trampoline approach allows
>> RIP-relative accesses to set up RSP, which avoids the stall.
>>
>> - The trampoline avoids any percpu access before CR3 is set up,
>> which means that no percpu memory needs to be mapped in the user
>> page tables. This prevents using Meltdown to read any percpu memory
>> outside the cpu_entry_area and prevents using timing leaks
>> to directly locate the percpu areas.
>>
>> The downsides of using a trampoline may outweigh the upsides, however.
>> It adds an extra non-contiguous I$ cache line to system calls, and it
>> forces an indirect jump to transfer control back to the normal kernel
>> text after CR3 is set up. The latter is because x86 lacks a 64-bit
>> direct jump instruction that could jump from the trampoline to the entry
>> text. With retpolines enabled, the indirect jump is extremely slow.
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless. It
>> does allow a timing attack to locate the percpu area, but KASLR is
>> more or less a lost cause against local attack on CPUs vulnerable to
>> Meltdown regardless. As far as I'm concerned, on current hardware,
>> KASLR is only useful to mitigate remote attacks that try to attack
>> the kernel without first gaining RCE against a vulnerable user
>> process.
>>
>> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
>> syscall overhead from ~237ns to ~228ns.
>>
>> There is a possible alternative approach: we could instead move the
>> trampoline within 2G of the entry text and make a separate copy for
>> each CPU. Then we could use a direct jump to rejoin the normal
>> entry path.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>
> The following commit should also be reverted:
>
> 4d99e4136580 ("perf machine: Workaround missing maps for x86 PTI entry trampolines")
I think we shouldn't, since the perf folks want new perf versions to
be fully functional on older kernels. My plan was to let the perf
crew do whatever reversions they felt appropriate.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [tip:x86/pti] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
` (2 preceding siblings ...)
2018-09-07 16:40 ` Josh Poimboeuf
@ 2018-09-08 9:35 ` tip-bot for Andy Lutomirski
2018-09-08 9:57 ` tip-bot for Andy Lutomirski
` (2 subsequent siblings)
6 siblings, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-08 9:35 UTC (permalink / raw)
To: linux-tip-commits
Cc: joro, tglx, bp, acme, mingo, peterz, jolsa, dave.hansen, bp,
alexander.shishkin, hpa, linux-kernel, jpoimboe, torvalds, ak,
adrian.hunter, luto
Commit-ID: 344347941aba1ec906ff50b4cdb8178c906746f2
Gitweb: https://git.kernel.org/tip/344347941aba1ec906ff50b4cdb8178c906746f2
Author: Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:44 -0700
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sat, 8 Sep 2018 11:20:12 +0200
x86/pti/64: Remove the SYSCALL64 entry trampoline
The SYSCALL64 trampoline has a couple of nice properties:
- The usual sequence of SWAPGS followed by two GS-relative accesses to
set up RSP is somewhat slow because the GS-relative accesses need
to wait for SWAPGS to finish. The trampoline approach allows
RIP-relative accesses to set up RSP, which avoids the stall.
- The trampoline avoids any percpu access before CR3 is set up,
which means that no percpu memory needs to be mapped in the user
page tables. This prevents using Meltdown to read any percpu memory
outside the cpu_entry_area and prevents using timing leaks
to directly locate the percpu areas.
The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up. The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text. With retpolines enabled, the indirect jump is extremely slow.
Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI. This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless. It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless. As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.
On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.
There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU. This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:
+ It avoids a pipeline stall
- It executes from an extra page and read from another extra page during
the syscall. The latter is because it needs to use a relative
addressing mode to find sp1 -- it's the same *cacheline*, but accessed
using an alias, so it's an extra TLB entry.
- Slightly more memory. This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a more
complex implementation.
- More code complexity.
The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.
[ tglx: Added the alternative discussion to the changelog ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
---
arch/x86/entry/entry_64.S | 69 +----------------------------------
arch/x86/include/asm/cpu_entry_area.h | 2 -
arch/x86/include/asm/sections.h | 1 -
arch/x86/kernel/asm-offsets.c | 2 -
arch/x86/kernel/cpu/common.c | 11 +-----
arch/x86/kernel/kprobes/core.c | 10 +----
arch/x86/kernel/vmlinux.lds.S | 10 -----
arch/x86/mm/cpu_entry_area.c | 36 ------------------
arch/x86/mm/pti.c | 33 ++++++++++++++++-
9 files changed, 36 insertions(+), 138 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7e82e553183a..0d728142467f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
* with them due to bugs in both AMD and Intel CPUs.
*/
- .pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline. This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address). So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline. We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
- _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH CPU_ENTRY_AREA_entry_stack + \
- SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
- UNWIND_HINT_EMPTY
- swapgs
-
- /* Stash the user RSP. */
- movq %rsp, RSP_SCRATCH
-
- /* Note: using %rsp as a scratch reg. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
- /* Load the top of the task stack into RSP */
- movq CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
- /* Start building the simulated IRET frame. */
- pushq $__USER_DS /* pt_regs->ss */
- pushq RSP_SCRATCH /* pt_regs->sp */
- pushq %r11 /* pt_regs->flags */
- pushq $__USER_CS /* pt_regs->cs */
- pushq %rcx /* pt_regs->ip */
-
- /*
- * x86 lacks a near absolute jump, and we can't jump to the real
- * entry text with a relative jump. We could push the target
- * address and then use retq, but this destroys the pipeline on
- * many CPUs (wasting over 20 cycles on Sandy Bridge). Instead,
- * spill RDI and restore it in a second-stage trampoline.
- */
- pushq %rdi
- movq $entry_SYSCALL_64_stage2, %rdi
- JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
- .popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
- UNWIND_HINT_EMPTY
- popq %rdi
- jmp entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
*/
swapgs
- /*
- * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
- * is not required to switch CR3.
- *
- * tss.sp2 is scratch space.
- */
+ /* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
*/
struct tss_struct tss;
- char entry_trampoline[PAGE_SIZE];
-
#ifdef CONFIG_X86_64
/*
* Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
#if defined(CONFIG_X86_64)
extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
#endif
#endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
/* Layout info for cpu_entry_area */
- OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
- OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 44c4ef3d989b..a3c939656579 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,19 +1531,10 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
- extern char _entry_trampoline[];
- extern char entry_SYSCALL_64_trampoline[];
-
int cpu = smp_processor_id();
- unsigned long SYSCALL64_entry_trampoline =
- (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
- (entry_SYSCALL_64_trampoline - _entry_trampoline);
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
- if (static_cpu_has(X86_FEATURE_PTI))
- wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
- else
- wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+ wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
#ifdef CONFIG_IA32_EMULATION
wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
bool arch_within_kprobe_blacklist(unsigned long addr)
{
- bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
- is_in_entry_trampoline_section =
- (addr >= (unsigned long)__entry_trampoline_start &&
- addr < (unsigned long)__entry_trampoline_end);
-#endif
return (addr >= (unsigned long)__kprobes_text_start &&
addr < (unsigned long)__kprobes_text_end) ||
(addr >= (unsigned long)__entry_text_start &&
- addr < (unsigned long)__entry_text_end) ||
- is_in_entry_trampoline_section;
+ addr < (unsigned long)__entry_text_end);
}
int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
*(.fixup)
*(.gnu.warning)
-#ifdef CONFIG_X86_64
- . = ALIGN(PAGE_SIZE);
- __entry_trampoline_start = .;
- _entry_trampoline = .;
- *(.entry_trampoline)
- . = ALIGN(PAGE_SIZE);
- __entry_trampoline_end = .;
- ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
#ifdef CONFIG_RETPOLINE
__indirect_thunk_start = .;
*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
#ifdef CONFIG_X86_64
static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
#endif
struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
static void __init setup_cpu_entry_area(int cpu)
{
#ifdef CONFIG_X86_64
- extern char _entry_trampoline[];
-
/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
pgprot_t gdt_prot = PAGE_KERNEL_RO;
pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
&per_cpu(exception_stacks, cpu),
sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
- cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
- __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
- /*
- * The cpu_entry_area alias addresses are not in the kernel binary
- * so they do not show up in /proc/kcore normally. This adds entries
- * for them manually.
- */
- kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
- _entry_trampoline,
- &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
#endif
percpu_setup_debug_store(cpu);
}
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
- char *name)
-{
- unsigned int cpu, ncpu = 0;
-
- if (symnum >= num_possible_cpus())
- return -EINVAL;
-
- for_each_possible_cpu(cpu) {
- if (ncpu++ >= symnum)
- break;
- }
-
- *value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
- *type = 't';
- strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
- return 0;
-}
-#endif
-
static __init void setup_cpu_entry_area_ptes(void)
{
#ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index c1fc1ae6b429..6bcf651c8bd6 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
}
/*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
*/
static void __init pti_clone_user_shared(void)
{
+ unsigned cpu;
+
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+ for_each_possible_cpu(cpu) {
+ /*
+ * The SYSCALL64 entry code needs to be able to find the
+ * thread stack and needs one word of scratch space in which
+ * to spill a register. All of this lives in the TSS, in
+ * the sp1 and sp2 slots.
+ *
+ * This is done for all possible CPUs during boot to ensure
+ * that it's propagated to all mms. If we were to add one of
+ * these mappings during CPU hotplug, we would need to take
+ * some measure to make sure that every mm that subsequently
+ * ran on that CPU would have the relevant PGD entry in its
+ * pagetables. The usual vmalloc_fault() mechanism would not
+ * work for page faults taken in entry_SYSCALL_64 before RSP
+ * is set up.
+ */
+
+ unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+ phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+ pte_t *target_pte;
+
+ target_pte = pti_user_pagetable_walk_pte(va);
+ if (WARN_ON(!target_pte))
+ return;
+
+ *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+ }
}
#else /* CONFIG_X86_64 */
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [tip:x86/pti] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
` (3 preceding siblings ...)
2018-09-08 9:35 ` [tip:x86/pti] " tip-bot for Andy Lutomirski
@ 2018-09-08 9:57 ` tip-bot for Andy Lutomirski
2018-09-12 19:33 ` tip-bot for Andy Lutomirski
2018-09-12 19:36 ` tip-bot for Andy Lutomirski
6 siblings, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-08 9:57 UTC (permalink / raw)
To: linux-tip-commits
Cc: luto, linux-kernel, dave.hansen, torvalds, alexander.shishkin,
jpoimboe, acme, adrian.hunter, bp, hpa, jolsa, tglx, joro, ak,
bp, peterz, mingo
Commit-ID: 86635715ee4228ded59f662dab36e9732b9c978f
Gitweb: https://git.kernel.org/tip/86635715ee4228ded59f662dab36e9732b9c978f
Author: Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:44 -0700
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sat, 8 Sep 2018 11:53:16 +0200
x86/pti/64: Remove the SYSCALL64 entry trampoline
The SYSCALL64 trampoline has a couple of nice properties:
- The usual sequence of SWAPGS followed by two GS-relative accesses to
set up RSP is somewhat slow because the GS-relative accesses need
to wait for SWAPGS to finish. The trampoline approach allows
RIP-relative accesses to set up RSP, which avoids the stall.
- The trampoline avoids any percpu access before CR3 is set up,
which means that no percpu memory needs to be mapped in the user
page tables. This prevents using Meltdown to read any percpu memory
outside the cpu_entry_area and prevents using timing leaks
to directly locate the percpu areas.
The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up. The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text. With retpolines enabled, the indirect jump is extremely slow.
Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI. This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless. It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless. As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.
On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.
There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU. This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:
+ It avoids a pipeline stall
- It executes from an extra page and read from another extra page during
the syscall. The latter is because it needs to use a relative
addressing mode to find sp1 -- it's the same *cacheline*, but accessed
using an alias, so it's an extra TLB entry.
- Slightly more memory. This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a more
complex implementation.
- More code complexity.
The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.
[ tglx: Added the alternative discussion to the changelog ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
---
arch/x86/entry/entry_64.S | 69 +----------------------------------
arch/x86/include/asm/cpu_entry_area.h | 2 -
arch/x86/include/asm/sections.h | 1 -
arch/x86/kernel/asm-offsets.c | 2 -
arch/x86/kernel/cpu/common.c | 11 +-----
arch/x86/kernel/kprobes/core.c | 10 +----
arch/x86/kernel/vmlinux.lds.S | 10 -----
arch/x86/mm/cpu_entry_area.c | 36 ------------------
arch/x86/mm/pti.c | 33 ++++++++++++++++-
9 files changed, 36 insertions(+), 138 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7e82e553183a..0d728142467f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
* with them due to bugs in both AMD and Intel CPUs.
*/
- .pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline. This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address). So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline. We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
- _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH CPU_ENTRY_AREA_entry_stack + \
- SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
- UNWIND_HINT_EMPTY
- swapgs
-
- /* Stash the user RSP. */
- movq %rsp, RSP_SCRATCH
-
- /* Note: using %rsp as a scratch reg. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
- /* Load the top of the task stack into RSP */
- movq CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
- /* Start building the simulated IRET frame. */
- pushq $__USER_DS /* pt_regs->ss */
- pushq RSP_SCRATCH /* pt_regs->sp */
- pushq %r11 /* pt_regs->flags */
- pushq $__USER_CS /* pt_regs->cs */
- pushq %rcx /* pt_regs->ip */
-
- /*
- * x86 lacks a near absolute jump, and we can't jump to the real
- * entry text with a relative jump. We could push the target
- * address and then use retq, but this destroys the pipeline on
- * many CPUs (wasting over 20 cycles on Sandy Bridge). Instead,
- * spill RDI and restore it in a second-stage trampoline.
- */
- pushq %rdi
- movq $entry_SYSCALL_64_stage2, %rdi
- JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
- .popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
- UNWIND_HINT_EMPTY
- popq %rdi
- jmp entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
*/
swapgs
- /*
- * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
- * is not required to switch CR3.
- *
- * tss.sp2 is scratch space.
- */
+ /* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
*/
struct tss_struct tss;
- char entry_trampoline[PAGE_SIZE];
-
#ifdef CONFIG_X86_64
/*
* Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
#if defined(CONFIG_X86_64)
extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
#endif
#endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
/* Layout info for cpu_entry_area */
- OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
- OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 44c4ef3d989b..a3c939656579 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,19 +1531,10 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
- extern char _entry_trampoline[];
- extern char entry_SYSCALL_64_trampoline[];
-
int cpu = smp_processor_id();
- unsigned long SYSCALL64_entry_trampoline =
- (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
- (entry_SYSCALL_64_trampoline - _entry_trampoline);
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
- if (static_cpu_has(X86_FEATURE_PTI))
- wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
- else
- wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+ wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
#ifdef CONFIG_IA32_EMULATION
wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
bool arch_within_kprobe_blacklist(unsigned long addr)
{
- bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
- is_in_entry_trampoline_section =
- (addr >= (unsigned long)__entry_trampoline_start &&
- addr < (unsigned long)__entry_trampoline_end);
-#endif
return (addr >= (unsigned long)__kprobes_text_start &&
addr < (unsigned long)__kprobes_text_end) ||
(addr >= (unsigned long)__entry_text_start &&
- addr < (unsigned long)__entry_text_end) ||
- is_in_entry_trampoline_section;
+ addr < (unsigned long)__entry_text_end);
}
int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
*(.fixup)
*(.gnu.warning)
-#ifdef CONFIG_X86_64
- . = ALIGN(PAGE_SIZE);
- __entry_trampoline_start = .;
- _entry_trampoline = .;
- *(.entry_trampoline)
- . = ALIGN(PAGE_SIZE);
- __entry_trampoline_end = .;
- ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
#ifdef CONFIG_RETPOLINE
__indirect_thunk_start = .;
*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
#ifdef CONFIG_X86_64
static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
#endif
struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
static void __init setup_cpu_entry_area(int cpu)
{
#ifdef CONFIG_X86_64
- extern char _entry_trampoline[];
-
/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
pgprot_t gdt_prot = PAGE_KERNEL_RO;
pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
&per_cpu(exception_stacks, cpu),
sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
- cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
- __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
- /*
- * The cpu_entry_area alias addresses are not in the kernel binary
- * so they do not show up in /proc/kcore normally. This adds entries
- * for them manually.
- */
- kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
- _entry_trampoline,
- &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
#endif
percpu_setup_debug_store(cpu);
}
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
- char *name)
-{
- unsigned int cpu, ncpu = 0;
-
- if (symnum >= num_possible_cpus())
- return -EINVAL;
-
- for_each_possible_cpu(cpu) {
- if (ncpu++ >= symnum)
- break;
- }
-
- *value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
- *type = 't';
- strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
- return 0;
-}
-#endif
-
static __init void setup_cpu_entry_area_ptes(void)
{
#ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index c1fc1ae6b429..4fee5c3003ed 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
}
/*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
*/
static void __init pti_clone_user_shared(void)
{
+ unsigned int cpu;
+
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+ for_each_possible_cpu(cpu) {
+ /*
+ * The SYSCALL64 entry code needs to be able to find the
+ * thread stack and needs one word of scratch space in which
+ * to spill a register. All of this lives in the TSS, in
+ * the sp1 and sp2 slots.
+ *
+ * This is done for all possible CPUs during boot to ensure
+ * that it's propagated to all mms. If we were to add one of
+ * these mappings during CPU hotplug, we would need to take
+ * some measure to make sure that every mm that subsequently
+ * ran on that CPU would have the relevant PGD entry in its
+ * pagetables. The usual vmalloc_fault() mechanism would not
+ * work for page faults taken in entry_SYSCALL_64 before RSP
+ * is set up.
+ */
+
+ unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+ phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+ pte_t *target_pte;
+
+ target_pte = pti_user_pagetable_walk_pte(va);
+ if (WARN_ON(!target_pte))
+ return;
+
+ *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+ }
}
#else /* CONFIG_X86_64 */
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [tip:x86/pti] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
` (4 preceding siblings ...)
2018-09-08 9:57 ` tip-bot for Andy Lutomirski
@ 2018-09-12 19:33 ` tip-bot for Andy Lutomirski
2018-09-12 19:36 ` tip-bot for Andy Lutomirski
6 siblings, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-12 19:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: bp, jpoimboe, ak, linux-kernel, joro, acme, jolsa, peterz,
dave.hansen, tglx, hpa, mingo, adrian.hunter, alexander.shishkin,
luto, bp, torvalds
Commit-ID: e536a56190d412b0f98dbd4dde608f9f7081bb6d
Gitweb: https://git.kernel.org/tip/e536a56190d412b0f98dbd4dde608f9f7081bb6d
Author: Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:44 -0700
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 12 Sep 2018 21:29:09 +0200
x86/pti/64: Remove the SYSCALL64 entry trampoline
The SYSCALL64 trampoline has a couple of nice properties:
- The usual sequence of SWAPGS followed by two GS-relative accesses to
set up RSP is somewhat slow because the GS-relative accesses need
to wait for SWAPGS to finish. The trampoline approach allows
RIP-relative accesses to set up RSP, which avoids the stall.
- The trampoline avoids any percpu access before CR3 is set up,
which means that no percpu memory needs to be mapped in the user
page tables. This prevents using Meltdown to read any percpu memory
outside the cpu_entry_area and prevents using timing leaks
to directly locate the percpu areas.
The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up. The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text. With retpolines enabled, the indirect jump is extremely slow.
Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI. This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless. It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless. As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.
On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.
There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU. This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:
+ It avoids a pipeline stall
- It executes from an extra page and read from another extra page during
the syscall. The latter is because it needs to use a relative
addressing mode to find sp1 -- it's the same *cacheline*, but accessed
using an alias, so it's an extra TLB entry.
- Slightly more memory. This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a more
complex implementation.
- More code complexity.
The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.
[ tglx: Added the alternative discussion to the changelog ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
---
arch/x86/entry/entry_64.S | 69 +----------------------------------
arch/x86/include/asm/cpu_entry_area.h | 2 -
arch/x86/include/asm/sections.h | 1 -
arch/x86/kernel/asm-offsets.c | 2 -
arch/x86/kernel/cpu/common.c | 13 +------
arch/x86/kernel/kprobes/core.c | 10 +----
arch/x86/kernel/vmlinux.lds.S | 10 -----
arch/x86/mm/cpu_entry_area.c | 36 ------------------
arch/x86/mm/pti.c | 33 ++++++++++++++++-
9 files changed, 36 insertions(+), 140 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7e82e553183a..0d728142467f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
* with them due to bugs in both AMD and Intel CPUs.
*/
- .pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline. This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address). So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline. We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
- _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH CPU_ENTRY_AREA_entry_stack + \
- SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
- UNWIND_HINT_EMPTY
- swapgs
-
- /* Stash the user RSP. */
- movq %rsp, RSP_SCRATCH
-
- /* Note: using %rsp as a scratch reg. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
- /* Load the top of the task stack into RSP */
- movq CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
- /* Start building the simulated IRET frame. */
- pushq $__USER_DS /* pt_regs->ss */
- pushq RSP_SCRATCH /* pt_regs->sp */
- pushq %r11 /* pt_regs->flags */
- pushq $__USER_CS /* pt_regs->cs */
- pushq %rcx /* pt_regs->ip */
-
- /*
- * x86 lacks a near absolute jump, and we can't jump to the real
- * entry text with a relative jump. We could push the target
- * address and then use retq, but this destroys the pipeline on
- * many CPUs (wasting over 20 cycles on Sandy Bridge). Instead,
- * spill RDI and restore it in a second-stage trampoline.
- */
- pushq %rdi
- movq $entry_SYSCALL_64_stage2, %rdi
- JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
- .popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
- UNWIND_HINT_EMPTY
- popq %rdi
- jmp entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
*/
swapgs
- /*
- * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
- * is not required to switch CR3.
- *
- * tss.sp2 is scratch space.
- */
+ /* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
*/
struct tss_struct tss;
- char entry_trampoline[PAGE_SIZE];
-
#ifdef CONFIG_X86_64
/*
* Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
#if defined(CONFIG_X86_64)
extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
#endif
#endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
/* Layout info for cpu_entry_area */
- OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
- OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 44c4ef3d989b..155082ae8b3f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,19 +1531,8 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
- extern char _entry_trampoline[];
- extern char entry_SYSCALL_64_trampoline[];
-
- int cpu = smp_processor_id();
- unsigned long SYSCALL64_entry_trampoline =
- (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
- (entry_SYSCALL_64_trampoline - _entry_trampoline);
-
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
- if (static_cpu_has(X86_FEATURE_PTI))
- wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
- else
- wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+ wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
#ifdef CONFIG_IA32_EMULATION
wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
bool arch_within_kprobe_blacklist(unsigned long addr)
{
- bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
- is_in_entry_trampoline_section =
- (addr >= (unsigned long)__entry_trampoline_start &&
- addr < (unsigned long)__entry_trampoline_end);
-#endif
return (addr >= (unsigned long)__kprobes_text_start &&
addr < (unsigned long)__kprobes_text_end) ||
(addr >= (unsigned long)__entry_text_start &&
- addr < (unsigned long)__entry_text_end) ||
- is_in_entry_trampoline_section;
+ addr < (unsigned long)__entry_text_end);
}
int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
*(.fixup)
*(.gnu.warning)
-#ifdef CONFIG_X86_64
- . = ALIGN(PAGE_SIZE);
- __entry_trampoline_start = .;
- _entry_trampoline = .;
- *(.entry_trampoline)
- . = ALIGN(PAGE_SIZE);
- __entry_trampoline_end = .;
- ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
#ifdef CONFIG_RETPOLINE
__indirect_thunk_start = .;
*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
#ifdef CONFIG_X86_64
static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
#endif
struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
static void __init setup_cpu_entry_area(int cpu)
{
#ifdef CONFIG_X86_64
- extern char _entry_trampoline[];
-
/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
pgprot_t gdt_prot = PAGE_KERNEL_RO;
pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
&per_cpu(exception_stacks, cpu),
sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
- cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
- __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
- /*
- * The cpu_entry_area alias addresses are not in the kernel binary
- * so they do not show up in /proc/kcore normally. This adds entries
- * for them manually.
- */
- kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
- _entry_trampoline,
- &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
#endif
percpu_setup_debug_store(cpu);
}
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
- char *name)
-{
- unsigned int cpu, ncpu = 0;
-
- if (symnum >= num_possible_cpus())
- return -EINVAL;
-
- for_each_possible_cpu(cpu) {
- if (ncpu++ >= symnum)
- break;
- }
-
- *value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
- *type = 't';
- strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
- return 0;
-}
-#endif
-
static __init void setup_cpu_entry_area_ptes(void)
{
#ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index c1fc1ae6b429..4fee5c3003ed 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
}
/*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
*/
static void __init pti_clone_user_shared(void)
{
+ unsigned int cpu;
+
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+ for_each_possible_cpu(cpu) {
+ /*
+ * The SYSCALL64 entry code needs to be able to find the
+ * thread stack and needs one word of scratch space in which
+ * to spill a register. All of this lives in the TSS, in
+ * the sp1 and sp2 slots.
+ *
+ * This is done for all possible CPUs during boot to ensure
+ * that it's propagated to all mms. If we were to add one of
+ * these mappings during CPU hotplug, we would need to take
+ * some measure to make sure that every mm that subsequently
+ * ran on that CPU would have the relevant PGD entry in its
+ * pagetables. The usual vmalloc_fault() mechanism would not
+ * work for page faults taken in entry_SYSCALL_64 before RSP
+ * is set up.
+ */
+
+ unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+ phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+ pte_t *target_pte;
+
+ target_pte = pti_user_pagetable_walk_pte(va);
+ if (WARN_ON(!target_pte))
+ return;
+
+ *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+ }
}
#else /* CONFIG_X86_64 */
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [tip:x86/pti] x86/pti/64: Remove the SYSCALL64 entry trampoline
2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
` (5 preceding siblings ...)
2018-09-12 19:33 ` tip-bot for Andy Lutomirski
@ 2018-09-12 19:36 ` tip-bot for Andy Lutomirski
6 siblings, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-12 19:36 UTC (permalink / raw)
To: linux-tip-commits
Cc: tglx, alexander.shishkin, luto, joro, mingo, ak, dave.hansen,
linux-kernel, adrian.hunter, torvalds, hpa, bp, jpoimboe, acme,
jolsa, bp, peterz
Commit-ID: bf904d2762ee6fc1e4acfcb0772bbfb4a27ad8a6
Gitweb: https://git.kernel.org/tip/bf904d2762ee6fc1e4acfcb0772bbfb4a27ad8a6
Author: Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:44 -0700
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 12 Sep 2018 21:33:53 +0200
x86/pti/64: Remove the SYSCALL64 entry trampoline
The SYSCALL64 trampoline has a couple of nice properties:
- The usual sequence of SWAPGS followed by two GS-relative accesses to
set up RSP is somewhat slow because the GS-relative accesses need
to wait for SWAPGS to finish. The trampoline approach allows
RIP-relative accesses to set up RSP, which avoids the stall.
- The trampoline avoids any percpu access before CR3 is set up,
which means that no percpu memory needs to be mapped in the user
page tables. This prevents using Meltdown to read any percpu memory
outside the cpu_entry_area and prevents using timing leaks
to directly locate the percpu areas.
The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up. The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text. With retpolines enabled, the indirect jump is extremely slow.
Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI. This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless. It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless. As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.
On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.
There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU. This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:
+ It avoids a pipeline stall
- It executes from an extra page and read from another extra page during
the syscall. The latter is because it needs to use a relative
addressing mode to find sp1 -- it's the same *cacheline*, but accessed
using an alias, so it's an extra TLB entry.
- Slightly more memory. This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a more
complex implementation.
- More code complexity.
The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.
[ tglx: Added the alternative discussion to the changelog ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
---
arch/x86/entry/entry_64.S | 69 +----------------------------------
arch/x86/include/asm/cpu_entry_area.h | 2 -
arch/x86/include/asm/sections.h | 1 -
arch/x86/kernel/asm-offsets.c | 2 -
arch/x86/kernel/cpu/common.c | 13 +------
arch/x86/kernel/kprobes/core.c | 10 +----
arch/x86/kernel/vmlinux.lds.S | 10 -----
arch/x86/mm/cpu_entry_area.c | 36 ------------------
arch/x86/mm/pti.c | 33 ++++++++++++++++-
9 files changed, 37 insertions(+), 139 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7e82e553183a..0d728142467f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
* with them due to bugs in both AMD and Intel CPUs.
*/
- .pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline. This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address). So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline. We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
- _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH CPU_ENTRY_AREA_entry_stack + \
- SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
- UNWIND_HINT_EMPTY
- swapgs
-
- /* Stash the user RSP. */
- movq %rsp, RSP_SCRATCH
-
- /* Note: using %rsp as a scratch reg. */
- SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
- /* Load the top of the task stack into RSP */
- movq CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
- /* Start building the simulated IRET frame. */
- pushq $__USER_DS /* pt_regs->ss */
- pushq RSP_SCRATCH /* pt_regs->sp */
- pushq %r11 /* pt_regs->flags */
- pushq $__USER_CS /* pt_regs->cs */
- pushq %rcx /* pt_regs->ip */
-
- /*
- * x86 lacks a near absolute jump, and we can't jump to the real
- * entry text with a relative jump. We could push the target
- * address and then use retq, but this destroys the pipeline on
- * many CPUs (wasting over 20 cycles on Sandy Bridge). Instead,
- * spill RDI and restore it in a second-stage trampoline.
- */
- pushq %rdi
- movq $entry_SYSCALL_64_stage2, %rdi
- JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
- .popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
- UNWIND_HINT_EMPTY
- popq %rdi
- jmp entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
*/
swapgs
- /*
- * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
- * is not required to switch CR3.
- *
- * tss.sp2 is scratch space.
- */
+ /* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
*/
struct tss_struct tss;
- char entry_trampoline[PAGE_SIZE];
-
#ifdef CONFIG_X86_64
/*
* Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
#if defined(CONFIG_X86_64)
extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
#endif
#endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
/* Layout info for cpu_entry_area */
- OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
- OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 44c4ef3d989b..2d5b1fa5f9c6 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,19 +1531,10 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
- extern char _entry_trampoline[];
- extern char entry_SYSCALL_64_trampoline[];
-
- int cpu = smp_processor_id();
- unsigned long SYSCALL64_entry_trampoline =
- (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
- (entry_SYSCALL_64_trampoline - _entry_trampoline);
+ int __maybe_unused cpu = smp_processor_id();
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
- if (static_cpu_has(X86_FEATURE_PTI))
- wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
- else
- wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+ wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
#ifdef CONFIG_IA32_EMULATION
wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
bool arch_within_kprobe_blacklist(unsigned long addr)
{
- bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
- is_in_entry_trampoline_section =
- (addr >= (unsigned long)__entry_trampoline_start &&
- addr < (unsigned long)__entry_trampoline_end);
-#endif
return (addr >= (unsigned long)__kprobes_text_start &&
addr < (unsigned long)__kprobes_text_end) ||
(addr >= (unsigned long)__entry_text_start &&
- addr < (unsigned long)__entry_text_end) ||
- is_in_entry_trampoline_section;
+ addr < (unsigned long)__entry_text_end);
}
int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
*(.fixup)
*(.gnu.warning)
-#ifdef CONFIG_X86_64
- . = ALIGN(PAGE_SIZE);
- __entry_trampoline_start = .;
- _entry_trampoline = .;
- *(.entry_trampoline)
- . = ALIGN(PAGE_SIZE);
- __entry_trampoline_end = .;
- ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
#ifdef CONFIG_RETPOLINE
__indirect_thunk_start = .;
*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
#ifdef CONFIG_X86_64
static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
#endif
struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
static void __init setup_cpu_entry_area(int cpu)
{
#ifdef CONFIG_X86_64
- extern char _entry_trampoline[];
-
/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
pgprot_t gdt_prot = PAGE_KERNEL_RO;
pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
&per_cpu(exception_stacks, cpu),
sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
- cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
- __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
- /*
- * The cpu_entry_area alias addresses are not in the kernel binary
- * so they do not show up in /proc/kcore normally. This adds entries
- * for them manually.
- */
- kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
- _entry_trampoline,
- &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
#endif
percpu_setup_debug_store(cpu);
}
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
- char *name)
-{
- unsigned int cpu, ncpu = 0;
-
- if (symnum >= num_possible_cpus())
- return -EINVAL;
-
- for_each_possible_cpu(cpu) {
- if (ncpu++ >= symnum)
- break;
- }
-
- *value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
- *type = 't';
- strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
- return 0;
-}
-#endif
-
static __init void setup_cpu_entry_area_ptes(void)
{
#ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index c1fc1ae6b429..4fee5c3003ed 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
}
/*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
*/
static void __init pti_clone_user_shared(void)
{
+ unsigned int cpu;
+
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+ for_each_possible_cpu(cpu) {
+ /*
+ * The SYSCALL64 entry code needs to be able to find the
+ * thread stack and needs one word of scratch space in which
+ * to spill a register. All of this lives in the TSS, in
+ * the sp1 and sp2 slots.
+ *
+ * This is done for all possible CPUs during boot to ensure
+ * that it's propagated to all mms. If we were to add one of
+ * these mappings during CPU hotplug, we would need to take
+ * some measure to make sure that every mm that subsequently
+ * ran on that CPU would have the relevant PGD entry in its
+ * pagetables. The usual vmalloc_fault() mechanism would not
+ * work for page faults taken in entry_SYSCALL_64 before RSP
+ * is set up.
+ */
+
+ unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+ phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+ pte_t *target_pte;
+
+ target_pte = pti_user_pagetable_walk_pte(va);
+ if (WARN_ON(!target_pte))
+ return;
+
+ *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+ }
}
#else /* CONFIG_X86_64 */
^ permalink raw reply related [flat|nested] 24+ messages in thread