[00/15] x86: Enable FSGSBASE instructions
mbox series

Message ID 1521481767-22113-1-git-send-email-chang.seok.bae@intel.com
Headers show
Series
  • x86: Enable FSGSBASE instructions
Related show

Message

Bae, Chang Seok March 19, 2018, 5:49 p.m. UTC
FSGSBASE is 64-bit instruction set to allow read/write
FS/GS base from any privilege. As introduced from
Ivybridge, enabling effort has been revolving quite long
[2,3,4] for various reasons. After extended discussions [1],
this patchset is proposed to introduce new ABIs of
customizing FS/GS base (separate from its selector).

FSGSBASE-enabled VM can be located on hosts with
either HW virtualization or SW emulation. KVM advertises
FSGSBASE when physical CPU has and emulation is 
supported in QEMU/TCG [5]. In a pool of mixed systems, VMM
may disable FSGSBASE for seamless VM migrations [6].

A couple of major benefits are expected. Kernel will have
performance improvement in context switch by skipping MSR
write for GS base. User-level programs (such as JAVA-based)
benefit from avoiding system calls to edit FS/GS base.

Changes when FSGSBASE enabled:
* In context switch, a thread's FS/GS base is secured
regardless of its selector [1].
* (Subsequently) when ptracer updates FS/GS base, it is
(always) carried onto the tracee. This is an outcome of
discussions with our GDB contact. Also, checked with other
toolchains [7, 8].
* On paranoid entry, GS base is compared with the kernel’s
to decide the necessity of SWAPGS.

[1] Recent discussion on LKML:
https://marc.info/?t=150147053700001&r=1&w=2
[2] Andy Lutomirski’s rebase work :
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase
[3] Patch set shown in year 2016:
https://marc.info/?t=145857711900001&r=1&w=2
[4] First patch set: https://lkml.org/lkml/2015/4/10/573
[5] QEMU with FSGSBASE emulation:
https://github.com/qemu/qemu/blob/026aaf47c02b79036feb830206cfebb2a726510d/target/i386/translate.c#L8186
[6] 5-level EPT:
http://lkml.kernel.org/r/9ddf602b-6c8b-8c1e-ab46-07ed12366593@redhat.com
[7] RR/FSGSBASE:
https://mail.mozilla.org/pipermail/rr-dev/2018-March/000616.html
[8] CRIU/FSGSBASE:
https://lists.openvz.org/pipermail/criu/2018-March/040654.html

Andi Kleen (1):
  x86/fsgsbase/64: Add intrinsics/macros for FSGSBASE instructions

Andy Lutomirski (4):
  x86/fsgsbase/64: Make ptrace read FS/GS base accurately
  x86/fsgsbase/64: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
  x86/fsgsbase/64: Preserve FS/GS state in __switch_to if FSGSBASE is on
  x86/fsgsbase/64: Enable FSGSBASE by default and add a chicken bit

Chang S. Bae (10):
  x86/fsgsbase/64: Introduce FS/GS base helper functions
  x86/fsgsbase/64: Use FS/GS base helpers in core dump
  x86/fsgsbase/64: Factor out load FS/GS segments from __switch_to
  x86/ptrace: A new macro to get an offset of user_regs_struct
  x86/fsgsbase/64: Add putregs() to handle multiple elements' setting
  x86/fsgsbase/64: putregs() in a reverse order
  x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
  x86/fsgsbase/64: When copying a thread, use FSGSBASE if enabled
  x86/fsgsbase/64: With FSGSBASE, compare GS bases on paranoid_entry
  x86/fsgsbase/64: Support legacy behavior when FS/GS updated by ptracer

 Documentation/admin-guide/kernel-parameters.txt |   2 +
 arch/x86/entry/entry_64.S                       |  54 ++++-
 arch/x86/include/asm/elf.h                      |   6 +-
 arch/x86/include/asm/fsgsbase.h                 | 200 +++++++++++++++++
 arch/x86/include/asm/inst.h                     |  15 ++
 arch/x86/kernel/cpu/common.c                    |  20 ++
 arch/x86/kernel/process_64.c                    | 274 ++++++++++++++++++++----
 arch/x86/kernel/ptrace.c                        | 253 ++++++++++++++++------
 8 files changed, 696 insertions(+), 128 deletions(-)
 create mode 100644 arch/x86/include/asm/fsgsbase.h

Comments

Dave Hansen March 19, 2018, 8:05 p.m. UTC | #1
On 03/19/2018 10:49 AM, Chang S. Bae wrote:
> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
> GS base is not the kernel's.

Do you mean "SWAPGS is needed..."?

> FSGSBASE instructions allow user to write any value on GS base;
> even negative. Sign check on the current GS base is not
> sufficient. Fortunately, reading GS base is fast. Kernel GS
> base is also known from the offset table with the CPU number.
> 
> GS-compatible RDPID macro is included.

This description could use some work.  I think you're trying to say
that, currently, userspace can't modify GS base and the kernel's
conventions are that a negative GS base means it is a kernel value and a
positive GS base means it is a user vale.  But, with your new patches,
userspace can put arbitrary data in there, breaking the exising assumption.

Correct?

This also needs to explain a bit of the theory about how we go finding
the kernel GS base value.

Also, this is expected to improve paranoid_entry speed, right?  The
rdmsr goes away so this should be faster.  Should that be mentioned?

>  /*
> - * Save all registers in pt_regs, and switch gs if needed.
> - * Use slow, but surefire "are we in kernel?" check.
> - * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
> + * Save all registers in pt_regs.
> + *
> + * SWAPGS needs when it comes from user space.

The grammar here needs some work.

	"SWAPGS is needed when entering from userspace".

>  To check where-it-from,
> + * read GS base from RDMSR/MSR_GS_BASE and check if negative or not.
> + * This works without FSGSBASE.
> + * When FSGSBASE enabled, arbitrary GS base can be put by a user-level
> + * task, which means negative value is possible. Direct comparison
> + * between the current and kernel GS bases determines the necessity of
> + * SWAPGS; do if and only if unmatched.
> + *
> + * Return: ebx=0: need SWAPGS on exit, ebx=1: otherwise
>   */

I don't think this really belongs in a comment above the function.  I'd
just explain overall that we are trying to determine if we interrupted
userspace or not.

>  ENTRY(paranoid_entry)
>  	UNWIND_HINT_FUNC
>  	cld
>  	PUSH_AND_CLEAR_REGS save_ret=1
>  	ENCODE_FRAME_POINTER 8
> -	movl	$1, %ebx
> -	movl	$MSR_GS_BASE, %ecx
> -	rdmsr
> -	testl	%edx, %edx
> -	js	1f				/* negative -> in kernel */
> -	SWAPGS
> -	xorl	%ebx, %ebx
>  
> -1:
> +	/*
> +	 * As long as this PTI macro doesn't depend on kernel GS base,
> +	 * we can do it early. This is because READ_KERNEL_GSBASE
> +	 * references data in kernel space.
> +	 */
>  	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14

We used to depend on some per-cpu data in here, but we no longer do.  So
I think this is OK.

> +	movl	$1, %ebx

I know it wasn't commented before, but please add a comment about what
this is doing.

> +	/*
> +	 * Read current GS base with RDGSBASE. Kernel GS base is found
> +	 * by CPU number and its offset value.
> +	 */
> +	ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase",	\
> +		"RDGSBASE %rdx", X86_FEATURE_FSGSBASE

I'd rather this be:

	ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase",	"nop",\
		X86_FEATURE_FSGSBASE
	
	RDGSBASE	   %rdx
	READ_KERNEL_GSBASE %rax
	/* See if the kernel GS_BASE value is in GS base register */
	cmpq	%rdx, %rax
	...

It's a lot more readable than what you have.

...
> +	jne	.Lparanoid_entry_swapgs
> +	ret
> +
> +.Lparanoid_entry_no_fsgsbase:
> +	/*
> +	 * A (slow) RDMSR is surefire without FSGSBASE.

I'd say:

	FSGSBASE is not in use, so depend on the kernel-enforced
	convention that a negative GS base indicates a kernel value.

> +	 * The READ_MSR_GSBASE macro scratches %ecx, %eax, and %edx.

"clobbers" is the normal terminology for this, not "scratches".

> +	READ_MSR_GSBASE save_reg=%edx
> +	testl	%edx, %edx	/* negative -> in kernel */
> +	jns	.Lparanoid_entry_swapgs
> +	ret
> +
> +.Lparanoid_entry_swapgs:
> +	SWAPGS
> +	xorl	%ebx, %ebx
>  	ret
>  END(paranoid_entry)

^^ Please comment the xorl, especially now that it's farther away from
the comment explaining what it is for.

> diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
> index 8936b7f..76d3457 100644
> --- a/arch/x86/include/asm/fsgsbase.h
> +++ b/arch/x86/include/asm/fsgsbase.h
> @@ -140,6 +140,55 @@ void  write_shadow_gsbase(unsigned long gsbase);
>  	MODRM 0xd0 wrgsbase_opd 1
>  .endm
>  
> +#if CONFIG_SMP
> +
> +.macro READ_KERNEL_GSBASE_RDPID reg:req

This needs some explanation.  Maybe:

/*
 * Fetch the per-cpu GSBASE value for this processor and
 * put it in @reg.  We normally use %GS for accessing per-cpu
 * data, but we are setting up %GS here and obviously can not
 * use %GS itself to access per-cpu data.
 */

> +	RDPID	\reg
> +
> +	/*
> +	 * processor id is written during vDSO (virtual dynamic shared object)
> +	 * initialization. 12 bits for the CPU and 8 bits for the node.
> +	 */
> +	andq	$0xFFF, \reg

This begs the question: when do we initialize the VDSO?  Is that before
or after the first paranoid_entry interrupt?  Also, why the magic
number?  Shouldn't this come out of a macro somewhere rather than having
to hard-code and spell out the convention?

> +	/*
> +	 * Kernel GS base is looked up from the __per_cpu_offset list with
> +	 * the CPU number (processor id).
> +	 */
> +	movq	__per_cpu_offset(, \reg, 8), \reg
> +.endm
> +
> +.macro READ_KERNEL_GSBASE_CPU_SEG_LIMIT reg:req
> +	/* CPU number is found from the limit of PER_CPU entry in GDT */
> +	movq	$__PER_CPU_SEG, \reg
> +	lsl	\reg, \reg
> +
> +	/* Same as READ_KERNEL_GSBASE_RDPID */
> +	andq	$0xFFF, \reg
> +	movq	__per_cpu_offset(, \reg, 8), \reg
> +.endm
> +
> +.macro READ_KERNEL_GSBASE reg:req
> +	ALTERNATIVE "READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
> +		"READ_KERNEL_GSBASE_RDPID \reg", X86_FEATURE_RDPID
> +.endm

Can I suggest a different indentation?

	ALTERNATIVE \
		"READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
		"READ_KERNEL_GSBASE_RDPID \reg",
		X86_FEATURE_RDPID
Bae, Chang Seok March 19, 2018, 9:12 p.m. UTC | #2
On 3/19/2018 01:05 PM, Dave Hansen wrote:
>> When FSGSBASE is enabled, SWAPGS needs if and only if (current)
>> GS base is not the kernel's.
> Do you mean "SWAPGS is needed..."?
Yes, will change.

>> FSGSBASE instructions allow user to write any value on GS base;
>> even negative. Sign check on the current GS base is not
>> sufficient. Fortunately, reading GS base is fast. Kernel GS
>> base is also known from the offset table with the CPU number.
>>
>> GS-compatible RDPID macro is included.

> This description could use some work.  I think you're trying to say
> that, currently, userspace can't modify GS base and the kernel's
> conventions are that a negative GS base means it is a kernel value and a
> positive GS base means it is a user vale.  But, with your new patches,
> userspace can put arbitrary data in there, breaking the exising assumption.
> Correct?
Yes.

> This also needs to explain a bit of the theory about how we go finding
> the kernel GS base value.
> Also, this is expected to improve paranoid_entry speed, right?  The
> rdmsr goes away so this should be faster.  Should that be mentioned?
I'm a bit reluctant to claim any performance benefit here yet (without having any strong empirical evidence).

>> + * SWAPGS needs when it comes from user space.
> The grammar here needs some work.
>        "SWAPGS is needed when entering from userspace".
Okay

>>  To check where-it-from,
>> + * read GS base from RDMSR/MSR_GS_BASE and check if negative or not.
>> + * This works without FSGSBASE.
>> + * When FSGSBASE enabled, arbitrary GS base can be put by a user-level
>> + * task, which means negative value is possible. Direct comparison
>> + * between the current and kernel GS bases determines the necessity of
>> + * SWAPGS; do if and only if unmatched.
>> + *
>> + * Return: ebx=0: need SWAPGS on exit, ebx=1: otherwise
>>   */
> I don't think this really belongs in a comment above the function.  I'd
> just explain overall that we are trying to determine if we interrupted
> userspace or not.
I'd rather sync-up with you about comments before posting a revision.

>> +     movl    $1, %ebx
> I know it wasn't commented before, but please add a comment about what
> this is doing.
Okay, as long as this comment doesn't go too much.

>> +     /*
>> +      * Read current GS base with RDGSBASE. Kernel GS base is found
>> +      * by CPU number and its offset value.
>> +      */
>> +     ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", \
>> +             "RDGSBASE %rdx", X86_FEATURE_FSGSBASE
>I'd rather this be:
>        ALTERNATIVE "jmp .Lparanoid_entry_no_fsgsbase", "nop",\
>                X86_FEATURE_FSGSBASE
>        RDGSBASE           %rdx
>        READ_KERNEL_GSBASE %rax
>        /* See if the kernel GS_BASE value is in GS base register */
>        cmpq    %rdx, %rax
>        ...
>It's a lot more readable than what you have.
Yes, if it helps your readability.

>> +     jne     .Lparanoid_entry_swapgs
>> +     ret
>> +
>> +.Lparanoid_entry_no_fsgsbase:
>> +     /*
>> +      * A (slow) RDMSR is surefire without FSGSBASE.
>> I'd say:
>>        FSGSBASE is not in use, so depend on the kernel-enforced
>>        convention that a negative GS base indicates a kernel value.
Okay, as the detail comment at the entry should be moved like that.

>> +      * The READ_MSR_GSBASE macro scratches %ecx, %eax, and %edx.
> "clobbers" is the normal terminology for this, not "scratches".
Got it.

>> +.Lparanoid_entry_swapgs:
>> +     SWAPGS
>> +     xorl    %ebx, %ebx
>>       ret
>>  END(paranoid_entry)
> ^^ Please comment the xorl, especially now that it's farther away from
> the comment explaining what it is for.
Okay (but if not too much)

> +.macro READ_KERNEL_GSBASE_RDPID reg:req
> This needs some explanation.  Maybe:
> /*
>  * Fetch the per-cpu GSBASE value for this processor and
>  * put it in @reg.  We normally use %GS for accessing per-cpu
>  * data, but we are setting up %GS here and obviously can not
>  * use %GS itself to access per-cpu data.
>  */
Let me think.

>> +     /*
>> +      * processor id is written during vDSO (virtual dynamic shared object)
>> +      * initialization. 12 bits for the CPU and 8 bits for the node.
>> +      */
>> +     andq    $0xFFF, \reg
> This begs the question: when do we initialize the VDSO?  Is that before
> or after the first paranoid_entry interrupt?  Also, why the magic
> number?  Shouldn't this come out of a macro somewhere rather than having
> to hard-code and spell out the convention?
Hoped the comment to be explanatory; but, agree that the hard-code anyways is bad.

>> +.macro READ_KERNEL_GSBASE reg:req
>> +     ALTERNATIVE "READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
>> +             "READ_KERNEL_GSBASE_RDPID \reg", X86_FEATURE_RDPID
>> +.endm
> Can I suggest a different indentation?
>        ALTERNATIVE \
>                "READ_KERNEL_GSBASE_CPU_SEG_LIMIT \reg", \
>                "READ_KERNEL_GSBASE_RDPID \reg",
>                X86_FEATURE_RDPID
Sure, you can.
Andy Lutomirski March 20, 2018, 3:05 p.m. UTC | #3
On Mon, Mar 19, 2018 at 5:49 PM, Chang S. Bae <chang.seok.bae@intel.com> wrote:
> FSGSBASE is 64-bit instruction set to allow read/write
> FS/GS base from any privilege. As introduced from
> Ivybridge, enabling effort has been revolving quite long
> [2,3,4] for various reasons. After extended discussions [1],
> this patchset is proposed to introduce new ABIs of
> customizing FS/GS base (separate from its selector).
>
> FSGSBASE-enabled VM can be located on hosts with
> either HW virtualization or SW emulation. KVM advertises
> FSGSBASE when physical CPU has and emulation is
> supported in QEMU/TCG [5]. In a pool of mixed systems, VMM
> may disable FSGSBASE for seamless VM migrations [6].
>
> A couple of major benefits are expected. Kernel will have
> performance improvement in context switch by skipping MSR
> write for GS base. User-level programs (such as JAVA-based)
> benefit from avoiding system calls to edit FS/GS base.

Can you describe what changed since the last submission?  It looks
like a lot has changed and this series is much more complicated and
much more fragile than it used to be.  Why?
Andy Lutomirski March 20, 2018, 3:06 p.m. UTC | #4
On Tue, Mar 20, 2018 at 3:05 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, Mar 19, 2018 at 5:49 PM, Chang S. Bae <chang.seok.bae@intel.com> wrote:
>> FSGSBASE is 64-bit instruction set to allow read/write
>> FS/GS base from any privilege. As introduced from
>> Ivybridge, enabling effort has been revolving quite long
>> [2,3,4] for various reasons. After extended discussions [1],
>> this patchset is proposed to introduce new ABIs of
>> customizing FS/GS base (separate from its selector).
>>
>> FSGSBASE-enabled VM can be located on hosts with
>> either HW virtualization or SW emulation. KVM advertises
>> FSGSBASE when physical CPU has and emulation is
>> supported in QEMU/TCG [5]. In a pool of mixed systems, VMM
>> may disable FSGSBASE for seamless VM migrations [6].
>>
>> A couple of major benefits are expected. Kernel will have
>> performance improvement in context switch by skipping MSR
>> write for GS base. User-level programs (such as JAVA-based)
>> benefit from avoiding system calls to edit FS/GS base.
>
> Can you describe what changed since the last submission?  It looks
> like a lot has changed and this series is much more complicated and
> much more fragile than it used to be.  Why?

Also, I've been getting lots of kbuild bot complaints about something
that looks like this patches from your git tree as recently as March
17.  Have you fixed those problems?
Bae, Chang Seok March 20, 2018, 4:33 p.m. UTC | #5
On 3/20/18, 08:07, "Andy Lutomirski" <luto@kernel.org> wrote:
> Can you describe what changed since the last submission?  It looks
> like a lot has changed and this series is much more complicated and
> much more fragile than it used to be.  Why?

According to your overall comments, most concerning part you have right now
is paranoid entry and ptrace legacy support. (correct me if I'm wong)
Design-wise, I think they're major changes -- I can follow-up with 
a summary if you want. I responded about paranoid entry. ptrace
legacy support is simply new that never before AFAIK.

> Also, I've been getting lots of kbuild bot complaints about something
> that looks like this patches from your git tree as recently as March
> 17.  Have you fixed those problems?

Last week, I don't have any major issue at all. I don't know what major
issues you saw. Most likely it is just silly rebase issue I bet.
Andy Lutomirski March 21, 2018, 12:40 a.m. UTC | #6
On Tue, Mar 20, 2018 at 4:33 PM, Bae, Chang Seok
<chang.seok.bae@intel.com> wrote:
> On 3/20/18, 08:07, "Andy Lutomirski" <luto@kernel.org> wrote:
>> Can you describe what changed since the last submission?  It looks
>> like a lot has changed and this series is much more complicated and
>> much more fragile than it used to be.  Why?
>
> According to your overall comments, most concerning part you have right now
> is paranoid entry and ptrace legacy support. (correct me if I'm wong)
> Design-wise, I think they're major changes -- I can follow-up with
> a summary if you want. I responded about paranoid entry. ptrace
> legacy support is simply new that never before AFAIK.

Summary please.
Bae, Chang Seok March 21, 2018, 3:15 p.m. UTC | #7
> Summary please.

These deltas I found as comparing with your old codes [4]
* Paranoid entry changes to avoid vulnerability 
* Ptrace changes to support the legacy behavior (conditionally)
* Dropped [1] to be aligned with what is concluded in [3]
* Dropped [2] since save_fsgs() doesn’t copy base when index is zero; now open code with FSGSBASE instructions.
* Re-wrote assembly macros and dropped (redundant) SWAPGS intrinsic
* Revised helper functions -- dropped unused helpers and factor out write_task_fs/gsbase
* Introducing all of helper functions at the beginning of patch series
* Factor out load_fsgs()
* One of the test cases will be included later (as published 15 patches already)

[1] Refresh FS and GS when modify_ldt changes an entry
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fsgsbase&id=fbe2c2a0ba5b879f8229ca7dfc0d434dbe31a805
[2] When copying a thread, safe FS and GS for real
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fsgsbase&id=e51eceeefbd587b514ca357b745b46e8e2746d34 
[3] FSGSBASE ABI discussion
https://marc.info/?l=linux-kernel&m=150212735223392&w=2 
[4] https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase