All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@amacapital.net>
To: Ingo Molnar <mingo@kernel.org>
Cc: "Andy Lutomirski" <luto@kernel.org>, "X86 ML" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Frédéric Weisbecker" <fweisbec@gmail.com>,
	"Rik van Riel" <riel@redhat.com>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Denys Vlasenko" <vda.linux@googlemail.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Kees Cook" <keescook@chromium.org>,
	"Brian Gerst" <brgerst@gmail.com>
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
Date: Wed, 17 Jun 2015 07:23:49 -0700	[thread overview]
Message-ID: <CALCETrU2oEiHiqb9gu+ZnDU+zOMk+JqDG2dYFVHsAh5xm2tGtw@mail.gmail.com> (raw)
In-Reply-To: <20150617103226.GA30325@gmail.com>

On Wed, Jun 17, 2015 at 3:32 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> The main things that are missing are that I haven't done the 32-bit parts
>> (anyone want to help?) and therefore I haven't deleted the old C code.  I also
>> think this may break UML for trivial reasons.
>
> So I'd suggest moving most of the SYSRET fast path to C too.
>
> This is how it looks like now after your patches:
>
>         testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz     tracesys
> entry_SYSCALL_64_fastpath:
> #if __SYSCALL_MASK == ~0
>         cmpq    $__NR_syscall_max, %rax
> #else
>         andl    $__SYSCALL_MASK, %eax
>         cmpl    $__NR_syscall_max, %eax
> #endif
>         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
>         movq    %r10, %rcx
>         call    *sys_call_table(, %rax, 8)
>         movq    %rax, RAX(%rsp)
> 1:
> /*
>  * Syscall return path ending with SYSRET (fast path).
>  * Has incompletely filled pt_regs.
>  */
>         LOCKDEP_SYS_EXIT
>         /*
>          * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
>          * it is too small to ever cause noticeable irq latency.
>          */
>         DISABLE_INTERRUPTS(CLBR_NONE)
>
>         /*
>          * We must check ti flags with interrupts (or at least preemption)
>          * off because we must *never* return to userspace without
>          * processing exit work that is enqueued if we're preempted here.
>          * In particular, returning to userspace with any of the one-shot
>          * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
>          * very bad.
>          */
>         testl   $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz     int_ret_from_sys_call_irqs_off  /* Go to the slow path */
>
> Most of that can be done in C.
>
> And I think we could also convert the IRET syscall return slow path to C too:
>
> GLOBAL(int_ret_from_sys_call)
>         SAVE_EXTRA_REGS
>         movq    %rsp, %rdi
>         call    syscall_return_slowpath /* returns with IRQs disabled */
>         RESTORE_EXTRA_REGS
>
>         /*
>          * Try to use SYSRET instead of IRET if we're returning to
>          * a completely clean 64-bit userspace context.
>          */
>         movq    RCX(%rsp), %rcx
>         movq    RIP(%rsp), %r11
>         cmpq    %rcx, %r11                      /* RCX == RIP */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
>          * in kernel space.  This essentially lets the user take over
>          * the kernel, since userspace controls RSP.
>          *
>          * If width of "canonical tail" ever becomes variable, this will need
>          * to be updated to remain correct on both old and new CPUs.
>          */
>         .ifne __VIRTUAL_MASK_SHIFT - 47
>         .error "virtual address width changed -- SYSRET checks need update"
>         .endif
>
>         /* Change top 16 bits to be the sign-extension of 47th bit */
>         shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>         sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>
>         /* If this changed %rcx, it was not canonical */
>         cmpq    %rcx, %r11
>         jne     opportunistic_sysret_failed
>
>         cmpq    $__USER_CS, CS(%rsp)            /* CS must match SYSRET */
>         jne     opportunistic_sysret_failed
>
>         movq    R11(%rsp), %r11
>         cmpq    %r11, EFLAGS(%rsp)              /* R11 == RFLAGS */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
>          * restoring TF results in a trap from userspace immediately after
>          * SYSRET.  This would cause an infinite loop whenever #DB happens
>          * with register state that satisfies the opportunistic SYSRET
>          * conditions.  For example, single-stepping this user code:
>          *
>          *           movq       $stuck_here, %rcx
>          *           pushfq
>          *           popq %r11
>          *   stuck_here:
>          *
>          * would never get past 'stuck_here'.
>          */
>         testq   $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
>         jnz     opportunistic_sysret_failed
>
>         /* nothing to check for RSP */
>
>         cmpq    $__USER_DS, SS(%rsp)            /* SS must match SYSRET */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * We win! This label is here just for ease of understanding
>          * perf profiles. Nothing jumps here.
>          */
> syscall_return_via_sysret:
>         /* rcx and r11 are already restored (see code above) */
>         RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq    RSP(%rsp), %rsp
>         USERGS_SYSRET64
>
> opportunistic_sysret_failed:
>         SWAPGS
>         jmp     restore_c_regs_and_iret
> END(entry_SYSCALL_64)
>
>
> Basically there would be a single C function we'd call, which returns a condition
> (or fixes up its return address on the stack directly) to determine between the
> SYSRET and IRET return paths.
>
> Moving this to C too has immediate benefits: that way we could easily add
> instrumentation to see how efficient these various return methods are, etc.
>
> I.e. I don't think there's two ways about this: once the entry code moves to the
> domain of C code, we get the best benefits by moving as much of it as possible.

This is almost certainly true.  There are a lot more cleanups possible here.

I want to nail down the 32-bit case first so we can delete the old code.

>
> The only low level bits remaining in assembly will be low level hardware ABI
> details: saving registers and restoring registers to the expected format - no
> 'active' code whatsoever.

I think this is true for syscalls.  Getting the weird special cases
(IRET and GS fault) for error_entry to work correctly in C could be
tricky.

--Andy

  parent reply	other threads:[~2015-06-17 14:24 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state Andy Lutomirski
2015-06-17  9:41   ` Ingo Molnar
2015-06-17 14:15     ` Andy Lutomirski
2015-06-18  9:57       ` Ingo Molnar
2015-06-18 11:07         ` Andy Lutomirski
2015-06-18 15:52           ` Andy Lutomirski
2015-06-18 16:17             ` Ingo Molnar
2015-06-18 16:26               ` Frederic Weisbecker
2015-06-18 19:26                 ` Andy Lutomirski
2015-06-17 15:27     ` Paul E. McKenney
2015-06-18  9:59       ` Ingo Molnar
2015-06-18 22:54         ` Paul E. McKenney
2015-06-19  2:19           ` Paul E. McKenney
2015-06-30 11:04           ` Ingo Molnar
2015-06-30 16:16             ` Paul E. McKenney
2015-06-16 20:16 ` [RFC/INCOMPLETE 02/13] notifiers: Assert that RCU is watching in notify_die Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 03/13] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 04/13] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 05/13] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 06/13] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 07/13] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks Andy Lutomirski
2015-06-17 10:00   ` Ingo Molnar
2015-06-17 10:02     ` Ingo Molnar
2015-06-17 14:12       ` Andy Lutomirski
2015-06-18 10:17         ` Ingo Molnar
2015-06-18 10:19           ` Ingo Molnar
2015-06-16 20:16 ` [RFC/INCOMPLETE 09/13] x86/entry/compat: Migrate compat " Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 10/13] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 11/13] x86/asm/entry/64: Simplify irq stack pt_regs handling Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 12/13] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 13/13] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h Andy Lutomirski
2015-06-17  9:48 ` [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Ingo Molnar
2015-06-17 10:13   ` Richard Weinberger
2015-06-17 11:04     ` Ingo Molnar
2015-06-17 14:19     ` Andy Lutomirski
2015-06-17 15:16   ` Andy Lutomirski
2015-06-18 10:14     ` Ingo Molnar
2015-06-17 10:32 ` Ingo Molnar
2015-06-17 11:14   ` Ingo Molnar
2015-06-17 14:23   ` Andy Lutomirski [this message]
2015-06-18 10:11     ` Ingo Molnar
2015-06-18 11:06       ` Andy Lutomirski
2015-06-18 16:24         ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrU2oEiHiqb9gu+ZnDU+zOMk+JqDG2dYFVHsAh5xm2tGtw@mail.gmail.com \
    --to=luto@amacapital.net \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=fweisbec@gmail.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=riel@redhat.com \
    --cc=vda.linux@googlemail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.