linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Brian Gerst <brgerst@gmail.com>
To: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>, "H . Peter Anvin" <hpa@zytor.com>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	Juergen Gross <jgross@suse.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@alien8.de>, Jiri Kosina <jkosina@suse.cz>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	David Laight <David.Laight@aculab.com>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Eduardo Valentin <eduval@amazon.com>,
	Greg KH <gregkh@linuxfoundation.org>,
	Will Deacon <will.deacon@arm.com>,
	"Liguori, Anthony" <aliguori@amazon.com>,
	Daniel Gruss <daniel.gruss@iaik.tugraz.at>,
	Hugh Dickins <hughd@google.com>, Kees Cook <keescook@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Waiman Long <llong@redhat.com>, Pavel Machek <pavel@ucw.cz>,
	Joerg Roedel <jroedel@suse.de>
Subject: Re: [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack
Date: Mon, 5 Mar 2018 11:41:01 -0500	[thread overview]
Message-ID: <CAMzpN2h3xkhw_A4VeeA47=oykKgxXeumHM-q0QpaA8+fwFVRjw@mail.gmail.com> (raw)
In-Reply-To: <1520245563-8444-12-git-send-email-joro@8bytes.org>

On Mon, Mar 5, 2018 at 5:25 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> It can happen that we enter the kernel from kernel-mode and
> on the entry-stack. The most common way this happens is when
> we get an exception while loading the user-space segment
> registers on the kernel-to-userspace exit path.
>
> The segment loading needs to be done after the entry-stack
> switch, because the stack-switch needs kernel %fs for
> per_cpu access.
>
> When this happens, we need to make sure that we leave the
> kernel with the entry-stack again, so that the interrupted
> code-path runs on the right stack when switching to the
> user-cr3.
>
> We do this by detecting this condition on kernel-entry by
> checking CS.RPL and %esp, and if it happens, we copy over
> the complete content of the entry stack to the task-stack.
> This needs to be done because once we enter the exception
> handlers we might be scheduled out or even migrated to a
> different CPU, so that we can't rely on the entry-stack
> contents. We also leave a marker in the stack-frame to
> detect this condition on the exit path.
>
> On the exit path the copy is reversed, we copy all of the
> remaining task-stack back to the entry-stack and switch
> to it.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S | 110 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 109 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index bb0bd896..3a84945 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -299,6 +299,9 @@
>   * copied there. So allocate the stack-frame on the task-stack and
>   * switch to it before we do any copying.
>   */
> +
> +#define CS_FROM_ENTRY_STACK    (1 << 31)
> +
>  .macro SWITCH_TO_KERNEL_STACK
>
>         ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
> @@ -320,6 +323,10 @@
>         /* Load top of task-stack into %edi */
>         movl    TSS_entry_stack(%edi), %edi
>
> +       /* Special case - entry from kernel mode via entry stack */
> +       testl   $SEGMENT_RPL_MASK, PT_CS(%esp)
> +       jz      .Lentry_from_kernel_\@
> +
>         /* Bytes to copy */
>         movl    $PTREGS_SIZE, %ecx
>
> @@ -333,8 +340,8 @@
>          */
>         addl    $(4 * 4), %ecx
>
> -.Lcopy_pt_regs_\@:
>  #endif
> +.Lcopy_pt_regs_\@:
>
>         /* Allocate frame on task-stack */
>         subl    %ecx, %edi
> @@ -350,6 +357,56 @@
>         cld
>         rep movsl
>
> +       jmp .Lend_\@
> +
> +.Lentry_from_kernel_\@:
> +
> +       /*
> +        * This handles the case when we enter the kernel from
> +        * kernel-mode and %esp points to the entry-stack. When this
> +        * happens we need to switch to the task-stack to run C code,
> +        * but switch back to the entry-stack again when we approach
> +        * iret and return to the interrupted code-path. This usually
> +        * happens when we hit an exception while restoring user-space
> +        * segment registers on the way back to user-space.
> +        *
> +        * When we switch to the task-stack here, we can't trust the
> +        * contents of the entry-stack anymore, as the exception handler
> +        * might be scheduled out or moved to another CPU. Therefore we
> +        * copy the complete entry-stack to the task-stack and set a
> +        * marker in the iret-frame (bit 31 of the CS dword) to detect
> +        * what we've done on the iret path.

We don't need to worry about preemption changing the entry stack.  The
faults that IRET or segment loads can generate just run the exception
fixup handler and return.  Interrupts were disabled when the fault
occurred, so the kernel cannot be preempted.  The other case to watch
is #DB on SYSENTER, but that simply returns and doesn't sleep either.

We can keep the same process as the existing debug/NMI handlers -
leave the current exception pt_regs on the entry stack and just switch
to the task stack for the call to the handler.  Then switch back to
the entry stack and continue.  No copying needed.

> +        *
> +        * On the iret path we copy everything back and switch to the
> +        * entry-stack, so that the interrupted kernel code-path
> +        * continues on the same stack it was interrupted with.
> +        *
> +        * Be aware that an NMI can happen anytime in this code.
> +        *
> +        * %esi: Entry-Stack pointer (same as %esp)
> +        * %edi: Top of the task stack
> +        */
> +
> +       /* Calculate number of bytes on the entry stack in %ecx */
> +       movl    %esi, %ecx
> +
> +       /* %ecx to the top of entry-stack */
> +       andl    $(MASK_entry_stack), %ecx
> +       addl    $(SIZEOF_entry_stack), %ecx
> +
> +       /* Number of bytes on the entry stack to %ecx */
> +       sub     %esi, %ecx
> +
> +       /* Mark stackframe as coming from entry stack */
> +       orl     $CS_FROM_ENTRY_STACK, PT_CS(%esp)

Not all 32-bit processors will zero-extend segment pushes.  You will
need to explicitly clear the bit in the case where we didn't switch
CR3.

> +
> +       /*
> +        * %esi and %edi are unchanged, %ecx contains the number of
> +        * bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
> +        * the stack-frame on task-stack and copy everything over
> +        */
> +       jmp .Lcopy_pt_regs_\@
> +
>  .Lend_\@:
>  .endm
>
> @@ -408,6 +465,56 @@
>  .endm
>
>  /*
> + * This macro handles the case when we return to kernel-mode on the iret
> + * path and have to switch back to the entry stack.
> + *
> + * See the comments below the .Lentry_from_kernel_\@ label in the
> + * SWITCH_TO_KERNEL_STACK macro for more details.
> + */
> +.macro PARANOID_EXIT_TO_KERNEL_MODE
> +
> +       /*
> +        * Test if we entered the kernel with the entry-stack. Most
> +        * likely we did not, because this code only runs on the
> +        * return-to-kernel path.
> +        */
> +       testl   $CS_FROM_ENTRY_STACK, PT_CS(%esp)
> +       jz      .Lend_\@
> +
> +       /* Unlikely slow-path */
> +
> +       /* Clear marker from stack-frame */
> +       andl    $(~CS_FROM_ENTRY_STACK), PT_CS(%esp)
> +
> +       /* Copy the remaining task-stack contents to entry-stack */
> +       movl    %esp, %esi
> +       movl    PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
> +
> +       /* Bytes on the task-stack to ecx */
> +       movl    PER_CPU_VAR(cpu_current_top_of_stack), %ecx
> +       subl    %esi, %ecx
> +
> +       /* Allocate stack-frame on entry-stack */
> +       subl    %ecx, %edi
> +
> +       /*
> +        * Save future stack-pointer, we must not switch until the
> +        * copy is done, otherwise the NMI handler could destroy the
> +        * contents of the task-stack we are about to copy.
> +        */
> +       movl    %edi, %ebx
> +
> +       /* Do the copy */
> +       shrl    $2, %ecx
> +       cld
> +       rep movsl
> +
> +       /* Safe to switch to entry-stack now */
> +       movl    %ebx, %esp
> +
> +.Lend_\@:
> +.endm
> +/*
>   * %eax: prev task
>   * %edx: next task
>   */
> @@ -765,6 +872,7 @@ restore_all:
>
>  restore_all_kernel:
>         TRACE_IRQS_IRET
> +       PARANOID_EXIT_TO_KERNEL_MODE
>         RESTORE_REGS 4
>         jmp     .Lirq_return
>
> --
> 2.7.4
>

--
Brian Gerst

  reply	other threads:[~2018-03-05 16:41 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
2018-03-05 10:25 ` [PATCH 01/34] x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c Joerg Roedel
2018-03-05 10:25 ` [PATCH 02/34] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_entry_stack Joerg Roedel
2018-03-05 10:25 ` [PATCH 03/34] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler Joerg Roedel
2018-03-05 10:25 ` [PATCH 04/34] x86/entry/32: Put ESPFIX code into a macro Joerg Roedel
2018-03-05 10:25 ` [PATCH 05/34] x86/entry/32: Unshare NMI return path Joerg Roedel
2018-03-05 10:25 ` [PATCH 06/34] x86/entry/32: Split off return-to-kernel path Joerg Roedel
2018-03-05 10:25 ` [PATCH 07/34] x86/entry/32: Restore segments before int registers Joerg Roedel
2018-03-05 12:17   ` Linus Torvalds
2018-03-05 13:12     ` Joerg Roedel
2018-03-05 14:51       ` Brian Gerst
2018-03-05 16:44         ` Joerg Roedel
2018-03-05 17:21           ` Brian Gerst
2018-03-05 18:23       ` Linus Torvalds
2018-03-05 18:36         ` Joerg Roedel
2018-03-05 20:38         ` Brian Gerst
2018-03-05 20:50           ` Linus Torvalds
2018-03-05 21:35             ` Joerg Roedel
2018-03-05 21:58               ` Linus Torvalds
2018-03-05 22:03                 ` H. Peter Anvin
2018-03-06  7:04                   ` Ingo Molnar
2018-03-06 13:45                     ` Dave Hansen
2018-03-06  8:38                 ` Joerg Roedel
2018-03-05 10:25 ` [PATCH 08/34] x86/entry/32: Enter the kernel via trampoline stack Joerg Roedel
2018-03-05 10:25 ` [PATCH 09/34] x86/entry/32: Leave " Joerg Roedel
2018-03-05 10:25 ` [PATCH 10/34] x86/entry/32: Introduce SAVE_ALL_NMI and RESTORE_ALL_NMI Joerg Roedel
2018-03-05 10:25 ` [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack Joerg Roedel
2018-03-05 16:41   ` Brian Gerst [this message]
2018-03-05 18:25     ` Joerg Roedel
2018-03-05 20:32       ` Brian Gerst
2018-03-06 12:27     ` Joerg Roedel
2018-03-05 10:25 ` [PATCH 12/34] x86/entry/32: Simplify debug entry point Joerg Roedel
2018-03-05 10:25 ` [PATCH 13/34] x86/entry/32: Add PTI cr3 switches to NMI handler code Joerg Roedel
2018-03-05 10:25 ` [PATCH 14/34] x86/entry/32: Add PTI cr3 switch to non-NMI entry/exit points Joerg Roedel
2018-03-05 10:25 ` [PATCH 15/34] x86/pgtable: Rename pti_set_user_pgd to pti_set_user_pgtbl Joerg Roedel
2018-03-05 10:25 ` [PATCH 16/34] x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled Joerg Roedel
2018-03-05 10:25 ` [PATCH 17/34] x86/pgtable/32: Allocate 8k page-tables " Joerg Roedel
2018-03-05 10:25 ` [PATCH 18/34] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h Joerg Roedel
2018-03-05 10:25 ` [PATCH 19/34] x86/pgtable: Move pti_set_user_pgtbl() " Joerg Roedel
2018-03-05 10:25 ` [PATCH 20/34] x86/pgtable: Move two more functions from pgtable_64.h " Joerg Roedel
2018-03-05 10:25 ` [PATCH 21/34] x86/mm/pae: Populate valid user PGD entries Joerg Roedel
2018-03-05 10:25 ` [PATCH 22/34] x86/mm/pae: Populate the user page-table with user pgd's Joerg Roedel
2018-03-05 10:25 ` [PATCH 23/34] x86/mm/legacy: " Joerg Roedel
2018-03-05 10:25 ` [PATCH 24/34] x86/mm/pti: Add an overflow check to pti_clone_pmds() Joerg Roedel
2018-03-05 10:25 ` [PATCH 25/34] x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32 Joerg Roedel
2018-03-05 10:25 ` [PATCH 26/34] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level " Joerg Roedel
2018-03-05 10:25 ` [PATCH 27/34] x86/mm/dump_pagetables: Define INIT_PGD Joerg Roedel
2018-03-05 10:25 ` [PATCH 28/34] x86/pgtable/pae: Use separate kernel PMDs for user page-table Joerg Roedel
2018-03-05 10:25 ` [PATCH 29/34] x86/ldt: Reserve address-space range on 32 bit for the LDT Joerg Roedel
2018-03-05 10:25 ` [PATCH 30/34] x86/ldt: Define LDT_END_ADDR Joerg Roedel
2018-03-05 10:26 ` [PATCH 31/34] x86/ldt: Split out sanity check in map_ldt_struct() Joerg Roedel
2018-03-05 10:26 ` [PATCH 32/34] x86/ldt: Enable LDT user-mapping for PAE Joerg Roedel
2018-03-05 10:26 ` [PATCH 33/34] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32 Joerg Roedel
2018-03-05 10:26 ` [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU Joerg Roedel
2018-03-05 13:39   ` Waiman Long
2018-03-05 16:09   ` Denys Vlasenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMzpN2h3xkhw_A4VeeA47=oykKgxXeumHM-q0QpaA8+fwFVRjw@mail.gmail.com' \
    --to=brgerst@gmail.com \
    --cc=David.Laight@aculab.com \
    --cc=aarcange@redhat.com \
    --cc=aliguori@amazon.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=daniel.gruss@iaik.tugraz.at \
    --cc=dave.hansen@intel.com \
    --cc=dvlasenk@redhat.com \
    --cc=eduval@amazon.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=jgross@suse.com \
    --cc=jkosina@suse.cz \
    --cc=joro@8bytes.org \
    --cc=jpoimboe@redhat.com \
    --cc=jroedel@suse.de \
    --cc=keescook@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=llong@redhat.com \
    --cc=luto@kernel.org \
    --cc=mingo@kernel.org \
    --cc=pavel@ucw.cz \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).