From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_PASS,T_DKIMWL_WL_HIGH,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64F6AC43334 for ; Wed, 5 Sep 2018 21:31:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1DF9B20659 for ; Wed, 5 Sep 2018 21:31:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="VxiN7csp" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1DF9B20659 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727760AbeIFCDz (ORCPT ); Wed, 5 Sep 2018 22:03:55 -0400 Received: from mail.kernel.org ([198.145.29.99]:37270 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727518AbeIFCDy (ORCPT ); Wed, 5 Sep 2018 22:03:54 -0400 Received: from mail-wm0-f52.google.com (mail-wm0-f52.google.com [74.125.82.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 5C0FE2083D for ; Wed, 5 Sep 2018 21:31:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1536183110; bh=LzfNpR4eMoKy28v1gN3arblecNY/tzLOmmFtnXcdPzs=; h=In-Reply-To:References:From:Date:Subject:To:Cc:From; b=VxiN7csp4udW8UYcdtXJXY8W4N1LnWS+VxqbfU5EkPCz31jMAvrT00nCQwqlwyyBN VmtB+Ae7AGQ3pC0qOtWwB2c3aAo8co9wDMw+pxat8c/DqHp/Ygj8aosdQMneIoQMdZ 4heJdmr4f0XI2hDt+xDKiMHzpA82OXRuXePKXDtg= Received: by mail-wm0-f52.google.com with SMTP id 207-v6so9096724wme.5 for ; Wed, 05 Sep 2018 14:31:50 -0700 (PDT) X-Gm-Message-State: APzg51CWViBBN7Y1btVxHN1Q4DWx4T46SmfU44ivMNNSZSWD1AV+MSN4 4T10/M3CJjjCuH1PhSrvnWghSD/hfykPuhKxwUavIg== X-Google-Smtp-Source: ANB0VdZWEOimacuK2QkDoAObEJk5RAl/4J1c88XsLNKbs4l57X3ogf4VZhGQ0Tpz+4h4OJmqnnvIdPzlWs81/nYCOk4= X-Received: by 2002:a1c:3413:: with SMTP id b19-v6mr171641wma.21.1536183108849; Wed, 05 Sep 2018 14:31:48 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a1c:7810:0:0:0:0:0 with HTTP; Wed, 5 Sep 2018 14:31:28 -0700 (PDT) In-Reply-To: <20180904070455.GX24124@hirez.programming.kicks-ass.net> References: <8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org> <20180904070455.GX24124@hirez.programming.kicks-ass.net> From: Andy Lutomirski Date: Wed, 5 Sep 2018 14:31:28 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline To: Peter Zijlstra Cc: Andy Lutomirski , X86 ML , Borislav Petkov , LKML , Dave Hansen , Adrian Hunter , Alexander Shishkin , Arnaldo Carvalho de Melo , Linus Torvalds , Josh Poimboeuf , Joerg Roedel , Jiri Olsa , Andi Kleen Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra wrote: > On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote: >> The SYSCALL64 trampoline has a couple of nice properties: >> >> - The usual sequence of SWAPGS followed by two GS-relative accesses to >> set up RSP is somewhat slow because the GS-relative accesses need >> to wait for SWAPGS to finish. The trampoline approach allows >> RIP-relative accesses to set up RSP, which avoids the stall. >> >> - The trampoline avoids any percpu access before CR3 is set up, >> which means that no percpu memory needs to be mapped in the user >> page tables. This prevents using Meltdown to read any percpu memory >> outside the cpu_entry_area and prevents using timing leaks >> to directly locate the percpu areas. >> >> The downsides of using a trampoline may outweigh the upsides, however. >> It adds an extra non-contiguous I$ cache line to system calls, and it >> forces an indirect jump to transfer control back to the normal kernel >> text after CR3 is set up. The latter is because x86 lacks a 64-bit >> direct jump instruction that could jump from the trampoline to the entry >> text. With retpolines enabled, the indirect jump is extremely slow. >> >> This patch changes the code to map the percpu TSS into the user page >> tables to allow the non-trampoline SYSCALL64 path to work under PTI. >> This does not add a new direct information leak, since the TSS is >> readable by Meltdown from the cpu_entry_area alias regardless. It >> does allow a timing attack to locate the percpu area, but KASLR is >> more or less a lost cause against local attack on CPUs vulnerable to >> Meltdown regardless. As far as I'm concerned, on current hardware, >> KASLR is only useful to mitigate remote attacks that try to attack >> the kernel without first gaining RCE against a vulnerable user >> process. >> >> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces >> syscall overhead from ~237ns to ~228ns. >> >> There is a possible alternative approach: we could instead move the >> trampoline within 2G of the entry text and make a separate copy for >> each CPU. Then we could use a direct jump to rejoin the normal >> entry path. > > Can we have a few words on why this solution and not this alternative? I > mean, you raise the possibility, but then surely you chose not to > implement that. Might as well share that with us. I can give some pros and cons. With the other approach: - We avoid a pipeline stall. - We execute from an extra page and read from another extra page during the syscall. (The latter is because we need to use a relative addressing mode to find sp1 -- it's the same *cacheline* we'd use anyway, but we're accessing it using an alias, so it's an extra TLB entry.) - We use more memory. This would be one page per CPU for a simple implementation and 64-ish bytes per CPU or one page per node for a more complex implementation. - More code complexity. I'm not convinced this is a good tradeoff.