From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E09C9C4167B for ; Tue, 8 Dec 2020 17:50:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 44F8823B85 for ; Tue, 8 Dec 2020 17:50:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 44F8823B85 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=alien8.de Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2D5A56B005D; Tue, 8 Dec 2020 12:50:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 286A26B006C; Tue, 8 Dec 2020 12:50:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1753A6B0073; Tue, 8 Dec 2020 12:50:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0113.hostedemail.com [216.40.44.113]) by kanga.kvack.org (Postfix) with ESMTP id F34166B005D for ; Tue, 8 Dec 2020 12:50:23 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id AB52D181AEF1A for ; Tue, 8 Dec 2020 17:50:23 +0000 (UTC) X-FDA: 77570854326.20.hose06_2a17cef273e8 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin20.hostedemail.com (Postfix) with ESMTP id 82676180C0609 for ; Tue, 8 Dec 2020 17:50:23 +0000 (UTC) X-HE-Tag: hose06_2a17cef273e8 X-Filterd-Recvd-Size: 8977 Received: from mail.skyhub.de (mail.skyhub.de [5.9.137.197]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Tue, 8 Dec 2020 17:50:21 +0000 (UTC) Received: from zn.tnic (p200300ec2f0f0800de0994b63dbef0e2.dip0.t-ipconnect.de [IPv6:2003:ec:2f0f:800:de09:94b6:3dbe:f0e2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.skyhub.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id EA5661EC053C; Tue, 8 Dec 2020 18:50:19 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=dkim; t=1607449820; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=Wll8a/0W46BCEqmGk8dxucjixTUIbOAnwRfJKIjdQCc=; b=XaFS6azD6t5Xr9XLknktT4H2gC2aKHD100ijn2oQAJPTqPsYb68kxYhjhUTn6bste02pBy v6XpqQK7/HJVLGnAlcfsLC4eIBw2MPYnBz5Sr4KTYV27ZIYues1CiD0CFpBO9LwdogTO1c YBw3l/PmIywECt9wztleEqpoBpFzKVE= Date: Tue, 8 Dec 2020 18:50:14 +0100 From: Borislav Petkov To: Yu-cheng Yu Cc: x86@kernel.org, "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas Shanbhogue , Dave Martin , Weijiang Yang , Pengfei Xu Subject: Re: [PATCH v15 08/26] x86/mm: Introduce _PAGE_COW Message-ID: <20201208175014.GD27920@zn.tnic> References: <20201110162211.9207-1-yu-cheng.yu@intel.com> <20201110162211.9207-9-yu-cheng.yu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20201110162211.9207-9-yu-cheng.yu@intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Nov 10, 2020 at 08:21:53AM -0800, Yu-cheng Yu wrote: > There is essentially no room left in the x86 hardware PTEs on some OSes > (not Linux). That left the hardware architects looking for a way to > represent a new memory type (shadow stack) within the existing bits. > They chose to repurpose a lightly-used state: Write=0,Dirty=1. It is not clear to me what the definition and semantics of that bit is. +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ Is it set by hw or by sw and hw uses it to know it is a shadow stack page, and so on. I think you should lead with its definition. > The reason it's lightly used is that Dirty=1 is normally set by hardware > and cannot normally be set by hardware on a Write=0 PTE. Software must > normally be involved to create one of these PTEs, so software can simply > opt to not create them. > > But that leaves us with a Linux problem: we need to ensure we never create Please use passive voice in your commit message: no "we" or "I", etc. > Write=0,Dirty=1 PTEs. In places where we do create them, we need to find > an alternative way to represent them _without_ using the same hardware bit > combination. Thus, enter _PAGE_COW. This results in the following: > > (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW) > (b) A R/O page that has been COW'ed: (R/O + _PAGE_COW) Both are "R/O + _PAGE_COW". Where's the difference? The dirty bit? > The user page is in a R/O VMA, and get_user_pages() needs a writable > copy. The page fault handler creates a copy of the page and sets > the new copy's PTE as R/O and _PAGE_COW. > (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW) So W=0, D=1 ? > (d) A shared shadow stack PTE: (R/O + _PAGE_COW) > When a shadow stack page is being shared among processes (this happens > at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the next shadow > stack access causes a fault, and the page is duplicated and > _PAGE_DIRTY_HW is set again. This is the COW equivalent for shadow > stack pages, even though it's copy-on-access rather than copy-on-write. > (e) A page where the processor observed a Write=1 PTE, started a write, set > Dirty=1, but then observed a Write=0 PTE. How does that happen? Something changed the PTE's W bit to 0 in-between? > That's possible today, but > will not happen on processors that support shadow stack. > > Use _PAGE_COW in pte_wrprotect() and _PAGE_DIRTY_HW in pte_mkwrite(). > Apply the same changes to pmd and pud. > > When this patch is applied, there are six free bits left in the 64-bit PTE. s/When this patch is applied/After this/ Avoid having "This patch" or "This commit" in the commit message. It is tautologically useless. Also, do $ git grep 'This patch' Documentation/process for more details. > There are no more free bits in the 32-bit PTE (except for PAE) and shadow > stack is not implemented for the 32-bit kernel. > > Signed-off-by: Yu-cheng Yu > --- > arch/x86/include/asm/pgtable.h | 120 ++++++++++++++++++++++++--- > arch/x86/include/asm/pgtable_types.h | 41 ++++++++- > 2 files changed, 150 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > index b23697658b28..c88c7ccf0318 100644 > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -121,9 +121,9 @@ extern pmdval_t early_pmd_flags; > * The following only work if pte_present() is true. > * Undefined behaviour if not.. > */ > -static inline int pte_dirty(pte_t pte) > +static inline bool pte_dirty(pte_t pte) > { > - return pte_flags(pte) & _PAGE_DIRTY_HW; > + return pte_flags(pte) & _PAGE_DIRTY_BITS; Why? Does _PAGE_COW mean dirty too? > @@ -343,6 +349,17 @@ static inline pte_t pte_mkold(pte_t pte) > > static inline pte_t pte_wrprotect(pte_t pte) > { > + /* > + * Blindly clearing _PAGE_RW might accidentally create > + * a shadow stack PTE (RW=0,Dirty=1). Move the hardware > + * dirty value to the software bit. > + */ > + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { > + pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >> > + _PAGE_BIT_DIRTY_HW << _PAGE_BIT_COW; Let that line stick out. And that shifting is not grokkable at a quick glance, at least not to me. Simplify? > static inline pmd_t pmd_wrprotect(pmd_t pmd) > { > + /* > + * Blindly clearing _PAGE_RW might accidentally create > + * a shadow stack PMD (RW=0,Dirty=1). Move the hardware > + * dirty value to the software bit. This whole carefully sidestepping the possiblity of creating a shadow stack pXd is kinda sucky... > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > index 7462a574fc93..5f764d8d9bae 100644 > --- a/arch/x86/include/asm/pgtable_types.h > +++ b/arch/x86/include/asm/pgtable_types.h > @@ -23,7 +23,8 @@ > #define _PAGE_BIT_SOFTW2 10 /* " */ > #define _PAGE_BIT_SOFTW3 11 /* " */ > #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ > -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ > +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ > +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ > #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ > #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ > #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ > @@ -36,6 +37,16 @@ > #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ > #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 > > +/* > + * This bit indicates a copy-on-write page, and is different from > + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to. > + */ > +#ifdef CONFIG_X86_64 CONFIG_X86_64 ? Do all x86 machines out there support CET? If anything, CONFIG_X86_CET... > +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ > +#else > +#define _PAGE_BIT_COW 0 > +#endif > + > /* If _PAGE_BIT_PRESENT is clear, we use these: */ > /* - if the user mapped it with PROT_NONE; pte_present gives true */ -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette