From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6718EC54E94 for ; Thu, 26 Jan 2023 08:57:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DD5306B0071; Thu, 26 Jan 2023 03:57:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D85F66B0072; Thu, 26 Jan 2023 03:57:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C26B26B0073; Thu, 26 Jan 2023 03:57:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B45BE6B0071 for ; Thu, 26 Jan 2023 03:57:36 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 7DB7080543 for ; Thu, 26 Jan 2023 08:57:36 +0000 (UTC) X-FDA: 80396346912.05.807B314 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf28.hostedemail.com (Postfix) with ESMTP id F00CEC0009 for ; Thu, 26 Jan 2023 08:57:33 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cSCVG6vz; spf=pass (imf28.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674723454; a=rsa-sha256; cv=none; b=6dhvghZBXWhuVWZWLT11d920lJTpiDFgusAEAxVgmpwCEb9e0iuhvyK8uCL71MQLKWuF9O VAeJZwvNqp2FOFmxJcJ0W33gGeHR/OXa8geoLI3/vcm/t76+FY7bAz5vutJ9WBcHcfxRjn kxn28VonpVh7DzLTkP1jaGrsZ3WBnQo= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cSCVG6vz; spf=pass (imf28.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674723454; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tQ5XpzdLsChe3LFjvdvQNZo+jEFVqgRJCqtyNbGXTwY=; b=7N4sM/8HV1NILeYuBqkF0H5XeN3YJKqdql8CICm1gIBcR0KW73x1D2BuQheu5ToXM0/Tc5 yq0BmNgm4+NdWKJ586G7rIDnxfv2MAJPTD0wWNT2Ip4DSa2h3br2psYL/QzHXuncZLLAGk RsXT/oHoKeI8TxjT+OdrXpitqX5PtoI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674723453; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tQ5XpzdLsChe3LFjvdvQNZo+jEFVqgRJCqtyNbGXTwY=; b=cSCVG6vzjXsqVShUa47EFEYp1JLp41EMSEDVzTviaXJAitec6LBTiF7SL1R+nSFX92eHJk sfWQ4A4zN2JhOTwDDurnJEqrIKAYq8LBAcPLxdHkAYiIk97YUXa0Oy4eLs98ar+0Kn+GAo 2HGb6bR12x/ZS09LHRDLuAR0Fvhbx1w= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-467-KmkcaQzSMDeLPWbqRw1guQ-1; Thu, 26 Jan 2023 03:57:31 -0500 X-MC-Unique: KmkcaQzSMDeLPWbqRw1guQ-1 Received: by mail-wm1-f69.google.com with SMTP id bg24-20020a05600c3c9800b003db0ddddb6fso747152wmb.0 for ; Thu, 26 Jan 2023 00:57:31 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=tQ5XpzdLsChe3LFjvdvQNZo+jEFVqgRJCqtyNbGXTwY=; b=G4q+zo1GGcZukvY8OtuIRcTSI/s60mBYIx0LG00T2h4aVIM+pWmRAtGbF/OBZvCoGt 6MRMlVXob0RXkCklOe7EjsybK6JRZqZxWa6DC4efURTt1FApSzCCaJMx0hICuoHcm50u xuBZPWaIwCBSs3giVP3AR7idtRYYqKLRGSJRrLOtGSW2exEWs01ZOVD+/VkaDvCiTx00 2y9HSlBnf4nV7SHWiQuovBuexr8+tcbh5oiPQ8nMpqER2vj8TT1rs67C6+7z1nvHiDDC 5b/BM6Q1bW8ICU0pqrld1YuHI4lNWvnJZclxPUQwkEJkYIzAJt+IzfdQmA4PIJnBsRez BbEA== X-Gm-Message-State: AFqh2krKYnMmVVkBj3WrtUlOP1rRkSSc9u3m7WBCcOv7L63CXVsB56Z4 9/jcZMaZ3mHt10O+FjtDbV7ZMYVaCXZRq0m7bSfKBIfOB7nPEHlys9fo4TLALkxPrpR27lw6e7O d1k/zi5i/v/M= X-Received: by 2002:a05:6000:549:b0:2be:184a:5d5c with SMTP id b9-20020a056000054900b002be184a5d5cmr26794296wrf.59.1674723450383; Thu, 26 Jan 2023 00:57:30 -0800 (PST) X-Google-Smtp-Source: AMrXdXurqrEs2CtvJH0g9Njs93dDFdHIy3kXW9E9qnzN9h2evMNKxU2yos9hVfmcwU773eh+jNEA6g== X-Received: by 2002:a05:6000:549:b0:2be:184a:5d5c with SMTP id b9-20020a056000054900b002be184a5d5cmr26794251wrf.59.1674723449956; Thu, 26 Jan 2023 00:57:29 -0800 (PST) Received: from ?IPV6:2a09:80c0:192:0:5dac:bf3d:c41:c3e7? ([2a09:80c0:192:0:5dac:bf3d:c41:c3e7]) by smtp.gmail.com with ESMTPSA id p6-20020a5d48c6000000b002bfc0558ecdsm651607wrs.113.2023.01.26.00.57.28 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 26 Jan 2023 00:57:29 -0800 (PST) Message-ID: <79e0a85e-1ec4-e359-649d-618ca79c36f7@redhat.com> Date: Thu, 26 Jan 2023 09:57:27 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 To: "Edgecombe, Rick P" , "bsingharora@gmail.com" , "hpa@zytor.com" , "Syromiatnikov, Eugene" , "peterz@infradead.org" , "rdunlap@infradead.org" , "keescook@chromium.org" , "Eranian, Stephane" , "kirill.shutemov@linux.intel.com" , "dave.hansen@linux.intel.com" , "linux-mm@kvack.org" , "fweimer@redhat.com" , "nadav.amit@gmail.com" , "jannh@google.com" , "dethoma@microsoft.com" , "kcc@google.com" , "linux-arch@vger.kernel.org" , "pavel@ucw.cz" , "oleg@redhat.com" , "hjl.tools@gmail.com" , "akpm@linux-foundation.org" , "Yang, Weijiang" , "Lutomirski, Andy" , "arnd@arndb.de" , "jamorris@linux.microsoft.com" , "linux-doc@vger.kernel.org" , "bp@alien8.de" , "Schimpe, Christina" , "x86@kernel.org" , "mike.kravetz@oracle.com" , "tglx@linutronix.de" , "andrew.cooper3@citrix.com" , "john.allen@amd.com" , "rppt@kernel.org" , "mingo@redhat.com" , "corbet@lwn.net" , "linux-kernel@vger.kernel.org" , "linux-api@vger.kernel.org" , "gorcunov@gmail.com" Cc: "Yu, Yu-cheng" References: <20230119212317.8324-1-rick.p.edgecombe@intel.com> <20230119212317.8324-19-rick.p.edgecombe@intel.com> <7f63d13d-7940-afb6-8b25-26fdf3804e00@redhat.com> <50cf64932507ba60639eca28692e7df285bcc0a7.camel@intel.com> <1327c608-1473-af4f-d962-c24f04f3952c@redhat.com> <8c3820ae1448de4baffe7c476b4b5d9ba0a309ff.camel@intel.com> <4d224020-f26f-60a4-c7ab-721a024c7a6d@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Queue-Id: F00CEC0009 X-Rspamd-Server: rspam01 X-Stat-Signature: oefd94117xqcu47bsk4crqbi9nf4b9eg X-HE-Tag: 1674723453-909959 X-HE-Meta: U2FsdGVkX19zBahbMlTGtsP+z0OxkFX2DnHCFPqbU+i8+vE3q0csRU+yw1yV8dF4tP/LjQtqM7HqgY+WXV+i59eCR9hqufUx2rLvBwL3udo5lAxW1OKyZiUL/NlHD6K7qdoWD1O5hpTgzTV0FTw7wp009ByrP4O8hkLJIvlSO0CspYSGuK15Ohk9CHzmwWRLWm0SgjrhcYkPjUbpI6pxKTV/7djtQIxHs0AFHl3ESge1nwJKEgAk06T88D3D7Up3rovDEnx1PlzS/C3y2JKetPEZai0VRZluh00dex84Pc4X1pd/W0VkyTG3GnoOZrekAR525OUAztPBNC4QIRSx1ZtG++m9JCbnMiO3p3G9KL6OCP7BRZqP1FaRbNEjjMNxdOUsxt8dBISPgpVGpvSrWsLrKSFniw0R/4/jnoxObxzk1BYgd9slg0VQqSUVn/KfVPYsEEn6hf1JC6/TAswYc313iRO3+tpgy2fgP424X3dUQ9rOCrBUgwdEFkVxvboGvgELETSSbatBVnaA/g/UUtN8+oNXH0kLfPhH3X2BbmYFhR62GRXRlVJCV/m3lJautmpAZ+zhBIXT+muwHF9jfdlpUQQPmX4lutGctwrC7IC6EcZWMy9fbcNADO61Sj6iZ0nb96Ecmw2qdOhSdKo14DtiUseCATbeBGc8AB/dbnD4182oJymb4bZ/diIFkURU70LZi5ZRkm3BCjqbAy8GwIER6zkbCzogGXgX134T3IQvB7RnY1oPS2FcAg+mVrsE4RQwjsvv/gs5/bhuQLVIyMGDupmYoikGI5WmuTGa2D/uxl295HFZPFdTUcshgL6J8DkPqkUAGgIZTUWldt6RiFftnb5N2NevlSLMl7qnNs/t0IPoPTiz7Od/I/r5o1m153032+TbzKfVLYfOfG7V4BMaeOZURo4MOJirAXwmliegMPlMOCOKcUWfw6jngJJx6ZMFZ2PKBqYS6ZZPj/H iPxoCRDc J/wUiFxCI/UBNEKhQNXi2s7VA3va5N7Tc/hMAVQR3aPbUi4sRUj7wrH7ilsrXoqa7rlhba5YN64/zEI27F606ZC2OmeS2JxHOjFf1aW8uLhhIgUYNVxDbYoBFTADARl8wG+hny2CqcEkn/LXErSRZ4SXGhW2q3CBDOaQ0va4S7vdzMkSC+S/IyRBHTND6+k/NiRf/MTO61vhG2YNhexB0Pt53ckZWU3UBULv0pBzjM9xv5UTMjV2+x1+jJHkHUIlMN1tisducbo2ezYucvqBeOM+FzsLDnoVJOTtKsaW9roas1CHJVNSC0FFsj12zmpaAIYPMsiX1qbtg3ZS0bePXP2E4vkkk14OBSv8CXhz0GZo02GxWnq0Z+bNoZbddeRrAPwjA4UF10R+dJpsQmNCEpJELJZdBQ4Dnu9TGwjrPzBeSwU2RfI7mMYsSH1XbCzclrQNfvhR6t4BJTmqz16FNDgpoR7/6Kx/DLbLqOX1k8CMHRDvrFAbu7yiahmJtijMMdAx0GV/b1yy4Iev98R6aHQNv17BE0qSS0ZG5CHyo99lJMVTT1ZfVMF2mcIzKjlrXMZJvBJjOsHXN78MoK8XLapheNA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 25.01.23 19:43, Edgecombe, Rick P wrote: > On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote: >>>> Roughly speaking: if we abstract it that way and get all of the >>>> "how >>>> to >>>> set it writable now?" out of core-MM, it not only is cleaner and >>>> less >>>> error prone, it might even allow other architectures that >>>> implement >>>> something comparable (e.g., using a dedicated HW bit) to actually >>>> reuse >>>> some of that work. Otherwise most of that "shstk" is really just >>>> x86 >>>> specific ... >>>> >>>> I guess the only cases we have to special case would be page >>>> pinning >>>> code where pte_write() would indicate that the PTE is writable >>>> (well, >>>> it >>>> is, just not by "ordinary CPU instruction" context directly): but >>>> you >>>> do >>>> that already, so ... :) >>>> >>>> Sorry for stumbling over that this late, I only started looking >>>> into >>>> this when you CCed me on that one patch. >>> >>> Sorry for not calling more attention to it earlier. Appreciate your >>> comments. >>> >>> Previously versions of this series had changed some of these >>> pte_mkwrite() calls to maybe_mkwrite(), which of course takes a >>> vma. >>> This way an x86 implementation could use the VM_SHADOW_STACK vma >>> flag >>> to decide between pte_mkwrite() and pte_mkwrite_shstk(). The >>> feedback >>> was that in some of these code paths "maybe" isn't really an >>> option, it >>> *needs* to make it writable. Even though the logic was the same, >>> the >>> name of the function made it look wrong. >>> >>> But another option could be to change pte_mkwrite() to take a vma. >>> This >>> would save using another software bit on x86, but instead requires >>> a >>> small change to each arch's pte_mkwrite(). >> >> I played with that idea shortly as well, but discarded it. I was not >> able to convince myself that it wouldn't be required to pass in the >> VMA >> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... >> which would end up fairly ugly (or even impossible in thing slike >> GUP-fast). >> >> For example, I wonder how we'd be handling stuff like do_numa_page() >> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and >> either >> call might set the PTE writable and maintain dirty bit ... > > pte_modify() is handled like this currently: > > https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/ > > There has been a couple iterations on that. The current solution is to > do the Dirty->SavedDirty fixup if needed after the new prots are added. > > Of course pte_modify() can't know whether you are are attempting to > create a shadow stack PTE with the prot you are passing in. But the > callers today explicitly call pte_mkwrite() after filling in the other > bits with pte_modify(). See below on my MAP_PRIVATE vs. MAP_SHARED comment. > Today this patch causes the pte_mkwrite() to be > skipped and another fault may be required in the mprotect() and numa > cases, but if we change pte_mkwrite() to take a VMA we can just make it > shadow stack to start. > > It might be worth mentioning, there was a suggestion in the past to try > to have the shadow stack bits come out of vm_get_page_prot(), but MM > code would then try to map the zero page as (shadow stack) writable > when there was a normal (non-shadow stack) read access. So I had to > abandon that approach and rely on explicit calls to pte_mkwrite/shstk() > to make it shadow stack. Thanks, do you have a pointer? > >> >> Having that said, maybe it could work with only a single saved-dirty >> bit >> and passing in the VMA for pte_mkwrite() only. >> >> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty >> bit >> to the soft-dirty bit instead, resulting in >> "writable=0,dirty=0,saved-dirty=1", >> >> pte_dirty() would return dirty==1||saved-dirty==1. >> >> pte_mkdirty() would set either set dirty=1 or saved-dirty=1, >> depending >> on the writable bit. >> >> pte_mkclean() would clean both bits. >> >> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)" >> >> pte_mkwrite() would act according to the VMA, and in addition, merge >> the >> saved-dirty bit into the dirty bit. >> >> pte_modify() and mk_pte() .... would require more thought ... > > Not sure I'm following what the mk_pte() problem would be. You mean if > Write=0,Dirty=1 is manually added to the prot? > > Shouldn't people generally use the pte_mkwrite() helpers unless they > are drawing from a prot that was already created with the helpers or > vm_get_page_prot()? pte_mkwrite() is mostly only used (except for writenotify ...) for MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory, vma->vm_page_prot in a VM_WRITE mapping already contains the write permissions. pte_mkwrite() is not necessary (again, unless writenotify is active). I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you didn't stumble over that issue yet? Because I don't see how it could work with MAP_SHARED VMAs. The other thing I had in mind was that we have to make sure that we're not accidentally setting "Write=0,Dirty=1" in mk_pte() / pte_modify(). Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect using pte_modify(), we have to make sure to move the dirty bit to the saved_dirty bit. > I think they can't manually create prot's from bits > in core mm code, right? And x86 arch code already has to be aware of > shadow stack. It's a bit of an assumption I guess, but I think maybe > not too crazy of one? I think that's true. Arch code is supposed to deal with that IIRC. > >> >> >> Further, ptep_modify_prot_commit() might have to be adjusted to >> properly >> flush in all relevant cases IIRC. > > Sorry, I'm not following. Can you elaborate? There is an adjustment > made in pte_flags_need_flush(). Note that I did not fully review all bits of this patch set, just throwing out what was on my mind. If already handled, great. > >> >>> >>> x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(), >>> but >>> maybe it could additionally warn if the vma is not writable. It >>> also >>> seems more aligned with your changes to stop taking hints from PTE >>> bits >>> and just look at the VMA? (I'm thinking about the dropping of the >>> dirty >>> check in GUP and dropping pte_saved_write()) >> >> The soft-shstk bit wouldn't be a hint, it would be logically >> changing >> the "type" of the PTE such that any other PTE functions can do the >> right >> thing without having to consume the VMA. > > Yea, true. > > Thanks for your comments and ideas here, I'll give the: > pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte) > ...solution a try. Good! -- Thanks, David / dhildenb