From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACB12C43462 for ; Tue, 4 May 2021 20:49:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 23F12611AB for ; Tue, 4 May 2021 20:49:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 23F12611AB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 82E796B0036; Tue, 4 May 2021 16:49:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7DFC56B006E; Tue, 4 May 2021 16:49:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 632E36B0070; Tue, 4 May 2021 16:49:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0206.hostedemail.com [216.40.44.206]) by kanga.kvack.org (Postfix) with ESMTP id 488606B0036 for ; Tue, 4 May 2021 16:49:08 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 0099DA2D4 for ; Tue, 4 May 2021 20:49:07 +0000 (UTC) X-FDA: 78104738376.12.99960EF Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf01.hostedemail.com (Postfix) with ESMTP id 00149500152C for ; Tue, 4 May 2021 20:48:54 +0000 (UTC) IronPort-SDR: NPGle7sRN3fE1d0bYKKfGI58YUSAlQFGyesuhxLMqTcmSgkFu5cXAY2nl+66oG0xRWBnjQYOHC vjKiqEFbU/aQ== X-IronPort-AV: E=McAfee;i="6200,9189,9974"; a="185207484" X-IronPort-AV: E=Sophos;i="5.82,272,1613462400"; d="scan'208";a="185207484" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 May 2021 13:49:04 -0700 IronPort-SDR: 6KDt1n8pfcb9gJH0NzrYGgA5EZXyDWXzCYb41j9NgDFcQB0vnbTXvsm/VpOSZC2ClLdLm+iNKj OCE0Yn8pJhDw== X-IronPort-AV: E=Sophos;i="5.82,272,1613462400"; d="scan'208";a="388940055" Received: from yyu32-mobl1.amr.corp.intel.com (HELO [10.251.140.235]) ([10.251.140.235]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 May 2021 13:49:03 -0700 Subject: Re: extending ucontext (Re: [PATCH v26 25/30] x86/cet/shstk: Handle signals for shadow stack) From: "Yu, Yu-cheng" To: Andy Lutomirski Cc: linux-arch , X86 ML , "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , LKML , "open list:DOCUMENTATION" , Linux-MM , Linux API , Arnd Bergmann , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas Shanbhogue , Dave Martin , Weijiang Yang , Pengfei Xu , Haitao Huang References: <20210427204315.24153-1-yu-cheng.yu@intel.com> <20210427204315.24153-26-yu-cheng.yu@intel.com> <8fd86049-930d-c9b7-379c-56c02a12cd77@intel.com> Message-ID: <5fc5dea4-0705-2aad-cf8f-7ff78a5e518a@intel.com> Date: Tue, 4 May 2021 13:49:02 -0700 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 00149500152C X-Stat-Signature: 1ntqbwk1ogaqosuyby9zgid4e9ta385r Authentication-Results: imf01.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf01.hostedemail.com: domain of yu-cheng.yu@intel.com has no SPF policy when checking 134.134.136.20) smtp.mailfrom=yu-cheng.yu@intel.com Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf01; identity=mailfrom; envelope-from=""; helo=mga02.intel.com; client-ip=134.134.136.20 X-HE-DKIM-Result: none/none X-HE-Tag: 1620161334-482135 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/30/2021 11:32 AM, Yu, Yu-cheng wrote: > On 4/30/2021 10:47 AM, Andy Lutomirski wrote: >> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng =20 >> wrote: >>> >>> On 4/28/2021 4:03 PM, Andy Lutomirski wrote: >>>> On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu =20 >>>> wrote: >>>>> >>>>> When shadow stack is enabled, a task's shadow stack states must be=20 >>>>> saved >>>>> along with the signal context and later restored in sigreturn. =20 >>>>> However, >>>>> currently there is no systematic facility for extending a signal=20 >>>>> context. >>>>> There is some space left in the ucontext, but changing ucontext is=20 >>>>> likely >>>>> to create compatibility issues and there is not enough space for=20 >>>>> further >>>>> extensions. >>>>> >>>>> Introduce a signal context extension struct 'sc_ext', which is used= =20 >>>>> to save >>>>> shadow stack restore token address.=C2=A0 The extension is located = above=20 >>>>> the fpu >>>>> states, plus alignment.=C2=A0 The struct can be extended (such as t= he ibt's >>>>> wait_endbr status to be introduced later), and sc_ext.total_size fi= eld >>>>> keeps track of total size. >>>> >>>> I still don't like this. >>>> >>>> Here's how the signal layout works, for better or for worse: >>>> >>>> The kernel has: >>>> >>>> struct rt_sigframe { >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 char __user *pretcode; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct ucontext uc; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct siginfo info; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* fp state follows here */ >>>> }; >>>> >>>> This is roughly the actual signal frame.=C2=A0 But userspace does no= t have >>>> this struct declared, and user code does not know the sizes of the >>>> fields.=C2=A0 So it's accessed in a nonsensical way.=C2=A0 The signa= l handler >>>> function is passed a pointer to the whole sigframe implicitly in RSP= , >>>> a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX. >>>> User code can *find* the fp state by following a pointer from >>>> mcontext, which is, in turn, found via uc: >>>> >>>> struct ucontext { >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 uc_flags; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct ucontext=C2=A0 *uc_link; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 stack_t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 uc_stack; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct sigcontext uc_mcontext;=C2=A0 = <-- fp pointer is in here >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sigset_t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= uc_sigmask;=C2=A0=C2=A0=C2=A0 /* mask last for extensibility */ >>>> }; >>>> >>>> The kernel, in sigreturn, works a bit differently.=C2=A0 The sigretu= rn >>>> variants know the base address of the frame but don't have the benef= it >>>> of receiving pointers to the fields.=C2=A0 So instead the kernel tak= es >>>> advantage of the fact that it knows the offset to uc and parses uc >>>> accordingly.=C2=A0 And the kernel follows the pointer in mcontext to= find >>>> the fp state.=C2=A0 The latter bit is quite important later.=C2=A0 T= he kernel >>>> does not parse info at all. >>>> >>>> The fp state is its own mess.=C2=A0 When XSAVE happened, Intel kindl= y (?) >>>> gave us a software defined area between the "legacy" x87 region and >>>> the modern supposedly extensible part.=C2=A0 Linux sticks the follow= ing >>>> structure in that hole: >>>> >>>> struct _fpx_sw_bytes { >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * If set to FP_XSTATE_MAGIC1 th= en this is an xstate context. >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * 0 if a legacy frame. >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 magic1; >>>> >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Total size of the fpstate are= a: >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 *=C2=A0 - if magic1 =3D=3D 0 th= en it's sizeof(struct _fpstate) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 *=C2=A0 - if magic1 =3D=3D FP_X= STATE_MAGIC1 then it's sizeof(struct=20 >>>> _xstate) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 *=C2=A0=C2=A0=C2=A0 plus extens= ions (if any) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 extended_size; >>>> >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Feature bit mask (including F= P/SSE/extended state) that is=20 >>>> present >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * in the memory layout: >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u64=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 xfeatures; >>>> >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Actual XSAVE state size, base= d on the xfeatures saved in=20 >>>> the layout. >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * 'extended_size' is greater th= an 'xstate_size': >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 xstate_size; >>>> >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* For future use: */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __u32=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 padding[7]; >>>> }; >>>> >>>> >>>> That's where we are right now upstream.=C2=A0 The kernel has a parse= r for >>>> the FPU state that is bugs piled upon bugs and is going to have to b= e >>>> rewritten sometime soon.=C2=A0 On top of all this, we have two upcom= ing >>>> features, both of which require different kinds of extensions: >>>> >>>> 1. AVX-512.=C2=A0 (Yeah, you thought this story was over a few years= ago, >>>> but no.=C2=A0 And AMX makes it worse.)=C2=A0 To make a long story sh= ort, we >>>> promised user code many years ago that a signal frame fit in 2048 >>>> bytes with some room to spare.=C2=A0 With AVX-512 this is false.=C2=A0= With AMX >>>> it's so wrong it's not even funny.=C2=A0 The only way out of the mes= s >>>> anyone has come up with involves making the length of the FPU state >>>> vary depending on which features are INIT, i.e. making it more compa= ct >>>> than "compact" mode is.=C2=A0 This has a side effect: it's no longer >>>> possible to modify the state in place, because enabling a feature wi= th >>>> no space allocated will make the structure bigger, and the stack won= 't >>>> have room.=C2=A0 Fortunately, one can relocate the entire FPU state,= update >>>> the pointer in mcontext, and the kernel will happily follow the >>>> pointer.=C2=A0 So new code on a new kernel using a super-compact sta= te >>>> could expand the state by allocating new memory (on the heap? very >>>> awkwardly on the stack?) and changing the pointer.=C2=A0 For all we = know, >>>> some code already fiddles with the pointer.=C2=A0 This is great, exc= ept >>>> that your patch sticks more data at the end of the FPU block that no >>>> one is expecting, and your sigreturn code follows that pointer, and >>>> will read off into lala land. >>>> >>> >>> Then, what about we don't do that at all.=C2=A0 Is it possible from n= ow on we >>> don't stick more data at the end, and take the relocating-fpu approac= h? >>> >>>> 2. CET.=C2=A0 CET wants us to find a few more bytes somewhere, and t= hose >>>> bytes logically belong in ucontext, and here we are. >>>> >>> >>> Fortunately, we can spare CET the need of ucontext extension.=C2=A0 W= hen the >>> kernel handles sigreturn, the user-mode shadow stack pointer is right= at >>> the restore token.=C2=A0 There is no need to put that in ucontext. >> >> That seems entirely reasonable.=C2=A0 This might also avoid needing to >> teach CRIU about CET at all. >> >>> >>> However, the WAIT_ENDBR status needs to be saved/restored for signals= . >>> Since IBT is now dependent on shadow stack, we can use a spare bit of >>> the shadow stack restore token for that. >> >> That seems like unnecessary ABI coupling.=C2=A0 We have plenty of bits= in >> uc_flags, and we have an entire reserved word in sigcontext.=C2=A0 How >> about just sticking this bit in one of those places? >=20 > Yes, I will make it UC_WAIT_ENDBR. Personally, I think an explicit flag is cleaner than using a reserved=20 word somewhere. However, there is a small issue: ia32 has no uc_flags. This series can support legacy apps up to now. But, instead of creating=20 too many special cases, perhaps we should drop CET support of ia32? Thoughts? Thanks, Yu-cheng