From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DEB90C433ED for ; Wed, 28 Apr 2021 23:04:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BC9EC61449 for ; Wed, 28 Apr 2021 23:04:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229939AbhD1XE4 (ORCPT ); Wed, 28 Apr 2021 19:04:56 -0400 Received: from mail.kernel.org ([198.145.29.99]:44174 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229593AbhD1XEx (ORCPT ); Wed, 28 Apr 2021 19:04:53 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id DC9FE61461 for ; Wed, 28 Apr 2021 23:04:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1619651047; bh=0afqfeFLgdB0fTNzaLFOMLYgWe3yB1y7QqmEFNzjedY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=s8CPOaIFrcZ6bgfFP6FstGuiS/Ep8lWIPMWnb/Iw39Kg8z0S6Hu4s34I7h0HF9dsO +jFJk3d8n+w7lQ3eNw+yzriu6NxFWTCNH+jtp5oeyK8l/QS+jeRDJ+d1OnU0USEegh +uux6L2gxngj6rNRFS8k+uTH3ngbWIj4yS5oYVpyMQM86An6XE9OCy2mYbVY4ECwti AsMh5rD4V7hbfjc3SXOiaKr7Q6bDENZ5iN1pV8hPkHPzM/l8lhf29EM91z/4YxbmpZ TmBqhOZm3Up25Hcy3mCbTpIA4g3ot3UsnqAZyVor05LSO9XfRFejgMe+QRf7O7MPDh OmWSmvkCLu+yA== Received: by mail-ej1-f41.google.com with SMTP id u21so97030046ejo.13 for ; Wed, 28 Apr 2021 16:04:07 -0700 (PDT) X-Gm-Message-State: AOAM532v3d1ZHg69IXKaju+ozd+sqw5TPApgiNNWZmNj8oWIoJia+pI0 1bNlYoUZVmz6Gwgnyb/VGnB8kGAdcO62Aws5k4mMFA== X-Google-Smtp-Source: ABdhPJzcb8Hw7Su0s8Rgxkv7jYWxNcuy2WTKMRFpTCb8fN9sWcUB+ds4ywq5Mo5SLRg2DAoSXXOtfYklNGY1ADo5ZH8= X-Received: by 2002:a17:906:6896:: with SMTP id n22mr32665459ejr.316.1619651045990; Wed, 28 Apr 2021 16:04:05 -0700 (PDT) MIME-Version: 1.0 References: <20210427204315.24153-1-yu-cheng.yu@intel.com> <20210427204315.24153-26-yu-cheng.yu@intel.com> In-Reply-To: <20210427204315.24153-26-yu-cheng.yu@intel.com> From: Andy Lutomirski Date: Wed, 28 Apr 2021 16:03:55 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: extending ucontext (Re: [PATCH v26 25/30] x86/cet/shstk: Handle signals for shadow stack) To: Yu-cheng Yu , linux-arch Cc: X86 ML , "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , LKML , "open list:DOCUMENTATION" , Linux-MM , Linux API , Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas Shanbhogue , Dave Martin , Weijiang Yang , Pengfei Xu , Haitao Huang Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu wrote: > > When shadow stack is enabled, a task's shadow stack states must be saved > along with the signal context and later restored in sigreturn. However, > currently there is no systematic facility for extending a signal context. > There is some space left in the ucontext, but changing ucontext is likely > to create compatibility issues and there is not enough space for further > extensions. > > Introduce a signal context extension struct 'sc_ext', which is used to save > shadow stack restore token address. The extension is located above the fpu > states, plus alignment. The struct can be extended (such as the ibt's > wait_endbr status to be introduced later), and sc_ext.total_size field > keeps track of total size. I still don't like this. Here's how the signal layout works, for better or for worse: The kernel has: struct rt_sigframe { char __user *pretcode; struct ucontext uc; struct siginfo info; /* fp state follows here */ }; This is roughly the actual signal frame. But userspace does not have this struct declared, and user code does not know the sizes of the fields. So it's accessed in a nonsensical way. The signal handler function is passed a pointer to the whole sigframe implicitly in RSP, a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX. User code can *find* the fp state by following a pointer from mcontext, which is, in turn, found via uc: struct ucontext { unsigned long uc_flags; struct ucontext *uc_link; stack_t uc_stack; struct sigcontext uc_mcontext; <-- fp pointer is in here sigset_t uc_sigmask; /* mask last for extensibility */ }; The kernel, in sigreturn, works a bit differently. The sigreturn variants know the base address of the frame but don't have the benefit of receiving pointers to the fields. So instead the kernel takes advantage of the fact that it knows the offset to uc and parses uc accordingly. And the kernel follows the pointer in mcontext to find the fp state. The latter bit is quite important later. The kernel does not parse info at all. The fp state is its own mess. When XSAVE happened, Intel kindly (?) gave us a software defined area between the "legacy" x87 region and the modern supposedly extensible part. Linux sticks the following structure in that hole: struct _fpx_sw_bytes { /* * If set to FP_XSTATE_MAGIC1 then this is an xstate context. * 0 if a legacy frame. */ __u32 magic1; /* * Total size of the fpstate area: * * - if magic1 == 0 then it's sizeof(struct _fpstate) * - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate) * plus extensions (if any) */ __u32 extended_size; /* * Feature bit mask (including FP/SSE/extended state) that is present * in the memory layout: */ __u64 xfeatures; /* * Actual XSAVE state size, based on the xfeatures saved in the layout. * 'extended_size' is greater than 'xstate_size': */ __u32 xstate_size; /* For future use: */ __u32 padding[7]; }; That's where we are right now upstream. The kernel has a parser for the FPU state that is bugs piled upon bugs and is going to have to be rewritten sometime soon. On top of all this, we have two upcoming features, both of which require different kinds of extensions: 1. AVX-512. (Yeah, you thought this story was over a few years ago, but no. And AMX makes it worse.) To make a long story short, we promised user code many years ago that a signal frame fit in 2048 bytes with some room to spare. With AVX-512 this is false. With AMX it's so wrong it's not even funny. The only way out of the mess anyone has come up with involves making the length of the FPU state vary depending on which features are INIT, i.e. making it more compact than "compact" mode is. This has a side effect: it's no longer possible to modify the state in place, because enabling a feature with no space allocated will make the structure bigger, and the stack won't have room. Fortunately, one can relocate the entire FPU state, update the pointer in mcontext, and the kernel will happily follow the pointer. So new code on a new kernel using a super-compact state could expand the state by allocating new memory (on the heap? very awkwardly on the stack?) and changing the pointer. For all we know, some code already fiddles with the pointer. This is great, except that your patch sticks more data at the end of the FPU block that no one is expecting, and your sigreturn code follows that pointer, and will read off into lala land. 2. CET. CET wants us to find a few more bytes somewhere, and those bytes logically belong in ucontext, and here we are. This is *almost*, but not quite, easy: struct ucontext is already variable length! Unfortunately, the whole variable length portion is used up by uc_sigmask. So I propose that we introduce a brand new bona fide extension mechanism. It works like this: First, we add a struct ucontext_extension at the end. It looks like: struct ucontext_extension { u64 length; /* sizeof(struct ucontext_extension) */ u64 flags; /* we will want this some day */ [CET stuff here] [future stuff here] }; And we locate it by scrounging a word somewhere in ucontext to give the offset from the beginning of struct ucontext to ucontext_extension. We indicate the presence of this feature using a new uc_flags bit. I can think of a couple of vaguely reasonable places: a) the reserved word in sigcontext. This is fine for x86 but not so great if other architectures want to do this. b) uc_link. Fine everywhere but powerpc. Oops. c) use the high bits of uc_flags. After all, once we add extensions, we don't need new flags, so we can steal 16 high bits of uc_flags for this. I think I'm in favor of (c). We do: (uc_flags & 0xffff0000) == 0: extension not present Otherwise the extension region is at ucontext + (uc_flags >> 16). And sigreturn finds the extension the same way, because CRIU can already migrate a signal frame from one kernel to another, your patch breaks this, and having sigreturn hardcode the offset would also break it. What do you think? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75C52C43461 for ; Wed, 28 Apr 2021 23:04:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1484461448 for ; Wed, 28 Apr 2021 23:04:10 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1484461448 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 915BD6B006C; Wed, 28 Apr 2021 19:04:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C55F6B006E; Wed, 28 Apr 2021 19:04:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 73EAF6B0070; Wed, 28 Apr 2021 19:04:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0084.hostedemail.com [216.40.44.84]) by kanga.kvack.org (Postfix) with ESMTP id 532246B006C for ; Wed, 28 Apr 2021 19:04:09 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 1837B180AD820 for ; Wed, 28 Apr 2021 23:04:09 +0000 (UTC) X-FDA: 78083305818.14.95A92D2 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf20.hostedemail.com (Postfix) with ESMTP id 00355FC for ; Wed, 28 Apr 2021 23:03:59 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id A3A996144F for ; Wed, 28 Apr 2021 23:04:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1619651047; bh=0afqfeFLgdB0fTNzaLFOMLYgWe3yB1y7QqmEFNzjedY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=s8CPOaIFrcZ6bgfFP6FstGuiS/Ep8lWIPMWnb/Iw39Kg8z0S6Hu4s34I7h0HF9dsO +jFJk3d8n+w7lQ3eNw+yzriu6NxFWTCNH+jtp5oeyK8l/QS+jeRDJ+d1OnU0USEegh +uux6L2gxngj6rNRFS8k+uTH3ngbWIj4yS5oYVpyMQM86An6XE9OCy2mYbVY4ECwti AsMh5rD4V7hbfjc3SXOiaKr7Q6bDENZ5iN1pV8hPkHPzM/l8lhf29EM91z/4YxbmpZ TmBqhOZm3Up25Hcy3mCbTpIA4g3ot3UsnqAZyVor05LSO9XfRFejgMe+QRf7O7MPDh OmWSmvkCLu+yA== Received: by mail-ej1-f53.google.com with SMTP id r12so96999780ejr.5 for ; Wed, 28 Apr 2021 16:04:07 -0700 (PDT) X-Gm-Message-State: AOAM530TwD3ha5eSQhaHiI4sMNeII9dqP4+raknCHiPu62pk+3U05JED QA5gYj4XLwPEb6L2i+PA3nTKz3f10TqX3wGG+WqDrQ== X-Google-Smtp-Source: ABdhPJzcb8Hw7Su0s8Rgxkv7jYWxNcuy2WTKMRFpTCb8fN9sWcUB+ds4ywq5Mo5SLRg2DAoSXXOtfYklNGY1ADo5ZH8= X-Received: by 2002:a17:906:6896:: with SMTP id n22mr32665459ejr.316.1619651045990; Wed, 28 Apr 2021 16:04:05 -0700 (PDT) MIME-Version: 1.0 References: <20210427204315.24153-1-yu-cheng.yu@intel.com> <20210427204315.24153-26-yu-cheng.yu@intel.com> In-Reply-To: <20210427204315.24153-26-yu-cheng.yu@intel.com> From: Andy Lutomirski Date: Wed, 28 Apr 2021 16:03:55 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: extending ucontext (Re: [PATCH v26 25/30] x86/cet/shstk: Handle signals for shadow stack) To: Yu-cheng Yu , linux-arch Cc: X86 ML , "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , LKML , "open list:DOCUMENTATION" , Linux-MM , Linux API , Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas Shanbhogue , Dave Martin , Weijiang Yang , Pengfei Xu , Haitao Huang Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 00355FC X-Stat-Signature: zzu8woh6xz3ey9n63ec7rdwxgxwsniy7 Received-SPF: none (kernel.org>: No applicable sender policy available) receiver=imf20; identity=mailfrom; envelope-from=""; helo=mail.kernel.org; client-ip=198.145.29.99 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1619651039-49886 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu wrote: > > When shadow stack is enabled, a task's shadow stack states must be saved > along with the signal context and later restored in sigreturn. However, > currently there is no systematic facility for extending a signal context. > There is some space left in the ucontext, but changing ucontext is likely > to create compatibility issues and there is not enough space for further > extensions. > > Introduce a signal context extension struct 'sc_ext', which is used to save > shadow stack restore token address. The extension is located above the fpu > states, plus alignment. The struct can be extended (such as the ibt's > wait_endbr status to be introduced later), and sc_ext.total_size field > keeps track of total size. I still don't like this. Here's how the signal layout works, for better or for worse: The kernel has: struct rt_sigframe { char __user *pretcode; struct ucontext uc; struct siginfo info; /* fp state follows here */ }; This is roughly the actual signal frame. But userspace does not have this struct declared, and user code does not know the sizes of the fields. So it's accessed in a nonsensical way. The signal handler function is passed a pointer to the whole sigframe implicitly in RSP, a pointer to &frame->info in RSI, anda pointer to &frame->uc in RDX. User code can *find* the fp state by following a pointer from mcontext, which is, in turn, found via uc: struct ucontext { unsigned long uc_flags; struct ucontext *uc_link; stack_t uc_stack; struct sigcontext uc_mcontext; <-- fp pointer is in here sigset_t uc_sigmask; /* mask last for extensibility */ }; The kernel, in sigreturn, works a bit differently. The sigreturn variants know the base address of the frame but don't have the benefit of receiving pointers to the fields. So instead the kernel takes advantage of the fact that it knows the offset to uc and parses uc accordingly. And the kernel follows the pointer in mcontext to find the fp state. The latter bit is quite important later. The kernel does not parse info at all. The fp state is its own mess. When XSAVE happened, Intel kindly (?) gave us a software defined area between the "legacy" x87 region and the modern supposedly extensible part. Linux sticks the following structure in that hole: struct _fpx_sw_bytes { /* * If set to FP_XSTATE_MAGIC1 then this is an xstate context. * 0 if a legacy frame. */ __u32 magic1; /* * Total size of the fpstate area: * * - if magic1 == 0 then it's sizeof(struct _fpstate) * - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct _xstate) * plus extensions (if any) */ __u32 extended_size; /* * Feature bit mask (including FP/SSE/extended state) that is present * in the memory layout: */ __u64 xfeatures; /* * Actual XSAVE state size, based on the xfeatures saved in the layout. * 'extended_size' is greater than 'xstate_size': */ __u32 xstate_size; /* For future use: */ __u32 padding[7]; }; That's where we are right now upstream. The kernel has a parser for the FPU state that is bugs piled upon bugs and is going to have to be rewritten sometime soon. On top of all this, we have two upcoming features, both of which require different kinds of extensions: 1. AVX-512. (Yeah, you thought this story was over a few years ago, but no. And AMX makes it worse.) To make a long story short, we promised user code many years ago that a signal frame fit in 2048 bytes with some room to spare. With AVX-512 this is false. With AMX it's so wrong it's not even funny. The only way out of the mess anyone has come up with involves making the length of the FPU state vary depending on which features are INIT, i.e. making it more compact than "compact" mode is. This has a side effect: it's no longer possible to modify the state in place, because enabling a feature with no space allocated will make the structure bigger, and the stack won't have room. Fortunately, one can relocate the entire FPU state, update the pointer in mcontext, and the kernel will happily follow the pointer. So new code on a new kernel using a super-compact state could expand the state by allocating new memory (on the heap? very awkwardly on the stack?) and changing the pointer. For all we know, some code already fiddles with the pointer. This is great, except that your patch sticks more data at the end of the FPU block that no one is expecting, and your sigreturn code follows that pointer, and will read off into lala land. 2. CET. CET wants us to find a few more bytes somewhere, and those bytes logically belong in ucontext, and here we are. This is *almost*, but not quite, easy: struct ucontext is already variable length! Unfortunately, the whole variable length portion is used up by uc_sigmask. So I propose that we introduce a brand new bona fide extension mechanism. It works like this: First, we add a struct ucontext_extension at the end. It looks like: struct ucontext_extension { u64 length; /* sizeof(struct ucontext_extension) */ u64 flags; /* we will want this some day */ [CET stuff here] [future stuff here] }; And we locate it by scrounging a word somewhere in ucontext to give the offset from the beginning of struct ucontext to ucontext_extension. We indicate the presence of this feature using a new uc_flags bit. I can think of a couple of vaguely reasonable places: a) the reserved word in sigcontext. This is fine for x86 but not so great if other architectures want to do this. b) uc_link. Fine everywhere but powerpc. Oops. c) use the high bits of uc_flags. After all, once we add extensions, we don't need new flags, so we can steal 16 high bits of uc_flags for this. I think I'm in favor of (c). We do: (uc_flags & 0xffff0000) == 0: extension not present Otherwise the extension region is at ucontext + (uc_flags >> 16). And sigreturn finds the extension the same way, because CRIU can already migrate a signal frame from one kernel to another, your patch breaks this, and having sigreturn hardcode the offset would also break it. What do you think?