From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.7 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_ADSP_CUSTOM_MED,DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98CD4C433E7 for ; Tue, 13 Oct 2020 04:06:07 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id D662E20838 for ; Tue, 13 Oct 2020 04:06:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="N3qCS6RB"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=google.com header.i=@google.com header.b="h0HW4UCy" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D662E20838 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:To:Subject:Message-ID:Date:From:In-Reply-To: References:MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=uL5ZArI8G6NV7Z1K8TrEWyrgByKhODgYOOyEo51MQfI=; b=N3qCS6RBvCeydQlaAK0R42fNM NGjsEYjnFPlNTgXg/cM8OFz/c3oqIJw8HF9stwAwmRm3ung7YpROv1UN37JDvTYE0BzMVOCLNsula KSJdsshGFBkfPWkZcYF1vRumZpqmu1In8iVyWsuqt9GPIxkKJd2coOvShSgXAAKXIQS2w4hcsfU3D I12s+54g9zGbEO8sHg6qfNKHPHo9yUxCr64q5MX5dCFr8Oa9Pct34Fa5y9yFVJPPhEMyP8E5Xz4jj ygqfL8fqUuXw9WOFEG5JDqfboKuIUdWM9dV8NR5Na+KjzyO1eRJrrJr1GNNzJ04/O+7LnJzSB/U5A pwsqZ/6mQ==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kSBXk-0006w7-BY; Tue, 13 Oct 2020 04:04:08 +0000 Received: from mail-ua1-x944.google.com ([2607:f8b0:4864:20::944]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kSBXh-0006vk-Kt for linux-arm-kernel@lists.infradead.org; Tue, 13 Oct 2020 04:04:06 +0000 Received: by mail-ua1-x944.google.com with SMTP id d18so6100553uae.0 for ; Mon, 12 Oct 2020 21:04:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=rHT12KLAdy0DmBz20kEvIMCHA1q7NHmAelFppCggWEk=; b=h0HW4UCyib1N673xO3w1d1z/9CDZ0wWz/xAGHNF0Ah6ga7XOaJC29XIJScOIJfBNKo bySesNleCmgmMuNLeuc+7nJIHrilP3DOgRAHHVeewVOV9x/jq+7Kq9Pxsz03Z4RZK8Tz efFQMKMyF1+jtUeTHchYaE5crbW1CAheC8KGwzS7rAl94KVVPh0OgLPRM0ENKTb4oB9O mYy7GLibZf1GogXp6zF4jaBZF4bEy34+1EXivDH0SrBt5lBVztyrKckOafHo+bJWYunv KYJGKkhk4LYUa+5FlPik+79NZ+ZTCoCEQm8YjKLBVFKxztFGPDT3vrJwELMdkZVF7Dj7 uGrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=rHT12KLAdy0DmBz20kEvIMCHA1q7NHmAelFppCggWEk=; b=bit8+LdzJ2BfAaSt5miZpctTDZL+CAqSfMLcm5bWy1TRfdJ4RWWVPylBrTbCRdsSVu sUY0+7UMCNj0Lj+bu7IkpnnSXZNxUKQifymYZCVzVLrClFEIhgWYxUDVgorojRgLBU7w x7z5OKnjy0HcI0UC0uxW/tCr3R+tIgmvow/ioOgMIC7cID3wwZPvZDrMr4M2yssm05Fp 8QJUuVLLQL2JsV4lqbgRJmCBpcJG8MNdi3/V/0F6I8DNtdvUqhFG283bJWpK/r61BbtX Q/+6xEsGT9lCGGbhGoT51LAL7vRRYZS1YCD5cjQZqhqv5LFWcaoKPTQoem325dn8qsch GDLg== X-Gm-Message-State: AOAM533lnRRDgYTadCdPaJxvkD2YQTO1hqhOVGgt3lXsUFHnzgt9gmv6 /XvzmpNb3uNRDyEZY1LoEIjmNWqPsuv4/efNlBnY7dJ21GJvJg== X-Google-Smtp-Source: ABdhPJyYPo266fVSFV/z/MpTdujWpPEiPJpLkGse7b10bhmAvFVDOyMcQL46KyiqYgzmqXKdhf5M70FSnwRTam4JWgU= X-Received: by 2002:ab0:6156:: with SMTP id w22mr4945874uan.122.1602561841371; Mon, 12 Oct 2020 21:04:01 -0700 (PDT) MIME-Version: 1.0 References: <20200801011152.39838-1-pcc@google.com> <20200819101809.GD6642@arm.com> <20200824144910.GO6642@arm.com> <20200826163726.GV6642@arm.com> <20200901160231.GB6642@arm.com> <20200907110316.GP6642@arm.com> In-Reply-To: <20200907110316.GP6642@arm.com> From: Peter Collingbourne Date: Mon, 12 Oct 2020 21:03:49 -0700 Message-ID: Subject: Re: [PATCH] arm64: add prctl(PR_PAC_SET_ENABLED_KEYS) To: Dave Martin X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20201013_000405_714457_D5E3C14C X-CRM114-Status: GOOD ( 65.55 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Andrey Konovalov , Kevin Brodsky , Kostya Serebryany , Linux ARM , Catalin Marinas , Vincenzo Frascino , Will Deacon , Evgenii Stepanov Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Sep 7, 2020 at 4:03 AM Dave Martin wrote: > > On Wed, Sep 02, 2020 at 09:31:55PM -0700, Peter Collingbourne wrote: > > On Tue, Sep 1, 2020 at 9:02 AM Dave Martin wrote: > > > > > > On Wed, Aug 26, 2020 at 07:55:51PM -0700, Peter Collingbourne wrote: > > > > On Wed, Aug 26, 2020 at 9:37 AM Dave Martin wrote: > > > > > > > > > > On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote: > > > > > > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin wrote: > > > > > > > > > > > > > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote: > > > > > > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin wrote: > > > > > > > > > > > > > > > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote: > > > > > > [...] > > > > > > > > > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h > > > > > > > > > > index 52dead2a8640..d121fa5fed5f 100644 > > > > > > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h > > > > > > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h > > > > > > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif > > > > > > > > > > ldp \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB] > > > > > > > > > > msr_s SYS_APDBKEYLO_EL1, \tmp2 > > > > > > > > > > msr_s SYS_APDBKEYHI_EL1, \tmp3 > > > > > > > > > > + > > > > > > > > > > + ldr \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK] > > > > > > > > > > + cbz \tmp2, .Laddr_auth_skip_\@ > > > > > > > > > > > > > > > > > > I wonder whether it would make sense to simple store the thread's base > > > > > > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx > > > > > > > > > bits separately. There may be reasons outside this snippet why this > > > > > > > > > isn't such a good idea though -- I haven't gone digging so far. > > > > > > > > > > > > > > > > As far as I know (as I learned [1] from the MTE patch series), it can > > > > > > > > be expensive to access SCTLR_EL1, therefore I arrange to only access > > > > > > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled > > > > > > > > keys. Detecting the "process has disabled keys" case is quite simple > > > > > > > > if we only store the disabled keys mask here, not so much if we store > > > > > > > > the full value of SCTLR_EL1. > > > > > > > > > > > > > > My thought was that we would still only write SCTLR_EL1 if needed, but > > > > > > > we would do the write-if-needed across the whole register in one go. > > > > > > > This would be easier to extend if we have to twiddle additional > > > > > > > SCTLR_EL1 bits in the future. If the key enable bits are the only > > > > > > > affected bits for now then we could of course defer this factoring until > > > > > > > later. (I vaguely remember a similar discussion in the past, but > > > > > > > possibly it was about the pauth keys anyway, rather than something > > > > > > > else.) > > > > > > > > > > > > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could > > > > > > potentially combine the two, so that both > > > > > > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end > > > > > > up affecting a common SCTLR_EL1 field in the task_struct, and we do: > > > > > > > > > > > > On kernel entry: > > > > > > - read SCTLR_EL1 from task_struct (or from the system register if we > > > > > > decide that's cheap enough) > > > > > > - if EnIA clear, set it, and write the value to the system register. > > > > > > > > > > > > On kernel exit: > > > > > > - read system register SCTLR_EL1 > > > > > > - read SCTLR_EL1 from task_struct > > > > > > - if they are different, write task's SCTLR_EL1 to the system register. > > > > > > > > > > > > But this would require at least an unconditional read of SCTLR_EL1 per > > > > > > kernel exit. The comment that I referenced says that "accesses" to the > > > > > > register are expensive. I was operating under the assumption that both > > > > > > reads and writes are expensive, but if it's actually only writes that > > > > > > are expensive, we may be able to get away with the above. > > > > > > > > > > The kernel is in full control over what's in SCTLR_EL1, so perhaps we > > > > > could cache what we wrote percpu, if the overhead of reading it is > > > > > problematic. > > > > > > > > Maybe, but that sounds like the kind of complexity that I'd like to avoid. > > > > > > Ack, this is more something to explore if the other approaches turn out > > > to be more costly. > > > > > > > I went ahead and did some performance measurements using the following > > > > program, which issues 10M invalid syscalls and exits. This should give > > > > us something as close as possible to a measurement of a kernel entry > > > > and exit on its own. > > > > > > > > .globl _start > > > > _start: > > > > movz x1, :abs_g1:10000000 > > > > movk x1, :abs_g0_nc:10000000 > > > > > > > > .Lloop: > > > > mov x8, #65536 // invalid > > > > svc #0 > > > > sub x1, x1, #1 > > > > cbnz x1, .Lloop > > > > > > > > mov x0, #0 > > > > mov x8, #0x5d // exit > > > > svc #0 > > > > > > > > On a Pixel 4 (which according to Wikipedia has a CPU based on > > > > Cortex-A76/A55) I measured the median of 1000 runs execution time at > > > > 0.554511s, implying an entry/exit cost of 55.5ns. If I make this > > > > modification to the kernel (Pixel 4 uses kernel version 4.14 so this > > > > won't apply cleanly to modern kernels; this is in the same place as > > > > the ptrauth_keys_install_user macro invocation in a modern kernel): > > > > > > > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S > > > > index 303447c765f7..b7c72d767f3e 100644 > > > > --- a/arch/arm64/kernel/entry.S > > > > +++ b/arch/arm64/kernel/entry.S > > > > @@ -340,6 +340,7 @@ alternative_else_nop_endif > > > > #endif > > > > 3: > > > > apply_ssbd 0, 5f, x0, x1 > > > > + mrs x0, sctlr_el1 > > > > 5: > > > > .endif > > > > > > > > The median execution time goes to 0.565604s, implying a cost of 1.1ns > > > > to read the system register. I also measured the cost of reading from > > > > a percpu variable at kernel exit like so: > > > > > > > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S > > > > index 303447c765f7..c5a89785f650 100644 > > > > --- a/arch/arm64/kernel/entry.S > > > > +++ b/arch/arm64/kernel/entry.S > > > > @@ -340,6 +340,7 @@ alternative_else_nop_endif > > > > #endif > > > > 3: > > > > apply_ssbd 0, 5f, x0, x1 > > > > + ldr_this_cpu x0, sctlr_el1_cache, x1 > > > > 5: > > > > .endif > > > > > > > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c > > > > index 0b6ca5479f1f..ac9c9a6179d4 100644 > > > > --- a/arch/arm64/kernel/traps.c > > > > +++ b/arch/arm64/kernel/traps.c > > > > @@ -52,6 +52,8 @@ > > > > #include > > > > #include > > > > > > > > +DEFINE_PER_CPU(unsigned long, sctlr_el1_cache); > > > > + > > > > static const char *handler[]= { > > > > "Synchronous Abort", > > > > "IRQ", > > > > > > > > With that the median execution time goes to 0.600186s, implying a cost > > > > of 4.5ns. So at least on existing hardware, it looks like reading > > > > SCTLR_EL1 directly is probably our cheapest option if we're going to > > > > read *something*. Therefore I'll look at implementing this with > > > > unconditional reads from SCTLR_EL1 in v2. > > > > > > Thanks for the investigation! > > > > > > I'm not too surprised by this outcome. There is a possibility that > > > reading SCTLR_EL1 sucks badly on some implementations, but on others it > > > may be little more than a register rename. We should probably wait for > > > evidence before panicking about it. > > > > > > So long as this SCTLR_EL1 switching is completely disabled via > > > alternatives on hardware to which it isn't applicable, I expect that > > > should be a reasonable compromise. > > > > One disadvantage of disabling SCTLR_EL1 switching on inapplicable > > hardware is that it complicates the mechanism for enabling and > > disabling deprecated instruction support (see > > arch/arm64/kernel/armv8_deprecated.c), which is currently implemented > > by using on_each_cpu to set or clear certain bits in SCTLR_EL1 on each > > CPU. This is incompatible with an entry/exit that assumes full control > > over SCTLR_EL1. > > > > I would propose the following instruction sequences in entry.S, which > > are justified by further experiments run on a Pixel 4 showing that: > > 1) an unconditional SCTLR_EL1 write on kernel entry/exit costs 3.0ns, > > whereas conditional writes cost 4.5ns > > 2) an SCTLR_EL1 read on kernel entry is 0.3ns faster than a load from > > task_struct > > 3) an unconditional SCTLR_EL1 write on kernel exit costs 0.7ns. > > > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S > > index 55af8b504b65..9e0acd5e7527 100644 > > --- a/arch/arm64/kernel/entry.S > > +++ b/arch/arm64/kernel/entry.S > > @@ -186,6 +186,12 @@ alternative_cb_end > > > > ptrauth_keys_install_kernel tsk, x20, x22, x23 > > > > +alternative_if ARM64_HAS_ADDRESS_AUTH > > + mrs x22, sctlr_el1 > > + orr x22, x22, SCTLR_ELx_ENIA > > + msr sctlr_el1, x22 > > +alternative_else_nop_endif > > + > > scs_load tsk, x20 > > .else > > add x21, sp, #S_FRAME_SIZE > > @@ -297,6 +303,9 @@ alternative_else_nop_endif > > /* No kernel C function calls after this as user keys are set. */ > > ptrauth_keys_install_user tsk, x0, x1, x2 > > > > + ldr x0, [tsk, THREAD_SCTLR] > > + msr sctlr_el1, x0 > > + > > apply_ssbd 0, x0, x1 > > .endif > > > > > > With these instruction sequences I would propose to reimplement > > toggling deprecated instruction support as follows: > > 1) Stop the machine using stop_machine. > > 2) While the machine is stopped, adjust the deprecated instruction > > support feature bits in each task_struct. > > > > This will make toggling deprecated instruction support more expensive, > > but that seems like the right tradeoff since I imagine that these > > features are rarely toggled, and making it more efficient to toggle > > them would mean more instructions in the entry/exit hot path. > > > > This seems like it will work well if SCTLR_EL1 is set on kernel exit. > > The deprecated feature bits are CP15BEN and SED, and each of them > > affects features that are only implemented at EL0, so it seems like we > > can wait until kernel exit to update SCTLR_EL1. But if we don't set > > SCTLR_EL1 on exit on certain hardware, we will need to have both > > mechanisms, use the existing mechanism on hardware that doesn't set > > SCTLR_EL1 and the new mechanism on hardware that does set SCTLR_EL1. > > Which I wouldn't mind too much, but I'm not sure that it's worth it to > > save 0.7ns (I'm not claiming that my numbers are universal, just that > > we should probably favor simpler approaches if we don't have evidence > > that they are significantly more expensive than more complex ones). > > > > Peter > > This approach does look like close to the simplest option for the entry/ > exit path -- I'll let others comment on whether the actual overhead > numbers sound acceptable. > > I think there are more places where the kernel modifies SCTLR than just > for the deprecated instruction emulation though. Do you think your > approach will work with all of them, and can they all tolerate a > stop_machine()? You're right, there are other places. Looking at the places written in assembly, I think all of them do not expect the value to be retained. But there are more places where we were setting this register in C that I hadn't checked. For example, in cpufeature.c we have: static void cpu_emulate_effective_ctr(const struct arm64_cpu_capabilities *__unused) { /* * If the CPU exposes raw CTR_EL0.IDC = 0, while effectively * CTR_EL0.IDC = 1 (from CLIDR values), we need to trap accesses * to the CTR_EL0 on this CPU and emulate it with the real/safe * value. */ if (!(read_cpuid_cachetype() & BIT(CTR_IDC_SHIFT))) sysreg_clear_set(sctlr_el1, SCTLR_EL1_UCT, 0); } So we may want heterogeneous values of SCTLR_EL1 on different cores, unless for example for this routine we reimplement to trap accesses to CTR_EL0 homogeneously. Furthermore it is unclear that it would be valid to write to SCTLR_EL1 directly with the result of reading from a thread_struct field. My belief that this behavior may be incorrect comes from the documented semantics of reserved fields in SCTLR_EL1. These bits are defined as RES0 which means that software must use an SBZP approach to write to them, and an approach that reads a value from thread_struct and then writes exactly that value to SCTLR_EL1 would not necessarily preserve RES0 bits that are different between cores. This led me to go back to considering approaches where we only update EnIA on entry and exit and leave the updating of the other bits to task switch at the same time as TCF0 in Catalin's MTE series. Such approaches seem like they will perform better as well, if extrapolating from existing hardware is anything to go by. It seems that my previous measurements on Pixel 4 may have been inaccurate because /sys/devices/system/cpu/*/cpufreq/scaling_governor was set to its default value so scaling may have interfered with the results. I ran a new set of measurements with scaling_governor set to powersave which gave me much more stable results. With the newly stable results I decided to take the median of 100 runs rather than 1000 to compensate for the slower run time caused by the powersave governor. These results showed that conditional MSR is faster than unconditional MSR, and that LDR from thread_struct to obtain the state of the incoming task is slightly faster than MRS. The full results with references to patches are below: Baseline: 316.7ns Conditional MSR based on LDR [1]: 319.6ns Unconditional MSR [2]: 353.4ns Conditional MSR based on MRS [3]: 323.8ns I also ran the same experiment on a DragonBoard 845c. Since this board can run the mainline kernel it gives us results that we may expect to be slightly more valid when making changes to the mainline kernel than the 4.14 kernel used by Pixel 4. On this board I saw two "clusters" of results so I took the median of the faster cluster which was more stable (I don't think this was due to big.LITTLE because I saw very different numbers if I pinned the process to a LITTLE core): Baseline: 450.5ns Conditional MSR based on LDR: 457.8ns Unconditional MSR: 487.0ns Conditional MSR based on MRS: 461.5ns So we can see very similar perf impacts to Pixel 4. Although this board is Cortex-A75/A55 which is the prior uarch generation to Pixel 4's Cortex-A76/A55, it also tells us that on big Cortex uarches SCTLR_EL1 performance is somewhat stable over time, at least to the extent that we can extrapolate that from two data points. Of course, we may tune the instruction sequences again once hardware with PAC support is publicly available. Thus I think that to begin with we should use instruction sequences similar to those in [1], protected via alternatives so we wouldn't be impacting the performance on existing hardware. I will develop my patch on top of the MTE series since they will interfere with each other and it seems plausible that MTE will land first. Peter [1] https://android-review.googlesource.com/c/kernel/msm/+/1457681 [2] https://android-review.googlesource.com/c/kernel/msm/+/1456578 [3] https://android-review.googlesource.com/c/kernel/msm/+/1456579 Peter _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel