From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15082C433E0 for ; Tue, 29 Dec 2020 00:57:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D7F20207FB for ; Tue, 29 Dec 2020 00:57:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730549AbgL2A5d (ORCPT ); Mon, 28 Dec 2020 19:57:33 -0500 Received: from mail.kernel.org ([198.145.29.99]:49308 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730229AbgL2A5c (ORCPT ); Mon, 28 Dec 2020 19:57:32 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 406572222A for ; Tue, 29 Dec 2020 00:56:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1609203411; bh=5ea1GP7R9ThOqNi8XEr5gxNXJU/ubpRT/SpIDHm2c0M=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=bOnhgK3ds/goNLDhJWb0r7fRqd65KN5BXdE4oYSm7k5zkRDAV/1pooaKncbxQMwqf fOUNJPfyVyB6nkeEQNc6maqX0cSIgI+eyBG9L1sgaBgPd6EuwaKhSYBROHQnL4TJGL Cq3PeZ3nvlkFmpumXZ6nt9sK04hHjLnE7fJk3gNqGhoR0Vn8m2i5GbX4pO+WXnbinV CGz/aqbTkNVecOveVLht/LqNxen/d+Me/6DXPqdBMRvND/LlObgSwJ3/9TqmR0Rtc4 h04e4s9Al0uBq8gpPaU+BYRPynbzikeks9DGeoUdJshRNB307Y1HbkY/YqTEuVF/FC wa+4hCR8K1IBA== Received: by mail-wm1-f44.google.com with SMTP id c124so791807wma.5 for ; Mon, 28 Dec 2020 16:56:51 -0800 (PST) X-Gm-Message-State: AOAM5305wmxNXzTObJCicOEO12esXBjBPAHniX0HWtTw6094Hyegv37d HnQOCFB1qY7/4dAMtEBGdzZpViG6c79mrlW3TCDvlg== X-Google-Smtp-Source: ABdhPJwWLZzdWjQFkiOrEi0hD/sw8QEhosbYgv30yS3Mb/zDERHlmT9dUrmSuLKTMJNeve8TtWbk4bbcTCBfJLD539U= X-Received: by 2002:a1c:6741:: with SMTP id b62mr1172357wmc.21.1609203409735; Mon, 28 Dec 2020 16:56:49 -0800 (PST) MIME-Version: 1.0 References: <1836294649.3345.1609100294833.JavaMail.zimbra@efficios.com> <20201228102537.GG1551@shell.armlinux.org.uk> <20201228190852.GI1551@shell.armlinux.org.uk> <1086654515.3607.1609187556216.JavaMail.zimbra@efficios.com> <1609200902.me5niwm2t6.astroid@bobo.none> In-Reply-To: <1609200902.me5niwm2t6.astroid@bobo.none> From: Andy Lutomirski Date: Mon, 28 Dec 2020 16:56:36 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode() To: Nicholas Piggin Cc: Andy Lutomirski , Mathieu Desnoyers , Arnd Bergmann , Benjamin Herrenschmidt , Catalin Marinas , Jann Horn , linux-arm-kernel , "Russell King, ARM Linux" , linux-kernel , linuxppc-dev , Michael Ellerman , paulmck , Paul Mackerras , Peter Zijlstra , stable , Will Deacon , x86 Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin wrote: > > Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am: > > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers > > wrote: > >> > >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote: > >> > >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin > >> > wrote: > >> >> > >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: > >> >> > After chatting with rmk about this (but without claiming that any of > >> >> > this is his opinion), based on the manpage, I think membarrier() > >> >> > currently doesn't really claim to be synchronizing caches? It just > >> >> > serializes cores. So arguably if userspace wants to use membarrier() > >> >> > to synchronize code changes, userspace should first do the code > >> >> > change, then flush icache as appropriate for the architecture, and > >> >> > then do the membarrier() to ensure that the old code is unused? > >> > >> ^ exactly, yes. > >> > >> >> > > >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() > >> >> > syscall. That might cause you to end up with two IPIs instead of one > >> >> > in total, but we probably don't care _that_ much about extra IPIs on > >> >> > 32-bit arm? > >> > >> This was the original thinking, yes. The cacheflush IPI will flush specific > >> regions of code, and the membarrier IPI issues context synchronizing > >> instructions. > > APIs should be written in terms of the service they provide to > userspace, and in highest level terms as possible, rather than directing > hardware to do some low level operation. Unfortunately we're stuck with > this for now. We could deprecate it and replace it though. > > If userspace wants to modify code and ensure that after the system call > returns then no other thread will be executing the previous code, then > there should be an API for that. It could actually combine the two IPIs > into one for architectures that require both too. I agree. The membarrier API for SYNC_CORE is pretty nasty. I would much prefer a real API for JITs to use. > > >> > >> Architectures with coherent i/d caches don't need the cacheflush step. > > > > There are different levels of coherency -- VIVT architectures may have > > differing requirements compared to PIPT, etc. > > > > In any case, I feel like the approach taken by the documentation is > > fundamentally confusing. Architectures don't all speak the same > > language How about something like: > > > > The SYNC_CORE operation causes all threads in the caller's address > > space (including the caller) to execute an architecture-defined > > barrier operation. membarrier() will ensure that this barrier is > > executed at a time such that all data writes done by the calling > > thread before membarrier() are made visible by the barrier. > > Additional architecture-dependent cache management operations may be > > required to use this for JIT code. > > As said this isn't what SYNC_CORE does, and it's not what powerpc > context synchronizing instructions do either, it will very much > re-order visibility of stores around such an instruction. Perhaps the docs should be entirely arch-specific. It may well be impossible to state what it does in an arch-neutral way. > > A thread completes store instructions into a store queue, which is > as far as a context synchronizing event goes. Visibility comes at > some indeterminite time later. As currently implemented, it has the same visibility semantics as regular membarrier, too. So if I do: a = 1; membarrier(SYNC_CORE); b = 1; and another thread does: while (READ_ONCE(b) != 1) ; barrier(); assert(a == 1); then the assertion will pass. Similarly, one can do this, I hope: memcpy(codeptr, [some new instructions], len); arch_dependent_cache_flush(codeptr, len); membarrier(SYNC_CORE); ready = 1; and another thread does: while (READ_ONCE(ready) != 1) ; barrier(); (*codeptr)(); arch_dependent_cache_flush is a nop on x86. On arm and arm64, it appears to be a syscall, although maybe arm64 can do it from userspace. I still don't know what it is on powerpc. Even using the term "cache" here is misleading. x86 chips have all kinds of barely-documented instruction caches, and they have varying degrees of coherency. The architecture actually promises that, if you do a certain incantation, then you get the desired result. membarrier() allows a user to do this incantation. But trying to replicate the incantation verbatim on an architecture like ARM is insufficient, and trying to flush the things that are documented as being caches on x86 is expensive and a complete waste of time on x86. When you write some JIT code, you do *not* want to flush it all the way to main memory, especially on CPUs don't have the ability to write back invalidating. (That's most CPUs.) Even on x86, I suspect that the various decoded insn caches are rather more coherent than documented, and I have questions in to Intel about this. No answers yet. So perhaps the right approach is to say that membarrier() helps you perform the architecture-specific sequence of steps needed to safely modify code. On x86, you use it like this. On arm64, you do this other thing. On powerpc, you do something else. > > I would be surprised if x86's serializing instructions were different > than powerpc. rdtsc ordering or flushing stores to cache would be > surprising. > At the very least, x86 has several levels of what ARM might call "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of context synchronization in that the changes they cause to the MMU take effect immediately, but they are not documented as synchronizing the instruction stream. "Serializing" instructions do all kinds of things, not all of which may be architecturally visible at all. MFENCE and LFENCE do various complicated things, and LFENCE has magic retroactive capabilities on old CPUs that were not documented when those CPUs were released. SFENCE does a different form of synchronization entirely. LOCK does something else. (The relationship between LOCK and MFENCE is confusing at best.) RDTSC doesn't serialize anything at all, but RDTSCP does provide a form of serialization that's kind of ilke LFENCE. It's a mess. Even the manuals are inconsistent about what "serialize" means. JMP has its own magic on x86, but only on very very old CPUs. The specific instruction that flushes everything into the coherency domain is SFENCE, and SFENCE is not, for normal purposes, needed for self- or cross-modifying code. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96148C433DB for ; Tue, 29 Dec 2020 00:58:30 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id C814C207D0 for ; Tue, 29 Dec 2020 00:58:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C814C207D0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from bilbo.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 4D4bdH43QgzDqGC for ; Tue, 29 Dec 2020 11:58:27 +1100 (AEDT) Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=kernel.org (client-ip=198.145.29.99; helo=mail.kernel.org; envelope-from=luto@kernel.org; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=kernel.org Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=bOnhgK3d; dkim-atps=neutral Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4D4bbV43mgzDqDM for ; Tue, 29 Dec 2020 11:56:54 +1100 (AEDT) Received: by mail.kernel.org (Postfix) with ESMTPSA id 579DD22472 for ; Tue, 29 Dec 2020 00:56:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1609203411; bh=5ea1GP7R9ThOqNi8XEr5gxNXJU/ubpRT/SpIDHm2c0M=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=bOnhgK3ds/goNLDhJWb0r7fRqd65KN5BXdE4oYSm7k5zkRDAV/1pooaKncbxQMwqf fOUNJPfyVyB6nkeEQNc6maqX0cSIgI+eyBG9L1sgaBgPd6EuwaKhSYBROHQnL4TJGL Cq3PeZ3nvlkFmpumXZ6nt9sK04hHjLnE7fJk3gNqGhoR0Vn8m2i5GbX4pO+WXnbinV CGz/aqbTkNVecOveVLht/LqNxen/d+Me/6DXPqdBMRvND/LlObgSwJ3/9TqmR0Rtc4 h04e4s9Al0uBq8gpPaU+BYRPynbzikeks9DGeoUdJshRNB307Y1HbkY/YqTEuVF/FC wa+4hCR8K1IBA== Received: by mail-wm1-f54.google.com with SMTP id k10so794972wmi.3 for ; Mon, 28 Dec 2020 16:56:51 -0800 (PST) X-Gm-Message-State: AOAM533pCMR7brdB1+qXkUHN8tRMdoJkZz9kWBC/x4vmSGukwyVbZqYB Q/EonIDSLIQat2zWeBL9ql83Tb3feaCT6On+ll6xmA== X-Google-Smtp-Source: ABdhPJwWLZzdWjQFkiOrEi0hD/sw8QEhosbYgv30yS3Mb/zDERHlmT9dUrmSuLKTMJNeve8TtWbk4bbcTCBfJLD539U= X-Received: by 2002:a1c:6741:: with SMTP id b62mr1172357wmc.21.1609203409735; Mon, 28 Dec 2020 16:56:49 -0800 (PST) MIME-Version: 1.0 References: <1836294649.3345.1609100294833.JavaMail.zimbra@efficios.com> <20201228102537.GG1551@shell.armlinux.org.uk> <20201228190852.GI1551@shell.armlinux.org.uk> <1086654515.3607.1609187556216.JavaMail.zimbra@efficios.com> <1609200902.me5niwm2t6.astroid@bobo.none> In-Reply-To: <1609200902.me5niwm2t6.astroid@bobo.none> From: Andy Lutomirski Date: Mon, 28 Dec 2020 16:56:36 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode() To: Nicholas Piggin Content-Type: text/plain; charset="UTF-8" X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: paulmck , Arnd Bergmann , Jann Horn , Peter Zijlstra , x86 , "Russell King, ARM Linux" , stable , linux-kernel , Will Deacon , Mathieu Desnoyers , Andy Lutomirski , Catalin Marinas , Paul Mackerras , linuxppc-dev , linux-arm-kernel Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin wrote: > > Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am: > > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers > > wrote: > >> > >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote: > >> > >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin > >> > wrote: > >> >> > >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: > >> >> > After chatting with rmk about this (but without claiming that any of > >> >> > this is his opinion), based on the manpage, I think membarrier() > >> >> > currently doesn't really claim to be synchronizing caches? It just > >> >> > serializes cores. So arguably if userspace wants to use membarrier() > >> >> > to synchronize code changes, userspace should first do the code > >> >> > change, then flush icache as appropriate for the architecture, and > >> >> > then do the membarrier() to ensure that the old code is unused? > >> > >> ^ exactly, yes. > >> > >> >> > > >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() > >> >> > syscall. That might cause you to end up with two IPIs instead of one > >> >> > in total, but we probably don't care _that_ much about extra IPIs on > >> >> > 32-bit arm? > >> > >> This was the original thinking, yes. The cacheflush IPI will flush specific > >> regions of code, and the membarrier IPI issues context synchronizing > >> instructions. > > APIs should be written in terms of the service they provide to > userspace, and in highest level terms as possible, rather than directing > hardware to do some low level operation. Unfortunately we're stuck with > this for now. We could deprecate it and replace it though. > > If userspace wants to modify code and ensure that after the system call > returns then no other thread will be executing the previous code, then > there should be an API for that. It could actually combine the two IPIs > into one for architectures that require both too. I agree. The membarrier API for SYNC_CORE is pretty nasty. I would much prefer a real API for JITs to use. > > >> > >> Architectures with coherent i/d caches don't need the cacheflush step. > > > > There are different levels of coherency -- VIVT architectures may have > > differing requirements compared to PIPT, etc. > > > > In any case, I feel like the approach taken by the documentation is > > fundamentally confusing. Architectures don't all speak the same > > language How about something like: > > > > The SYNC_CORE operation causes all threads in the caller's address > > space (including the caller) to execute an architecture-defined > > barrier operation. membarrier() will ensure that this barrier is > > executed at a time such that all data writes done by the calling > > thread before membarrier() are made visible by the barrier. > > Additional architecture-dependent cache management operations may be > > required to use this for JIT code. > > As said this isn't what SYNC_CORE does, and it's not what powerpc > context synchronizing instructions do either, it will very much > re-order visibility of stores around such an instruction. Perhaps the docs should be entirely arch-specific. It may well be impossible to state what it does in an arch-neutral way. > > A thread completes store instructions into a store queue, which is > as far as a context synchronizing event goes. Visibility comes at > some indeterminite time later. As currently implemented, it has the same visibility semantics as regular membarrier, too. So if I do: a = 1; membarrier(SYNC_CORE); b = 1; and another thread does: while (READ_ONCE(b) != 1) ; barrier(); assert(a == 1); then the assertion will pass. Similarly, one can do this, I hope: memcpy(codeptr, [some new instructions], len); arch_dependent_cache_flush(codeptr, len); membarrier(SYNC_CORE); ready = 1; and another thread does: while (READ_ONCE(ready) != 1) ; barrier(); (*codeptr)(); arch_dependent_cache_flush is a nop on x86. On arm and arm64, it appears to be a syscall, although maybe arm64 can do it from userspace. I still don't know what it is on powerpc. Even using the term "cache" here is misleading. x86 chips have all kinds of barely-documented instruction caches, and they have varying degrees of coherency. The architecture actually promises that, if you do a certain incantation, then you get the desired result. membarrier() allows a user to do this incantation. But trying to replicate the incantation verbatim on an architecture like ARM is insufficient, and trying to flush the things that are documented as being caches on x86 is expensive and a complete waste of time on x86. When you write some JIT code, you do *not* want to flush it all the way to main memory, especially on CPUs don't have the ability to write back invalidating. (That's most CPUs.) Even on x86, I suspect that the various decoded insn caches are rather more coherent than documented, and I have questions in to Intel about this. No answers yet. So perhaps the right approach is to say that membarrier() helps you perform the architecture-specific sequence of steps needed to safely modify code. On x86, you use it like this. On arm64, you do this other thing. On powerpc, you do something else. > > I would be surprised if x86's serializing instructions were different > than powerpc. rdtsc ordering or flushing stores to cache would be > surprising. > At the very least, x86 has several levels of what ARM might call "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of context synchronization in that the changes they cause to the MMU take effect immediately, but they are not documented as synchronizing the instruction stream. "Serializing" instructions do all kinds of things, not all of which may be architecturally visible at all. MFENCE and LFENCE do various complicated things, and LFENCE has magic retroactive capabilities on old CPUs that were not documented when those CPUs were released. SFENCE does a different form of synchronization entirely. LOCK does something else. (The relationship between LOCK and MFENCE is confusing at best.) RDTSC doesn't serialize anything at all, but RDTSCP does provide a form of serialization that's kind of ilke LFENCE. It's a mess. Even the manuals are inconsistent about what "serialize" means. JMP has its own magic on x86, but only on very very old CPUs. The specific instruction that flushes everything into the coherency domain is SFENCE, and SFENCE is not, for normal purposes, needed for self- or cross-modifying code. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B97BC433E0 for ; Tue, 29 Dec 2020 00:58:08 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 181562079A for ; Tue, 29 Dec 2020 00:58:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 181562079A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:To:Subject:Message-ID:Date:From:In-Reply-To: References:MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=nk21B0r3EGi0pftNvoQQr5Dia8w1QiToOjBRDtM+YI4=; b=CFh0hPyKBKB7LTejf/cmvU4MW R6E+rIqTvBMFZ1JR9a97JL2yK2FXb0poUw/I9jTMjSNdcsAvXbqtPHzETi/y36EUKx/+JGd0ugUcc e76+1y8mXXo5+hvwpPsxwaFiJrKGiyAHim13kbEVs0ZqtVKGDW++CvmnF2mkfcDlMR5AWvw9g0peq EtdJgxVJF2483HMirYEFwnvc5D2w47khtHzNOwYW3z6eN9XoZZQMeHRzi3/KB3rdKGx9YwkL8Nx8X HdJuNDPVAv8lUL33YM+MI/xGZT+zb6wEAsMwiUoGQQskaYQ/MpVZY8v15WOLFmoF7H8IzLVP9f/+O A7GyCpsSw==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1ku3Jn-0001VN-DL; Tue, 29 Dec 2020 00:56:55 +0000 Received: from mail.kernel.org ([198.145.29.99]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1ku3Jk-0001UK-KS for linux-arm-kernel@lists.infradead.org; Tue, 29 Dec 2020 00:56:53 +0000 Received: by mail.kernel.org (Postfix) with ESMTPSA id 2EAF1207FB for ; Tue, 29 Dec 2020 00:56:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1609203411; bh=5ea1GP7R9ThOqNi8XEr5gxNXJU/ubpRT/SpIDHm2c0M=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=bOnhgK3ds/goNLDhJWb0r7fRqd65KN5BXdE4oYSm7k5zkRDAV/1pooaKncbxQMwqf fOUNJPfyVyB6nkeEQNc6maqX0cSIgI+eyBG9L1sgaBgPd6EuwaKhSYBROHQnL4TJGL Cq3PeZ3nvlkFmpumXZ6nt9sK04hHjLnE7fJk3gNqGhoR0Vn8m2i5GbX4pO+WXnbinV CGz/aqbTkNVecOveVLht/LqNxen/d+Me/6DXPqdBMRvND/LlObgSwJ3/9TqmR0Rtc4 h04e4s9Al0uBq8gpPaU+BYRPynbzikeks9DGeoUdJshRNB307Y1HbkY/YqTEuVF/FC wa+4hCR8K1IBA== Received: by mail-wm1-f44.google.com with SMTP id v14so801704wml.1 for ; Mon, 28 Dec 2020 16:56:51 -0800 (PST) X-Gm-Message-State: AOAM532vvyHzoe4XysBp0biVJgXA9l0OV4NmDxzV4yNqTr/6w2VPyrtq tA4+M5pWKjAxMHasgFDfun+VVCyr5vVo5xh3GDX7Cw== X-Google-Smtp-Source: ABdhPJwWLZzdWjQFkiOrEi0hD/sw8QEhosbYgv30yS3Mb/zDERHlmT9dUrmSuLKTMJNeve8TtWbk4bbcTCBfJLD539U= X-Received: by 2002:a1c:6741:: with SMTP id b62mr1172357wmc.21.1609203409735; Mon, 28 Dec 2020 16:56:49 -0800 (PST) MIME-Version: 1.0 References: <1836294649.3345.1609100294833.JavaMail.zimbra@efficios.com> <20201228102537.GG1551@shell.armlinux.org.uk> <20201228190852.GI1551@shell.armlinux.org.uk> <1086654515.3607.1609187556216.JavaMail.zimbra@efficios.com> <1609200902.me5niwm2t6.astroid@bobo.none> In-Reply-To: <1609200902.me5niwm2t6.astroid@bobo.none> From: Andy Lutomirski Date: Mon, 28 Dec 2020 16:56:36 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode() To: Nicholas Piggin X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20201228_195652_870370_E25D6D6D X-CRM114-Status: GOOD ( 40.78 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: paulmck , Arnd Bergmann , Jann Horn , Peter Zijlstra , Benjamin Herrenschmidt , x86 , "Russell King, ARM Linux" , stable , linux-kernel , Will Deacon , Mathieu Desnoyers , Michael Ellerman , Andy Lutomirski , Catalin Marinas , Paul Mackerras , linuxppc-dev , linux-arm-kernel Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin wrote: > > Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am: > > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers > > wrote: > >> > >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote: > >> > >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin > >> > wrote: > >> >> > >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: > >> >> > After chatting with rmk about this (but without claiming that any of > >> >> > this is his opinion), based on the manpage, I think membarrier() > >> >> > currently doesn't really claim to be synchronizing caches? It just > >> >> > serializes cores. So arguably if userspace wants to use membarrier() > >> >> > to synchronize code changes, userspace should first do the code > >> >> > change, then flush icache as appropriate for the architecture, and > >> >> > then do the membarrier() to ensure that the old code is unused? > >> > >> ^ exactly, yes. > >> > >> >> > > >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() > >> >> > syscall. That might cause you to end up with two IPIs instead of one > >> >> > in total, but we probably don't care _that_ much about extra IPIs on > >> >> > 32-bit arm? > >> > >> This was the original thinking, yes. The cacheflush IPI will flush specific > >> regions of code, and the membarrier IPI issues context synchronizing > >> instructions. > > APIs should be written in terms of the service they provide to > userspace, and in highest level terms as possible, rather than directing > hardware to do some low level operation. Unfortunately we're stuck with > this for now. We could deprecate it and replace it though. > > If userspace wants to modify code and ensure that after the system call > returns then no other thread will be executing the previous code, then > there should be an API for that. It could actually combine the two IPIs > into one for architectures that require both too. I agree. The membarrier API for SYNC_CORE is pretty nasty. I would much prefer a real API for JITs to use. > > >> > >> Architectures with coherent i/d caches don't need the cacheflush step. > > > > There are different levels of coherency -- VIVT architectures may have > > differing requirements compared to PIPT, etc. > > > > In any case, I feel like the approach taken by the documentation is > > fundamentally confusing. Architectures don't all speak the same > > language How about something like: > > > > The SYNC_CORE operation causes all threads in the caller's address > > space (including the caller) to execute an architecture-defined > > barrier operation. membarrier() will ensure that this barrier is > > executed at a time such that all data writes done by the calling > > thread before membarrier() are made visible by the barrier. > > Additional architecture-dependent cache management operations may be > > required to use this for JIT code. > > As said this isn't what SYNC_CORE does, and it's not what powerpc > context synchronizing instructions do either, it will very much > re-order visibility of stores around such an instruction. Perhaps the docs should be entirely arch-specific. It may well be impossible to state what it does in an arch-neutral way. > > A thread completes store instructions into a store queue, which is > as far as a context synchronizing event goes. Visibility comes at > some indeterminite time later. As currently implemented, it has the same visibility semantics as regular membarrier, too. So if I do: a = 1; membarrier(SYNC_CORE); b = 1; and another thread does: while (READ_ONCE(b) != 1) ; barrier(); assert(a == 1); then the assertion will pass. Similarly, one can do this, I hope: memcpy(codeptr, [some new instructions], len); arch_dependent_cache_flush(codeptr, len); membarrier(SYNC_CORE); ready = 1; and another thread does: while (READ_ONCE(ready) != 1) ; barrier(); (*codeptr)(); arch_dependent_cache_flush is a nop on x86. On arm and arm64, it appears to be a syscall, although maybe arm64 can do it from userspace. I still don't know what it is on powerpc. Even using the term "cache" here is misleading. x86 chips have all kinds of barely-documented instruction caches, and they have varying degrees of coherency. The architecture actually promises that, if you do a certain incantation, then you get the desired result. membarrier() allows a user to do this incantation. But trying to replicate the incantation verbatim on an architecture like ARM is insufficient, and trying to flush the things that are documented as being caches on x86 is expensive and a complete waste of time on x86. When you write some JIT code, you do *not* want to flush it all the way to main memory, especially on CPUs don't have the ability to write back invalidating. (That's most CPUs.) Even on x86, I suspect that the various decoded insn caches are rather more coherent than documented, and I have questions in to Intel about this. No answers yet. So perhaps the right approach is to say that membarrier() helps you perform the architecture-specific sequence of steps needed to safely modify code. On x86, you use it like this. On arm64, you do this other thing. On powerpc, you do something else. > > I would be surprised if x86's serializing instructions were different > than powerpc. rdtsc ordering or flushing stores to cache would be > surprising. > At the very least, x86 has several levels of what ARM might call "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of context synchronization in that the changes they cause to the MMU take effect immediately, but they are not documented as synchronizing the instruction stream. "Serializing" instructions do all kinds of things, not all of which may be architecturally visible at all. MFENCE and LFENCE do various complicated things, and LFENCE has magic retroactive capabilities on old CPUs that were not documented when those CPUs were released. SFENCE does a different form of synchronization entirely. LOCK does something else. (The relationship between LOCK and MFENCE is confusing at best.) RDTSC doesn't serialize anything at all, but RDTSCP does provide a form of serialization that's kind of ilke LFENCE. It's a mess. Even the manuals are inconsistent about what "serialize" means. JMP has its own magic on x86, but only on very very old CPUs. The specific instruction that flushes everything into the coherency domain is SFENCE, and SFENCE is not, for normal purposes, needed for self- or cross-modifying code. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel